Error resolution of Zalando Research Flair NLP package installation on Centos 7, “Failed building wheel for regex…”​

I was working on an NLP tool for evaluation purposes and found an issue in creating the environment. They had set up everything on Ubuntu so they might not face this issue but I am replicating on Centos 7 and found an error. Hope this will help someone. The project is based on PyTorch 0.4+… Continue reading Error resolution of Zalando Research Flair NLP package installation on Centos 7, “Failed building wheel for regex…”​

Published

How to create an Apache Beam data pipeline and deploy it using Cloud Dataflow in Java

Cloud Dataflow is a fully managed google service for executing data processing pipelines using Apache Beam. What do you mean by fully managed? Cloud dataflow like BigQuery dynamically provisions the optimal quantity and type of resource(i.e CPU or memory instances) based on volume and specific resource requirements for your job. Cloud dataflow is a server-less… Continue reading How to create an Apache Beam data pipeline and deploy it using Cloud Dataflow in Java

Published

Google Dataflow Python ValueError: Unable to get the Filesystem for path gs://myprojetc/digport/ports.csv.gz

I am using google cloud to create an event on Cloud Storage to Big Query using Apache Beam pythons library. I was executing an ETL in the “DirectRunner” mode and found no issue. But later when I take everything on dataflow to execute found an error. Below command used to upload the file and I… Continue reading Google Dataflow Python ValueError: Unable to get the Filesystem for path gs://myprojetc/digport/ports.csv.gz

Published

Python: Stream the ingest of data into the database in real-time using dataflow.

In my previous articles, we solve real-time data ingestion problems using various tools like Apache Kafka, Storm, Flink and Spark. I have shown you in details that how to create such pipelines for real-time processing. In this blog, we will try to simulate a similar problem using Apache Beam and Dataflow using Python. Let’s say… Continue reading Python: Stream the ingest of data into the database in real-time using dataflow.

Published

Sample Dataflow Pipeline featuring Cloud Pub/Sub, Dataflow, and BigQuery…

Streaming data in Google Cloud Platform is typically published to Cloud Pub/Sub, a serverless real-time messaging service. Cloud Pub/Sub provides reliable delivery and can scale to more than a million messages per second. It stores copies of messages in multiple zones to provide “at least once” guaranteed delivery to subscribers, and there can be many… Continue reading Sample Dataflow Pipeline featuring Cloud Pub/Sub, Dataflow, and BigQuery…

Published

Solved: Protocol tcp Port Exclusion issues when running Hadoop on Windows Docker

If you’re looking for simple and painless Hadoop deployment, Docker is the right tool for you. deployment. We mostly use Docker community edition-CE (https://docs.docker.com/docker-for-windows/install/) on Microsoft Windows, under system requirement it clearly says “Hyper-V and Containers Windows features must be enabled.” to run Docker on Windows. In case you are using Docker Engine – Enterprise(EE) you… Continue reading Solved: Protocol tcp Port Exclusion issues when running Hadoop on Windows Docker

Published