Head to Head

Big Data Blogs

Recent Uploads

LAMP stack in Cloud: Building a Scalable, Secure and Highly Available architecture using AWS...

1. Requirement Overview The acronym LAMP (Linux, Apache, MySQL, PHP) refers to an open-source stack, used to run dynamic and static content of servers. A small startup organization uses the LAMP stack of software. The dynamic nature of demand and projected future growth in traffic drives the need for a massively scalable solution to e

-nable the availability (and reliability) of its web-based application. This document is a representation of an enhanced solution using AWS cloud-based services. The proposed solution is to move into the AWS cloud can provide a much greater impetus to improve on the current state-of-the-art and reduce maintenance and move the infrastructure to a more secure and scalable environment. This solution does not include the events and streaming processin

Reference architecture of bigdata solution in GCP and Azure......

This article is a showcase of a Reference architecture approach for the financial sector where stream and batch processing is a common part of its solution with other designs. Firstly the requirement analysis is the step to define the implementation of any use case. Therefore before moving to reference architecture we first need to under

-stand Requirements Engineering.  Requirements Engineering is regarded as one of the most important steps in software engineering and takes about 30% of project time. When done properly, it can provide a good foundation for the system design and development as the functionality and components needed for the system become clear during the Requirements Engineering process. it is important to understand what a requirement is. Generally spoken

Error resolution of Zalando Research Flair NLP package installation on Centos 7, "Failed building...

I was working on an NLP tool for evaluation purposes and found an issue in creating the environment. They had set up everything on Ubuntu so they might not face this issue but I am replicating on Centos 7 and found an error. Hope this will help someone. The project is based on PyTorch 0.4+ and Python 3.6+. You can find that I have

-created Python 3.6 on Centos 7 in my previous article Now while installing Flair library using the below command I am getting an issue:- pip3.6 install flair .... .... Collecting docutils>=0.10 (from botocore<1.13.0,>=1.12.71->boto3->pytorch-pretrained-bert==0.3.0->flair) Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host

How to install and create Python 3.6 virtualenv on HDP 3.0.1...

Many times the default version bundled with HDP i.e 2.7 is not sufficient to explore a few libraries and it required python to add an additional version as in my case I need python 3.6+ to explore NLP libraries. I have done this many times with another version of HDP but this time I want to create a cheat sheet so that fire and forge

-t would not be a problem in the future. Centos 7 vanilla or HDP 3.0.1virtual box still does not have a package for Python version 3. So it’s EPEL to the rescue: (you need to be root for this). *Be careful not to run the below command on dev/prod environment where many users are connected as it refreshes the packages/iptable with new system packages. #yum -y install epel-release In my case, it shows up to the latest version. Now let

How to create an Apache Beam data pipeline and deploy it using Cloud Dataflow in Java...

Cloud Dataflow is a fully managed google service for executing data processing pipelines using Apache Beam. What do you mean by fully managed? Cloud dataflow like BigQuery dynamically provisions the optimal quantity and type of resource(i.e CPU or memory instances) based on volume and specific resource requirements for your job. Clou

-d dataflow is a server-less and auto-scaling service. Dataflow and Spark Google Cloud Dataflow is closely analogous to Apache Spark in terms of API and engine. Both are also directed acyclic graph-based (DAG) data processing engines. However, there are aspects of Dataflow that aren’t directly comparable to Spark. Where Spark is strictly an API and engine with the supporting technologies, Google Cloud Dataflow is all that plus Google’s u

Google Dataflow Python ValueError: Unable to get the Filesystem for path gs://myprojetc/digport/p...

I am using google cloud to create an event on Cloud Storage to Big Query using Apache Beam pythons library. I was executing an ETL in the "DirectRunner" mode and found no issue. But later when I take everything on dataflow to execute found an error. Below command used to upload the file and I can see my file is present at location

- same:- gsutil cp datapip.csv.gz gs://myproject/data/datapip.csv.gz Sadly whenever I run the below command to execute my pipeline in cloud mode get an error:- python dfmypy.py -p myproject -b mybucket -d mydataset Correcting timestamps and writing to BigQuery dataset flights Traceback (most recent call last): File "df06.py", line 171, in <module> run(project=args['project'], bucket=args['bucket'], dataset=args['dataset'

Python: Stream the ingest of data into the database in real-time using dataflow....

In my previous articles, we solve real-time data ingestion problems using various tools like Apache Kafka, Storm, Flink and Spark. I have shown you in detail that how to create such pipelines for real-time processing. In this blog, we will try to simulate a similar problem using Apache Beam and Dataflow using Python. Let's say we hav

-e sample data below and FL_DATE and DEP_TIME columns represent the local dates without timezone. You can find the dataset and python code in my GitHub repository too. This is the flight dataset and two countries having different timezones and in sample data the time zones offset is not present in this dataset. As timezone depends upon airport location so we will put timezone offset in our dataset to Coordinated Universal Time(UTC). Therefore l

Sample Java Program on Google Cloud Pub/Sub...

Overview This article contains a sample java program on Google Cloud's Pub/Sub to publish messages from google store. The solution is simple to set up the environment, create a topic, subscribe to that topic and read those messages using a java program. Prerequisite Create a new GCP projectEnable the Pub/Sub APISetting envir

-onment variablesJava1.8Java SDK eclipse. Setup Pub/Sub Create topics with Cloud Pub / Sub. : Open the google cloud shell and create new pub/sub topic using below command:- export PUBSUB_TOPIC=mynewtopic gcloud pubsub topics create $PUBSUB_TOPIC Create a new Pub/Sub subscription: Open the google cloud shell and create new pub/sub subscription using below command:- export PUBSUB_SUBSCRIPTION=mynewsub gcloud pubsub subscriptions cre

Sample Dataflow Pipeline featuring Cloud Pub/Sub, Dataflow, and BigQuery......

Streaming data in Google Cloud Platform is typically published to Cloud Pub/Sub, a serverless real-time messaging service. Cloud Pub/Sub provides reliable delivery and can scale to more than a million messages per second. It stores copies of messages in multiple zones to provide “at least once” guaranteed delivery to subscribers,

- and there can be many simultaneous subscribers. The simulation code that we are writing here is only for quick experimentation with streaming data. Hence, I will not take the extra effort needed to make it fault-tolerant. If we had to do so, we could make the simulation fault-tolerant by starting from a BigQuery query that is bounded in terms of a time range with the start of that time range automatically inferred from the last-notified recor

Solved: Protocol tcp Port Exclusion issues when running Hadoop on Windows Docker...

If you’re looking for simple and painless Hadoop deployment, Docker is the right tool for you. deployment. We mostly use Docker community edition-CE (https://docs.docker.com/docker-for-windows/install/) on Microsoft Windows, under system requirement it clearly says "Hyper-V and Containers Windows features must be enabled." to

-run Docker on Windows. In case you are using Docker Engine – Enterprise(EE) you might not require the Hyper-V. I think we developers just happy with docker CE. Now the issue with Hyper-V is it reserves some ports that are required by Hadoop for inter-process communication. So as of now, you got my point that Hadoop uses certain ports such as 50070, those are required to communicate with data node and expose URI for hdfs but these po

Technical debt : Understand and manage it......

As we go on writing a piece of code or building a solution we always create some sort of technical debt. It is not always a bad thing but it is something we should learn over time how to manage, control and track. Initially in my career, I always thought and looked for creating a perfect solution and sometimes that mode of thinking p

-ut me into a situation where my manager thinks that I am putting a lot of effort into the analysis of the solution rather than delivering them fast. You would not believe that since the last six years after entering into Big Data, everyone was impressed by me not because I have learned how to create a perfect solution but because I have learned how to deliver fast with some sort of technical debt into my (deliverable to pay later). I agree that I

PowerShell script wrappers using the Microsoft Azure AzCopy.exe tool...

Use case We are working on building data lake in Azure using Azure container, ADF, Azure DWH, Databricks and many other services of Azure. After ingesting wide variety of datasources using API, on premise databases, flate files, reporting servers, we come to know that clients have some requirement to push files in Azure Blob storage.

-Users can locate the files on their local system but dont know what actual folder hierarchy in Azure to run the process. This requirement extents to help them testing their processes in UAT and Dev environments. Introduction AzCopy is a command-line utility that you can use to copy blobs or files to or from a storage account.  This article helps you to understand the workings and how to parameterized AzCopy.exe to Export and Import

Azure Arc - redefine hybrid cloud......

Azure delivered 59% revenue growth in the latest quarter which is more than expected from its other Microsoft products. MSFT introducing various new cloud services and acquisitions giving it edge over the rivals Amazon and Google. https://www.zdnet.com/article/azure-synapse-analytics-combines-data-warehouse-lake-and-pipelines/ ht

-tps://www.cnbc.com/2019/11/04/microsofts-azure-arc-lets-customers-use-its-tools-on-other-clouds.html “Azure Arc enables customers to have a central, unified, and self-service approach to manage their Windows and Linux Servers, Kubernetes clusters, and Azure data services wherever they are,” writes Jeremy Winter, director of Program Management for Microsoft Azure. “Azure Arc also extends adoption of cloud practices like DevOps and Azure

Apache Storm key takeaways......

Hadoop moves the code to the data, Storm moves the data to the code. This behavior makes more sense in a stream-processing system, because the data set isn’t known beforehand, unlike in a batch job. Also, the data set is continuously flowing through the code. A Storm cluster consists of two types of nodes: the master node and th

-e worker nodes. A master node runs a daemon called Nimbus, and the worker nodes each run a daemon called a Supervisor. The master node can be thought of as the control center. In addition to the other responsibilities, this is where you’d run any of the commands such as activate , deactivate , rebalance , or kill —available in a Storm cluster (more on these commands you see on storm site). The worker nodes are where the logic in the spo

Approach to execute Machine Learning project, "Halt the Hate"......

Disclaimer: The analysis was done in this project touches a sensitive issue in India. So I never convince anybody to trust my model. A real human society is so complex that “all the things may be interconnected in a different way than in the model.” Imagine you are presented with a dataset of “Hate Crimes” in India and asked how

- to minimize these crimes by analyzing other factors. This is the problem I am taking in hand to solve and analyze with a minimum number of resources. Some can say that education and providing jobs to youth in India by the government could solve this problem and yes you are right. You will see that relationship soon. You can also make your best guess by visualizing many other factors that I will present here. Next post, I’ll create the Machine

Fundamantals of Apache Spark......

You can view my other articles on Spark RDD at below links... Apache Spark RDD API using Pyspark…Tips and Tricks for Apache Spark RDD API, Dataframe API How did Spark become so efficient in data processing as compared to MapReduce? It comes with a very advanced Directed Acyclic Graph (DAG) data processing engine. What it means is

- that for every Spark job, a DAG of tasks is created to be executed by the engine. The DAG in mathematical parlance consists of a set of vertices and directed edges connecting them. The tasks are executed as per the DAG layout. In the MapReduce case, the DAG consists of only two vertices, with one vertex for the map task and the other one for the reduce task. The edge is directed from the map vertex to the reduce vertex. The in-memory data proces

Bayesian-posterior imagination and applications......

Before going into Bayes and posterior probability let us first understand few terms we going to use:- Conditional Probability:- Conditional Probability and Independence:- A conditional probability is the probability of one event if another event occurred. In the “die-toss” example, the probability of event A, three d

-ots showing, is P(A) = 1/6 on a single toss. But what if we know that event B, at least three dots showing, occurred? Then there are only four possible outcomes, one of which is A. The probability of A = {3} is 1/4 , given that B = {3, 4, 5, 6} occurred. The conditional probability of A given B is written P(A|B). Event A is independent of B if the conditional probability of A given B is the same as the unconditional probability of A. That

Tips and Tricks for Apache Spark RDD API, Dataframe API- Part -1...

I am planning to share my knowledge on Apache Spark RDD, Dataframes API and some tips and tricks. If I combine everything into one then it would be a very lengthy article. Therefore I am dividing the long article into three separate articles and this article is the first series in that continuation. Spark RDD API Dataframe API

- Tips and tricks on RDD API and Dataframe API. Let us start with basics of RDD API. Resilient Distributed Dataset(RDD) is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. RDD could come from any datasource, e.g. text files, JSON, csv file, a database via JDBC etc. Here in demo I am using Scala prompt spark-shell to show usage of API like below:- [root@victoria bin]#

In-depth Kafka Message queue principles of high-reliability...

 At present many open source distributed processing systems such as Cloudera, Apache Storm, Spark and others support the integration with Kafka. Kafka is increasingly being favored by many internet shops and they use Kafka as one of its core messaging engines. The reliability of the Kafka message can be imagined as a commercial-gra

-de messaging middleware solution. In this article, we will understand Kakfa storage mechanism, replication principle, synchronization principle, and durability assurance to analyze its reliability. As shown in the figure above, a typical Kafka architecture includes several Producers (which can be server logs, business data, page views generated by the front-end of the page, etc.), several brokers (Kafka supports horizontal expansion, more g

Python numpy excercise......

Python numpy exercises are present in my git repository at location. Happy Machine Learning...