Head to Head

Many Blogs, One Place

Recent Uploads

Demo Delta Lake on big data workloads...

First what’s the difference between a Delta Lake and Change Data Capture? CDC is just the log of changes on a relational table. Delta Lake is to provide more native administrative capabilities to a data lake implementation (schemas, transactions, cataloging). Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads.

For more information on Delta Lake you can visit here. Now lets start with Demo assuming you have gained some knowledge about Data Lake and Spark Delta Lake. The Delta Lake package is available as with the –packages option. Run below command on your Spark node to install required packages:-

My Big Data solution using AWS services...

A global advertising agency that manages marketing for different customers in Asia, Europe and US required the solution on development of a Big Data platform. The company data analysts required a Big Data solution to run their models, reports and development effort can be handled by their own IT Team. The company is looking for recommendations

on how to setup the Big Data Platform that will allow them to analyse trends and patterns over time across different clients. They would need a presentation layer to provide reporting capabilities to individual clients on only their specific data. Therefore, the main goal of this document is proposing a solution for IT team, Analyst and other stake holder so that it can be managed flexible, elastic, fault-tolerant, cost-efficient, scalable, secure and high-performance Big Data platform.

Operators teach Kubernetes how to simplify the stateful application...

This is the first article to a series of articles to showcase how we use Operator that can leverage Kubernetes to create a stateful application such as Kafka Cluster. An Operator is a way to package, run, and maintain a Kubernetes application. An Operator builds on Kubernetes to automate the entire lifecycle of the software it manages. Because Operators extend Kubernetes, they provide application-specific automation.

Before we begin to describe how Operators do these jobs, let’s define a few Kubernetes terms to provide context. How Kubernetes Works! Kubernetes automates the lifecycle of a stateless application, such as a static web server. Without a state, any instances of an application are inter...

Reference architecture of bigdata solution in GCP and Azure...

This article is a showcase of a Reference architecture approach for the financial sector where stream and batch processing is a common part of its solution with other designs. Firstly the requirement analysis is the step to define the implementation of any use case. Therefore before moving to reference architecture we first need to understand Requirements Engineering.

Requirements Engineering is regarded as one of the most important steps in software engineering and takes about 30% of project time. When done properly, it can provide a good foundation for the system design and development as the functionality and components needed for the system become clear during the Requirements Engineering process...

Error resolution of Zalando Research Flair NLP package installation on Centos 7...

I was working on an NLP tool for evaluation purposes and found an issue in creating the environment. They had set up everything on Ubuntu so they might not face this issue but I am replicating on Centos 7 and found an error. Hope this will help someone. The project is based on PyTorch 0.4+ and Python 3.6+. You can find that I have created the Python 3.6 on Centos 7 in my previous article

Now while installing Flair library using below command I am getting an issue as below:-

How to create an Apache Beam data pipeline and deploy it using Cloud Dataflow in Java

Cloud Dataflow is a fully managed google service for executing data processing pipelines using Apache Beam. What do you mean by fully managed? Cloud dataflow like BigQuery dynamically provisions the optimal quantity and type of resource(i.e CPU or memory instances) based on volume and specific resource requirements for your job. Cloud dataflow is a server-less and auto-scaling service.

Dataflow and Spark Google Cloud Dataflow is closely analogous to Apache Spark in terms of API and engine. Both are also directed acyclic graph-based (DAG) data processing engines. However, there are aspects of Dataflow that aren’t directly comparable to Spark. Where Spark is strictly an API and engine with the supporting technologies, Google Cloud Dataflow is all that plus Google’s underlying infrastructure and operational support.

Google Dataflow Python ValueError: Unable to get the Filesystem for path...

I am using google cloud to create an event on Cloud Storage to Big Query using Apache Beam pythons library. I was executing an ETL in the “DirectRunner” mode and found no issue. But later when I take everything on dataflow to execute found an error.

Below command used to upload the file and I can see my file is present at location same:-

Python: Stream the ingest of data into the database in real-time using dataflow....

In my previous articles, we solve real-time data ingestion problems using various tools like Apache Kafka, Storm, Flink and Spark. I have shown you in details that how to create such pipelines for real-time processing. In this blog, we will try to simulate a similar problem using Apache Beam and Dataflow using Python. Let’s say we have sample data below and FL_DATE and DEP_TIME columns represent the local dates without timezone.

You can find dataset and python code in my github repository too. This is the flight dataset and two countries having different timezones and in sample data the time zones offset is not present in this dataset. As timezone depends upon airport location so we will put timezone offset in our dataset to Coordinated Universal Time(UTC). Therefore let us first start with the transformation to our sample dataset and convert all time fields to UTC. Additionally, we add three fields for the destination airport: the latitude, longitude, and time zone offset.

Sample Java Program on Google Cloud Pub/Sub...

Overview This article contains a sample java program on Google Cloud’s Pub/Sub to publish messages from google store. The solution is simple to set up the environment, create a topic, subscribe to that topic and read those messages using a java program. Prerequisite Create a new GCP project Enable the Pub/Sub API Setting environment variables Java1.8 Java SDK eclipse.

Setup Pub/Sub Create topics with Cloud Pub / Sub. : Open the google cloud shell and create new pub/sub topic using below command:- export PUBSUB_TOPIC=mynewtopic gcloud pubsub topics create $PUBSUB_TOPIC

Sample Dataflow Pipeline featuring Cloud Pub/Sub, Dataflow, and BigQuery...

Streaming data in Google Cloud Platform is typically published to Cloud Pub/Sub, a serverless real-time messaging service. Cloud Pub/Sub provides reliable delivery and can scale to more than a million messages per second. It stores copies of messages in multiple zones to provide “at least once” guaranteed delivery to subscribers, and there can be many simultaneous subscribers.

The simulation code that we are writing here is only for quick experimentation with streaming data. Hence, I will not take the extra effort needed to make it fault-tolerant. If we had to do so, we could make the simulation fault-tolerant by starting from a BigQuery query that is bounded in terms of a time range with the start of that time range automatically inferred from the last-notified record in Cloud Pub/Sub. Because Cloud Pub/Sub subscriptions are not retroactive, we need to maintain...

Solved: Protocol tcp Port Exclusion issues when running Hadoop on Windows Docker...

If you’re looking for simple and painless Hadoop deployment, Docker is the right tool for you. deployment. We mostly use Docker community edition-CE (https://docs.docker.com/docker-for-windows/install/) on Microsoft Windows, under system requirement it clearly says “Hyper-V and Containers Windows features must be enabled.” to run Docker on Windows. In case you are using

Docker Engine – Enterprise(EE) you might not require the Hyper-V. Now the issue with Hyper-V is it reserves some ports that are required by Hadoop for inter-process communication. So as of now, you got my point that Hadoop uses certain ports such as 50070, those are required to communicate with data node and expose URI for hdfs but these ports are reserved by Hyper-V. A good excuse Hyper-V reserve some ports is to switch communication between Linux and Windows systems. To view reserve port run be...

Technical debt : Understand and manage it...

As we go on writing a piece of code or building a solution we always create some sort of technical debt. It is not always a bad thing but it is something we should learn over time how to manage, control and track. Initially in my career, I always thought and looked for creating a perfect solution and sometimes that mode of thinking put me into a situation where my manager thinks that I am putting a lot of effort into the analysis of the solution rather than delivering them fast.

You would not believe that since the last six years after entering into Big Data, everyone was impressed by me not because I have learned how to create a perfect solution but because I...

PowerShell script wrappers using the Microsoft Azure AzCopy.exe tool...

Use case We are working on building data lake in Azure using Azure container, ADF, Azure DWH, Databricks and many other services of Azure. After ingesting wide variety of datasources using API, on premise databases, flate files, reporting servers, we come to know that clients have some requirement to push files in Azure Blob storage. Users can locate the files on their local system but dont know what actual folder hierarchy in Azure to run the process.

This requirement extents to help them testing their processes in UAT and Dev environments. Introduction AzCopy is a command-line utility that you can use to copy blobs or files to or from a storage account. This article helps you to understand the workings and how to parameterized AzCopy.exe to Export and Import Azure files. You can find Powershell Script to copy local files to Blob storage Account using AzCopy at my git that uses azCo...

Azure Arc – redefine hybrid cloud...

Azure delivered 59% revenue growth in the latest quarter which is more than expected from its other Microsoft products. MSFT introducing various new cloud services and acquisitions giving it edge over the rivals Amazon and Google. Simply put, Azure Arc provides you with the opportunity to run Azure services anywhere. It provide three main functions below:-

Azure SQL Database and Azure Database for PostgreSQL Hyperscale (scale-out function by sharding) can be deployed to Kubernetes on any cloud environment such as AWS or GCP or on-premise, and centralized management including patch application, automatic update, security policy application, etc. is possible Integrated management of Windows Server, Linux Server, Kubernetes cluster, etc. running on any platform including AWS and on-premises. Deploy and deploy applications to Kuber...

Leader in Me...

Information cascade I learned a TON very quickly and everyday about completely new stuff, and was able to do so because I could easily feel the inertia and the help of my mentors profile and encouragements. A cascade reflex encouraged me to write blogs and articles on technology and I have completed 100 plus blogs in a years time. This was the time when I have involved myself into the likes/views/upvote/comments/badges on my blogs along with various other projects.

I maintain rank 1st for complete one quarter at HortonWorks community. End of 2018 put me whole new situation and bring me a most needed change in lifestyle. I have shifted job with a startup company and then visited onsite Japan(short story, actually it is more complex). I personally want that change because “good/bad contribution” and “agree/disagree” is seems taking me nowhere in life.

My experience with HCL interview...

I am having 13+ years of experience and get the call from HR located in Hyderabad bspraviya_b@hcl.com for the interview for the position of Solution Architect in July 2018. After HR discussion first technical interview is done by an employee name Pawan from Noida. They call me for a personal interview at Greater Noida. I live in Ambala Cantt Haryana and spent 2000Rs to reach there but concern person is not aware that I am coming for face to face interview.

Somehow after the struggle of one hour, HR arranged my interview. Another round of technical interview happen. They told me to leave for the day as manager was busy in a meeting and can’t join to conduct another level of interview.

Apache Storm key takeaways...

Hadoop moves the code to the data, Storm moves the data to the code. This behavior makes more sense in a stream-processing system, because the data set isn’t known beforehand, unlike in a batch job. Also, the data set is continuously flowing through the code. A Storm cluster consists of two types of nodes: the master node and the worker nodes. A master node runs a daemon called Nimbus, and the worker nodes each run a daemon called a Supervisor. The master node

can be thought of as the control center. In addition to the other responsibilities, this is where you’d run any of the commands such as activate , deactivate , rebalance , or kill —available in a Storm cluster (more on these commands you see on storm site).

Approach to execute Machine Learning project, “Halt the Hate”...

Disclaimer: The analysis was done in this project touches a sensitive issue in India. So I never convince anybody to trust my model. A real human society is so complex that “all the things may be interconnected in a different way than in the model.” Imagine you are presented with a dataset of “Hate Crimes” in India and asked how to minimize these crimes by analyzing other factors. This is the problem I am taking

in hand to solve and analyze with a minimum number of resources. Some can say that education and providing jobs to youth in India by the government could solve this problem and yes you are right. You will see that relationship soon. You can also make your best guess by visualizing many other factors that I will present here. Next post, I’ll create the Machine Learning model a...

Fundamantals of Apache Spark...

How did Spark become so efficient in data processing as compared to MapReduce? It comes with a very advanced Directed Acyclic Graph (DAG) data processing engine. What it means is that for every Spark job, a DAG of tasks is created to be executed by the engine. The DAG in mathematical parlance consists of a set of vertices and directed edges connecting them. The tasks are executed as per the DAG layout. In the MapReduce case, the

DAG consists of only two vertices, with one vertex for the map task and the other one for the reduce task. The edge is directed from the map vertex to the reduce vertex. The in-memory data processing combined with its DAG-based data processing engine makes Spark very efficient. In Spark’s case, the DAG of tasks can be as complicated as it can. Thankfully, Spark comes with utilities that can give excellent visualization of the DAG of any Spark job that is running.

Bayesian-posterior imagination and applications...

A conditional probability is the probability of one event if another event occurred. In the “die-toss” example, the probability of event A, three dots showing, is P(A) = 1/6 on a single toss. But what if we know that event B, at least three dots showing, occurred? Then there are only four possible outcomes,

one of which is A. The probability of A = {3} is 1/4 , given that B = {3, 4, 5, 6} occurred. The conditional probability of A given B is written P(A|B).

Tips and Tricks for Apache Spark RDD API, Dataframe API- Part -1...

I am planning to share my knowledge on Apache Spark RDD, Dataframes API and some tips and tricks. If I combine everything into one then it would be a very lengthy article. Therefore I am dividing the long article into three separate articles and this article is the first series in that continuation. Spark RDD API Dataframe API Tips and tricks on RDD API

and Dataframe API. Let us start with basics of RDD API. Resilient Distributed Dataset(RDD) is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you

Better late then never : Time to replace your micro-service architecture with Kafka...

Kafka already spawns and facilitated many organizations on micro-services architecture world. If Kafka is still not part of your infrastructure, its high time for you to go with it. I am not promoting Kafka better then any other message queue systems as many articles are already floating on the internet about this subject.

Kafka’s uniqueness is that it provides both simple file system and bridge functions. A Kafka broker’s most basic task is to write messages to and read messages from the log on disk as quickly as possible. Queue message will not be lost after the persistence, wh...

In-depth Kafka Message queue principles of high-reliability...

At present many open source distributed processing systems such as Cloudera, Apache Storm, Spark and others support the integration with Kafka. Kafka is increasingly being favored by many internet shops and they use Kafka as one of its core messaging engines. The reliability of the Kafka message can be imagined as a commercial-grade messaging middleware solution.

In this article, we will understand Kakfa storage mechanism, replication principle, synchronization principle, and durability assurance to analyze its reliability.

JavaScript Issue resolution in JupyterLab Notebook...

The graphs are not appearing in JupyterLab Notebook and the error message says “JavaScript output is disabled in JupyterLab”. At first, it seems that from Notebook itself I just need to enable it but few site says it Jupyterlab does not support it yet is frustrating.So to solve this issue or enable extension first stop your notebook and use below command.

Now invoke “jupyter lab” and you can see your plots: