Many Blogs, One Place
First what’s the difference between a Delta Lake and Change Data Capture? CDC is just the log of changes on a relational table. Delta Lake is to provide more native administrative capabilities to a data lake implementation (schemas, transactions, cataloging). Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads.
A global advertising agency that manages marketing for different customers in Asia, Europe and US required the solution on development of a Big Data platform. The company data analysts required a Big Data solution to run their models, reports and development effort can be handled by their own IT Team. The company is looking for recommendations
on how to setup the Big Data Platform that will allow them to analyse trends and patterns over time across different clients. They would need a presentation layer to provide reporting capabilities to individual clients on only their specific data. Therefore, the main goal of this document is proposing a solution for IT team, Analyst and other stake holder so that it can be managed flexible, elastic, fault-tolerant, cost-efficient, scalable, secure and high-performance Big Data platform.
This is the first article to a series of articles to showcase how we use Operator that can leverage Kubernetes to create a stateful application such as Kafka Cluster. An Operator is a way to package, run, and maintain a Kubernetes application. An Operator builds on Kubernetes to automate the entire lifecycle of the software it manages. Because Operators extend Kubernetes, they provide application-specific automation.
This article is a showcase of a Reference architecture approach for the financial sector where stream and batch processing is a common part of its solution with other designs. Firstly the requirement analysis is the step to define the implementation of any use case. Therefore before moving to reference architecture we first need to understand Requirements Engineering.
Requirements Engineering is regarded as one of the most important steps in software engineering and takes about 30% of project time. When done properly, it can provide a good foundation for the system design and development as the functionality and components needed for the system become clear during the Requirements Engineering process...
I was working on an NLP tool for evaluation purposes and found an issue in creating the environment. They had set up everything on Ubuntu so they might not face this issue but I am replicating on Centos 7 and found an error. Hope this will help someone. The project is based on PyTorch 0.4+ and Python 3.6+. You can find that I have created the Python 3.6 on Centos 7 in my previous article
Cloud Dataflow is a fully managed google service for executing data processing pipelines using Apache Beam. What do you mean by fully managed? Cloud dataflow like BigQuery dynamically provisions the optimal quantity and type of resource(i.e CPU or memory instances) based on volume and specific resource requirements for your job. Cloud dataflow is a server-less and auto-scaling service.
Dataflow and Spark Google Cloud Dataflow is closely analogous to Apache Spark in terms of API and engine. Both are also directed acyclic graph-based (DAG) data processing engines. However, there are aspects of Dataflow that aren’t directly comparable to Spark. Where Spark is strictly an API and engine with the supporting technologies, Google Cloud Dataflow is all that plus Google’s underlying infrastructure and operational support.
I am using google cloud to create an event on Cloud Storage to Big Query using Apache Beam pythons library. I was executing an ETL in the “DirectRunner” mode and found no issue. But later when I take everything on dataflow to execute found an error.
In my previous articles, we solve real-time data ingestion problems using various tools like Apache Kafka, Storm, Flink and Spark. I have shown you in details that how to create such pipelines for real-time processing. In this blog, we will try to simulate a similar problem using Apache Beam and Dataflow using Python. Let’s say we have sample data below and FL_DATE and DEP_TIME columns represent the local dates without timezone.
You can find dataset and python code in my github repository too. This is the flight dataset and two countries having different timezones and in sample data the time zones offset is not present in this dataset. As timezone depends upon airport location so we will put timezone offset in our dataset to Coordinated Universal Time(UTC). Therefore let us first start with the transformation to our sample dataset and convert all time fields to UTC. Additionally, we add three fields for the destination airport: the latitude, longitude, and time zone offset.
Overview This article contains a sample java program on Google Cloud’s Pub/Sub to publish messages from google store. The solution is simple to set up the environment, create a topic, subscribe to that topic and read those messages using a java program. Prerequisite Create a new GCP project Enable the Pub/Sub API Setting environment variables Java1.8 Java SDK eclipse.
Streaming data in Google Cloud Platform is typically published to Cloud Pub/Sub, a serverless real-time messaging service. Cloud Pub/Sub provides reliable delivery and can scale to more than a million messages per second. It stores copies of messages in multiple zones to provide “at least once” guaranteed delivery to subscribers, and there can be many simultaneous subscribers.
The simulation code that we are writing here is only for quick experimentation with streaming data. Hence, I will not take the extra effort needed to make it fault-tolerant. If we had to do so, we could make the simulation fault-tolerant by starting from a BigQuery query that is bounded in terms of a time range with the start of that time range automatically inferred from the last-notified record in Cloud Pub/Sub. Because Cloud Pub/Sub subscriptions are not retroactive, we need to maintain...
If you’re looking for simple and painless Hadoop deployment, Docker is the right tool for you. deployment. We mostly use Docker community edition-CE (https://docs.docker.com/docker-for-windows/install/) on Microsoft Windows, under system requirement it clearly says “Hyper-V and Containers Windows features must be enabled.” to run Docker on Windows. In case you are using
Docker Engine – Enterprise(EE) you might not require the Hyper-V. Now the issue with Hyper-V is it reserves some ports that are required by Hadoop for inter-process communication. So as of now, you got my point that Hadoop uses certain ports such as 50070, those are required to communicate with data node and expose URI for hdfs but these ports are reserved by Hyper-V. A good excuse Hyper-V reserve some ports is to switch communication between Linux and Windows systems. To view reserve port run be...
As we go on writing a piece of code or building a solution we always create some sort of technical debt. It is not always a bad thing but it is something we should learn over time how to manage, control and track. Initially in my career, I always thought and looked for creating a perfect solution and sometimes that mode of thinking put me into a situation where my manager thinks that I am putting a lot of effort into the analysis of the solution rather than delivering them fast.
Use case We are working on building data lake in Azure using Azure container, ADF, Azure DWH, Databricks and many other services of Azure. After ingesting wide variety of datasources using API, on premise databases, flate files, reporting servers, we come to know that clients have some requirement to push files in Azure Blob storage. Users can locate the files on their local system but dont know what actual folder hierarchy in Azure to run the process.
This requirement extents to help them testing their processes in UAT and Dev environments. Introduction AzCopy is a command-line utility that you can use to copy blobs or files to or from a storage account. This article helps you to understand the workings and how to parameterized AzCopy.exe to Export and Import Azure files. You can find Powershell Script to copy local files to Blob storage Account using AzCopy at my git that uses azCo...
Azure delivered 59% revenue growth in the latest quarter which is more than expected from its other Microsoft products. MSFT introducing various new cloud services and acquisitions giving it edge over the rivals Amazon and Google. Simply put, Azure Arc provides you with the opportunity to run Azure services anywhere. It provide three main functions below:-
Azure SQL Database and Azure Database for PostgreSQL Hyperscale (scale-out function by sharding) can be deployed to Kubernetes on any cloud environment such as AWS or GCP or on-premise, and centralized management including patch application, automatic update, security policy application, etc. is possible Integrated management of Windows Server, Linux Server, Kubernetes cluster, etc. running on any platform including AWS and on-premises. Deploy and deploy applications to Kuber...
Information cascade I learned a TON very quickly and everyday about completely new stuff, and was able to do so because I could easily feel the inertia and the help of my mentors profile and encouragements. A cascade reflex encouraged me to write blogs and articles on technology and I have completed 100 plus blogs in a years time. This was the time when I have involved myself into the likes/views/upvote/comments/badges on my blogs along with various other projects.
I maintain rank 1st for complete one quarter at HortonWorks community. End of 2018 put me whole new situation and bring me a most needed change in lifestyle. I have shifted job with a startup company and then visited onsite Japan(short story, actually it is more complex). I personally want that change because “good/bad contribution” and “agree/disagree” is seems taking me nowhere in life.
I am having 13+ years of experience and get the call from HR located in Hyderabad firstname.lastname@example.org for the interview for the position of Solution Architect in July 2018. After HR discussion first technical interview is done by an employee name Pawan from Noida. They call me for a personal interview at Greater Noida. I live in Ambala Cantt Haryana and spent 2000Rs to reach there but concern person is not aware that I am coming for face to face interview.
Hadoop moves the code to the data, Storm moves the data to the code. This behavior makes more sense in a stream-processing system, because the data set isn’t known beforehand, unlike in a batch job. Also, the data set is continuously flowing through the code. A Storm cluster consists of two types of nodes: the master node and the worker nodes. A master node runs a daemon called Nimbus, and the worker nodes each run a daemon called a Supervisor. The master node
Disclaimer: The analysis was done in this project touches a sensitive issue in India. So I never convince anybody to trust my model. A real human society is so complex that “all the things may be interconnected in a different way than in the model.” Imagine you are presented with a dataset of “Hate Crimes” in India and asked how to minimize these crimes by analyzing other factors. This is the problem I am taking
in hand to solve and analyze with a minimum number of resources. Some can say that education and providing jobs to youth in India by the government could solve this problem and yes you are right. You will see that relationship soon. You can also make your best guess by visualizing many other factors that I will present here. Next post, I’ll create the Machine Learning model a...
How did Spark become so efficient in data processing as compared to MapReduce? It comes with a very advanced Directed Acyclic Graph (DAG) data processing engine. What it means is that for every Spark job, a DAG of tasks is created to be executed by the engine. The DAG in mathematical parlance consists of a set of vertices and directed edges connecting them. The tasks are executed as per the DAG layout. In the MapReduce case, the
DAG consists of only two vertices, with one vertex for the map task and the other one for the reduce task. The edge is directed from the map vertex to the reduce vertex. The in-memory data processing combined with its DAG-based data processing engine makes Spark very efficient. In Spark’s case, the DAG of tasks can be as complicated as it can. Thankfully, Spark comes with utilities that can give excellent visualization of the DAG of any Spark job that is running.
A conditional probability is the probability of one event if another event occurred. In the “die-toss” example, the probability of event A, three dots showing, is P(A) = 1/6 on a single toss. But what if we know that event B, at least three dots showing, occurred? Then there are only four possible outcomes,
I am planning to share my knowledge on Apache Spark RDD, Dataframes API and some tips and tricks. If I combine everything into one then it would be a very lengthy article. Therefore I am dividing the long article into three separate articles and this article is the first series in that continuation. Spark RDD API Dataframe API Tips and tricks on RDD API
Kafka already spawns and facilitated many organizations on micro-services architecture world. If Kafka is still not part of your infrastructure, its high time for you to go with it. I am not promoting Kafka better then any other message queue systems as many articles are already floating on the internet about this subject.
At present many open source distributed processing systems such as Cloudera, Apache Storm, Spark and others support the integration with Kafka. Kafka is increasingly being favored by many internet shops and they use Kafka as one of its core messaging engines. The reliability of the Kafka message can be imagined as a commercial-grade messaging middleware solution.