Holding Everything With Bond...

Holding Everything With Bond...  

How did Spark become so efficient in data processing as compared to MapReduce?…

It comes with a very advanced Directed Acyclic Graph (DAG) data processing engine. What it means is that for every Spark job, a DAG of tasks is created to be executed by the engine. The DAG in mathematical parlance consists of a set of vertices and directed edges connecting them. The tasks are executed as per the DAG layout. In the MapReduce case, the DAG consists of only two vertices, with one vertex for the map task and the other one for the reduce task. The edge is directed from the map vertex to the reduce vertex. The in-memory data processing combined with its DAG-based data processing engine makes Spark very efficient. In Spark’s case, the DAG of tasks can be as complicated as it can. Thankfully, Spark comes with utilities that can give excellent visualization of the DAG of any Spark job that is running.>Read More…

 

Approach to execute Machine Learning project, “Halt the Hate”…

Imagine you are presented with a dataset of “Hate Crimes” in India and asked how to minimize these crimes by analyzing other factors. This is the problem I am taking in hand to solve and analyze with a minimum number of resources. Some can say that education and providing jobs to youth in India by the government could solve this problem and yes you are right. You will see that relationship soon. You can also make your best guess by visualizing many other factors that I will present here.>Read More…

 

Encourage you to switch to Jupyter Lab...

Notebooks are great for prototyping, longer pipelines or processes. If you are a user of PyCharm or Jupyter Notebook and an exploratory data scientist, I would encourage you to switch you to Jupyter Lab.>Read More…

 

Why and when we need Machine Learning...

I’m into the data management/data quality from several years. When I ask some people what is data management processes they simply reply, “well, we have some of our data stored in a database and other data stored on file shares with proper permissions.” This isn’t data management…it’s data storage. If you and/or your organization don’t have good, clean data, you are most definitely not ready for machine learning. Data management should be your first step before diving into any other data project(s).>Read More…

 

Python Lists and Lambda Learning...

There are many ways to use Python's List and Lambda. Here I am going to show some of useful tips and tricks. So lets first start with Lists>Read More…

 

Apache Spark RDD API using Pyspark…

In my previous article, I am using scala to show usability of Spark RDD API. Many of us utilizing PySpark to work with RDD and Lambda functions. Though the function names and output is same what we have in Scala, syntax in Pyspark is different on RDD operations. I'll explain here Pyspark RDD using a different approach and with a different perspective to solve the problem.Read More…

 

How to convert Python list, tuples, strings to each other...

There are three built-in functions in Python : lists, tuples, and strings. The three functions, str (), tuple (), and list (), convert to each other using the following example:..Read More…

 

Tips and Tricks for Apache Spark RDD API, Dataframe API- Part -1...

I am planning to share my knowledge on Apache Spark RDD, Dataframes API and some tips and tricks. If I combine everything into one then it would be a very lengthy article. Therefore I am dividing the long article into three separate articles and this article is the first series in that continuation..Read More…

 

Better late then never : Time to replace your micro-service architecture with Kafka...

Kafka already spawns and facilitated many organizations on micro-services architecture world. If Kafka is still not part of your infrastructure, its high time for you to go with it. I am not promoting Kafka better then any other message queue systems as many articles are already floating on the internet about this subject.Read More…

 

In-depth Kafka Message queue principles of high-reliability...

 At present many open source distributed processing systems such as Cloudera, Apache Storm, Spark and others support the integration with Kafka. Kafka is increasingly being favored by many internet shops and they use Kafka as one of its core messaging engines. The reliability of the Kafka message can be imagined as a commercial-grade messaging middleware solution.Read More…

 

Consume JSON Messages From Kafka Using Kafka-Python's Deserializer…...

Hope you are here when you want to take a ride on Python and Apache Kafka. Kafka-Python is most popular python library for Python. For documentation on this library visit to page https://kafka-python.readthedocs.io/en/master/. kafka-python is designed to function much like the official java client. kafka-python is best used with newer brokers (0.9+),Read More…

 

Digital Files Analytics(DFA) System’s Ingestion Platform—A real time data ingestion platform.

Here I want to introduce you to real time data ingestion platform used for Digital File Analytics(DFA) system to stream extracted data from heterogeneous source like images, pdfs and movies. You can find more detail about DFA using this link.Read More…

 

Moving to communication of events between subsystems-CQRS-ES with open source…...

Before going into definitions of EP, CEP, and QSQS let us start with some basic database term and what problem we are trying to address here. We have commercial databases and database professionals those who publicized CRUD operations a lot. It is one-row-per-pattern works well in most of the projects and enough to build an application more quickly and securely. I have probably implemented 100 CRUD projects (including web applications) and we do that way because we have limited budgets and projects have deadlines.Read More…

 

Storm Topology design paradigm:-Breaking down Topology into functional components...

If we’re building a racecar, we need to keep performance in mind starting on day one. We can’t refactor our engine to improve it later if it wasn’t built for performance from the ground up. Here I am going to show functional design approach of seeing a stream processing problem and break it down into constructs that fits within a Storm topology and get performance, scalability out of by optimizing and breaking down into functional componentsRead More…

 

Try Kill batch processing with unified log stream processing...

Logs in your application are abstraction view of functionality/behavior whether application is a web application to cryptocurrency. I love to create analogy and here I compare logs of any system as white blood cells. White blood cell in our body helps us to fight an infection and recognizes the invading particles before it cause disease.Read More…

 

SolrCloud vs HDPSearch…

Let us start to remove some confusion we have related to SolrCloud and HDPSearch. First what is the SolrCloud:- Apache Solr includes the ability to set up a cluster of Solr servers that combines fault tolerance and high availability, called SolrCloud, these capabilities provide distributed indexing and search capabilities, supporting the following features:Read More…

 

AWS and GCE both great! Some more powerful configuration of load balancing puts GCE over the top…

I work with Hadoop so I come across or sometimes management ask me a common question, “Why we need Hadoop in cloud” and to answer this question I keep my bold points ready like below… >Cloud is your data center, No need to deal with reliability & scaling issues. >Pay What You Need. >Deployed in Minutes. >Cloud storage enables economic flexibility, scale, and rich features.Read More…

 

Benefits of Blogging!

Yes, blogging has many benefits. First thing money and I’m not earning money writing blogs right now but many bloggers get pleasure who make money. You could be looking at blogs with other goals in mind but here are few from my perspective:Read More…

 

Install and smoketest R and RHadoop on Hortonworks Data Platform (HDP25-CentOS7)...

Before going to Installation steps i'd like to give a small introduction on RHADOOP. What is RHadoop? RHadoop is an open source project for combine R and Hadoop together.incomplete without compression). And what if i tell you in CDS, we use compression along with enhancement of visual image/scanned documents. It contains 4 different packages to combine different project from Hadoop and 1 package to enhance some functions to fit MapReduce framework. Read More…

 

Content Data Store(CDS) Compressing and enhancing technique......

Aggressively we are adding new features to Content Data Store(CDS) system. One of the feature that i am going to discuss here is compression technique(BigData application is incomplete without compression). And what if i tell you in CDS, we use compression along with enhancement of visual image/scanned documents. Our compression technique has two additional features:- Read More…

 

ZeroMQ Part-1...

Programs like people need to communicate, and for them we have the UDP, TCP, HTTP, IPX, WebSocket protocol to connect and other related applications. But the underlying protocol is difficult to achieve. We need a high level of abstraction, scalable, and easy to use things, it is ZeroMQ(ØMQ). ØMQ gives us advanced levels of availability and speed. Read More…

 

Google Cloud Platform(GCP) overview...

Google Cloud Platform - GCP is a collection of various services of SaaS, PaaS, and IaaS, and new services are still being launched day by day. There are lot of services including beta service etc and we will not introduce everything in detail in this article but i am going to introduce mainly the services which become important in infrastructure considering game. Read More…

 

Multiple WAL in Apache HBase 1.3 and performance enhancements!!!...

Apache HBase 1.3.0 was released mid-January 2017 and ships with support for date-based tiered compaction and improvements in multiple areas, like write-ahead log (WAL), and a new RPC scheduler, among others. The release includes almost 1,700 resolved issues in total. Read More…

 

Book Review : The Folly of Fools....

Finished reading this books and developed friendships with author "Robert L. Trivers" for his remarkable writing. Although it seems book didn't contains the rigorous data collection, statistical analysis, but certainly have the theory foundations for what he came up with. Trivers has made good point on the self-deception among people in authority and observations made on self-deception can be incredibly expensive seems very true. The point made on "Iraq war" and "willful ignorance in NASA" are lacking data but convincing.Read More…

 

Apache Eagle: Real-time security monitoring solution....

On January 10, 2017, the Apache Software Foundation, which consists of more than 350 open source projects and innovation initiatives, all developed by volunteer, governance volunteer and incubator volunteers, announced that Apache Eagle has graduated from the Apache Incubator Program.Read More…

 

We just need to be better to each other before talking to AI....

I'm not going to talk about statistics, machine learning, or AI, not even comparing any database which shows error made by AI work vs Manual work. I believe that the fundamental problems of our time are ethical, not technological. If we can figure out that part, the technology should take care of itself.Read More…

 

SolrCloud : CAP theorem world, this makes Solr a CP system, and keep availability in certain circumstances....

A SolrCloud cluster holds one or more distributed indexes which are called Collections. Each Collection is divided into shards (to increase write capacity) and each shard has one or more replicas (to increase query capacity). One replica from each shard is elected as a leader, who performs the additional task of adding a ‘version’ to each update before streaming it to available replicas. This means that write traffic for a particular shard hits the shard’s leader first and is then synchronously replicated to all available replicas. One Solr node (a JVM instance) may host a few replicas belonging to different shards or even different collections. Read More…

 

Sumo Logic : Log Management Tool...

This is my first face off with "Sumo Logic". If you want a quick introduction on "Sumo Logic", this topic will be helpful without going into details documentation. Sumo Logic designed to help you manage and analyze your log files. It has started out attempting to be a SaaS version of Splunk and have gone their own way as matured, but as a result of their beginnings, it is one of the most feature-rich and enterprise-focused SaaS log management tools. Read More…

 

'Open source' and 'free software'…

Its my Materialist vs Idealist thought going on here. If you not find it to your reality - be patience with my arguments. Think back 20 years to a time when the internet was still a DARPA project and the web was but a glimmer in Tim Berners-Lee’s eyes. At that time, someone who could create software or build computers was pretty special. In fact, unless you worked for a software vendor or attended class in a computer science department, programming was pretty much a black art understood by an elite few. There were some computer users, but computer hobbyists weren’t exactly mainstream. Read More…

 

Cloud Databases & Cloud Blob…

Since Apache Spot earlier this year started at Intel and Cloudera, the momentum of the project is growing with Anomoli, Centrify, Cloudwick, Cybraics, eBay, Endgame, Jask, Streamsets, Webroot and other partners with the unanimous support. Read More…

 

Cloud Databases & Cloud Blob…

Cloud computing is the next stage in evolution of the Internet. The cloud in cloud computing provides the means through which everything — from computing power to computing infrastructure, applications, business processes to personal collaboration Read More…

 

Zombie project and toxic workplace!!!

My advise for work under Zombie project and toxic workplace… I would like to say that there’s a chance of salvaging the Zombie project but there probably isn’t and its not your fault. I have been in this kind of situation before but I would suggest that your course of action depends on your core morals. Read More…

 

Almost Everything in Python!!!

A curated list of Python frameworks, libraries, software and resources. Inspired by awesome-php. Awesome Python Environment Management Package Management Package Repositories.... Read More…

 

Advertisement attributes or Ad Attributes…An Idea!!!

Some time ago i was working on an idea called as Ad Attributes or Advertisement attributes. I’d like to share my thoughts on this idea with audience. Advertisement attributes are for creating a favorable selling climate.... Read More…

 

Converting PDF to Text using Tesseract…

Tesseract is unable to handle pdf files directly, therefore files first converted to a tiff using ghostscript before passing it to Tesseract. Tesseract does not have ability to process pdf files, In addition tesseract cannot process multiple page tiffs(images), so ghostscript go along with it to complete the task..... Read More…

 

OCR – “Optical Character Recognition”, Set up Tesseract OCR on Centos 6.8…

OCR means “Optical Character Recognition” and Tesseract is licensed under the Apache License v2.0. Tesseract OCR configured system is able to convert images with embedded text to text files. This tutorial “How to install” is meant as a practical guide; it does not cover theoretical backgrounds/concept of OCR/algorithm used in Tesseract. They are treated in lot of other documents in the web.... Read More…

 

A Step-by-Step Guide to HDFS Data Protection Solution for Your Organization on Cloudera CHD

Comprehensive encryption offering wherever it resides, including structured and unstructured data at rest and data in motion. HDFS Encryption implements transparent, end-to-end encryption of data read from and written to HDFS, without requiring changes to application code.... Read More…

 

PG-Storm: Let PostgreSQL run faster on the GPU

PostgreSQL extension PG-Storm, allows users to customize the data scan and run queries faster. CPU-intensive work load is identified and transferred to the GPU to take advantage of the powerful GPU parallel execution ability to complete the data task. The combination of few number of core processors, RAM bandwidth, and the GPU has a unique advantage. GPUs typically have hundreds of processor cores and RAM bandwidths that are several times larger than CPUs. They can handle large numbers of computations in parallel, so their operations are very efficient... Read More…

 

Past and Future of Apache Kylin!!!

Apache Kylin (Chinese: Kirin) appears, can solve the problems based on Hadoop.... Article Apache Kylin origin In today's era of big data, Hadoop has become the de facto standards, and a large number of tools one after another around the Hadoop platform to build, to address the needs of different scenarios... Read More…

 

Tephra is open-sourced projects that adds complete transaction support to Apache HBase...

Transaction support in Hbase? Yes, a wide range of use case require transaction support. Firstly, we want the client to have great insight and fine-grained control of what the transaction system can do. Having full control on the client side not only allows you to make the best decisions for optimizing for specific use cases, but it also makes integration with third-party systems simpler... Read More…

 

Hive Naming conventions and database naming...

Short Description: Naming conventions help to ease programmer and architect to understand whats inside going on in a business. Read More…

 

HBase Replication and comparison with popular online backup programs...

Short Description: HBase Replication: Hbase Replication solution can solve the cluster security, data security, read and write separation and operation Read More…

 

Apache Shiro design is intuitive and a simple way to ensure the safety of the application...

Short Description: Apache Shiro’s design goals are to simplify application security by being intuitive and easy to use… Read More…

 

Heterogeneous Storage in HDFS(Part-1)

An Introduction of heterogeneous storage type, and the flexible configuration of heterogeneous storage! Heterogeneous Storage in HDFS Hadoop version 2.6.0 introduced a new feature heterogeneous storage. Heterogeneous storage can be different according to each play their respective advantages of the storage medium to read and write characteristics. This is very suitable for cold storage of data. Data for the cold means storage with large capacity and where high read and write performance is not required, such as the most common disk for thermal data, the SSD can be used to store this way. On the other hand when we required efficient read performance, even in rate appear able to do ten times or a hundred times the ordinary disk read and write speed, or even data directly stored memory, lazy loaded hdfs. Read More…

 

The ACID properties and the CAP theorem are two concepts in data management to distributed system.

Started working on HBase again!! Thought why not refresh few concepts before proceeding to actual work. Important things comes into mind when we work with NoSQL is distributed environment are sharding and partitions. Let’s dive into ACID properties of database and CAP theorem for distributed system... Read More…

 

Coding Tips and Best Practice in Hive and Oozie…

Many time during the code review found some common mistakes done by the developer. Here are few of the... Read More…

 

HPL/SQL Make SQL-on-Hadoop More Dynamic

Think about the old days when we solved many business problems using Dynamic SQL, exception handling, flow-of-control, iterations. Now when I worked with couple of migration projects found few business rules that need to transform to Hive compatible (some of them are very complex and nearly impossible)... Read More…

 

Best Practices for Hive Authorization when using connector to HiveServer2

Recently we are in process of working with Presto and configuring Hive Connector to it. It got connected successfully with steps given at prestodb.io/docs/current/connector/hive.html. An overview of our architecture is Presto is running on a different machine (Presto Machine) use Hive connector to communicate with Hadoop cluster which is running on different machines. Presto Machine have hive.properties file which tells Presto to use thrift connection to hive client and hdfs-site core-site.xml files for HDFS... Read More…

 

HDFS is really not designed for many small files!!!

Few of my friends new to Hadoop ask frequently what the good file size is for Hadoop and how to decide file size. Obviously it should not be small size and file size should be as per the block size. HDFS is really not designed for many small files... Read More…

 

Kafka: A detail introduction

We need Kafka when there is need for building a real-time processing system as Kafka is a high-performance publisher-subscriber-based messaging system with highly scalable properties. Traditional systems unable to process this large data and mainly for offline used analysis, Kafka is a solution to the real-time problems of any software solution; that is to say, unify offline or online data processing and routing it to multiple consumers quickly... Read More…

 

Out of the Box(Why Women Live Longer than Men)

Fact is men enjoy life more but at the end winners are women because they always get extra bits of years(these bits are sometimes in GB of ten years of extra life compared to men)... Read More…

 

Introduction to Spark…

Spark As a Unified Stack and Computational Engine is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines... Read More…

 

Performance utilities in Hive…

Best Practice
Before taking you in details of utilities provided by Hive, let me explain few components to get execution flow and where the related information stored in system. Read More…

 

Data Analysis Approach to a successful outcome…

Setting the Scene
I have done data analysis for one of my project using below approach and hopefully it may help you understand underlying subject. Soon i’ll post my project on data analysis and detail description on technology used Python(web scraping- data collection), Hadoop, Spark and R. Read More…

 

JRuby code to purge data on Hbase over Hive table…

Problem to Solve:-
How to delete/update/query Binary format stored values in a HBase column family column. Hive over HBase table, where we cant use standard API and unable to apply filters on binary values, you can use below solution for programmability. Read More…

 

Python and Python bites.

Python and Python bites “lambda”
We are working on a new gallery viewer in the next month.

Hi everyone, this article show you one powerful function in Python programming language called “lambda”. It can solve any small problem in single line code. So lets start the beginning of your interesting or may be future programming language. Read More...

 
Encourage you to switch to Jupyter Lab...Encourage you to switch to Jupyter Lab... Why and when we need Machine Learning...Why and when we need Machine Learning... Python Lists and Lambda Learning... Python Lists and Lambda Learning... Apache Spark RDD API using Pyspark… Apache Spark RDD API using Pyspark… How to convert Python list... How to convert Python list... Tips and Tricks for Apache Spark... Tips and Tricks for Apache Spark... Better late then never : Time to replace... Better late then never : Time to replace... In-depth KafkaI In-depth KafkaI SolrCloud vs HDPSearch… SolrCloud vs HDPSearch Why we need Hadoop in cloud GCE vs AWS.. benefits of blogging Benefits Would say there’s a chance of salvaging!! Advice.. Almost Everything in Python!!! Python Ev.. Advertisement attributes or Ad Attributes…An Idea!!! Advt attr.. Tesseract is unable to handle pdf files directly Tesseract Run.. Optical Character Recognition”, Set up Tesseract OCR on Centos 6.8… Tesseract Install.. A Step-by-Step Guide to HDFS Data Protection HDFS Encryp.. PG-Storm: Let PostgreSQL run faster on the GPU PG-Storm... Apache Kylin (Chinese: Kirin) appears, can solve the problems based on Hadoop Apache Kylin  Tephra is open-sourced projects that adds complete transaction support to Apache HBase... Tephra Hive Naming conventions and database naming...Hive Naming HBase Replication and comparison with popular online backup programs...HBase Replication Apache Shiro design is intuitive and a simple way to ensure the safety of the application...Apache Shiro Heterogeneous Storage in HDFS(Part-1)Hetro HDFS The ACID properties and the CAP theorem are two concepts in data management to distributed system.ACID and CAP Coding Tips and Best Practice in Hive and Oozie…Coding Tips HPL/SQL Make SQL-on-Hadoop More DynamicHPL/SQL Best Practices for Hive Authorization when using connector to HiveServer2Hive Best Practices Setting the sceneSetting the scene HDFS is really not designed for ...HDFS is not Kafka: A detail introductionKafka Why Women Live Longer...Why Women Introduction to Spark…Spark… Python BitePython Bite Hive Best PracticeHive Practice
 
© 2018 Mukesh Kumar | Blogs | Blogs | Contact us