My experience with HCL interview…

I am having 13+ years of experience and get the call from HR located in Hyderabad bspraviya_b@hcl.com for the interview for the position of Solution Architect in July 2018. After HR discussion first technical interview is done by an employee name Pawan from Noida. They call me for a personal interview at Greater Noida. I live in Ambala Cantt Haryana and spent 2000Rs to reach there but concern person is not aware that I am coming for face to face interview. Somehow after the struggle of one hour, HR arranged my interview. Another round of technical interview happen. They told […]

Analytics, Apache Spark, Apache Storm, Bigdata, Hadoop

Apache Storm key takeaways…

Hadoop moves the code to the data, Storm moves the data to the code. This behavior makes more sense in a stream-processing system, because the data set isn’t known beforehand, unlike in a batch job. Also, the data set is continuously flowing through the code. A Storm cluster consists of two types of nodes: the master node and the worker nodes. A master node runs a daemon called Nimbus, and the worker nodes each run a daemon called a Supervisor. The master node can be thought of as the control center. In addition to the other responsibilities, this is where […]

Analytics, Data Science, Exploratory Data Analysis, Hadoop

Approach to execute Machine Learning project, “Halt the Hate”…

Disclaimer: The analysis was done in this project touches a sensitive issue in India. So I never convince anybody to trust my model. A real human society is so complex that “all the things may be interconnected in a different way than in the model.” Imagine you are presented with a dataset of “Hate Crimes” in India and asked how to minimize these crimes by analyzing other factors. This is the problem I am taking in hand to solve and analyze with a minimum number of resources. Some can say that education and providing jobs to youth in India by […]


Fundamantals of Apache Spark…

You can view my other articles on Spark RDD at below links… Apache Spark RDD API using Pyspark… Tips and Tricks for Apache Spark RDD API, Dataframe API How did Spark become so efficient in data processing as compared to MapReduce? It comes with a very advanced Directed Acyclic Graph (DAG) data processing engine. What it means is that for every Spark job, a DAG of tasks is created to be executed by the engine. The DAG in mathematical parlance consists of a set of vertices and directed edges connecting them. The tasks are executed as per the DAG layout. […]


My intuition to understand eigenvalues and eigenvectors…

One of my biggest hurdles learning linear algebra was getting the intuition of learning Algebra. Eigenvalues and eigenvectors are one of those things that pop up in a million places because they’re so useful, but to recognize where they may be useful you need intuition as to what they’re doing. The eigenvectors are the “axes” of the transformation represented by the matrix. Consider spinning a globe (the universe of vectors): every location faces a new direction, except the poles. The eigenvalue is the amount the eigenvector is scaled up or down when going through the matrix. Eigenvalues are special numbers […]

Analytics, Bigdata, Hadoop, Python

How to convert Python list, tuples, strings to each other…

There are three built-in functions in Python : lists, tuples, and strings. The three functions, str (), tuple (), and list (), convert to each other using the following example: >>> s = ‘123456’ >>> list(s) [‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’] >>> tuple(s) (‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’) >>> tuple(list(s)) (‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’) >>> list(tuple(s)) [‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’] >>> “”.join(tuple(s)) ‘123456’ >>> “”.join(list(s)) ‘123456’ >>> str(tuple(s)) “(‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’)” >>> str(list(s)) “[‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’]”

Best Practices, Bigdata, Hadoop, Kafka

Better late then never : Time to replace your micro-service architecture with Kafka…

Kafka already spawns and facilitated many organizations on micro-services architecture world. If Kafka is still not part of your infrastructure, its high time for you to go with it. I am not promoting Kafka better then any other message queue systems as many articles are already floating on the internet about this subject. Kafka’s uniqueness is that it provides both simple file system and bridge functions. A Kafka broker’s most basic task is to write messages to and read messages from the log on disk as quickly as possible. Queue message will not be lost after the persistence, which is […]

Analytics, Apache Spark, Hadoop, Kafka, Python, Spark

Consume JSON Messages From Kafka Using Kafka-Python’s Deserializer

Hope you are here when you want to take a ride on Python and Apache Kafka. Kafka-Python is most popular python library for Python. For documentation on this library visit to page https://kafka-python.readthedocs.io/en/master/. kafka-python is designed to function much like the official java client. kafka-python is best used with newer brokers (0.9+), but is backwards-compatible with older versions (to 0.8.0). Some features will only be enabled on newer brokers. So instead of showing you a simple example to run Kafka Producer and Consumer separately, I’ll show the JSON serializer and deserializer. Preparing the Environment Lets start with Install python package using […]


AWS and GCE both great! Some more powerful configuration of load balancing puts GCE over the top…

I work with Hadoop so I come across or sometimes management ask me a common question, “Why we need Hadoop in cloud” and to answer this question I keep my bold points ready like below… Cloud is your data center, No need to deal with reliability & scaling issues. Pay What You Need. Deployed in Minutes. Cloud storage enables economic flexibility, scale, and rich features. Size clusters independent of storage needs and price continues decreasing. Geo-Redundancy allows for business continuity/disaster recovery planning. Now they move forward to ask me a detail comparison and to find out the difference between GCP […]


PIR Sensor, a pyroelectric device…

After working on Sensors with Arduino i have dicided to pass my knowledge via blogs. I will start sharing few project that already done and some are in scope but yet to be materilized in Robotics and AI. I will be posting regular articles here as I build a data collection system using Arduino, python and Sensors. Visulization on collected data is also in scope. My projects are mainly based on Home security system therefore now onward call my self “The Agent 360!” I have created a seperate menu “Robotics” at www.ammozon.co.in for all my robotics blog and projects to […]

Administration, Bigdata, Hadoop

Apache Solr Search Installation on HDP2.6 using Yum Repo

As we know that “HDP 2.6” is not bundle with “HDP Search” which includes Solr. Therefore here in two parts of article i am going to explain ways to install Solr/SolrCloud/HDP_Search:- 1. Apache Solr Search Installation on HDP2.6 using Yum Repo. 2. Apache Solr Search Installation on HDP2.6 using Ambari Management Pack. Both are using different approach therefore i have divided it into two articles Recently i have installed HDP2.6 on one of my development environment. Now its time to bring same services back one by one as we are running on HDP2.5 in production environment, one of them is […]

Bigdata, Database, Hadoop

Cloud Databases & Cloud Blob…

Cloud computing is the next stage in evolution of the Internet. The cloud in cloud computing provides the means through which everything — from computing power to computing infrastructure, applications, business processes to personal collaboration — can be delivered to you as a service wherever and whenever you need. Cloud databases are web-based services, designed for running queries on structured data stored on cloud data services. Most of the time, these services work in conjunction with cloud compute resources to provide users the capability to store, process, and query data sets within the cloud environment. These services are designed to […]

Analytics, Bigdata, Framework, Hadoop

Sumo Logic : Log Management Tool

This is my first face off with “Sumo Logic”. If you want a quick introduction on “Sumo Logic”, this topic will be helpful without going into details documentation. Sumo Logic designed to help you manage and analyze your log files. It has started out attempting to be a SaaS version of Splunk and have gone their own way as matured, but as a result of their beginnings, it is one of the most feature-rich and enterprise-focused SaaS log management tools. Installation: Sumo Logic is a SaaS model, which means you’ll be setting up a communication out to the Sumo Logic […]

Bigdata, Hadoop

SolrCloud : CAP theorem world, this makes Solr a CP system, and keep availability in certain circumstances.

A SolrCloud cluster holds one or more distributed indexes which are called Collections. Each Collection is divided into shards (to increase write capacity) and each shard has one or more replicas (to increase query capacity). One replica from each shard is elected as a leader, who performs the additional task of adding a ‘version’ to each update before streaming it to available replicas. This means that write traffic for a particular shard hits the shard’s leader first and is then synchronously replicated to all available replicas. One Solr node (a JVM instance) may host a few replicas belonging to different […]


We just need to be better to each other before talking to AI

I’m not going to talk about statistics, machine learning, or AI, not even comparing any database which shows error made by AI work vs Manual work. I believe that the fundamental problems of our time are ethical, not technological. If we can figure out that part, the technology should take care of itself. I would love to live in a post-scarcity utopia where we all run around self-actualizing. I don’t think we need to give up AI to get there — in fact, I think technology will be the key that unlock the gates.We have to have the wisdom to […]


Apache Eagle: Real-time security monitoring solution

On January 10, 2017, the Apache Software Foundation, which consists of more than 350 open source projects and innovation initiatives, all developed by volunteer, governance volunteer and incubator volunteers, announced that Apache Eagle has graduated from the Apache Incubator Program. Eagle originated in eBay, the first to solve large-scale Hadoop cluster monitoring issues. The team quickly realized that this would also be useful for the entire community, so in October 2015 the project was submitted to the Apache Incubator. Since then, Eagle has gained the attention of developers and organizations for its extensive usage scenarios, such as system / service […]


Google Cloud Platform(GCP) overview

Google Cloud Platform – GCP is a collection of various services of SaaS, PaaS, and IaaS, and new services are still being launched day by day. There are lot of services including beta service etc and we will not introduce everything in detail in this article but i am going to introduce mainly the services which become important in infrastructure considering game. GCP service classification and overview of each service In GCP, there are many services like other cloud services, but in this document we classify those services into five categories of “execution environment service” “storage service” “network service” “data […]

Analytics, Bigdata, Framework, Hadoop, RHadoop

Install and smoketest R and RHadoop on Hortonworks Data Platform (HDP25-CentOS7)

Before going to Installation steps i’d like to give a small introduction on RHADOOP. What is RHadoop? RHadoop is an open source project for combine R and Hadoop together. It contains 4 different packages to combine different project from Hadoop and 1 package to enhance some functions to fit MapReduce framework. rhdfs: Combine Hadoop’s HDFS with R. rhbase: Combine Hadoop’s HBase with R. rmr2: Combine Hadoop’s MapReduce 2 with R. ravro: Combine Hadoop’s Avro with R. plyrmr: Provides a familiar plyr-like interface with MapReduce. You can reference the official GitHub of RHadoop: https://github.com/RevolutionAnalytics/RHadoop Requirements First at all, I have installed HDP2.5 […]

Analytics, Hadoop

Advertisement attributes or Ad Attributes…An Idea!!!

Some time ago i was working on an idea called as Ad Attributes or Advertisement attributes. I’d like to share my thoughts on this idea with audience. Advertisement attributes are for creating a favorable selling climate. Today consumers are constantly targeted with product information by marketing companies. Consumers are faced with numerous advertisements with vast information on products. Thus consumers use the heuristics approach to help them in making their purchasing decisions. This approach is basically using mental shortcuts to streamline the selection process cognitively. This is to avoid being puzzled or paralyzed by the huge number of products offered in the […]

Hadoop, Hbase, Hive

JRuby code to purge data on Hbase over Hive table…

Problem to Solve:- How to delete/update/query Binary format stored values in a HBase column family column. Hive over HBase table, where we cant use standard API and unable to apply filters on binary values, you can use below solution for programmability.   Find JRuby source code at github location github.com/mkjmkumar/JRuby_HBase_API This program written in JRuby to purge data using HBase shell and deletes required data applying filter on given binary column.   So you have already heard many advantages of storing data in HBase(specially binary block format) and create Hive table on top of that to query your data. I am not going to explain use case for this, why […]

Hadoop, Hive, Java, Pig, Python

Python and Python bites

Python and Python bites “lambda”    Hi everyone, this article show you one powerful function in Python programming language called “lambda”. It can solve any small problem in single line code. So lets start the beginning of your interesting or may be future programming language. Anonymous functions created at runtime are known as lambda functions. The below line defines an ordinary function usage in python. >>def f (x): return x+42 >>print f(21) 63 For lambda functions, >>calc = lambda x: x+42 >>calc(21) 63   lambda definition does not include a “return” statement. It always contains an expression which is returned. Also […]

Hadoop, Kylin, Security

Past and Future of Apache Kylin!!!

Short Description: Apache Kylin (Chinese: Kirin) appears, can solve the problems based on Hadoop. Article Apache Kylin origin In today’s era of big data, Hadoop has become the de facto standards, and a large number of tools one after another around the Hadoop platform to build, to address the needs of different scenarios. For example, Hadoop Hive is a data warehouse tools, data files stored on HDFS distributed file system can be mapped to a database table and provides SQL queries. Hive execution engine can be converted to SQL MapReduce task to run, ideally suited for data warehouse data analysis. […]

Hadoop, HDFS

Heterogeneous Storage in HDFS(Part-1)…

An Introduction of heterogeneous storage type, and the flexible configuration of heterogeneous storage! Heterogeneous Storage in HDFS Hadoop version 2.6.0 introduced a new feature heterogeneous storage. Heterogeneous storage can be different according to each play their respective advantages of the storage medium to read and write characteristics. This is very suitable for cold storage of data. Data for the cold means storage with large capacity and where high read and write performance is not required, such as the most common disk for thermal data, the SSD can be used to store this way. On the other hand when we required […]

Best Practices, Hadoop, Hive

Performance utilities in Hive

Before taking you in details of utilities provided by Hive, let me explain few components to get execution flow and where the related information stored in system. Hive is a data warehouse software best suited for OLAP (OnLine Analytical Processing) workloads to handle and query over vast volume of data residing in a distributed storage. The Hadoop Distributed File System (HDFS) is the ecosystem in which Hive maintains the data reliably and survives from hardware failures. Hive is the only SQL-like relational big data warehousing approach developed on top of Hadoop. HiveQL as described, is an SQL-like query language for […]

Hadoop, Hive, Oozie

Coding Tips and Best Practice in Hive and Oozie…

Many time during the code review found some common mistakes done by the developer. Here are few of them… Workflow mandatory item : Add this property in all workflows that have a Hive action. This property will make sure that the hive job runs with the necessary number of reducers instead of just 1. <property> <name> mapreduce.job.reduces </name>  <value>-1</value> </property> HQL items : Setting properties: Keep the set properties in the HQL to a minimum. Let it take the default values. Add only what is absolutely necessary for that script. If you are using older code as template do not […]

Hadoop, HDFS

HDFS is really not designed for many small files!!!

Few of my friends new to Hadoop ask frequently what the good file size is for Hadoop and how to decide file size. Obviously it should not be small size and file size should be as per the block size. HDFS is really not designed for many small files. For each file, the client has to talk to the namenode, which gives it the location(s) of the block(s) of the file, and then the client streams the data from the datanode. Now, in the best case, the client does this once, and then finds that it is the machine with […]

Hadoop, HDFS, Hive

HBase Replication and comparison with popular online backup programs…

Short Description: HBase Replication: Hbase Replication solution can solve the cluster security, data security, read and write separation and operation Article   This article is first series of three articles, next coming articles with some code and mechanism present in latest version of HBase supporting HBase Replication.   HBase Replication Hbase Replication solution can solve the cluster security, data security, read and write separation, operation and maintenance, and the guest operating errors, and so the ease of management and configuration, provide powerful online applications support. Hbase replication currently used in the industry are rare, because there are many aspects, such […]

Hadoop, Kafka

Kafka: A detail introduction

I’ll cover Kafka in detail with introduction to programmability and will try to cover almost full architecture of it. So here it go:- We need Kafka when there is need for building a real-time processing system as Kafka is a high-performance publisher-subscriber-based messaging system with highly scalable properties. Traditional systems unable to process this large data and mainly for offline used analysis, Kafka is a solution to the real-time problems of any software solution; that is to say, unify offline or online data processing and routing it to multiple consumers quickly. Below are the Characteristics of Kafka:- Persistent messaging: – […]

Bigdata, Hadoop, NoSql

The ACID properties and the CAP theorem are two concepts in data management to distributed system.

Started working on HBase again!! Thought why not refresh few concepts before proceeding to actual work. Important things comes into mind when we work with NoSQL is distributed environment are sharding and partitions.  Let’s dive into ACID properties of database and CAP theorem for distributed system. The ACID properties and the CAP theorem are two concepts in data management to distributed system. Funny thing they both comes with “C” with totally different meaning. What is ACID: – It is a rule and meant a lot for RDMBS because all RDBMS are ACID compliance. A=Atomicity, means all or nothing, if I […]

Analytics, Hadoop

Data Analysis Approach to a successful outcome

I have done data analysis for one of my project using below approach and hopefully it may help you understand underlying subject. Soon i’ll post my project on data analysis and detail description on technology used Python(web scraping- data collection), Hadoop, Spark and R. Data analysis is a highly iterative and non-linear process, better reflected by a series of cyclic process, in which information is learned at each step, which then informs whether (and how) to refine, and redo, the step that was just performed, or whether (and how) to proceed to the next step. Setting the Scene Data analysis is […]