Hadoop

My experience with HCL interview…

I am having 13+ years of experience and get the call from HR located in Hyderabad bspraviya_b@hcl.com for the interview for the position of Solution Architect in July 2018. After HR discussion first technical interview is done by an employee name Pawan from Noida. They call me for a personal interview at Greater Noida. I live in Ambala Cantt Haryana and spent 2000Rs to reach there but concern person is not aware that I am coming for face to face interview. Somehow after the struggle of one hour, HR arranged my interview. Another round of technical interview happen. They told […]

Analytics, Apache Spark, Apache Storm, Bigdata, Hadoop

Apache Storm key takeaways…

Hadoop moves the code to the data, Storm moves the data to the code. This behavior makes more sense in a stream-processing system, because the data set isn’t known beforehand, unlike in a batch job. Also, the data set is continuously flowing through the code. A Storm cluster consists of two types of nodes: the master node and the worker nodes. A master node runs a daemon called Nimbus, and the worker nodes each run a daemon called a Supervisor. The master node can be thought of as the control center. In addition to the other responsibilities, this is where […]

Analytics, Data Science, Exploratory Data Analysis, Hadoop

Approach to execute Machine Learning project, “Halt the Hate”…

Disclaimer: The analysis was done in this project touches a sensitive issue in India. So I never convince anybody to trust my model. A real human society is so complex that “all the things may be interconnected in a different way than in the model.” Imagine you are presented with a dataset of “Hate Crimes” in India and asked how to minimize these crimes by analyzing other factors. This is the problem I am taking in hand to solve and analyze with a minimum number of resources. Some can say that education and providing jobs to youth in India by […]

Hadoop

Fundamantals of Apache Spark…

You can view my other articles on Spark RDD at below links… Apache Spark RDD API using Pyspark…Tips and Tricks for Apache Spark RDD API, Dataframe API How did Spark become so efficient in data processing as compared to MapReduce? It comes with a very advanced Directed Acyclic Graph (DAG) data processing engine. What it means is that for every Spark job, a DAG of tasks is created to be executed by the engine. The DAG in mathematical parlance consists of a set of vertices and directed edges connecting them. The tasks are executed as per the DAG layout. In […]

Analytics, Bigdata, Data Science, Exploratory Data Analysis, Machine Learning

Bayesian-posterior imagination and applications…

Before going into Bayes and posterior probability let us first understand few terms we going to use:- Conditional Probability:- Conditional Probability and Independence:- A conditional probability is the probability of one event if another event occurred. In the “die-toss” example, the probability of event A, three dots showing, is P(A) = 1/6 on a single toss. But what if we know that event B, at least three dots showing, occurred? Then there are only four possible outcomes, one of which is A. The probability of A = {3} is 1/4 , given that B = {3, 4, 5, 6} occurred. […]

Analytics, Best Practices, Bigdata, Exploratory Data Analysis, Machine Learning

Residual Plots for Regression Analysis…

As we discussed in my last article to show you parameters to understand the accuracy and prediction of a regression model but I guess before going into that we first need to understand the importance of residual plot. Without understanding residual plots the discussion on regression would be incomplete. Using residual analysis we can verify that our model is linear or nonlinear. Residual plots reveal unwanted residual patterns that indicate biased results. You just need to muster it by visualization. In residual analysis we check that the variables are randomly scattered around zero for the entire range of fitted values. […]

Bigdata, Notebook, Python

JavaScript Issue resolution in JupyterLab Notebook

The graphs are not appearing in JupyterLab Notebook and the error message says “JavaScript output is disabled in JupyterLab”. At first, it seems that from Notebook itself I just need to enable it but few site says it Jupyterlab does not support it yet is frustrating. #matplotlib submodule pyplot import matplotlib.pyplot as plt import numpy as np x = np.arange(0,100,0.5) y = 2 * np.sqrt(x) plt.plot(x,y) plt.show() “JavaScript output is disabled in JupyterLab” So to solve this issue or enable extension first stop your notebook and use below command.  C:\Users\victor>jupyter nbextension enable –py –sys-prefix widgetsnbextension Enabling notebook extension jupyter-js-widgets/extension… – […]

Best Practices, Bigdata, Data Science, Exploratory Data Analysis, Machine Learning

Ordinary least squares regression (OLSR)

Ordinary least squares regression (OLSR)  Invented in 1795 by Carl Friedrich Gauss, it is considered one of the earliest known general prediction methods. OLSR is a generalized linear modeling technique. It is used for estimating all unknown parameters involved in a linear regression model, the goal of which is to minimize the sum of the squares of the difference of the observed variables and the explanatory variables. Ordinary least squares regression is also known as ordinary least squares or least squared errors regression. Lets start with a Linear regression model like below:- Here is few terminology we use when we […]

Bigdata, Data Science, Exploratory Data Analysis, Machine Learning

ROC curve and performance parameters of a classification model…

When we evaluate a model we analysis few parameters to verify the performance of our model. These parameters demonstrate the performance of our model using confusion matrices. Few more frequently used performance parameters are Accuracy, Precision, Recall and F1 score. Let me give you an idea what they are in this article so that when we talk about our model in next articles would not be confused with terms. So let’s say our model is ready and we want to know how good our model is? These terms help the audience of our hypothesis to understand how good predictions are. […]

Data Science, Exploratory Data Analysis, Machine Learning

Understanding distribution functions…

This article helps to understand distribution functions and its usage in Exploratory Data Analysis in Data Science. In next article, I’ll take you to some of the practical usages on my sample project for the terms defined here. Exploratory Data Analysis is the combination of many small tasks like data cleansing, data munging and create visualization etc to understand the value in data. In the distribution of data, we actually try to extract value out of it. Also, distribution is important when the data is ready for analysis and we have received another set of sample data then we do […]

Hadoop

My intuition to understand eigenvalues and eigenvectors…

One of my biggest hurdles learning linear algebra was getting the intuition of learning Algebra. Eigenvalues and eigenvectors are one of those things that pop up in a million places because they’re so useful, but to recognize where they may be useful you need intuition as to what they’re doing. The eigenvectors are the “axes” of the transformation represented by the matrix. Consider spinning a globe (the universe of vectors): every location faces a new direction, except the poles. The eigenvalue is the amount the eigenvector is scaled up or down when going through the matrix. Eigenvalues are special numbers […]

Bigdata, Database

Encourage you to switch to Jupyter Lab…

Notebooks are great for prototyping, longer pipelines or processes. If you are a user of PyCharm or Jupyter Notebook and an exploratory data scientist, I would encourage you to switch you to Jupyter Lab. For Jupyter Lab installation steps go here Below are some of the advantages that I see using Jupyter Lab over Jupyter Notebook:- The new terminal is a tab view to use compared. The ability to set out multiple windows easily, much like an IDE This will make working on a remote server so much nicer, just start Jupyter Lab and an ssh tunnel and you have a […]

Analytics, Bigdata, Database

Why and when we need Machine Learning…

I’m into the data management/data quality from several years. When I ask some people what is data management processes they simply reply, “well, we have some of our data stored in a database and other data stored on file shares with proper permissions.” This isn’t data management…it’s data storage. If you and/or your organization don’t have good, clean data, you are most definitely not ready for machine learning. Data management should be your first step before diving into any other data project(s). Now I’d say if you have good data management and tagged for machine learning so give yourself a pause and […]

Bigdata, Database, Python

Python Lists and Lambda Learning…

1 There are many ways to use Python’s List and Lambda. Here I am going to show some of useful tips and tricks. So lets first start with Lists> Below we use most of the time with List. >>> a = [66.6, 333, 333, 1, 1234.5] >>> print a.count (333), a.count (66.6), a.count (‘x’) 2 1 0 >>> a.insert (2, -1) >>> a.append (333) >>> a [66.6, 333, -1, 333, 1, 1234.5, 333] >>> a.index (333) 1 >>> a.index (333,2) 3 >>> a.remove (333) >>> a [66.6, -1, 333, 1, 1234.5, 333] >>> a.reverse () >>> a [333, 1234.5, 1, […]

Apache Spark, Bigdata, Database

Apache Spark RDD API using Pyspark…

In my previous article, I am using scala to show usability of Spark RDD API. Many of us utilizing PySpark to work with RDD and Lambda functions. Though the function names and output is same what we have in Scala, syntax in Pyspark is different on RDD operations. I’ll explain here Pyspark RDD using a different approach and with a different perspective to solve the problem. Let us consider we are streaming data using Spark and we have created RDD using this streaming application want to perform RDD operations on this stream of data in particular time interval. Here I am […]

Analytics, Bigdata, Python

How to convert Python list, tuples, strings to each other…

There are three built-in functions in Python : lists, tuples, and strings. The three functions, str (), tuple (), and list (), convert to each other using the following example: >>> s = ‘123456’ >>> list(s) [‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’] >>> tuple(s) (‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’) >>> tuple(list(s)) (‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’) >>> list(tuple(s)) [‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’] >>> “”.join(tuple(s)) ‘123456’ >>> “”.join(list(s)) ‘123456’ >>> str(tuple(s)) “(‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’)” >>> str(list(s)) “[‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’]”

Analytics, Apache Spark, Bigdata, Database

Tips and Tricks for Apache Spark RDD API, Dataframe API- Part -1

I am planning to share my knowledge on Apache Spark RDD, Dataframes API and some tips and tricks. If I combine everything into one then it would be a very lengthy article. Therefore I am dividing the long article into three separate articles and this article is the first series in that continuation. Spark RDD API Dataframe API Tips and tricks on RDD API and Dataframe API. Let us start with basics of RDD API. Resilient Distributed Dataset(RDD) is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. RDD could […]

Analytics, Bigdata, Kafka

Better late then never : Time to replace your micro-service architecture with Kafka…

Kafka already spawns and facilitated many organizations on micro-services architecture world. If Kafka is still not part of your infrastructure, its high time for you to go with it. I am not promoting Kafka better then any other message queue systems as many articles are already floating on the internet about this subject. Kafka’s uniqueness is that it provides both simple file system and bridge functions. A Kafka broker’s most basic task is to write messages to and read messages from the log on disk as quickly as possible. Queue message will not be lost after the persistence, which is […]

Analytics, Bigdata, Kafka

In-depth Kafka Message queue principles of high-reliability

 At present many open source distributed processing systems such as Cloudera, Apache Storm, Spark and others support the integration with Kafka. Kafka is increasingly being favored by many internet shops and they use Kafka as one of its core messaging engines. The reliability of the Kafka message can be imagined as a commercial-grade messaging middleware solution. In this article, we will understand Kakfa storage mechanism, replication principle, synchronization principle, and durability assurance to analyze its reliability. As shown in the figure above, a typical Kafka architecture includes several Producers (which can be server logs, business data, page views generated by […]

Analytics, Bigdata, Hadoop, Python

How to convert Python list, tuples, strings to each other…

There are three built-in functions in Python : lists, tuples, and strings. The three functions, str (), tuple (), and list (), convert to each other using the following example: >>> s = ‘123456’ >>> list(s) [‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’] >>> tuple(s) (‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’) >>> tuple(list(s)) (‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’) >>> list(tuple(s)) [‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’] >>> “”.join(tuple(s)) ‘123456’ >>> “”.join(list(s)) ‘123456’ >>> str(tuple(s)) “(‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’)” >>> str(list(s)) “[‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’]”

Analytics, Apache Spark, Best Practices, Bigdata, Framework

Tips and Tricks for Apache Spark RDD API, Dataframe API- Part -1

I am planning to share my knowledge on Apache Spark RDD, Dataframes API and some tips and tricks. If I combine everything into one then it would be a very lengthy article. Therefore I am dividing the long article into three separate articles and this article is the first series in that continuation. Spark RDD API Dataframe API Tips and tricks on RDD API and Dataframe API. Let us start with basics of RDD API. Resilient Distributed Dataset(RDD) is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. RDD could […]

Best Practices, Bigdata, Hadoop, Kafka

Better late then never : Time to replace your micro-service architecture with Kafka…

Kafka already spawns and facilitated many organizations on micro-services architecture world. If Kafka is still not part of your infrastructure, its high time for you to go with it. I am not promoting Kafka better then any other message queue systems as many articles are already floating on the internet about this subject. Kafka’s uniqueness is that it provides both simple file system and bridge functions. A Kafka broker’s most basic task is to write messages to and read messages from the log on disk as quickly as possible. Queue message will not be lost after the persistence, which is […]

Analytics, Apache Spark, Bigdata, Kafka, Messaging System

In-depth Kafka Message queue principles of high-reliability

At present many open source distributed processing systems such as Cloudera, Apache Storm, Spark and others support the integration with Kafka. Kafka is increasingly being favored by many internet shops and they use Kafka as one of its core messaging engines. The reliability of the Kafka message can be imagined as a commercial-grade messaging middleware solution. In this article, we will understand Kakfa storage mechanism, replication principle, synchronization principle, and durability assurance to analyze its reliability. As shown in the figure above, a typical Kafka architecture includes several Producers (which can be server logs, business data, page views generated by […]

Analytics, Apache Spark, Hadoop, Kafka, Python, Spark

Consume JSON Messages From Kafka Using Kafka-Python’s Deserializer

Hope you are here when you want to take a ride on Python and Apache Kafka. Kafka-Python is most popular python library for Python. For documentation on this library visit to page https://kafka-python.readthedocs.io/en/master/. kafka-python is designed to function much like the official java client. kafka-python is best used with newer brokers (0.9+), but is backwards-compatible with older versions (to 0.8.0). Some features will only be enabled on newer brokers. So instead of showing you a simple example to run Kafka Producer and Consumer separately, I’ll show the JSON serializer and deserializer. Preparing the Environment Lets start with Install python package using […]

Analytics, Bigdata, Kafka

Moving to communication of events between subsystems — CQRS-ES with open source…

Before going into definitions of EP, CEP, and QSQS let us start with some basic database term and what problem we are trying to address here. We have commercial databases and database professionals those who publicized CRUD operations a lot. It is one-row-per-pattern works well in most of the projects and enough to build an application more quickly and securely. I have probably implemented 100 CRUD projects (including web applications) and we do that way because we have limited budgets and projects have deadlines. CRUD work well until someone asked for historical data and I saw few managers complaining lack […]

Bigdata, HDP Search, Solr, SolrCloud

SolrCloud vs HDPSearch…

Let us start to remove some confusion we have related to SolrCloud and HDPSearch. First what is the SolrCloud:- Apache Solr includes the ability to set up a cluster of Solr servers that combines fault tolerance and high availability, called SolrCloud, these capabilities provide distributed indexing and search capabilities, supporting the following features: Central configuration for the entire cluster. Automatic load balancing and fail-over for queries. ZooKeeper integration for cluster coordination and configuration. Lets clear some confusion on Solr and SolrCloud(ZooKeeper coordinating, Solr with HDFS, HA mode) too:- Solr and SolrCloud are not separate things; Solr is the application while […]

Hadoop

AWS and GCE both great! Some more powerful configuration of load balancing puts GCE over the top…

I work with Hadoop so I come across or sometimes management ask me a common question, “Why we need Hadoop in cloud” and to answer this question I keep my bold points ready like below… Cloud is your data center, No need to deal with reliability & scaling issues. Pay What You Need. Deployed in Minutes. Cloud storage enables economic flexibility, scale, and rich features. Size clusters independent of storage needs and price continues decreasing. Geo-Redundancy allows for business continuity/disaster recovery planning. Now they move forward to ask me a detail comparison and to find out the difference between GCP […]

Love what you like, Uncategorized

Benefits of Blogging!

Yes, blogging has many benefits. First thing money and I’m not earning money writing blogs right now but many bloggers get pleasure who make money. You could be looking at blogs with other goals in mind but here are few from my perspective: 1. Fame — A successful blog has the potential to get you noticed and build you a more visible profile in your business market, community or social media. 2. Contacts — Blogs are an excellent way to get to know people and network. With blogs naturally leading to the conversation, a well-read blog will put you in […]

Hadoop

PIR Sensor, a pyroelectric device…

After working on Sensors with Arduino i have dicided to pass my knowledge via blogs. I will start sharing few project that already done and some are in scope but yet to be materilized in Robotics and AI. I will be posting regular articles here as I build a data collection system using Arduino, python and Sensors. Visulization on collected data is also in scope. My projects are mainly based on Home security system therefore now onward call my self “The Agent 360!” I have created a seperate menu “Robotics” at www.ammozon.co.in for all my robotics blog and projects to […]

Administration, Bigdata, Hadoop

Apache Solr Search Installation on HDP2.6 using Yum Repo

As we know that “HDP 2.6” is not bundle with “HDP Search” which includes Solr. Therefore here in two parts of article i am going to explain ways to install Solr/SolrCloud/HDP_Search:- 1. Apache Solr Search Installation on HDP2.6 using Yum Repo. 2. Apache Solr Search Installation on HDP2.6 using Ambari Management Pack. Both are using different approach therefore i have divided it into two articles Recently i have installed HDP2.6 on one of my development environment. Now its time to bring same services back one by one as we are running on HDP2.5 in production environment, one of them is […]

Python

pyshark, tshark and wireshark installation…

Python wrapper for tshark, allowing python packet parsing using wireshark dissectors. Installation All Platforms We are going to use python pip for installation if you dont have pip please follow below command to install pip:- # sudo yum install python-pip # sudo yum install python-wheel Once done  install pyshark using pip:- #pip install pyshark Now install tshark as pip does not identify it we go with yum whatprovides tool:- # yum whatprovides *tshark* confirm tshark version once done:- #tshark -v Now install wireshark #yum install wireshark Now go to python shell and use below command to sniff into network:- >>import pyshark >>capture […]

Apache Spark, Hbase

Multiple WAL in Apache HBase 1.3 and performance enhancements!!!

Apache HBase 1.3.0 was released mid-January 2017 and ships with support for date-based tiered compaction and improvements in multiple areas, like write-ahead log (WAL), and a new RPC scheduler, among others. The release includes almost 1,700 resolved issues in total. Below are some bold points on enhancement made in HBase 1.3.0:- The “date-based tiered compaction” support shipped in HBase 1.3.0 is beneficial for where data is infrequently deleted or updated and recent data is scanned more often than an older one. Records time-to-live (TTL) can be easily enforced with this new compaction strategy. Improved multiple WAL support in Apache HBase […]

Bigdata, Database, Hadoop

Cloud Databases & Cloud Blob…

Cloud computing is the next stage in evolution of the Internet. The cloud in cloud computing provides the means through which everything — from computing power to computing infrastructure, applications, business processes to personal collaboration — can be delivered to you as a service wherever and whenever you need. Cloud databases are web-based services, designed for running queries on structured data stored on cloud data services. Most of the time, these services work in conjunction with cloud compute resources to provide users the capability to store, process, and query data sets within the cloud environment. These services are designed to […]

Apache Spark, open source

Apache Spot, the open source community to continue the fight against cybercrime…

Apache Spot, force Apache community in order to fight cybercrime. Since Apache Spot earlier this year started at Intel and Cloudera, the momentum of the project is growing with Anomoli, Centrify, Cloudwick, Cybraics, eBay, Endgame, Jask, Streamsets, Webroot and other partners with the unanimous support. Use Apache Hadoop to achieve unlimited scale log management and data storage, as well as with Apache Spark achiev near real-time machine learning and anomaly detection, network security and no new data analysis functions. With Apache Spot, we can do more effective use of technology provided by Big Data ecosystems, and can detect unknown network […]

Free Software, open source

‘Open source’ and ‘free software’

Its my Materialist vs Idealist thought going on here. If you not find it to your reality – be patience with my arguments. First of all what is the software and how the demand for software with respect to programmers change by time. Think back 20 years to a time when the internet was still a DARPA project and the web was but a glimmer in Tim Berners-Lee’s eyes. At that time, someone who could create software or build computers was pretty special. In fact, unless you worked for a software vendor or attended class in a computer science department, […]

Analytics, Bigdata, Framework, Hadoop

Sumo Logic : Log Management Tool

This is my first face off with “Sumo Logic”. If you want a quick introduction on “Sumo Logic”, this topic will be helpful without going into details documentation. Sumo Logic designed to help you manage and analyze your log files. It has started out attempting to be a SaaS version of Splunk and have gone their own way as matured, but as a result of their beginnings, it is one of the most feature-rich and enterprise-focused SaaS log management tools. Installation: Sumo Logic is a SaaS model, which means you’ll be setting up a communication out to the Sumo Logic […]

Bigdata, Hadoop

SolrCloud : CAP theorem world, this makes Solr a CP system, and keep availability in certain circumstances.

A SolrCloud cluster holds one or more distributed indexes which are called Collections. Each Collection is divided into shards (to increase write capacity) and each shard has one or more replicas (to increase query capacity). One replica from each shard is elected as a leader, who performs the additional task of adding a ‘version’ to each update before streaming it to available replicas. This means that write traffic for a particular shard hits the shard’s leader first and is then synchronously replicated to all available replicas. One Solr node (a JVM instance) may host a few replicas belonging to different […]

Hadoop

We just need to be better to each other before talking to AI

I’m not going to talk about statistics, machine learning, or AI, not even comparing any database which shows error made by AI work vs Manual work. I believe that the fundamental problems of our time are ethical, not technological. If we can figure out that part, the technology should take care of itself. I would love to live in a post-scarcity utopia where we all run around self-actualizing. I don’t think we need to give up AI to get there — in fact, I think technology will be the key that unlock the gates.We have to have the wisdom to […]

Hadoop

Apache Eagle: Real-time security monitoring solution

On January 10, 2017, the Apache Software Foundation, which consists of more than 350 open source projects and innovation initiatives, all developed by volunteer, governance volunteer and incubator volunteers, announced that Apache Eagle has graduated from the Apache Incubator Program. Eagle originated in eBay, the first to solve large-scale Hadoop cluster monitoring issues. The team quickly realized that this would also be useful for the entire community, so in October 2015 the project was submitted to the Apache Incubator. Since then, Eagle has gained the attention of developers and organizations for its extensive usage scenarios, such as system / service […]

Fun

Book Review : The Folly of Fools

The Folly of Fools: The Logic of Deceit and Self-Deception in Human Life Finished reading this books and developed friendships with author “Robert L. Trivers” for his remarkable writing. Although it seems book didn’t contains the rigorous data collection, statistical analysis, but certainly have the theory foundations for what he came up with. Trivers has made good point on the self-deception among people in authority and observations made on self-deception can be incredibly expensive seems very true. The point made on “Iraq war” and “willful ignorance in NASA” are lacking data but convincing. I’ll close in observing that no matter […]

Hadoop

Google Cloud Platform(GCP) overview

Google Cloud Platform – GCP is a collection of various services of SaaS, PaaS, and IaaS, and new services are still being launched day by day. There are lot of services including beta service etc and we will not introduce everything in detail in this article but i am going to introduce mainly the services which become important in infrastructure considering game. GCP service classification and overview of each service In GCP, there are many services like other cloud services, but in this document we classify those services into five categories of “execution environment service” “storage service” “network service” “data […]

Hbase, HbaseFcsk

Hbase Administration using HBaseFsck (hbck) and other tools…

HBaseFsck (hbck) is a tool for checking for region consistency and table integrity problems and repairing a corrupted HBase. Sometime we need to run hbck in reguler interval because some inconsistencies can be transient (e.g. cluster is starting up or a region is splitting). Operationally you may want to run hbck regularly and setup alert (e.g. via nagios) if it repeatedly reports inconsistencies . A run of hbck will report a list of inconsistencies along with a brief description of the regions and tables affected. Simple command to run hbck are below: hbase hbck or hbase hbck -details If you […]

Fun, Uncategorized

My advise for work under Zombie project and toxic workplace…

I would like to say that there’s a chance of salvaging the Zombie project but there probably isn’t and its not your fault. I have been in this kind of situation before but I would suggest that your course of action depends on your core morals. Here are few recommendations from my past to save your future:- 1. Do your job, find a better place switch. 2. If respect is not reciprocal in your project/company, quit. 3. If you get only canned questions, quit. 4. If questions are like machine guns without any followup, quit. 5. If your boss is […]

Bigdata, Messaging System

ZeroMQ Part-1

  Programs like people need to communicate, and for them we have the UDP, TCP, HTTP, IPX, WebSocket protocol to connect and other related applications. But the underlying protocol is difficult to achieve. We need a high level of abstraction, scalable, and easy to use things, it is ZeroMQ(ØMQ). ØMQ gives us advanced levels of availability and speed. ØMQ is a neat messaging library that allows us to build our own messaging infrastructure. It can help build framework that scales where services could be handled by different applications. I am inclined to python programming and luckily pyzmq provides python bindings […]

Python

Content Data Store(CDS) Compressing and enhancing technique…

Aggressively we are adding new features to Content Data Store(CDS) system. One of the feature that i am going to discuss here is compression technique(BigData application is incomplete without compression). And what if i tell you in CDS, we use compression along with enhancement of visual image/scanned documents. Our compression technique has two additional features:- Smaller:- Reduce file size and save 80% space compare to your image/scanned document. Clearer:- Isolate foreground color by identifying background color and choose small number of representative colors. Another important feature is performance. We don’t use api provided by office-lens or others, instead we have small python […]

Analytics, Bigdata, Framework, Hadoop, RHadoop

Install and smoketest R and RHadoop on Hortonworks Data Platform (HDP25-CentOS7)

Before going to Installation steps i’d like to give a small introduction on RHADOOP. What is RHadoop? RHadoop is an open source project for combine R and Hadoop together. It contains 4 different packages to combine different project from Hadoop and 1 package to enhance some functions to fit MapReduce framework. rhdfs: Combine Hadoop’s HDFS with R. rhbase: Combine Hadoop’s HBase with R. rmr2: Combine Hadoop’s MapReduce 2 with R. ravro: Combine Hadoop’s Avro with R. plyrmr: Provides a familiar plyr-like interface with MapReduce. You can reference the official GitHub of RHadoop: https://github.com/RevolutionAnalytics/RHadoop Requirements First at all, I have installed HDP2.5 […]

Framework, Python

Almost Everything in Python!!!

A curated list of Python frameworks, libraries, software and resources. Inspired by awesome-php. Awesome Python Environment Management Package Management Package Repositories Distribution Build Tools Interactive Interpreter Files Date and Time Text Processing Specific Formats Processing Natural Language Processing Documentation Configuration Command-line Tools Downloader Imagery OCR Audio Video Geolocation HTTP Database Database Drivers ORM Web Frameworks Serverless Frameworks Permissions CMS E-commerce RESTful API Serialization Authentication Template Engine Queue Search News Feed Asset Management Caching Email Internationalization URL Manipulation HTML Manipulation Web Crawling Web Content Extracting Forms Data Validation Anti-spam Tagging Admin Panels Static Site Generator Processes Concurrency and Parallelism Networking WebSocket […]

Tesseract

Converting PDF to Text using Tesseract…

Tesseract is unable to handle pdf files directly, therefore files first converted to a tiff using ghostscript before passing it to Tesseract. Tesseract does not have ability to process pdf files, In addition tesseract cannot process multiple page tiffs(images), so ghostscript go along with it to complete the task. I am using below command to process multiple tiff files:- for i in *.tiff ; do tesseract $i $i; done; When we run ghostscript and pass pdf file to process, it generate multiple tiff files for each page of our pdf. Run below command to process pdf file using ghostscript:- gs -dNOPAUSE […]

Analytics, Tesseract

OCR – “Optical Character Recognition”, Set up Tesseract OCR on Centos 6.8…

OCR means “Optical Character Recognition” and Tesseract is licensed under the Apache License v2.0. Tesseract OCR configured system is able to convert images with embedded text to text files. This tutorial “How to install” is meant as a practical guide; it does not cover theoretical backgrounds/concept of OCR/algorithm used in Tesseract. They are treated in lot of other documents in the web. Tesseract installation is supported beautifully with Ubuntu without issues(cause apt-get) but with Centos required some effort and correct version to build. Please follow below steps for Tesseract installation on Centos:- 1. OS update using yum. Setup Centos 6.8 […]

Analytics, Hadoop

Advertisement attributes or Ad Attributes…An Idea!!!

Some time ago i was working on an idea called as Ad Attributes or Advertisement attributes. I’d like to share my thoughts on this idea with audience. Advertisement attributes are for creating a favorable selling climate. Today consumers are constantly targeted with product information by marketing companies. Consumers are faced with numerous advertisements with vast information on products. Thus consumers use the heuristics approach to help them in making their purchasing decisions. This approach is basically using mental shortcuts to streamline the selection process cognitively. This is to avoid being puzzled or paralyzed by the huge number of products offered in the […]

Hadoop, Hbase, Hive

JRuby code to purge data on Hbase over Hive table…

Problem to Solve:- How to delete/update/query Binary format stored values in a HBase column family column. Hive over HBase table, where we cant use standard API and unable to apply filters on binary values, you can use below solution for programmability.   Find JRuby source code at github location github.com/mkjmkumar/JRuby_HBase_API This program written in JRuby to purge data using HBase shell and deletes required data applying filter on given binary column.   So you have already heard many advantages of storing data in HBase(specially binary block format) and create Hive table on top of that to query your data. I am not going to explain use case for this, why […]

Hadoop, Hive, Java, Pig, Python

Python and Python bites

Python and Python bites “lambda”    Hi everyone, this article show you one powerful function in Python programming language called “lambda”. It can solve any small problem in single line code. So lets start the beginning of your interesting or may be future programming language. Anonymous functions created at runtime are known as lambda functions. The below line defines an ordinary function usage in python. >>def f (x): return x+42 >>print f(21) 63 For lambda functions, >>calc = lambda x: x+42 >>calc(21) 63   lambda definition does not include a “return” statement. It always contains an expression which is returned. Also […]

Database, GPU, PostgreSQL

PG-Storm: Let PostgreSQL run faster on the GPU

  PostgreSQL extension PG-Storm, allows users to customize the data scan and run queries faster. CPU-intensive work load is identified and transferred to the GPU to take advantage of the powerful GPU parallel execution ability to complete the data task. The combination of few number of core processors, RAM bandwidth, and the GPU has a unique advantage. GPUs typically have hundreds of processor cores and RAM bandwidths that are several times larger than CPUs. They can handle large numbers of computations in parallel, so their operations are very efficient. PG-Storm based on two basic ideas: On-the-fly native GPU code generation. […]

Hadoop, Kylin, Security

Past and Future of Apache Kylin!!!

Short Description: Apache Kylin (Chinese: Kirin) appears, can solve the problems based on Hadoop. Article Apache Kylin origin In today’s era of big data, Hadoop has become the de facto standards, and a large number of tools one after another around the Hadoop platform to build, to address the needs of different scenarios. For example, Hadoop Hive is a data warehouse tools, data files stored on HDFS distributed file system can be mapped to a database table and provides SQL queries. Hive execution engine can be converted to SQL MapReduce task to run, ideally suited for data warehouse data analysis. […]

Hadoop, HDFS

Heterogeneous Storage in HDFS(Part-1)…

An Introduction of heterogeneous storage type, and the flexible configuration of heterogeneous storage! Heterogeneous Storage in HDFS Hadoop version 2.6.0 introduced a new feature heterogeneous storage. Heterogeneous storage can be different according to each play their respective advantages of the storage medium to read and write characteristics. This is very suitable for cold storage of data. Data for the cold means storage with large capacity and where high read and write performance is not required, such as the most common disk for thermal data, the SSD can be used to store this way. On the other hand when we required […]

Cloudera, encryption, HDFS, Security

A Step-by-Step Guide to HDFS Data Protection Solution for Your Organization on Cloudera CHD

  An enterprise-ready encryption solution should provide the following Comprehensive encryption offering wherever it resides, including structured and unstructured data at rest and data in motion. HDFS Encryption implements transparent, end-to-end encryption of data read from and written to HDFS, without requiring changes to application code. Centralized encryption and key management: A centralized solution will enable you to protect and manage both the data and keys. Secure the data by encrypting or tokenizing it, while controlling access to the protected data. This guide will help you through enabling HDFS encryption on your cluster, using the default Java KeyStore KMS. If […]

Security, Shiro

Apache Shiro design is intuitive and a simple way to ensure the safety of the application…

Short Description: Apache Shiro’s design goals are to simplify application security by being intuitive and easy to use… Article Apache Shiro design is intuitive and simple way to ensure the safety of the application. Software design is generally based on user stories to achieve, that is, based on how users interact with the system to design the user interface or service API. For example, a user story will be displayed after a user logs on a button to view personal account information, if the user is not registered, it displays a registration button. This user story implies major application user […]

Best Practices, Hadoop, Hive

Performance utilities in Hive

Before taking you in details of utilities provided by Hive, let me explain few components to get execution flow and where the related information stored in system. Hive is a data warehouse software best suited for OLAP (OnLine Analytical Processing) workloads to handle and query over vast volume of data residing in a distributed storage. The Hadoop Distributed File System (HDFS) is the ecosystem in which Hive maintains the data reliably and survives from hardware failures. Hive is the only SQL-like relational big data warehousing approach developed on top of Hadoop. HiveQL as described, is an SQL-like query language for […]

Best Practices, Database, Hive

Best Practices for Hive Authorization when using connector to HiveServer2

Recently we are in process of working with Presto and configuring Hive Connector to it. It got connected successfully with steps given at prestodb.io/docs/current/connector/hive.html. An overview of our architecture is Presto is running on a different machine (Presto Machine) use Hive connector to communicate with Hadoop cluster which is running on different machines. Presto Machine have hive.properties file which tells Presto to use thrift connection to hive client and hdfs-site core-site.xml files for HDFS. Below is the architecture of our environment. Below is the command to invoke presto… /presto –server XX.X.X.XX:9080 –catalog hive There is no presto user exists in […]

Database, Hbase, Tephra

Tephra is open-sourced projects that adds complete transaction support to Apache HBase…

Transaction support in Hbase? Yes, a wide range of use case require transaction support. Firstly, we want the client to have great insight and fine-grained control of what the transaction system can do. Having full control on the client side not only allows you to make the best decisions for optimizing for specific use cases, but it also makes integration with third-party systems simpler. Secondly, when different types of components in your application share the data and update the data in multiple data stores in many different ways(Hadoop applications), it is important for the transaction system to support you. Thirdly, […]

Database, HPL

HPL/SQL Make SQL-on-Hadoop More Dynamic

Think about the old days when we solved many business problems using Dynamic SQL, exception handling, flow-of-control, iterations. Now when I worked with couple of migration projects found few business rules that need to transform to Hive compatible (some of them are very complex and nearly impossible). Solution is HPL/SQL (formerly PL/HQL), is a language translation and execution layer developed by Dmitry Tolpeko (http://www.hplsql.org/) Why HPL/SQL The role of Hadoop in Data Warehousing is huge. But to implement comprehensive ETL, reporting, analytics and data mining processes you not only need distributed processing engines such as MapReduce, Spark or Tez, you […]

Hadoop, Hive, Oozie

Coding Tips and Best Practice in Hive and Oozie…

Many time during the code review found some common mistakes done by the developer. Here are few of them… Workflow mandatory item : Add this property in all workflows that have a Hive action. This property will make sure that the hive job runs with the necessary number of reducers instead of just 1. <property> <name> mapreduce.job.reduces </name>  <value>-1</value> </property> HQL items : Setting properties: Keep the set properties in the HQL to a minimum. Let it take the default values. Add only what is absolutely necessary for that script. If you are using older code as template do not […]

Health

Out of the Box(Why Women Live Longer than Men)

Fact is men enjoy life more but at the end winners are women because they always get extra bits of years(these bits are sometimes in GB of ten years of extra life compared to men). I am not subject matter expert but some questions around me lead to dig more and get few possible connections to life. Hope you’ll enjoy this article to have more understanding of life(after all we all have one). Below are the few factors that matter/contribute to women, as a group, live longer than men. 1. The death rates for women are lower than those for […]

Hadoop, HDFS

HDFS is really not designed for many small files!!!

Few of my friends new to Hadoop ask frequently what the good file size is for Hadoop and how to decide file size. Obviously it should not be small size and file size should be as per the block size. HDFS is really not designed for many small files. For each file, the client has to talk to the namenode, which gives it the location(s) of the block(s) of the file, and then the client streams the data from the datanode. Now, in the best case, the client does this once, and then finds that it is the machine with […]

Hadoop, HDFS, Hive

HBase Replication and comparison with popular online backup programs…

Short Description: HBase Replication: Hbase Replication solution can solve the cluster security, data security, read and write separation and operation Article   This article is first series of three articles, next coming articles with some code and mechanism present in latest version of HBase supporting HBase Replication.   HBase Replication Hbase Replication solution can solve the cluster security, data security, read and write separation, operation and maintenance, and the guest operating errors, and so the ease of management and configuration, provide powerful online applications support. Hbase replication currently used in the industry are rare, because there are many aspects, such […]

Apache Spark, Spark

Introduction to Spark

Introduction to Apache Spark:- Spark As a Unified Stack and Computational Engine is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines. Eventually the big data exports around the world have derived the specialized systems on top of Hadoop to solve certain problems like graph processing, implementation of efficient iterative algorithms, real time query engines etc.. As you may know all the other components like Impala, Mahout, Tez, GraphLab etc are derived from Hadoop for different purposes. What is Apache Spark? Apache spark is the generalized engine which combines the specialties of all […]

Hadoop, Kafka

Kafka: A detail introduction

I’ll cover Kafka in detail with introduction to programmability and will try to cover almost full architecture of it. So here it go:- We need Kafka when there is need for building a real-time processing system as Kafka is a high-performance publisher-subscriber-based messaging system with highly scalable properties. Traditional systems unable to process this large data and mainly for offline used analysis, Kafka is a solution to the real-time problems of any software solution; that is to say, unify offline or online data processing and routing it to multiple consumers quickly. Below are the Characteristics of Kafka:- Persistent messaging: – […]

Best Practices, Hive

Hive Naming conventions and database naming…

Short Description: Naming conventions help to ease programmer and architect to understand whats inside going on in a business. Article I have worked with almost 20 to 25 applications. Whenever i start working first i have to understand each applications naming convention and i keep thinking why we all not follow single naming convention. As Hadoop is evolving rapidly therefore would like to share my naming convention so that may be if you come to my project will feel comfortable and so as I if you follow too. Database Names: If application serve to technology then database name would be […]

Bigdata, Hadoop, NoSql

The ACID properties and the CAP theorem are two concepts in data management to distributed system.

Started working on HBase again!! Thought why not refresh few concepts before proceeding to actual work. Important things comes into mind when we work with NoSQL is distributed environment are sharding and partitions.  Let’s dive into ACID properties of database and CAP theorem for distributed system. The ACID properties and the CAP theorem are two concepts in data management to distributed system. Funny thing they both comes with “C” with totally different meaning. What is ACID: – It is a rule and meant a lot for RDMBS because all RDBMS are ACID compliance. A=Atomicity, means all or nothing, if I […]

Analytics, Hadoop

Data Analysis Approach to a successful outcome

I have done data analysis for one of my project using below approach and hopefully it may help you understand underlying subject. Soon i’ll post my project on data analysis and detail description on technology used Python(web scraping- data collection), Hadoop, Spark and R. Data analysis is a highly iterative and non-linear process, better reflected by a series of cyclic process, in which information is learned at each step, which then informs whether (and how) to refine, and redo, the step that was just performed, or whether (and how) to proceed to the next step. Setting the Scene Data analysis is […]