Analytics, Apache Spark, Apache Storm, Bigdata, Hadoop

Apache Storm key takeaways…

Hadoop moves the code to the data, Storm moves the data to the code. This behavior makes more sense in a stream-processing system, because the data set isn’t known beforehand, unlike in a batch job. Also, the data set is continuously flowing through the code. A Storm cluster consists of two types of nodes: the master node and the worker nodes. A master node runs a daemon called Nimbus, and the worker nodes each run a daemon called a Supervisor. The master node can be thought of as the control center. In addition to the other responsibilities, this is where […]

Bigdata, Data Science, Exploratory Data Analysis, Machine Learning

ROC curve and performance parameters of a classification model…

When we evaluate a model we analysis few parameters to verify the performance of our model. These parameters demonstrate the performance of our model using confusion matrices. Few more frequently used performance parameters are Accuracy, Precision, Recall and F1 score. Let me give you an idea what they are in this article so that when we talk about our model in next articles would not be confused with terms. So let’s say our model is ready and we want to know how good our model is? These terms help the audience of our hypothesis to understand how good predictions are. […]

Bigdata, HDP Search, Solr, SolrCloud

SolrCloud vs HDPSearch…

Let us start to remove some confusion we have related to SolrCloud and HDPSearch. First what is the SolrCloud:- Apache Solr includes the ability to set up a cluster of Solr servers that combines fault tolerance and high availability, called SolrCloud, these capabilities provide distributed indexing and search capabilities, supporting the following features: Central configuration for the entire cluster. Automatic load balancing and fail-over for queries. ZooKeeper integration for cluster coordination and configuration. Lets clear some confusion on Solr and SolrCloud(ZooKeeper coordinating, Solr with HDFS, HA mode) too:- Solr and SolrCloud are not separate things; Solr is the application while […]

Apache Spark, Hbase

Multiple WAL in Apache HBase 1.3 and performance enhancements!!!

Apache HBase 1.3.0 was released mid-January 2017 and ships with support for date-based tiered compaction and improvements in multiple areas, like write-ahead log (WAL), and a new RPC scheduler, among others. The release includes almost 1,700 resolved issues in total. Below are some bold points on enhancement made in HBase 1.3.0:- The “date-based tiered compaction” support shipped in HBase 1.3.0 is beneficial for where data is infrequently deleted or updated and recent data is scanned more often than an older one. Records time-to-live (TTL) can be easily enforced with this new compaction strategy. Improved multiple WAL support in Apache HBase […]


Converting PDF to Text using Tesseract…

Tesseract is unable to handle pdf files directly, therefore files first converted to a tiff using ghostscript before passing it to Tesseract. Tesseract does not have ability to process pdf files, In addition tesseract cannot process multiple page tiffs(images), so ghostscript go along with it to complete the task. I am using below command to process multiple tiff files:- for i in *.tiff ; do tesseract $i $i; done; When we run ghostscript and pass pdf file to process, it generate multiple tiff files for each page of our pdf. Run below command to process pdf file using ghostscript:- gs -dNOPAUSE […]

Analytics, Hadoop

Advertisement attributes or Ad Attributes…An Idea!!!

Some time ago i was working on an idea called as Ad Attributes or Advertisement attributes. I’d like to share my thoughts on this idea with audience. Advertisement attributes are for creating a favorable selling climate. Today consumers are constantly targeted with product information by marketing companies. Consumers are faced with numerous advertisements with vast information on products. Thus consumers use the heuristics approach to help them in making their purchasing decisions. This approach is basically using mental shortcuts to streamline the selection process cognitively. This is to avoid being puzzled or paralyzed by the huge number of products offered in the […]

Hadoop, Hive, Java, Pig, Python

Python and Python bites

Python and Python bites “lambda”    Hi everyone, this article show you one powerful function in Python programming language called “lambda”. It can solve any small problem in single line code. So lets start the beginning of your interesting or may be future programming language. Anonymous functions created at runtime are known as lambda functions. The below line defines an ordinary function usage in python. >>def f (x): return x+42 >>print f(21) 63 For lambda functions, >>calc = lambda x: x+42 >>calc(21) 63   lambda definition does not include a “return” statement. It always contains an expression which is returned. Also […]

Security, Shiro

Apache Shiro design is intuitive and a simple way to ensure the safety of the application…

Short Description: Apache Shiro’s design goals are to simplify application security by being intuitive and easy to use… Article Apache Shiro design is intuitive and simple way to ensure the safety of the application. Software design is generally based on user stories to achieve, that is, based on how users interact with the system to design the user interface or service API. For example, a user story will be displayed after a user logs on a button to view personal account information, if the user is not registered, it displays a registration button. This user story implies major application user […]

Database, HPL

HPL/SQL Make SQL-on-Hadoop More Dynamic

Think about the old days when we solved many business problems using Dynamic SQL, exception handling, flow-of-control, iterations. Now when I worked with couple of migration projects found few business rules that need to transform to Hive compatible (some of them are very complex and nearly impossible). Solution is HPL/SQL (formerly PL/HQL), is a language translation and execution layer developed by Dmitry Tolpeko ( Why HPL/SQL The role of Hadoop in Data Warehousing is huge. But to implement comprehensive ETL, reporting, analytics and data mining processes you not only need distributed processing engines such as MapReduce, Spark or Tez, you […]

Hadoop, Kafka

Kafka: A detail introduction

I’ll cover Kafka in detail with introduction to programmability and will try to cover almost full architecture of it. So here it go:- We need Kafka when there is need for building a real-time processing system as Kafka is a high-performance publisher-subscriber-based messaging system with highly scalable properties. Traditional systems unable to process this large data and mainly for offline used analysis, Kafka is a solution to the real-time problems of any software solution; that is to say, unify offline or online data processing and routing it to multiple consumers quickly. Below are the Characteristics of Kafka:- Persistent messaging: – […]

Bigdata, Hadoop, NoSql

The ACID properties and the CAP theorem are two concepts in data management to distributed system.

Started working on HBase again!! Thought why not refresh few concepts before proceeding to actual work. Important things comes into mind when we work with NoSQL is distributed environment are sharding and partitions.  Let’s dive into ACID properties of database and CAP theorem for distributed system. The ACID properties and the CAP theorem are two concepts in data management to distributed system. Funny thing they both comes with “C” with totally different meaning. What is ACID: – It is a rule and meant a lot for RDMBS because all RDBMS are ACID compliance. A=Atomicity, means all or nothing, if I […]

Analytics, Hadoop

Data Analysis Approach to a successful outcome

I have done data analysis for one of my project using below approach and hopefully it may help you understand underlying subject. Soon i’ll post my project on data analysis and detail description on technology used Python(web scraping- data collection), Hadoop, Spark and R. Data analysis is a highly iterative and non-linear process, better reflected by a series of cyclic process, in which information is learned at each step, which then informs whether (and how) to refine, and redo, the step that was just performed, or whether (and how) to proceed to the next step. Setting the Scene Data analysis is […]