Analytics, Apache Spark, Apache Storm, Bigdata, Hadoop

Apache Storm key takeaways…

Hadoop moves the code to the data, Storm moves the data to the code. This behavior makes more sense in a stream-processing system, because the data set isn’t known beforehand, unlike in a batch job. Also, the data set is continuously flowing through the code. A Storm cluster consists of two types of nodes: the master node and the worker nodes. A master node runs a daemon called Nimbus, and the worker nodes each run a daemon called a Supervisor. The master node can be thought of as the control center. In addition to the other responsibilities, this is where […]

Analytics, Bigdata, Data Science, Exploratory Data Analysis, Machine Learning

Bayesian-posterior imagination and applications…

Before going into Bayes and posterior probability let us first understand few terms we going to use:- Conditional Probability:- Conditional Probability and Independence:- A conditional probability is the probability of one event if another event occurred. In the “die-toss” example, the probability of event A, three dots showing, is P(A) = 1/6 on a single toss. But what if we know that event B, at least three dots showing, occurred? Then there are only four possible outcomes, one of which is A. The probability of A = {3} is 1/4 , given that B = {3, 4, 5, 6} occurred. […]

Analytics, Best Practices, Bigdata, Exploratory Data Analysis, Machine Learning

Residual Plots for Regression Analysis…

As we discussed in my last article to show you parameters to understand the accuracy and prediction of a regression model but I guess before going into that we first need to understand the importance of residual plot. Without understanding residual plots the discussion on regression would be incomplete. Using residual analysis we can verify that our model is linear or nonlinear. Residual plots reveal unwanted residual patterns that indicate biased results. You just need to muster it by visualization. In residual analysis we check that the variables are randomly scattered around zero for the entire range of fitted values. […]

Bigdata, Notebook, Python

JavaScript Issue resolution in JupyterLab Notebook

The graphs are not appearing in JupyterLab Notebook and the error message says “JavaScript output is disabled in JupyterLab”. At first, it seems that from Notebook itself I just need to enable it but few site says it Jupyterlab does not support it yet is frustrating. #matplotlib submodule pyplot import matplotlib.pyplot as plt import numpy as np x = np.arange(0,100,0.5) y = 2 * np.sqrt(x) plt.plot(x,y) plt.show() “JavaScript output is disabled in JupyterLab” So to solve this issue or enable extension first stop your notebook and use below command.  C:\Users\victor>jupyter nbextension enable –py –sys-prefix widgetsnbextension Enabling notebook extension jupyter-js-widgets/extension… – […]

Best Practices, Bigdata, Data Science, Exploratory Data Analysis, Machine Learning

Ordinary least squares regression (OLSR)

Ordinary least squares regression (OLSR)  Invented in 1795 by Carl Friedrich Gauss, it is considered one of the earliest known general prediction methods. OLSR is a generalized linear modeling technique. It is used for estimating all unknown parameters involved in a linear regression model, the goal of which is to minimize the sum of the squares of the difference of the observed variables and the explanatory variables. Ordinary least squares regression is also known as ordinary least squares or least squared errors regression. Lets start with a Linear regression model like below:- Here is few terminology we use when we […]

Bigdata, Data Science, Exploratory Data Analysis, Machine Learning

ROC curve and performance parameters of a classification model…

When we evaluate a model we analysis few parameters to verify the performance of our model. These parameters demonstrate the performance of our model using confusion matrices. Few more frequently used performance parameters are Accuracy, Precision, Recall and F1 score. Let me give you an idea what they are in this article so that when we talk about our model in next articles would not be confused with terms. So let’s say our model is ready and we want to know how good our model is? These terms help the audience of our hypothesis to understand how good predictions are. […]

Bigdata, Database

Encourage you to switch to Jupyter Lab…

Notebooks are great for prototyping, longer pipelines or processes. If you are a user of PyCharm or Jupyter Notebook and an exploratory data scientist, I would encourage you to switch you to Jupyter Lab. For Jupyter Lab installation steps go here Below are some of the advantages that I see using Jupyter Lab over Jupyter Notebook:- The new terminal is a tab view to use compared. The ability to set out multiple windows easily, much like an IDE This will make working on a remote server so much nicer, just start Jupyter Lab and an ssh tunnel and you have a […]

Analytics, Bigdata, Database

Why and when we need Machine Learning…

I’m into the data management/data quality from several years. When I ask some people what is data management processes they simply reply, “well, we have some of our data stored in a database and other data stored on file shares with proper permissions.” This isn’t data management…it’s data storage. If you and/or your organization don’t have good, clean data, you are most definitely not ready for machine learning. Data management should be your first step before diving into any other data project(s). Now I’d say if you have good data management and tagged for machine learning so give yourself a pause and […]

Bigdata, Database, Python

Python Lists and Lambda Learning…

1 There are many ways to use Python’s List and Lambda. Here I am going to show some of useful tips and tricks. So lets first start with Lists> Below we use most of the time with List. >>> a = [66.6, 333, 333, 1, 1234.5] >>> print a.count (333), a.count (66.6), a.count (‘x’) 2 1 0 >>> a.insert (2, -1) >>> a.append (333) >>> a [66.6, 333, -1, 333, 1, 1234.5, 333] >>> a.index (333) 1 >>> a.index (333,2) 3 >>> a.remove (333) >>> a [66.6, -1, 333, 1, 1234.5, 333] >>> a.reverse () >>> a [333, 1234.5, 1, […]

Apache Spark, Bigdata, Database

Apache Spark RDD API using Pyspark…

In my previous article, I am using scala to show usability of Spark RDD API. Many of us utilizing PySpark to work with RDD and Lambda functions. Though the function names and output is same what we have in Scala, syntax in Pyspark is different on RDD operations. I’ll explain here Pyspark RDD using a different approach and with a different perspective to solve the problem. Let us consider we are streaming data using Spark and we have created RDD using this streaming application want to perform RDD operations on this stream of data in particular time interval. Here I am […]

Analytics, Bigdata, Python

How to convert Python list, tuples, strings to each other…

There are three built-in functions in Python : lists, tuples, and strings. The three functions, str (), tuple (), and list (), convert to each other using the following example: >>> s = ‘123456’ >>> list(s) [‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’] >>> tuple(s) (‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’) >>> tuple(list(s)) (‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’) >>> list(tuple(s)) [‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’] >>> “”.join(tuple(s)) ‘123456’ >>> “”.join(list(s)) ‘123456’ >>> str(tuple(s)) “(‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’)” >>> str(list(s)) “[‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’]”

Analytics, Apache Spark, Bigdata, Database

Tips and Tricks for Apache Spark RDD API, Dataframe API- Part -1

I am planning to share my knowledge on Apache Spark RDD, Dataframes API and some tips and tricks. If I combine everything into one then it would be a very lengthy article. Therefore I am dividing the long article into three separate articles and this article is the first series in that continuation. Spark RDD API Dataframe API Tips and tricks on RDD API and Dataframe API. Let us start with basics of RDD API. Resilient Distributed Dataset(RDD) is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. RDD could […]

Analytics, Bigdata, Kafka

Better late then never : Time to replace your micro-service architecture with Kafka…

Kafka already spawns and facilitated many organizations on micro-services architecture world. If Kafka is still not part of your infrastructure, its high time for you to go with it. I am not promoting Kafka better then any other message queue systems as many articles are already floating on the internet about this subject. Kafka’s uniqueness is that it provides both simple file system and bridge functions. A Kafka broker’s most basic task is to write messages to and read messages from the log on disk as quickly as possible. Queue message will not be lost after the persistence, which is […]

Analytics, Bigdata, Kafka

In-depth Kafka Message queue principles of high-reliability

 At present many open source distributed processing systems such as Cloudera, Apache Storm, Spark and others support the integration with Kafka. Kafka is increasingly being favored by many internet shops and they use Kafka as one of its core messaging engines. The reliability of the Kafka message can be imagined as a commercial-grade messaging middleware solution. In this article, we will understand Kakfa storage mechanism, replication principle, synchronization principle, and durability assurance to analyze its reliability. As shown in the figure above, a typical Kafka architecture includes several Producers (which can be server logs, business data, page views generated by […]

Analytics, Bigdata, Hadoop, Python

How to convert Python list, tuples, strings to each other…

There are three built-in functions in Python : lists, tuples, and strings. The three functions, str (), tuple (), and list (), convert to each other using the following example: >>> s = ‘123456’ >>> list(s) [‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’] >>> tuple(s) (‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’) >>> tuple(list(s)) (‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’) >>> list(tuple(s)) [‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’] >>> “”.join(tuple(s)) ‘123456’ >>> “”.join(list(s)) ‘123456’ >>> str(tuple(s)) “(‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’)” >>> str(list(s)) “[‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’]”

Analytics, Apache Spark, Best Practices, Bigdata, Framework

Tips and Tricks for Apache Spark RDD API, Dataframe API- Part -1

I am planning to share my knowledge on Apache Spark RDD, Dataframes API and some tips and tricks. If I combine everything into one then it would be a very lengthy article. Therefore I am dividing the long article into three separate articles and this article is the first series in that continuation. Spark RDD API Dataframe API Tips and tricks on RDD API and Dataframe API. Let us start with basics of RDD API. Resilient Distributed Dataset(RDD) is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. RDD could […]

Best Practices, Bigdata, Hadoop, Kafka

Better late then never : Time to replace your micro-service architecture with Kafka…

Kafka already spawns and facilitated many organizations on micro-services architecture world. If Kafka is still not part of your infrastructure, its high time for you to go with it. I am not promoting Kafka better then any other message queue systems as many articles are already floating on the internet about this subject. Kafka’s uniqueness is that it provides both simple file system and bridge functions. A Kafka broker’s most basic task is to write messages to and read messages from the log on disk as quickly as possible. Queue message will not be lost after the persistence, which is […]

Analytics, Apache Spark, Bigdata, Kafka, Messaging System

In-depth Kafka Message queue principles of high-reliability

At present many open source distributed processing systems such as Cloudera, Apache Storm, Spark and others support the integration with Kafka. Kafka is increasingly being favored by many internet shops and they use Kafka as one of its core messaging engines. The reliability of the Kafka message can be imagined as a commercial-grade messaging middleware solution. In this article, we will understand Kakfa storage mechanism, replication principle, synchronization principle, and durability assurance to analyze its reliability. As shown in the figure above, a typical Kafka architecture includes several Producers (which can be server logs, business data, page views generated by […]

Analytics, Bigdata, Kafka

Moving to communication of events between subsystems — CQRS-ES with open source…

Before going into definitions of EP, CEP, and QSQS let us start with some basic database term and what problem we are trying to address here. We have commercial databases and database professionals those who publicized CRUD operations a lot. It is one-row-per-pattern works well in most of the projects and enough to build an application more quickly and securely. I have probably implemented 100 CRUD projects (including web applications) and we do that way because we have limited budgets and projects have deadlines. CRUD work well until someone asked for historical data and I saw few managers complaining lack […]

Bigdata, HDP Search, Solr, SolrCloud

SolrCloud vs HDPSearch…

Let us start to remove some confusion we have related to SolrCloud and HDPSearch. First what is the SolrCloud:- Apache Solr includes the ability to set up a cluster of Solr servers that combines fault tolerance and high availability, called SolrCloud, these capabilities provide distributed indexing and search capabilities, supporting the following features: Central configuration for the entire cluster. Automatic load balancing and fail-over for queries. ZooKeeper integration for cluster coordination and configuration. Lets clear some confusion on Solr and SolrCloud(ZooKeeper coordinating, Solr with HDFS, HA mode) too:- Solr and SolrCloud are not separate things; Solr is the application while […]

Administration, Bigdata, Hadoop

Apache Solr Search Installation on HDP2.6 using Yum Repo

As we know that “HDP 2.6” is not bundle with “HDP Search” which includes Solr. Therefore here in two parts of article i am going to explain ways to install Solr/SolrCloud/HDP_Search:- 1. Apache Solr Search Installation on HDP2.6 using Yum Repo. 2. Apache Solr Search Installation on HDP2.6 using Ambari Management Pack. Both are using different approach therefore i have divided it into two articles Recently i have installed HDP2.6 on one of my development environment. Now its time to bring same services back one by one as we are running on HDP2.5 in production environment, one of them is […]

Bigdata, Database, Hadoop

Cloud Databases & Cloud Blob…

Cloud computing is the next stage in evolution of the Internet. The cloud in cloud computing provides the means through which everything — from computing power to computing infrastructure, applications, business processes to personal collaboration — can be delivered to you as a service wherever and whenever you need. Cloud databases are web-based services, designed for running queries on structured data stored on cloud data services. Most of the time, these services work in conjunction with cloud compute resources to provide users the capability to store, process, and query data sets within the cloud environment. These services are designed to […]

Analytics, Bigdata, Framework, Hadoop

Sumo Logic : Log Management Tool

This is my first face off with “Sumo Logic”. If you want a quick introduction on “Sumo Logic”, this topic will be helpful without going into details documentation. Sumo Logic designed to help you manage and analyze your log files. It has started out attempting to be a SaaS version of Splunk and have gone their own way as matured, but as a result of their beginnings, it is one of the most feature-rich and enterprise-focused SaaS log management tools. Installation: Sumo Logic is a SaaS model, which means you’ll be setting up a communication out to the Sumo Logic […]

Bigdata, Hadoop

SolrCloud : CAP theorem world, this makes Solr a CP system, and keep availability in certain circumstances.

A SolrCloud cluster holds one or more distributed indexes which are called Collections. Each Collection is divided into shards (to increase write capacity) and each shard has one or more replicas (to increase query capacity). One replica from each shard is elected as a leader, who performs the additional task of adding a ‘version’ to each update before streaming it to available replicas. This means that write traffic for a particular shard hits the shard’s leader first and is then synchronously replicated to all available replicas. One Solr node (a JVM instance) may host a few replicas belonging to different […]

Bigdata, Messaging System

ZeroMQ Part-1

  Programs like people need to communicate, and for them we have the UDP, TCP, HTTP, IPX, WebSocket protocol to connect and other related applications. But the underlying protocol is difficult to achieve. We need a high level of abstraction, scalable, and easy to use things, it is ZeroMQ(ØMQ). ØMQ gives us advanced levels of availability and speed. ØMQ is a neat messaging library that allows us to build our own messaging infrastructure. It can help build framework that scales where services could be handled by different applications. I am inclined to python programming and luckily pyzmq provides python bindings […]

Analytics, Bigdata, Framework, Hadoop, RHadoop

Install and smoketest R and RHadoop on Hortonworks Data Platform (HDP25-CentOS7)

Before going to Installation steps i’d like to give a small introduction on RHADOOP. What is RHadoop? RHadoop is an open source project for combine R and Hadoop together. It contains 4 different packages to combine different project from Hadoop and 1 package to enhance some functions to fit MapReduce framework. rhdfs: Combine Hadoop’s HDFS with R. rhbase: Combine Hadoop’s HBase with R. rmr2: Combine Hadoop’s MapReduce 2 with R. ravro: Combine Hadoop’s Avro with R. plyrmr: Provides a familiar plyr-like interface with MapReduce. You can reference the official GitHub of RHadoop: https://github.com/RevolutionAnalytics/RHadoop Requirements First at all, I have installed HDP2.5 […]

Bigdata, Hadoop, NoSql

The ACID properties and the CAP theorem are two concepts in data management to distributed system.

Started working on HBase again!! Thought why not refresh few concepts before proceeding to actual work. Important things comes into mind when we work with NoSQL is distributed environment are sharding and partitions.  Let’s dive into ACID properties of database and CAP theorem for distributed system. The ACID properties and the CAP theorem are two concepts in data management to distributed system. Funny thing they both comes with “C” with totally different meaning. What is ACID: – It is a rule and meant a lot for RDMBS because all RDBMS are ACID compliance. A=Atomicity, means all or nothing, if I […]