Analytics, Data Science, Exploratory Data Analysis, Hadoop

Approach to execute Machine Learning project, “Halt the Hate”…

Disclaimer: The analysis was done in this project touches a sensitive issue in India. So I never convince anybody to trust my model. A real human society is so complex that “all the things may be interconnected in a different way than in the model.” Imagine you are presented with a dataset of “Hate Crimes” in India and asked how to minimize these crimes by analyzing other factors. This is the problem I am taking in hand to solve and analyze with a minimum number of resources. Some can say that education and providing jobs to youth in India by […]

Analytics, Bigdata, Database

Why and when we need Machine Learning…

I’m into the data management/data quality from several years. When I ask some people what is data management processes they simply reply, “well, we have some of our data stored in a database and other data stored on file shares with proper permissions.” This isn’t data management…it’s data storage. If you and/or your organization don’t have good, clean data, you are most definitely not ready for machine learning. Data management should be your first step before diving into any other data project(s). Now I’d say if you have good data management and tagged for machine learning so give yourself a pause and […]

Analytics, Bigdata, Python

How to convert Python list, tuples, strings to each other…

There are three built-in functions in Python : lists, tuples, and strings. The three functions, str (), tuple (), and list (), convert to each other using the following example: >>> s = ‘123456’ >>> list(s) [‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’] >>> tuple(s) (‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’) >>> tuple(list(s)) (‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’) >>> list(tuple(s)) [‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’] >>> “”.join(tuple(s)) ‘123456’ >>> “”.join(list(s)) ‘123456’ >>> str(tuple(s)) “(‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’)” >>> str(list(s)) “[‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’]”

Analytics, Apache Spark, Bigdata, Database

Tips and Tricks for Apache Spark RDD API, Dataframe API- Part -1

I am planning to share my knowledge on Apache Spark RDD, Dataframes API and some tips and tricks. If I combine everything into one then it would be a very lengthy article. Therefore I am dividing the long article into three separate articles and this article is the first series in that continuation. Spark RDD API Dataframe API Tips and tricks on RDD API and Dataframe API. Let us start with basics of RDD API. Resilient Distributed Dataset(RDD) is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. RDD could […]

Analytics, Bigdata, Kafka

Better late then never : Time to replace your micro-service architecture with Kafka…

Kafka already spawns and facilitated many organizations on micro-services architecture world. If Kafka is still not part of your infrastructure, its high time for you to go with it. I am not promoting Kafka better then any other message queue systems as many articles are already floating on the internet about this subject. Kafka’s uniqueness is that it provides both simple file system and bridge functions. A Kafka broker’s most basic task is to write messages to and read messages from the log on disk as quickly as possible. Queue message will not be lost after the persistence, which is […]

Analytics, Bigdata, Kafka

In-depth Kafka Message queue principles of high-reliability

 At present many open source distributed processing systems such as Cloudera, Apache Storm, Spark and others support the integration with Kafka. Kafka is increasingly being favored by many internet shops and they use Kafka as one of its core messaging engines. The reliability of the Kafka message can be imagined as a commercial-grade messaging middleware solution. In this article, we will understand Kakfa storage mechanism, replication principle, synchronization principle, and durability assurance to analyze its reliability. As shown in the figure above, a typical Kafka architecture includes several Producers (which can be server logs, business data, page views generated by […]

Analytics, Bigdata, Hadoop, Python

How to convert Python list, tuples, strings to each other…

There are three built-in functions in Python : lists, tuples, and strings. The three functions, str (), tuple (), and list (), convert to each other using the following example: >>> s = ‘123456’ >>> list(s) [‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’] >>> tuple(s) (‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’) >>> tuple(list(s)) (‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’) >>> list(tuple(s)) [‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’] >>> “”.join(tuple(s)) ‘123456’ >>> “”.join(list(s)) ‘123456’ >>> str(tuple(s)) “(‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’)” >>> str(list(s)) “[‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’]”

Analytics, Apache Spark, Best Practices, Bigdata, Framework

Tips and Tricks for Apache Spark RDD API, Dataframe API- Part -1

I am planning to share my knowledge on Apache Spark RDD, Dataframes API and some tips and tricks. If I combine everything into one then it would be a very lengthy article. Therefore I am dividing the long article into three separate articles and this article is the first series in that continuation. Spark RDD API Dataframe API Tips and tricks on RDD API and Dataframe API. Let us start with basics of RDD API. Resilient Distributed Dataset(RDD) is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. RDD could […]

Analytics, Apache Spark, Bigdata, Kafka, Messaging System

In-depth Kafka Message queue principles of high-reliability

At present many open source distributed processing systems such as Cloudera, Apache Storm, Spark and others support the integration with Kafka. Kafka is increasingly being favored by many internet shops and they use Kafka as one of its core messaging engines. The reliability of the Kafka message can be imagined as a commercial-grade messaging middleware solution. In this article, we will understand Kakfa storage mechanism, replication principle, synchronization principle, and durability assurance to analyze its reliability. As shown in the figure above, a typical Kafka architecture includes several Producers (which can be server logs, business data, page views generated by […]

Analytics, Apache Spark, Hadoop, Kafka, Python, Spark

Consume JSON Messages From Kafka Using Kafka-Python’s Deserializer

Hope you are here when you want to take a ride on Python and Apache Kafka. Kafka-Python is most popular python library for Python. For documentation on this library visit to page https://kafka-python.readthedocs.io/en/master/. kafka-python is designed to function much like the official java client. kafka-python is best used with newer brokers (0.9+), but is backwards-compatible with older versions (to 0.8.0). Some features will only be enabled on newer brokers. So instead of showing you a simple example to run Kafka Producer and Consumer separately, I’ll show the JSON serializer and deserializer. Preparing the Environment Lets start with Install python package using […]

Analytics, Bigdata, Kafka

Moving to communication of events between subsystems — CQRS-ES with open source…

Before going into definitions of EP, CEP, and QSQS let us start with some basic database term and what problem we are trying to address here. We have commercial databases and database professionals those who publicized CRUD operations a lot. It is one-row-per-pattern works well in most of the projects and enough to build an application more quickly and securely. I have probably implemented 100 CRUD projects (including web applications) and we do that way because we have limited budgets and projects have deadlines. CRUD work well until someone asked for historical data and I saw few managers complaining lack […]

Analytics, Bigdata, Framework, Hadoop

Sumo Logic : Log Management Tool

This is my first face off with “Sumo Logic”. If you want a quick introduction on “Sumo Logic”, this topic will be helpful without going into details documentation. Sumo Logic designed to help you manage and analyze your log files. It has started out attempting to be a SaaS version of Splunk and have gone their own way as matured, but as a result of their beginnings, it is one of the most feature-rich and enterprise-focused SaaS log management tools. Installation: Sumo Logic is a SaaS model, which means you’ll be setting up a communication out to the Sumo Logic […]

Analytics, Bigdata, Framework, Hadoop, RHadoop

Install and smoketest R and RHadoop on Hortonworks Data Platform (HDP25-CentOS7)

Before going to Installation steps i’d like to give a small introduction on RHADOOP. What is RHadoop? RHadoop is an open source project for combine R and Hadoop together. It contains 4 different packages to combine different project from Hadoop and 1 package to enhance some functions to fit MapReduce framework. rhdfs: Combine Hadoop’s HDFS with R. rhbase: Combine Hadoop’s HBase with R. rmr2: Combine Hadoop’s MapReduce 2 with R. ravro: Combine Hadoop’s Avro with R. plyrmr: Provides a familiar plyr-like interface with MapReduce. You can reference the official GitHub of RHadoop: https://github.com/RevolutionAnalytics/RHadoop Requirements First at all, I have installed HDP2.5 […]

Analytics, Tesseract

OCR – “Optical Character Recognition”, Set up Tesseract OCR on Centos 6.8…

OCR means “Optical Character Recognition” and Tesseract is licensed under the Apache License v2.0. Tesseract OCR configured system is able to convert images with embedded text to text files. This tutorial “How to install” is meant as a practical guide; it does not cover theoretical backgrounds/concept of OCR/algorithm used in Tesseract. They are treated in lot of other documents in the web. Tesseract installation is supported beautifully with Ubuntu without issues(cause apt-get) but with Centos required some effort and correct version to build. Please follow below steps for Tesseract installation on Centos:- 1. OS update using yum. Setup Centos 6.8 […]

Analytics, Hadoop

Advertisement attributes or Ad Attributes…An Idea!!!

Some time ago i was working on an idea called as Ad Attributes or Advertisement attributes. I’d like to share my thoughts on this idea with audience. Advertisement attributes are for creating a favorable selling climate. Today consumers are constantly targeted with product information by marketing companies. Consumers are faced with numerous advertisements with vast information on products. Thus consumers use the heuristics approach to help them in making their purchasing decisions. This approach is basically using mental shortcuts to streamline the selection process cognitively. This is to avoid being puzzled or paralyzed by the huge number of products offered in the […]

Analytics, Hadoop

Data Analysis Approach to a successful outcome

I have done data analysis for one of my project using below approach and hopefully it may help you understand underlying subject. Soon i’ll post my project on data analysis and detail description on technology used Python(web scraping- data collection), Hadoop, Spark and R. Data analysis is a highly iterative and non-linear process, better reflected by a series of cyclic process, in which information is learned at each step, which then informs whether (and how) to refine, and redo, the step that was just performed, or whether (and how) to proceed to the next step. Setting the Scene Data analysis is […]