Analytics, Bigdata, Hadoop, Python

How to convert Python list, tuples, strings to each other…

There are three built-in functions in Python : lists, tuples, and strings. The three functions, str (), tuple (), and list (), convert to each other using the following example: >>> s = ‘123456’ >>> list(s) [‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’] >>> tuple(s) (‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’) >>> tuple(list(s)) (‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’) >>> list(tuple(s)) [‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’] >>> “”.join(tuple(s)) ‘123456’ >>> “”.join(list(s)) ‘123456’ >>> str(tuple(s)) “(‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’)” >>> str(list(s)) “[‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’]”

Analytics, Apache Spark, Best Practices, Bigdata, Framework

Tips and Tricks for Apache Spark RDD API, Dataframe API- Part -1

I am planning to share my knowledge on Apache Spark RDD, Dataframes API and some tips and tricks. If I combine everything into one then it would be a very lengthy article. Therefore I am dividing the long article into three separate articles and this article is the first series in that continuation. Spark RDD API Dataframe API Tips and tricks on RDD API and Dataframe API. Let us start with basics of RDD API. Resilient Distributed Dataset(RDD) is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. RDD could […]

Analytics, Apache Spark, Bigdata, Kafka, Messaging System

In-depth Kafka Message queue principles of high-reliability

At present many open source distributed processing systems such as Cloudera, Apache Storm, Spark and others support the integration with Kafka. Kafka is increasingly being favored by many internet shops and they use Kafka as one of its core messaging engines. The reliability of the Kafka message can be imagined as a commercial-grade messaging middleware solution. In this article, we will understand Kakfa storage mechanism, replication principle, synchronization principle, and durability assurance to analyze its reliability. As shown in the figure above, a typical Kafka architecture includes several Producers (which can be server logs, business data, page views generated by […]

Analytics, Apache Spark, Hadoop, Kafka, Python, Spark

Consume JSON Messages From Kafka Using Kafka-Python’s Deserializer

Hope you are here when you want to take a ride on Python and Apache Kafka. Kafka-Python is most popular python library for Python. For documentation on this library visit to page kafka-python is designed to function much like the official java client. kafka-python is best used with newer brokers (0.9+), but is backwards-compatible with older versions (to 0.8.0). Some features will only be enabled on newer brokers. So instead of showing you a simple example to run Kafka Producer and Consumer separately, I’ll show the JSON serializer and deserializer. Preparing the Environment Lets start with Install python package using […]

Analytics, Bigdata, Kafka

Moving to communication of events between subsystems — CQRS-ES with open source…

Before going into definitions of EP, CEP, and QSQS let us start with some basic database term and what problem we are trying to address here. We have commercial databases and database professionals those who publicized CRUD operations a lot. It is one-row-per-pattern works well in most of the projects and enough to build an application more quickly and securely. I have probably implemented 100 CRUD projects (including web applications) and we do that way because we have limited budgets and projects have deadlines. CRUD work well until someone asked for historical data and I saw few managers complaining lack […]

Analytics, Bigdata, Framework, Hadoop

Sumo Logic : Log Management Tool

This is my first face off with “Sumo Logic”. If you want a quick introduction on “Sumo Logic”, this topic will be helpful without going into details documentation. Sumo Logic designed to help you manage and analyze your log files. It has started out attempting to be a SaaS version of Splunk and have gone their own way as matured, but as a result of their beginnings, it is one of the most feature-rich and enterprise-focused SaaS log management tools. Installation: Sumo Logic is a SaaS model, which means you’ll be setting up a communication out to the Sumo Logic […]

Analytics, Bigdata, Framework, Hadoop, RHadoop

Install and smoketest R and RHadoop on Hortonworks Data Platform (HDP25-CentOS7)

Before going to Installation steps i’d like to give a small introduction on RHADOOP. What is RHadoop? RHadoop is an open source project for combine R and Hadoop together. It contains 4 different packages to combine different project from Hadoop and 1 package to enhance some functions to fit MapReduce framework. rhdfs: Combine Hadoop’s HDFS with R. rhbase: Combine Hadoop’s HBase with R. rmr2: Combine Hadoop’s MapReduce 2 with R. ravro: Combine Hadoop’s Avro with R. plyrmr: Provides a familiar plyr-like interface with MapReduce. You can reference the official GitHub of RHadoop: Requirements First at all, I have installed HDP2.5 […]

Analytics, Tesseract

OCR – “Optical Character Recognition”, Set up Tesseract OCR on Centos 6.8…

OCR means “Optical Character Recognition” and Tesseract is licensed under the Apache License v2.0. Tesseract OCR configured system is able to convert images with embedded text to text files. This tutorial “How to install” is meant as a practical guide; it does not cover theoretical backgrounds/concept of OCR/algorithm used in Tesseract. They are treated in lot of other documents in the web. Tesseract installation is supported beautifully with Ubuntu without issues(cause apt-get) but with Centos required some effort and correct version to build. Please follow below steps for Tesseract installation on Centos:- 1. OS update using yum. Setup Centos 6.8 […]

Analytics, Hadoop

Advertisement attributes or Ad Attributes…An Idea!!!

Some time ago i was working on an idea called as Ad Attributes or Advertisement attributes. I’d like to share my thoughts on this idea with audience. Advertisement attributes are for creating a favorable selling climate. Today consumers are constantly targeted with product information by marketing companies. Consumers are faced with numerous advertisements with vast information on products. Thus consumers use the heuristics approach to help them in making their purchasing decisions. This approach is basically using mental shortcuts to streamline the selection process cognitively. This is to avoid being puzzled or paralyzed by the huge number of products offered in the […]

Analytics, Hadoop

Data Analysis Approach to a successful outcome

I have done data analysis for one of my project using below approach and hopefully it may help you understand underlying subject. Soon i’ll post my project on data analysis and detail description on technology used Python(web scraping- data collection), Hadoop, Spark and R. Data analysis is a highly iterative and non-linear process, better reflected by a series of cyclic process, in which information is learned at each step, which then informs whether (and how) to refine, and redo, the step that was just performed, or whether (and how) to proceed to the next step. Setting the Scene Data analysis is […]