Analytics, Bigdata, Hadoop, Python

How to convert Python list, tuples, strings to each other…

There are three built-in functions in Python : lists, tuples, and strings. The three functions, str (), tuple (), and list (), convert to each other using the following example: >>> s = ‘123456’ >>> list(s) [‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’] >>> tuple(s) (‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’) >>> tuple(list(s)) (‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’) >>> list(tuple(s)) [‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’] >>> “”.join(tuple(s)) ‘123456’ >>> “”.join(list(s)) ‘123456’ >>> str(tuple(s)) “(‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’)” >>> str(list(s)) “[‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’]”

Analytics, Apache Spark, Best Practices, Bigdata, Framework

Tips and Tricks for Apache Spark RDD API, Dataframe API- Part -1

I am planning to share my knowledge on Apache Spark RDD, Dataframes API and some tips and tricks. If I combine everything into one then it would be a very lengthy article. Therefore I am dividing the long article into three separate articles and this article is the first series in that continuation. Spark RDD API Dataframe API Tips and tricks on RDD API and Dataframe API. Let us start with basics of RDD API. Resilient Distributed Dataset(RDD) is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. RDD could […]

Best Practices, Bigdata, Hadoop, Kafka

Better late then never : Time to replace your micro-service architecture with Kafka…

Kafka already spawns and facilitated many organizations on micro-services architecture world. If Kafka is still not part of your infrastructure, its high time for you to go with it. I am not promoting Kafka better then any other message queue systems as many articles are already floating on the internet about this subject. Kafka’s uniqueness is that it provides both simple file system and bridge functions. A Kafka broker’s most basic task is to write messages to and read messages from the log on disk as quickly as possible. Queue message will not be lost after the persistence, which is […]

Analytics, Apache Spark, Bigdata, Kafka, Messaging System

In-depth Kafka Message queue principles of high-reliability

At present many open source distributed processing systems such as Cloudera, Apache Storm, Spark and others support the integration with Kafka. Kafka is increasingly being favored by many internet shops and they use Kafka as one of its core messaging engines. The reliability of the Kafka message can be imagined as a commercial-grade messaging middleware solution. In this article, we will understand Kakfa storage mechanism, replication principle, synchronization principle, and durability assurance to analyze its reliability. As shown in the figure above, a typical Kafka architecture includes several Producers (which can be server logs, business data, page views generated by […]

Analytics, Bigdata, Kafka

Moving to communication of events between subsystems — CQRS-ES with open source…

Before going into definitions of EP, CEP, and QSQS let us start with some basic database term and what problem we are trying to address here. We have commercial databases and database professionals those who publicized CRUD operations a lot. It is one-row-per-pattern works well in most of the projects and enough to build an application more quickly and securely. I have probably implemented 100 CRUD projects (including web applications) and we do that way because we have limited budgets and projects have deadlines. CRUD work well until someone asked for historical data and I saw few managers complaining lack […]

Bigdata, HDP Search, Solr, SolrCloud

SolrCloud vs HDPSearch…

Let us start to remove some confusion we have related to SolrCloud and HDPSearch. First what is the SolrCloud:- Apache Solr includes the ability to set up a cluster of Solr servers that combines fault tolerance and high availability, called SolrCloud, these capabilities provide distributed indexing and search capabilities, supporting the following features: Central configuration for the entire cluster. Automatic load balancing and fail-over for queries. ZooKeeper integration for cluster coordination and configuration. Lets clear some confusion on Solr and SolrCloud(ZooKeeper coordinating, Solr with HDFS, HA mode) too:- Solr and SolrCloud are not separate things; Solr is the application while […]

Administration, Bigdata, Hadoop

Apache Solr Search Installation on HDP2.6 using Yum Repo

As we know that “HDP 2.6” is not bundle with “HDP Search” which includes Solr. Therefore here in two parts of article i am going to explain ways to install Solr/SolrCloud/HDP_Search:- 1. Apache Solr Search Installation on HDP2.6 using Yum Repo. 2. Apache Solr Search Installation on HDP2.6 using Ambari Management Pack. Both are using different approach therefore i have divided it into two articles Recently i have installed HDP2.6 on one of my development environment. Now its time to bring same services back one by one as we are running on HDP2.5 in production environment, one of them is […]

Bigdata, Database, Hadoop

Cloud Databases & Cloud Blob…

Cloud computing is the next stage in evolution of the Internet. The cloud in cloud computing provides the means through which everything — from computing power to computing infrastructure, applications, business processes to personal collaboration — can be delivered to you as a service wherever and whenever you need. Cloud databases are web-based services, designed for running queries on structured data stored on cloud data services. Most of the time, these services work in conjunction with cloud compute resources to provide users the capability to store, process, and query data sets within the cloud environment. These services are designed to […]

Analytics, Bigdata, Framework, Hadoop

Sumo Logic : Log Management Tool

This is my first face off with “Sumo Logic”. If you want a quick introduction on “Sumo Logic”, this topic will be helpful without going into details documentation. Sumo Logic designed to help you manage and analyze your log files. It has started out attempting to be a SaaS version of Splunk and have gone their own way as matured, but as a result of their beginnings, it is one of the most feature-rich and enterprise-focused SaaS log management tools. Installation: Sumo Logic is a SaaS model, which means you’ll be setting up a communication out to the Sumo Logic […]

Bigdata, Hadoop

SolrCloud : CAP theorem world, this makes Solr a CP system, and keep availability in certain circumstances.

A SolrCloud cluster holds one or more distributed indexes which are called Collections. Each Collection is divided into shards (to increase write capacity) and each shard has one or more replicas (to increase query capacity). One replica from each shard is elected as a leader, who performs the additional task of adding a ‘version’ to each update before streaming it to available replicas. This means that write traffic for a particular shard hits the shard’s leader first and is then synchronously replicated to all available replicas. One Solr node (a JVM instance) may host a few replicas belonging to different […]

Bigdata, Messaging System

ZeroMQ Part-1

  Programs like people need to communicate, and for them we have the UDP, TCP, HTTP, IPX, WebSocket protocol to connect and other related applications. But the underlying protocol is difficult to achieve. We need a high level of abstraction, scalable, and easy to use things, it is ZeroMQ(ØMQ). ØMQ gives us advanced levels of availability and speed. ØMQ is a neat messaging library that allows us to build our own messaging infrastructure. It can help build framework that scales where services could be handled by different applications. I am inclined to python programming and luckily pyzmq provides python bindings […]

Analytics, Bigdata, Framework, Hadoop, RHadoop

Install and smoketest R and RHadoop on Hortonworks Data Platform (HDP25-CentOS7)

Before going to Installation steps i’d like to give a small introduction on RHADOOP. What is RHadoop? RHadoop is an open source project for combine R and Hadoop together. It contains 4 different packages to combine different project from Hadoop and 1 package to enhance some functions to fit MapReduce framework. rhdfs: Combine Hadoop’s HDFS with R. rhbase: Combine Hadoop’s HBase with R. rmr2: Combine Hadoop’s MapReduce 2 with R. ravro: Combine Hadoop’s Avro with R. plyrmr: Provides a familiar plyr-like interface with MapReduce. You can reference the official GitHub of RHadoop: Requirements First at all, I have installed HDP2.5 […]

Bigdata, Hadoop, NoSql

The ACID properties and the CAP theorem are two concepts in data management to distributed system.

Started working on HBase again!! Thought why not refresh few concepts before proceeding to actual work. Important things comes into mind when we work with NoSQL is distributed environment are sharding and partitions.  Let’s dive into ACID properties of database and CAP theorem for distributed system. The ACID properties and the CAP theorem are two concepts in data management to distributed system. Funny thing they both comes with “C” with totally different meaning. What is ACID: – It is a rule and meant a lot for RDMBS because all RDBMS are ACID compliance. A=Atomicity, means all or nothing, if I […]