Analytics, Apache Spark, Apache Storm, Bigdata, Hadoop

Apache Storm key takeaways…

Hadoop moves the code to the data, Storm moves the data to the code. This behavior makes more sense in a stream-processing system, because the data set isn’t known beforehand, unlike in a batch job. Also, the data set is continuously flowing through the code. A Storm cluster consists of two types of nodes: the master node and the worker nodes. A master node runs a daemon called Nimbus, and the worker nodes each run a daemon called a Supervisor. The master node can be thought of as the control center. In addition to the other responsibilities, this is where […]

Apache Spark, Bigdata, Database

Apache Spark RDD API using Pyspark…

In my previous article, I am using scala to show usability of Spark RDD API. Many of us utilizing PySpark to work with RDD and Lambda functions. Though the function names and output is same what we have in Scala, syntax in Pyspark is different on RDD operations. I’ll explain here Pyspark RDD using a different approach and with a different perspective to solve the problem. Let us consider we are streaming data using Spark and we have created RDD using this streaming application want to perform RDD operations on this stream of data in particular time interval. Here I am […]

Analytics, Apache Spark, Bigdata, Database

Tips and Tricks for Apache Spark RDD API, Dataframe API- Part -1

I am planning to share my knowledge on Apache Spark RDD, Dataframes API and some tips and tricks. If I combine everything into one then it would be a very lengthy article. Therefore I am dividing the long article into three separate articles and this article is the first series in that continuation. Spark RDD API Dataframe API Tips and tricks on RDD API and Dataframe API. Let us start with basics of RDD API. Resilient Distributed Dataset(RDD) is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. RDD could […]

Analytics, Apache Spark, Best Practices, Bigdata, Framework

Tips and Tricks for Apache Spark RDD API, Dataframe API- Part -1

I am planning to share my knowledge on Apache Spark RDD, Dataframes API and some tips and tricks. If I combine everything into one then it would be a very lengthy article. Therefore I am dividing the long article into three separate articles and this article is the first series in that continuation. Spark RDD API Dataframe API Tips and tricks on RDD API and Dataframe API. Let us start with basics of RDD API. Resilient Distributed Dataset(RDD) is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. RDD could […]

Analytics, Apache Spark, Bigdata, Kafka, Messaging System

In-depth Kafka Message queue principles of high-reliability

At present many open source distributed processing systems such as Cloudera, Apache Storm, Spark and others support the integration with Kafka. Kafka is increasingly being favored by many internet shops and they use Kafka as one of its core messaging engines. The reliability of the Kafka message can be imagined as a commercial-grade messaging middleware solution. In this article, we will understand Kakfa storage mechanism, replication principle, synchronization principle, and durability assurance to analyze its reliability. As shown in the figure above, a typical Kafka architecture includes several Producers (which can be server logs, business data, page views generated by […]

Analytics, Apache Spark, Hadoop, Kafka, Python, Spark

Consume JSON Messages From Kafka Using Kafka-Python’s Deserializer

Hope you are here when you want to take a ride on Python and Apache Kafka. Kafka-Python is most popular python library for Python. For documentation on this library visit to page https://kafka-python.readthedocs.io/en/master/. kafka-python is designed to function much like the official java client. kafka-python is best used with newer brokers (0.9+), but is backwards-compatible with older versions (to 0.8.0). Some features will only be enabled on newer brokers. So instead of showing you a simple example to run Kafka Producer and Consumer separately, I’ll show the JSON serializer and deserializer. Preparing the Environment Lets start with Install python package using […]

Apache Spark, Hbase

Multiple WAL in Apache HBase 1.3 and performance enhancements!!!

Apache HBase 1.3.0 was released mid-January 2017 and ships with support for date-based tiered compaction and improvements in multiple areas, like write-ahead log (WAL), and a new RPC scheduler, among others. The release includes almost 1,700 resolved issues in total. Below are some bold points on enhancement made in HBase 1.3.0:- The “date-based tiered compaction” support shipped in HBase 1.3.0 is beneficial for where data is infrequently deleted or updated and recent data is scanned more often than an older one. Records time-to-live (TTL) can be easily enforced with this new compaction strategy. Improved multiple WAL support in Apache HBase […]

Apache Spark, open source

Apache Spot, the open source community to continue the fight against cybercrime…

Apache Spot, force Apache community in order to fight cybercrime. Since Apache Spot earlier this year started at Intel and Cloudera, the momentum of the project is growing with Anomoli, Centrify, Cloudwick, Cybraics, eBay, Endgame, Jask, Streamsets, Webroot and other partners with the unanimous support. Use Apache Hadoop to achieve unlimited scale log management and data storage, as well as with Apache Spark achiev near real-time machine learning and anomaly detection, network security and no new data analysis functions. With Apache Spot, we can do more effective use of technology provided by Big Data ecosystems, and can detect unknown network […]

Apache Spark, Spark

Introduction to Spark

Introduction to Apache Spark:- Spark As a Unified Stack and Computational Engine is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines. Eventually the big data exports around the world have derived the specialized systems on top of Hadoop to solve certain problems like graph processing, implementation of efficient iterative algorithms, real time query engines etc.. As you may know all the other components like Impala, Mahout, Tez, GraphLab etc are derived from Hadoop for different purposes. What is Apache Spark? Apache spark is the generalized engine which combines the specialties of all […]