Hadoop

Fundamantals of Apache Spark…

You can view my other articles on Spark RDD at below links… Apache Spark RDD API using Pyspark…Tips and Tricks for Apache Spark RDD API, Dataframe API How did Spark become so efficient in data processing as compared to MapReduce? It comes with a very advanced Directed Acyclic Graph (DAG) data processing engine. What it means is that for every Spark job, a DAG of tasks is created to be executed by the engine. The DAG in mathematical parlance consists of a set of vertices and directed edges connecting them. The tasks are executed as per the DAG layout. In […]

Analytics, Data Science, Exploratory Data Analysis, Hadoop

Approach to execute Machine Learning project, “Halt the Hate”…

Disclaimer: The analysis was done in this project touches a sensitive issue in India. So I never convince anybody to trust my model. A real human society is so complex that “all the things may be interconnected in a different way than in the model.” Imagine you are presented with a dataset of “Hate Crimes” in India and asked how to minimize these crimes by analyzing other factors. This is the problem I am taking in hand to solve and analyze with a minimum number of resources. Some can say that education and providing jobs to youth in India by […]

Bigdata, HDP Search, Solr, SolrCloud

SolrCloud vs HDPSearch…

Let us start to remove some confusion we have related to SolrCloud and HDPSearch. First what is the SolrCloud:- Apache Solr includes the ability to set up a cluster of Solr servers that combines fault tolerance and high availability, called SolrCloud, these capabilities provide distributed indexing and search capabilities, supporting the following features: Central configuration for the entire cluster. Automatic load balancing and fail-over for queries. ZooKeeper integration for cluster coordination and configuration. Lets clear some confusion on Solr and SolrCloud(ZooKeeper coordinating, Solr with HDFS, HA mode) too:- Solr and SolrCloud are not separate things; Solr is the application while […]

Apache Spark, Hbase

Multiple WAL in Apache HBase 1.3 and performance enhancements!!!

Apache HBase 1.3.0 was released mid-January 2017 and ships with support for date-based tiered compaction and improvements in multiple areas, like write-ahead log (WAL), and a new RPC scheduler, among others. The release includes almost 1,700 resolved issues in total. Below are some bold points on enhancement made in HBase 1.3.0:- The “date-based tiered compaction” support shipped in HBase 1.3.0 is beneficial for where data is infrequently deleted or updated and recent data is scanned more often than an older one. Records time-to-live (TTL) can be easily enforced with this new compaction strategy. Improved multiple WAL support in Apache HBase […]

Analytics, Bigdata, Framework, Hadoop, RHadoop

Install and smoketest R and RHadoop on Hortonworks Data Platform (HDP25-CentOS7)

Before going to Installation steps i’d like to give a small introduction on RHADOOP. What is RHadoop? RHadoop is an open source project for combine R and Hadoop together. It contains 4 different packages to combine different project from Hadoop and 1 package to enhance some functions to fit MapReduce framework. rhdfs: Combine Hadoop’s HDFS with R. rhbase: Combine Hadoop’s HBase with R. rmr2: Combine Hadoop’s MapReduce 2 with R. ravro: Combine Hadoop’s Avro with R. plyrmr: Provides a familiar plyr-like interface with MapReduce. You can reference the official GitHub of RHadoop: https://github.com/RevolutionAnalytics/RHadoop Requirements First at all, I have installed HDP2.5 […]

Analytics, Hadoop

Advertisement attributes or Ad Attributes…An Idea!!!

Some time ago i was working on an idea called as Ad Attributes or Advertisement attributes. I’d like to share my thoughts on this idea with audience. Advertisement attributes are for creating a favorable selling climate. Today consumers are constantly targeted with product information by marketing companies. Consumers are faced with numerous advertisements with vast information on products. Thus consumers use the heuristics approach to help them in making their purchasing decisions. This approach is basically using mental shortcuts to streamline the selection process cognitively. This is to avoid being puzzled or paralyzed by the huge number of products offered in the […]

Hadoop, Hbase, Hive

JRuby code to purge data on Hbase over Hive table…

Problem to Solve:- How to delete/update/query Binary format stored values in a HBase column family column. Hive over HBase table, where we cant use standard API and unable to apply filters on binary values, you can use below solution for programmability.   Find JRuby source code at github location github.com/mkjmkumar/JRuby_HBase_API This program written in JRuby to purge data using HBase shell and deletes required data applying filter on given binary column.   So you have already heard many advantages of storing data in HBase(specially binary block format) and create Hive table on top of that to query your data. I am not going to explain use case for this, why […]

Hadoop, Hive, Java, Pig, Python

Python and Python bites

Python and Python bites “lambda”    Hi everyone, this article show you one powerful function in Python programming language called “lambda”. It can solve any small problem in single line code. So lets start the beginning of your interesting or may be future programming language. Anonymous functions created at runtime are known as lambda functions. The below line defines an ordinary function usage in python. >>def f (x): return x+42 >>print f(21) 63 For lambda functions, >>calc = lambda x: x+42 >>calc(21) 63   lambda definition does not include a “return” statement. It always contains an expression which is returned. Also […]

Hadoop, Kylin, Security

Past and Future of Apache Kylin!!!

Short Description: Apache Kylin (Chinese: Kirin) appears, can solve the problems based on Hadoop. Article Apache Kylin origin In today’s era of big data, Hadoop has become the de facto standards, and a large number of tools one after another around the Hadoop platform to build, to address the needs of different scenarios. For example, Hadoop Hive is a data warehouse tools, data files stored on HDFS distributed file system can be mapped to a database table and provides SQL queries. Hive execution engine can be converted to SQL MapReduce task to run, ideally suited for data warehouse data analysis. […]

Hadoop, HDFS

Heterogeneous Storage in HDFS(Part-1)…

An Introduction of heterogeneous storage type, and the flexible configuration of heterogeneous storage! Heterogeneous Storage in HDFS Hadoop version 2.6.0 introduced a new feature heterogeneous storage. Heterogeneous storage can be different according to each play their respective advantages of the storage medium to read and write characteristics. This is very suitable for cold storage of data. Data for the cold means storage with large capacity and where high read and write performance is not required, such as the most common disk for thermal data, the SSD can be used to store this way. On the other hand when we required […]

Cloudera, encryption, HDFS, Security

A Step-by-Step Guide to HDFS Data Protection Solution for Your Organization on Cloudera CHD

  An enterprise-ready encryption solution should provide the following Comprehensive encryption offering wherever it resides, including structured and unstructured data at rest and data in motion. HDFS Encryption implements transparent, end-to-end encryption of data read from and written to HDFS, without requiring changes to application code. Centralized encryption and key management: A centralized solution will enable you to protect and manage both the data and keys. Secure the data by encrypting or tokenizing it, while controlling access to the protected data. This guide will help you through enabling HDFS encryption on your cluster, using the default Java KeyStore KMS. If […]

Best Practices, Hadoop, Hive

Performance utilities in Hive

Before taking you in details of utilities provided by Hive, let me explain few components to get execution flow and where the related information stored in system. Hive is a data warehouse software best suited for OLAP (OnLine Analytical Processing) workloads to handle and query over vast volume of data residing in a distributed storage. The Hadoop Distributed File System (HDFS) is the ecosystem in which Hive maintains the data reliably and survives from hardware failures. Hive is the only SQL-like relational big data warehousing approach developed on top of Hadoop. HiveQL as described, is an SQL-like query language for […]

Best Practices, Database, Hive

Best Practices for Hive Authorization when using connector to HiveServer2

Recently we are in process of working with Presto and configuring Hive Connector to it. It got connected successfully with steps given at prestodb.io/docs/current/connector/hive.html. An overview of our architecture is Presto is running on a different machine (Presto Machine) use Hive connector to communicate with Hadoop cluster which is running on different machines. Presto Machine have hive.properties file which tells Presto to use thrift connection to hive client and hdfs-site core-site.xml files for HDFS. Below is the architecture of our environment. Below is the command to invoke presto… /presto –server XX.X.X.XX:9080 –catalog hive There is no presto user exists in […]

Database, HPL

HPL/SQL Make SQL-on-Hadoop More Dynamic

Think about the old days when we solved many business problems using Dynamic SQL, exception handling, flow-of-control, iterations. Now when I worked with couple of migration projects found few business rules that need to transform to Hive compatible (some of them are very complex and nearly impossible). Solution is HPL/SQL (formerly PL/HQL), is a language translation and execution layer developed by Dmitry Tolpeko (http://www.hplsql.org/) Why HPL/SQL The role of Hadoop in Data Warehousing is huge. But to implement comprehensive ETL, reporting, analytics and data mining processes you not only need distributed processing engines such as MapReduce, Spark or Tez, you […]

Hadoop, Hive, Oozie

Coding Tips and Best Practice in Hive and Oozie…

Many time during the code review found some common mistakes done by the developer. Here are few of them… Workflow mandatory item : Add this property in all workflows that have a Hive action. This property will make sure that the hive job runs with the necessary number of reducers instead of just 1. <property> <name> mapreduce.job.reduces </name>  <value>-1</value> </property> HQL items : Setting properties: Keep the set properties in the HQL to a minimum. Let it take the default values. Add only what is absolutely necessary for that script. If you are using older code as template do not […]

Hadoop, HDFS

HDFS is really not designed for many small files!!!

Few of my friends new to Hadoop ask frequently what the good file size is for Hadoop and how to decide file size. Obviously it should not be small size and file size should be as per the block size. HDFS is really not designed for many small files. For each file, the client has to talk to the namenode, which gives it the location(s) of the block(s) of the file, and then the client streams the data from the datanode. Now, in the best case, the client does this once, and then finds that it is the machine with […]

Hadoop, HDFS, Hive

HBase Replication and comparison with popular online backup programs…

Short Description: HBase Replication: Hbase Replication solution can solve the cluster security, data security, read and write separation and operation Article   This article is first series of three articles, next coming articles with some code and mechanism present in latest version of HBase supporting HBase Replication.   HBase Replication Hbase Replication solution can solve the cluster security, data security, read and write separation, operation and maintenance, and the guest operating errors, and so the ease of management and configuration, provide powerful online applications support. Hbase replication currently used in the industry are rare, because there are many aspects, such […]

Apache Spark, Spark

Introduction to Spark

Introduction to Apache Spark:- Spark As a Unified Stack and Computational Engine is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines. Eventually the big data exports around the world have derived the specialized systems on top of Hadoop to solve certain problems like graph processing, implementation of efficient iterative algorithms, real time query engines etc.. As you may know all the other components like Impala, Mahout, Tez, GraphLab etc are derived from Hadoop for different purposes. What is Apache Spark? Apache spark is the generalized engine which combines the specialties of all […]

Hadoop, Kafka

Kafka: A detail introduction

I’ll cover Kafka in detail with introduction to programmability and will try to cover almost full architecture of it. So here it go:- We need Kafka when there is need for building a real-time processing system as Kafka is a high-performance publisher-subscriber-based messaging system with highly scalable properties. Traditional systems unable to process this large data and mainly for offline used analysis, Kafka is a solution to the real-time problems of any software solution; that is to say, unify offline or online data processing and routing it to multiple consumers quickly. Below are the Characteristics of Kafka:- Persistent messaging: – […]

Bigdata, Hadoop, NoSql

The ACID properties and the CAP theorem are two concepts in data management to distributed system.

Started working on HBase again!! Thought why not refresh few concepts before proceeding to actual work. Important things comes into mind when we work with NoSQL is distributed environment are sharding and partitions.  Let’s dive into ACID properties of database and CAP theorem for distributed system. The ACID properties and the CAP theorem are two concepts in data management to distributed system. Funny thing they both comes with “C” with totally different meaning. What is ACID: – It is a rule and meant a lot for RDMBS because all RDBMS are ACID compliance. A=Atomicity, means all or nothing, if I […]

Analytics, Hadoop

Data Analysis Approach to a successful outcome

I have done data analysis for one of my project using below approach and hopefully it may help you understand underlying subject. Soon i’ll post my project on data analysis and detail description on technology used Python(web scraping- data collection), Hadoop, Spark and R. Data analysis is a highly iterative and non-linear process, better reflected by a series of cyclic process, in which information is learned at each step, which then informs whether (and how) to refine, and redo, the step that was just performed, or whether (and how) to proceed to the next step. Setting the Scene Data analysis is […]