Hadoop, Hbase, Hive

JRuby code to purge data on Hbase over Hive table…

Problem to Solve:- How to delete/update/query Binary format stored values in a HBase column family column. Hive over HBase table, where we cant use standard API and unable to apply filters on binary values, you can use below solution for programmability.   Find JRuby source code at github location github.com/mkjmkumar/JRuby_HBase_API This program written in JRuby to purge data using HBase shell and deletes required data applying filter on given binary column.   So you have already heard many advantages of storing data in HBase(specially binary block format) and create Hive table on top of that to query your data. I am not going to explain use case for this, why […]

Hadoop, Hive, Java, Pig, Python

Python and Python bites

Python and Python bites “lambda”    Hi everyone, this article show you one powerful function in Python programming language called “lambda”. It can solve any small problem in single line code. So lets start the beginning of your interesting or may be future programming language. Anonymous functions created at runtime are known as lambda functions. The below line defines an ordinary function usage in python. >>def f (x): return x+42 >>print f(21) 63 For lambda functions, >>calc = lambda x: x+42 >>calc(21) 63   lambda definition does not include a “return” statement. It always contains an expression which is returned. Also […]

Best Practices, Hadoop, Hive

Performance utilities in Hive

Before taking you in details of utilities provided by Hive, let me explain few components to get execution flow and where the related information stored in system. Hive is a data warehouse software best suited for OLAP (OnLine Analytical Processing) workloads to handle and query over vast volume of data residing in a distributed storage. The Hadoop Distributed File System (HDFS) is the ecosystem in which Hive maintains the data reliably and survives from hardware failures. Hive is the only SQL-like relational big data warehousing approach developed on top of Hadoop. HiveQL as described, is an SQL-like query language for […]

Best Practices, Database, Hive

Best Practices for Hive Authorization when using connector to HiveServer2

Recently we are in process of working with Presto and configuring Hive Connector to it. It got connected successfully with steps given at prestodb.io/docs/current/connector/hive.html. An overview of our architecture is Presto is running on a different machine (Presto Machine) use Hive connector to communicate with Hadoop cluster which is running on different machines. Presto Machine have hive.properties file which tells Presto to use thrift connection to hive client and hdfs-site core-site.xml files for HDFS. Below is the architecture of our environment. Below is the command to invoke presto… /presto –server XX.X.X.XX:9080 –catalog hive There is no presto user exists in […]

Hadoop, Hive, Oozie

Coding Tips and Best Practice in Hive and Oozie…

Many time during the code review found some common mistakes done by the developer. Here are few of them… Workflow mandatory item : Add this property in all workflows that have a Hive action. This property will make sure that the hive job runs with the necessary number of reducers instead of just 1. <property> <name> mapreduce.job.reduces </name>  <value>-1</value> </property> HQL items : Setting properties: Keep the set properties in the HQL to a minimum. Let it take the default values. Add only what is absolutely necessary for that script. If you are using older code as template do not […]

Hadoop, HDFS, Hive

HBase Replication and comparison with popular online backup programs…

Short Description: HBase Replication: Hbase Replication solution can solve the cluster security, data security, read and write separation and operation Article   This article is first series of three articles, next coming articles with some code and mechanism present in latest version of HBase supporting HBase Replication.   HBase Replication Hbase Replication solution can solve the cluster security, data security, read and write separation, operation and maintenance, and the guest operating errors, and so the ease of management and configuration, provide powerful online applications support. Hbase replication currently used in the industry are rare, because there are many aspects, such […]

Best Practices, Hive

Hive Naming conventions and database naming…

Short Description: Naming conventions help to ease programmer and architect to understand whats inside going on in a business. Article I have worked with almost 20 to 25 applications. Whenever i start working first i have to understand each applications naming convention and i keep thinking why we all not follow single naming convention. As Hadoop is evolving rapidly therefore would like to share my naming convention so that may be if you come to my project will feel comfortable and so as I if you follow too. Database Names: If application serve to technology then database name would be […]