hive_best_practice

Before taking you in details of utilities provided by Hive, let me explain few components to get execution flow and where the related information stored in system.

Hive is a data warehouse software best suited for OLAP (OnLine Analytical Processing) workloads to handle and query over vast volume of data residing in a distributed storage. The Hadoop Distributed File System (HDFS) is the ecosystem in which Hive maintains the data reliably and survives from hardware failures. Hive is the only SQL-like relational big data warehousing approach developed on top of Hadoop.

HiveQL as described, is an SQL-like query language for expressing queries in Hive. After a query is issued through an interface in Hive, it undergoes several processing phases including parsing, semantic analysis, logical plan generation, physical plan generation, etc; ultimately the plan is transformed to a sequence of mapreduce operations which are then executed over Hadoop.

What is Hive Meta-store:-

Hive system catalog contains schemas, tables, columns, and their types, tables’ locations, statistics and other information essential for data management. Since meta data should be available fast, Hive uses a traditional RDBMS (e.g., Derby SQL Server, MySQL Server, etc.) to manage meta data rather than using the HDFS. Because Java works with objects and an RDBMS uses the relational model, an Object Relational Mapping (ORM), called DataNucleus, is accompanied by the RDBMS to translate between objects and relational schema.

Driver:-

The component that receives the query, after it is received by the UI from the user, and manages the lifespan of a query inside Hive. It also implements the notion of session handles and retrieves the session statistics. Session has the same meaning as it does in traditional databases: “The SQL operations that are performed while a connection is active form a session.”

The Driver consists of three main components, namely, Compiler, Optimizer, and Executor. The compiler translates HiveQL into a DAG (Directed Acyclic Graph) of mapreduce tasks that are executed by the executor or execute engine in the order of their dependencies. The optimizer resides at some point between the compiler and executor to improve the performance.

Hive Server:-

Hive server or Thrift Server allows access to Hive with a single port, that is, it allows programmatically access to Hive remotely. Therefore it provides means to integrate Hive with other applications. Thrift is a scalable cross-language service development framework; or simply, a binary communication protocol. Clients in different programming languages can communicate seamlessly with Hive using the “thrift interface”. Here, by client we mean any source that issues a query.

How the components of Hive architecture interact with each other?
A user submits the query via Hive CLI/Hive web Interface, JDBC/ODBC, or Thrift interface. The Driver receives the query and passes it to the compiler. Compiler does the typical parsing, type checking, semantic analysis, and pings the meta-store if needed. Finally it generates a logical query plan that is sent to the optimizer. The optimized query plan is converted to a DAG of mapreduce jobs. The executor executes these jobs in the order of dependency on Hadoop. Figure 6 shows the steps of this process.

Hive provides the EXPLAIN and ANALYZE statements that can be used as utilities to check and identify the performance of queries.

The EXPLAIN statement:-
Hive provides an EXPLAIN command to return a query execution plan without running the query. We can use an EXPLAIN command for queries if we have a doubt or a concern about performance. The EXPLAIN command will help to see the difference between two or more
queries for the same purpose. The syntax for EXPLAIN is as follows:
EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] hive_query

The following keywords can be used:-
EXTENDED: This provides additional information for the operators in the plan, such as file pathname and abstract syntax tree.
DEPENDENCY: This provides a JSON format output that contains a list of tables and partitions that the query depends on. It is available since HIVE 0.10.0.
AUTHORIZATION: This lists all entities needed to be authorized including input and output to run the Hive query and authorization failures, if any. It is available since HIVE 0.14.0.

The ANALYZE statement:-
Hive statistics are a collection of data that describe more details, such as the number of rows, number of files, and raw data size, on the objects in the Hive database. Statistics is a metadata of Hive data. Hive supports statistics at the table, partition, and column level.
These statistics serve as an input to the Hive Cost-Based Optimizer (CBO), which is an optimizer to pick the query plan with the lowest cost in terms of system resources required to complete the query.
The statistics are gathered through the ANALYZE statement since Hive 0.10.0 on tables, partitions, and columns as given in the following examples:

ANALYZE TABLE employee COMPUTE STATISTICS;

ANALYZE TABLE employee_partitioned PARTITION(year=2014, month=12) COMPUTE STATISTICS;

ANALYZE TABLE employee_id COMPUTE STATISTICS FOR COLUMNS employee_id;

Once the statistics are built, we can check the statistics by the DESCRIBE EXTENDED/FORMATTED statement. From the table/partition output, we can find the statistics information inside the parameters, such as parameters:{numFiles=1, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1417726247, numRows=4,totalSize=227, rawDataSize=223}). The following is an example:

DESCRIBE EXTENDED employee_partitioned PARTITION(year=2014, month=12);
or
DESCRIBE EXTENDED employee;

Hive statistics are persisted in the metastore to avoid computing them every time. For newly created tables and/or partitions, statistics are automatically computed by default if we enable the following setting:
SET hive.stats.autogather=ture;

Leave a Reply

Your email address will not be published. Required fields are marked *