Many time during the code review found some common mistakes done by the developer. Here are few of them…
Workflow mandatory item :
Add this property in all workflows that have a Hive action. This property will make sure that the hive job runs with the necessary number of reducers instead of just 1.
<name> mapreduce.job.reduces </name> <value>-1</value>
HQL items : Setting properties:
Keep the set properties in the HQL to a minimum. Let it take the default values. Add only what is absolutely necessary for that script. If you are using older code as template do not carry over all the set properties.
HQL items : Container properties:
Do not add any memory settings for the hql unless, it’s absolutely needed and has been verified by the Leads. This is very important, as assigning large amounts of memory will cause cluster imbalance. Specifically
Not to Turn of Autoconvert:
Do not turn off autoconvert joins ( set hive.tez.autoconvert property is defaulted to TRUE)
HQL items : Table joins
When joining 2 tables just use the table name or a data restricting query for the table Example
Select a,b….. From external_table inner JOIN orc_table On key1= key1 and key2 = key2
Select a, b….. From external_table inner JOIN (select col1, col2 from orc_table Where key1 = key1 and key2 = key2) On key1= key1 and key2 = key2
Do not use this:
Select a, b….. From external_table inner JOIN (select col1, col2 from orc_table) On key1= key1 and key2 = key2
This will cause all the data of orc_table to be loaded.
For left outer joins
For left outer joins , restrict the search criteria for the last 30 days ( or whatever is the business need ),if not it might throw outOfMemory
Multiple left outer joins
When there are multiple left outer joins, do not put them in one query. Create temporary tables progressively
If the query has left joins with 4 tables a, b, c, d then don’t write one query with left joins
Select … From a left join b on … Left join c on … Left join d on …
Instead create temp tables progressively
Create temp table1 as Select …from a Left join b on …
Create temp table2 as Select … from table1 Left join c on …
Crate temp table3 as Select … from table2 Left join d on ….
A query with group by
When doing a query with group by , between a lower level and a higher level table, first create a temp table for the group by and then left join with the higher level table
Change unix_timestamp to to_unix_timestamp (current_timestamp ()) in the Where clause. It’s ok to use unix_timestamp in the select clause.
Please add your code review comments or best practice in comment section!!!