Hive Cheat Sheet

1. Hive options

Set execution engine

set hive.execution.engine=mr/spark/tez;

Set queuename

set mapreduce.job.queuename=default;

2. Hive optimizations

Vectorized execution

hive.vectorized.execution.enabled

https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution

Map Joins

hive.auto.convert.join;

https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution

When hive.auto.convert.join is activated, Hive will make a MAPJOIN in case the n-1 tables of the join are smaller or equal to the value of hive.auto.convert.join.noconditionaltask.size. But this paramter should also be lower than the heap size of the HiverServer or HiveCli (depending how you acess Hive). Thus, we need to lower the value of hive.auto.convert.join.noconditionaltask.size to avoid OOM errors.

Note that the ORC compression is not included in the hive.auto.convert.join.noconditionaltask.size. Which means a table of 1GB in ORC is actually 10 times bigger once uncompressed and it should fit into the heap memory.

Conclusion: With hive.auto.convert.join.noconditionaltask.size = 256MB, we may have in memory up to 10x more, i.e 2GB (which should correspond to the heap of the Hiveserver)

3. Hive queries

Deduplicate lines

select  <needed fields>
            from 
select *, row_number() over (partition by id order by tech_timestampchargement desc) as rank
from [database.table]
where rank = 1  // to keep only rhe first occurence

4. Hive procedures

Tranfert data between clusters If it’s an external table, you just need to execute a “show create table” from the source cluster and execute it on the new cluster. You just need to copy paste the content of the table folder from a ccuster to the other in the path specified in the create statement.

For internal tables, the process is the same but you may need to execute an “msck repar table” in order to update the tables pointers.

comments powered by Disqus