Hive Cheat Sheet

1. Hive options

Set execution engine

set hive.execution.engine=mr/spark/tez;

Set queuename

set mapreduce.job.queuename=default;

2. Hive optimizations

Vectorized execution


Map Joins;

When is activated, Hive will make a MAPJOIN in case the n-1 tables of the join are smaller or equal to the value of But this paramter should also be lower than the heap size of the HiverServer or HiveCli (depending how you acess Hive). Thus, we need to lower the value of to avoid OOM errors.

Note that the ORC compression is not included in the Which means a table of 1GB in ORC is actually 10 times bigger once uncompressed and it should fit into the heap memory.

Conclusion: With = 256MB, we may have in memory up to 10x more, i.e 2GB (which should correspond to the heap of the Hiveserver)

3. Hive queries

Deduplicate lines

select  <needed fields>
select *, row_number() over (partition by id order by tech_timestampchargement desc) as rank
from [database.table]
where rank = 1  // to keep only rhe first occurence

4. Hive procedures

Tranfert data between clusters If it’s an external table, you just need to execute a “show create table” from the source cluster and execute it on the new cluster. You just need to copy paste the content of the table folder from a ccuster to the other in the path specified in the create statement.

For internal tables, the process is the same but you may need to execute an “msck repar table” in order to update the tables pointers.

comments powered by Disqus