1. Hive options
Set execution engine
set hive.execution.engine=mr/spark/tez;
Set queuename
set mapreduce.job.queuename=default;
2. Hive optimizations
Vectorized execution
hive.vectorized.execution.enabled
https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution
Map Joins
hive.auto.convert.join;
https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution
When hive.auto.convert.join is activated, Hive will make a MAPJOIN in case the n-1 tables of the join are smaller or equal to the value of hive.auto.convert.join.noconditionaltask.size. But this paramter should also be lower than the heap size of the HiverServer or HiveCli (depending how you acess Hive). Thus, we need to lower the value of hive.auto.convert.join.noconditionaltask.size to avoid OOM errors.
Note that the ORC compression is not included in the hive.auto.convert.join.noconditionaltask.size. Which means a table of 1GB in ORC is actually 10 times bigger once uncompressed and it should fit into the heap memory.
Conclusion: With hive.auto.convert.join.noconditionaltask.size = 256MB, we may have in memory up to 10x more, i.e 2GB (which should correspond to the heap of the Hiveserver)
3. Hive queries
Deduplicate lines
select <needed fields>
from
select *, row_number() over (partition by id order by tech_timestampchargement desc) as rank
from [database.table]
where rank = 1 // to keep only rhe first occurence
4. Hive procedures
Tranfert data between clusters If it’s an external table, you just need to execute a “show create table” from the source cluster and execute it on the new cluster. You just need to copy paste the content of the table folder from a ccuster to the other in the path specified in the create statement.
For internal tables, the process is the same but you may need to execute an “msck repar table” in order to update the tables pointers.
comments powered by Disqus