Experiment FindingsPDF Available

A study of recent contributions on Hive Big Data query engine Optimization

Authors:

Abstract

In this research, efforts have been made to address the challenges posed by query optimization in big data environments, including how to improve queries on large volume of data. The architecture of the Hive query engine are investigated in detail, and are referred to how to optimize queries that are specifically done by the compiler and optimizer units. The details of the optimization methods are pretty discussed and, at the end, the correlation-based approach is described in detail with various examples. Finally, the result of the efficiency of the general methods is presented in the form of an analytical chart.

w w w . j d c s . i r


Hive

*

     
   






Hive





Keywords:
Big Data
Hive Query engine
Query Optimization
Correlation-based Query
Optimization

. .   Hive 
  
* Corresponding author   
   Hive
     



       
 
[1]
 IBM     
[2]        
         
         

       
 [3]
.

        

       
       

    
 
.
        
   


      
   
 
      

      


 
 
      
      
       



  
   .   
      
       
 
.
       
  
PigHivePrestoSparkImpala
       
  

Presto     
     
Apache Hadoop
Distributed Computation Environment
   Hive
     

       

       
       

    
  Hive    


Hive


SQL-Like
HQL


OLAP

OLTP









[1]
























Schema on write
Schema on read
   Hive
      









Hive




HDFS


 





HDFSNFS

[1]

DDL[6]

Fast initial load
Hadoop Distributed File System



 [4]


Hash


[4]
 Hive[1]
Hadoop
       
       .
[9]       
      
      
        

        
      
       [10]
 
         
   Hive
     




    
  
       
       
        
[9]

  
      

  


      
      

   
 

HDFS[10]
Commodity hardware
Reliability
Hadoop Distributed File System
Strongly connected
Name node
Data nodes
       


  
HDFS      
       
         
Heartbeat



       
       
 





 
        

      
 



Replication
   Hive
      



 
       
.

   

       


[2]
      
      

      
      
      
      
         
Map-reduce
        
:
      
     
.



.


.
       

      
        

  [9]

      

Bottleneck
Single point of failure
   Hive
     

     [9]






Parquet






 [8]
https://parquet.apache.org
 ORC



HDFS



ORC


stripes








countminmaxsum





Optimized Row Columnar
   Hive
      

minmax

  ORC
ORC

ORC
ACID



[8]





[8]
Built-in indexes

Map
Union

[8][9]

        
     
      
       
Hive

   [9]

 [3]
Structs
   Hive
      


UI
     
     

       
 [7]



Session
JDBC/ODBC
 [7]

       
    
      
     

       
    
 
 [7]
      


User Interface
Driver
Compiler
Execution Plan
       
[2]


       
HDFS   
    

 [7]   


      

  
  

      
     

HiveQL
SQL
SQL
        
  
      
Metastore
Execution Engine
Directed Acyclic Graph (DAG)
Streaming Interface
   Hive
      



        
 [2] 

   
   

HiveQL

   
  
 [7]

   
  DDL
DML

      
   DDL   
  HDFS  LOAD
        
        

        


Row-based
Multi-table Insert
Scan
Data Manipulation Language
Parser













      


       
      

       

o Join
  Join
       
Semantic Analyzer
Block-based
Logical Plan Generator
Optimizer
   Hive
      

     

o      
    
    
   
mapreduce


o      
     


o     
    

o     



  
       
repartition union all 

Physical Execution Plan Generator
     


   
[5]





 
       


   Hive
      

  
 



Push down
select-select


     

     
        
ORC
     
    




 
sampling

union

Projection pruning
Deducing transitive predicates
Merging select-select queries in to single operator
Multiway
Query Rewrite to accommodate for Join skew on some
column values
Partition pruning
Scan pruning based on partitions and bucketing
Remove unnecessary reduce sink operators


  group by


      


ORC

      
ORC

ORC
  

     [11]
      

       
       
Cost-based optimization
   Hive
      


   
   

ORC
       





select

  
   
     
       

 


        
       

       


SELECT tl.key, SUM(value)
FROM t1 JOIN t2 ON (tl.key = t2.key)
Correlation-based optimization
GROUP BY tl.key

      
group by
group by



        

         
     group by 
         
        
       



SELECT tmpl.key, COUNT(*)
FROM (SELECT key, AVG(value) AS avg
FROM t1
GROUP BY /*AGG1*/ key) tmpl
JOIN /*JOIN1*/ tl
ON (tmpl.key = t2.key)
WHERE tl.value > tmpl.avg
GROUP BY /*AGG2*/ tmpl.key

      


Shuffle
   Hive
      


  




 



SELECT tmpl.key,COUNT(*)
FROM tl
JOIN /*JOIN1*/ (SELECT KEY, AVG(value) AS AVG
FROM tl
GROUP BY /*AGG1*/ key ) tmpl
ON ( tl.key = tmpl.key )
JOIN /*JOIN1*/ t2 ON ( tmpl.key = t2.key )
WHERE t2.value > tmpl.avg
GROUP BY /*AGG2*/ tl.key;

   
       
        


 



   t1    
group byt1
   Hive
      

      
      


  


HiveQL
      
    
Hint

STREAMTABLE

      

      
      
 [7]
SELECT /* STREAMTABLE(a) */
a.val, b.val, c.val, d.val, e.val
FROM a JOIN b ON (a.key = b.keyl)
JOIN c ON (c.key = b.key1)
JOIN d ON (d.key = c.key)
JOIN a ON (e.key = d.key)
a

  

MAPJOIN
cache
 


bcde
 [7]
SELECT /* MAPJOIN(b,c,d,e) */
a.val, b.val, c.val, d.val, e.val
FROM a JOIN b ON (a.key = b.keyl)
JOIN c ON (c.key = b.key1)
JOIN d ON (d.key = c.key)
JOIN a ON (e.key = d.key)
 















   Hive
      






 
[1]
"Hive Website," [Online]. Available:
http://hive.apache.org/.
[2]
A. Thusoo, Joydeep Sen Sarma, Namit
Jain, Zheng Shao, Prasad Chakka, Suresh
Anthony, Hao Liu, Pete Wyckoff and
Raghotham Murthy, "Hive: a warehousing
solution over a map-reduce framework," in
Proceedings of the VLDB Endowment 2,
no. 2, 2009.
[3]
Thusoo, Joydeep Sen Sarma, Namit Jain,
Zheng Shao, Prasad Chakka, Ning Zhang,
Suresh Antony, Hao Liu and Raghotham
Murthy, "Hive-a petabyte scale data
warehouse using hadoop," in Data
Engineering (ICDE), 2010 IEEE 26th
International Conference on, pp. 996-1005.
IEEE, 2010., 2010.
[4]
A. Thusoo, J. Sen Sarma, N. Jain, Z. Shao,
P. Chakka, S. Anthony, H. Liu, P. Wyckoff
and R. Murthy, "Hive: a warehousing
solution over a map-reduce framework".
[5]
Vlad Mihai MIREL, Benchmarking Big
Data SQL Frameworks., Aalto University,
2016.
[6]
"Apache ORC," [Online]. Available:
https://orc.apache.org/.
[7]
"Hive design and architecture," [Online].
Available:
https://cwiki.apache.org/confluence/display
/Hive/Design.
[8]
Hortonworks, Apache Hive Performance
Tuning, Hortonworks Docs, 2017.
[9]
W. Tom, Hadoop: The definitive guide.,
O'Reilly Media, 2012.
[10]
K. Shvachko, Hairong Kuang, Sanjay
Radia and Robert Chansler, The hadoop
distributed file system, Mass storage
systems and technologies (MSST), 2010
IEEE 26th symposium on (pp. 1-10), 2010.
[11]
Z. Tariq, "Query optimization techniques in
Apache Hive," 2015. [Online]. Available:
https://www.slideshare.net/ZaraTariq/query
-optimization-techniques-in-apache-hive.
   Hive
      







     



pasdari.meysam@znu.ac.ir






 



.

dmp@znu.ac.ir
ResearchGate has not been able to resolve any citations for this publication.
Article
The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!.
Conference Paper
The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop is a popular open-source map-reduce implementation which is being used in companies like Yahoo, Facebook etc. to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. In this paper, we present Hive, an open-source data warehousing solution built on top of Hadoop. Hive supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into map-reduce jobs that are executed using Hadoop. In addition, HiveQL enables users to plug in custom map-reduce scripts into queries. The language includes a type system with support for tables containing primitive types, collections like arrays and maps, and nested compositions of the same. The underlying IO libraries can be extended to query data in custom formats. Hive also includes a system catalog - Metastore - that contains schemas and statistics, which are useful in data exploration, query optimization and query compilation. In Facebook, the Hive warehouse contains tens of thousands of tables and stores over 700TB of data and is being used extensively for both reporting and ad-hoc analyses by more than 200 users per month.
Article
The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop [3] is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse.
Benchmarking Big Data SQL Frameworks
  • Vlad Mihai
Vlad Mihai MIREL, Benchmarking Big Data SQL Frameworks., Aalto University, 2016.
Apache Hive Performance Tuning, Hortonworks Docs
  • Hortonworks
Hortonworks, Apache Hive Performance Tuning, Hortonworks Docs, 2017.
Query optimization techniques in Apache Hive
  • Z Tariq
Z. Tariq, "Query optimization techniques in Apache Hive," 2015. [Online]. Available: https://www.slideshare.net/ZaraTariq/query -optimization-techniques-in-apache-hive.