Content uploaded by Maysam Pasdari
Author content
All content in this area was uploaded by Maysam Pasdari on Jan 06, 2019
Content may be subject to copyright.
w w w . j d c s . i r
Hive
*
Hive
Keywords:
Big Data
Hive Query engine
Query Optimization
Correlation-based Query
Optimization
. . Hive
* Corresponding author
Hive
[1]
IBM
[2]
[3]
.
.
.
.
PigHivePrestoSparkImpala
Presto
1
Apache Hadoop
2
Distributed Computation Environment
Hive
Hive
Hive
SQL-Like
HQL
OLAP
OLTP
[1]
3
Schema on write
4
Schema on read
Hive
Hive
HDFS
HDFSNFS
[1]
DDL[6]
5
Fast initial load
6
Hadoop Distributed File System
[4]
Hash
[4]
Hive[1]
Hadoop
.
[9]
[10]
Hive
[9]
HDFS[10]
7
Commodity hardware
8
Reliability
9
Hadoop Distributed File System
10
Strongly connected
11
Name node
12
Data nodes
HDFS
Heartbeat
13
Replication
Hive
14
.
[2]
14
Map-reduce
:
.
.
.
[9]
15
Bottleneck
16
Single point of failure
Hive
[9]
Parquet
[8]
17
https://parquet.apache.org
ORC
HDFS
ORC
stripes
countminmaxsum
18
Optimized Row Columnar
Hive
minmax
ORC
ORC
ORC
ACID
[8]
[8]
19
Built-in indexes
Map
Union
[8][9]
Hive
[9]
[3]
20
Structs
Hive
UI
[7]
Session
JDBC/ODBC
[7]
[7]
21
User Interface
22
Driver
23
Compiler
24
Execution Plan
[2]
HDFS
[7]
HiveQL
SQL
SQL
25
Metastore
26
Execution Engine
27
Directed Acyclic Graph (DAG)
28
Streaming Interface
Hive
[2]
HiveQL
[7]
DDL
DML
DDL
HDFS LOAD
29
Row-based
30
Multi-table Insert
31
Scan
32
Data Manipulation Language
33
Parser
o Join
Join
34
Semantic Analyzer
35
Block-based
36
Logical Plan Generator
37
Optimizer
Hive
o
mapreduce
o
o
o
repartition union all
38
Physical Execution Plan Generator
[5]
Hive
Push down
select-select
ORC
sampling
union
39
Projection pruning
40
Deducing transitive predicates
41
Merging select-select queries in to single operator
42
Multiway
43
Query Rewrite to accommodate for Join skew on some
column values
44
Partition pruning
45
Scan pruning based on partitions and bucketing
46
Remove unnecessary reduce sink operators
group by
ORC
ORC
ORC
[11]
47
Cost-based optimization
Hive
ORC
select
SELECT tl.key, SUM(value)
FROM t1 JOIN t2 ON (tl.key = t2.key)
48
Correlation-based optimization
GROUP BY tl.key
group by
group by
group by
SELECT tmpl.key, COUNT(*)
FROM (SELECT key, AVG(value) AS avg
FROM t1
GROUP BY /*AGG1*/ key) tmpl
JOIN /*JOIN1*/ tl
ON (tmpl.key = t2.key)
WHERE tl.value > tmpl.avg
GROUP BY /*AGG2*/ tmpl.key
49
Shuffle
Hive
SELECT tmpl.key,COUNT(*)
FROM tl
JOIN /*JOIN1*/ (SELECT KEY, AVG(value) AS AVG
FROM tl
GROUP BY /*AGG1*/ key ) tmpl
ON ( tl.key = tmpl.key )
JOIN /*JOIN1*/ t2 ON ( tmpl.key = t2.key )
WHERE t2.value > tmpl.avg
GROUP BY /*AGG2*/ tl.key;
t1
group byt1
Hive
HiveQL
Hint
STREAMTABLE
[7]
SELECT /* STREAMTABLE(a) */
a.val, b.val, c.val, d.val, e.val
FROM a JOIN b ON (a.key = b.keyl)
JOIN c ON (c.key = b.key1)
JOIN d ON (d.key = c.key)
JOIN a ON (e.key = d.key)
a
MAPJOIN
cache
bcde
[7]
SELECT /* MAPJOIN(b,c,d,e) */
a.val, b.val, c.val, d.val, e.val
FROM a JOIN b ON (a.key = b.keyl)
JOIN c ON (c.key = b.key1)
JOIN d ON (d.key = c.key)
JOIN a ON (e.key = d.key)
Hive
[1]
"Hive Website," [Online]. Available:
http://hive.apache.org/.
[2]
A. Thusoo, Joydeep Sen Sarma, Namit
Jain, Zheng Shao, Prasad Chakka, Suresh
Anthony, Hao Liu, Pete Wyckoff and
Raghotham Murthy, "Hive: a warehousing
solution over a map-reduce framework," in
Proceedings of the VLDB Endowment 2,
no. 2, 2009.
[3]
Thusoo, Joydeep Sen Sarma, Namit Jain,
Zheng Shao, Prasad Chakka, Ning Zhang,
Suresh Antony, Hao Liu and Raghotham
Murthy, "Hive-a petabyte scale data
warehouse using hadoop," in Data
Engineering (ICDE), 2010 IEEE 26th
International Conference on, pp. 996-1005.
IEEE, 2010., 2010.
[4]
A. Thusoo, J. Sen Sarma, N. Jain, Z. Shao,
P. Chakka, S. Anthony, H. Liu, P. Wyckoff
and R. Murthy, "Hive: a warehousing
solution over a map-reduce framework".
[5]
Vlad Mihai MIREL, Benchmarking Big
Data SQL Frameworks., Aalto University,
2016.
[6]
"Apache ORC," [Online]. Available:
https://orc.apache.org/.
[7]
"Hive design and architecture," [Online].
Available:
https://cwiki.apache.org/confluence/display
/Hive/Design.
[8]
Hortonworks, Apache Hive Performance
Tuning, Hortonworks Docs, 2017.
[9]
W. Tom, Hadoop: The definitive guide.,
O'Reilly Media, 2012.
[10]
K. Shvachko, Hairong Kuang, Sanjay
Radia and Robert Chansler, The hadoop
distributed file system, Mass storage
systems and technologies (MSST), 2010
IEEE 26th symposium on (pp. 1-10), 2010.
[11]
Z. Tariq, "Query optimization techniques in
Apache Hive," 2015. [Online]. Available:
https://www.slideshare.net/ZaraTariq/query
-optimization-techniques-in-apache-hive.
Hive
pasdari.meysam@znu.ac.ir
.
dmp@znu.ac.ir