Content uploaded by Peter A. Boncz
Author content
All content in this area was uploaded by Peter A. Boncz on Aug 31, 2015
Content may be subject to copyright.
Vectorization vs. Compilation in QueryExecution
Juliusz Sompolski1
VectorWise B.V.
julek@vectorwise.com
Marcin Zukowski
VectorWise B.V.
marcin@vectorwise.com
Peter Boncz2
Vrije Universiteit Amsterdam
p.a.boncz@vu.nl
ABSTRACT
Compilingdatabase queriesintoexecutable(sub-)programs
provides substantialbenefits comparingtotraditionalinter-
preted execution.Manyofthese benefits, suchasreduced
interpretationoverhead,better instructioncodelocality,and
providing opportunities touseSIMDinstructions, havepre-
viouslybeen provided byredesigningquery processors to
use avectorizedexecution model.Inthis paper, wetry to
shedlightonthequestionofhowstate-of-the-art compila-
tionstrategiesrelatetovectorized executionforanalytical
databaseworkloadsonmodern CPUs. Forthispurpose, we
carefullyinvestigate thebehaviorof vectorized and compiled
strategiesinsidetheIngres VectorWise database system in
three use cases:Project,Selectand Hash Join.Oneofthe
findings isthatcompilationshouldalways becombined with
block-wise query execution.Another contributionisiden-
tifyingthree cases where “loop-compilation”strategies are
inferiortovectorized execution.Assuch,acarefulmerging
ofthese twostrategies isproposedforoptimalperformance:
eitherbyincorporatingvectorized execution principles into
compiled query plansorusingquery compilationtocreate
buildingblocks forvectorized processing.
1. INTRODUCTION
Database systemsprovidemanyusefulabstractions such
asdataindependence,ACIDproperties, and thepossibil-
itytoposedeclarativecomplex ad-hocqueries over large
amountsofdata.Thisflexibilityimplies thatadatabase
server hasno advanceknowledgeofthequeries until run-
time, which hastraditionally led most systems toimplement
their query evaluators using aninterpretationengine. Such
anengine evaluates plansconsisting ofalgebraic operators,
suchasScan,Join,Project,Aggregationand Select. Theop-
erators internally includeexpressions,whichcan beboolean
1Thiswork is part ofaMScthesisbeingwritten atVrije
Universiteit Amsterdam.
2Theauthoralso remainsaffiliatedwith CWI Amsterdam.
Permission to makedigital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. Tocopyotherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or afee.
Proceedings of the Seventh International Workshop on Data Management
on NewHardware(DaMoN 2011), June 13, 2011, Athens, Greece.
Copyright 2011 ACM 978-1-4503-0658-4 ...$10.00.
conditionsused in Joinsand Select,calculationsused to in-
troduce newcolumnsin Project,and functionslikeMIN,
MAX and SUM used inAggregation.Mostquery inter-
preters followtheso-called iterator-model (asdescribedin
Volcano[5]), inwhicheachoperator implements an API that
consistsofopen(),next()and close() methods. Each next()
call produces onenewtuple, and query evaluationfollows a
“pull”model in which next() is called recursively totraverse
theoperatortree fromtherootdownwards,withtheresult
tuples beingpulled upwards.
Ithas been observed that thetuple-at-a-timemodel leads
tointerpretationoverhead:thesituationthatmuchmore
timeis spentinevaluatingthequery planthanin actually
calculatingthequery result.Additionally,this tuple-at-a-
timeinterpretationmodel particularlyaffectshigh perfor-
mancefeatures introduced in modern CPUs [13]. Forin-
stance, thefact thatunitsofactualworkare hidden inthe
streamofinterpretingcodeand functioncalls,preventscom-
pilers and modern CPUsfromgettingthebenefitsofdeep
CPUpipelining and SIMDinstructions,because forthese
thework instructionsshould beadjacentintheinstruction
streamand independentofeachother.
RelatedWork: Vectorizedexecution.MonetDB [2]
reduced interpretationoverhead byusingbulk processing,
where eachoperatorwouldfullyprocessits input, and only
then invokingthenext executionstage. This idea hasbeen
further improvedin theX100 project [1], later evolvinginto
VectorWise, withvectorizedexecution.Itis aform ofblock-
orientedquery processing[8], where thenext() methodrather
thanasingle tuple produces ablock(typically100-10000)
oftuples. Inthevectorized model,dataisrepresented as
small single-dimensionalarrays(vectors),easily accessible
forCPUs. Theeffectis(i) that thepercentageofinstruc-
tions spentininterpretationlogicisreduced byafactor
equaltothevector-size, and (ii) that thefunctionsthatper-
form work nowtypicallyprocessanarrayof values in atight
loop.Suchtightloopscan beoptimized well bycompilers,
e.g.unrolled when beneficial, and enable compilers to gener-
ateSIMDinstructionsautomatically.Modern CPUsalsodo
well onsuchloops, asfunctioncalls are eliminated,branches
get more predictable,and out-of-order executioninCPUs
often takesmultiple loop iterationsintoexecutionconcur-
rently,exploitingthedeeplypipelined resources ofmodern
CPUs.Itwasshownthatvectorizedexecutioncan improve
data-intensive(OLAP)queries byafactor50.
RelatedWork: Loop-compilation.Analternativestrat-
egyforeliminatingtheill effectsofinterpretationis using
Just-In-Time(JIT) query compilation.Onreceiving a query
forthefirst time, thequery processorcompiles (part of) the
query into a routinethat getssubsequentlyexecuted.In
Javaengines, this can bedonethroughthegenerationof
new Javaclassesthat are loaded usingreflection(and JIT
compiled bythevirtualmachine)[10]. InCorC++,source
codetext isgenerated,compiled,dynamically loaded,and
executed.System Roriginally skippedcompilation bygen-
erating assembly directly,but thenon-portabilityofthat
approachled toitsabandonment[4]. Depending onthe
compilationstrategy,thegenerated codemayeithersolve
thewhole query (“holistic”compilation[7])oronly certain
performance-critical pieces. Other systems thatare known
touse compilationare ParAccel [9]and therecentlyan-
nounced Hypersystem [6]. Wewill generalisethecurrent
state-of-the-art usingtheterm“loop-compilation”strategies,
asthese typicallytry to compile thecore ofthequery into a
single loopthatiteratesover tuples. This can becontrasted
with vectorized execution,which decomposes operators in
multiplebasic steps,and executes aseparateloopforeach
basicstep (“multi-loop”).
Compilationremoves interpretationoverheadand canlead
tovery concise and CPU-friendly code. Inthispaper, we
putcompilationin its most favourablelightbyassuming
thatcompilation-time isnegligible.This is oftentruein
OLAP queries whichtend doberatherlong-running,and
technologies suchasJITin Java and theLLVMframework
forC/C++ [12]nowadays providelow(milliseconds) laten-
ciesforcompiling and linking.
Roadmap:vectorizationvs.compilation. Vectorized
expressionsprocessoneormore inputarrays and store the
result inanoutputarray.Even thoughsystemslikeVec-
torWisego through lengthstoensure that these arrays are
CPUcache-resident, this materializationconstitutes extra
load/store work. Compilationcanavoidthis work by keep-
ingresultsinCPUregisters astheyflowfromone expression
totheother. Also,compilationasa generaltechniqueis or-
thogonalto anyexecutionstrategy,and canonlyimprove
performance.Weused theVectorWise DBMS3toinves-
tigatethree interestinguse cases that highlighttheissues
around therelationship between compilationand vectoriza-
tion.
Asourfirst case, Section2shows howin Projectexpres-
sioncalculationsloop-compilationtendstoprovidethebest
results, but that this hinges on usingblock-oriented pro-
cessing.Thus,compilingexpressionsinatuple-at-a-time
enginemayimprovesome performance, butfallsshort of
thegainsthatare possible. In Section3,oursecond case
is Select,where weshowthatbranchmispredictionshurt
loop-compilationwhen evaluatingconjunctivepredicates. In
contrast, thevectorized approachavoidsthisproblem asit
cantransform control-dependencies intodata-dependencies
forevaluatingbooleans(along[11]). Thethird case in Sec-
tion4concernsprobinglargehash-tables,using a HashJoin
asanexample. Here, loop-compilationgetsCPUcachemiss
stalled while processinglinked lists(i.e., hash bucket chains).
Weshowthatamixed approachthatuses vectorizationfor
chain-followingisfastest,and robust tothevariousparame-
ters ofthekey lookup problem. These findingsleadto a set
ofconclusionswhichwepresentin Section5.
3See www.ingres.com/vectorwise.Data storageand query
evaluationinVectorWise isbased ontheX100 project[1].
Algorithm 1Implementationofanexample query using
vectorizedand compiled modes. Map-primitives are stat-
ically compiledfunctionsforcombinationsofoperations
(OP), types (T) and inputformats(col/val).Dynamically
compiled primitives, suchasc000(), followthesamepat-
tern aspre-generated vectorized primitives,butmaytake
arbitrarily complex expressionsasOP.
// Generalv e c torized p rimitiv e pa t t ern
mapOPTcolTcol(idxn,T∗res,T∗col1,T∗col2){
for(inti=0; i <n; i ++)
res[ i ]=OP(col1[ i ] , col2[ i ] );
}
// Themicro−be nchmarkusesdatastoredin:
constidxLEN=10 2 4;
chr tmp1[LEN] , tmp2[LEN] , one=1 0 0 ;
sht tmp3[LEN] ;
inttmp4[LEN] ; // f i nalresult
// Ve c torizedco de:
mapad d chrvalchrcol(LEN,tmp1,&one, l discou n t ) ;
mapsu b chrvalchrcol(LEN,tmp2,&one, l tax);
mapmulchrcolchrcol(LEN,tmp3,tmp1,tmp2);
mapmul i ntcolshtcol(LEN,tmp4, l extpric e ,tmp3);
// Compilede q uivalentofthisexpres s ion:
c00 0 (idxn,int∗res,int∗col1,chr∗col2,chr∗col3){
for(idxi=0; i <n; i ++)
res[ i ]=col1[ i ] ∗( (10 0 −col2[ i ] )∗(1 0 0+col3[ i ] ) ) ;
}
2. CASE STUDY:PROJECT
Inspired bytheexpressionsin Q1 ofTPC-Hwefocuson
thefollowingsimple Scan-Project query asmicro-benchmark:
SELECT l_extprice*(1-l_discount)*(1+l_tax) FROM lineitem
Thescanned columnsare all decimalswith precisiontwo.
VectorWise representstheseinternallyasintegers, usingthe
valuemultiplied by100 in thiscase. After scanning and de-
compressionit chooses thesmallest integertypethat, given
theactualvaluedomain,canrepresent thenumbers. The
samehappensforcalculationexpressions, where thedesti-
nationtypeischosen tobetheminimal-widthinteger type,
suchthatoverflowis prevented.IntheTPC-Hcase, the
price column isa 4-byteinteger and theother two are single-
bytecolumns. Theadditionand subtraction produceagain
single bytes, theirmultiplication a 2-byte integer. Thefi-
nalmultiplicationmultiplies a 4-byte with a 2-byteinteger,
creatinga4-byteresult.
VectorizedExecution.VectorWise executesfunctions
insideaProjectasso-called map-primitives. Algorithm1
showstheexamplecodeforabinary primitive. Inthis,chr,
sht,int and lng representinternaltypes for1-,2-,4-and
8-byteintegers and idx representsaninternaltypeforrep-
resentingsizes, indexes oroffsetswithincolumnsof data
(implemented asintegerofrequired width).Aval suffixin
theprimitivenameindicates aconstant(non-columnar) pa-
rameter. VectorWisepre-generates primitives forall needed
combinationsofoperations,types and parametermodes.All
functionsto support SQL fitinca.9000 lines ofmacro-
code, whichexpandsinto roughly 3000 differentprimitive
functionsproducing 200KLOC and a5MBbinary.
Itisreasonable toexpectthatacompiler thatclaims sup-
port forSIMDshould beable tovectorizethetrivial loopin
themap_ functions. On x86 systems,gcc(weused 4.5.1)usu-
allydoes so and theIntel compiler iccnever failsto.With
1 2 5 10 20 50
Vector size
Cycles per tuple (log scale)
1 2 4 8 32 128 512 2K 8K 32K 128K 1M 4M
Comp.
per tuple
Interpreted
Vectorized, SIMD
Compiled, SIMD
Vectorized, no−SIMD
Compiled, no−SIMD
Vectorized, SIMD−sht
Compiled, SIMD−sht
Figure 1:Projectmicro-benchmark: with and
without {compilation,vectorization, SIMD}.The
“SIMD-sht” lineswork around the alignmentsub-
optimalityiniccSIMD code generation.
asingle SSEinstruction,modern x86 systems canadd and
subtract16 single-bytevalues,multiply 8single-byteinte-
gers into a 2-byteresult,ormultiplyfour4-byte integers.
Thus,16 tuples in this expressioncould beprocessed with 8
SIMDinstructions: one8-bitaddition,one8-bitsubtraction,
two 8-bit multiplicationswith 16-bitresults, and four32-bit
multiplications.All ofthese instructions store oneresult
and thefirst two operationsloadoneinput(withtheother
parameter being a constant) while theothertwoloadtwo
inputs.With these 22 (2*2+6*3)load/stores, weroughly
need 30 instructions–inrealitysome additionalinstruc-
tionsforcasts, padding and looping are required,suchthat
thetotalforprocessing 16 tuplesis around 60.Incompari-
son,withoutSIMDwewould need 4instructions(2 loads,1
calculation,1store) per calculationsuchthatasingletuple
requires 16 instructions,>4times more thanwith SIMD.
Thevectorized “SIMD” and “no-SIMD” linesin Figure 1,
showanexperimentinwhichexpressionresultsare calcu-
lated,usingdifferentvector-sizes. Weuseda 2.67GHz Ne-
halemcore,ona 64-bitsLinuxmachinewith12GB ofRAM.
Theno-SIMDvectorizedcode,produced byexplicitlydis-
ablingSIMDgenerationin thecompiler(icc 11.04,here),is
indeed 4times slower than SIMD. Thegeneraltrendofde-
creasinginterpretationoverheadwithincreasingvector-size
until around onethousand,and performancedeteriorating
dueto cachemissesifvectors start exceedingtheL1 and L2
caches, hasbeen describedalreadyin detail in[13,1].
CompiledExecution. Thelowerpart ofAlgorithm1
showsthecompiled codethatamodified versionofVector-
4Compiler optionsare -O3 forgcc,supplemented for icc with
-xSSE4.2-mp1-unroll
Wise can nowgenerateon-the-fly:it combines vectorization
withcompilation.Suchacombination in itselfisnotnew
(“compound primitives” [1]),andtheendresultis similar
towhataholistic query compiler likeHIQUE [7]generates
forthisScan-Project plan,thoughitwouldalso add Scan
code. However, if weassumeaHIQUEwith asimplemain-
memory storagesystemand takel_tax,etc.tobepointers
into acolumn-wise storedtable,thenc000() would bethe
exact productof a “loop-compilation”strategy.
Themain benefitofthecompiled approachis theabsence
ofload/stores.Thevectorized approach needs22 load/s-
tores, butonlythebottom threeloadsand top-level store
are needed bythecompiled strategy.Comparingvectorized
withcompiled,weare surprised tosee that thevectorized
version is significantly faster (4vs.2cycles/tuple).Close
investigationofthegeneratedcoderevealed thaticc chooses
initsSIMDgenerationto alignall calculationsonthewidest
unit (here:4-byte integers). Hence, theopportunities for1-
byteand 2-byteSIMDoperationsare lost.Arguably,thisis
acompiler bug orsub-optimality.
Inorder toshowwhatcompilationcouldachieve,were-
tried thesame, nowassumingthat l_extprice would fitinto
a 2-byteinteger; whichare the“SIMD-sht” linesinin Fig-
ure 1.Here, wesee compilation beatingvectorized execu-
tion,asonewould normally expectin Projecttasks.Afinal
observationis thatcompiled map-primitives are less sensi-
tivetocachesize(only toL2,notL1),suchthatahybrid
vectorized/compiled enginecan use largevector-sizes.
Tuple-at-a-timecompilation.Theblackstarand dia-
mondinFigure 1, correspond tosituationswhere primitive
functionswork tuple-at-a-time. Thenon-compiled strategy
iscalled “interpreted”,here. AnenginelikeMySQL,whose
wholeiteratorinterfaceis tuple-at-a-time, canonlyuse such
functionsasithasjust onetupleto operateon atanymo-
mentin time.Tuple-at-a-time primitives are conceptually
very closetothefunctionsin Algorithm1with vector-size
n=1,butlackthefor-loop.Weimplemented them separately
forfairness,because thesefor-loopsintroduceloop-overhead.
Thisexperimentshows thatif onewouldcontemplateintro-
ducingcompilation inanenginelikeMySQL withoutbreak-
ingitstuple-at-a-timeoperator API,thegain inexpression
calculation performance could beafactor 3(23 vs 7cycle/tu-
ple). Theabsolute performanceisclearlybelow whatblock-
wise query processingoffers (7vs1.2cycle/tuple),mainly
dueto missed SIMDopportunities, butalsobecausethe
virtualmethodcall for every tuple inhibits speculativeexe-
cutionacrosstuplesin theCPU.Worse,in tuple-at-a-time
query evaluationfunction primitives in OLAPqueries only
makeup asmall percentage(<5%) ofoveralltime[1], be-
cause most effort goes intothetuple-at-a-time operatorAPIs.
Theoverall gainofusingcompilation withoutchangingthe
tuple-at-a-timenature ofaquery enginecantherefore at
most beafew percent,makingsuchanendeavourquestion-
able.
3. CASE STUDY:SELECT
Wenowturn ourattention to a micro-benchmark that
testsconjunctiveselections:
WHERE col1 < v1 AND col2 < v2 AND col3 < v3
Selection primitives shown in Algorithm2create vectors
ofindexes forwhichtheconditionevaluates to true, called
Algorithm 2Implementationsof<selection primitives.
All algorithmsreturn thenumberofselecteditems(re-
turn j). Formid selectivities,branchinginstructionslead
tobranchmispredictions.Inavectorized implementation
such branchingcan beavoided.VectorWise dynamically se-
lects thebest method dependingon theobservedselectivity,
butinthemicro-benchmark weshowtheresultsforboth
methods.
// Twov e c torizedimplementations
// (1.)mediumsele c tivity:non−branching co d e
idxselltTcolTval(idxn,T∗res,T∗col1,T∗val2,
idx∗sel){
if(sel== NULL){
for(idxi=0, idxj=0; i <n; i ++){
res[j]=i ; j+= (col1[ i ] <val2[0]);
}
}else{
for(idxi=0, idxj=0; i <n; i ++){
res[j]=sel [ i ] ; j+= (col1[sel [ i ] ] <∗val2);
}
}
returnj;
}
// (2.)else:branchingsele c tion
idxsel l tTcolTval(idxn,T∗res,T∗col1,T∗val2,
idx∗sel){
if(sel==NULL){
for(idxi=0, idxj=0; i <n; i ++)
if(col1[ i ] <∗val2) r es[j++]=i ;
}else{
for(idxi=0, idxj=0; i <n; i ++)
if(col1[sel [ i ] ] <∗val2) r es[j++]=sel [ i ] ;
}
returnj;
}
// Ve c torizedconjunctionimplementation:
constidxLEN=10 2 4;
idxsel1[LEN] , sel2[LEN] , res[LEN] , ret1,ret2,ret3;
ret1=selltTcoltval(LEN,sel1,col1,&v1,NULL);
ret2=selltTcoltval( r et1,sel2,col1,&v1,sel1);
ret3=selltTcoltval( r et2,res,col1,&v1,sel2);
selection vectors.Selection primitives canalso takeaselec-
tionvectorasparameter, toevaluate theconditiononlyon
elements ofthevectors fromthepositionspointed tobythe
selectionvector5.Avectorized conjunction isimplemented
bychainingselection primitives with theoutputselection
vectorofthepreviousonebeingtheinputselectionvector
ofthenext one, working onatighteningsubset oftheorigi-
nalvectors, evaluatingthisconjunctionlazily onlyonthose
elements forwhichthepreviousconditionspassed.
Eachconditionmaybe evaluated with oneoftwoimple-
mentationsofselection primitive. Thenaive“branching”
implementationofselectionevaluates conditionslazily and
branches outif anyofthepredicates fails. Iftheselectivity
ofconditionsis neither very loworhigh,CPU branch predic-
tors are unabletocorrectly guess thebranchoutcome. This
preventstheCPU fromfillingitspipelines with usefulfuture
codeand hinders performance.In[11]it was shownthata
branch(control-dependency) in theselectioncodecan be
transformedinto a datadependencyforbetterperformance.
Thesel_lt functionsinAlgorithm2contain bothap-
proaches. TheVectorWise implementationofselectionsuses
amechanism thatchooseseither thebranchornon-branch
5Infact,other primitives are also able towork with selection
vectors, butit wasremoved from codesnippets where not
necessary forthediscussed experiments.
Algorithm 3Fourcompiledimplementationsofacon-
junctiveselection.Branchingcannotbeavoided in loop-
compilation,whichcombines selectionwithother opera-
tions,withoutexecutingthese operationseagerly.Thefour
implementationsbalancebetween branching and eager com-
putation.
// (1.)al l predicatesbranching(”lazy”)
idxc00 01 (idxn,T∗res,T∗col1,T∗col2,T∗col3,
T∗v1,T∗v2,T∗v3){
idxi , j=0;
for(i=0; i <n; i ++)
if(col1[ i ]<∗v1&& col2[ i ] <∗v2&& col3[ i ] <∗v3)
res[j++]=i ;
returnj;// return numb erofsele c teditems.
}
// (2.)branching1,2,non−br.3
idxc00 02 (idxn,T∗res,T∗col1,T∗col2,T∗col3,
T∗v1,T∗v2,T∗v3){
idxi , j=0;
for(j=0; i <n; i ++)
if(col1[ i ]<∗v1&& col2[ i ] <∗v2){
res[j]=i ; j+= col3[ i ] <∗v3;
}
returnj;
}
// (3.)branching1,non−br.2,3
idxc00 03 (idxn,T∗res,T∗col1,T∗col2,T∗col3,
T∗v1,T∗v2,T∗v3){
idxi , j=0;
for(i=0; i <n; i ++)
if(col1[ i ]<v1){
res[j]=i ; j+= col2[ i ] <∗v2&col3[ i ] <∗v3
}
returnj;
}
// (4.)non−branching1,2,3,(”compute−al l ”)
idxc00 04 (idxn,T∗res,T∗col1,T∗col2,T∗col3,
T∗v1,T∗v2,T∗v3){
idxi , j=0;
for(i=0; i <n; i ++){
res[j]=i ;
j+= (col1[ i ]<∗v1&col2[ i ] <∗v2&col3[ i ] <∗v3)
}
returnj;
}
strategydepending ontheobserved selectivity6.As such, its
performance achievestheminimumofthevectorized branch-
ing and non-branchinglines in Figure 2.
Inthis experiment,eachofthecolumnscol1,col2,col3
isaninteger column,andthevalues v1,v2 and v3 are con-
stants, adjusted to controltheselectivityof eachcondition.
Here, wekeep theselectivityofeach branchequal, hence to
thecuberootoftheoverall selectivity,whichwevary from
0to1. Weperformed the experimenton1Kinput tuples.
Figure 2shows thatcompilation of conjunctiveSelectis
inferiortothepure vectorizedapproach.Thelazycompiled
programdoes slightlyoutperform vectorized branching,but
forthemediumselectivities branchingis byfarnotthebest
option.Thegistoftheproblemisthat thetrickof(i) con-
verting all controlde