Content uploaded by Armin Weiß

Author content

All content in this area was uploaded by Armin Weiß on May 11, 2017

Content may be subject to copyright.

Content uploaded by Armin Weiß

Author content

All content in this area was uploaded by Armin Weiß on Apr 25, 2016

Content may be subject to copyright.

BlockQuicksort: How Branch Mispredictions don’t

aﬀect Quicksort

Stefan Edelkamp1and Armin Weiß2

1 TZI, Universität Bremen,

Am Fallturm 1, D-28239 Bremen, Germany

2 Stevens Institute of Technology,

1 Castle Point Terrace, Hoboken, NJ 07030, USA

Abstract

Since the work of Kaligosi and Sanders (2006), it is well-known that Quicksort – which is com-

monly considered as one of the fastest in-place sorting algorithms – suﬀers in an essential way

from branch mispredictions. We present a novel approach to address this problem by partially de-

coupling control from data ﬂow: in order to perform the partitioning, we split the input in blocks

of constant size (we propose 128 data elements); then, all elements in one block are compared

with the pivot and the outcomes of the comparisons are stored in a buﬀer. In a second pass, the re-

spective elements are rearranged. By doing so, we avoid conditional branches based on outcomes

of comparisons at all (except for the ﬁnal Insertionsort). Moreover, we prove that for a static

branch predictor the average total number of branch mispredictions is at most n log n+O(n)

for some small depending on the block size when sorting nelements.

Our experimental results are promising: when sorting random integer data, we achieve an

increase in speed (number of elements sorted per second) of more than 80% over the GCC

implementation of C++ std::sort. Also for many other types of data and non-random inputs,

there is still a signiﬁcant speedup over std::sort. Only in few special cases like sorted or

almost sorted inputs, std::sort can beat our implementation. Moreover, even on random input

permutations, our implementation is even slightly faster than an implementation of the highly

tuned Super Scalar Sample Sort, which uses a linear amount of additional space.

1998 ACM Subject Classiﬁcation F.2.2 Nonnumerical Algorithms and Problems

Keywords and phrases in-place sorting, Quicksort, branch mispredictions, lean programs

1 Introduction

Sorting a sequence of elements of some totally ordered universe remains one of the most

fascinating and well-studied topics in computer science. Moreover, it is an essential part

of many practical applications. Thus, eﬃcient sorting algorithms directly transfer to a

performance gain for many applications. One of the most widely used sorting algorithms is

Quicksort, which has been introduced by Hoare in 1962 [

14

] and is considered to be one of the

most eﬃcient sorting algorithms. For sorting an array, it works as follows: ﬁrst, it chooses an

arbitrary pivot element and then rearranges the array such that all elements smaller than the

pivot are moved to the left side and all elements larger than the pivot are moved to the right

side of the array – this is called partitioning. Then, the left and right side are both sorted

recursively. Although its average

1

number of comparisons is not optimal – 1

.

38

nlog n

+

O

(

n

)

vs.

nlog n

+

O

(

n

)for Mergesort –, its over-all instruction count is very low. Moreover, by

1

Here and in the following, the average case refers to a uniform distribution of all input permutations

assuming all elements are diﬀerent.

arXiv:1604.06697v2 [cs.DS] 23 Jun 2016

2 BlockQuicksort

choosing the pivot element as median of some larger sample, the leading term 1

.

38

nlog n

for the average number of comparisons can be made smaller – even down to

nlog n

when

choosing the pivot as median of some sample of growing size [

21

]. Other advantages of

Quicksort are that it is easy to implement and that it does not need extra memory except

the recursion stack of logarithmic size (even in the worst case if properly implemented). A

major drawback of Quicksort is its quadratic worst-case running time. Nevertheless, there

are eﬃcient ways to circumvent a really bad worst-case. The most prominent is Introsort

(introduced by Musser [

22

]) which is applied in GCC implementation of

std::sort

: as soon

as the recursion depth exceeds a certain limit, the algorithm switches to Heapsort.

Another deﬁciency of Quicksort is that it suﬀers from branch mispredictions (or branch

misses) in an essential way. On modern processors with long pipelines (14 stages for Intel

Haswell, Broadwell, Skylake processors – for the older Pentium 4 processors even more

than twice as many) every branch misprediction causes a rather long interruption of the

execution since the pipeline has to be ﬁlled anew. In [

15

], Kaligosi and Sanders analyzed the

number of branch mispredictions incurred by Quicksort. They examined diﬀerent simple

branch prediction schemes (static prediction and 1-bit, 2-bit predictors) and showed that

with all of them, Quicksort with a random element as pivot causes on average

cn log n

+

O

(

n

)

branch mispredictions for some constant

c

= 0

.

34 (resp.

c

= 0

.

46,

c

= 0

.

43). In particular, in

Quicksort with random pivot element, every fourth comparison is followed by a mispredicted

branch. The reason is that for partitioning, each element is compared with the pivot and

depending on the outcome either it is swapped with some other element or not. Since for an

optimal pivot (the median), the probability of being smaller the pivot is 50%, there is no

way to predict these branches.

Kaligosi and Sanders also established that choosing skewed pivot elements (far oﬀ the

median) might even decrease the running time because it makes branches more predictable.

This also explains why, although theoretically larger samples for pivot selection were shown

to be superior, in practice the median-of three variant turned out to be the best. In [

6

], the

skewed pivot phenomenon is conﬁrmed experimentally. Moreover, in [

20

], precise theoretical

bounds on the number of branch misses for Quicksort are given – establishing also theoretical

superiority of skewed pivots under the assumption that branch mispredictions are expensive.

In [

8

] Brodal and Moruz proved a general lower bound on the number of branch mispre-

dictions given that every comparison is followed by a conditional branch which depends on

the outcome of the comparison. In this case there are Ω(

nlogdn

)branch mispredictions for

a sorting algorithm which performs

O

(

dn log n

)comparisons. As Elmasry and Katajainen re-

marked in [

10

], this theorem does not hold anymore if the results of comparisons are not used

for conditional branches. Indeed, they showed that every program can be transformed into a

program which induces only a constant number of branch misses and whose running time

is linear in the running time of the original program. However, this general transformation

introduces a huge constant factor overhead. Still, in [

10

] and [

11

] Elmasry, Katajainen and

Stenmark showed how to eﬃciently implement many algorithms related to sorting with only

few branch mispredictions. They call such programs lean. In particular, they present variants

of Mergesort and Quicksort suﬀering only very little from branch misses. Their Quicksort

variant (called Tuned Quicksort, for details on the implementation, see [

16

]) is very fast for

random permutations – however, it does not behave well with duplicate elements because it

applies Lomuto’s uni-directional partitioner (see e. g. [9]).

Another development in recent years is multi-pivot Quicksort (i. e. several pivots in each

partitioning stage [

4

,

5

,

18

,

27

,

28

]). It started with the introduction of Yaroslavskiy’s dual-

pivot Quicksort [

30

] – which, surprisingly, was faster than known Quicksort variants and, thus,

S. Edelkamp and A. Weiß 3

became the standard sorting implementation in Oracle Java 7 and Java 8. Concerning branch

mispredictions all these multi-pivot variants behave essentially like ordinary Quicksort [

20

];

however, they have one advantage: every data element is accessed only a few times (this is

also referred to as the number of scans). As outlined in [

5

], increasing the number of pivot

elements further (up to 127 or 255), leads to Super Scalar Sample Sort, which has been

introduced by Sanders and Winkel [

24

]. Super Scalar Sample Sort not only has the advantage

of few scans, but also is based on the idea of avoiding conditional branches. Indeed, the

correct bucket (the position between two pivot elements) can be found by converting the

results of comparisons to integers and then simply performing integer arithmetic. In their

experiments Sanders and Winkel show that Super Scalar Sample Sort is approximately twice

as fast as Quicksort (

std::sort

) when sorting random integer data. However, Super Scalar

Sample Sort has one major draw-back: it uses a linear amount of extra space (for sorting

n

data elements, it requires space for another

n

data elements and additionally for more than

nintegers). In the conclusion of [15], Kaligosi and Sander raised the question:

However, an in-place sorting algorithm that is better than Quicksort with skewed pivots

is an open problem.

(Here, in-place means that it needs only a constant or logarithmic amount of extra space.)

In this work, we solve the problem by presenting our block partition algorithm, which allows

to implement Quicksort without any branch mispredictions incurred by conditional branches

based on results of comparisons (except for the ﬁnal Insertionsort – also there are still

conditional branches based on the control-ﬂow, but their amount is relatively small). We call

the resulting algorithm BlockQuicksort. Our work is inspired by Tuned Quicksort from [

11

],

from where we also borrow parts of our implementation. The diﬀerence is that by doing the

partitioning block-wise, we can use Hoare’s partitioner, which is far better with duplicate

elements than Lomuto’s partitioner (although Tuned Quicksort can be made working with

duplicates by applying a check for duplicates similar to what we propose for BlockQuicksort

as one of the further improvements in Section 3.2). Moreover, BlockQuicksort is also superior

to Tuned Quicksort for random permutations of integers.

Our Contributions

We present a variant of the partition procedure that only incurs few branch mispredictions

by storing results of comparisons in constant size buﬀers.

We prove an upper bound of

n log n

+

O

(

n

)branch mispredictions on average, where

< 1

16 for our proposed block size (Theorem 1).

We propose some improvements over the basic version.

We implemented our algorithm with an stl-style interface2.

We conduct experiments and compare BlockQuicksort with

std::sort

, Yaroslavskiy’s

dual-pivot Quicksort and Super Scalar Sample Sort – on random integer data it is faster

than all of these and also Katajainen et al.’s Tuned Quicksort.

Outline

Section 2 introduces some general facts on branch predictors and mispredictions,

and gives a short account of standard improvements of Quicksort. In Section 3, we give

a precise description of our block partition method and establish our main theoretical

result – the bound on the number of branch mispredictions. Finally, in Section 4, we

2Code available at https://github.com/weissan/BlockQuicksort

4 BlockQuicksort

experimentally evaluate diﬀerent block sizes, diﬀerent pivot selection strategies and compare

our implementation with other state of the art implementations of Quicksort and Super

Scalar Sample Sort.

2 Preliminaries

Logarithms denoted by

log

are always base 2. The term average case refers to a uniform

distribution of all input permutations assuming all elements are diﬀerent. In the following

std::sort always refers to its GCC implementation.

Branch Misses

Branch mispredictions can occur when the code contains conditional jumps

(i. e. if statements, loops, etc.). Whenever the execution ﬂow reaches such a statement,

the processor has to decide in advance which branch to follow and decode the subsequent

instructions of that branch. Because of the length of the pipeline of modern microprocessors,

a wrong predicted branch causes a large delay since, before continuing the execution, the

instructions for the other branch have to be decoded.

Branch Prediction Schemes

Precise branch prediction schemes of most modern processors

are not disclosed to the public. However, the simplest schemes suﬃce to make BlockQuicksort

induce only few mispredictions.

The easiest branch prediction scheme is the static predictor: for every conditional jump

the compiler marks the more likely branch. In particular, that means that for every if

statement, we can assume that there is a misprediction if and only if the if branch is not

taken; for every loop statement, there is precisely one misprediction for every time the

execution ﬂow reaches that loop: when the execution leaves the loop. For more information

about branch prediction schemes, we refer to [13, Section 3.3].

How to avoid Conditional Branches

The usual implementation of sorting algorithms

performs conditional jumps based on the outcome of comparisons of data elements. There

are at least two methods how these conditional jumps can be avoided – both are supported

by the hardware of modern processors:

Conditional moves (

CMOVcc

instructions on x86 processors) – or, more general, conditional

execution. In C++ compilation to a conditional move can be (often) triggered by

i = ( x < y ) ? j : i ;

Cast Boolean variables to integer (SETcc instructions x86 processors). In C++:

i n t i = ( x < y ) ;

Also many other instruction sets support these methods (e. g. ARM [

1

], MIPS [

23

]). Still, the

Intel Architecture Optimization Reference Manual [

2

] advises only to use these instructions to

avoid unpredictable branches (as it is the case for sorting) since correctly predicted branches

are still faster. For more examples how to apply these methods to sorting, see [11].

Quicksort and improvements

The central part of Quicksort is the partitioning procedure.

Given some pivot element, it returns a pointer

p

to an element in the array and rearranges

the array such that all elements left of the

p

are smaller or equal the pivot and all elements

on the right are greater or equal the pivot. Quicksort ﬁrst chooses some pivot element, then

performs the partitioning, and, ﬁnally, recurses on the elements smaller and larger the pivot –

S. Edelkamp and A. Weiß 5

Algorithm 1 Quicksort

1: procedure Quicksort(A[,...,r])

2: if r > then

3: pivot ←choosePivot(A[,...,r])

4: cut ←partition(A[,...,r], pivot)

5: Quicksort(A[, . . . , cut −1])

6: Quicksort(A[cut,...,r])

7: end if

8: end procedure

see Algorithm 1. We call the procedure which organizes the calls to the partitioner the

Quicksort main loop.

There are many standard improvements for Quicksort. For our optimized Quicksort main

loop (which is a modiﬁed version of Tuned Quicksort [

11

,

16

]), we implemented the following:

A very basic optimization due to Sedgewick [

26

] avoids recursion partially (e.g.

std::sort

)

or totally (here – this requires the introduction of an explicit stack).

Introsort [

22

]: there is an additional counter for the number of recursion levels. As soon

as it exceeds some bound (

std::sort

uses 2

log n

– we use 2

log n

+ 3), the algorithms

stops Quicksort and switches to Heapsort [

12

,

29

] (only for the respective sub-array). By

doing so, a worst-case running time of O(nlog n)is guaranteed.

Sedgewick [

26

] also proposed to switch to Insertionsort (see e. g. [

17

, Section 5.2.1]) as

soon as the array size is less than some ﬁxed small constant (16 for

std::sort

and our

implementation). There are two possibilities when to apply Insertionsort: either during

the recursion, when the array size becomes too small, or at the very end after Quicksort

has ﬁnished. We implemented the ﬁrst possibility (in contrast to

std::sort

) because for

sorting integers, it hardly made a diﬀerence, but for larger data elements there was a

slight speedup (in [19] this was proposed as memory-tuned Quicksort).

After partitioning, the pivot is moved to its correct position and not included in the

recursive calls (not applied in std::sort).

The basic version of Quicksort uses a random or ﬁxed element as pivot. A slight

improvement is to choose the pivot as median of three elements – typically the ﬁrst,

in the middle and the last. This is applied in

std::sort

and many other Quicksort

implementations. Sedgewick [

26

] already remarked that choosing the pivots from an even

larger sample does not provide a signiﬁcant increase of speed. In view of the experiments

with skewed pivots [

15

], this is no surprise. For BlockQuicksort, a pivot closer to the

median turns out to be beneﬁcial (Figure 2 in Section 4). Thus, it makes sense to invest

more time to ﬁnd a better pivot element. In [

21

], Martinez and Roura show that the

number of comparisons incurred by Quicksort is minimal if the pivot element is selected

as median of Θ(

√n

)elements. Another variant is to choose the pivot as median of three

(resp. ﬁve) elements which themselves are medians of of three (resp. ﬁve) elements. We

implemented all these variants for our experiments – see Section 4.

Our main contribution is the block partitioner, which we describe in the next section.

3 Block Partitioning

The idea of block partitioning is quite simple. Recall how Hoare’s original partition procedure

works (Algorithm 2): Two pointers start at the leftmost and rightmost elements of the array

and move towards the middle. In every step the current element is compared to the pivot

6 BlockQuicksort

Algorithm 2 Hoare’s Partitioning

1: procedure Partition(A[,...,r], pivot)

2: while < r do

3: while A[]<pivot do ++ end while

4: while A[r]>pivot do r−− end while

5: if < r then swap(A[], A[r]); ++;r−− end if

6: end while

7: return

8: end procedure

(Line 3 and 4). If some element on the right side is less or equal the pivot (resp. some element

on the left side is greater or equal), the respective pointer stops and the two elements found

this way are swapped (Line 5). Then the pointers continue moving towards the middle.

The idea of BlockQuicksort (Algorithm 3) is to separate Lines 3 and 4 of Algorithm 2

from Line 5: ﬁx some block size

B

; we introduce two buﬀers

oﬀsetsL

[0

, . . . , B −

1] and

oﬀsetsR

[0

, . . . , B −

1] for storing pointers to elements (

oﬀsetsL

will store pointers to elements

on the left side of the array which are greater or equal than the pivot element – likewise

oﬀsetsR

for the right side). The main loop of Algorithm 3 consists of two stages: the scanning

phase (Lines 5 to 18) and the rearrangement phase (Lines 19 to 26).

Algorithm 3 Block partitioning

1: procedure BlockPartition(A[,...,r], pivot)

2: integer oﬀsetsL[0,...,B−1],oﬀsetsR[0,...,B−1]

3: integer startL,startR,numL,numR←0

4: while r−+ 1 >2Bdo start main loop

5: if numL= 0 then if left buﬀer is empty, reﬁll it

6: startL←0

7: for i= 0,...,B−1do

8: oﬀsetsL[numL]←i

9: numL+= (pivot ≥A[+i]) scanning phase for left side

10: end for

11: end if

12: if numR= 0 then if right buﬀer is empty, reﬁll it

13: startR←0

14: for i= 0,...,B−1do

15: oﬀsetsR[numR]←i

16: numR+= (pivot ≤A[r−i]) scanning phase for right side

17: end for

18: end if

19: integer num = min(numL,numR)

20: for j= 0,...,num −1do

21: swap(A+ oﬀsetsL[startL+j], Ar−oﬀsetsR[startR+j])rearrangement phase

22: end for

23: numL,numR−= num;startL,startR+= num

24: if (numL= 0) then += Bend if

25: if (numR= 0) then r−=Bend if

26: end while end main loop

27: compare and rearrange remaining elements

28: end procedure

Like for classical Hoare partition, we also start with two pointers (or indices as in the

S. Edelkamp and A. Weiß 7

pseudocode) to the leftmost and rightmost element of the array. First, the scanning phase

takes place: the buﬀers which are empty are reﬁlled. In order to do so, we move the respective

pointer towards the middle and compare each element with the pivot. However, instead

of stopping at the ﬁrst element which should be swapped, only a pointer to the element is

stored in the respective buﬀer (Lines 8 and 9 resp. 15 and 16 – actually the pointer is always

stored, but depending on the outcome of the comparison a counter holding the number of

pointers in the buﬀer is increased or not) and the pointer continues moving towards the

middle. After an entire block of

B

elements has been scanned (either on both sides of the

array or only on one side), the rearranging phase begins: it starts with the ﬁrst positions of

the two buﬀers and swaps the data elements they point to (Line 21); then it continues until

one of the buﬀers contains no more pointers to elements which should be swapped. Now the

scanning phase is restarted and the buﬀer that has run empty is ﬁlled again.

The algorithm continues this way until fewer elements than two times the block size

remain. Now, the simplest variant is to switch to the usual Hoare partition method for

the remaining elements (in the experiments with suﬃx

Hoare finish

). But, we also can

continue with the idea of block partitioning: the algorithm scans the remaining elements

as one or two ﬁnal blocks (of smaller size) and performs a last rearrangement phase. After

that, some elements to swap in one of the two buﬀers might still remain, while the other

buﬀer is empty. With one run through the buﬀer, all these elements can be moved to the left

resp. right (similar as it is done in the Lomuto partitioning method, but without performing

actual comparisons). We do not present the details for this ﬁnal rearranging here because

on one hand it gets a little tedious and on the other hand it does neither provide a lot of

insight into the algorithm nor is it necessary to prove our result on branch mispredictions.

The C++ code of this basic variant can be found in Appendix B.

3.1 Analysis

If the input consists of random permutations (all data elements diﬀerent), the average

numbers of comparisons and swaps are the same as for usual Quicksort with median-of-three.

This is because both Hoare’s partitioner and the block partitioner preserve randomness of

the array.

The number of scanned elements (total number of elements loaded to the registers) is

increased by two times the number of swaps, because for every swap, the data elements have

to be loaded again. However, the idea is that due to the small block size, the data elements

still remain in L1 cache when being swapped – so the additional scan has no negative eﬀect on

the running time. In Section 4 we see that for larger data types and from a certain threshold

on, an increasing size of the blocks has a negative eﬀect on the running time. Therefore,

the block size should not be chosen too large – we propose

B

= 128 and ﬁx this constant

throughout (thus, already for inputs of moderate size, the buﬀers also do not require much

more space than the stack for Quicksort).

Branch mispredictions

The next theorem is our main theoretical result. For simplicity we

assume here that BlockQuicksort is implemented without the worst-case-stopper Heapsort

(i. e. there is no limit on the recursion depth). Since there is only a low probability that a

high recursion depth is reached while the array is still large, this assumption is not a real

restriction. We analyze a static branch predictor: there is a misprediction every time a loop

is left and a misprediction every time the if branch of an if statement is not taken.

8 BlockQuicksort

ITheorem 1.

Let

C

be the average number of comparisons of Quicksort with constant size

pivot sample. Then BlockQuicksort (without limit to the recursion depth and with the same

pivot selection method) with blocksize

B

induces at most

6

B·C

+

O

(

n

)branch mispredictions on

average. In particular, BlockQuicksort with median-of-three induces less then

8

Bnlog n

+

O

(

n

)

branch mispredictions on average.

Theorem 1 shows that when choosing the block size suﬃciently large, the

nlog n

-term

becomes very small and – for real-world inputs – we can basically assume a linear number of

branch mispredictions. Moreover, Theorem 1 can be generalized to samples of non-constant

size for pivot selection. Since the proof might become tedious, we stick to the basic variant

here. The constant 6 in Theorem 1 can be replaced by 4 when implementing Lines 19, 24,

and 25 of Algorithm 3 with conditional moves.

Proof.

First, we show that every execution of the block partitioner Algorithm 3 on an array

of length

n

induces at most

6

Bn

+

c

branch mispredictions for some constant

c

. In order

to do so, we only need to look at the main loop (Line 4 to 27) of Algorithm 3 because the

ﬁnal scanning and rearrangement phases consider only a constant (at most 2

B

) number of

elements. Inside the main loop there are three for loops (starting Lines 7, 14, 20), four if

statements (starting Lines 5, 12, 24, 25) and the min calculation (whose straightforward

implementation is an if statement – Line 19). We know that in every execution of the main

loop at least one of the conditions of the if statements in Line 5 and 12 is true because in

every rearrangement phase at least one buﬀer runs empty. The same holds for the two if

statements in Line 24 and 25. Therefore, we obtain at most two branch mispredictions for

the if s, three for the for loops and one for the min in every execution of the main loop.

In every execution of the main loop, there are at least

B

comparisons of elements with the

pivot. Thus, the number of branch misses in the main loop is at most

6

B

times the number of

comparisons. Hence, for every input permutation the total number of branch mispredictions

of BlockQuicksort is at most

6

B·

#

comparisons

+ (

c

+

c0

)

·

#

calls to partition

+

O

(

n

)

,

where

c0

it the number of branch mispredictions of one execution of the main loop of Quicksort

(including pivot selection, which only needs a constant number of instructions) and the

O

(

n

)

term comes from the ﬁnal Insertionsort. The number of calls to partition is bounded by

n

because each element can be chosen as pivot only once (since the pivots are not contained in

the arrays for the recursive calls). Thus, by taking the average over all input permutations,

the ﬁrst statement follows.

The second statement follows because Quicksort with median-of-three incurs 1

.

18

nlog n

+

O(n)comparisons on average [25]. J

I

Remark. The

O

(

n

)-term in Theorem 1 can be bounded by 3

n

by taking a closer look to

the ﬁnal rearranging phase.

We give a rough heuristic estimate: it is save to assume that the average length of arrays

on which Insertionsort is called is at least 8 (recall that we switch to Insertionsort as soon as

the array size is less than 17). For Insertionsort there is one branch miss for each element

(when exiting the loop for ﬁnding the position) plus one for each call of Insertionsort (when

exiting the loop over all elements to insert). Furthermore, there are at most two branch

misses in the main Quicksort loop (Lines 177 and 196 in Appendix B) for every call to

Insertionsort. Hence, we have approximately 11

8nbranch misses due to Insertionsort.

It remains to count the constant number of branch misprediction incurred during every

call of partitioning: After exiting the main loop of block partition, there is one more scan

and rearrangement phase with a smaller block size. This leads to at most

≤

7branch

mispredictions (one extra because there is an additional case that both buﬀers are empty).

S. Edelkamp and A. Weiß 9

The ﬁnal rearranging incurs at most three branch misses (Lines 118, 136, 140). Selecting the

pivot as median-of-three (Line 11) induces no branch misses since all conditional statements

are compiled to conditional moves. Finally, there is at most one branch miss in the main

Quicksort loop for every call to partition (Line 180). This sums up to at most 13 branch

misses per call to partition. Because the average size of arrays treated by Insertionsort is at

least 8, the number of calls to partition is less than n/8.

Thus, in total the

O

(

n

)-term in Theorem 1 consists of at most

11

8n

+

13

8n

= 3

n

branch

mispredictions.

3.2 Further Tuning of Block Partitioning

We propose and implemented further tunings for our block partitioner:

1.

Loop unrolling: since the block size is a power of two, the loops of the scanning phase

can be unrolled four or even eight times without causing additional overhead.

2. Cyclic permutations instead of swaps: We replace

1: for j= 0,...,num −1do

2: swap(A+ oﬀsetsL[startL+j], Ar−oﬀsetsR[startR+j])

3: end for

by the following code, which does not perform exactly the same data movements, but

still in the end all elements less than the pivot are on the left and all elements greater

are on the right:

1: temp ←A+ oﬀsetsL[startL]

2: A+ oﬀsetsL[startL]←Ar−oﬀsetsR[startR]

3: for j= 1,...,num −1do

4: Ar−oﬀsetsR[startR+j−1]←A+ oﬀsetsL[startL+j]

5: A+ oﬀsetsL[startL+j]←Ar−oﬀsetsR[startR+j]

6: end for

7: Ar−oﬀsetsR[startR+ num −1]←temp

Note that this is also a standard improvement for partitioning – see e. g. [3].

In the following, we always assume these two improvements since they are of very basic

nature (plus one more small change in the ﬁnal rearrangement phase). We call the variant

without them block_partition_simple – its C++ code can be found in Appendix B.

The next improvement is a slight change of the algorithm: in our experiments we noticed

that for small arrays with many duplicates the recursion depth becomes often higher than

the threshold for switching to Heapsort – a way to circumvent this is an additional check for

duplicates equal to the pivot if one of the following two conditions applies:

the pivot occurs twice in the sample for pivot selection (in the case of median-of-three),

the partitioning results very unbalanced for an array of small size.

The check for duplicates takes place after the partitioning is completed. Only the larger half

of the array is searched for elements equal to the pivot. This check works similar to Lomuto’s

partitioner (indeed, we used the implementation from [

16

]): starting from the position of the

pivot, the respective half of the array is scanned for elements equal to the pivot (this can be

done by one less than comparison since elements are already known to be greater or equal

(resp. less or equal) the pivot)). Elements which are equal to the pivot are moved to the side

of the pivot. The scan continues as long as at least every fourth element is equal to the pivot

(instead every fourth one could take any other ratio – this guarantees that the check stops

soon if there are only few duplicates).

After this check, all elements which are identiﬁed as being equal to the pivot remain

in the middle of the array (between the elements larger and the elements smaller than the

10 BlockQuicksort

pivot); thus, they can be excluded from further recursive calls. We denote this version with

the suﬃx duplicate check (dc).

4 Experiments

We ran thorough experiments with implementations in C++ on diﬀerent machines with

diﬀerent types of data and diﬀerent kinds of input sequences. If not speciﬁed explicitly, the

experiments are run on an Intel Core i5-2500K CPU (3.30GHz, 4 cores, 32KB L1 instruction

and data cache, 256KB L2 cache per core and 6MB L3 shared cache) with 16GB RAM and

operating system Ubuntu Linux 64bit version 14.04.4. We used GNU’s

g++

(4.8.4); optimized

with ﬂags -O3 -march=native.

For time measurements, we used

std::chrono::high_resolution_clock

, for generating

random inputs, the Mersenne Twister pseudo-random generator

std::mt19937

. All time

measurements were repeated with the same 20 deterministically chosen seeds – the displayed

numbers are the average of these 20 runs. Moreover, for each time measurement, at least

128MB of data were sorted – if the array size is smaller, then for this time measurement

several arrays have been sorted and the total elapsed time measured. Our running time plots

all display the actual time divided by the number of elements to sort on the y-axis.

We performed our running time experiments with three diﬀerent data types:

int: signed 32-bit integers.

Vector

: 10-dimensional array of 64-bit ﬂoating-point numbers (

double

). The order is

deﬁned via the Euclidean norm – for every comparison the sums of the squares of the

components are computed and then compared.

Record: 21-dimensional array of 32-bit integers. Only the ﬁrst component is compared.

The code of our implementation of BlockQuicksort as well as the other algorithms and our

running time experiments is available at https://github.com/weissan/BlockQuicksort.

Diﬀerent Block Sizes

Figure 1 shows experimental results on random permutations for

diﬀerent data types and block sizes ranging from 4 up to 2

24

. We see that for integers only

2123252729211 213 215 217 219 221 223

block size

0

50

100

150

200

250

300

350

time per element [ns]

Record, n=1048576

Record, n=16777216

Vector, n=1048576

Vector, n=16777216

int, n=1048576

int, n=134217728

Figure 1 Diﬀerent block sizes for random permutations.

at the end there is a slight negative eﬀect when increasing the block size. Presumably this is

because up to a block size of 2

19

, still two blocks ﬁt entirely into the L3 cache of the CPU.

On the other hand for

Vector

a block size of 64 and for

Record

of 8 seem to be optimal –

with a considerably increasing running time for larger block sizes.

As a compromise we chose to ﬁx the block size to 128 elements for all further experiments.

An alternative approach would be to choose a ﬁxed number of bytes for one block and adapt

the block size according to the size of the data elements.

S. Edelkamp and A. Weiß 11

2 4 6 8 10 12 14 16

skew factor

30

40

50

60

70

80

time per element [ns]

block partition, n=1048576

block partition, n=16777216

GCC partition, n=1048576

GCC partition, n=16777216

Figure 2

Sorting random permutations of 32-bit integers with skewed pivot. A skew factor

k

means that n

k-th element is chosen as pivot of an array of length n.

Skewed Pivot Experiments

We repeated the experiments from [

15

] with skewed pivot for

both the usual Hoare partitioner (

std::__unguarded_partition

, from the GCC implement-

ation of

std::sort

) and our block partition method. For both partitioners we used our

tuned Quicksort loop. The results can be seen in Figure 2: classic Quicksort beneﬁts from

skewed pivot, whereas BlockQuicksort works best with the exact median. Therefore, for

BlockQuicksort it makes sense to invest more eﬀort to ﬁnd a good pivot.

Diﬀerent Pivot Selection Methods

We implemented several strategies for pivot selection:

median-of-three, median-of-ﬁve, median-of-twenty-three,

median-of-three-medians-of-three, median-of-three-medians-of-ﬁve, median-of-ﬁve-me-

dians-of-ﬁve: ﬁrst calculate three (resp. ﬁve) times the median of three (resp. ﬁve)

elements, then take the pivot as median of these three (resp. ﬁve) medians,

median-of-√n.

All pivot selection strategies switch to median-of-three for small arrays. Moreover, the

median-of-

√n

variant switches to median-of-ﬁve-medians-of-ﬁve for arrays of length below

20000 (for smaller

n

even the number of comparisons was better with median-of-ﬁve-medians-

of-ﬁve). The medians of larger samples are computed with

std::nth_element

. Despite the

results on skewed pivots Figure 2, there was no big diﬀerence between the diﬀerent pivot

selection strategies as it can be seen in Figure 3. As expected, median-of-three was always

the slowest for larger arrays. Median-of-ﬁve-medians-of-ﬁve was the fastest for

int

and

median-of-

√n

for

Vector

. We think that the small diﬀerence between all strategies is due to

the large overhead for the calculation of the median of a large sample – and maybe because

the array is rearranged in a way that is not favorable for the next recursive calls.

4.1 Comparison with other Sorting Algorithms

We compare variants of BlockQuicksort with the GCC implementation of

std::sort3

(which is

known to be one of the most eﬃcient Quicksort implementations – see e. g. [

7

]), Yaroslavskiy’s

3

For the source code see e.g.

https://gcc.gnu.org/onlinedocs/gcc-4.7.2/libstdc++/api/a01462_

source.html

– be aware that in newer versions of GCC the implementation is slightly diﬀerent: the old

version uses the ﬁrst, middle and last element as sample for pivot selection, whereas the new version

uses the second, middle and last element. For decreasingly sorted arrays the newer version works far

better – for random permutations and increasingly sorted arrays, the old one is better. We used the old

version for our experiment. The new version is included in some plots Figures 9 and 10 in the appendix;

this reveals a enormous diﬀerence between the two versions for particular inputs and underlines the

importance of proper pivot selection.

12 BlockQuicksort

217 218 219 220 221 222 223 224

number of elements n

5.6

5.8

6.0

6.2

6.4

6.6

6.8

7.0

time divided by n log n [ns]

217 218 219 220 221 222 223 224

number of elements n

8.6

8.8

9.0

9.2

9.4

time divided by n log n [ns]

217 218 219 220 221 222 223 224

number of elements n

1.65

1.70

1.75

1.80

1.85

1.90

time divided by n log n [ns]

median-of-three

median-of-23

median-of-3-medians-of-3

median-of-3-medians-of-5

median-of-ﬁve

median-of-5-medians-of-5

median-of-√n

Figure 3

Diﬀerent pivot selection strategies with random permutation. Upper left:

Record

;upper

right:Vector;lower left :int. Be aware that the y-axis here displays the time divided by nlog n.

dual-pivot Quicksort [

30

] (we converted the Java code of [

30

] to C++) and an implementation

of Super Scalar Sample Sort [

24

] by Hübschle-Schneider

4

. For random permutations and

random values modulo

√n

, we also test Tuned Quicksort [

16

] and three-pivot Quicksort

implemented by Aumüller and Bingmann

5

from [

5

] (which is based on [

18

]) – for other types

of inputs we omit these algorithms because of their poor behavior with duplicate elements.

Branch mispredictions

We experimentally determined the number of branch mispredictions

of BlockQuicksort and the other algorithms with the chachegrind branch prediction proﬁler,

which is part of the proﬁling tool valgrind

6

. The results of these experiments on random

int

data can be seen in Figure 4 – the

y

-axis shows the number of branch misprediction

divided the the array size. We only display the median-of-three variant of BlockQuicksort

since all the variants are very much alike. We also added plots of BlockQuicksort and Tuned

Quicksort skipping ﬁnal Insertionsort (i. e. the arrays remain partially unsorted).

We see that both

std::sort

and Yaroslavskiy’s dual-pivot Quicksort incur Θ(

nlog n

)

branch mispredictions. The up and down for Super Scalar Sample Sort presumably is because

of the variation in the size of the arrays where the base case sorting algorithm

std::sort

is

4

URL:

https://github.com/lorenzhs/ssssort/blob/b931c024cef3e6d7b7e7fd3ee3e67491d875e021/

ssssort.h – retrieved April 12, 2016

5URL: http://eiche.theoinf.tu-ilmenau.de/Quicksort- experiments/ – retrieved March, 2016

6

For more information on valgrind, see

http://valgrind.org/

. To perform the measurements we used

the same Python script as in [

11

,

16

], which ﬁrst measures the number of branch mispredictions of the

whole program including generation of test cases and then, in a second run, measures the number of

branch mispredictions incurred by the generation of test cases.

S. Edelkamp and A. Weiß 13

210 212 214 216 218 220 222 224

number of elements n

0

2

4

6

8

10

branch misses per element

Yaroslavskiy

BlockQS

BlockQS (no IS)

Tuned QS

Tuned QS (no IS)

SSSSort

std::sort

Figure 4 Number of branch mispredictions.

applied to. For BlockQuicksort there is an almost non-visible

nlog n

term for the number of

branch mispredictions. Indeed, we computed an approximation of 0

.

02

nlog n

+ 1

.

75

n

branch

mispredictions. Thus, the actual number of branch mispredictions is still better then our

bounds in Theorem 1. There are two factors which contribute to this discrepancy: our rough

estimates in the mentioned results, and that the actual branch predictor of a modern CPU

might be much better than a static branch predictor. Also note that approximately one half

of the branch mispredictions are incurred by Insertionsort – only the other half by the actual

block partitioning and main Quicksort loop.

Finally, Figure 4 shows that Katajainen et al.’s Tuned Quicksort is still more eﬃcient

with respect to branch mispredictions (only

O

(

n

)). This is no surprise since it does not need

any checks whether buﬀers are empty etc. Moreover, we see that over 80% of the branch

misses of Tuned Quicksort come from the ﬁnal Insertionsort.

210 212 214 216 218 220 222 224 226 228

number of elements

20

30

40

50

60

70

80

90

time per element [ns]

Yaroslavskiy

BlockQS

BlockQS (mo-sq, dc)

BlockQS (Hoare ﬁnish)

BlockQS simple

Tuned QS

3-pivot QS

SSSSort

std::sort

Figure 5 Random permutation of int.

Running Time Experiments

In Figure 5 we present running times on random

int

permuta-

tions of diﬀerent BlockQuicksort variants and the other algorithms including Katajainen’s

14 BlockQuicksort

Tuned Quicksort and Aumüller and Bingmann’s three-pivot Quicksort. The optimized

BlockQuicksort variants need around 45ns per element when sorting 2

28

elements, whereas

std::sort

needs 85ns per element – thus, there is a speed increase of 88% (i. e. the number

of elements sorted per second is increased by 88%)

7

. The same algorithms are displayed in

210 212 214 216 218 220 222 224 226 228

number of elements

10

15

20

25

30

35

40

45

50

55

time per element [ns]

Yaroslavskiy

BlockQS

BlockQS (mo-sq, dc)

BlockQS (Hoare ﬁnish)

BlockQS simple

Tuned QS

3-pivot QS

SSSSort

std::sort

Figure 6 Random int values between 0and √n.

Figure 6 for sorting random

int

s between 0and

√n

. Here, we observe that Tuned Quicksort

is much worse than all the other algorithms (already for

n

= 2

12

it moves outside the

displayed range). All variants of BlockQuicksort are faster than

std::sort

– the

duplicate

check (dc) version is almost twice as fast.

Figure 7 presents experiments with data containing a lot of duplicates and having speciﬁc

structures – thus, maybe coming closer to “real-world” inputs (although it is not clear what

that means). Since here Tuned Quicksort and three-pivot Quicksort are much slower than all

the other algorithms, we exclude these two algorithms from the plots. The array for the left

plot contains long already sorted runs. This is most likely the reason that

std::sort

and

Yaroslavskiy’s dual-pivot Quicksort have similar running times to BlockQuicksort (for sorted

sequences the conditional branches can be easily predicted what explains the fast running

time). The arrays for the middle and right plot start with sorted runs and become more and

more erratic; the array for the right one also contains a extremely high number of duplicates.

Here the advantage of BlockQuicksort – avoiding conditional branches – can be observed

again. In all three plots the check for duplicates (dc) established a considerable improvement.

In Figure 8, we show the results of selected algorithms for random permutations of

Vector

and

Record

. We conjecture that the good results of Super Scalar Sample Sort on

Record

s

are because of its better cache behavior (since

Record

are large data elements with very

cheap comparisons). More running time experiments also on other machines and compiler

ﬂags can be found in Appendix A.

7

In an earlier version of this work, we presented slightly diﬀerent outcomes of our experiments. One

reason it the usage of another random number generator. Otherwise, we introduced only minor changes

in test environment – and no changes at all in the sorting algorithms themselves.

S. Edelkamp and A. Weiß 15

210 214 217 220 224 228

number of elements

0

10

20

30

40

50

60

time per element [ns]

210 214 217 220 224 228

number of elements

0

10

20

30

40

50

60

70

80

90

time per element [ns]

Yaroslavskiy

BlockQS (mo-sq, dc)

BlockQS simple

SSSSort

std::sort

210 214 217 220 224 228

number of elements

0

10

20

30

40

50

60

70

time per element [ns]

Figure 7

Arrays

A

of

int

with duplicates: left :

A

[

i

] =

imod √n

;middle:

A

[

i

] =

i2

+

n/

2

mod n

;right:

A

[

i

] =

i8

+

n/

2

mod n

. Since

n

is always a power of two, the value

n/

2occurs

approximately n7/8times in the last case.

210 214 217 220 224

number of elements n

100

150

200

250

300

350

time per element [ns]

210 214 217 220 224

number of elements n

60

80

100

120

140

160

180

200

time per element [ns]

Yaroslavskiy

BlockQS (mo-√n, dc)

BlockQS simple

SSSSort

std::sort

Figure 8 Random permutations – left: Vector;right: Record.

More Statistics

Table 1 shows the number of branches taken / branch mispredicted as well

as the instruction count and cache misses. Although

std::sort

has a much lower instruction

count than the other algorithms, it induces most branch misses and (except Tuned Quicksort)

most L1 cache misses (= L3 refs since no L2 cache is simulated). BlockQuicksort does not

only have a low number of branch mispredictions, but also a good cache behavior – one

reason for this is that Insertionsort is applied during the recursion and not at the very end.

5 Conclusions and Future Research

We have established an eﬃcient in-place general purpose sorting algorithm, which avoids

branch predictions by converting results of comparisons to integers. In the experiments we

have seen that it is competitive on diﬀerent kinds of data. Moreover, in several benchmarks

it is almost twice as fast as std::sort. Future research might address the following issues:

We used Insertionsort as recursion stopper – inducing a linear number of branch misses.

Is there a more eﬃcient recursion stopper that induces fewer branch mispredictions?

More eﬃcient usage of the buﬀers: in our implementation the buﬀers on average are not

even ﬁlled half. To use the space more eﬃciently one could address the buﬀers cyclically

and scan until one buﬀer is ﬁlled. By doing so, also both buﬀers could be ﬁlled in the

same loop – however, with the cost of introducing additional overhead.

16 BlockQuicksort

algorithm

branches

taken

branch

misses instructions L1 refs L3 refs L3 misses

std::sort 37.81 10.23 174.82 51.96 1.05 0.41

SSSSort 16.2 3.87 197.06 68.47 0.82 0.5

Yaroslavskiy 52.92 9.51 218.42 59.82 0.79 0.27

BlockQS (mo-√n, dc) 20.55 2.39 322.08 89.9 0.77 0.27

BlockQS (mo5-mo5) 20.12 2.31 321.49 88.63 0.78 0.28

BlockQS 20.51 2.25 337.27 92.45 0.88 0.3

BlockQS (no IS) 15.38 1.09 309.85 84.66 0.88 0.3

Tuned QS 29.66 1.44 461.88 105.43 1.23 0.39

Tuned QS (no IS) 24.53 0.26 434.53 97.65 1.22 0.39

Table 1

Instruction count, branch and cache misses when sorting random

int

permutations of

size 16777216 = 224. All displayed numbers are divided by the number of elements.

The ﬁnal rearrangement of the block partitioner is not optimal: for small arrays the

similar problems with duplicates arise as for Lomuto’s partitioner.

Pivot selection strategy: though theoretically optimal, median-of-

√n

pivot selection is

not best in practice. Also we want to emphasize that not only the sample size but also

the selection method is important (compare the diﬀerent behavior of the two versions of

std::sort

for sorted and reversed permutations). It might be even beneﬁcial to use a

fast pseudo-random generator (e. g. a linear congruence generator) for selecting samples

for pivot selection.

Parallel versions: the block structure is very well suited for parallelism.

A three-pivot version might be interesting, but eﬃcient multi-pivot variants are not trivial:

our ﬁrst attempt was much slower.

Acknowledgments Thanks to Jyrki Katajainen and Max Stenmark for allowing us to use

their Python scripts for measuring branch mispredictions and cache missses and to Lorenz

Hübschle-Schneider for his implementation of Super Scalar Sample Sort. We are also indebted

to Jan Philipp Wächter for all his help with creating the plots, to Daniel Bahrdt for answering

many C++ questions, and to Christoph Greulich for his help with the experiments.

References

1ARMv8 Instruction Set Overview, 2011. Document number: PRD03-GENC-010197 15.0.

2Intel 64 and IA-32 Architecture Optimization Reference Manual, 2016. Order Number:

248966-032.

3D. Abhyankar and M. Ingle. Engineering of a quicksort partitioning algorithm. Journal of

Global Research in Computer Science, 2(2):17–23, 2011.

4Martin Aumüller and Martin Dietzfelbinger. Optimal partitioning for dual pivot quicksort

- (extended abstract). In ICALP, pages 33–44, 2013.

5Martin Aumüller, Martin Dietzfelbinger, and Pascal Klaue. How good is multi-pivot quick-

sort? CoRR, abs/1510.04676, 2015.

6Paul Biggar, Nicholas Nash, Kevin Williams, and David Gregg. An experimental study of

sorting and branch prediction. J. Exp. Algorithmics, 12:1.8:1–39, 2008.

7Gerth Stølting Brodal, Rolf Fagerberg, and Kristoﬀer Vinther. Engineering a cache-

oblivious sorting algorithm. J. Exp. Algorithmics, 12:2.2:1–23, 2008.

S. Edelkamp and A. Weiß 17

8Gerth Stølting Brodal and Gabriel Moruz. Tradeoﬀs between branch mispredictions and

comparisons for sorting algorithms. In WADS, volume 3608 of LNCS, pages 385–395.

Springer, 2005.

9Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Cliﬀord Stein. Introduction

to Algorithms. The MIT Press, 3nd edition, 2009.

10 Amr Elmasry and Jyrki Katajainen. Lean programs, branch mispredictions, and sorting.

In FUN, volume 7288 of LNCS, pages 119–130. Springer, 2012.

11 Amr Elmasry, Jyrki Katajainen, and Max Stenmark. Branch mispredictions don’t aﬀect

mergesort. In SEA, pages 160–171, 2012.

12 Robert W. Floyd. Algorithm 245: Treesort 3. Comm. of the ACM, 7(12):701, 1964.

13 John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Ap-

proach. Morgan Kaufmann, 5th edition, 2011.

14 Charles A. R. Hoare. Quicksort. The Computer Journal, 5(1):10–16, 1962.

15 Kanela Kaligosi and Peter Sanders. How branch mispredictions aﬀect quicksort. In ESA,

pages 780–791, 2006.

16 Jyrki Katajainen. Sorting programs executing fewer branches. CPH STL Report

2263887503, Department of Computer Science, University of Copenhagen, 2014.

17 Donald E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming.

Addison Wesley Longman, 2nd edition, 1998.

18 Shrinu Kushagra, Alejandro López-Ortiz, Aurick Qiao, and J. Ian Munro. Multi-pivot

quicksort: Theory and experiments. In ALENEX, pages 47–60, 2014.

19 Anthony LaMarca and Richard E Ladner. The inﬂuence of caches on the performance of

sorting. J. Algorithms, 31(1):66–104, 1999.

20 Conrado Martínez, Markus E. Nebel, and Sebastian Wild. Analysis of branch misses in

quicksort. In Workshop on Analytic Algorithmics and Combinatorics, ANALCO 2015, San

Diego, CA, USA, January 4, 2015, pages 114–128, 2015.

21 Conrado Martínez and Salvador Roura. Optimal Sampling Strategies in Quicksort and

Quickselect. SIAM J. Comput., 31(3):683–705, 2001.

22 David R. Musser. Introspective sorting and selection algorithms. Software—Practice and

Experience, 27(8):983–993, 1997.

23 Charles Price. MIPS IV Instruction Set, 1995.

24 Peter Sanders and Sebastian Winkel. Super Scalar Sample Sort. In ESA, pages 784–796,

2004.

25 Robert Sedgewick. The analysis of quicksort programs. Acta Inf., 7(4):327–355, 1977.

26 Robert Sedgewick. Implementing quicksort programs. Commun. ACM, 21(10):847–857,

1978.

27 Sebastian Wild and Markus E. Nebel. Average case analysis of java 7’s dual pivot quicksort.

In ESA, pages 825–836, 2012.

28 Sebastian Wild, Markus E. Nebel, and Ralph Neininger. Average case and distributional

analysis of dual-pivot quicksort. ACM Transactions on Algorithms, 11(3):22:1–42, 2015.

29 J. W. J. Williams. Algorithm 232: Heapsort. Communications of the ACM, 7(6):347–348,

1964.

30 Vladimir Yaroslavskiy. Dual-Pivot Quicksort algorithm, 2009. URL: http://codeblab.

com/wp-content/uploads/2009/09/DualPivotQuicksort.pdf.

18 BlockQuicksort

A More Experimental Results

In Figure 10 and Figure 9 we also included the new GCC implementation of

std::sort

(GCC

version 4.8.4) marked as

std::sort

(new). The very small diﬀerence in the implementation

of choosing the second instead of the ﬁrst element as part of the sample for pivot selection

makes a enormous diﬀerence when sorting special permutations like decreasingly sorted

arrays. This shows how important not only the size of the pivot sample but also the proper

selection is. In the other benchmarks both implementations were relatively close, so we do

not show both of them.

210 214 217 220 224 228

number of elements

0

5

10

15

20

25

30

35

time per element [ns]

210 214 217 220 224 228

number of elements

0

10

20

30

40

50

60

70

time per element [ns]

Yaroslavskiy

BlockQS (mo-sq, dc)

BlockQS simple

SSSSort

std::sort (new)

std::sort

210 214 217 220 224 228

number of elements

0

10

20

30

40

50

60

70

time per element [ns]

Figure 9

Permutations of

int

:left: sorted; middle: reversed; right: transposition –

A

[

i

] =

i

+

n/

2

mod n.

210 212 214 216 218 220 222 224 226 228

number of elements

5

10

15

20

25

30

35

40

45

time per element [ns]

210 212 214 216 218 220 222 224 226 228

number of elements

10

15

20

25

30

35

40

45

time per element [ns]

Yaroslavskiy

BlockQS (mo-sq, dc)

BlockQS simple

SSSSort

stl

std::sort

210 212 214 216 218 220 222 224 226 228

number of elements

10

20

30

40

50

60

70

time per element [ns]

Figure 10

Random permutations of

int

with at most

k

inversions (

k

random swaps of neighboring

elements): left:k=√n;middle:k=n;right:k=nlog n.

S. Edelkamp and A. Weiß 19

210 212 214 216 218 220 222 224 226 228

number of elements

0

5

10

15

20

25

30

35

40

45

time per element [ns]

210 212 214 216 218 220 222 224 226 228

number of elements

0

5

10

15

20

25

30

35

40

45

time per element [ns]

Yaroslavskiy

BlockQS (mo-sq, dc)

BlockQS simple

SSSSort

std::sort

210 212 214 216 218 220 222 224 226 228

number of elements

0

5

10

15

20

25

30

35

40

45

time per element [ns]

Figure 11

Arrays

A

of

int

with many duplicates: Left: constant; middle:

A

[

i

]=0for

i < n/

2

and A[i] = 1 otherwise; right: random 0-1 values.

210 214 217 220 224

number of elements n

0

50

100

150

200

250

300

time per element [ns]

210 214 217 220 224

number of elements n

50

100

150

200

250

300

time per element [ns]

Yaroslavskiy

BlockQS (mo-sq, dc)

BlockQS simple

SSSSort

std::sort

210 214 217 220 224

number of elements n

0

50

100

150

200

250

time per element [ns]

Figure 12

Sorting

Vector

:Left: random 0-1 values; middle: random values between 0and

√n

;

right: sorted.

210 214 217 220 224

number of elements n

0

50

100

150

200

250

time per element [ns]

210 214 217 220 224

number of elements n

40

60

80

100

120

140

160

180

200

time per element [ns]

Yaroslavskiy

BlockQS (mo-sq, dc)

BlockQS simple

SSSSort

std::sort

210 214 217 220 224

number of elements n

20

30

40

50

60

70

80

90

100

110

time per element [ns]

Figure 13

Sorting

Record

:Left: random 0-1 values; middle: random values between 0and

√n

;

right: sorted.

20 BlockQuicksort

210 214 217 220 224

number of elements n

20

30

40

50

60

70

80

time per element [ns]

210 214 217 220 224

number of elements n

20

30

40

50

60

70

80

time per element [ns]

Yaroslavskiy

BlockQS (mo-√n, dc)

BlockQS simple

SSSSort

std::sort

Figure 14 Random permutations of int with other compiler optimizations.

Left: -O3 -march=native -funroll-loops;right: -O1 -march=native.

210 214 217 220 224 228

number of elements n

30

40

50

60

70

80

90

100

time per element [ns]

210 214 217 220 224

number of elements n

15

20

25

30

35

40

45

50

55

time per element [ns]

Yaroslavskiy

BlockQS (mo-sq, dc)

BlockQS simple

SSSSort

std::sort

210 214 217 220 224

number of elements n

0

10

20

30

40

50

time per element [ns]

Figure 15

Running time experiments with

int

on Intel Xeon E5-4627v2 CPU (3.30GHz, 8 cores,

32KB L1 instruction and data cache, 256KB L2 cache per core and 16MB L3 shared cache) with

128GB RAM and operating system Windows Server 2012 R2. We used Cygwin

g++

(4.9.3); optimized

with ﬂags -O3 -march=native.

Left: random permutation; middle: random values between 0and√n;right: sorted.

210 214 217 220 224

number of elements n

30

40

50

60

70

80

time per element [ns]

210 214 217 220 224

number of elements n

15

20

25

30

35

40

45

50

55

60

time per element [ns]

Yaroslavskiy

BlockQS (mo-sq, dc)

BlockQS simple

SSSSort

std::sort

210 214 217 220 224

number of elements n

0

5

10

15

20

25

30

35

40

time per element [ns]

Figure 16

Running time experiments with

int

on Laptop with Intel Core i7-5500U CPU (2.40GHz,

2 cores, 32KB L1 instruction and data cache, 256KB L2 cache per core and 4MB L3 shared cache)

with 8GB RAM and operating system Windows 10. We used Mingw

g++

(5.3.0); optimized with

ﬂags -O3 -march=native.

Left: random permutation; middle: random values between 0and√n;right: sorted.

S. Edelkamp and A. Weiß 21

B C++ Code

Here, we give the C++ code of the basic BlockQuicksort variant (the ﬁnal rearranging is

also in block style, but there is no loop unrolling etc. applied).

1tem p l a t e <ty pen am e iter , typ e na me Compare>

2i n l i n e v oi d s o r t _ p a i r ( i t e r i 1 , i t e r i2 , Com pare l e s s ) {

3t y p e d e f t yp en am e s td : : i t e r a t o r _ t r a i t s < i t e r > :: v al ue _ ty pe T;

4bool s m a l l e r = l e s s ( ∗i 2 , ∗i 1 ) ;

5T tem p = st d : : mov e ( s m a l l e r ? ∗i1 : tem p ) ;

6∗i 1 = st d : : mo ve ( s m a l l e r ? ∗i 2 : ∗i 1 ) ;

7∗i 2 = st d : : mo ve ( s m a l l e r ? temp : ∗i 2 ) ;

8}

9

10 tem p l a t e <ty pen am e iter , typ e na me Compare>

11 inline i t e r m edi an_ of_3 ( i t e r i 1 , i t e r i 2 , i t e r i 3 , Co mpare l e s s ) {

12 s o r t _ p a i r ( i 1 , i 2 , l e s s ) ;

13 s o r t _ p a i r ( i 2 , i 3 , l e s s ) ;

14 s o r t _ p a i r ( i 1 , i 2 , l e s s ) ;

15 r e t u r n i 2 ;

16 }

17

18 tem p l a t e <ty pen am e iter , typ e na me Compare>

19 inline i t e r h o a r e _ b l o c k _ p a r t i t i o n _ s i m p l e ( i t e r b e g in , i t e r e nd , i t e r p i v o t _ p o s i t i on

, Co mpare l e s s ) {

20 t y p e d e f t yp en am e s t d : : i t e r a t o r _ t r a i t s <i t e r > : : d i f f e r e n c e _ t y p e i n d ex ;

21 i nd e x in d ex L [ BLOCKSIZE ] , in de x R [ BLOCKSIZE ] ;

22

23 i t e r l a s t = e nd −1 ;

24 s t d : : i t er _ s w ap ( p i v o t _ p o s i t i o n , l a s t ) ;

25 c o n s t ty p en am e s td : : i t e r a t o r _ t r a i t s < i t e r > :: v al ue _ ty p e & p iv o t = ∗l a s t ;

26 pivot_position = la st ;

27 last −−;

28

29 i n t num_left = 0 ;

30 i n t num_right = 0 ;

31 i n t s t a r t _ l e f t = 0 ;

32 i n t start_right = 0;

33 i n t num ;

34 // m ai n l o o p

35 while ( l a s t −b e g i n + 1 > 2 ∗BLOCKSIZE)

36 {

37 // Com pare and s t o r e i n b u f f e r s

38 i f ( nu m_ le ft == 0 ) {

39 s t a r t _ l e f t = 0 ;

40 for ( in d e x j = 0 ; j < BLOCKSIZE; j++) {

41 in de xL [ nu m_l eft ] = j ;

42 nu m _l ef t += ( ! ( l e s s ( b eg i n [ j ] , p i v o t ) ) ) ;

43 }

44 }

45 i f ( nu m_r ig ht == 0 ) {

46 start_right = 0;

47 for ( in d e x j = 0 ; j < BLOCKSIZE; j++) {

48 in dex R [ num _r ight ] = j ;

49 nu m_r igh t += ! ( l e s s ( p i vo t , ∗( l a s t −j ) ) ) ;

50 }

51 }

52 // r e a r r a n ge e l e m e n t s

53 num = st d : : m in ( nu m_ l ef t , num _ ri gh t ) ;

54 for (int j = 0 ; j < num; j ++)

55 s td : : i te r _s wa p ( b eg i n + i nd ex L [ s t a r t _ l e f t + j ] , l a s t −i nd ex R [ s t a r t _ r i g h t +

j ] ) ;

56

57 num_left −= num;

58 num_right −= num;

59 s t a r t _ l e f t += num ;

60 start_right += num;

61 b e g i n += ( nu m_ l ef t == 0 ) ? BLOCKSIZE : 0 ;

62 l a s t −= ( nu m_r ig ht == 0 ) ? BLOCKSIZE : 0 ;

63

64 }// e nd m ain l o o p

65

66 // Com pare and s t o r e i n b u f f e r s f i n a l i t e r a t i o n

67 i nd e x s h i f t R = 0 , s h i f t L = 0 ;

68 i f ( n um_ ri gh t == 0 && n u m _l e f t == 0) { / / f o r s m a ll a r r ay s o r i n t h e u n l i k e l y

c as e t ha t b ot h b u f f e r s a r e emp ty

69 s h i f t L = ( ( l a s t −be g i n ) + 1) / 2 ;

22 BlockQuicksort

70 s h i f t R = ( l a s t −be g i n ) + 1 −s h i f t L ;

71 s t a r t _ l e f t = 0 ; s t a r t _ r i g h t = 0 ;

72 for ( i n de x j = 0 ; j < s h i f t L ; j ++) {

73 in de xL [ nu m_l eft ] = j ;

74 nu m _l ef t += ( ! l e s s ( b e g i n [ j ] , p i v o t ) ) ;

75 in dex R [ num _r ight ] = j ;

76 nu m_r igh t += ! l e s s ( p iv o t , ∗( l a s t −j ) ) ;

77 }

78 i f ( s h i f t L < s h i f t R )

79 {

80 in de x R [ nu m_ ri gh t ] = s h i f t R −1;

81 nu m_r igh t += ! l e s s ( p iv o t , ∗( l a s t −s h i f t R + 1 ) ) ;

82 }

83 }

84 e l s e i f ( num_right != 0) {

85 s h i f t L = ( l a s t −b e g i n ) −BLOCKSIZE + 1;

86 s h i f t R = BLOCKSIZE;

87 s t a r t _ l e f t = 0 ;

88 for ( i n de x j = 0 ; j < s h i f t L ; j ++) {

89 in de xL [ nu m_l eft ] = j ;

90 nu m _l ef t += ( ! l e s s ( b e g i n [ j ] , p i v o t ) ) ;

91 }

92 }

93 e l s e {

94 s h i f t L = BLOCKSIZE ;

95 s h i f t R = ( l a s t −be g i n ) −BLOCKSIZE + 1 ;

96 start_right = 0;

97 for ( i n d e x j = 0 ; j < s h i f t R ; j ++) {

98 in dex R [ num _r ight ] = j ;

99 nu m_r igh t += ! ( l e s s ( p i vo t , ∗( l a s t −j ) ) ) ;

100 }

101 }

102

103 // r ea r r a ng e f i n a l i t e r a t i o n

104 num = st d : : m in ( nu m_ l ef t , num _ ri gh t ) ;

105 for (int j = 0 ; j < num; j ++)

106 s td : : i te r _s wa p ( b eg i n + i nd ex L [ s t a r t _ l e f t + j ] , l a s t −i nd ex R [ s t a r t _ r i g h t +

j ] ) ;

107

108 num_left −= num;

109 num_right −= num;

110 s t a r t _ l e f t += num ;

111 start_right += num;

112 b e g i n += ( nu m_ l ef t == 0 ) ? s h i f t L : 0 ;

113 l a s t −= ( nu m_r ig ht == 0 ) ? s h i f t R : 0 ;

114 // en d f i n a l i t e r a t i o n

115

116

117 // r e a r r a n ge e l e m e n t s re m a i n i n g in b u f f e r

118 i f ( nu m_ le ft != 0 )

119 {

120 i n t l ow e r I = s t a r t _ l e f t + num _l ef t −1 ;

121 i nd ex up pe r = l a s t −b e g i n ;

122 // s e a r c h f i r s t e l e m en t to be sw ap pe d

123 while ( l o w e r I >= s t a r t _ l e f t && ind e x L [ l o w e r I ] == u pp e r ) {

124 upper−−; lowerI −−;

125 }

126 while ( l o w e rI >= s t a r t _ l e f t )

127 s t d : : i t e r _ s wa p ( be g i n + u p pe r −−, be g i n + i n d ex L [ l o w e r I −− ]) ;

128

129 s t d : : i t er _ s w ap ( p i v o t _ p o s i t i o n , b e g in + u p p er + 1) ; // f e t c h t he p i v o t

130 r e t u r n b e g i n + u p p er + 1 ;

131 }

132 e l s e i f ( num_right != 0) {

133 i n t l o we r I = s t a r t _ r i g h t + n um _ri gh t −1 ;

134 i nd ex up pe r = l a s t −b e g i n ;

135 // s e a r c h f i r s t e l e m en t to be sw ap pe d

136 while ( l o w e r I >= s t a r t _ r i g h t && i nd ex R [ l o w er I ] == u p pe r ) {

137 upper−−; lowerI −−;

138 }

139

140 while ( l o w e r I >= s t a r t _ r i g h t )

141 s t d : : i t e r _s w a p ( l a s t −upper −−, l a s t −i n de xR [ l o w e r I − −]) ;

142

143 s t d : : i t er _ s w ap ( p i v o t _ p o s i t i o n , l a s t −upper) ; / / f e t c h th e p i v o t

144 r e t u r n l a s t −upper ;

145 }

146 e l s e {/ / no r e m a i n i n g e l e m e n t s

S. Edelkamp and A. Weiß 23

147 s t d : : i t er _ s w ap ( p i v o t _ p o s i t i o n , b e g in ) ; // f e t c h t he p i v o t

148 r e t u r n b e g i n ;

149 }

150 }

151

152 tem p l a t e <ty pen am e iter , typ e na me Compare>

153 struct Hoare_block_partition_simple {

154 s t a t i c i n l i n e i t e r p a r ti t i o n ( i t e r b eg in , i t e r en d , C ompar e l e s s ) {

155 // ch o o s e pi v o t

156 i t e r mid = me di an _o f_ 3 ( be g i n , b e g i n + (e n d −b e g i n ) / 2 , end , l e s s ) ;

157 // p a r t i t i o n

158 r e t u r n hoare_block_partition_simple ( begin + 1 , end −1 , mid , l e s s ) ;

159 }

160 } ;

161

162 // Q u i c k s o rt mai n l o o p . Im p l e m en t a t i o n b a se d o n Tuned Q u i c k s o r t ( El m as ry ,

K at aj ai n en , St enm ar k )

163 tem p l a t e <t e m p l a te<class ,class>class Partitioner , type na me i t e r , ty pen am e

Compare>

164 i n l i n e v oi d q s o rt ( i t e r b eg in , i t e r en d , Co mpare l e s s ) {

165 c o n s t i n t de p th _ l i m i t = 2 ∗i l o g b ( ( double ) ( en d −b e g i n ) ) + 3 ;

166 i t e r s t a c k [ 8 0 ] ;

167 iter∗s = s t a c k ;

168 i n t d ep th _s ta ck [ 4 0 ] ;

169 i n t d e pt h = 0;

170 i n t ∗d_ s_top = de pth _s tac k ;

171 ∗s = be g i n ;

172 ∗( s + 1) = e nd ;

173 s += 2 ;

174 ∗d_s _to p = 0;

175 ++d_s_ top ;

176 do {

177 i f ( d ept h < de p th _l im i t && en d −b e g i n > IS_THRESH) {

178 i t e r p i v ot = P a r t i t i o n e r < i t e r , Compa re > : : p a r t i t i o n ( b e gi n , e nd , l e s s ) ;

179 // P ush l a r g e s i d e to s t a c k an d c o n t i n u e o n s m a l l s i d e

180 i f ( pivot −b e g i n > e nd −pivot ) {

181 ∗s = be g i n ;

182 ∗( s + 1 ) = pi v o t ;

183 b e g i n = pi v ot + 1 ;

184 }

185 e l s e {

186 ∗s = pi v o t + 1 ;

187 ∗( s + 1) = e nd ;

188 en d = p i v o t ;

189 }

190 s += 2 ;

191 de p t h++;

192 ∗d_s _top = d e pth ;

193 ++d_s_ top ;

194 }

195 e l s e {

196 i f ( en d −b e g i n > IS_THRESH) // i f r e c u r s i o n d ep th l i m i t e x ce e de d

197 s td : : p a r t i a l _ s o r t ( b eg in , end , en d ) ;

198 e l s e

199 I n s e r t i o n s o r t : : i n s e r t i o n_ s o r t ( b eg i n , e nd , l e s s ) ; / / c o py o f s t d : :

_ _ i n s e r t i o n _ s o r t (GCC 4 . 7 . 2 )

200

201 // p op new s u b a r r a y f ro m s ta c k

202 s−= 2 ;

203 b e g i n = ∗s ;

204 en d = ∗( s + 1) ;

205 −−d_s_top ;

206 depth = ∗d_s_top ;

207 }

208 }while ( s ! = s t a c k ) ;

209 }

210

211 // e x am pl e i n v o c a t i o n o f q s o r t

212 i n t main ( void) {

213 s t d : : v e c t o r < int> v ;

214 //

215 // a s s i g n v a l u e s to v

216 //

217 q s o r t <H o a r e _ bl o c k _ p a r t i t io n _ s i m p l e >(v . be g i n ( ) , v . end ( ) , s t d : : l e s s <int >( ) ) ;

218 }