PreprintPDF Available

BlockQuicksort: How Branch Mispredictions don't affect Quicksort

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Since the work of Kaligosi and Sanders (2006), it is well-known that Quicksort -- which is commonly considered as one of the fastest in-place sorting algorithms -- suffers in an essential way from branch mispredictions. We present a novel approach to address this problem by partially decoupling control from data flow: in order to perform the partitioning, we split the input in blocks of constant size (we propose 128 data elements); then, all elements in one block are compared with the pivot and the outcomes of the comparisons are stored in a buffer. In a second pass, the respective elements are rearranged. By doing so, we avoid conditional branches based on outcomes of comparisons at all (except for the final Insertionsort). Moreover, we prove that for a static branch predictor the average total number of branch mispredictions is at most $\epsilon n \log n + O(n)$ for some small $\epsilon$ depending on the block size when sorting $n$ elements. Our experimental results are promising: when sorting random integer data, we achieve an increase in speed of 80% over the GCC implementation of C++ std::sort. Also for many other types of data and non-random inputs, there is still a significant speedup over std::sort. Only in few special cases like sorted or almost sorted inputs, std::sort can beat out implementation. Moreover, even on random input permutations, our implementation is even slightly faster than an implementation of the highly tuned Super Scalar Sample Sort, which uses a linear amount of additional space.
Content may be subject to copyright.
BlockQuicksort: How Branch Mispredictions don’t
aﬀect Quicksort
Stefan Edelkamp1and Armin Weiß2
1 TZI, Universität Bremen,
Am Fallturm 1, D-28239 Bremen, Germany
2 Stevens Institute of Technology,
1 Castle Point Terrace, Hoboken, NJ 07030, USA
Abstract
Since the work of Kaligosi and Sanders (2006), it is well-known that Quicksort – which is com-
monly considered as one of the fastest in-place sorting algorithms – suﬀers in an essential way
from branch mispredictions. We present a novel approach to address this problem by partially de-
coupling control from data ﬂow: in order to perform the partitioning, we split the input in blocks
of constant size (we propose 128 data elements); then, all elements in one block are compared
with the pivot and the outcomes of the comparisons are stored in a buﬀer. In a second pass, the re-
spective elements are rearranged. By doing so, we avoid conditional branches based on outcomes
of comparisons at all (except for the ﬁnal Insertionsort). Moreover, we prove that for a static
branch predictor the average total number of branch mispredictions is at most n log n+O(n)
for some small depending on the block size when sorting nelements.
Our experimental results are promising: when sorting random integer data, we achieve an
increase in speed (number of elements sorted per second) of more than 80% over the GCC
implementation of C++ std::sort. Also for many other types of data and non-random inputs,
there is still a signiﬁcant speedup over std::sort. Only in few special cases like sorted or
almost sorted inputs, std::sort can beat our implementation. Moreover, even on random input
permutations, our implementation is even slightly faster than an implementation of the highly
tuned Super Scalar Sample Sort, which uses a linear amount of additional space.
1998 ACM Subject Classiﬁcation F.2.2 Nonnumerical Algorithms and Problems
Keywords and phrases in-place sorting, Quicksort, branch mispredictions, lean programs
1 Introduction
Sorting a sequence of elements of some totally ordered universe remains one of the most
fascinating and well-studied topics in computer science. Moreover, it is an essential part
of many practical applications. Thus, eﬃcient sorting algorithms directly transfer to a
performance gain for many applications. One of the most widely used sorting algorithms is
Quicksort, which has been introduced by Hoare in 1962 [
14
] and is considered to be one of the
most eﬃcient sorting algorithms. For sorting an array, it works as follows: ﬁrst, it chooses an
arbitrary pivot element and then rearranges the array such that all elements smaller than the
pivot are moved to the left side and all elements larger than the pivot are moved to the right
side of the array – this is called partitioning. Then, the left and right side are both sorted
recursively. Although its average
1
number of comparisons is not optimal – 1
.
38
nlog n
+
O
(
n
)
vs.
nlog n
+
O
(
n
)for Mergesort –, its over-all instruction count is very low. Moreover, by
1
Here and in the following, the average case refers to a uniform distribution of all input permutations
assuming all elements are diﬀerent.
arXiv:1604.06697v2 [cs.DS] 23 Jun 2016
2 BlockQuicksort
choosing the pivot element as median of some larger sample, the leading term 1
.
38
nlog n
for the average number of comparisons can be made smaller – even down to
nlog n
when
choosing the pivot as median of some sample of growing size [
21
]. Other advantages of
Quicksort are that it is easy to implement and that it does not need extra memory except
the recursion stack of logarithmic size (even in the worst case if properly implemented). A
major drawback of Quicksort is its quadratic worst-case running time. Nevertheless, there
are eﬃcient ways to circumvent a really bad worst-case. The most prominent is Introsort
(introduced by Musser [
22
]) which is applied in GCC implementation of
std::sort
: as soon
as the recursion depth exceeds a certain limit, the algorithm switches to Heapsort.
Another deﬁciency of Quicksort is that it suﬀers from branch mispredictions (or branch
misses) in an essential way. On modern processors with long pipelines (14 stages for Intel
Haswell, Broadwell, Skylake processors – for the older Pentium 4 processors even more
than twice as many) every branch misprediction causes a rather long interruption of the
execution since the pipeline has to be ﬁlled anew. In [
15
], Kaligosi and Sanders analyzed the
number of branch mispredictions incurred by Quicksort. They examined diﬀerent simple
branch prediction schemes (static prediction and 1-bit, 2-bit predictors) and showed that
with all of them, Quicksort with a random element as pivot causes on average
cn log n
+
O
(
n
)
branch mispredictions for some constant
c
= 0
.
34 (resp.
c
= 0
.
46,
c
= 0
.
43). In particular, in
Quicksort with random pivot element, every fourth comparison is followed by a mispredicted
branch. The reason is that for partitioning, each element is compared with the pivot and
depending on the outcome either it is swapped with some other element or not. Since for an
optimal pivot (the median), the probability of being smaller the pivot is 50%, there is no
way to predict these branches.
Kaligosi and Sanders also established that choosing skewed pivot elements (far oﬀ the
median) might even decrease the running time because it makes branches more predictable.
This also explains why, although theoretically larger samples for pivot selection were shown
to be superior, in practice the median-of three variant turned out to be the best. In [
6
], the
skewed pivot phenomenon is conﬁrmed experimentally. Moreover, in [
20
], precise theoretical
bounds on the number of branch misses for Quicksort are given – establishing also theoretical
superiority of skewed pivots under the assumption that branch mispredictions are expensive.
In [
8
] Brodal and Moruz proved a general lower bound on the number of branch mispre-
dictions given that every comparison is followed by a conditional branch which depends on
the outcome of the comparison. In this case there are Ω(
nlogdn
)branch mispredictions for
a sorting algorithm which performs
O
(
dn log n
)comparisons. As Elmasry and Katajainen re-
marked in [
10
], this theorem does not hold anymore if the results of comparisons are not used
for conditional branches. Indeed, they showed that every program can be transformed into a
program which induces only a constant number of branch misses and whose running time
is linear in the running time of the original program. However, this general transformation
introduces a huge constant factor overhead. Still, in [
10
] and [
11
] Elmasry, Katajainen and
Stenmark showed how to eﬃciently implement many algorithms related to sorting with only
few branch mispredictions. They call such programs lean. In particular, they present variants
of Mergesort and Quicksort suﬀering only very little from branch misses. Their Quicksort
variant (called Tuned Quicksort, for details on the implementation, see [
16
]) is very fast for
random permutations – however, it does not behave well with duplicate elements because it
applies Lomuto’s uni-directional partitioner (see e. g. [9]).
Another development in recent years is multi-pivot Quicksort (i. e. several pivots in each
partitioning stage [
4
,
5
,
18
,
27
,
28
]). It started with the introduction of Yaroslavskiy’s dual-
pivot Quicksort [
30
] – which, surprisingly, was faster than known Quicksort variants and, thus,
S. Edelkamp and A. Weiß 3
became the standard sorting implementation in Oracle Java 7 and Java 8. Concerning branch
mispredictions all these multi-pivot variants behave essentially like ordinary Quicksort [
20
];
however, they have one advantage: every data element is accessed only a few times (this is
also referred to as the number of scans). As outlined in [
5
], increasing the number of pivot
elements further (up to 127 or 255), leads to Super Scalar Sample Sort, which has been
introduced by Sanders and Winkel [
24
]. Super Scalar Sample Sort not only has the advantage
of few scans, but also is based on the idea of avoiding conditional branches. Indeed, the
correct bucket (the position between two pivot elements) can be found by converting the
results of comparisons to integers and then simply performing integer arithmetic. In their
experiments Sanders and Winkel show that Super Scalar Sample Sort is approximately twice
as fast as Quicksort (
std::sort
) when sorting random integer data. However, Super Scalar
Sample Sort has one major draw-back: it uses a linear amount of extra space (for sorting
n
data elements, it requires space for another
n
data elements and additionally for more than
nintegers). In the conclusion of [15], Kaligosi and Sander raised the question:
However, an in-place sorting algorithm that is better than Quicksort with skewed pivots
is an open problem.
(Here, in-place means that it needs only a constant or logarithmic amount of extra space.)
In this work, we solve the problem by presenting our block partition algorithm, which allows
to implement Quicksort without any branch mispredictions incurred by conditional branches
based on results of comparisons (except for the ﬁnal Insertionsort – also there are still
conditional branches based on the control-ﬂow, but their amount is relatively small). We call
the resulting algorithm BlockQuicksort. Our work is inspired by Tuned Quicksort from [
11
],
from where we also borrow parts of our implementation. The diﬀerence is that by doing the
partitioning block-wise, we can use Hoare’s partitioner, which is far better with duplicate
elements than Lomuto’s partitioner (although Tuned Quicksort can be made working with
duplicates by applying a check for duplicates similar to what we propose for BlockQuicksort
as one of the further improvements in Section 3.2). Moreover, BlockQuicksort is also superior
to Tuned Quicksort for random permutations of integers.
Our Contributions
We present a variant of the partition procedure that only incurs few branch mispredictions
by storing results of comparisons in constant size buﬀers.
We prove an upper bound of
n log n
+
O
(
n
)branch mispredictions on average, where
 < 1
16 for our proposed block size (Theorem 1).
We propose some improvements over the basic version.
We implemented our algorithm with an stl-style interface2.
We conduct experiments and compare BlockQuicksort with
std::sort
, Yaroslavskiy’s
dual-pivot Quicksort and Super Scalar Sample Sort – on random integer data it is faster
than all of these and also Katajainen et al.’s Tuned Quicksort.
Outline
Section 2 introduces some general facts on branch predictors and mispredictions,
and gives a short account of standard improvements of Quicksort. In Section 3, we give
a precise description of our block partition method and establish our main theoretical
result – the bound on the number of branch mispredictions. Finally, in Section 4, we
2Code available at https://github.com/weissan/BlockQuicksort
4 BlockQuicksort
experimentally evaluate diﬀerent block sizes, diﬀerent pivot selection strategies and compare
our implementation with other state of the art implementations of Quicksort and Super
Scalar Sample Sort.
2 Preliminaries
Logarithms denoted by
log
are always base 2. The term average case refers to a uniform
distribution of all input permutations assuming all elements are diﬀerent. In the following
std::sort always refers to its GCC implementation.
Branch Misses
Branch mispredictions can occur when the code contains conditional jumps
(i. e. if statements, loops, etc.). Whenever the execution ﬂow reaches such a statement,
the processor has to decide in advance which branch to follow and decode the subsequent
instructions of that branch. Because of the length of the pipeline of modern microprocessors,
a wrong predicted branch causes a large delay since, before continuing the execution, the
instructions for the other branch have to be decoded.
Branch Prediction Schemes
Precise branch prediction schemes of most modern processors
are not disclosed to the public. However, the simplest schemes suﬃce to make BlockQuicksort
induce only few mispredictions.
The easiest branch prediction scheme is the static predictor: for every conditional jump
the compiler marks the more likely branch. In particular, that means that for every if
statement, we can assume that there is a misprediction if and only if the if branch is not
taken; for every loop statement, there is precisely one misprediction for every time the
execution ﬂow reaches that loop: when the execution leaves the loop. For more information
about branch prediction schemes, we refer to [13, Section 3.3].
How to avoid Conditional Branches
The usual implementation of sorting algorithms
performs conditional jumps based on the outcome of comparisons of data elements. There
are at least two methods how these conditional jumps can be avoided – both are supported
by the hardware of modern processors:
Conditional moves (
CMOVcc
instructions on x86 processors) – or, more general, conditional
execution. In C++ compilation to a conditional move can be (often) triggered by
i = ( x < y ) ? j : i ;
Cast Boolean variables to integer (SETcc instructions x86 processors). In C++:
i n t i = ( x < y ) ;
Also many other instruction sets support these methods (e. g. ARM [
1
], MIPS [
23
]). Still, the
Intel Architecture Optimization Reference Manual [
2
] advises only to use these instructions to
avoid unpredictable branches (as it is the case for sorting) since correctly predicted branches
are still faster. For more examples how to apply these methods to sorting, see [11].
Quicksort and improvements
The central part of Quicksort is the partitioning procedure.
Given some pivot element, it returns a pointer
p
to an element in the array and rearranges
the array such that all elements left of the
p
are smaller or equal the pivot and all elements
on the right are greater or equal the pivot. Quicksort ﬁrst chooses some pivot element, then
performs the partitioning, and, ﬁnally, recurses on the elements smaller and larger the pivot –
S. Edelkamp and A. Weiß 5
Algorithm 1 Quicksort
1: procedure Quicksort(A[,...,r])
2: if r >  then
3: pivot choosePivot(A[,...,r])
4: cut partition(A[,...,r], pivot)
5: Quicksort(A[, . . . , cut 1])
6: Quicksort(A[cut,...,r])
7: end if
8: end procedure
see Algorithm 1. We call the procedure which organizes the calls to the partitioner the
Quicksort main loop.
There are many standard improvements for Quicksort. For our optimized Quicksort main
loop (which is a modiﬁed version of Tuned Quicksort [
11
,
16
]), we implemented the following:
A very basic optimization due to Sedgewick [
26
] avoids recursion partially (e.g.
std::sort
)
or totally (here – this requires the introduction of an explicit stack).
Introsort [
22
]: there is an additional counter for the number of recursion levels. As soon
as it exceeds some bound (
std::sort
uses 2
log n
– we use 2
log n
+ 3), the algorithms
stops Quicksort and switches to Heapsort [
12
,
29
] (only for the respective sub-array). By
doing so, a worst-case running time of O(nlog n)is guaranteed.
Sedgewick [
26
] also proposed to switch to Insertionsort (see e. g. [
17
, Section 5.2.1]) as
soon as the array size is less than some ﬁxed small constant (16 for
std::sort
and our
implementation). There are two possibilities when to apply Insertionsort: either during
the recursion, when the array size becomes too small, or at the very end after Quicksort
has ﬁnished. We implemented the ﬁrst possibility (in contrast to
std::sort
) because for
sorting integers, it hardly made a diﬀerence, but for larger data elements there was a
slight speedup (in [19] this was proposed as memory-tuned Quicksort).
After partitioning, the pivot is moved to its correct position and not included in the
recursive calls (not applied in std::sort).
The basic version of Quicksort uses a random or ﬁxed element as pivot. A slight
improvement is to choose the pivot as median of three elements – typically the ﬁrst,
in the middle and the last. This is applied in
std::sort
and many other Quicksort
implementations. Sedgewick [
26
] already remarked that choosing the pivots from an even
larger sample does not provide a signiﬁcant increase of speed. In view of the experiments
with skewed pivots [
15
], this is no surprise. For BlockQuicksort, a pivot closer to the
median turns out to be beneﬁcial (Figure 2 in Section 4). Thus, it makes sense to invest
more time to ﬁnd a better pivot element. In [
21
], Martinez and Roura show that the
number of comparisons incurred by Quicksort is minimal if the pivot element is selected
as median of Θ(
n
)elements. Another variant is to choose the pivot as median of three
(resp. ﬁve) elements which themselves are medians of of three (resp. ﬁve) elements. We
implemented all these variants for our experiments – see Section 4.
Our main contribution is the block partitioner, which we describe in the next section.
3 Block Partitioning
The idea of block partitioning is quite simple. Recall how Hoare’s original partition procedure
works (Algorithm 2): Two pointers start at the leftmost and rightmost elements of the array
and move towards the middle. In every step the current element is compared to the pivot
6 BlockQuicksort
Algorithm 2 Hoare’s Partitioning
1: procedure Partition(A[,...,r], pivot)
2: while  < r do
3: while A[]<pivot do ++ end while
4: while A[r]>pivot do rend while
5: if  < r then swap(A[], A[r]); ++;r−− end if
6: end while
7: return
8: end procedure
(Line 3 and 4). If some element on the right side is less or equal the pivot (resp. some element
on the left side is greater or equal), the respective pointer stops and the two elements found
this way are swapped (Line 5). Then the pointers continue moving towards the middle.
The idea of BlockQuicksort (Algorithm 3) is to separate Lines 3 and 4 of Algorithm 2
from Line 5: ﬁx some block size
B
; we introduce two buﬀers
oﬀsetsL
[0
, . . . , B
1] and
oﬀsetsR
[0
, . . . , B
1] for storing pointers to elements (
oﬀsetsL
will store pointers to elements
on the left side of the array which are greater or equal than the pivot element – likewise
oﬀsetsR
for the right side). The main loop of Algorithm 3 consists of two stages: the scanning
phase (Lines 5 to 18) and the rearrangement phase (Lines 19 to 26).
Algorithm 3 Block partitioning
1: procedure BlockPartition(A[,...,r], pivot)
2: integer oﬀsetsL[0,...,B1],oﬀsetsR[0,...,B1]
3: integer startL,startR,numL,numR0
4: while r+ 1 >2Bdo start main loop
5: if numL= 0 then if left buﬀer is empty, reﬁll it
6: startL0
7: for i= 0,...,B1do
8: oﬀsetsL[numL]i
9: numL+= (pivot A[+i]) scanning phase for left side
10: end for
11: end if
12: if numR= 0 then if right buﬀer is empty, reﬁll it
13: startR0
14: for i= 0,...,B1do
15: oﬀsetsR[numR]i
16: numR+= (pivot A[ri]) scanning phase for right side
17: end for
18: end if
19: integer num = min(numL,numR)
20: for j= 0,...,num 1do
21: swap(A+ oﬀsetsL[startL+j], AroﬀsetsR[startR+j])rearrangement phase
22: end for
23: numL,numR= num;startL,startR+= num
24: if (numL= 0) then += Bend if
25: if (numR= 0) then r=Bend if
26: end while end main loop
27: compare and rearrange remaining elements
28: end procedure
Like for classical Hoare partition, we also start with two pointers (or indices as in the
S. Edelkamp and A. Weiß 7
pseudocode) to the leftmost and rightmost element of the array. First, the scanning phase
takes place: the buﬀers which are empty are reﬁlled. In order to do so, we move the respective
pointer towards the middle and compare each element with the pivot. However, instead
of stopping at the ﬁrst element which should be swapped, only a pointer to the element is
stored in the respective buﬀer (Lines 8 and 9 resp. 15 and 16 – actually the pointer is always
stored, but depending on the outcome of the comparison a counter holding the number of
pointers in the buﬀer is increased or not) and the pointer continues moving towards the
middle. After an entire block of
B
elements has been scanned (either on both sides of the
array or only on one side), the rearranging phase begins: it starts with the ﬁrst positions of
the two buﬀers and swaps the data elements they point to (Line 21); then it continues until
one of the buﬀers contains no more pointers to elements which should be swapped. Now the
scanning phase is restarted and the buﬀer that has run empty is ﬁlled again.
The algorithm continues this way until fewer elements than two times the block size
remain. Now, the simplest variant is to switch to the usual Hoare partition method for
the remaining elements (in the experiments with suﬃx
Hoare finish
). But, we also can
continue with the idea of block partitioning: the algorithm scans the remaining elements
as one or two ﬁnal blocks (of smaller size) and performs a last rearrangement phase. After
that, some elements to swap in one of the two buﬀers might still remain, while the other
buﬀer is empty. With one run through the buﬀer, all these elements can be moved to the left
resp. right (similar as it is done in the Lomuto partitioning method, but without performing
actual comparisons). We do not present the details for this ﬁnal rearranging here because
on one hand it gets a little tedious and on the other hand it does neither provide a lot of
insight into the algorithm nor is it necessary to prove our result on branch mispredictions.
The C++ code of this basic variant can be found in Appendix B.
3.1 Analysis
If the input consists of random permutations (all data elements diﬀerent), the average
numbers of comparisons and swaps are the same as for usual Quicksort with median-of-three.
This is because both Hoare’s partitioner and the block partitioner preserve randomness of
the array.
The number of scanned elements (total number of elements loaded to the registers) is
increased by two times the number of swaps, because for every swap, the data elements have
to be loaded again. However, the idea is that due to the small block size, the data elements
still remain in L1 cache when being swapped – so the additional scan has no negative eﬀect on
the running time. In Section 4 we see that for larger data types and from a certain threshold
on, an increasing size of the blocks has a negative eﬀect on the running time. Therefore,
the block size should not be chosen too large – we propose
B
= 128 and ﬁx this constant
throughout (thus, already for inputs of moderate size, the buﬀers also do not require much
more space than the stack for Quicksort).
Branch mispredictions
The next theorem is our main theoretical result. For simplicity we
assume here that BlockQuicksort is implemented without the worst-case-stopper Heapsort
(i. e. there is no limit on the recursion depth). Since there is only a low probability that a
high recursion depth is reached while the array is still large, this assumption is not a real
restriction. We analyze a static branch predictor: there is a misprediction every time a loop
is left and a misprediction every time the if branch of an if statement is not taken.
8 BlockQuicksort
ITheorem 1.
Let
C
be the average number of comparisons of Quicksort with constant size
pivot sample. Then BlockQuicksort (without limit to the recursion depth and with the same
pivot selection method) with blocksize
B
induces at most
6
B·C
+
O
(
n
)branch mispredictions on
average. In particular, BlockQuicksort with median-of-three induces less then
8
Bnlog n
+
O
(
n
)
branch mispredictions on average.
Theorem 1 shows that when choosing the block size suﬃciently large, the
nlog n
-term
becomes very small and – for real-world inputs – we can basically assume a linear number of
branch mispredictions. Moreover, Theorem 1 can be generalized to samples of non-constant
size for pivot selection. Since the proof might become tedious, we stick to the basic variant
here. The constant 6 in Theorem 1 can be replaced by 4 when implementing Lines 19, 24,
and 25 of Algorithm 3 with conditional moves.
Proof.
First, we show that every execution of the block partitioner Algorithm 3 on an array
of length
n
induces at most
6
Bn
+
c
branch mispredictions for some constant
c
. In order
to do so, we only need to look at the main loop (Line 4 to 27) of Algorithm 3 because the
ﬁnal scanning and rearrangement phases consider only a constant (at most 2
B
) number of
elements. Inside the main loop there are three for loops (starting Lines 7, 14, 20), four if
statements (starting Lines 5, 12, 24, 25) and the min calculation (whose straightforward
implementation is an if statement – Line 19). We know that in every execution of the main
loop at least one of the conditions of the if statements in Line 5 and 12 is true because in
every rearrangement phase at least one buﬀer runs empty. The same holds for the two if
statements in Line 24 and 25. Therefore, we obtain at most two branch mispredictions for
the if s, three for the for loops and one for the min in every execution of the main loop.
In every execution of the main loop, there are at least
B
comparisons of elements with the
pivot. Thus, the number of branch misses in the main loop is at most
6
B
times the number of
comparisons. Hence, for every input permutation the total number of branch mispredictions
of BlockQuicksort is at most
6
B·
#
comparisons
+ (
c
+
c0
)
·
#
calls to partition
+
O
(
n
)
,
where
c0
it the number of branch mispredictions of one execution of the main loop of Quicksort
(including pivot selection, which only needs a constant number of instructions) and the
O
(
n
)
term comes from the ﬁnal Insertionsort. The number of calls to partition is bounded by
n
because each element can be chosen as pivot only once (since the pivots are not contained in
the arrays for the recursive calls). Thus, by taking the average over all input permutations,
the ﬁrst statement follows.
The second statement follows because Quicksort with median-of-three incurs 1
.
18
nlog n
+
O(n)comparisons on average [25]. J
I
Remark. The
O
(
n
)-term in Theorem 1 can be bounded by 3
n
by taking a closer look to
the ﬁnal rearranging phase.
We give a rough heuristic estimate: it is save to assume that the average length of arrays
on which Insertionsort is called is at least 8 (recall that we switch to Insertionsort as soon as
the array size is less than 17). For Insertionsort there is one branch miss for each element
(when exiting the loop for ﬁnding the position) plus one for each call of Insertionsort (when
exiting the loop over all elements to insert). Furthermore, there are at most two branch
misses in the main Quicksort loop (Lines 177 and 196 in Appendix B) for every call to
Insertionsort. Hence, we have approximately 11
8nbranch misses due to Insertionsort.
It remains to count the constant number of branch misprediction incurred during every
call of partitioning: After exiting the main loop of block partition, there is one more scan
and rearrangement phase with a smaller block size. This leads to at most
7branch
mispredictions (one extra because there is an additional case that both buﬀers are empty).
S. Edelkamp and A. Weiß 9
The ﬁnal rearranging incurs at most three branch misses (Lines 118, 136, 140). Selecting the
pivot as median-of-three (Line 11) induces no branch misses since all conditional statements
are compiled to conditional moves. Finally, there is at most one branch miss in the main
Quicksort loop for every call to partition (Line 180). This sums up to at most 13 branch
misses per call to partition. Because the average size of arrays treated by Insertionsort is at
least 8, the number of calls to partition is less than n/8.
Thus, in total the
O
(
n
)-term in Theorem 1 consists of at most
11
8n
+
13
8n
= 3
n
branch
mispredictions.
3.2 Further Tuning of Block Partitioning
We propose and implemented further tunings for our block partitioner:
1.
Loop unrolling: since the block size is a power of two, the loops of the scanning phase
can be unrolled four or even eight times without causing additional overhead.
2. Cyclic permutations instead of swaps: We replace
1: for j= 0,...,num 1do
2: swap(A+ oﬀsetsL[startL+j], AroﬀsetsR[startR+j])
3: end for
by the following code, which does not perform exactly the same data movements, but
still in the end all elements less than the pivot are on the left and all elements greater
are on the right:
1: temp A+ oﬀsetsL[startL]
2: A+ oﬀsetsL[startL]AroﬀsetsR[startR]
3: for j= 1,...,num 1do
4: AroﬀsetsR[startR+j1]A+ oﬀsetsL[startL+j]
5: A+ oﬀsetsL[startL+j]AroﬀsetsR[startR+j]
6: end for
7: AroﬀsetsR[startR+ num 1]temp
Note that this is also a standard improvement for partitioning – see e. g. [3].
In the following, we always assume these two improvements since they are of very basic
nature (plus one more small change in the ﬁnal rearrangement phase). We call the variant
without them block_partition_simple – its C++ code can be found in Appendix B.
The next improvement is a slight change of the algorithm: in our experiments we noticed
that for small arrays with many duplicates the recursion depth becomes often higher than
the threshold for switching to Heapsort – a way to circumvent this is an additional check for
duplicates equal to the pivot if one of the following two conditions applies:
the pivot occurs twice in the sample for pivot selection (in the case of median-of-three),
the partitioning results very unbalanced for an array of small size.
The check for duplicates takes place after the partitioning is completed. Only the larger half
of the array is searched for elements equal to the pivot. This check works similar to Lomuto’s
partitioner (indeed, we used the implementation from [
16
]): starting from the position of the
pivot, the respective half of the array is scanned for elements equal to the pivot (this can be
done by one less than comparison since elements are already known to be greater or equal
(resp. less or equal) the pivot)). Elements which are equal to the pivot are moved to the side
of the pivot. The scan continues as long as at least every fourth element is equal to the pivot
(instead every fourth one could take any other ratio – this guarantees that the check stops
soon if there are only few duplicates).
After this check, all elements which are identiﬁed as being equal to the pivot remain
in the middle of the array (between the elements larger and the elements smaller than the
10 BlockQuicksort
pivot); thus, they can be excluded from further recursive calls. We denote this version with
the suﬃx duplicate check (dc).
4 Experiments
We ran thorough experiments with implementations in C++ on diﬀerent machines with
diﬀerent types of data and diﬀerent kinds of input sequences. If not speciﬁed explicitly, the
experiments are run on an Intel Core i5-2500K CPU (3.30GHz, 4 cores, 32KB L1 instruction
and data cache, 256KB L2 cache per core and 6MB L3 shared cache) with 16GB RAM and
operating system Ubuntu Linux 64bit version 14.04.4. We used GNU’s
g++
(4.8.4); optimized
with ﬂags -O3 -march=native.
For time measurements, we used
std::chrono::high_resolution_clock
, for generating
random inputs, the Mersenne Twister pseudo-random generator
std::mt19937
. All time
measurements were repeated with the same 20 deterministically chosen seeds – the displayed
numbers are the average of these 20 runs. Moreover, for each time measurement, at least
128MB of data were sorted – if the array size is smaller, then for this time measurement
several arrays have been sorted and the total elapsed time measured. Our running time plots
all display the actual time divided by the number of elements to sort on the y-axis.
We performed our running time experiments with three diﬀerent data types:
int: signed 32-bit integers.
Vector
: 10-dimensional array of 64-bit ﬂoating-point numbers (
double
). The order is
deﬁned via the Euclidean norm – for every comparison the sums of the squares of the
components are computed and then compared.
Record: 21-dimensional array of 32-bit integers. Only the ﬁrst component is compared.
The code of our implementation of BlockQuicksort as well as the other algorithms and our
running time experiments is available at https://github.com/weissan/BlockQuicksort.
Diﬀerent Block Sizes
Figure 1 shows experimental results on random permutations for
diﬀerent data types and block sizes ranging from 4 up to 2
24
. We see that for integers only
2123252729211 213 215 217 219 221 223
block size
0
50
100
150
200
250
300
350
time per element [ns]
Record, n=1048576
Record, n=16777216
Vector, n=1048576
Vector, n=16777216
int, n=1048576
int, n=134217728
Figure 1 Diﬀerent block sizes for random permutations.
at the end there is a slight negative eﬀect when increasing the block size. Presumably this is
because up to a block size of 2
19
, still two blocks ﬁt entirely into the L3 cache of the CPU.
On the other hand for
Vector
a block size of 64 and for
Record
of 8 seem to be optimal –
with a considerably increasing running time for larger block sizes.
As a compromise we chose to ﬁx the block size to 128 elements for all further experiments.
An alternative approach would be to choose a ﬁxed number of bytes for one block and adapt
the block size according to the size of the data elements.
S. Edelkamp and A. Weiß 11
2 4 6 8 10 12 14 16
skew factor
30
40
50
60
70
80
time per element [ns]
block partition, n=1048576
block partition, n=16777216
GCC partition, n=1048576
GCC partition, n=16777216
Figure 2
Sorting random permutations of 32-bit integers with skewed pivot. A skew factor
k
means that n
k-th element is chosen as pivot of an array of length n.
Skewed Pivot Experiments
We repeated the experiments from [
15
] with skewed pivot for
both the usual Hoare partitioner (
std::__unguarded_partition
, from the GCC implement-
ation of
std::sort
) and our block partition method. For both partitioners we used our
tuned Quicksort loop. The results can be seen in Figure 2: classic Quicksort beneﬁts from
skewed pivot, whereas BlockQuicksort works best with the exact median. Therefore, for
BlockQuicksort it makes sense to invest more eﬀort to ﬁnd a good pivot.
Diﬀerent Pivot Selection Methods
We implemented several strategies for pivot selection:
median-of-three, median-of-ﬁve, median-of-twenty-three,
median-of-three-medians-of-three, median-of-three-medians-of-ﬁve, median-of-ﬁve-me-
dians-of-ﬁve: ﬁrst calculate three (resp. ﬁve) times the median of three (resp. ﬁve)
elements, then take the pivot as median of these three (resp. ﬁve) medians,
median-of-n.
All pivot selection strategies switch to median-of-three for small arrays. Moreover, the
median-of-
n
variant switches to median-of-ﬁve-medians-of-ﬁve for arrays of length below
20000 (for smaller
n
even the number of comparisons was better with median-of-ﬁve-medians-
of-ﬁve). The medians of larger samples are computed with
std::nth_element
. Despite the
results on skewed pivots Figure 2, there was no big diﬀerence between the diﬀerent pivot
selection strategies as it can be seen in Figure 3. As expected, median-of-three was always
the slowest for larger arrays. Median-of-ﬁve-medians-of-ﬁve was the fastest for
int
and
median-of-
n
for
Vector
. We think that the small diﬀerence between all strategies is due to
the large overhead for the calculation of the median of a large sample – and maybe because
the array is rearranged in a way that is not favorable for the next recursive calls.
4.1 Comparison with other Sorting Algorithms
We compare variants of BlockQuicksort with the GCC implementation of
std::sort3
(which is
known to be one of the most eﬃcient Quicksort implementations – see e. g. [
7
]), Yaroslavskiy’s
3
For the source code see e.g.
https://gcc.gnu.org/onlinedocs/gcc-4.7.2/libstdc++/api/a01462_
source.html
– be aware that in newer versions of GCC the implementation is slightly diﬀerent: the old
version uses the ﬁrst, middle and last element as sample for pivot selection, whereas the new version
uses the second, middle and last element. For decreasingly sorted arrays the newer version works far
better – for random permutations and increasingly sorted arrays, the old one is better. We used the old
version for our experiment. The new version is included in some plots Figures 9 and 10 in the appendix;
this reveals a enormous diﬀerence between the two versions for particular inputs and underlines the
importance of proper pivot selection.
12 BlockQuicksort
217 218 219 220 221 222 223 224
number of elements n
5.6
5.8
6.0
6.2
6.4
6.6
6.8
7.0
time divided by n log n [ns]
217 218 219 220 221 222 223 224
number of elements n
8.6
8.8
9.0
9.2
9.4
time divided by n log n [ns]
217 218 219 220 221 222 223 224
number of elements n
1.65
1.70
1.75
1.80
1.85
1.90
time divided by n log n [ns]
median-of-three
median-of-23
median-of-3-medians-of-3
median-of-3-medians-of-5
median-of-ﬁve
median-of-5-medians-of-5
median-of-n
Figure 3
Diﬀerent pivot selection strategies with random permutation. Upper left:
Record
;upper
right:Vector;lower left :int. Be aware that the y-axis here displays the time divided by nlog n.
dual-pivot Quicksort [
30
] (we converted the Java code of [
30
] to C++) and an implementation
of Super Scalar Sample Sort [
24
] by Hübschle-Schneider
4
. For random permutations and
random values modulo
n
, we also test Tuned Quicksort [
16
] and three-pivot Quicksort
implemented by Aumüller and Bingmann
5
from [
5
] (which is based on [
18
]) – for other types
of inputs we omit these algorithms because of their poor behavior with duplicate elements.
Branch mispredictions
We experimentally determined the number of branch mispredictions
of BlockQuicksort and the other algorithms with the chachegrind branch prediction proﬁler,
which is part of the proﬁling tool valgrind
6
. The results of these experiments on random
int
data can be seen in Figure 4 – the
y
-axis shows the number of branch misprediction
divided the the array size. We only display the median-of-three variant of BlockQuicksort
since all the variants are very much alike. We also added plots of BlockQuicksort and Tuned
Quicksort skipping ﬁnal Insertionsort (i. e. the arrays remain partially unsorted).
We see that both
std::sort
and Yaroslavskiy’s dual-pivot Quicksort incur Θ(
nlog n
)
branch mispredictions. The up and down for Super Scalar Sample Sort presumably is because
of the variation in the size of the arrays where the base case sorting algorithm
std::sort
is
4
URL:
https://github.com/lorenzhs/ssssort/blob/b931c024cef3e6d7b7e7fd3ee3e67491d875e021/
ssssort.h – retrieved April 12, 2016
5URL: http://eiche.theoinf.tu-ilmenau.de/Quicksort- experiments/ – retrieved March, 2016
6
http://valgrind.org/
. To perform the measurements we used
the same Python script as in [
11
,
16
], which ﬁrst measures the number of branch mispredictions of the
whole program including generation of test cases and then, in a second run, measures the number of
branch mispredictions incurred by the generation of test cases.
S. Edelkamp and A. Weiß 13
Figure 4 Number of branch mispredictions.
applied to. For BlockQuicksort there is an almost non-visible
nlog n
term for the number of
branch mispredictions. Indeed, we computed an approximation of 0
.
02
nlog n
+ 1
.
75
n
branch
mispredictions. Thus, the actual number of branch mispredictions is still better then our
bounds in Theorem 1. There are two factors which contribute to this discrepancy: our rough
estimates in the mentioned results, and that the actual branch predictor of a modern CPU
might be much better than a static branch predictor. Also note that approximately one half
of the branch mispredictions are incurred by Insertionsort – only the other half by the actual
block partitioning and main Quicksort loop.
Finally, Figure 4 shows that Katajainen et al.’s Tuned Quicksort is still more eﬃcient
with respect to branch mispredictions (only
O
(
n
)). This is no surprise since it does not need
any checks whether buﬀers are empty etc. Moreover, we see that over 80% of the branch
misses of Tuned Quicksort come from the ﬁnal Insertionsort.
210 212 214 216 218 220 222 224 226 228
number of elements
20
30
40
50
60
70
80
90
time per element [ns]
Yaroslavskiy
BlockQS
BlockQS (mo-sq, dc)
BlockQS (Hoare ﬁnish)
BlockQS simple
Tuned QS
3-pivot QS
SSSSort
std::sort
Figure 5 Random permutation of int.
Running Time Experiments
In Figure 5 we present running times on random
int
permuta-
tions of diﬀerent BlockQuicksort variants and the other algorithms including Katajainen’s
14 BlockQuicksort
Tuned Quicksort and Aumüller and Bingmann’s three-pivot Quicksort. The optimized
BlockQuicksort variants need around 45ns per element when sorting 2
28
elements, whereas
std::sort
needs 85ns per element – thus, there is a speed increase of 88% (i. e. the number
of elements sorted per second is increased by 88%)
7
. The same algorithms are displayed in
210 212 214 216 218 220 222 224 226 228
number of elements
10
15
20
25
30
35
40
45
50
55
time per element [ns]
Yaroslavskiy
BlockQS
BlockQS (mo-sq, dc)
BlockQS (Hoare ﬁnish)
BlockQS simple
Tuned QS
3-pivot QS
SSSSort
std::sort
Figure 6 Random int values between 0and n.
Figure 6 for sorting random
int
s between 0and
n
. Here, we observe that Tuned Quicksort
is much worse than all the other algorithms (already for
n
= 2
12
it moves outside the
displayed range). All variants of BlockQuicksort are faster than
std::sort
– the
duplicate
check (dc) version is almost twice as fast.
Figure 7 presents experiments with data containing a lot of duplicates and having speciﬁc
structures – thus, maybe coming closer to “real-world” inputs (although it is not clear what
that means). Since here Tuned Quicksort and three-pivot Quicksort are much slower than all
the other algorithms, we exclude these two algorithms from the plots. The array for the left
plot contains long already sorted runs. This is most likely the reason that
std::sort
and
Yaroslavskiy’s dual-pivot Quicksort have similar running times to BlockQuicksort (for sorted
sequences the conditional branches can be easily predicted what explains the fast running
time). The arrays for the middle and right plot start with sorted runs and become more and
more erratic; the array for the right one also contains a extremely high number of duplicates.
Here the advantage of BlockQuicksort – avoiding conditional branches – can be observed
again. In all three plots the check for duplicates (dc) established a considerable improvement.
In Figure 8, we show the results of selected algorithms for random permutations of
Vector
and
Record
. We conjecture that the good results of Super Scalar Sample Sort on
Record
s
are because of its better cache behavior (since
Record
are large data elements with very
cheap comparisons). More running time experiments also on other machines and compiler
ﬂags can be found in Appendix A.
7
In an earlier version of this work, we presented slightly diﬀerent outcomes of our experiments. One
reason it the usage of another random number generator. Otherwise, we introduced only minor changes
in test environment – and no changes at all in the sorting algorithms themselves.
S. Edelkamp and A. Weiß 15
210 214 217 220 224 228
number of elements
0
10
20
30
40
50
60
time per element [ns]
210 214 217 220 224 228
number of elements
0
10
20
30
40
50
60
70
80
90
time per element [ns]
Yaroslavskiy
BlockQS (mo-sq, dc)
BlockQS simple
SSSSort
std::sort
210 214 217 220 224 228
number of elements
0
10
20
30
40
50
60
70
time per element [ns]
Figure 7
Arrays
A
of
int
with duplicates: left :
A
[
i
] =
imod n
;middle:
A
[
i
] =
i2
+
n/
2
mod n
;right:
A
[
i
] =
i8
+
n/
2
mod n
. Since
n
is always a power of two, the value
n/
2occurs
approximately n7/8times in the last case.
210 214 217 220 224
number of elements n
100
150
200
250
300
350
time per element [ns]
210 214 217 220 224
number of elements n
60
80
100
120
140
160
180
200
time per element [ns]
Yaroslavskiy
BlockQS (mo-n, dc)
BlockQS simple
SSSSort
std::sort
Figure 8 Random permutations – left: Vector;right: Record.
More Statistics
Table 1 shows the number of branches taken / branch mispredicted as well
as the instruction count and cache misses. Although
std::sort
has a much lower instruction
count than the other algorithms, it induces most branch misses and (except Tuned Quicksort)
most L1 cache misses (= L3 refs since no L2 cache is simulated). BlockQuicksort does not
only have a low number of branch mispredictions, but also a good cache behavior – one
reason for this is that Insertionsort is applied during the recursion and not at the very end.
5 Conclusions and Future Research
We have established an eﬃcient in-place general purpose sorting algorithm, which avoids
branch predictions by converting results of comparisons to integers. In the experiments we
have seen that it is competitive on diﬀerent kinds of data. Moreover, in several benchmarks
it is almost twice as fast as std::sort. Future research might address the following issues:
We used Insertionsort as recursion stopper – inducing a linear number of branch misses.
Is there a more eﬃcient recursion stopper that induces fewer branch mispredictions?
More eﬃcient usage of the buﬀers: in our implementation the buﬀers on average are not
even ﬁlled half. To use the space more eﬃciently one could address the buﬀers cyclically
and scan until one buﬀer is ﬁlled. By doing so, also both buﬀers could be ﬁlled in the
same loop – however, with the cost of introducing additional overhead.
16 BlockQuicksort
algorithm
branches
taken
branch
misses instructions L1 refs L3 refs L3 misses
std::sort 37.81 10.23 174.82 51.96 1.05 0.41
SSSSort 16.2 3.87 197.06 68.47 0.82 0.5
Yaroslavskiy 52.92 9.51 218.42 59.82 0.79 0.27
BlockQS (mo-n, dc) 20.55 2.39 322.08 89.9 0.77 0.27
BlockQS (mo5-mo5) 20.12 2.31 321.49 88.63 0.78 0.28
BlockQS 20.51 2.25 337.27 92.45 0.88 0.3
BlockQS (no IS) 15.38 1.09 309.85 84.66 0.88 0.3
Tuned QS 29.66 1.44 461.88 105.43 1.23 0.39
Tuned QS (no IS) 24.53 0.26 434.53 97.65 1.22 0.39
Table 1
Instruction count, branch and cache misses when sorting random
int
permutations of
size 16777216 = 224. All displayed numbers are divided by the number of elements.
The ﬁnal rearrangement of the block partitioner is not optimal: for small arrays the
similar problems with duplicates arise as for Lomuto’s partitioner.
Pivot selection strategy: though theoretically optimal, median-of-
n
pivot selection is
not best in practice. Also we want to emphasize that not only the sample size but also
the selection method is important (compare the diﬀerent behavior of the two versions of
std::sort
for sorted and reversed permutations). It might be even beneﬁcial to use a
fast pseudo-random generator (e. g. a linear congruence generator) for selecting samples
for pivot selection.
Parallel versions: the block structure is very well suited for parallelism.
A three-pivot version might be interesting, but eﬃcient multi-pivot variants are not trivial:
our ﬁrst attempt was much slower.
Acknowledgments Thanks to Jyrki Katajainen and Max Stenmark for allowing us to use
their Python scripts for measuring branch mispredictions and cache missses and to Lorenz
Hübschle-Schneider for his implementation of Super Scalar Sample Sort. We are also indebted
to Jan Philipp Wächter for all his help with creating the plots, to Daniel Bahrdt for answering
many C++ questions, and to Christoph Greulich for his help with the experiments.
References
1ARMv8 Instruction Set Overview, 2011. Document number: PRD03-GENC-010197 15.0.
2Intel 64 and IA-32 Architecture Optimization Reference Manual, 2016. Order Number:
248966-032.
3D. Abhyankar and M. Ingle. Engineering of a quicksort partitioning algorithm. Journal of
Global Research in Computer Science, 2(2):17–23, 2011.
4Martin Aumüller and Martin Dietzfelbinger. Optimal partitioning for dual pivot quicksort
- (extended abstract). In ICALP, pages 33–44, 2013.
5Martin Aumüller, Martin Dietzfelbinger, and Pascal Klaue. How good is multi-pivot quick-
sort? CoRR, abs/1510.04676, 2015.
6Paul Biggar, Nicholas Nash, Kevin Williams, and David Gregg. An experimental study of
sorting and branch prediction. J. Exp. Algorithmics, 12:1.8:1–39, 2008.
7Gerth Stølting Brodal, Rolf Fagerberg, and Kristoﬀer Vinther. Engineering a cache-
oblivious sorting algorithm. J. Exp. Algorithmics, 12:2.2:1–23, 2008.
S. Edelkamp and A. Weiß 17
8Gerth Stølting Brodal and Gabriel Moruz. Tradeoﬀs between branch mispredictions and
comparisons for sorting algorithms. In WADS, volume 3608 of LNCS, pages 385–395.
Springer, 2005.
9Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Cliﬀord Stein. Introduction
to Algorithms. The MIT Press, 3nd edition, 2009.
10 Amr Elmasry and Jyrki Katajainen. Lean programs, branch mispredictions, and sorting.
In FUN, volume 7288 of LNCS, pages 119–130. Springer, 2012.
11 Amr Elmasry, Jyrki Katajainen, and Max Stenmark. Branch mispredictions don’t aﬀect
mergesort. In SEA, pages 160–171, 2012.
12 Robert W. Floyd. Algorithm 245: Treesort 3. Comm. of the ACM, 7(12):701, 1964.
13 John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Ap-
proach. Morgan Kaufmann, 5th edition, 2011.
14 Charles A. R. Hoare. Quicksort. The Computer Journal, 5(1):10–16, 1962.
15 Kanela Kaligosi and Peter Sanders. How branch mispredictions aﬀect quicksort. In ESA,
pages 780–791, 2006.
16 Jyrki Katajainen. Sorting programs executing fewer branches. CPH STL Report
2263887503, Department of Computer Science, University of Copenhagen, 2014.
17 Donald E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming.
Addison Wesley Longman, 2nd edition, 1998.
18 Shrinu Kushagra, Alejandro López-Ortiz, Aurick Qiao, and J. Ian Munro. Multi-pivot
quicksort: Theory and experiments. In ALENEX, pages 47–60, 2014.
19 Anthony LaMarca and Richard E Ladner. The inﬂuence of caches on the performance of
sorting. J. Algorithms, 31(1):66–104, 1999.
20 Conrado Martínez, Markus E. Nebel, and Sebastian Wild. Analysis of branch misses in
quicksort. In Workshop on Analytic Algorithmics and Combinatorics, ANALCO 2015, San
Diego, CA, USA, January 4, 2015, pages 114–128, 2015.
21 Conrado Martínez and Salvador Roura. Optimal Sampling Strategies in Quicksort and
Quickselect. SIAM J. Comput., 31(3):683–705, 2001.
22 David R. Musser. Introspective sorting and selection algorithms. Software—Practice and
Experience, 27(8):983–993, 1997.
23 Charles Price. MIPS IV Instruction Set, 1995.
24 Peter Sanders and Sebastian Winkel. Super Scalar Sample Sort. In ESA, pages 784–796,
2004.
25 Robert Sedgewick. The analysis of quicksort programs. Acta Inf., 7(4):327–355, 1977.
26 Robert Sedgewick. Implementing quicksort programs. Commun. ACM, 21(10):847–857,
1978.
27 Sebastian Wild and Markus E. Nebel. Average case analysis of java 7’s dual pivot quicksort.
In ESA, pages 825–836, 2012.
28 Sebastian Wild, Markus E. Nebel, and Ralph Neininger. Average case and distributional
analysis of dual-pivot quicksort. ACM Transactions on Algorithms, 11(3):22:1–42, 2015.
29 J. W. J. Williams. Algorithm 232: Heapsort. Communications of the ACM, 7(6):347–348,
1964.
30 Vladimir Yaroslavskiy. Dual-Pivot Quicksort algorithm, 2009. URL: http://codeblab.
18 BlockQuicksort
A More Experimental Results
In Figure 10 and Figure 9 we also included the new GCC implementation of
std::sort
(GCC
version 4.8.4) marked as
std::sort
(new). The very small diﬀerence in the implementation
of choosing the second instead of the ﬁrst element as part of the sample for pivot selection
makes a enormous diﬀerence when sorting special permutations like decreasingly sorted
arrays. This shows how important not only the size of the pivot sample but also the proper
selection is. In the other benchmarks both implementations were relatively close, so we do
not show both of them.
210 214 217 220 224 228
number of elements
0
5
10
15
20
25
30
35
time per element [ns]
210 214 217 220 224 228
number of elements
0
10
20
30
40
50
60
70
time per element [ns]
Yaroslavskiy
BlockQS (mo-sq, dc)
BlockQS simple
SSSSort
std::sort (new)
std::sort
210 214 217 220 224 228
number of elements
0
10
20
30
40
50
60
70
time per element [ns]
Figure 9
Permutations of
int
:left: sorted; middle: reversed; right: transposition –
A
[
i
] =
i
+
n/
2
mod n.
210 212 214 216 218 220 222 224 226 228
number of elements
5
10
15
20
25
30
35
40
45
time per element [ns]
210 212 214 216 218 220 222 224 226 228
number of elements
10
15
20
25
30
35
40
45
time per element [ns]
Yaroslavskiy
BlockQS (mo-sq, dc)
BlockQS simple
SSSSort
stl
std::sort
210 212 214 216 218 220 222 224 226 228
number of elements
10
20
30
40
50
60
70
time per element [ns]
Figure 10
Random permutations of
int
with at most
k
inversions (
k
random swaps of neighboring
elements): left:k=n;middle:k=n;right:k=nlog n.
S. Edelkamp and A. Weiß 19
210 212 214 216 218 220 222 224 226 228
number of elements
0
5
10
15
20
25
30
35
40
45
time per element [ns]
210 212 214 216 218 220 222 224 226 228
number of elements
0
5
10
15
20
25
30
35
40
45
time per element [ns]
Yaroslavskiy
BlockQS (mo-sq, dc)
BlockQS simple
SSSSort
std::sort
210 212 214 216 218 220 222 224 226 228
number of elements
0
5
10
15
20
25
30
35
40
45
time per element [ns]
Figure 11
Arrays
A
of
int
with many duplicates: Left: constant; middle:
A
[
i
]=0for
i < n/
2
and A[i] = 1 otherwise; right: random 0-1 values.
210 214 217 220 224
number of elements n
0
50
100
150
200
250
300
time per element [ns]
210 214 217 220 224
number of elements n
50
100
150
200
250
300
time per element [ns]
Yaroslavskiy
BlockQS (mo-sq, dc)
BlockQS simple
SSSSort
std::sort
210 214 217 220 224
number of elements n
0
50
100
150
200
250
time per element [ns]
Figure 12
Sorting
Vector
:Left: random 0-1 values; middle: random values between 0and
n
;
right: sorted.
210 214 217 220 224
number of elements n
0
50
100
150
200
250
time per element [ns]
210 214 217 220 224
number of elements n
40
60
80
100
120
140
160
180
200
time per element [ns]
Yaroslavskiy
BlockQS (mo-sq, dc)
BlockQS simple
SSSSort
std::sort
210 214 217 220 224
number of elements n
20
30
40
50
60
70
80
90
100
110
time per element [ns]
Figure 13
Sorting
Record
:Left: random 0-1 values; middle: random values between 0and
n
;
right: sorted.
20 BlockQuicksort
210 214 217 220 224
number of elements n
20
30
40
50
60
70
80
time per element [ns]
210 214 217 220 224
number of elements n
20
30
40
50
60
70
80
time per element [ns]
Yaroslavskiy
BlockQS (mo-n, dc)
BlockQS simple
SSSSort
std::sort
Figure 14 Random permutations of int with other compiler optimizations.
Left: -O3 -march=native -funroll-loops;right: -O1 -march=native.
210 214 217 220 224 228
number of elements n
30
40
50
60
70
80
90
100
time per element [ns]
210 214 217 220 224
number of elements n
15
20
25
30
35
40
45
50
55
time per element [ns]
Yaroslavskiy
BlockQS (mo-sq, dc)
BlockQS simple
SSSSort
std::sort
210 214 217 220 224
number of elements n
0
10
20
30
40
50
time per element [ns]
Figure 15
Running time experiments with
int
on Intel Xeon E5-4627v2 CPU (3.30GHz, 8 cores,
32KB L1 instruction and data cache, 256KB L2 cache per core and 16MB L3 shared cache) with
128GB RAM and operating system Windows Server 2012 R2. We used Cygwin
g++
(4.9.3); optimized
with ﬂags -O3 -march=native.
Left: random permutation; middle: random values between 0andn;right: sorted.
210 214 217 220 224
number of elements n
30
40
50
60
70
80
time per element [ns]
210 214 217 220 224
number of elements n
15
20
25
30
35
40
45
50
55
60
time per element [ns]
Yaroslavskiy
BlockQS (mo-sq, dc)
BlockQS simple
SSSSort
std::sort
210 214 217 220 224
number of elements n
0
5
10
15
20
25
30
35
40
time per element [ns]
Figure 16
Running time experiments with
int
on Laptop with Intel Core i7-5500U CPU (2.40GHz,
2 cores, 32KB L1 instruction and data cache, 256KB L2 cache per core and 4MB L3 shared cache)
with 8GB RAM and operating system Windows 10. We used Mingw
g++
(5.3.0); optimized with
ﬂags -O3 -march=native.
Left: random permutation; middle: random values between 0andn;right: sorted.
S. Edelkamp and A. Weiß 21
B C++ Code
Here, we give the C++ code of the basic BlockQuicksort variant (the ﬁnal rearranging is
also in block style, but there is no loop unrolling etc. applied).
1tem p l a t e <ty pen am e iter , typ e na me Compare>
2i n l i n e v oi d s o r t _ p a i r ( i t e r i 1 , i t e r i2 , Com pare l e s s ) {
3t y p e d e f t yp en am e s td : : i t e r a t o r _ t r a i t s < i t e r > :: v al ue _ ty pe T;
4bool s m a l l e r = l e s s ( i 2 , i 1 ) ;
5T tem p = st d : : mov e ( s m a l l e r ? i1 : tem p ) ;
6i 1 = st d : : mo ve ( s m a l l e r ? i 2 : i 1 ) ;
7i 2 = st d : : mo ve ( s m a l l e r ? temp : i 2 ) ;
8}
9
10 tem p l a t e <ty pen am e iter , typ e na me Compare>
11 inline i t e r m edi an_ of_3 ( i t e r i 1 , i t e r i 2 , i t e r i 3 , Co mpare l e s s ) {
12 s o r t _ p a i r ( i 1 , i 2 , l e s s ) ;
13 s o r t _ p a i r ( i 2 , i 3 , l e s s ) ;
14 s o r t _ p a i r ( i 1 , i 2 , l e s s ) ;
15 r e t u r n i 2 ;
16 }
17
18 tem p l a t e <ty pen am e iter , typ e na me Compare>
19 inline i t e r h o a r e _ b l o c k _ p a r t i t i o n _ s i m p l e ( i t e r b e g in , i t e r e nd , i t e r p i v o t _ p o s i t i on
, Co mpare l e s s ) {
20 t y p e d e f t yp en am e s t d : : i t e r a t o r _ t r a i t s <i t e r > : : d i f f e r e n c e _ t y p e i n d ex ;
21 i nd e x in d ex L [ BLOCKSIZE ] , in de x R [ BLOCKSIZE ] ;
22
23 i t e r l a s t = e nd 1 ;
24 s t d : : i t er _ s w ap ( p i v o t _ p o s i t i o n , l a s t ) ;
25 c o n s t ty p en am e s td : : i t e r a t o r _ t r a i t s < i t e r > :: v al ue _ ty p e & p iv o t = l a s t ;
26 pivot_position = la st ;
27 last −−;
28
29 i n t num_left = 0 ;
30 i n t num_right = 0 ;
31 i n t s t a r t _ l e f t = 0 ;
32 i n t start_right = 0;
33 i n t num ;
34 // m ai n l o o p
35 while ( l a s t b e g i n + 1 > 2 BLOCKSIZE)
36 {
37 // Com pare and s t o r e i n b u f f e r s
38 i f ( nu m_ le ft == 0 ) {
39 s t a r t _ l e f t = 0 ;
40 for ( in d e x j = 0 ; j < BLOCKSIZE; j++) {
41 in de xL [ nu m_l eft ] = j ;
42 nu m _l ef t += ( ! ( l e s s ( b eg i n [ j ] , p i v o t ) ) ) ;
43 }
44 }
45 i f ( nu m_r ig ht == 0 ) {
46 start_right = 0;
47 for ( in d e x j = 0 ; j < BLOCKSIZE; j++) {
48 in dex R [ num _r ight ] = j ;
49 nu m_r igh t += ! ( l e s s ( p i vo t , ( l a s t j ) ) ) ;
50 }
51 }
52 // r e a r r a n ge e l e m e n t s
53 num = st d : : m in ( nu m_ l ef t , num _ ri gh t ) ;
54 for (int j = 0 ; j < num; j ++)
55 s td : : i te r _s wa p ( b eg i n + i nd ex L [ s t a r t _ l e f t + j ] , l a s t i nd ex R [ s t a r t _ r i g h t +
j ] ) ;
56
57 num_left = num;
58 num_right = num;
59 s t a r t _ l e f t += num ;
60 start_right += num;
61 b e g i n += ( nu m_ l ef t == 0 ) ? BLOCKSIZE : 0 ;
62 l a s t = ( nu m_r ig ht == 0 ) ? BLOCKSIZE : 0 ;
63
64 }// e nd m ain l o o p
65
66 // Com pare and s t o r e i n b u f f e r s f i n a l i t e r a t i o n
67 i nd e x s h i f t R = 0 , s h i f t L = 0 ;
68 i f ( n um_ ri gh t == 0 && n u m _l e f t == 0) { / / f o r s m a ll a r r ay s o r i n t h e u n l i k e l y
c as e t ha t b ot h b u f f e r s a r e emp ty
69 s h i f t L = ( ( l a s t be g i n ) + 1) / 2 ;
22 BlockQuicksort
70 s h i f t R = ( l a s t be g i n ) + 1 s h i f t L ;
71 s t a r t _ l e f t = 0 ; s t a r t _ r i g h t = 0 ;
72 for ( i n de x j = 0 ; j < s h i f t L ; j ++) {
73 in de xL [ nu m_l eft ] = j ;
74 nu m _l ef t += ( ! l e s s ( b e g i n [ j ] , p i v o t ) ) ;
75 in dex R [ num _r ight ] = j ;
76 nu m_r igh t += ! l e s s ( p iv o t , ( l a s t j ) ) ;
77 }
78 i f ( s h i f t L < s h i f t R )
79 {
80 in de x R [ nu m_ ri gh t ] = s h i f t R 1;
81 nu m_r igh t += ! l e s s ( p iv o t , ( l a s t s h i f t R + 1 ) ) ;
82 }
83 }
84 e l s e i f ( num_right != 0) {
85 s h i f t L = ( l a s t b e g i n ) BLOCKSIZE + 1;
86 s h i f t R = BLOCKSIZE;
87 s t a r t _ l e f t = 0 ;
88 for ( i n de x j = 0 ; j < s h i f t L ; j ++) {
89 in de xL [ nu m_l eft ] = j ;
90 nu m _l ef t += ( ! l e s s ( b e g i n [ j ] , p i v o t ) ) ;
91 }
92 }
93 e l s e {
94 s h i f t L = BLOCKSIZE ;
95 s h i f t R = ( l a s t be g i n ) BLOCKSIZE + 1 ;
96 start_right = 0;
97 for ( i n d e x j = 0 ; j < s h i f t R ; j ++) {
98 in dex R [ num _r ight ] = j ;
99 nu m_r igh t += ! ( l e s s ( p i vo t , ( l a s t j ) ) ) ;
100 }
101 }
102
103 // r ea r r a ng e f i n a l i t e r a t i o n
104 num = st d : : m in ( nu m_ l ef t , num _ ri gh t ) ;
105 for (int j = 0 ; j < num; j ++)
106 s td : : i te r _s wa p ( b eg i n + i nd ex L [ s t a r t _ l e f t + j ] , l a s t i nd ex R [ s t a r t _ r i g h t +
j ] ) ;
107
108 num_left = num;
109 num_right = num;
110 s t a r t _ l e f t += num ;
111 start_right += num;
112 b e g i n += ( nu m_ l ef t == 0 ) ? s h i f t L : 0 ;
113 l a s t = ( nu m_r ig ht == 0 ) ? s h i f t R : 0 ;
114 // en d f i n a l i t e r a t i o n
115
116
117 // r e a r r a n ge e l e m e n t s re m a i n i n g in b u f f e r
118 i f ( nu m_ le ft != 0 )
119 {
120 i n t l ow e r I = s t a r t _ l e f t + num _l ef t 1 ;
121 i nd ex up pe r = l a s t b e g i n ;
122 // s e a r c h f i r s t e l e m en t to be sw ap pe d
123 while ( l o w e r I >= s t a r t _ l e f t && ind e x L [ l o w e r I ] == u pp e r ) {
124 upper−−; lowerI −−;
125 }
126 while ( l o w e rI >= s t a r t _ l e f t )
127 s t d : : i t e r _ s wa p ( be g i n + u p pe r −−, be g i n + i n d ex L [ l o w e r I −− ]) ;
128
129 s t d : : i t er _ s w ap ( p i v o t _ p o s i t i o n , b e g in + u p p er + 1) ; // f e t c h t he p i v o t
130 r e t u r n b e g i n + u p p er + 1 ;
131 }
132 e l s e i f ( num_right != 0) {
133 i n t l o we r I = s t a r t _ r i g h t + n um _ri gh t 1 ;
134 i nd ex up pe r = l a s t b e g i n ;
135 // s e a r c h f i r s t e l e m en t to be sw ap pe d
136 while ( l o w e r I >= s t a r t _ r i g h t && i nd ex R [ l o w er I ] == u p pe r ) {
137 upper−−; lowerI −−;
138 }
139
140 while ( l o w e r I >= s t a r t _ r i g h t )
141 s t d : : i t e r _s w a p ( l a s t upper −−, l a s t i n de xR [ l o w e r I ]) ;
142
143 s t d : : i t er _ s w ap ( p i v o t _ p o s i t i o n , l a s t upper) ; / / f e t c h th e p i v o t
144 r e t u r n l a s t upper ;
145 }
146 e l s e {/ / no r e m a i n i n g e l e m e n t s
S. Edelkamp and A. Weiß 23
147 s t d : : i t er _ s w ap ( p i v o t _ p o s i t i o n , b e g in ) ; // f e t c h t he p i v o t
148 r e t u r n b e g i n ;
149 }
150 }
151
152 tem p l a t e <ty pen am e iter , typ e na me Compare>
153 struct Hoare_block_partition_simple {
154 s t a t i c i n l i n e i t e r p a r ti t i o n ( i t e r b eg in , i t e r en d , C ompar e l e s s ) {
155 // ch o o s e pi v o t
156 i t e r mid = me di an _o f_ 3 ( be g i n , b e g i n + (e n d b e g i n ) / 2 , end , l e s s ) ;
157 // p a r t i t i o n
158 r e t u r n hoare_block_partition_simple ( begin + 1 , end 1 , mid , l e s s ) ;
159 }
160 } ;
161
162 // Q u i c k s o rt mai n l o o p . Im p l e m en t a t i o n b a se d o n Tuned Q u i c k s o r t ( El m as ry ,
K at aj ai n en , St enm ar k )
163 tem p l a t e <t e m p l a te<class ,class>class Partitioner , type na me i t e r , ty pen am e
Compare>
164 i n l i n e v oi d q s o rt ( i t e r b eg in , i t e r en d , Co mpare l e s s ) {
165 c o n s t i n t de p th _ l i m i t = 2 i l o g b ( ( double ) ( en d b e g i n ) ) + 3 ;
166 i t e r s t a c k [ 8 0 ] ;
167 iters = s t a c k ;
168 i n t d ep th _s ta ck [ 4 0 ] ;
169 i n t d e pt h = 0;
170 i n t d_ s_top = de pth _s tac k ;
171 s = be g i n ;
172 ( s + 1) = e nd ;
173 s += 2 ;
174 d_s _to p = 0;
175 ++d_s_ top ;
176 do {
177 i f ( d ept h < de p th _l im i t && en d b e g i n > IS_THRESH) {
178 i t e r p i v ot = P a r t i t i o n e r < i t e r , Compa re > : : p a r t i t i o n ( b e gi n , e nd , l e s s ) ;
179 // P ush l a r g e s i d e to s t a c k an d c o n t i n u e o n s m a l l s i d e
180 i f ( pivot b e g i n > e nd pivot ) {
181 s = be g i n ;
182 ( s + 1 ) = pi v o t ;
183 b e g i n = pi v ot + 1 ;
184 }
185 e l s e {
186 s = pi v o t + 1 ;
187 ( s + 1) = e nd ;
188 en d = p i v o t ;
189 }
190 s += 2 ;
191 de p t h++;
192 d_s _top = d e pth ;
193 ++d_s_ top ;
194 }
195 e l s e {
196 i f ( en d b e g i n > IS_THRESH) // i f r e c u r s i o n d ep th l i m i t e x ce e de d
197 s td : : p a r t i a l _ s o r t ( b eg in , end , en d ) ;
198 e l s e
199 I n s e r t i o n s o r t : : i n s e r t i o n_ s o r t ( b eg i n , e nd , l e s s ) ; / / c o py o f s t d : :
_ _ i n s e r t i o n _ s o r t (GCC 4 . 7 . 2 )
200
201 // p op new s u b a r r a y f ro m s ta c k
202 s= 2 ;
203 b e g i n = s ;
204 en d = ( s + 1) ;
205 d_s_top ;
206 depth = d_s_top ;
207 }
208 }while ( s ! = s t a c k ) ;
209 }
210
211 // e x am pl e i n v o c a t i o n o f q s o r t
212 i n t main ( void) {
213 s t d : : v e c t o r < int> v ;
214 //
215 // a s s i g n v a l u e s to v
216 //
217 q s o r t <H o a r e _ bl o c k _ p a r t i t io n _ s i m p l e >(v . be g i n ( ) , v . end ( ) , s t d : : l e s s <int >( ) ) ;
218 }
... Research on sorting algorithms is an active field with new algorithms being developed to work with modern computing architectures [60][61][62][63][64][65]. There have been many significant innovations even within the last few years [66,67]. Additionally, parallelization of sorting algorithms is quite different on GPUs versus CPUs, and efficient parallelization approaches are only starting to be developed [68,69]. ...
Preprint
Recent advances in selected CI, including the adaptive sampling configuration interaction (ASCI) algorithm and its heat bath extension, have made the ASCI approach competitive with the most accurate techniques available, and hence an increasingly powerful tool in solving quantum Hamiltonians. In this work, we show that a useful paradigm for generating efficient selected CI/exact diagonalization algorithms is driven by fast sorting algorithms, much in the same way iterative diagonalization is based on the paradigm of matrix vector multiplication. We present several new algorithms for all parts of performing a selected CI, which includes new ASCI search, dynamic bit masking, fast orbital rotations, fast diagonal matrix elements, and residue arrays. The algorithms presented here are fast and scalable, and we find that because they are built on fast sorting algorithms they are more efficient than all other approaches we considered. After introducing these techniques we present ASCI results applied to a large range of systems and basis sets in order to demonstrate the types of simulations that can be practically treated at the full-CI level with modern methods and hardware, presenting double- and triple-zeta benchmark data for the G1 dataset. The largest of these calculations is Si$_{2}$H$_{6}$ which is a simulation of 34 electrons in 152 orbitals. We also present some preliminary results for fast deterministic perturbation theory simulations that use hash functions to maintain high efficiency for treating large basis sets.
... For elements equivalent to X it does not matter into which sub lists they are presented. It could possibly also occur that an element equivalent to X remains between the two sub list [12,13]. Figure 2 show the sequential Quicksort implementation outline. ...
Article
Full-text available
In this paper we aims to parallelization the Quicksort algorithm using multithreading (OpenMP) platform. ‎ The proposed method examined on two standard dataset (‎ File 1: Hamlet.txt 180 KB and File 2: Moby ‎ Dick.txt ‎ ‎ 1.18 MB) ‎ with different number of threads. The fundamental idea of the proposed algorithm is to creating many additional temporary sub-arrays according to a number of ‎ characters in each word, the sizes of each one of these sub-arrays are adopted based on a number of ‎ elements with the exact same number of characters in the input array. The elements of the input ‎ datasets is distributing into these temporary sub-arrays depending on the number of characters in each ‎ word.‎ ‎ As a conclusion, the experimental results of this study reveal that the performance of parallelization the proposed ‎ Quicksort algorithm has shown improvement when compared ‎ to the sequential Quicksort algorithm by ‎ delivering improved Execution Time, ‎ Speedup and Efficiency. ‎
Conference Paper
In this paper we introduce RADULS2, the fastest parallel sorter based on radix algorithm. It is optimized to process huge amounts of data making use of modern multicore CPUs. The main novelties include: high performance algorithm for handling tiny arrays (up to about a hundred of records) that could appear even billions times as subproblems to handle and improved processing of larger subarrays with better use of non-temporal memory stores.
Article
Full-text available
The idea of multi-pivot quicksort has recently received the attention of researchers after Vladimir Yaroslavskiy proposed a dual pivot quicksort algorithm that, contrary to prior intuition, outperforms standard quicksort by a a significant margin under the Java JVM [10]. More recently, this algorithm has been analysed in terms of comparisons and swaps by Wild and Nebel [9]. Our contributions to the topic are as follows. First, we perform the previous experiments using a native C implementation thus removing potential extraneous effects of the JVM. Second, we provide analyses on cache behavior of these algorithms. We then provide strong evidence that cache behavior is causing most of the performance differences in these algorithms. Additionally, we build upon prior work in multi-pivot quicksort and propose a 3-pivot variant that performs very well in theory and practice. We show that it makes fewer comparisons and has better cache behavior than the dual pivot quicksort in the expected case. We validate this with experimental results, showing a 7-8% performance improvement in our tests. Copyright
Article
Full-text available
Multi-Pivot Quicksort refers to variants of classical quicksort where in the partitioning step $k$ pivots are used to split the input into $k + 1$ segments. For many years, multi-pivot quicksort was regarded as impractical, but in 2009 a 2-pivot approach by Yaroslavskiy, Bentley, and Bloch was chosen as the standard sorting algorithm in Sun's Java 7. In 2014 at ALENEX, Kushagra et al. introduced an even faster algorithm that uses three pivots. This paper studies what possible advantages multi-pivot quicksort might offer in general. The contributions are as follows: Natural comparison-optimal algorithms for multi-pivot quicksort are devised and analyzed. The analysis shows that the benefits of using multiple pivots with respect to the average comparison count are marginal and these strategies are inferior to simpler strategies such as the well known median-of-$k$ approach. A substantial part of the partitioning cost is caused by rearranging elements. A rigorous analysis of an algorithm for rearranging elements in the partitioning step is carried out, observing mainly how often array cells are accessed during partitioning. The algorithm behaves best if 3 or 5 pivots are used. Experiments show that this translates into good cache behavior and is closest to predicting observed running times of multi-pivot quicksort algorithms. Finally, it is studied how choosing pivots from a sample affects sorting cost.
Article
Full-text available
The analysis of algorithms mostly relies on counting classic elementary operations like additions, multiplications, comparisons, swaps etc. This approach is often sufficient to quantify an algorithm's efficiency. In some cases, however, features of modern processor architectures like pipelined execution and memory hierarchies have significant impact on running time and need to be taken into account to get a reliable picture. One such example is Quicksort: It has been demonstrated experimentally that under certain conditions on the hardware the classically optimal balanced choice of the pivot as median of a sample gets harmful. The reason lies in mispredicted branches whose rollback costs become dominating. In this paper, we give the first precise analytical investigation of the influence of pipelining and the resulting branch mispredictions on the efficiency of (classic) Quicksort and Yaroslavskiy's dual-pivot Quicksort as implemented in Oracle's Java 7 library. For the latter it is still not fully understood why experiments prove it 10% faster than a highly engineered implementation of a classic single-pivot version. For different branch prediction strategies, we give precise asymptotics for the expected number of branch misses caused by the aforementioned Quicksort variants when their pivots are chosen from a sample of the input. We conclude that the difference in branch misses is too small to explain the superiority of the dual-pivot algorithm.
Article
Full-text available
In 2009, Oracle replaced the long-serving sorting algorithm in its Java 7 runtime library by a new dual-pivot Quicksort variant due to Vladimir Yaroslavskiy. The decision was based on the strikingly good performance of Yaroslavskiy's implementation in running time experiments. At that time, no precise investigations of the algorithm were available to explain its superior performance—on the contrary: previous theoretical studies of other dual-pivot Quicksort variants even discouraged the use of two pivots. In 2012, two of the authors gave an average case analysis of a simplified version of Yaroslavskiy's algorithm, proving that savings in the number of comparisons are possible. However, Yaroslavskiy's algorithm needs more swaps, which renders the analysis inconclusive. To force the issue, we herein extend our analysis to the fully detailed style of Knuth: we determine the exact number of executed Java Bytecode instructions. Surprisingly, Yaroslavskiy's algorithm needs sightly more Bytecode instructions than a simple implementation of classic Quicksort—contradicting observed running times. As in Oracle's library implementation, we incorporate the use of Insertionsort on small subproblems and show that it indeed speeds up Yaroslavskiy's Quicksort in terms of Bytecodes; but even with optimal Insertionsort thresholds, the new Quicksort variant needs slightly more Bytecode instructions on average. Finally, we show that the (suitably normalized) costs of Yaroslavskiy's algorithm converge to a random variable whose distribution is characterized by a fixed-point equation. From that, we compute variances of costs and show that for large n, costs are concentrated around their mean.
Conference Paper
Full-text available
In quicksort, due to branch mispredictions, a skewed pivot-selection strategy can lead to a better performance than the exact-median pivot-selection strategy, even if the exact median is given for free. In this paper we investigate the effect of branch mispredictions on the behaviour of mergesort. By decoupling element comparisons from branches, we can avoid most negative effects caused by branch mispredictions. When sorting a sequence of n elements, our fastest version of mergesort performs n log2n + O(n) element comparisons and induces at most O(n) branch mispredictions. We also describe an in-situ version of mergesort that provides the same bounds, but uses only O(log2n) words of extra memory. In our test computers, when sorting integer data, mergesort was the fastest sorting method, then came quicksort, and in-situ mergesort was the slowest of the three. We did a similar kind of decoupling for quicksort, but the transformation made it slower.
Article
One of the most sophisticated sorting algorithm in sorting literature is Quicksort. Though Quicksort has several striking aspects, design of partition function is the central aspect of the Quicksort algorithm. Partitioning is a meticulously researched area in which we find Hoare Partition and Lomuto Partition as two prominent partition algorithms in the literature. Despite the fact that much effort has been targeted on research into partitioning, it seems that partitioning is still inadequately understood and amenable to a right blend of optimizations. Superior partitioning algorithms can be designed using a perfect blend of performance improving measures and a touch of elegance. This paper postulates two novel partition algorithms which are better than the existing ones. Proposed algorithm3 apply some effective optimizations and because of this instruction count gets reduced. Reduced instruction count helps the function in gaining spectacular performance. Presented algorithm4 is an elegant algorithm which is compact and intenselycompetitive from performance point of view.
Conference Paper
Dual pivot quicksort refers to variants of classical quicksort where in the partitioning step two pivots are used to split the input into three segments. This can be done in different ways, giving rise to different algorithms. Recently, a dual pivot algorithm due to Yaroslavskiy received much attention, because it replaced the well-engineered quicksort algorithm in Oracle's Java 7 runtime library. Nebel and Wild (ESA 2012) analyzed this algorithm and showed that on average it uses 1.9n ln n + O(n) comparisons to sort an input of size n, beating standard quicksort, which uses 2n ln n + O(n) comparisons. We introduce a model that captures all dual pivot algorithms, give a unified analysis, and identify new dual pivot algorithms that minimize the average number of key comparisons among all possible algorithms up to lower order or linear terms. This minimum is 1.8n ln n + O(n). For the case that the pivots are chosen from a small sample, we include a comparison of dual pivot quicksort and classical quicksort. We also present results about minimizing the average number of swaps.