Conference PaperPDF Available

Branch Mispredictions Don’t Affect Mergesort

Authors:

Abstract

In quicksort, due to branch mispredictions, a skewed pivot-selection strategy can lead to a better performance than the exact-median pivot-selection strategy, even if the exact median is given for free. In this paper we investigate the effect of branch mispredictions on the behaviour of mergesort. By decoupling element comparisons from branches, we can avoid most negative effects caused by branch mispredictions. When sorting a sequence of n elements, our fastest version of mergesort performs n log2n + O(n) element comparisons and induces at most O(n) branch mispredictions. We also describe an in-situ version of mergesort that provides the same bounds, but uses only O(log2n) words of extra memory. In our test computers, when sorting integer data, mergesort was the fastest sorting method, then came quicksort, and in-situ mergesort was the slowest of the three. We did a similar kind of decoupling for quicksort, but the transformation made it slower.
Branch Mispredictions Don’t Affect Mergesort
?
Amr Elmasry
1
, Jyrki Katajainen
1,2
, and Max Stenmark
2
1
Department of Computer Science, University of Copenhagen
Universitetsparken 1, 2100 Copenhagen East, Denmark
2
Jyrki Katajainen and Company
Thorsgade 101, 2200 Copenhagen North, Denmark
Abstract. In quicksort, due to branch mispredictions, a skewed pivot-
selection strategy can lead to a better performance than the exact-
median pivot-selection strategy, even if the exact median is given for
free. In this paper we investigate the effect of branch mispredictions on
the behaviour of mergesort. By decoupling element comparisons from
branches, we can avoid most negative effects caused by branch mispre-
dictions. When sorting a sequence of n elements, our fastest version of
mergesort performs n log
2
n + O (n) element comparisons and induces at
most O(n) branch mispredictions. We also describe an in-situ version
of mergesort that provides the same bounds, but uses only O(log
2
n)
words of extra memory. In our test computers, when sorting integer
data, mergesort was the fastest sorting method, then came quicksort,
and in-situ mergesort was the slowest of the three. We did a similar kind
of decoupling for quicksort, but the transformation made it slower.
1 Introduction
Branch mispredictions may have a significant effect on the speed of programs.
For example, Kaligosi and Sanders [8] showed that in quicksort [6] it may be
more advantageous to select a skewed pivot instead of finding a pivot close to
the median. The reason for this is that for a comparison against the median
the outcome has a fifty percent chance of being smaller or larger, whereas the
outcome of comparisons against a skewed pivot is easier to predict. All in all, a
skewed pivot will lead to a better branch prediction and—possibly—a decrease
in computation time. In a same vein, Brodal and Moruz [3] showed that skewed
binary search trees can perform better than perfectly balanced search trees.
In this paper we tackle the following question posed in [8]. Given a random
permutation of the integers {0, 1, . . . , n 1}, does there exist a faster in-situ
sorting algorithm than quicksort with skewed pivots for this particular type of
input? We use the word in-situ to indicate that the algorithm is allowed to use
O(log
2
n) extra words of memory (as any careful implementation of quicksort).
It is often claimed that quicksort is faster than mergesort. To check the cor-
rectness of this claim, we performed some simple benchmarks for the quicksort
(std::sort) and mergesort (std::stable sort) programs available at the GNU
implementation (g++ version 4.6.1) of the C++ standard library; std::sort is
?
c
2012 Springer-Verlag. This is the authors’ version of the work. The original pub-
lication is available at www.springerlink.com.
Table 1. The execution time [ns], the number of conditional branches, and the number
of mispredictions on two of our computers (Per and Ares), each per n log
2
n, for the
quicksort and mergesort programs taken from the C++ standard library.
Program std::sort std::stable sort
Time Branches Mispredicts Time Branches Mispredicts
n Per Ares Per Ares
2
10
6.5 5.3 1.47 0.45 6.2 5.0 2.05 0.14
2
15
6.2 5.2 1.50 0.43 5.9 4.7 2.02 0.09
2
20
6.2 5.1 1.50 0.43 6.3 4.7 2.01 0.07
2
25
6.1 5.1 1.51 0.43 6.1 4.6 2.01 0.05
an implementation of Musser’s introsort [13] and std::stable sort is an imple-
mentation of bottom-up mergesort. In our test environment
3
, for integer data,
the two programs had the same speed within the measurement accuracy (see
Table 1). An inspection of the assembly-language code produced by g++ revealed
that in the performance-critical inner loop of std::stable sort all element com-
parisons were followed by conditional moves. A conditional move is written in
C as if (a b) x = y, where a, b, x, and y are some variables (or constants),
and is some comparison operator. This instruction, or some of its restricted
forms, is supported as a hardware primitive by most computers. By using a
branch-prediction profiler (valgrind) we could confirm that the number of branch
mispredictions per n log
2
n—referred to as the branch-misprediction ratio—was
much lower for std::stable sort than for std:.sort. Based on these initial ob-
servations, we concluded that mergesort is a noteworthy competitor to quicksort.
Our main results in this paper are:
1. We optimize (reduce the number of branches executed and branch mispre-
dictions induced) the standard-library mergesort so that it becomes faster
than quicksort for our task in hand (Section 2).
2. We describe an in-situ version of this optimized mergesort (Section 3). Even
though an ideal translation of its inner loop only contains 18 assembly-
language instructions, in our experiments it was slower than quicksort.
3
The experiments discussed in the paper were carried out on two computers:
Per: Model: Intel
R
Core
TM
2 CPU T5600 @ 1.83GHz; main memory: 1 GB; L2
cache: 8-way associative, 2 MB; cache line: 64 B.
Ares: Model: Intel
R
Core
TM
i3 CPU M 370 @ 2.4GHz × 4; main memory: 2.6 GB;
L2 cache: 12-way associative, 3 MB; cache line: 64 B.
Both computers run under Ubuntu 11.10 (Linux kernel 3.0.0-15-generic) and g++
compiler (gcc version 4.6.1) with optimization level -O3 was used. According to
the documentation, at optimization level -O3 this compiler always attempted to
transform conditional branches into branch-less equivalents. Micro-benchmarking
showed that in Per conditional moves were faster than conditional branches when the
result of the branch condition was unpredictable. In Ares the opposite was true. All
execution times were measured using gettimeofday in sys/time.h. For a problem
of size n, each experiment was repeated 2
26
/n times and the average was reported.
2
3. We eliminated all branches from the performance-critical loop of quicksort
(Section 4). After this transformation the program induces O(n) branch
mispredictions on the average. However, in our experiments the branch-
optimized versions of quicksort were slower than std::sort.
4. We made a number of experiments for quicksort with skewed pivots (Sec-
tion 4). We could repeat the findings reported in [8], but the performance
improvement obtained by selecting a skewed pivot was not very large. For
our mergesort programs the branch-misprediction ratio is significantly lower
than that reported for quicksort with skewed pivots in [8].
We took the idea of decoupling element comparisons from branches from
Mortensen [12]. He described a variant of mergesort that performs n log
2
n +
O(n) element comparisons and induces O(n) branch mispredictions. However,
the performance-critical loop of the standard-library mergesort only contains 14
assembly-language instructions, whereas that of Mortensen’s program contains
more. This could be a reason why Mortensen’s implementation is slower than the
standard-library implementation. Our key improvement is to keep the instruction
count down while doing the branch optimization.
The idea of decoupling element comparisons from branches was also used by
Sanders and Winkel [15] in their samplesort. The resulting program performs
n log
2
n + O(n) element comparisons and induces O(n) branch mispredictions in
the expected case. As for Mortensen’s mergesort, samplesort needs O(n) extra
space. Using the technique described in [15], one can modify heapsort such that it
will achieve the same bound on the number of branch mispredictions in addition
to its normal characteristics. In particular, heapsort is fully in-place, but it suffers
from a bad cache behaviour [5].
Brodal and Moruz [2] proved that any comparison-based program that sorts
a sequence of n elements using O(βn log
2
n) element comparisons, for β > 1,
must induce (n log
β
n) branch mispredictions. However, this result only holds
under the assumption that every element comparison is followed by a condi-
tional branch depending on the outcome of the comparison. In particular, after
decoupling element comparisons from branches the lower bound is no more valid.
The branch-prediction features of a number of sorting programs were exper-
imentally studied in [1]; also a few optimizations were made to known methods.
In a companion paper [5] it is shown that any program can be transformed into
a form that induces only O(1) branch mispredictions. The resulting programs
are called lean. However, for a program of length κ, the transformation may
make the lean counterpart a factor of κ slower. In practice, the slowdown is
not that big, but the experiments showed that lean programs were often slower
than moderately branch-optimized programs. In [5], lean versions of mergesort
and heapsort are presented. In the current paper, we work towards speeding up
mergesort even further and include quicksort in the study.
A reader who is unfamiliar with the branch-prediction techniques employed
at the hardware level should recap the basic facts from a textbook on computer
organization (e.g. [14]). In our theoretical analysis, we assume that the branch
predictor used is static. A typical static predictor assumes that forward branches
3
are not taken and backward branches are taken. Hence, for a conditional branch
at the end of a loop the prediction is correct except for the last iteration when
stepping out of the loop.
2 Tuned Mergesort
In the C++ standard library shipped with our compiler, std::stable sort is an
implementation of bottom-up mergesort. First, it divides the input into chunks of
seven elements and sorts these chunks using insertionsort. Second, it merges the
sequences sorted so far pairwise, level by level, starting with the sorted chunks,
until the whole sequence is sorted. If possible, it allocates an extra buffer of
size n, where n is the size of the input, then it alternatively moves the elements
between the input and the buffer, and produces the final output in the place of
the original input. If no extra memory is available, it reverts to an in-situ sorting
strategy, which is asymptotically slower than the one using extra space.
One reason for the execution speed is a tight (compact) inner loop. We repro-
duce it in a polished form below on the left together with its assembly-language
translation on the right. When illustrating the assembly-language translations,
we use pure C [10], which is a glorified assembly language with the syntax of
C [11]. In the following code extracts, p and q are iterators pointing to the cur-
rent elements of the two input sequences, r is an iterator indicating the current
output position, t1 and t2 are iterators indicating the first positions beyond the
input sequences, and less is the comparator used in element comparisons. The
additional variables are temporary: s and t store iterators, x and y elements,
and done and smaller Boolean values.
1 while (p != t1 && q != t2 ) {
2 i f (less(q , p) ) {
3 r = q ;
4 ++q ;
5 }
6 else {
7 r = p ;
8 ++p ;
9 }
10 ++r ;
11 }
1 test :
2 done = (q == t2 ) ;
3 i f (done ) goto exit ;
4 entrance :
5 x = p ;
6 s = p + 1;
7 y = q ;
8 t = q + 1;
9 s malle r = less (y , x ) ;
10 i f (smaller) q = t ;
11 i f (! smaller) p = s ;
12 i f (! smaller) y = x ;
13 r = y ;
14 ++r ;
15 done = (p == t1 ) ;
16 i f (! done) goto test ;
17 exit :
Since the two branches of the if statement are so short and symmetric, a good
compiler will compile them using conditional moves. The assembly-language
translation corresponding to the pure-C code was produced by the g++ com-
piler. As shown on the right above, the output contained 14 instructions.
By decoupling element comparisons from branches, each merging phase of
two subsequences induces at most O(1) branch mispredictions. In total, the
merge procedure is invoked O(n) times, so the number of branch mispredictions
induced is O(n). Other characteristics of bottom-up mergesort remain the same;
it performs n log
2
n + O(n) element comparisons and element moves.
4
Table 2. The execution time [ns], the number of conditional branches, and the num-
ber of mispredictions, each per n log
2
n, for bottom-up mergesort taken from the C++
standard library and our optimized mergesort.
Program std::stable sort Optimized mergesort
Time Branches Mispredicts Time Branches Mispredicts
n Per Ares Per Ares
2
10
6.2 5.0 2.05 0.14 4.4 3.5 0.75 0.06
2
15
5.9 4.7 2.02 0.09 4.4 3.5 0.66 0.04
2
20
6.3 4.7 2.01 0.07 5.2 3.7 0.62 0.03
2
25
6.1 4.6 2.01 0.05 5.2 3.7 0.60 0.02
To reduce the number of branches executed and the number of branch mis-
predictions induced even further, we implemented the following optimizations:
Handle small subproblems differently: Instead of using insertionsort, we sort
each chunk of size four with a straight-line code that has no branches. In brief,
we simulate a sorting network for four elements using conditional moves.
Insertionsort induces one branch misprediction per element, whereas our
routine only induces O(1) branch mispredictions in total.
Unfold the main loop responsible for merging: When merging two subse-
quences, we move four elements to the output sequence in each iteration.
We do this as long as each of the two subsequences to be merged has at least
four elements. Hereafter in this performance-critical loop the instructions
involved in the conditional branches, testing whether or not one of the input
subsequences is exhausted, are only executed every fourth element compari-
son. If one or both subsequences have less than four elements, we handle the
remaining elements by a normal (not-unfolded) loop.
To see whether or not our improvements are effective in practice, we tested
our optimized mergesort against std::stable sort. Our results are reported in
Table 2. From these results, it is clear that even improvements in the linear term
can be significant for the efficiency of a sorting program.
3 Tuned In-Situ Mergesort
Since the results for mergesort were so good, we set ourselves a goal to show
that some variation of the in-place mergesort algorithm of Katajainen et al. [9]
will be faster than quicksort. We have to admit that this goal was too ambitious,
but we came quite close. We should also point out that, similar to quicksort, the
resulting sorting algorithm is no more stable.
The basic step used in [9] is to sort half of the elements using the other half as
a working area. This idea could be utilized in different ways. We rely on the sim-
plest approach: Before applying the basic step, partition the elements around the
median. In principle, the standard-library routine std::nth element can accom-
plish this task by performing a quicksort-type partitioning. After partitioning
5
and sorting, the other half of the elements can be handled recursively. We stop
the recursion when the number of remaining elements is less than n/ log
2
n and
use introsort to handle them. An iterative procedure-level description of this
sorting program is given below. Its interface is the same as that for std::sort.
1 template <typename i terator , typename comparator>
2 void sort(iterator p , iterato r r , comparator less ) {
3 typedef typename std : : iterator_traits<iterator >::difference_type index ;
4 index n = r p ;
5 index threshold = n / ilogb(2 + n) ;
6 while (n > threshold ) {
7 iterator q_1 = p + n / 2;
8 iterator q_2 = r n / 2;
9 converse_relation<comparator> greater(less ) ;
10 std : : nth_el ement(p , q_1 , r , greater) ;
11 mergesort (p , q_1 , q_2 , less ) ;
12 r = q_1 ;
13 n = r p ;
14 }
15 std : : sort(p , r , less) ;
16 }
Most of the work is done in the basic steps, and each step only uses O(1)
extra space in addition to the input sequence. Compared to normal mergesort,
the inner loop is not much longer. In the following code extracts, the variables
have the same meaning as those used in tuned mergesort: p, q, r, s, t, t1, and
t2 store iterators; x and y elements; and done and smaller Boolean values.
1 while (p != t1 && q != t2 ) {
2 i f (less(q , p) ) {
3 s = q ;
4 ++q ;
5 }
6 else {
7 s = p ;
8 ++p ;
9 }
10 x = r ;
11 r = s ;
12 s = x ;
13 ++r ;
14 }
1 test :
2 done = (q == t2 ) ;
3 i f (done ) goto exit ;
4 entrance :
5 x = p ;
6 s = p + 1;
7 y = q ;
8 t = q + 1;
9 s malle r = less (y , x ) ;
10 i f (smaller) s = t ;
11 i f (smaller) q = t ;
12 i f (! smaller) p = s ;
13 i f (! smaller) y = x ;
14 x = r ;
15 r = y ;
16 s ;
17 s = x ;
18 ++r ;
19 done = (p == t1 ) ;
20 i f (! done) goto test ;
21 exit :
As shown on the right above, an ideal translation of the loop contains 18 assembly-
language instructions, which is only four more than that required by the inner
loop of mergesort. Because of register spilling, the actual code produced by
the g++ compiler was a bit longer; it contained 26 instructions. Again, the two
branches of the if statement were compiled using conditional moves.
For an input of size m, the worst-case cost of std::nth element and std::sort
is O(m) and O(m log
2
m), respectively [13]. Thus, the overhead caused by these
subroutines is linear in the input size. Both of these routines require at most a
logarithmic amount of extra space. To sum up, we rely on standard library com-
ponents and ensure that our program only induces O(n) branch mispredictions.
6
Table 3. The execution time [ns], the number of conditional branches, and the number
of mispredictions, each per n log
2
n, for two in-situ variants of mergesort.
Program In-situ std::stable sort In-situ mergesort
Time Branches Mispredicts Time Branches Mispredicts
n Per Ares Per Ares
2
10
49.2 29.7 9.0 2.08 7.3 5.7 1.93 0.26
2
15
57.6 35.0 11.1 2.38 7.1 5.6 1.94 0.15
2
20
62.7 38.5 12.9 2.53 7.4 5.7 1.92 0.11
2
25
68.0 41.3 14.4 2.62 7.6 5.7 1.92 0.09
In our experiments, we compared our in-situ mergesort against the space-
economical mergesort provided by the C++ standard library. The library routine
is recursive, so (due to the recursion stack) it requires a logarithmic amount of
extra space. The performance difference between the two programs is stunning,
as seen in Table 3. We admit that this comparison is unfair; the library routine
promises to sort the elements stably, whereas our in-situ mergesort does not.
However, this comparison shows how well our in-situ mergesort performs.
4 Comparison to Quicksort
In the C++ standard library shipped with our compiler, std::sort is an imple-
mentation of introsort [13], which is a variant of median-of-three quicksort [6].
Introsort is half-recursive, it coarsens the base case by leaving small subprob-
lems (of size 16 or smaller) unsorted, it calls insertionsort to finalize the sorting
process, and it calls heapsort if the recursion depth becomes too large. Since
introsort is known to be fast, it was natural to use it as our starting point.
The performance-critical loop of quicksort is tight as shown on the left below;
p and r are iterators indicating how far the partitioning process has proceeded
from the beginning and the end, respectively; v is the pivot, and less is the
comparator used in element comparisons; the four additional variables are tem-
porary: x and y store elements, and smaller and cross Boolean values.
1 while (true) {
2 while (less(p , v ) ) {
3 ++p ;
4 }
5 r ;
6 while (less (v , r) ) {
7 r ;
8 }
9 i f (p >= r) {
10 return p ;
11 }
12 x = p ;
13 p = r ;
14 r = x ;
15 ++p ;
16 }
1 p ;
2 goto first_loop ;
3 swap :
4 p = y ;
5 r = x ;
6 first_loop :
7 ++p ;
8 x = p ;
9 s malle r = less (x , v ) ;
10 i f (smaller) goto first_loop ;
11 second_loop :
12 r ;
13 y = r ;
14 smaller = less (v , y) ;
15 i f (smaller) goto seco nd_loop ;
16 cross = (p < r) ;
17 i f (cross) goto swap ;
18 return p ;
7
In the assembly-language translation displayed in pure C [10] on the right above,
both of the innermost while loops contain four instructions and, after rotating
the instructions of the outer loop such that the conditional branch becomes its
last instruction, the outer loop contains four instructions as well. Due to instruc-
tion scheduling and register allocation, the picture was a bit more complicated
for the code produced by the compiler, but the simplified code displayed to the
right above is good enough for our purposes.
For the basic version of mergesort, the number of instructions executed per
n log
2
n, called the instruction-execution ratio, is 14. Let us now analyse this ratio
for quicksort. It is known that for the basic version of quicksort the expected
number of element comparisons performed is about 2n ln n 1.39n log
2
n and
the expected number of element exchanges is one sixth of this number [16].
Combining this with the number of instructions executed in different parts of
the performance-critical loop, the expected instruction-execution ratio is
(4 + (1/6) × 4) × 1.39 6.48 .
This number is extremely low; even for our improved mergesort the instruction-
execution ratio is higher (11 instructions).
The key issue is the conditional branches at the end of the innermost while
loops; their outcome is unpredictable. The performance-critical loop can be made
lean using the program transformation described in [5]. A bit more efficient code
is obtained by numbering the code blocks and executing the moves inside the
code blocks conditionally. We identify three code blocks in the performance-
critical loop of Hoare’s partitioning algorithm: the two innermost loops and the
swap. By converting the while loops to do-while loops, we could avoid some
code repetition. The outcome of the program transformation is given on the
left below; variable lambda indicates the code block under execution. It turns out
that the corresponding code is much shorter for Lomuto’s partitioning algorithm
described, for example, in [4]. Now the performance-critical loop only contains
one if statement, and the swap inside it can be executed conditionally. The code
obtained by applying the program transformation is shown on the right below.
1 int lambda = 2;
2 p ;
3 do {
4 i f (lambda == 1) p = y ;
5 i f (lambda == 1) r = x ;
6 i f (lambda == 1) lambda = 2;
7 i f (lambda == 2) ++p ;
8 x = p ;
9 s malle r = less (x , v ) ;
10 i f (lambda != 2) smaller = true;
11 i f (! smaller) lambda = 3;
12 i f (lambda == 3) r ;
13 y = r ;
14 smaller = less (v , y) ;
15 i f (lambda != 3) smaller = true;
16 i f (! smaller) lambda = 1;
17 } while (p < r) ;
18 return p ;
1 while (q < r) {
2 x = q ;
3 conditi on = le ss (x , v) ;
4 i f (conditio n ) ++p ;
5 i f (conditio n ) q = p ;
6 i f (conditio n ) p = x ;
7 ++q ;
8 }
9 return p ;
The resulting programs are interesting in several respects. When we rely on
Hoare’s partitioning algorithm, in each iteration of the loop two element compari-
8
Table 4. The execution time [ns], the number of conditional branches, and the number
of mispredictions, each per n log
2
n, for two branch-optimized variants of quicksort.
Program Optimized quicksort (Hoare) Optimized quicksort (Lomuto)
Time Branches Mispredicts Time Branches Mispredicts
n Per Ares Per Ares
2
10
9.2 6.5 2.93 0.43 6.5 5.2 2.14 0.40
2
15
9.5 6.4 3.24 0.42 6.5 5.1 2.34 0.40
2
20
9.7 6.5 3.33 0.42 6.6 5.2 2.40 0.40
2
25
9.8 6.5 3.46 0.42 6.7 5.3 2.42 0.40
sons are performed. Since the loop is executed 2n ln n times on the average,
the expected number of element comparisons increases to 2.78n log
2
n. This
increase does not occur for Lomuto’s partitioning algorithm. Note that in the
above C code we allow conditional moves between memory locations, and even
allow conditional arithmetic. If we are restricted to only use conditional moves
(as in pure-C), these instructions need to be substituted by pure-C instructions.
Because the underlying hardware only supports conditional moves to registers,
the assembly-language instruction counts were a bit higher than that indicated
by the C code above; the actual counts were 20 and 11, respectively. This means
that the expected instruction-execution ratio (that is a 1.39 factor of the in-
struction counts) is around 27.8 when Hoare’s partitioning is in use and around
15.29 when Lomuto’s partitioning is in use. Thus, the cost of eliminating the
unpredictable branches is high in both cases!
When testing these branch-optimized versions of quicksort, we observed that
the compiler was not able to handle that many conditional moves. In some
architectures each such move requires more than one clock cycle, so it may
be more efficient to use conditional branches. The performance of our branch-
optimized quicksort programs is reported in Table 4. Compared to introsort (see
Table 1), these programs are slower. To avoid branch mispredictions, it would
be necessary to write the programs in assembly language.
We also tested other variants of introsort by trying different pivot-selection
strategies: random element, first element, median of the first, middle and last
element, and α-skewed hypothetical element. Since in our setup the elements
are given in random order, the simplest pivot-selection strategy—select the first
element as the pivot—was already fast, but it was slower than the median-of-
three pivot-selection strategy used by introsort. On the other hand, the selection
of a skewed pivot indeed improved the performance. In our test environment, for
small problem instances the median pivot worked best (i.e. α = 1/2), whereas
for large problem instances α = 1/5 turned out to be the best choice. The results
for the naive and α-skewed pivot-selection strategies are given in Table 5.
From these experiments, our conclusion is that the performance of the sort-
ing programs considered is ranked as follows: mergesort, quicksort, and in-situ
mergesort. Still, quicksort is the fastest method for in-situ sorting.
9
Table 5. The execution time [ns], the number of conditional branches, and the number
of mispredictions, each per n log
2
n, for two other variants of introsort.
Program Introsort (naive pivot) Introsort ((1/5)-skewed pivot)
Time Branches Mispredicts Time Branches Mispredicts
n Per Ares Per Ares
2
10
7.0 5.6 1.78 0.45 6.4 5.1 1.48 0.37
2
15
6.6 5.3 1.78 0.43 6.1 4.8 1.53 0.36
2
20
6.5 5.1 1.74 0.42 6.0 4.7 1.55 0.35
2
25
6.4 5.1 1.72 0.42 6.0 4.7 1.56 0.34
5 Advice for Practitioners
Like sorting programs, most programs can be optimized with respect to different
criteria: the number of branch mispredictions, cache misses, element compari-
sons, or element moves. Optimizing one of the parameters may mean that the
optimality with respect to another is lost. Not even the optimal bounds are the
best in practice; the best choice depends on the environment where the programs
are run. The task of a programmer is difficult. As any activity involving design,
good programming requires good compromises.
In this paper we were interested in reducing the cost caused by branch mispre-
dictions. In principle, there are two ways of removing branches from programs:
1. Store the result of a comparison in a Boolean variable and use this value
in normal integer arithmetic (i.e. rely on the setcc family of instructions
available in Intel processors).
2. Move the data from one place to another conditionally (i.e. rely on the cmovcc
family of instructions available in Intel processors).
In Intel’s architecture optimization reference manual [7, Section 3.4.1], a clear
guideline is given how these two types of optimizations should be used.
Use the setcc and cmov[cc] instructions to eliminate unpredictable con-
ditional branches where possible. Do not do this for predictable branches.
Do not use these instructions to eliminate all unpredictable conditional
branches (because using these instructions will incur execution overhead
due to the requirement for executing both paths of a conditional branch).
In addition, converting a conditional branch to setcc or cmov[cc] trades
off control flow dependence for data dependence and restricts the ca-
pability of the out-of-order engine. When tuning, note that all Intel ...
processors usually have very high branch prediction rates. Consistently
mispredicted branches are generally rare. Use these instructions only if
the increase in computation time is less than the expected cost of a
mispredicted branch.
Complicated optimizations often mean complicated programs with many
branches. As a result, it will be more difficult to remove the branches by hand.
Fortunately, an automatic way of eliminating all branches, except one, is known
10
[5]. However, the performance of the obtained program is not necessarily good
due to the high constant factor introduced in the running time. In our work we
got the best results by starting from a simple program and reducing branches
from it by hand.
Implicitly, we assumed that the elements being manipulated are small. For
large elements, it may be necessary to execute each element comparison and
element move in a loop. However, in order for the general O(n log
2
n) bound for
sorting to be valid, element comparisons and element moves must be constant-
time operations. If this was not the case, e.g. if we were sorting strings of char-
acters, the comparison-based methods would not be optimal any more. On the
other hand, if the elements were large but of fixed length, the loops involved
in element comparisons and element moves could be unfolded and conditional
branches could be avoided. Nonetheless, the increase in the number of element
moves can become significant.
At this point we can reveal that we started this research by trying to make
quicksort lean (as was done with heapsort and mergesort in [5]). However, we
had big problems in forcing the compiler(s) to use conditional moves, and our
hand-written assembly-language code was slower than the code produced by the
compiler. So be warned; it is not always easy to eliminate unpredictable branches
without a significant penalty in performance.
6 Afterword
We leave it for the reader to decide whether quicksort should still be considered
the quickest sorting algorithm. It is definitely an interesting randomized algo-
rithm. However, many of the implementation enhancements proposed for it seem
to have little relevance in contemporary computers.
We are clearly in favour of mergesort instead of quicksort. If extra memory is
allowed, mergesort is stable. Multi-way mergesort removes most of the problems
associated with expensive element moves. The algorithm itself does not—even
though our in-situ mergesort does—use random access. This would facilitate an
extension to the interface of the C++ standard-library sort function: The input
sequence should only support forward iterators, not random-access iterators.
In algorithm education at many universities a programming language is used
that is far from a raw machine. Under such circumstances it gives little meaning
to talk about the branch-prediction features of sorting programs. A cursory
examination showed that in one of our test computers (Ares) the Python stand-
ard-library sort was a factor of 135 slower than the C++ standard-library sort
when sorting million integers.
References
1. Biggar, P., Nash, N., Williams, K., Gregg, D.: An experimental study of sorting
and branch prediction. ACM Journal of Experimental Algorithmics 12, Article 1.8
(2008)
11
2. Brodal, G.S., Moruz, G.: Tradeoffs between branch mispredictions and compari-
sons for sorting algorithms. In: Proceedings of the 9th International Workshop on
Algorithms and Data Structures. Lecture Notes in Computer Science, vol. 3608,
pp. 385–395. Springer-Verlag (2005)
3. Brodal, G.S., Moruz, G.: Skewed binary search trees. In: Proceedings of the 14th
Annual European Symposium on Algorithms. Lecture Notes in Computer Science,
vol. 4168, pp. 708–719. Springer-Verlag (2006)
4. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms.
The MIT Press, 3nd edn. (2009)
5. Elmasry, A., Katajainen, J.: Lean programs, branch mispredictions, and sorting. In:
Proceedings of the 6th International Conference on Fun with Algorithms. Lecture
Notes in Computer Science, vol. 7288, pp. 119–130. Springer-Verlag (2012)
6. Hoare, C.A.R.: Quicksort. The Computer Journal 5(1), 10–16 (1962)
7. Intel Corporation: Intel
R
64 and IA-32 Architectures Optimization Reference Man-
ual, version 025 (1997–2011)
8. Kaligosi, K., Sanders, P.: How branch mispredictions affect quicksort. In: Proceed-
ings of the 14th Annual European Symposium on Algorithms. Lecture Notes in
Computer Science, vol. 4168, pp. 780–791. Springer-Verlag (2006)
9. Katajainen, J., Pasanen, T., Teuhola, J.: Practical in-place mergesort. Nordic Jour-
nal of Computing 3(1), 27–40 (1996)
10. Katajainen, J., Taff, J.L.: A meticulous analysis of mergesort programs. In: Pro-
ceedings of the 3rd Italian Conference on Algorithms and Complexity. Lecture
Notes in Computer Science, vol. 1203, pp. 217–228. Springer-Verlag (1997)
11. Kernighan, B.W., Ritchie, D.M.: The C Programming Language. Prentice Hall,
2nd edn. (1988)
12. Mortensen, S.: Refining the pure-C cost model. Master’s Thesis, Department of
Computer Science, University of Copenhagen (2001)
13. Musser, D.R.: Introspective sorting and selection algorithms. Software—Practice
and Experience 27(8), 983–993 (1997)
14. Patterson, D.A., Hennessy, J.L.: Computer Organization and Design, The Hard-
ware/Software Interface. Morgan Kaufmann Publishers, 4th edn. (2009)
15. Sanders, P., Winkel, S.: Super scalar sample sort. In: Proceedings of the 12th
Annual European Symposium on Algorithms. Lecture Notes in Computer Science,
vol. 3221, pp. 784–796. Springer-Verlag (2004)
16. Sedgewick, R.: The analysis of Quicksort programs. Acta Informatica 7(4), 327–355
(1977)
12
... Unfortunately, implementations of this InPlaceMergesort algorithm have not been documented. The work of Katajainen et al. [15,19,29], inspired by Reinhardt, is practical, but the number of comparisons is larger. ...
... In [15] Elmasry, Katajainen and Stenmark proposed InSituMergesort, following the same principle as UltimateHeapsort, but with Mergesort replacing ExternalHeapsort. Also InSituMergesort only uses an expected-case lineartime algorithm for the median computation. ...
... The code of our implementation of QuickMergesort as well as the other algorithms and our running time experiments is available at https:// github.com/weissan/QuickXsort. In our implementation of QuickMergesort, we use the merging procedure from [15], which avoids avoids branches based on comparisons altogether. We use the partitioner from the GCC implementation of std::sort. ...
Article
Full-text available
QuickXsort is a highly efficient in-place sequential sorting scheme that mixes Hoare’s Quicksort algorithm with X, where X can be chosen from a wider range of other known sorting algorithms, like Heapsort, Insertionsort and Mergesort. Its major advantage is that QuickXsort can be in-place even if X is not. In this work we provide general transfer theorems expressing the number of comparisons of QuickXsort in terms of the number of comparisons of X. More specifically, if pivots are chosen as medians of (not too fast) growing size samples, the average number of comparisons of QuickXsort and X differ only by o(n)-terms. For median-of-k pivot selection for some constant k, the difference is a linear term whose coefficient we compute precisely. For instance, median-of-three QuickMergesort uses at most nlgn-0.8358n+O(logn) comparisons. Furthermore, we examine the possibility of sorting base cases with some other algorithm using even less comparisons. By doing so the average-case number of comparisons can be reduced down to nlgn-1.4112n+o(n) for a remaining gap of only 0.0315n comparisons to the known lower bound (while using only O(logn) additional space and O(nlogn) time overall). Implementations of these sorting strategies show that the algorithms challenge well-established library implementations like Musser’s Introsort.
... C'est par exemple le cas de Sanders et Winkel qui propose une variante de l'algorithme de tri SampleSort qui utilise des instructions particulières du processeur semblables à des CMOV [40]. Dans le même style, nous avons les travaux de Elmasry, Katajainen et Stenmark qui proposent une variante du tri fusion à l'aide d'instructions spéciales du même genre [22]. ...
... Ces instructions peuvent s'apparenter à des instructions de mouvement conditionnelle CMOV. Similairement, Elmasry, Katajainen et Stenmark ont proposé une variante de MergeSort [22] qui essaye de contourner au maximum l'utilisation des instructions de branchements. ...
... 15 A description of these predictors can be consulted in the section 2 of the following article: 16 http://drops.dagstuhl.de/opus/volltexte/2016/5713/pdf/13.pdf 17 18 This class behavior is mostly inherited by the class Predictor. 19 The following methods have been extended: 20 -__init__() 21 The following methods have been overriden: 22 -_prediction() 23 -_update() """ 25 26 def __init__(self, state = None): 27 """Initialize the counters and initial prediction. 28 29 Initialize counters prediction and instruction to zero. ...
Thesis
À l'origine de cette thèse, nous nous sommes intéressés à l'algorithme de tri TimSort qui est apparu en 2002, alors que la littérature sur le problème du tri était déjà bien dense. Bien qu'il soit utilisé dans de nombreux langages de programmation, les performances de cet algorithme n'avaient jamais été formellement analysées avant nos travaux. L'étude fine de TimSort nous a conduits à enrichir nos modèles théoriques, en y incorporant des caractéristiques modernes de l'architecture des ordinateurs. Nous avons, en particulier, étudié le mécanisme de prédiction de branchement. Grâce à cette analyse théorique, nous avons pu proposer des modifications de certains algorithmes élémentaires (comme l'exponentiation rapide ou la dichotomie) qui utilisent ce principe à leur avantage, améliorant significativement leurs performances lorsqu'ils sont exécutés sur des machines récentes. Enfin, même s'il est courant dans le cadre de l'analyse en moyenne de considérer que les entrées sont uniformément distribuées, cela ne semble pas toujours refléter les distributions auxquelles nous sommes confrontés dans la réalité. Ainsi, une des raisons du choix d'implanter TimSort dans des bibliothèques standard de Java et Python est probablement sa capacité à s'adapter à des entrées partiellement triées. Nous proposons, pour conclure cette thèse, un modèle mathématique de distribution non-uniforme sur les permutations qui favorise l'apparition d'entrées partiellement triées, et nous en donnons une analyse probabiliste détaillée
... However, this general transformation introduces a huge constant factor overhead. Still, Elmasry, Katajainen, and Stenmark [12,13] showed how to efficiently implement many algorithms related to sorting with only a few branch mispredictions. They call such programs lean. ...
... We call the resulting algorithm BlockQuicksort. Our work is inspired by Tuned Quicksort from [13], from which we also borrow parts of our implementation. The difference is that by doing the partitioning block-wise, we can use Hoare's partitioner, which is far better with duplicate elements than Lomuto's partitioner (although Tuned Quicksort can be made working with duplicates by applying a check for duplicates similar to what we propose for BlockQuicksort as one of the further improvements in Section 3.2). ...
... For more examples on how to apply these methods to sorting, see [13]. ...
Article
It is well known that Quicksort -- which is commonly considered as one of the fastest in-place sorting algorithms -- suffers in an essential way from branch mispredictions. We present a novel approach to addressing this problem by partially decoupling control from dataflow&colon; in order to perform the partitioning, we split the input into blocks of constant size. Then, all elements in one block are compared with the pivot and the outcomes of the comparisons are stored in a buffer. In a second pass, the respective elements are rearranged. By doing so, we avoid conditional branches based on outcomes of comparisons (except for the final Insertionsort). Moreover, we prove that when sorting n elements, the average total number of branch mispredictions is at most &epsi;n log n + O(n) for some small &epsi; depending on the block size. Our experimental results are promising: when sorting random-integer data, we achieve an increase in speed (number of elements sorted per second) of more than 80% over the GCC implementation of Quicksort (C++ std::sort). Also, for many other types of data and non-random inputs, there is still a significant speedup over std::sort. Only in a few special cases, such as sorted or almost sorted inputs, can std::sort beat our implementation. Moreover, on random-input permutations, our implementation is even slightly faster than an implementation of the highly tuned Super Scalar Sample Sort, which uses a linear amount of additional space. Finally, we also apply our approach to Quickselect and obtain a speed-up of more than 100% over the GCC implementation (C++ std::nth_element).
... Suppose a sort algorithm operates on fixed-size items, which are either keys or key-value pairs, depending on the application. The following notation will be useful for the rest of the discussion: N : Number of items to sort C: Number of items fitting into L2 cache (typically 2 16 − 2 18 ) T : Number of threads (typically double the core count) W: Number of items per SIMD register (typically [4][5][6][7][8][9][10][11][12][13][14][15][16] R: Number of SIMD registers per core (typically 16 or 32) B: Size of each item in bits (typically 32, 64, or 128) ...
... For non-SIMD sorts, there were several attempts at developing a branchless bmerge. For example, [12] argues that the compiler will generate cmov (conditional move) instructions for short if statements, but this is usually not the case in practice. Other techniques include [13], which replaces the branch with a binary flag and multiplication, and [16], which runs a hybrid set-intersection algorithm that removes difficult-to-predict branches in favor of those that are easy to predict. ...
Conference Paper
Full-text available
Mergesort is a popular algorithm for sorting real-world workloads as it is immune to data skewness, suitable for parallelization using vectorized intrinsics, and relatively simple to multi-thread. In this paper, we introduce Origami, an in-memory merge-sort framework that is optimized for scalar, as well as all current SIMD (single-instruction multiple-data) CPU architectures. For each vector-extension set (e.g., SSE, AVX2, AVX-512), we present an in-register sorter for small sequences that is up to 8× faster than prior methods and a branchless streaming merger that achieves up to a 1.5× speed-up over the naive merge. In addition, we introduce a cache-residing quad-merge tree to avoid bottlenecking on memory bandwidth and a parallel partitioning scheme to maximize thread-level concurrency. We develop an end-to-end sort with these components and produce a highly utilized mergesort pipeline by reducing the synchronization overhead between threads. Single-threaded Origami performs up to 2× faster than the closest competitor and achieves a nearly perfect speed-up in multi-core environments.
... There are several works study in parallel sorting algorithm on multi-core CPU [2][3][4][5][6] , and branch misprediction and cache misses in parallel sorting algorithm [7][8][9][10] . It should avoid branch misprediction and reduce cache misses by improving locality while develop parallel sorting algorithm on multi-core CPU. ...
Article
Full-text available
Quicksort is an important algorithm that uses the divide and conquer concept, and it can be run to solve any problem. The performance of the algorithm can be improved by implementing this algorithm in parallel. In this paper, the parallel sorting algorithm named the Multi-Deque Partition Dual-Deque Merge Sorting algorithm (MPDMSort) is proposed and run on a shared memory system. This algorithm contains the Multi-Deque Partitioning phase, which is a block-based parallel partitioning algorithm, and the Dual-Deque Merging phase, which is a merging algorithm without compare-and-swap operations and sorts the small data with the sorting function of the standard template library. The OpenMP library, which is an application programming interface used to develop the parallel implementation of this algorithm, is implemented in MPDMSort. Two computers (one with an Intel Xeon Gold 6142 CPU and the other with an Intel Core i7-11700 CPU) running Ubuntu Linux are used in this experiment. The results show that MPDMSort is faster than parallel balanced quicksort and multiway merge sort on the large random distribution data. A speedup of 13.81×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} and speedup per thread of 0.86 can be obtained. Thus, developers can use these parallel partitioning and merging algorithms to improve the performance of related algorithms.
... However, there are few -mostly negative -results of transferring the theory work into practice. Implementations of non-stable in-place mergesort [22,23,44] are reported to be slower than quicksort from the C++ standard library. Katajainen and Teuhola report that their implementation [44] is even slower than heapsort, which is quite slow for big inputs due to its cache-inefficiency. ...
Preprint
We present sorting algorithms that represent the fastest known techniques for a wide range of input sizes, input distributions, data types, and machines. A part of the speed advantage is due to the feature to work in-place. Previously, the in-place feature often implied performance penalties. Our main algorithmic contribution is a blockwise approach to in-place data distribution that is provably cache-efficient. We also parallelize this approach taking dynamic load balancing and memory locality into account. Our comparison-based algorithm, In-place Superscalar Samplesort (IPS$^4$o), combines this technique with branchless decision trees. By taking cases with many equal elements into account and by adapting the distribution degree dynamically, we obtain a highly robust algorithm that outperforms the best in-place parallel comparison-based competitor by almost a factor of three. IPS$^4$o also outperforms the best comparison-based competitors in the in-place or not in-place, parallel or sequential settings. IPS$^4$o even outperforms the best integer sorting algorithms in a wide range of situations. In many of the remaining cases (often involving near-uniform input distributions, small keys, or a sequential setting), our new in-place radix sorter turns out to be the best algorithm. Claims to have the, in some sense, "best" sorting algorithm can be found in many papers which cannot all be true. Therefore, we base our conclusions on extensive experiments involving a large part of the cross product of 21 state-of-the-art sorting codes, 6 data types, 10 input distributions, 4 machines, 4 memory allocation strategies, and input sizes varying over 7 orders of magnitude. This confirms the robust performance of our algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications.
... Identical subproblem sizes are also possible when implementing the base case of merge sort. For large data sets, this would likely imply using merge sort for intermediate input sizes that fit into the cache and one would want to have a highly tuned implementation that avoids branch mispredictions, e.g., based on 27 . One could even consider merging circuits as yet another intermediate stage. ...
Preprint
Full-text available
Sorting a set of items is a task that can be useful by itself or as a building block for more complex operations. That is why a lot of effort has been put into finding sorting algorithms that sort large sets as fast as possible. But the more sophisticated the algorithms become, the less efficient they are for small sets of items due to large constant factors. We aim to determine if there is a faster way than insertion sort to sort small sets of items to provide a more efficient base case sorter. We looked at sorting networks, at how they can improve the speed of sorting few elements, and how to implement them in an efficient manner by using conditional moves. Since sorting networks need to be implemented explicitly for each set size, providing networks for larger sizes becomes less efficient due to increased code sizes. To also enable the sorting of slightly larger base cases, we adapted sample sort to Register Sample Sort, to break down those larger sets into sizes that can in turn be sorted by sorting networks. From our experiments we found that when sorting only small sets, the sorting networks outperform insertion sort by a factor of at least 1.76 for any array size between six and sixteen, and by a factor of 2.72 on average across all machines and array sizes. When integrating sorting networks as a base case sorter into Quicksort, we achieved far less performance improvements, which is probably due to the networks having a larger code size and cluttering the L1 instruction cache. But for x86 machines with a larger L1 instruction cache of 64 KiB or more, we obtained speedups of 12.7% when using sorting networks as a base case sorter in std::sort. In conclusion, the desired improvement in speed could only be achieved under special circumstances, but the results clearly show the potential of using conditional moves in the field of sorting algorithms.
Article
We present new sequential and parallel sorting algorithms that now represent the fastest known techniques for a wide range of input sizes, input distributions, data types, and machines. Somewhat surprisingly, part of the speed advantage is due to the additional feature of the algorithms to work in-place, i.e., they do not need a significant amount of space beyond the input array. Previously, the in-place feature often implied performance penalties. Our main algorithmic contribution is a blockwise approach to in-place data distribution that is provably cache-efficient. We also parallelize this approach taking dynamic load balancing and memory locality into account. Our new comparison-based algorithm In-place Parallel Super Scalar Samplesort ( IPS ⁴ o ) , combines this technique with branchless decision trees. By taking cases with many equal elements into account and by adapting the distribution degree dynamically, we obtain a highly robust algorithm that outperforms the best previous in-place parallel comparison-based sorting algorithms by almost a factor of three. That algorithm also outperforms the best comparison-based competitors regardless of whether we consider in-place or not in-place, parallel or sequential settings. Another surprising result is that IPS ⁴ o even outperforms the best (in-place or not in-place) integer sorting algorithms in a wide range of situations. In many of the remaining cases (often involving near-uniform input distributions, small keys, or a sequential setting), our new In-place Parallel Super Scalar Radix Sort ( IPS ² Ra ) turns out to be the best algorithm. Claims to have the – in some sense – “best” sorting algorithm can be found in many papers which cannot all be true. Therefore, we base our conclusions on an extensive experimental study involving a large part of the cross product of 21 state-of-the-art sorting codes, 6 data types, 10 input distributions, 4 machines, 4 memory allocation strategies, and input sizes varying over 7 orders of magnitude. This confirms the claims made about the robust performance of our algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications. This is particularly true for integer sorting algorithms giving one reason to prefer comparison-based algorithms for robust general-purpose sorting.
Article
Full-text available
Sorting a set of items is a task that can be useful by itself or as a building block for more complex operations. That is why a lot of effort has been put into finding sorting algorithms that sort large sets as efficiently as possible. But the more sophisticated and complex the algorithms become, the less efficient they are for small sets of items due to large constant factors. A relatively simple sorting algorithm that is often used as a base case sorter is insertion sort, because it has small code size and small constant factors influencing its execution time. We aim to determine if there is a faster way to sort small sets of items to provide an efficient base case sorter. We looked at sorting networks, at how they can improve the speed of sorting few elements, and how to implement them in an efficient manner using conditional moves. Since sorting networks need to be implemented explicitly for each set size, providing networks for larger sizes becomes less efficient due to increased code sizes. To also enable the sorting of slightly larger base cases, we adapted sample sort to Register Sample Sort, to break down those larger sets into sizes that can in turn be sorted by sorting networks. From our experiments we found that when sorting only small sets of integers, the sorting networks outperform insertion sort by a factor of at least 1.76 for any array size between six and 16, and by a factor of 2.72 on average across all machines and array sizes. When integrating sorting networks as a base case sorter into Quicksort, we achieved far less performance improvements over using insertion sort, which is probably due to the networks having a larger code size and cluttering the L1 instruction cache. The same effect occurs when including Register Sample Sort as a base case sorter for IPSo. But for x86 machines that have a larger L1 instruction cache of 64 KiB or more, we obtained speedups of 12.7% when using sorting networks as a base case sorter in std::sort, and of 5%–6% when integrating Register Sample Sort as a base case sorter into IPSo, each in comparison to using insertion sort as the base case sorter. In conclusion, the desired improvement in speed could only be achieved under special circumstances, but the results clearly show the potential of using conditional moves in the field of sorting algorithms.
Conference Paper
Full-text available
According to a folk theorem, every program can be trans-formed into a program that produces the same output and only has one loop. We generalize this to a form where the resulting program has one loop and no other branches than the one associated with the loop con-trol. For this branch, branch prediction is easy even for a static branch predictor. If the original program is of length κ, measured in the number of assembly-language instructions, and runs in t(n) time for an input of size n, the transformed program is of length O(κ) and runs in O(κt(n)) time. Normally sorting programs are short, but still κ may be too large for practical purposes. Therefore, we provide more efficient hand-tailored heapsort and mergesort programs. Our programs retain most features of the original programs—e.g. they perform the same number of element comparisons—and they induce O(1) branch mispredictions. On comput-ers where branch mispredictions were expensive, some of our programs were, for integer data and small instances, faster than the counterparts in the GNU implementation of the C++ standard library.
Conference Paper
Full-text available
Branch mispredictions is an important factor affecting the running time in practice. In this paper we consider tradeoffs between the number of branch mispredictions and the number of comparisons for sorting algorithms in the comparison model. We prove that a sorting algorithm using O(dn log n) comparisons performs Ω(n logd n) branch mispredictions. We show that Multiway MergeSort achieves this tradeoff by adopting a multiway merger with a low number of branch mispredictions. For adaptive sorting algorithms we similarly obtain that an algorithm performing O(dn(1 + log(1 + Inv/n))) comparisons must perform Ω(n logd (1 + Inv/n)) branch mispredictions, where Inv is the number of inversions in the input. This tradeoff can be achieved by GenericSort by Estivill-Castro and Wood by adopting a multiway division protocol and a multiway merging algorithm with a low number of branch mispredictions.
Conference Paper
Full-text available
It is well-known that to minimize the number of comparisons a binary search tree should be perfectly balanced. Previous work has shown that a dominating factor over the running time for a search is the number of cache faults performed, and that an appropriate memory layout of a binary search tree can reduce the number of cache faults by several hundred percent. Motivated by the fact that during a search branching to the left or right at a node does not necessarily have the same cost, e.g. because of branch prediction schemes, we in this paper study the class of skewed binary search trees. For all nodes in a skewed binary search tree the ratio between the size of the left subtree and the size of the tree is a fixed constant (a ratio of 1/2 gives perfect balanced trees). In this paper we present an experimental study of various memory layouts of static skewed binary search trees, where each element in the tree is accessed with a uniform probability. Our results show that for many of the memory layouts we consider skewed binary search trees can perform better than perfect balanced search trees. The improvements in the running time are on the order of 15%.
Article
C is a general-purpose programming language that was originally designed for ″system programming″ , that is, for writing programs such as compilers, operating systems, text editors, etc. Its other applications include data base systems, numerical analysis and engineering programs, and a great deal of text-processing software. It is the primary language of the UNIX system, and is also available in several other environments.
Conference Paper
Sample sort, a generalization of quicksort that partitions the input into many pieces, is known as the best practical comparison based sorting algorithm for distributed memory parallel computers. We show that sample sort is also useful on a single processor. The main algorithmic insight is that element comparisons can be decoupled from expensive conditional branching using predicated instructions. This transformation facilitates optimizations like loop unrolling and software pipelining. The final implementation, albeit cache efficient, is limited by a linear number of memory accesses rather than the O(n log n) comparisons. On an Itanium 2 machine, we obtain a speedup of up to 2 over std: :sort from the GCC STL library, which is known as one of the fastest available quicksort implementations.