Branch Mispredictions Don’t Aﬀect Mergesort

?

Amr Elmasry

1

, Jyrki Katajainen

1,2

, and Max Stenmark

2

1

Department of Computer Science, University of Copenhagen

Universitetsparken 1, 2100 Copenhagen East, Denmark

2

Jyrki Katajainen and Company

Thorsgade 101, 2200 Copenhagen North, Denmark

Abstract. In quicksort, due to branch mispredictions, a skewed pivot-

selection strategy can lead to a better performance than the exact-

median pivot-selection strategy, even if the exact median is given for

free. In this paper we investigate the eﬀect of branch mispredictions on

the behaviour of mergesort. By decoupling element comparisons from

branches, we can avoid most negative eﬀects caused by branch mispre-

dictions. When sorting a sequence of n elements, our fastest version of

mergesort performs n log

2

n + O (n) element comparisons and induces at

most O(n) branch mispredictions. We also describe an in-situ version

of mergesort that provides the same bounds, but uses only O(log

2

n)

words of extra memory. In our test computers, when sorting integer

data, mergesort was the fastest sorting method, then came quicksort,

and in-situ mergesort was the slowest of the three. We did a similar kind

of decoupling for quicksort, but the transformation made it slower.

1 Introduction

Branch mispredictions may have a signiﬁcant eﬀect on the speed of programs.

For example, Kaligosi and Sanders [8] showed that in quicksort [6] it may be

more advantageous to select a skewed pivot instead of ﬁnding a pivot close to

the median. The reason for this is that for a comparison against the median

the outcome has a ﬁfty percent chance of being smaller or larger, whereas the

outcome of comparisons against a skewed pivot is easier to predict. All in all, a

skewed pivot will lead to a better branch prediction and—possibly—a decrease

in computation time. In a same vein, Brodal and Moruz [3] showed that skewed

binary search trees can perform better than perfectly balanced search trees.

In this paper we tackle the following question posed in [8]. Given a random

permutation of the integers {0, 1, . . . , n − 1}, does there exist a faster in-situ

sorting algorithm than quicksort with skewed pivots for this particular type of

input? We use the word in-situ to indicate that the algorithm is allowed to use

O(log

2

n) extra words of memory (as any careful implementation of quicksort).

It is often claimed that quicksort is faster than mergesort. To check the cor-

rectness of this claim, we performed some simple benchmarks for the quicksort

(std::sort) and mergesort (std::stable sort) programs available at the GNU

implementation (g++ version 4.6.1) of the C++ standard library; std::sort is

?

c

2012 Springer-Verlag. This is the authors’ version of the work. The original pub-

lication is available at www.springerlink.com.

Table 1. The execution time [ns], the number of conditional branches, and the number

of mispredictions on two of our computers (Per and Ares), each per n log

2

n, for the

quicksort and mergesort programs taken from the C++ standard library.

Program std::sort std::stable sort

Time Branches Mispredicts Time Branches Mispredicts

n Per Ares Per Ares

2

10

6.5 5.3 1.47 0.45 6.2 5.0 2.05 0.14

2

15

6.2 5.2 1.50 0.43 5.9 4.7 2.02 0.09

2

20

6.2 5.1 1.50 0.43 6.3 4.7 2.01 0.07

2

25

6.1 5.1 1.51 0.43 6.1 4.6 2.01 0.05

an implementation of Musser’s introsort [13] and std::stable sort is an imple-

mentation of bottom-up mergesort. In our test environment

3

, for integer data,

the two programs had the same speed within the measurement accuracy (see

Table 1). An inspection of the assembly-language code produced by g++ revealed

that in the performance-critical inner loop of std::stable sort all element com-

parisons were followed by conditional moves. A conditional move is written in

C as if (a b) x = y, where a, b, x, and y are some variables (or constants),

and is some comparison operator. This instruction, or some of its restricted

forms, is supported as a hardware primitive by most computers. By using a

branch-prediction proﬁler (valgrind) we could conﬁrm that the number of branch

mispredictions per n log

2

n—referred to as the branch-misprediction ratio—was

much lower for std::stable sort than for std:.sort. Based on these initial ob-

servations, we concluded that mergesort is a noteworthy competitor to quicksort.

Our main results in this paper are:

1. We optimize (reduce the number of branches executed and branch mispre-

dictions induced) the standard-library mergesort so that it becomes faster

than quicksort for our task in hand (Section 2).

2. We describe an in-situ version of this optimized mergesort (Section 3). Even

though an ideal translation of its inner loop only contains 18 assembly-

language instructions, in our experiments it was slower than quicksort.

3

The experiments discussed in the paper were carried out on two computers:

Per: Model: Intel

R

Core

TM

2 CPU T5600 @ 1.83GHz; main memory: 1 GB; L2

cache: 8-way associative, 2 MB; cache line: 64 B.

Ares: Model: Intel

R

Core

TM

i3 CPU M 370 @ 2.4GHz × 4; main memory: 2.6 GB;

L2 cache: 12-way associative, 3 MB; cache line: 64 B.

Both computers run under Ubuntu 11.10 (Linux kernel 3.0.0-15-generic) and g++

compiler (gcc version 4.6.1) with optimization level -O3 was used. According to

the documentation, at optimization level -O3 this compiler always attempted to

transform conditional branches into branch-less equivalents. Micro-benchmarking

showed that in Per conditional moves were faster than conditional branches when the

result of the branch condition was unpredictable. In Ares the opposite was true. All

execution times were measured using gettimeofday in sys/time.h. For a problem

of size n, each experiment was repeated 2

26

/n times and the average was reported.

2

3. We eliminated all branches from the performance-critical loop of quicksort

(Section 4). After this transformation the program induces O(n) branch

mispredictions on the average. However, in our experiments the branch-

optimized versions of quicksort were slower than std::sort.

4. We made a number of experiments for quicksort with skewed pivots (Sec-

tion 4). We could repeat the ﬁndings reported in [8], but the performance

improvement obtained by selecting a skewed pivot was not very large. For

our mergesort programs the branch-misprediction ratio is signiﬁcantly lower

than that reported for quicksort with skewed pivots in [8].

We took the idea of decoupling element comparisons from branches from

Mortensen [12]. He described a variant of mergesort that performs n log

2

n +

O(n) element comparisons and induces O(n) branch mispredictions. However,

the performance-critical loop of the standard-library mergesort only contains 14

assembly-language instructions, whereas that of Mortensen’s program contains

more. This could be a reason why Mortensen’s implementation is slower than the

standard-library implementation. Our key improvement is to keep the instruction

count down while doing the branch optimization.

The idea of decoupling element comparisons from branches was also used by

Sanders and Winkel [15] in their samplesort. The resulting program performs

n log

2

n + O(n) element comparisons and induces O(n) branch mispredictions in

the expected case. As for Mortensen’s mergesort, samplesort needs O(n) extra

space. Using the technique described in [15], one can modify heapsort such that it

will achieve the same bound on the number of branch mispredictions in addition

to its normal characteristics. In particular, heapsort is fully in-place, but it suﬀers

from a bad cache behaviour [5].

Brodal and Moruz [2] proved that any comparison-based program that sorts

a sequence of n elements using O(βn log

2

n) element comparisons, for β > 1,

must induce Ω(n log

β

n) branch mispredictions. However, this result only holds

under the assumption that every element comparison is followed by a condi-

tional branch depending on the outcome of the comparison. In particular, after

decoupling element comparisons from branches the lower bound is no more valid.

The branch-prediction features of a number of sorting programs were exper-

imentally studied in [1]; also a few optimizations were made to known methods.

In a companion paper [5] it is shown that any program can be transformed into

a form that induces only O(1) branch mispredictions. The resulting programs

are called lean. However, for a program of length κ, the transformation may

make the lean counterpart a factor of κ slower. In practice, the slowdown is

not that big, but the experiments showed that lean programs were often slower

than moderately branch-optimized programs. In [5], lean versions of mergesort

and heapsort are presented. In the current paper, we work towards speeding up

mergesort even further and include quicksort in the study.

A reader who is unfamiliar with the branch-prediction techniques employed

at the hardware level should recap the basic facts from a textbook on computer

organization (e.g. [14]). In our theoretical analysis, we assume that the branch

predictor used is static. A typical static predictor assumes that forward branches

3

are not taken and backward branches are taken. Hence, for a conditional branch

at the end of a loop the prediction is correct except for the last iteration when

stepping out of the loop.

2 Tuned Mergesort

In the C++ standard library shipped with our compiler, std::stable sort is an

implementation of bottom-up mergesort. First, it divides the input into chunks of

seven elements and sorts these chunks using insertionsort. Second, it merges the

sequences sorted so far pairwise, level by level, starting with the sorted chunks,

until the whole sequence is sorted. If possible, it allocates an extra buﬀer of

size n, where n is the size of the input, then it alternatively moves the elements

between the input and the buﬀer, and produces the ﬁnal output in the place of

the original input. If no extra memory is available, it reverts to an in-situ sorting

strategy, which is asymptotically slower than the one using extra space.

One reason for the execution speed is a tight (compact) inner loop. We repro-

duce it in a polished form below on the left together with its assembly-language

translation on the right. When illustrating the assembly-language translations,

we use pure C [10], which is a gloriﬁed assembly language with the syntax of

C [11]. In the following code extracts, p and q are iterators pointing to the cur-

rent elements of the two input sequences, r is an iterator indicating the current

output position, t1 and t2 are iterators indicating the ﬁrst positions beyond the

input sequences, and less is the comparator used in element comparisons. The

additional variables are temporary: s and t store iterators, x and y elements,

and done and smaller Boolean values.

1 while (p != t1 && q != t2 ) {

2 i f (less(∗q , ∗p) ) {

3 ∗r = ∗q ;

4 ++q ;

5 }

6 else {

7 ∗r = ∗p ;

8 ++p ;

9 }

10 ++r ;

11 }

1 test :

2 done = (q == t2 ) ;

3 i f (done ) goto exit ;

4 entrance :

5 x = ∗p ;

6 s = p + 1;

7 y = ∗q ;

8 t = q + 1;

9 s malle r = less (y , x ) ;

10 i f (smaller) q = t ;

11 i f (! smaller) p = s ;

12 i f (! smaller) y = x ;

13 ∗r = y ;

14 ++r ;

15 done = (p == t1 ) ;

16 i f (! done) goto test ;

17 exit :

Since the two branches of the if statement are so short and symmetric, a good

compiler will compile them using conditional moves. The assembly-language

translation corresponding to the pure-C code was produced by the g++ com-

piler. As shown on the right above, the output contained 14 instructions.

By decoupling element comparisons from branches, each merging phase of

two subsequences induces at most O(1) branch mispredictions. In total, the

merge procedure is invoked O(n) times, so the number of branch mispredictions

induced is O(n). Other characteristics of bottom-up mergesort remain the same;

it performs n log

2

n + O(n) element comparisons and element moves.

4

Table 2. The execution time [ns], the number of conditional branches, and the num-

ber of mispredictions, each per n log

2

n, for bottom-up mergesort taken from the C++

standard library and our optimized mergesort.

Program std::stable sort Optimized mergesort

Time Branches Mispredicts Time Branches Mispredicts

n Per Ares Per Ares

2

10

6.2 5.0 2.05 0.14 4.4 3.5 0.75 0.06

2

15

5.9 4.7 2.02 0.09 4.4 3.5 0.66 0.04

2

20

6.3 4.7 2.01 0.07 5.2 3.7 0.62 0.03

2

25

6.1 4.6 2.01 0.05 5.2 3.7 0.60 0.02

To reduce the number of branches executed and the number of branch mis-

predictions induced even further, we implemented the following optimizations:

– Handle small subproblems diﬀerently: Instead of using insertionsort, we sort

each chunk of size four with a straight-line code that has no branches. In brief,

we simulate a sorting network for four elements using conditional moves.

Insertionsort induces one branch misprediction per element, whereas our

routine only induces O(1) branch mispredictions in total.

– Unfold the main loop responsible for merging: When merging two subse-

quences, we move four elements to the output sequence in each iteration.

We do this as long as each of the two subsequences to be merged has at least

four elements. Hereafter in this performance-critical loop the instructions

involved in the conditional branches, testing whether or not one of the input

subsequences is exhausted, are only executed every fourth element compari-

son. If one or both subsequences have less than four elements, we handle the

remaining elements by a normal (not-unfolded) loop.

To see whether or not our improvements are eﬀective in practice, we tested

our optimized mergesort against std::stable sort. Our results are reported in

Table 2. From these results, it is clear that even improvements in the linear term

can be signiﬁcant for the eﬃciency of a sorting program.

3 Tuned In-Situ Mergesort

Since the results for mergesort were so good, we set ourselves a goal to show

that some variation of the in-place mergesort algorithm of Katajainen et al. [9]

will be faster than quicksort. We have to admit that this goal was too ambitious,

but we came quite close. We should also point out that, similar to quicksort, the

resulting sorting algorithm is no more stable.

The basic step used in [9] is to sort half of the elements using the other half as

a working area. This idea could be utilized in diﬀerent ways. We rely on the sim-

plest approach: Before applying the basic step, partition the elements around the

median. In principle, the standard-library routine std::nth element can accom-

plish this task by performing a quicksort-type partitioning. After partitioning

5

and sorting, the other half of the elements can be handled recursively. We stop

the recursion when the number of remaining elements is less than n/ log

2

n and

use introsort to handle them. An iterative procedure-level description of this

sorting program is given below. Its interface is the same as that for std::sort.

1 template <typename i terator , typename comparator>

2 void sort(iterator p , iterato r r , comparator less ) {

3 typedef typename std : : iterator_traits<iterator >::difference_type index ;

4 index n = r − p ;

5 index threshold = n / ilogb(2 + n) ;

6 while (n > threshold ) {

7 iterator q_1 = p + n / 2;

8 iterator q_2 = r − n / 2;

9 converse_relation<comparator> greater(less ) ;

10 std : : nth_el ement(p , q_1 , r , greater) ;

11 mergesort (p , q_1 , q_2 , less ) ;

12 r = q_1 ;

13 n = r − p ;

14 }

15 std : : sort(p , r , less) ;

16 }

Most of the work is done in the basic steps, and each step only uses O(1)

extra space in addition to the input sequence. Compared to normal mergesort,

the inner loop is not much longer. In the following code extracts, the variables

have the same meaning as those used in tuned mergesort: p, q, r, s, t, t1, and

t2 store iterators; x and y elements; and done and smaller Boolean values.

1 while (p != t1 && q != t2 ) {

2 i f (less(∗q , ∗p) ) {

3 s = q ;

4 ++q ;

5 }

6 else {

7 s = p ;

8 ++p ;

9 }

10 x = ∗r ;

11 ∗r = ∗s ;

12 ∗s = x ;

13 ++r ;

14 }

1 test :

2 done = (q == t2 ) ;

3 i f (done ) goto exit ;

4 entrance :

5 x = ∗p ;

6 s = p + 1;

7 y = ∗q ;

8 t = q + 1;

9 s malle r = less (y , x ) ;

10 i f (smaller) s = t ;

11 i f (smaller) q = t ;

12 i f (! smaller) p = s ;

13 i f (! smaller) y = x ;

14 x = ∗r ;

15 ∗r = y ;

16 −−s ;

17 ∗s = x ;

18 ++r ;

19 done = (p == t1 ) ;

20 i f (! done) goto test ;

21 exit :

As shown on the right above, an ideal translation of the loop contains 18 assembly-

language instructions, which is only four more than that required by the inner

loop of mergesort. Because of register spilling, the actual code produced by

the g++ compiler was a bit longer; it contained 26 instructions. Again, the two

branches of the if statement were compiled using conditional moves.

For an input of size m, the worst-case cost of std::nth element and std::sort

is O(m) and O(m log

2

m), respectively [13]. Thus, the overhead caused by these

subroutines is linear in the input size. Both of these routines require at most a

logarithmic amount of extra space. To sum up, we rely on standard library com-

ponents and ensure that our program only induces O(n) branch mispredictions.

6

Table 3. The execution time [ns], the number of conditional branches, and the number

of mispredictions, each per n log

2

n, for two in-situ variants of mergesort.

Program In-situ std::stable sort In-situ mergesort

Time Branches Mispredicts Time Branches Mispredicts

n Per Ares Per Ares

2

10

49.2 29.7 9.0 2.08 7.3 5.7 1.93 0.26

2

15

57.6 35.0 11.1 2.38 7.1 5.6 1.94 0.15

2

20

62.7 38.5 12.9 2.53 7.4 5.7 1.92 0.11

2

25

68.0 41.3 14.4 2.62 7.6 5.7 1.92 0.09

In our experiments, we compared our in-situ mergesort against the space-

economical mergesort provided by the C++ standard library. The library routine

is recursive, so (due to the recursion stack) it requires a logarithmic amount of

extra space. The performance diﬀerence between the two programs is stunning,

as seen in Table 3. We admit that this comparison is unfair; the library routine

promises to sort the elements stably, whereas our in-situ mergesort does not.

However, this comparison shows how well our in-situ mergesort performs.

4 Comparison to Quicksort

In the C++ standard library shipped with our compiler, std::sort is an imple-

mentation of introsort [13], which is a variant of median-of-three quicksort [6].

Introsort is half-recursive, it coarsens the base case by leaving small subprob-

lems (of size 16 or smaller) unsorted, it calls insertionsort to ﬁnalize the sorting

process, and it calls heapsort if the recursion depth becomes too large. Since

introsort is known to be fast, it was natural to use it as our starting point.

The performance-critical loop of quicksort is tight as shown on the left below;

p and r are iterators indicating how far the partitioning process has proceeded

from the beginning and the end, respectively; v is the pivot, and less is the

comparator used in element comparisons; the four additional variables are tem-

porary: x and y store elements, and smaller and cross Boolean values.

1 while (true) {

2 while (less(∗p , v ) ) {

3 ++p ;

4 }

5 −−r ;

6 while (less (v , ∗r) ) {

7 −−r ;

8 }

9 i f (p >= r) {

10 return p ;

11 }

12 x = ∗p ;

13 ∗p = ∗r ;

14 ∗r = x ;

15 ++p ;

16 }

1 −−p ;

2 goto first_loop ;

3 swap :

4 ∗p = y ;

5 ∗r = x ;

6 first_loop :

7 ++p ;

8 x = ∗p ;

9 s malle r = less (x , v ) ;

10 i f (smaller) goto first_loop ;

11 second_loop :

12 −−r ;

13 y = ∗r ;

14 smaller = less (v , y) ;

15 i f (smaller) goto seco nd_loop ;

16 cross = (p < r) ;

17 i f (cross) goto swap ;

18 return p ;

7

In the assembly-language translation displayed in pure C [10] on the right above,

both of the innermost while loops contain four instructions and, after rotating

the instructions of the outer loop such that the conditional branch becomes its

last instruction, the outer loop contains four instructions as well. Due to instruc-

tion scheduling and register allocation, the picture was a bit more complicated

for the code produced by the compiler, but the simpliﬁed code displayed to the

right above is good enough for our purposes.

For the basic version of mergesort, the number of instructions executed per

n log

2

n, called the instruction-execution ratio, is 14. Let us now analyse this ratio

for quicksort. It is known that for the basic version of quicksort the expected

number of element comparisons performed is about 2n ln n ≈ 1.39n log

2

n and

the expected number of element exchanges is one sixth of this number [16].

Combining this with the number of instructions executed in diﬀerent parts of

the performance-critical loop, the expected instruction-execution ratio is

(4 + (1/6) × 4) × 1.39 ≈ 6.48 .

This number is extremely low; even for our improved mergesort the instruction-

execution ratio is higher (11 instructions).

The key issue is the conditional branches at the end of the innermost while

loops; their outcome is unpredictable. The performance-critical loop can be made

lean using the program transformation described in [5]. A bit more eﬃcient code

is obtained by numbering the code blocks and executing the moves inside the

code blocks conditionally. We identify three code blocks in the performance-

critical loop of Hoare’s partitioning algorithm: the two innermost loops and the

swap. By converting the while loops to do-while loops, we could avoid some

code repetition. The outcome of the program transformation is given on the

left below; variable lambda indicates the code block under execution. It turns out

that the corresponding code is much shorter for Lomuto’s partitioning algorithm

described, for example, in [4]. Now the performance-critical loop only contains

one if statement, and the swap inside it can be executed conditionally. The code

obtained by applying the program transformation is shown on the right below.

1 int lambda = 2;

2 −−p ;

3 do {

4 i f (lambda == 1) ∗p = y ;

5 i f (lambda == 1) ∗r = x ;

6 i f (lambda == 1) lambda = 2;

7 i f (lambda == 2) ++p ;

8 x = ∗p ;

9 s malle r = less (x , v ) ;

10 i f (lambda != 2) smaller = true;

11 i f (! smaller) lambda = 3;

12 i f (lambda == 3) −−r ;

13 y = ∗r ;

14 smaller = less (v , y) ;

15 i f (lambda != 3) smaller = true;

16 i f (! smaller) lambda = 1;

17 } while (p < r) ;

18 return p ;

1 while (q < r) {

2 x = ∗q ;

3 conditi on = le ss (x , v) ;

4 i f (conditio n ) ++p ;

5 i f (conditio n ) ∗q = ∗p ;

6 i f (conditio n ) ∗p = x ;

7 ++q ;

8 }

9 return p ;

The resulting programs are interesting in several respects. When we rely on

Hoare’s partitioning algorithm, in each iteration of the loop two element compari-

8

Table 4. The execution time [ns], the number of conditional branches, and the number

of mispredictions, each per n log

2

n, for two branch-optimized variants of quicksort.

Program Optimized quicksort (Hoare) Optimized quicksort (Lomuto)

Time Branches Mispredicts Time Branches Mispredicts

n Per Ares Per Ares

2

10

9.2 6.5 2.93 0.43 6.5 5.2 2.14 0.40

2

15

9.5 6.4 3.24 0.42 6.5 5.1 2.34 0.40

2

20

9.7 6.5 3.33 0.42 6.6 5.2 2.40 0.40

2

25

9.8 6.5 3.46 0.42 6.7 5.3 2.42 0.40

sons are performed. Since the loop is executed ∼2n ln n times on the average,

the expected number of element comparisons increases to ∼2.78n log

2

n. This

increase does not occur for Lomuto’s partitioning algorithm. Note that in the

above C code we allow conditional moves between memory locations, and even

allow conditional arithmetic. If we are restricted to only use conditional moves

(as in pure-C), these instructions need to be substituted by pure-C instructions.

Because the underlying hardware only supports conditional moves to registers,

the assembly-language instruction counts were a bit higher than that indicated

by the C code above; the actual counts were 20 and 11, respectively. This means

that the expected instruction-execution ratio (that is a 1.39 factor of the in-

struction counts) is around 27.8 when Hoare’s partitioning is in use and around

15.29 when Lomuto’s partitioning is in use. Thus, the cost of eliminating the

unpredictable branches is high in both cases!

When testing these branch-optimized versions of quicksort, we observed that

the compiler was not able to handle that many conditional moves. In some

architectures each such move requires more than one clock cycle, so it may

be more eﬃcient to use conditional branches. The performance of our branch-

optimized quicksort programs is reported in Table 4. Compared to introsort (see

Table 1), these programs are slower. To avoid branch mispredictions, it would

be necessary to write the programs in assembly language.

We also tested other variants of introsort by trying diﬀerent pivot-selection

strategies: random element, ﬁrst element, median of the ﬁrst, middle and last

element, and α-skewed hypothetical element. Since in our setup the elements

are given in random order, the simplest pivot-selection strategy—select the ﬁrst

element as the pivot—was already fast, but it was slower than the median-of-

three pivot-selection strategy used by introsort. On the other hand, the selection

of a skewed pivot indeed improved the performance. In our test environment, for

small problem instances the median pivot worked best (i.e. α = 1/2), whereas

for large problem instances α = 1/5 turned out to be the best choice. The results

for the naive and α-skewed pivot-selection strategies are given in Table 5.

From these experiments, our conclusion is that the performance of the sort-

ing programs considered is ranked as follows: mergesort, quicksort, and in-situ

mergesort. Still, quicksort is the fastest method for in-situ sorting.

9

Table 5. The execution time [ns], the number of conditional branches, and the number

of mispredictions, each per n log

2

n, for two other variants of introsort.

Program Introsort (naive pivot) Introsort ((1/5)-skewed pivot)

Time Branches Mispredicts Time Branches Mispredicts

n Per Ares Per Ares

2

10

7.0 5.6 1.78 0.45 6.4 5.1 1.48 0.37

2

15

6.6 5.3 1.78 0.43 6.1 4.8 1.53 0.36

2

20

6.5 5.1 1.74 0.42 6.0 4.7 1.55 0.35

2

25

6.4 5.1 1.72 0.42 6.0 4.7 1.56 0.34

5 Advice for Practitioners

Like sorting programs, most programs can be optimized with respect to diﬀerent

criteria: the number of branch mispredictions, cache misses, element compari-

sons, or element moves. Optimizing one of the parameters may mean that the

optimality with respect to another is lost. Not even the optimal bounds are the

best in practice; the best choice depends on the environment where the programs

are run. The task of a programmer is diﬃcult. As any activity involving design,

good programming requires good compromises.

In this paper we were interested in reducing the cost caused by branch mispre-

dictions. In principle, there are two ways of removing branches from programs:

1. Store the result of a comparison in a Boolean variable and use this value

in normal integer arithmetic (i.e. rely on the setcc family of instructions

available in Intel processors).

2. Move the data from one place to another conditionally (i.e. rely on the cmovcc

family of instructions available in Intel processors).

In Intel’s architecture optimization reference manual [7, Section 3.4.1], a clear

guideline is given how these two types of optimizations should be used.

Use the setcc and cmov[cc] instructions to eliminate unpredictable con-

ditional branches where possible. Do not do this for predictable branches.

Do not use these instructions to eliminate all unpredictable conditional

branches (because using these instructions will incur execution overhead

due to the requirement for executing both paths of a conditional branch).

In addition, converting a conditional branch to setcc or cmov[cc] trades

oﬀ control ﬂow dependence for data dependence and restricts the ca-

pability of the out-of-order engine. When tuning, note that all Intel ...

processors usually have very high branch prediction rates. Consistently

mispredicted branches are generally rare. Use these instructions only if

the increase in computation time is less than the expected cost of a

mispredicted branch.

Complicated optimizations often mean complicated programs with many

branches. As a result, it will be more diﬃcult to remove the branches by hand.

Fortunately, an automatic way of eliminating all branches, except one, is known

10

[5]. However, the performance of the obtained program is not necessarily good

due to the high constant factor introduced in the running time. In our work we

got the best results by starting from a simple program and reducing branches

from it by hand.

Implicitly, we assumed that the elements being manipulated are small. For

large elements, it may be necessary to execute each element comparison and

element move in a loop. However, in order for the general O(n log

2

n) bound for

sorting to be valid, element comparisons and element moves must be constant-

time operations. If this was not the case, e.g. if we were sorting strings of char-

acters, the comparison-based methods would not be optimal any more. On the

other hand, if the elements were large but of ﬁxed length, the loops involved

in element comparisons and element moves could be unfolded and conditional

branches could be avoided. Nonetheless, the increase in the number of element

moves can become signiﬁcant.

At this point we can reveal that we started this research by trying to make

quicksort lean (as was done with heapsort and mergesort in [5]). However, we

had big problems in forcing the compiler(s) to use conditional moves, and our

hand-written assembly-language code was slower than the code produced by the

compiler. So be warned; it is not always easy to eliminate unpredictable branches

without a signiﬁcant penalty in performance.

6 Afterword

We leave it for the reader to decide whether quicksort should still be considered

the quickest sorting algorithm. It is deﬁnitely an interesting randomized algo-

rithm. However, many of the implementation enhancements proposed for it seem

to have little relevance in contemporary computers.

We are clearly in favour of mergesort instead of quicksort. If extra memory is

allowed, mergesort is stable. Multi-way mergesort removes most of the problems

associated with expensive element moves. The algorithm itself does not—even

though our in-situ mergesort does—use random access. This would facilitate an

extension to the interface of the C++ standard-library sort function: The input

sequence should only support forward iterators, not random-access iterators.

In algorithm education at many universities a programming language is used

that is far from a raw machine. Under such circumstances it gives little meaning

to talk about the branch-prediction features of sorting programs. A cursory

examination showed that in one of our test computers (Ares) the Python stand-

ard-library sort was a factor of 135 slower than the C++ standard-library sort

when sorting million integers.

References

1. Biggar, P., Nash, N., Williams, K., Gregg, D.: An experimental study of sorting

and branch prediction. ACM Journal of Experimental Algorithmics 12, Article 1.8

(2008)

11

2. Brodal, G.S., Moruz, G.: Tradeoﬀs between branch mispredictions and compari-

sons for sorting algorithms. In: Proceedings of the 9th International Workshop on

Algorithms and Data Structures. Lecture Notes in Computer Science, vol. 3608,

pp. 385–395. Springer-Verlag (2005)

3. Brodal, G.S., Moruz, G.: Skewed binary search trees. In: Proceedings of the 14th

Annual European Symposium on Algorithms. Lecture Notes in Computer Science,

vol. 4168, pp. 708–719. Springer-Verlag (2006)

4. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms.

The MIT Press, 3nd edn. (2009)

5. Elmasry, A., Katajainen, J.: Lean programs, branch mispredictions, and sorting. In:

Proceedings of the 6th International Conference on Fun with Algorithms. Lecture

Notes in Computer Science, vol. 7288, pp. 119–130. Springer-Verlag (2012)

6. Hoare, C.A.R.: Quicksort. The Computer Journal 5(1), 10–16 (1962)

7. Intel Corporation: Intel

R

64 and IA-32 Architectures Optimization Reference Man-

ual, version 025 (1997–2011)

8. Kaligosi, K., Sanders, P.: How branch mispredictions aﬀect quicksort. In: Proceed-

ings of the 14th Annual European Symposium on Algorithms. Lecture Notes in

Computer Science, vol. 4168, pp. 780–791. Springer-Verlag (2006)

9. Katajainen, J., Pasanen, T., Teuhola, J.: Practical in-place mergesort. Nordic Jour-

nal of Computing 3(1), 27–40 (1996)

10. Katajainen, J., Tr¨aﬀ, J.L.: A meticulous analysis of mergesort programs. In: Pro-

ceedings of the 3rd Italian Conference on Algorithms and Complexity. Lecture

Notes in Computer Science, vol. 1203, pp. 217–228. Springer-Verlag (1997)

11. Kernighan, B.W., Ritchie, D.M.: The C Programming Language. Prentice Hall,

2nd edn. (1988)

12. Mortensen, S.: Reﬁning the pure-C cost model. Master’s Thesis, Department of

Computer Science, University of Copenhagen (2001)

13. Musser, D.R.: Introspective sorting and selection algorithms. Software—Practice

and Experience 27(8), 983–993 (1997)

14. Patterson, D.A., Hennessy, J.L.: Computer Organization and Design, The Hard-

ware/Software Interface. Morgan Kaufmann Publishers, 4th edn. (2009)

15. Sanders, P., Winkel, S.: Super scalar sample sort. In: Proceedings of the 12th

Annual European Symposium on Algorithms. Lecture Notes in Computer Science,

vol. 3221, pp. 784–796. Springer-Verlag (2004)

16. Sedgewick, R.: The analysis of Quicksort programs. Acta Informatica 7(4), 327–355

(1977)

12