Content uploaded by Oded Green

Author content

All content in this area was uploaded by Oded Green on Nov 11, 2018

Content may be subject to copyright.

A Fast and Simple Approach to Merge and Merge

Sort using Wide Vector Instructions

Alex Watkins and Oded Green

Georgia Institute of Technology

{jwatkins45, ogreen}@gatech.edu

Abstract—Merging and sorting algorithms are the backbone

of many modern computer applications. As such, efﬁcient im-

plementations are desired. Recent architectural advancements

in CPUs (Central Processing Units), such as wider and more

powerful vector instructions, allow for algorithmic improvements.

This paper presents a new approach to merge sort using

vector instructions. Traditional approaches to vectorized sorting

typically utilize a bitonic sorting network (Batcher’s Algorithm)

which adds signiﬁcant overhead. Our approach eliminates the

overhead from this approach. We start with a branch-avoiding

merge algorithm and then use the Merge Path algorithm to

split up merging between the different SIMD lanes. Testing

demonstrates that the algorithm not only surpasses the SIMD

based bitonic counterpart, but that it is over 2.94xfaster than

a traditional merge, merging over 300M keys per second in one

thread and over 16B keys per second in parallel. Our new sort

reaches is over 5xfaster than quicksort and 2xfaster than Intel’s

IPP library sort, sorting over 5.3M keys per second for a single

processor and in parallel over 500M keys per second and a

speedup of over 2xfrom a traditional merge sort.

I. INTRODUCTION

Sorting algorithms play a crucial role in numerous computer

applications [1]. Database systems use sorting algorithms to

organize internal data and to present the data to the users

in a sorted format. Graph searching uses sorting to speedup

lookups. A variety of high-speed sorting algorithms have

been proposed including quicksort [2], merge sort, radix sort,

bitonic (Batcher’s Algorithm) [3], and several others [1], [4],

[5]. The time complexities of these algorithms vary as well as

their parallel scalability. Quicksort works great in the general

case but does not scale favorably as the number of threads

increase. Radix sort has a favorable efﬁciency but suffers

from memory bandwidth overhead. As processor architecture

changes constantly, sorting algorithms also need to be re-

evaluated. In this paper, we show a new vectorized sorting

algorithm that utilizes Intel’s AVX-512 instruction set. We

show how the Merge Path [6] concept can be vectorized

to efﬁciently speed up the parallel merging phases of the

algorithm (the lower phases in the merging network) as well

as how to implement a vectorized merging algorithm for the

traditional phases of the algorithm (the upper phases in the

merging network).

SIMD (Single Instruction Multiple Data) instructions can

be found in many modern architectures such as x86, ARM,

and Power processors. These instruction enable massive per-

formance gains by utilizing a single instruction on multiple

pieces of data, hence the term SIMD. Numerous sorting

implementations have been designed to take advantage of the

various SIMD instruction sets: [5], [7], [8], [9], [10], [11], [12].

Many of them introduce a signiﬁcant algorithmic overhead

making it less than optimal.

Our new algorithm utilizes Intel’s AVX-512 instruction

set taking advantage of new features: 1) scatter and gather

instructions, 2) wider instructions, and 3) a large number of

masked operations. These masked operations enable us to

emulate advanced conditional instructions and implement our

sorting algorithm using the branch-avoiding model discussed

in [13] and [14]. Through the branch-avoiding model, we are

able to increase the control ﬂow of our implementation from

one control ﬂow per thread to Wsoftware control ﬂows per

thread. Altogether, we show that the scalability of our new

sorting algorithm is W·P, where Wis the width of the

vector instruction and Pis the number of processors, whereas

a typical parallel sorting algorithm only scales to Pthreads.

On the KNL (Intel Xeon Phi Knights Landing, described

in Sec. V) processor used in our experiments, P= 272

and W= 16. Therefore, we can effectively execute 4352

concurrent software threads.

In this paper we show the performance for both parallel sort-

ing as well as parallel merging. In summary the contributions

of our algorithm are as follows:

•Using the branch-avoiding model, we show a novel way to

increase the control ﬂow available in modern CPU processors.

•Our new vectorized merging algorithm is almost 3xfaster

than a scalar merging algorithm and is approximately 2.5x

faster than the sorting network used in [7] for merging.

•Our parallel merging algorithm can merge up to 300M keys

per second using a single thread and over 16B keys per second

in parallel.

•Using a single thread, our new vectorized algorithm is 5x

faster than the standard C-lib quicksort and 2xfaster than a

sort found in Intel’s IPP library [15].

•For a single thread, our algorithm sorts 5.3M keys per

second. Using the KNL system, with 272 hardware threads

our new sort algorithm can sort over 500M keys per second.

In contrast to its scalar counterpart, our new algorithm is over

2xfaster.

TABLE I

NUMBER OF VECTOR INSTRUCTIONS NEEDED IN DIFFERENT MERGE

KE RNE LS F OR DI FFER EN T VEC TO R ISAS. VAL UES A RE E STI MATE S GIV EN

BAS ED O N OUR I MPL EM ENTATI ON S;ACT UAL C OU NT MAY VARY.

Bitonic Our Algorithm

Width SSE AVX-512 [17] AVX-512

4 25 17 10

8 50 23 10

16 100 29 10

II. RE LATE D WORK

CPU architecture changes rapidly adding features like in-

creased memory bandwidth, SIMD width, and thread counts,

as well as other lower level hardware features such as out of

order execution. Numerous algorithms have been designed to

take advantage of these hardware features: [7], [4], [9], [16].

Yet, not all of these algorithms fully utilize all these hardware

features.

a) Sorting Networks: Numerous algorithms can be de-

scribed through a sorting network. In most cases, sorting

networks involve a ﬁxed number of comparisons for a ﬁxed

input size. This does not allow for ﬂexible array size selection

or scaling to large array sizes due to overhead costs. One

commonly used sorting network is the bitonic sorting network

[3]. Bitonic involves a multilevel network with multiple com-

parisons at each level. The network depth is proportional to

the size of the network.

Chhugani et al. [7] produced a version of bitonic using In-

tel’s SSE instructions. Speciﬁcally, Chhugani et al. [7] showed

how to replace the merging phase of the algorithm with their

4x4 (8 input) network. Chhugani et al.’s [7] vectorized sort

is about 3.3xfaster than its scalar counterpart. It is therefore

not surprising that this algorithm is a common baseline when

measuring performance. We too use it as a common baseline

and show that our new algorithm is substantially faster than

Chhugani et al.’s [7] (over 2.5x). Our algorithm avoids unnec-

essary comparisons required by the merging step. While their

algorithm has the same time complexity as our new algorithm,

the constant in their algorithm is dependent on the overhead

of the bitonic network. Our algorithm has a smaller constant.

Further, their algorithm also does not scale well for larger

SIMD widths because of the overheads. For example, a 16-

way network (16x2 input elements) is actually faster than a

32-way (32x2 input elements) network when working with

512-bit SIMD [17]. Table I demonstrates how a software based

bitonic network scales for wider instruction sets. Notice how

for larger networks the instruction count continues to grow;

partly because each new level adds additional overhead. In

contrast, our algorithm is invariant to the vector width.

b) Other Vectorized Algorithms: Another approach to

sorting using SIMD is using the AA-Sort [5] algorithm which

is similar to comb sort [18] and merge sort. The merging step

uses a similar algorithm to the Intel SSE bitonic merge [7]

and has a similar speedup of around 3.33x. Inoue et al. [11].

also presents an AVX-512 based sort which is similar to the

AA-Sort but updated for AVX-512. Bramas [10] presents a

modiﬁed quicksort variant using 512bit SIMD on the KNL

Algorithm 1: Branch-Avoiding Merge - assuming 32-

bit data

function Branch-Avoiding Merge (A,B )

Input : Two sorted sub arrays: Aand B

Output : Sorted array C

AIndex ←0; BI ndex ←0; CIndex ←0;

while (AIndex < |A|and BI ndex < |B|)do

flag ←Rig ht Shift(A[AI ndex]−B[BIndex],31);

C[CIndex + +] ←(flag )∗A[AIndex] + (1 −flag )∗B[BIndex];

AIndex ←AI ndex +flag;

BIndex ←B Index + (1 −f lag);

// Copy remaining elements into C

processor. Schlegel et al. [19], show a sorted-set SIMD in-

tersection; however, this implementation is restricted to 8 bit

keys making it ineffective for the sorting of most values. The

Merge Path concept has been used with other SIMD based

sorting algorithms. For example Xiaochen et al. [12] used

Merge Path for partitioning at the thread granularity and then

using a SIMD bitonic [7] to merge the values. In contrast, our

new algorithm does the partitioning at both the thread level

and at the SIMD instruction granularity.

c) Other Sorting Algorithms: Radix sort is a non-

comparison based sorting algorithm. Radix sort suffers greatly

from high memory bandwidth [9]. Its irregular memory access

patterns result in poor cache utilization [9]. This means for

larger data sets, any performance gain from radix sorts is

quickly dissipated despite radix sort’s reduced computational

complexity. A hybrid radix sort proposed in [20] alleviates this

overhead.

d) Branch-Avoiding Merge: A traditional merging algo-

rithm starts each iteration with a “check for bounds” followed

by a simple IF −T HE N −ELSE structure. At a high

level and ignoring the W HI LE loop there are two branch

possibilities in the code, either the code block under the IF

statement is executed or the code block under the ELSE

statement is executed. The branching behavior in this case

is purely dependent on the input data itself. Whereas, the

W H ILE loop is easily predicted. This creates a challenge for

CPU branch predictors to efﬁciently work as intended when

running these heavily data dependent algorithms [14]. Branch

misses can cost tens of clock cycles [21], which is costly in

comparison to the merging process which requires a small

number of instructions.

Green [14] presents a branch-avoiding merging algorithm

which uses masks and conditional-like instructions to avoid

the heavy cost of branch mis-prediction (Alg. 1). From a

performance perspective, the branch-avoiding algorithm in

contrast to the traditional algorithm is : 1) slower when the

number of unique keys is small and the branch predictor

is accurate and 2) faster when the number of unique keys

is large and the branch predictor is inaccurate. An added

beneﬁt of the branch-avoiding algorithm is that it removes

the speculation from the execution and makes the execution

more deterministic. With the removal of the control ﬂow

dependency, we show in this paper how to 1) use each data

lane in a vector as its own thread (essentially each lane is

separate software threads) and 2) how to ensure that each (of

the W) software threads execute the correct instructions given

that there is only a single hardware thread.

TABLE II

DESCRIPTION OF IMPORTANT VARIABLES USED IN PSEUDOCODE

Variable Description

AIndex, BI ndex,CIndex Current index in A, B, and C, respectively.

AStop, BStop Max index for A and B, respectively.

AEndM ask,BEndM ask Mask marking whether a sub-array has elapsed in A and B,

respectively.

maskA, maskB Comparison mask for A and B, respectively.

cmp Comparison mask

AElems, BElems,C Elems Vector values of the A, B, and C sub-arrays respectively.

N,|A|,|B|,|C|Array size

PNumber of Threads

WSIMD width

TABLE III

SUB SET O F SIMD INSTRUCTIONS USED BY OUR MERGING AND SORTING

ALGORITHMS. W DE MOTE S TH E WID TH O F THE V EC TOR I NST RUC TI ON.

Instruction Description

Gather_W(src*,indicies, mask) Load data from multiple memory locations if mask bit is set.

Scatter_W(dest*,elems, indicies,mask) Stores data at multiple memory locations if mask bit is set.

Load_W(src*) Loads a sequential segment of memory.

Store_W(dest*,elems) Store a sequential segment of values to memory.

CompareLT_W(A,B) Pairwise SIMD vector comparison (less than).

CompareGT_W(A,B) Pairwise SIMD vector comparison (greater than).

Blend_W(A,B,mask) Selectively pulls elements from vector B if the mask bit is set

and from vector A otherwise.

Add_W(A,value) Adds a value to each element in the given A vector.

Add_W(A,B) Adds each lane in A to each lane in B.

MaskAdd_W(A,mask,value) Adds a value to each element in A if the corresponding mask

bit is set.

BitAnd_W(maskA, maskB) Bitwise and for masks.

BitOr_W(maskA,maskB) Bitwise or for masks.

BitNeg_W(mask) Flip bits on mask.

Set_W(value,index) Set the lane at the index to value.

III. MERGING ALGORITHMS

In this section we present our new vectorized merging

algorithms. First we present Merge Path [22], [6], which

provides a way of parallel merging of two sorted sub-arrays.

The Merge Path concept is used in both our parallel merging

and parallel sorting. We note, Merge Path was developed

originally for the ﬁnal phases of a sort, where the number of

threads is greater than the number of sorted arrays. The ﬁrst

algorithm we present in this section will cover this extension.

The later algorithms will use vector units based on the branch-

avoiding methodology [13], [14] to execute different (yet

concurrent) merges in each lane of the vector unit. These

merging algorithms can also be used in the initial phases of

the sorting where a single unsorted array is sorted by a single

thread (sequentially), signiﬁcantly increasing throughput.

a) Merge Path: One of the key challenges for any

parallel algorithm is to utilize the entire system throughout the

entire execution. This is also true for sorting. For a parallel

merge sort algorithm, the problem of workload imbalance

and system utilization becomes an issue when the number

of sorted arrays is smaller than the number of processors

available. Several PRAM algorithms with partitioning schemes

have been developed. Some of these gave perfect balance while

others did not [16], [8]. The Merge Path concept highlights

a way to get perfect load-balancing while ensuring that the

algorithm is considered work efﬁcient (under the assumption

that P < N). Merge Path has been implemented successfully

B[1] B[2] B[ 3] B[4] B[5 ] B[6] B[7] B[8 ]

1 2 3 5 8 9 10 12

A[1] 51 1 1 0 0 0 0 0

A[2] 61 1 1 1 0 0 0 0

A[3] 71 1 1 1 0 0 0 0

A[4] 11 1 1 1 1 1 1 1 0

A[5] 13 1 1 1 1 1 1 1 1

A[6] 14 1 1 1 1 1 1 1 1

A[7] 15 1 1 1 1 1 1 1 1

A[8] 16 1 1 1 1 1 1 1 1

Fig. 1. Merge Path matrix showing intersection lines and points

for both CPU [6] and for the GPU [22]. While Merge Path is

similar to [23], Merge Path offers a more visual and intuitive

explanation to the parallel merging process.

The following is a high-level overview of Merge Path. The

reader is referred to [6] and [22] for additional details. Given

two sorted arrays, place one array vertical (A) and one array

horizontal (B) so that a grid forms in between the two arrays

as seen in Fig. 1. Note, the arrays do not have to be equal

length - we use equal length arrays for simplicity.

Starting from the top-left corner, elements from each array

are compared one by one. If the compared element from B

is smaller than the element from A, that element is copied in

the output array C, and the paths move to the right. If the

element in Ais smaller or equal to the element in B, then

that element is copied into Aand the paths moves down by

one position. The merging is ﬁnalized once the path reaches

the bottom right corner. Given the Merge Path, we can see the

order in which a merging algorithm will copy elements from

the two input arrays into the output array. While the exact path

is not known until runtime, Merge Path offers a visual way

to see that the sequence of events and partitioning mechanism

ensures threads receive an equal amount of work.

Consider the cross diagonals drawn in Fig. 1. Notice that

the distance, in blocks, between the top-left corner to any

point on these cross diagonal is the same. Now notice that

the distance between each block dot, which is the intersection

of the path with the cross diagonal, are also equally distance

from the other dots. Finding the intersection between the cross

diagonal and the path requires doing a binary search on the

cross diagonal and comparing two adjacent elements on the

cross diagonal (additional details in [6]). And while we do

not know the actual path between two of the dots, notice that

the number of blocks going down and to the right is equal.

Therefore, by ﬁnding these dots, it is possible to partition the

merge in multiple threads that work on independent sections

of the array. Finding these dots requires O(log(N)) time per

processor while the sorting itself requires O(N/P )time per

processor, therefore when P < O(N/log(N)) this is consider

optimal.

b) Vectorized Merge Path: Algorithm 3 presents an out-

line of our new vectorized algorithm. Similar to the scalar

version of Merge Path, our inputs are two different arrays

Aand B. Unlike the scalar version which uses one starting

!"#"$ !%#"$ !&#"$ !'#"$ ("#"$ (%#"$ (&#"$ ('#"$

)"#"$ )%#"$ )&#"$ )'#"$ )"#%$ )%#%$ )&#%$ )'#%$

*+,-./0 !11.+22+2

34*5 6+7 8-. 34*5 6+ 78-.

34*5 6+7 8-. 34*5 6+ 78-.

Load(A) Load(B)

Compare(A,B)

Blend(A,B, mask)

Store(A)Store(B)

(a)

!"#"$ %"#"$ !&#"$ %&#"$ !'#"$ %'#"$ !(#"$ %(#"$

)"#"$ )&#"$ )'#"$ )(#"$ )"#&$ )&#&$ )'#&$ )(#&$

*+,-./0 !11.+22+2

34*5 6+7 8-. 34*5 6+ 78-.

34*5 6+7 8-. 34*5 6+ 78-.

Gather(A) Gather(B)

Scatter(A) Scatter(B)

Compare(A,B)

Blend(A,B, mask)

(b)

Fig. 2. (Top) Vectorized merging for small arrays with load and store

instructions instead of gather and scatter, W= 4. Great for merging arrays

of size two or four elements. (Bottom) SIMD merge using gather and scatter.

For both (Top) and (Bottom) the C array is not sorted. An additional variant

for the bottom ﬁgure also exists such that the output is sorted.

point per thread, our new algorithm requires Wstarting points

per hardware thread - one for each lane within the vector.

The starting points are found using the Merge Path algorithm.

This pseudo code can be extended for Pthreads by adding

an additional parallel for loop across threads and Merge Path

to partition the input across multiple threads. As such, each

vector is responsible for merging N

(W·P)elements. A similar

two tier approach, albeit more complex, is presented in [22].

The new algorithm uses gather and scatter instructions (used

for non-sequential memory operations). While past SIMD

instruction sets have supported gather instructions, scatter in-

structions have not been widely supported. The scatter instruc-

tion enables our algorithm to write values to Windependent

locations and enable writing the output in a sorted manner

without requiring additional data movement. The algorithm

by Chhugani [7] does not beneﬁt from scatter instructions as

it writes to consecutive locations.

Note that the W H ILE loop now operates on Wlanes. The

parameter checked in the W HI LE condition is now a vector

of Wbits. These W-bit arrays are used to represent the status

of each lane in the vector operation - these bit arrays are known

as masks1. It is through these masks that we increase the

control ﬂow and enable lane-level control granularity. Notice,

that masks allow us to specify if an instruction in a speciﬁc

lane should be executed, for example if an index should be

increased. Unlike the scalar algorithm where a single index is

incremented, in the new algorithm both indices are updated

using the mask info (similar to Alg. 1). This doubles the

number of comparisons, but is still few than Chhugani et al.

[7].

c) Local Vectorized Merging for Small Arrays: This

algorithm uses faster sequential loads and stores coupled with

1Mask can be manipulated either using special vector instructions or

through arithmetic operation.

Algorithm 2: Vector merging algorithm for small

arrays. Gather and scatter instructions are replaced with

loads and stores - this can be done so long as the input

is local in the memory.

function Merge (array)

Input : Partially sorted array: array

Output : Partially sorted array: array

for ind ←0; ind < subSize;ind ←ind + 1 do

AElems[ind]←Load 16(array +ind ∗16);

BElems[ind]←Load 16(arr ay +ind ∗16 + subSize);

ACount, BC ount ←0;

for ind ←0; ind < 2·subSize;ind ←ind + 1 do

cmp ←CompareLT (AE lems[0], BElems[0]);

cmp ←BitAnd 16(cmp, compar eLT (ACount, subSize));

cmp ←BitOr 16(cmp, compar eLT (BCount, subS ize));

CElems ←B lend 16(BElems[0], AElems[0], cmp);

Store 16(arr ay +ind ∗16, CElems);

ACount ←M askAdd 16(ACount, cmp, 1);

BCount ←M askAdd 16(BC ount, ¬cmp, 1);

for ind ←0; ind < subArraySiz e −1;ind + + do

AElems[ind]←

Blend 16(AElems[ind], AElems[ind + 1], cmp );

BElems[ind]←

Blend 16(BElems[ind + 1], B Elems[ind], cmp);

// Finish merging with another algorithm

Algorithm 3: Branch-avoiding SIMD based merging

algorithm using gather and scatter

function Merge (array, AI ndex, AStop, BIndex, BS top)

Input : array: array, vector indexes: AInd, AS top, BInd, BS top

Output : Merged array: C

CIndex ←Add W (AI nd, BInd);

AEndM ask, BEndMask ←0xF F F F ;.One bit set to 1 for each lane

while BitOr W (AEndM ask, BEndM ask)6= 0 do

.Pull the elements from memory

AElems ←Gather W (array, AI nd, AEndMask);

BElems ←Gather W (array , BInd, BE ndMask);

.Compare the elements

cmp ←CompareLT W(AE lems, BElems);

cmp ←BitAnd W (cmp, AEndM ask);

tmp ←BitN eg(BEndM ask);

cmp ←BitOr W (cmp, tmp);

CElems ←B lend W(B Elems, AElems, cmp);

.Store output to memory

tmp ←BitOr W (AEndM ask, BEndM ask);

Scatter W (C, CElems, C Index, tmp);

.Increment Indicies

AInd ←M askAdd W(AI nd, 1, cmp);

cmp ←BitN eg(cmp);

BInd ←M askAdd W (BInd, 1, cmp);

AEndV ector ←CompareGT W(AStop, AI nd);

BEndV ector ←CompareGT W(B Stop, BInd);

CIndex ←Add W (C Index, 1);

non-sequentially stored sub-arrays. This makes this particular

approach more beneﬁcial for small array sizes. Consider Fig. 2

(top) and the ﬁrst round of a merge sort for an unsorted array

(Alg. 2). Using a sequential load, two SIMD vectors are loaded

into memory. Each vector’s elements are compared pairwise

with the other vector’s elements. The smaller elements are

stored sequentially ﬁrst, followed by the larger elements.

This produces sub-arrays of size 2 stored in a non-sequential

fashion where each sub-array is stored with its elements offset

from each other by W. Therefore, these elements need to be

restored. This adds some overhead and makes this approach

less ideal for large arrays. In addition, this algorithm suffers

from the disadvantage that at sub-arrays of size N

(W∗P)and

above it can no longer perform at full system utilization.

d) SIMD Merge Using Scatter & Gather: As an input to

the algorithm, we provide splitter (indices in the input array)

values that are each offset by the initial sub-array size. For

example, when merging sub-arrays of size 4 in the input, the

A splitters will be 0,8,16,24, and so on. The B splitters will

be 4,12,20,28, and so on. Because the sub arrays sizes are

all perfectly equal,the vector lanes are fully utilized and load-

balanced. Fig. 2 (bottom) shows the merging process for this

Sorted'Data

Arrays of Size 1

Arrays of Size !"#

⁄

Arrays of Size 4

Arrays of Size 2 Algorithm 2

Algorithm 2

Algorithm 3

Algorithm 3

with Merge Path

…

…

Unsorted'Data

Array of Size 𝑛

…

…

…

…

…

…

Algorithm 3

Algorithm 3

with Merge Path

Arrays of Size 8

Arrays of Size &! "#

⁄

Log(n) Levels

Fig. 3. Optimized merge sort algorithm using two different merging

algorithms in three different phases.

Algorithm 4: Single core merge sort algorithm

function Sort (array)

Input : Unsorted array: array

Output : Sorted array: array

for subSize ←1; subSiz e < 8; subSize ←2·subSize do

for i←0; i<N;i←i+W·2·subSize do

Algorithm2(array +i);

// Swap pointers for C and array

// Re-arrange sub-arrays so that they are stored sequentially

for ;subSize < N/W ;subSize ←2·subS ize do

for i←0; i<n;i←i+W·2·subS ize do

for ind ←0; ind < 16; ind ←ind + 1 do

AInd ←Set(x, ind);

x←x+subSize;

AStop ←Set(x, ind);

BInd ←S et(x, ind);

x←x+subSize;

BStop ←Set(x, ind);

Algorithm3(array +i, AI nd, AStop, BInd, B Stop);

// Swap pointers for C and array

for ;subSize < N ;subSize ←2·subSize do

for off set ←0; off set < n;of fset ←of fset + 32 ·subSize do

AInd, BI nd, AStop, BStop ←MergeP ath();

Algorithm3(array , AInd, BInd, AS top, BStop);

// Swap pointers for C and array

algorithm (Alg. 3). Since we use gather and scatter, there is

no need to store more than two value (one per array) at any

given time - this reduces the memory footprint. Our algorithm

is also responsible for tracking the indices in A, B, and C

which are used for the scatter and gather instructions. This

algorithm, by default, suffers from the same disadvantage as

the previous where at sub-arrays of size N

(W∗P)and above,

it can no longer perform at full system utilization. However,

this can be overcome by using the Merge Path algorithm

for partitioning the input array to the multiple vector lanes

(essentially replacing the roles of the threads with the software

controlled data lanes).

IV. SORT IN G ALGORITHM

A. Detailed discussion of Parallel Merge Sort

In the ﬁrst phase of the parallel algorithm, the input data is

entirely unsorted and no known ordering of the data exists.

As such we start with our vectorized sequential load and

store merge, Alg. 2. In Sub. Sec. IV-B we discuss additional

motivation of how this phase was designed from a practical

perspective. This algorithm is used to sort up to sub-arrays of

size four elements. As our algorithm uses vectors that are W

wide, this means that we are dealing with 2·Wand 4·W

elements in these phases.

We determined that using arrays larger than 4, for example

8 elements, proved to be ineffective due to caching and

TABLE IV

MERGE ALGORITHMS IMPLEMENTATIONS USED IN ANALYSIS.

Algorithm Description

Traditional Standard merge algorithm.

Bitonic Implementation of [7] using SSE instructions.

AVX-512-BA Implementation of branch avoiding algorithm (Alg. 3) using merge path and

the AVX-512 instruction set.

TABLE V

SORTING ALGORITHM IMPLEMENTATIONS USED IN ANALYSIS.

Algorithm Description

Traditional Iterative merge sort based on traditional merge.

Bitonic Iterative merge sort based on a bitonic merge [7] using SSE instructions.

AVX-512-BA Implementation of Alg. 4 using AVX-512.

IPP SortAscend from the 2018 Intel IPP library [15].

Quicksort qsort from the c library (Intel Parallel Studio 2018 Cluster Edition).

other overheads associated with vector instructions - including

reorganizing the sorted data which is spread out in a lock-step

manner. The data reorganization can be done using scatter

instructions.

In the next phase of the vector algorithm, Alg. 3, we merge

the sub-arrays of size 4 up to sub-arrays of size N

(W∗P).

Note that even in this phase, the threads work entirely in an

independent manner. In the ﬁrst round of this algorithm, each

sub-array will consist of 4 elements. This will then be doubled

to 8 elements, 16 elements, and so forth. Further, in this phase

of the algorithm we have enough sub-arrays to also ensure that

all lanes are always used. As the arrays are of equal length

we can ensure that each data lane receives an equal number

of elements–ensuring good load-balancing.

Lastly, we ﬁnish again with Alg. 3 (red boxes in Fig. 3)

using a vectorized merge. Unlike the previous phase which

used a single thread for each array, the merging process here

will use multiple threads for a single array. The overhead of

adding the threads is relatively small as the arrays are fairly

large - this was shown to be inexpensive by the Merge Path

algorithm. The Merge Path splitters will determine the starting

points for each the vector lane’s merge. This algorithm will be

used in an iterative manner starting with arrays sizes of N

(W∗P)

up to a fully sorted array. This entire process is summed up

in Alg. 4.

B. Failed Parallel Vectorization Attempts

In our initial implementation we started off by sorting small

sub-arrays of size 64 elements using the non vectorized quick-

sort algorithm. This was then followed by using the vectorized

merge algorithm (Alg. 3 with MP) for the larger arrays. Our

assumption was that this phase of the algorithm was fairly

small and as such would not account for a large amount of

the execution time. However, proﬁling of the execution showed

us otherwise - the time spent sorting the arrays into partitions

of 64 elements was roughly 50% of the total run time. This in

fact motivated the development of Alg. 2 which better utilizes

the vector units and had better locality. This greatly improved

the overall speedup of our new algorithm.

V. EXPERIMENTAL SETU P

In the following section we compare both our merging and

sorting algorithms with several different algorithms. For the

sake of simplicity, in the context of merging we focus on

arrays Aand Bsuch that |A|=|B|and |C|=|A|+|B|;

however, in practice this is not a constraint when wanting

to merge arrays of unequal length. Table IV lists the various

merging algorithms tested in this paper. Our parallel merge sort

algorithm utilizes the various merging algorithms discussed

in Sec. III. Table V lists the sorting algorithms that we

use in our paper. We include comparisons against some of

Intel’s implementations, including a sort from Intel’s IPP (Intel

Integrated Performance Primitives [15]). We also compare

against an implementation of [7]. Our analysis covers a mix

of single threaded and multi-threaded executions with scalar

and vectorized implementations.

System Conﬁguration: The experiments presented in

this paper are executed on an Intel Xeon Phi 7250 processor

with 96GB of DRAM memory (102.4 GB/s peak bandwidth).

This processor is part of the Xeon Phi Knights Landing series

of processors. In addition to to the main memory, the processor

has an additional 16GB of MCDRAM high bandwidth mem-

ory (400 GB/s peak bandwidth). This MCDRAM is used as

our primary memory and so long as the application’s memory

ﬁts into MCDRAM the lower latency DRAM memory is not

utilized. The Intel Xeon Phi 7250 has 68 cores with 272

threads (4-way SMP). These cores run at a 1.4 GHz clock and

share a 32MB L2 cache. All code is compiled with the Intel

Compiler (icc, Intel Parallel Studio 2018 Cluster Edition).

Vector Instruction Set: Our algorithm is implemented

using Intel’s AVX-512 instruction set which supports vectors

of up to 512 bits. Speciﬁcally our algorithm is implemented

for 32 bit integers2meaning that our vector width is W= 16.

Given the system parameters, P= 272, we are able to execute

up to 4352 merges concurrently. Our algorithm can also be

implemented for 64 bit values - this would result in a vector

width of W= 8 and a reduction in the number of software

threads by half. The algorithm of Chhugani et al. [7] uses

Intel’s SSE instruction set. SSE uses 128 bit vector units with

W= 4. Recall that in Sec. II an analysis of the algorithm of

[7] is given for wider vectors and that this algorithm does not

scale to larger vectors.

Random Key Generation: All experiments in this paper

use randomly generated numbers taken from an uniform

random distribution. The range of the keys was shown to play

a critical role in the performance of merging algorithms [14].

Speciﬁcally, as the number of unique keys increases it becomes

more likely that the Merge Path will be very close to the

main diagonal and the branch predictor becomes less accurate

leading to lower performance. The range of the keys is from 0

to 2iwhere i∈ {4, ..., 28}. Unless mentioned otherwise, the

maximal i= 28 is used for specifying the range.

2The algorithm can easily be changed to 64 bits using similar instructions

designed for large words.

Sequential Merging

Fig. 4. Performance of single core merging. (top) shows the results in keys

merged per second (higher is better) as a function of the array size. Maximal

number of keys was held at 228. (bottom) shows the results in keys merged

per second (higher is better) as a function of the key range. Input arrays have

1M elements each (0.5M for A and 0.5M for B).

To test the performance of the merge algorithms, two

arrays are ﬁlled with randomly generated numbers. These are

then sorted separately. Lastly, they are merged together. Only

the merging phase is timed. To test the sort algorithms, a

single array with randomly generated keys is created. This

is followed by the sorting of the array - which is timed.

VI. PERFORMANCE ANA LYSIS

Single core merging as a function of array size: Fig.

4 (top) depicts the number of merge keys per second as a

function of the array size. Our new algorithm outperforms

the other merging algorithms by almost 3x. The measured

execution time is very small leading to high overhead for array

sizes below 103. For larger arrays these overheads disappear.

Single core merging as a function of the key range:

Fig. 4 (bottom) depicts the performance of several merge

algorithms using a single core as a function of keys per second

vs the maximal number of keys. The abscissa is the range

of the key ranges and the ordinate is the number of merge

keys per second - higher is better. The input size is 1M keys.

Our new algorithm outperforms the standard scalar merge and

the SIMD algorithm of Chhugani et al. [7] by a factor of

2.8xand 2.2x, respectively. Note, for both the standard scalar

and SSE [7] implementations, the performance decreases as

the key range increases, due to branch mis-predictions. This

matches the results of [14].

Parallel merging as a function of array size: Fig. 5

(top) depicts the number of merge elements per second as a

function of the array size. The number of threads used was 64

and the maximal number of keys was held at 228. Once again,

our new algorithm outperforms the other merging algorithms.

Merging the smaller arrays has some overheads that disappear

as the array sizes become substantial. Our new algorithm, at

its peak can merge upto 16 billion keys per second.

Parallel merging as a function of threads: Fig. 5

(bottom) depicts the number of merge elements per second

as a function of the number of threads. The array size is 1B

Parallel Merging

Fig. 5. Performance of parallel merging. (top) shows the results in elements

merged per second (higher is better) as a function of the array size. Maximal

number of keys was held at 228 and 64 threads were used. (bottom) shows

the results in keys merged per second (higher is better) as a function of the

number of threads. Input arrays have 1B elements each and maximal number

of keys was held at 228. Hyperthreading is used when the number of threads

is higher than 64.

elements and the maximal number of unique keys was held

at 228. Hyperthreading is used when the number of threads is

higher than 64. Once again, our new algorithm outperforms

the other merging algorithms (except for threads 128 and 256

where the performance of our algorithm drops off). The reason

for the drop-off at those thread counts is that our processor

has 68 cores with 4-way SMT so for 128 and 256 thread

counts, hyperthreading is used (this is a known side effect

[24]). The hardware threads on a single core share - vector

processing units, memory queues for outstanding memory

requests, and caches (1MB L2 cache share amongst 2 cores

- upto 8 threads). Our branch avoiding algorithm uses two

gather instructions per thread. That means at 272 threads we

attempt to load upto 272*2*16=8704 different addresses. This

is a memory intensive operation which can lead to various

memory problems. Furthermore, if too many memory requests

are generated concurrently then the core might stall- a core has

a maximal number of memory requests that it can dispatch

before it is throttled back. From the cache’s perspective,

our algorithm uses 16xmore cache lines. This can lead to

contention, especially at the L1 level where the cache is fairly

small, when more threads are used on a single core. Lastly,

the slowdown seems to be consistent with other KNL research

and is actually noted in the system’s user guide [25].

Single core sorting as a function of the array size:

Fig. 6 (top) depicts the performance of the sorting algorithms

as a function of the array size. The maximal number of keys

was held at 228. Our algorithm is faster by 2x−3x. All of

the tested algorithms decay around 50% from an array size of

216 to 224. It seems that our algorithm has a slightly faster

decay in comparison to these other algorithms most likely

due to oversaturation of the memory requests at the core

granularity. KNL was not designed for this level of memory

request intensity.

Sequential Sorting

Fig. 6. Performance of single core sorting. (top) shows the results in keys

merged per second (higher is better) as a function of the array size. Maximal

number of keys was held at 228. (bottom) shows the results in keys merged

per second (higher is better) as a function of the key range. Input arrays have

224 elements each.

Parallel Sorting

Fig. 7. Performance of parallel sorting. (top) shows the results in elements

merged per second (higher is better) as a function of the array size. Maximal

number of keys was held at 228 and 64 threads were used. (bottom) shows

the results in keys merged per second (higher is better) as a function of the

number of threads. Input arrays have 220 elements each and maximal number

of keys was held at 228. Hyperthreading is used when the number of threads

is higher than 64.

Single core sorting as a function of the key range: Fig.

6 (bottom) depicts the performance of the sorting algorithms

as a function of the key range. The size of the input array is

224 elements. For all but Intel’s IPP the performance seems

mostly independent of the key range. This is in contrast to the

merging algorithms where there was a clear reduction in the

performance. However, for all the algorithms, performance did

decrease by about 10%. The algorithm that most standouts is

Intel’s IPP algorithm which has a decrease in performance by

roughly 6x−7x. Beyond the range of 1024 unique keys (which

is fairly typical for large applications), our new algorithm

outperforms Intel’s IPP algorithm by roughly 2.2x

Parallel sorting as a function of array size: Fig. 7

(top) depicts the number of elements sorted per second as a

function of the array size. The number of threads used was 64

and the maximal number of keys was held at 228. This time

our algorithm outperforms the others for small array sizes,

but falls off for larger sizes. This is an artifact of the memory

subsystem of KNL and seems to affect algorithms with certain

types of memory access patterns. This issue does not seem to

be an artifact of our algorithm itself because we do not see this

issue on the AVX-512 Intel Skylake system we have tested on

(Those results have been omitted for brevity).

Parallel sorting as a function of threads: Fig. 7

(bottom) depicts the number of elements sorted per second

as a function of the number of threads. The array size used

was 220 and the maximal number of keys was held at 228.

Hyperthreading is used when the number of threads is higher

than 64. Our new algorithm outperforms the other algorithms

at larger threads counts.

VII. CONCLUSIONS

In this paper we presented a new way of approaching

vectorized merging and sorting. We showed how to increase

the scalability and control ﬂow of these algorithms from P

threads to P·Wsoftware threads executed on Pphysical

hardware threads. As part of this process we showed how to

greatly increase the control ﬂow available in a single hardware

thread. This was in part due to the introduction of new gather

and scatter instructions which enabled loading and storing

data in an efﬁcient manner to various locations as well the

introduction of new mask instructions. Through these masked

instruction we showed how each data lane could be controlled

at a ﬁner grain than was previously possible.

From a performance perspective, our new vector based

algorithm outperforms many state of the art implementations.

This is thanks to the low overhead, efﬁciency, and balanced

threading of the algorithm. Testing demonstrates that the algo-

rithm not only surpasses the SIMD based bitonic counterpart,

but that it is over 2.94xfaster than a traditional merge, merging

over 300M keys per second in one thread and over 16B keys

per second in parallel. A full sort reaches to over 5xfaster

than quicksort and is 2xfaster than Intel’s IPP library sort,

sorting over 5.3M keys per second for a single processor and

in parallel over 500M keys per second and a speedup of over

2xfrom a traditional merge sort.

REFERENCES

[1] W. A. Martin, “Sorting,” ACM Comput. Surv., vol. 3, no. 4, pp. 147–174,

1971.

[2] C. A. R. Hoare, “Algorithm 64: Quicksort,” Commun. ACM, vol. 4, no. 7,

p. 321, 1961.

[3] K. E. Batcher, “Sorting networks and their applications,” in Proceedings

of the April 30–May 2, 1968, spring joint computer conference, 1968,

Conference Paper, pp. 307–314.

[4] N. Amato, R. Iyer, S. Sundaresan, and Y. Wu, “A comparison of parallel

sorting algorithms on different architectures,” Texas A & M University,

Report, 1998.

[5] H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani, “Aa-sort: A

new parallel sorting algorithm for multi-core simd processors,” in

16th International Conference on Parallel Architecture and Compilation

Techniques (PACT 2007), 2007, Conference Proceedings, pp. 189–198.

[6] S. Odeh, O. Green, Z. Mwassi, O. Shmueli, and Y. Birk, “Merge Path-

Parallel Merging Made Simple,” in IEEE 26th International Parallel and

Distributed Processing Symposium Workshops & PhD Forum (IPDPSW),

2012, pp. 1611–1618.

[7] J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K.

Chen, A. Baransi, S. Kumar, and P. Dubey, “Efﬁcient implementation

of sorting on multi-core simd cpu architecture,” Proc. VLDB Endow.,

vol. 1, no. 2, pp. 1313–1324, 2008.

[8] H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani, “A high-

performance sorting algorithm for multicore single-instruction multiple-

data processors,” Software: Practice and Experience, vol. 42, no. 6, pp.

753–777, 2012.

[9] N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and

P. Dubey, “Fast sort on cpus and gpus: a case for bandwidth oblivious

simd sort,” in Proceedings of the 2010 ACM SIGMOD International

Conference on Management of data, 2010, Conference Paper, pp. 351–

362.

[10] B. Bramas, “Fast sorting algorithms using avx-512 on intel knights

landing,” CoRR, vol. abs/1704.08579, 2017.

[11] H. Inoue and K. Taura, “Simd- and cache-friendly algorithm for sorting

an array of structures,” Proc. VLDB Endow., vol. 8, no. 11, pp. 1274–

1285, 2015.

[12] T. Xiaochen, K. Rocki, and R. Suda, “Register level sort algorithm on

multi-core simd processors,” in Proceedings of the 3rd Workshop on

Irregular Applications: Architectures and Algorithms, 2013, Conference

Paper, pp. 1–8.

[13] O. Green, M. Dukhan, and R. Vuduc, “Branch-Avoiding Graph Algo-

rithms,” in 27th ACM on Symposium on Parallelism in Algorithms and

Architectures, 2015, pp. 212–223.

[14] O. Green, “When Merging and Branch Predictors Collide,” in IEEE

Fourth Workshop on Irregular Applications: Architectures and Algo-

rithms, 2014, pp. 33–40.

[15] I. Corporation, “Intel R

integrated performance primitives,” 2018,

version 2018.1.163. [Online]. Available: https://software.intel.com/

en-us/intel- ipp

[16] S. G. Akl and N. Santoro, “Optimal parallel merging and sorting without

memory conﬂicts,” IEEE Transactions on Computers, vol. C-36, no. 11,

pp. 1367–1369, 1987.

[17] Y. Liu, T. Pan, O. Green, and S. Aluru, “Parallelized kendall’s tau

coefﬁcient computation via simd vectorized sorting on many-integrated-

core processors,” arXiv:1704.03767, 2017.

[18] S. Lacey and R. Box, “A fast easy sort,” Byte Magazine, pp. 315–320,

April 1991.

[19] B. Schlegel, T. Willhalm, and W. Lehner, “Fast sorted-set intersection

using simd instructions,” in ADMS@ VLDB, 2011, Conference Proceed-

ings, pp. 1–8.

[20] E. Stehle and H.-A. Jacobsen, “A memory bandwidth-efﬁcient hybrid

radix sort on gpus,” CoRR, vol. abs/1611.01137, 2016.

[21] A. Fog, “The microarchitecture of intel, amd and via cpus,” An optimiza-

tion guide for assembly programmers and compiler makers. Copenhagen

University College of Engineering, 2011.

[22] O. Green, R. McColl, and D. Bader., “GPU Merge Path: A GPU Merging

Algorithm,” in 26th ACM International Conference on Supercomputing,

2012, pp. 331–340.

[23] N. Deo and D. Sarkar, “Parallel algorithms for merging and sorting,”

Information Sciences, vol. 56, no. 1, pp. 151 – 161, 1991.

[24] A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani,

S. Hutsell, R. Agarwal, and Y.-C. Liu, “Knights landing: Second-

generation intel xeon phi product,” Ieee micro, vol. 36, no. 2, pp. 34–46,

2016.

[25] TACC, “Stampede user guide,” 2018. [On-

line]. Available: https://portal.tacc.utexas.edu/user-guides/stampede2#

best-known-practices-and-preliminary-observations-knl