Content uploaded by Oded Green
Author content
All content in this area was uploaded by Oded Green on Nov 11, 2018
Content may be subject to copyright.
A Fast and Simple Approach to Merge and Merge
Sort using Wide Vector Instructions
Alex Watkins and Oded Green
Georgia Institute of Technology
{jwatkins45, ogreen}@gatech.edu
Abstract—Merging and sorting algorithms are the backbone
of many modern computer applications. As such, efficient im-
plementations are desired. Recent architectural advancements
in CPUs (Central Processing Units), such as wider and more
powerful vector instructions, allow for algorithmic improvements.
This paper presents a new approach to merge sort using
vector instructions. Traditional approaches to vectorized sorting
typically utilize a bitonic sorting network (Batcher’s Algorithm)
which adds significant overhead. Our approach eliminates the
overhead from this approach. We start with a branch-avoiding
merge algorithm and then use the Merge Path algorithm to
split up merging between the different SIMD lanes. Testing
demonstrates that the algorithm not only surpasses the SIMD
based bitonic counterpart, but that it is over 2.94xfaster than
a traditional merge, merging over 300M keys per second in one
thread and over 16B keys per second in parallel. Our new sort
reaches is over 5xfaster than quicksort and 2xfaster than Intel’s
IPP library sort, sorting over 5.3M keys per second for a single
processor and in parallel over 500M keys per second and a
speedup of over 2xfrom a traditional merge sort.
I. INTRODUCTION
Sorting algorithms play a crucial role in numerous computer
applications [1]. Database systems use sorting algorithms to
organize internal data and to present the data to the users
in a sorted format. Graph searching uses sorting to speedup
lookups. A variety of high-speed sorting algorithms have
been proposed including quicksort [2], merge sort, radix sort,
bitonic (Batcher’s Algorithm) [3], and several others [1], [4],
[5]. The time complexities of these algorithms vary as well as
their parallel scalability. Quicksort works great in the general
case but does not scale favorably as the number of threads
increase. Radix sort has a favorable efficiency but suffers
from memory bandwidth overhead. As processor architecture
changes constantly, sorting algorithms also need to be re-
evaluated. In this paper, we show a new vectorized sorting
algorithm that utilizes Intel’s AVX-512 instruction set. We
show how the Merge Path [6] concept can be vectorized
to efficiently speed up the parallel merging phases of the
algorithm (the lower phases in the merging network) as well
as how to implement a vectorized merging algorithm for the
traditional phases of the algorithm (the upper phases in the
merging network).
SIMD (Single Instruction Multiple Data) instructions can
be found in many modern architectures such as x86, ARM,
and Power processors. These instruction enable massive per-
formance gains by utilizing a single instruction on multiple
pieces of data, hence the term SIMD. Numerous sorting
implementations have been designed to take advantage of the
various SIMD instruction sets: [5], [7], [8], [9], [10], [11], [12].
Many of them introduce a significant algorithmic overhead
making it less than optimal.
Our new algorithm utilizes Intel’s AVX-512 instruction
set taking advantage of new features: 1) scatter and gather
instructions, 2) wider instructions, and 3) a large number of
masked operations. These masked operations enable us to
emulate advanced conditional instructions and implement our
sorting algorithm using the branch-avoiding model discussed
in [13] and [14]. Through the branch-avoiding model, we are
able to increase the control flow of our implementation from
one control flow per thread to Wsoftware control flows per
thread. Altogether, we show that the scalability of our new
sorting algorithm is W·P, where Wis the width of the
vector instruction and Pis the number of processors, whereas
a typical parallel sorting algorithm only scales to Pthreads.
On the KNL (Intel Xeon Phi Knights Landing, described
in Sec. V) processor used in our experiments, P= 272
and W= 16. Therefore, we can effectively execute 4352
concurrent software threads.
In this paper we show the performance for both parallel sort-
ing as well as parallel merging. In summary the contributions
of our algorithm are as follows:
•Using the branch-avoiding model, we show a novel way to
increase the control flow available in modern CPU processors.
•Our new vectorized merging algorithm is almost 3xfaster
than a scalar merging algorithm and is approximately 2.5x
faster than the sorting network used in [7] for merging.
•Our parallel merging algorithm can merge up to 300M keys
per second using a single thread and over 16B keys per second
in parallel.
•Using a single thread, our new vectorized algorithm is 5x
faster than the standard C-lib quicksort and 2xfaster than a
sort found in Intel’s IPP library [15].
•For a single thread, our algorithm sorts 5.3M keys per
second. Using the KNL system, with 272 hardware threads
our new sort algorithm can sort over 500M keys per second.
In contrast to its scalar counterpart, our new algorithm is over
2xfaster.
TABLE I
NUMBER OF VECTOR INSTRUCTIONS NEEDED IN DIFFERENT MERGE
KE RNE LS F OR DI FFER EN T VEC TO R ISAS. VAL UES A RE E STI MATE S GIV EN
BAS ED O N OUR I MPL EM ENTATI ON S;ACT UAL C OU NT MAY VARY.
Bitonic Our Algorithm
Width SSE AVX-512 [17] AVX-512
4 25 17 10
8 50 23 10
16 100 29 10
II. RE LATE D WORK
CPU architecture changes rapidly adding features like in-
creased memory bandwidth, SIMD width, and thread counts,
as well as other lower level hardware features such as out of
order execution. Numerous algorithms have been designed to
take advantage of these hardware features: [7], [4], [9], [16].
Yet, not all of these algorithms fully utilize all these hardware
features.
a) Sorting Networks: Numerous algorithms can be de-
scribed through a sorting network. In most cases, sorting
networks involve a fixed number of comparisons for a fixed
input size. This does not allow for flexible array size selection
or scaling to large array sizes due to overhead costs. One
commonly used sorting network is the bitonic sorting network
[3]. Bitonic involves a multilevel network with multiple com-
parisons at each level. The network depth is proportional to
the size of the network.
Chhugani et al. [7] produced a version of bitonic using In-
tel’s SSE instructions. Specifically, Chhugani et al. [7] showed
how to replace the merging phase of the algorithm with their
4x4 (8 input) network. Chhugani et al.’s [7] vectorized sort
is about 3.3xfaster than its scalar counterpart. It is therefore
not surprising that this algorithm is a common baseline when
measuring performance. We too use it as a common baseline
and show that our new algorithm is substantially faster than
Chhugani et al.’s [7] (over 2.5x). Our algorithm avoids unnec-
essary comparisons required by the merging step. While their
algorithm has the same time complexity as our new algorithm,
the constant in their algorithm is dependent on the overhead
of the bitonic network. Our algorithm has a smaller constant.
Further, their algorithm also does not scale well for larger
SIMD widths because of the overheads. For example, a 16-
way network (16x2 input elements) is actually faster than a
32-way (32x2 input elements) network when working with
512-bit SIMD [17]. Table I demonstrates how a software based
bitonic network scales for wider instruction sets. Notice how
for larger networks the instruction count continues to grow;
partly because each new level adds additional overhead. In
contrast, our algorithm is invariant to the vector width.
b) Other Vectorized Algorithms: Another approach to
sorting using SIMD is using the AA-Sort [5] algorithm which
is similar to comb sort [18] and merge sort. The merging step
uses a similar algorithm to the Intel SSE bitonic merge [7]
and has a similar speedup of around 3.33x. Inoue et al. [11].
also presents an AVX-512 based sort which is similar to the
AA-Sort but updated for AVX-512. Bramas [10] presents a
modified quicksort variant using 512bit SIMD on the KNL
Algorithm 1: Branch-Avoiding Merge - assuming 32-
bit data
function Branch-Avoiding Merge (A,B )
Input : Two sorted sub arrays: Aand B
Output : Sorted array C
AIndex ←0; BI ndex ←0; CIndex ←0;
while (AIndex < |A|and BI ndex < |B|)do
flag ←Rig ht Shift(A[AI ndex]−B[BIndex],31);
C[CIndex + +] ←(flag )∗A[AIndex] + (1 −flag )∗B[BIndex];
AIndex ←AI ndex +flag;
BIndex ←B Index + (1 −f lag);
// Copy remaining elements into C
processor. Schlegel et al. [19], show a sorted-set SIMD in-
tersection; however, this implementation is restricted to 8 bit
keys making it ineffective for the sorting of most values. The
Merge Path concept has been used with other SIMD based
sorting algorithms. For example Xiaochen et al. [12] used
Merge Path for partitioning at the thread granularity and then
using a SIMD bitonic [7] to merge the values. In contrast, our
new algorithm does the partitioning at both the thread level
and at the SIMD instruction granularity.
c) Other Sorting Algorithms: Radix sort is a non-
comparison based sorting algorithm. Radix sort suffers greatly
from high memory bandwidth [9]. Its irregular memory access
patterns result in poor cache utilization [9]. This means for
larger data sets, any performance gain from radix sorts is
quickly dissipated despite radix sort’s reduced computational
complexity. A hybrid radix sort proposed in [20] alleviates this
overhead.
d) Branch-Avoiding Merge: A traditional merging algo-
rithm starts each iteration with a “check for bounds” followed
by a simple IF −T HE N −ELSE structure. At a high
level and ignoring the W HI LE loop there are two branch
possibilities in the code, either the code block under the IF
statement is executed or the code block under the ELSE
statement is executed. The branching behavior in this case
is purely dependent on the input data itself. Whereas, the
W H ILE loop is easily predicted. This creates a challenge for
CPU branch predictors to efficiently work as intended when
running these heavily data dependent algorithms [14]. Branch
misses can cost tens of clock cycles [21], which is costly in
comparison to the merging process which requires a small
number of instructions.
Green [14] presents a branch-avoiding merging algorithm
which uses masks and conditional-like instructions to avoid
the heavy cost of branch mis-prediction (Alg. 1). From a
performance perspective, the branch-avoiding algorithm in
contrast to the traditional algorithm is : 1) slower when the
number of unique keys is small and the branch predictor
is accurate and 2) faster when the number of unique keys
is large and the branch predictor is inaccurate. An added
benefit of the branch-avoiding algorithm is that it removes
the speculation from the execution and makes the execution
more deterministic. With the removal of the control flow
dependency, we show in this paper how to 1) use each data
lane in a vector as its own thread (essentially each lane is
separate software threads) and 2) how to ensure that each (of
the W) software threads execute the correct instructions given
that there is only a single hardware thread.
TABLE II
DESCRIPTION OF IMPORTANT VARIABLES USED IN PSEUDOCODE
Variable Description
AIndex, BI ndex,CIndex Current index in A, B, and C, respectively.
AStop, BStop Max index for A and B, respectively.
AEndM ask,BEndM ask Mask marking whether a sub-array has elapsed in A and B,
respectively.
maskA, maskB Comparison mask for A and B, respectively.
cmp Comparison mask
AElems, BElems,C Elems Vector values of the A, B, and C sub-arrays respectively.
N,|A|,|B|,|C|Array size
PNumber of Threads
WSIMD width
TABLE III
SUB SET O F SIMD INSTRUCTIONS USED BY OUR MERGING AND SORTING
ALGORITHMS. W DE MOTE S TH E WID TH O F THE V EC TOR I NST RUC TI ON.
Instruction Description
Gather_W(src*,indicies, mask) Load data from multiple memory locations if mask bit is set.
Scatter_W(dest*,elems, indicies,mask) Stores data at multiple memory locations if mask bit is set.
Load_W(src*) Loads a sequential segment of memory.
Store_W(dest*,elems) Store a sequential segment of values to memory.
CompareLT_W(A,B) Pairwise SIMD vector comparison (less than).
CompareGT_W(A,B) Pairwise SIMD vector comparison (greater than).
Blend_W(A,B,mask) Selectively pulls elements from vector B if the mask bit is set
and from vector A otherwise.
Add_W(A,value) Adds a value to each element in the given A vector.
Add_W(A,B) Adds each lane in A to each lane in B.
MaskAdd_W(A,mask,value) Adds a value to each element in A if the corresponding mask
bit is set.
BitAnd_W(maskA, maskB) Bitwise and for masks.
BitOr_W(maskA,maskB) Bitwise or for masks.
BitNeg_W(mask) Flip bits on mask.
Set_W(value,index) Set the lane at the index to value.
III. MERGING ALGORITHMS
In this section we present our new vectorized merging
algorithms. First we present Merge Path [22], [6], which
provides a way of parallel merging of two sorted sub-arrays.
The Merge Path concept is used in both our parallel merging
and parallel sorting. We note, Merge Path was developed
originally for the final phases of a sort, where the number of
threads is greater than the number of sorted arrays. The first
algorithm we present in this section will cover this extension.
The later algorithms will use vector units based on the branch-
avoiding methodology [13], [14] to execute different (yet
concurrent) merges in each lane of the vector unit. These
merging algorithms can also be used in the initial phases of
the sorting where a single unsorted array is sorted by a single
thread (sequentially), significantly increasing throughput.
a) Merge Path: One of the key challenges for any
parallel algorithm is to utilize the entire system throughout the
entire execution. This is also true for sorting. For a parallel
merge sort algorithm, the problem of workload imbalance
and system utilization becomes an issue when the number
of sorted arrays is smaller than the number of processors
available. Several PRAM algorithms with partitioning schemes
have been developed. Some of these gave perfect balance while
others did not [16], [8]. The Merge Path concept highlights
a way to get perfect load-balancing while ensuring that the
algorithm is considered work efficient (under the assumption
that P < N). Merge Path has been implemented successfully
B[1] B[2] B[ 3] B[4] B[5 ] B[6] B[7] B[8 ]
1 2 3 5 8 9 10 12
A[1] 51 1 1 0 0 0 0 0
A[2] 61 1 1 1 0 0 0 0
A[3] 71 1 1 1 0 0 0 0
A[4] 11 1 1 1 1 1 1 1 0
A[5] 13 1 1 1 1 1 1 1 1
A[6] 14 1 1 1 1 1 1 1 1
A[7] 15 1 1 1 1 1 1 1 1
A[8] 16 1 1 1 1 1 1 1 1
Fig. 1. Merge Path matrix showing intersection lines and points
for both CPU [6] and for the GPU [22]. While Merge Path is
similar to [23], Merge Path offers a more visual and intuitive
explanation to the parallel merging process.
The following is a high-level overview of Merge Path. The
reader is referred to [6] and [22] for additional details. Given
two sorted arrays, place one array vertical (A) and one array
horizontal (B) so that a grid forms in between the two arrays
as seen in Fig. 1. Note, the arrays do not have to be equal
length - we use equal length arrays for simplicity.
Starting from the top-left corner, elements from each array
are compared one by one. If the compared element from B
is smaller than the element from A, that element is copied in
the output array C, and the paths move to the right. If the
element in Ais smaller or equal to the element in B, then
that element is copied into Aand the paths moves down by
one position. The merging is finalized once the path reaches
the bottom right corner. Given the Merge Path, we can see the
order in which a merging algorithm will copy elements from
the two input arrays into the output array. While the exact path
is not known until runtime, Merge Path offers a visual way
to see that the sequence of events and partitioning mechanism
ensures threads receive an equal amount of work.
Consider the cross diagonals drawn in Fig. 1. Notice that
the distance, in blocks, between the top-left corner to any
point on these cross diagonal is the same. Now notice that
the distance between each block dot, which is the intersection
of the path with the cross diagonal, are also equally distance
from the other dots. Finding the intersection between the cross
diagonal and the path requires doing a binary search on the
cross diagonal and comparing two adjacent elements on the
cross diagonal (additional details in [6]). And while we do
not know the actual path between two of the dots, notice that
the number of blocks going down and to the right is equal.
Therefore, by finding these dots, it is possible to partition the
merge in multiple threads that work on independent sections
of the array. Finding these dots requires O(log(N)) time per
processor while the sorting itself requires O(N/P )time per
processor, therefore when P < O(N/log(N)) this is consider
optimal.
b) Vectorized Merge Path: Algorithm 3 presents an out-
line of our new vectorized algorithm. Similar to the scalar
version of Merge Path, our inputs are two different arrays
Aand B. Unlike the scalar version which uses one starting
!"#"$ !%#"$ !&#"$ !'#"$ ("#"$ (%#"$ (&#"$ ('#"$
)"#"$ )%#"$ )&#"$ )'#"$ )"#%$ )%#%$ )&#%$ )'#%$
*+,-./0 !11.+22+2
34*5 6+7 8-. 34*5 6+ 78-.
34*5 6+7 8-. 34*5 6+ 78-.
Load(A) Load(B)
Compare(A,B)
Blend(A,B, mask)
Store(A)Store(B)
(a)
!"#"$ %"#"$ !&#"$ %&#"$ !'#"$ %'#"$ !(#"$ %(#"$
)"#"$ )&#"$ )'#"$ )(#"$ )"#&$ )&#&$ )'#&$ )(#&$
*+,-./0 !11.+22+2
34*5 6+7 8-. 34*5 6+ 78-.
34*5 6+7 8-. 34*5 6+ 78-.
Gather(A) Gather(B)
Scatter(A) Scatter(B)
Compare(A,B)
Blend(A,B, mask)
(b)
Fig. 2. (Top) Vectorized merging for small arrays with load and store
instructions instead of gather and scatter, W= 4. Great for merging arrays
of size two or four elements. (Bottom) SIMD merge using gather and scatter.
For both (Top) and (Bottom) the C array is not sorted. An additional variant
for the bottom figure also exists such that the output is sorted.
point per thread, our new algorithm requires Wstarting points
per hardware thread - one for each lane within the vector.
The starting points are found using the Merge Path algorithm.
This pseudo code can be extended for Pthreads by adding
an additional parallel for loop across threads and Merge Path
to partition the input across multiple threads. As such, each
vector is responsible for merging N
(W·P)elements. A similar
two tier approach, albeit more complex, is presented in [22].
The new algorithm uses gather and scatter instructions (used
for non-sequential memory operations). While past SIMD
instruction sets have supported gather instructions, scatter in-
structions have not been widely supported. The scatter instruc-
tion enables our algorithm to write values to Windependent
locations and enable writing the output in a sorted manner
without requiring additional data movement. The algorithm
by Chhugani [7] does not benefit from scatter instructions as
it writes to consecutive locations.
Note that the W H ILE loop now operates on Wlanes. The
parameter checked in the W HI LE condition is now a vector
of Wbits. These W-bit arrays are used to represent the status
of each lane in the vector operation - these bit arrays are known
as masks1. It is through these masks that we increase the
control flow and enable lane-level control granularity. Notice,
that masks allow us to specify if an instruction in a specific
lane should be executed, for example if an index should be
increased. Unlike the scalar algorithm where a single index is
incremented, in the new algorithm both indices are updated
using the mask info (similar to Alg. 1). This doubles the
number of comparisons, but is still few than Chhugani et al.
[7].
c) Local Vectorized Merging for Small Arrays: This
algorithm uses faster sequential loads and stores coupled with
1Mask can be manipulated either using special vector instructions or
through arithmetic operation.
Algorithm 2: Vector merging algorithm for small
arrays. Gather and scatter instructions are replaced with
loads and stores - this can be done so long as the input
is local in the memory.
function Merge (array)
Input : Partially sorted array: array
Output : Partially sorted array: array
for ind ←0; ind < subSize;ind ←ind + 1 do
AElems[ind]←Load 16(array +ind ∗16);
BElems[ind]←Load 16(arr ay +ind ∗16 + subSize);
ACount, BC ount ←0;
for ind ←0; ind < 2·subSize;ind ←ind + 1 do
cmp ←CompareLT (AE lems[0], BElems[0]);
cmp ←BitAnd 16(cmp, compar eLT (ACount, subSize));
cmp ←BitOr 16(cmp, compar eLT (BCount, subS ize));
CElems ←B lend 16(BElems[0], AElems[0], cmp);
Store 16(arr ay +ind ∗16, CElems);
ACount ←M askAdd 16(ACount, cmp, 1);
BCount ←M askAdd 16(BC ount, ¬cmp, 1);
for ind ←0; ind < subArraySiz e −1;ind + + do
AElems[ind]←
Blend 16(AElems[ind], AElems[ind + 1], cmp );
BElems[ind]←
Blend 16(BElems[ind + 1], B Elems[ind], cmp);
// Finish merging with another algorithm
Algorithm 3: Branch-avoiding SIMD based merging
algorithm using gather and scatter
function Merge (array, AI ndex, AStop, BIndex, BS top)
Input : array: array, vector indexes: AInd, AS top, BInd, BS top
Output : Merged array: C
CIndex ←Add W (AI nd, BInd);
AEndM ask, BEndMask ←0xF F F F ;.One bit set to 1 for each lane
while BitOr W (AEndM ask, BEndM ask)6= 0 do
.Pull the elements from memory
AElems ←Gather W (array, AI nd, AEndMask);
BElems ←Gather W (array , BInd, BE ndMask);
.Compare the elements
cmp ←CompareLT W(AE lems, BElems);
cmp ←BitAnd W (cmp, AEndM ask);
tmp ←BitN eg(BEndM ask);
cmp ←BitOr W (cmp, tmp);
CElems ←B lend W(B Elems, AElems, cmp);
.Store output to memory
tmp ←BitOr W (AEndM ask, BEndM ask);
Scatter W (C, CElems, C Index, tmp);
.Increment Indicies
AInd ←M askAdd W(AI nd, 1, cmp);
cmp ←BitN eg(cmp);
BInd ←M askAdd W (BInd, 1, cmp);
AEndV ector ←CompareGT W(AStop, AI nd);
BEndV ector ←CompareGT W(B Stop, BInd);
CIndex ←Add W (C Index, 1);
non-sequentially stored sub-arrays. This makes this particular
approach more beneficial for small array sizes. Consider Fig. 2
(top) and the first round of a merge sort for an unsorted array
(Alg. 2). Using a sequential load, two SIMD vectors are loaded
into memory. Each vector’s elements are compared pairwise
with the other vector’s elements. The smaller elements are
stored sequentially first, followed by the larger elements.
This produces sub-arrays of size 2 stored in a non-sequential
fashion where each sub-array is stored with its elements offset
from each other by W. Therefore, these elements need to be
restored. This adds some overhead and makes this approach
less ideal for large arrays. In addition, this algorithm suffers
from the disadvantage that at sub-arrays of size N
(W∗P)and
above it can no longer perform at full system utilization.
d) SIMD Merge Using Scatter & Gather: As an input to
the algorithm, we provide splitter (indices in the input array)
values that are each offset by the initial sub-array size. For
example, when merging sub-arrays of size 4 in the input, the
A splitters will be 0,8,16,24, and so on. The B splitters will
be 4,12,20,28, and so on. Because the sub arrays sizes are
all perfectly equal,the vector lanes are fully utilized and load-
balanced. Fig. 2 (bottom) shows the merging process for this
Sorted'Data
Arrays of Size 1
Arrays of Size !"#
⁄
Arrays of Size 4
Arrays of Size 2 Algorithm 2
Algorithm 2
Algorithm 3
Algorithm 3
with Merge Path
…
…
Unsorted'Data
Array of Size 𝑛
…
…
…
…
…
…
Algorithm 3
Algorithm 3
with Merge Path
Arrays of Size 8
Arrays of Size &! "#
⁄
Log(n) Levels
Fig. 3. Optimized merge sort algorithm using two different merging
algorithms in three different phases.
Algorithm 4: Single core merge sort algorithm
function Sort (array)
Input : Unsorted array: array
Output : Sorted array: array
for subSize ←1; subSiz e < 8; subSize ←2·subSize do
for i←0; i<N;i←i+W·2·subSize do
Algorithm2(array +i);
// Swap pointers for C and array
// Re-arrange sub-arrays so that they are stored sequentially
for ;subSize < N/W ;subSize ←2·subS ize do
for i←0; i<n;i←i+W·2·subS ize do
for ind ←0; ind < 16; ind ←ind + 1 do
AInd ←Set(x, ind);
x←x+subSize;
AStop ←Set(x, ind);
BInd ←S et(x, ind);
x←x+subSize;
BStop ←Set(x, ind);
Algorithm3(array +i, AI nd, AStop, BInd, B Stop);
// Swap pointers for C and array
for ;subSize < N ;subSize ←2·subSize do
for off set ←0; off set < n;of fset ←of fset + 32 ·subSize do
AInd, BI nd, AStop, BStop ←MergeP ath();
Algorithm3(array , AInd, BInd, AS top, BStop);
// Swap pointers for C and array
algorithm (Alg. 3). Since we use gather and scatter, there is
no need to store more than two value (one per array) at any
given time - this reduces the memory footprint. Our algorithm
is also responsible for tracking the indices in A, B, and C
which are used for the scatter and gather instructions. This
algorithm, by default, suffers from the same disadvantage as
the previous where at sub-arrays of size N
(W∗P)and above,
it can no longer perform at full system utilization. However,
this can be overcome by using the Merge Path algorithm
for partitioning the input array to the multiple vector lanes
(essentially replacing the roles of the threads with the software
controlled data lanes).
IV. SORT IN G ALGORITHM
A. Detailed discussion of Parallel Merge Sort
In the first phase of the parallel algorithm, the input data is
entirely unsorted and no known ordering of the data exists.
As such we start with our vectorized sequential load and
store merge, Alg. 2. In Sub. Sec. IV-B we discuss additional
motivation of how this phase was designed from a practical
perspective. This algorithm is used to sort up to sub-arrays of
size four elements. As our algorithm uses vectors that are W
wide, this means that we are dealing with 2·Wand 4·W
elements in these phases.
We determined that using arrays larger than 4, for example
8 elements, proved to be ineffective due to caching and
TABLE IV
MERGE ALGORITHMS IMPLEMENTATIONS USED IN ANALYSIS.
Algorithm Description
Traditional Standard merge algorithm.
Bitonic Implementation of [7] using SSE instructions.
AVX-512-BA Implementation of branch avoiding algorithm (Alg. 3) using merge path and
the AVX-512 instruction set.
TABLE V
SORTING ALGORITHM IMPLEMENTATIONS USED IN ANALYSIS.
Algorithm Description
Traditional Iterative merge sort based on traditional merge.
Bitonic Iterative merge sort based on a bitonic merge [7] using SSE instructions.
AVX-512-BA Implementation of Alg. 4 using AVX-512.
IPP SortAscend from the 2018 Intel IPP library [15].
Quicksort qsort from the c library (Intel Parallel Studio 2018 Cluster Edition).
other overheads associated with vector instructions - including
reorganizing the sorted data which is spread out in a lock-step
manner. The data reorganization can be done using scatter
instructions.
In the next phase of the vector algorithm, Alg. 3, we merge
the sub-arrays of size 4 up to sub-arrays of size N
(W∗P).
Note that even in this phase, the threads work entirely in an
independent manner. In the first round of this algorithm, each
sub-array will consist of 4 elements. This will then be doubled
to 8 elements, 16 elements, and so forth. Further, in this phase
of the algorithm we have enough sub-arrays to also ensure that
all lanes are always used. As the arrays are of equal length
we can ensure that each data lane receives an equal number
of elements–ensuring good load-balancing.
Lastly, we finish again with Alg. 3 (red boxes in Fig. 3)
using a vectorized merge. Unlike the previous phase which
used a single thread for each array, the merging process here
will use multiple threads for a single array. The overhead of
adding the threads is relatively small as the arrays are fairly
large - this was shown to be inexpensive by the Merge Path
algorithm. The Merge Path splitters will determine the starting
points for each the vector lane’s merge. This algorithm will be
used in an iterative manner starting with arrays sizes of N
(W∗P)
up to a fully sorted array. This entire process is summed up
in Alg. 4.
B. Failed Parallel Vectorization Attempts
In our initial implementation we started off by sorting small
sub-arrays of size 64 elements using the non vectorized quick-
sort algorithm. This was then followed by using the vectorized
merge algorithm (Alg. 3 with MP) for the larger arrays. Our
assumption was that this phase of the algorithm was fairly
small and as such would not account for a large amount of
the execution time. However, profiling of the execution showed
us otherwise - the time spent sorting the arrays into partitions
of 64 elements was roughly 50% of the total run time. This in
fact motivated the development of Alg. 2 which better utilizes
the vector units and had better locality. This greatly improved
the overall speedup of our new algorithm.
V. EXPERIMENTAL SETU P
In the following section we compare both our merging and
sorting algorithms with several different algorithms. For the
sake of simplicity, in the context of merging we focus on
arrays Aand Bsuch that |A|=|B|and |C|=|A|+|B|;
however, in practice this is not a constraint when wanting
to merge arrays of unequal length. Table IV lists the various
merging algorithms tested in this paper. Our parallel merge sort
algorithm utilizes the various merging algorithms discussed
in Sec. III. Table V lists the sorting algorithms that we
use in our paper. We include comparisons against some of
Intel’s implementations, including a sort from Intel’s IPP (Intel
Integrated Performance Primitives [15]). We also compare
against an implementation of [7]. Our analysis covers a mix
of single threaded and multi-threaded executions with scalar
and vectorized implementations.
System Configuration: The experiments presented in
this paper are executed on an Intel Xeon Phi 7250 processor
with 96GB of DRAM memory (102.4 GB/s peak bandwidth).
This processor is part of the Xeon Phi Knights Landing series
of processors. In addition to to the main memory, the processor
has an additional 16GB of MCDRAM high bandwidth mem-
ory (400 GB/s peak bandwidth). This MCDRAM is used as
our primary memory and so long as the application’s memory
fits into MCDRAM the lower latency DRAM memory is not
utilized. The Intel Xeon Phi 7250 has 68 cores with 272
threads (4-way SMP). These cores run at a 1.4 GHz clock and
share a 32MB L2 cache. All code is compiled with the Intel
Compiler (icc, Intel Parallel Studio 2018 Cluster Edition).
Vector Instruction Set: Our algorithm is implemented
using Intel’s AVX-512 instruction set which supports vectors
of up to 512 bits. Specifically our algorithm is implemented
for 32 bit integers2meaning that our vector width is W= 16.
Given the system parameters, P= 272, we are able to execute
up to 4352 merges concurrently. Our algorithm can also be
implemented for 64 bit values - this would result in a vector
width of W= 8 and a reduction in the number of software
threads by half. The algorithm of Chhugani et al. [7] uses
Intel’s SSE instruction set. SSE uses 128 bit vector units with
W= 4. Recall that in Sec. II an analysis of the algorithm of
[7] is given for wider vectors and that this algorithm does not
scale to larger vectors.
Random Key Generation: All experiments in this paper
use randomly generated numbers taken from an uniform
random distribution. The range of the keys was shown to play
a critical role in the performance of merging algorithms [14].
Specifically, as the number of unique keys increases it becomes
more likely that the Merge Path will be very close to the
main diagonal and the branch predictor becomes less accurate
leading to lower performance. The range of the keys is from 0
to 2iwhere i∈ {4, ..., 28}. Unless mentioned otherwise, the
maximal i= 28 is used for specifying the range.
2The algorithm can easily be changed to 64 bits using similar instructions
designed for large words.
Sequential Merging
Fig. 4. Performance of single core merging. (top) shows the results in keys
merged per second (higher is better) as a function of the array size. Maximal
number of keys was held at 228. (bottom) shows the results in keys merged
per second (higher is better) as a function of the key range. Input arrays have
1M elements each (0.5M for A and 0.5M for B).
To test the performance of the merge algorithms, two
arrays are filled with randomly generated numbers. These are
then sorted separately. Lastly, they are merged together. Only
the merging phase is timed. To test the sort algorithms, a
single array with randomly generated keys is created. This
is followed by the sorting of the array - which is timed.
VI. PERFORMANCE ANA LYSIS
Single core merging as a function of array size: Fig.
4 (top) depicts the number of merge keys per second as a
function of the array size. Our new algorithm outperforms
the other merging algorithms by almost 3x. The measured
execution time is very small leading to high overhead for array
sizes below 103. For larger arrays these overheads disappear.
Single core merging as a function of the key range:
Fig. 4 (bottom) depicts the performance of several merge
algorithms using a single core as a function of keys per second
vs the maximal number of keys. The abscissa is the range
of the key ranges and the ordinate is the number of merge
keys per second - higher is better. The input size is 1M keys.
Our new algorithm outperforms the standard scalar merge and
the SIMD algorithm of Chhugani et al. [7] by a factor of
2.8xand 2.2x, respectively. Note, for both the standard scalar
and SSE [7] implementations, the performance decreases as
the key range increases, due to branch mis-predictions. This
matches the results of [14].
Parallel merging as a function of array size: Fig. 5
(top) depicts the number of merge elements per second as a
function of the array size. The number of threads used was 64
and the maximal number of keys was held at 228. Once again,
our new algorithm outperforms the other merging algorithms.
Merging the smaller arrays has some overheads that disappear
as the array sizes become substantial. Our new algorithm, at
its peak can merge upto 16 billion keys per second.
Parallel merging as a function of threads: Fig. 5
(bottom) depicts the number of merge elements per second
as a function of the number of threads. The array size is 1B
Parallel Merging
Fig. 5. Performance of parallel merging. (top) shows the results in elements
merged per second (higher is better) as a function of the array size. Maximal
number of keys was held at 228 and 64 threads were used. (bottom) shows
the results in keys merged per second (higher is better) as a function of the
number of threads. Input arrays have 1B elements each and maximal number
of keys was held at 228. Hyperthreading is used when the number of threads
is higher than 64.
elements and the maximal number of unique keys was held
at 228. Hyperthreading is used when the number of threads is
higher than 64. Once again, our new algorithm outperforms
the other merging algorithms (except for threads 128 and 256
where the performance of our algorithm drops off). The reason
for the drop-off at those thread counts is that our processor
has 68 cores with 4-way SMT so for 128 and 256 thread
counts, hyperthreading is used (this is a known side effect
[24]). The hardware threads on a single core share - vector
processing units, memory queues for outstanding memory
requests, and caches (1MB L2 cache share amongst 2 cores
- upto 8 threads). Our branch avoiding algorithm uses two
gather instructions per thread. That means at 272 threads we
attempt to load upto 272*2*16=8704 different addresses. This
is a memory intensive operation which can lead to various
memory problems. Furthermore, if too many memory requests
are generated concurrently then the core might stall- a core has
a maximal number of memory requests that it can dispatch
before it is throttled back. From the cache’s perspective,
our algorithm uses 16xmore cache lines. This can lead to
contention, especially at the L1 level where the cache is fairly
small, when more threads are used on a single core. Lastly,
the slowdown seems to be consistent with other KNL research
and is actually noted in the system’s user guide [25].
Single core sorting as a function of the array size:
Fig. 6 (top) depicts the performance of the sorting algorithms
as a function of the array size. The maximal number of keys
was held at 228. Our algorithm is faster by 2x−3x. All of
the tested algorithms decay around 50% from an array size of
216 to 224. It seems that our algorithm has a slightly faster
decay in comparison to these other algorithms most likely
due to oversaturation of the memory requests at the core
granularity. KNL was not designed for this level of memory
request intensity.
Sequential Sorting
Fig. 6. Performance of single core sorting. (top) shows the results in keys
merged per second (higher is better) as a function of the array size. Maximal
number of keys was held at 228. (bottom) shows the results in keys merged
per second (higher is better) as a function of the key range. Input arrays have
224 elements each.
Parallel Sorting
Fig. 7. Performance of parallel sorting. (top) shows the results in elements
merged per second (higher is better) as a function of the array size. Maximal
number of keys was held at 228 and 64 threads were used. (bottom) shows
the results in keys merged per second (higher is better) as a function of the
number of threads. Input arrays have 220 elements each and maximal number
of keys was held at 228. Hyperthreading is used when the number of threads
is higher than 64.
Single core sorting as a function of the key range: Fig.
6 (bottom) depicts the performance of the sorting algorithms
as a function of the key range. The size of the input array is
224 elements. For all but Intel’s IPP the performance seems
mostly independent of the key range. This is in contrast to the
merging algorithms where there was a clear reduction in the
performance. However, for all the algorithms, performance did
decrease by about 10%. The algorithm that most standouts is
Intel’s IPP algorithm which has a decrease in performance by
roughly 6x−7x. Beyond the range of 1024 unique keys (which
is fairly typical for large applications), our new algorithm
outperforms Intel’s IPP algorithm by roughly 2.2x
Parallel sorting as a function of array size: Fig. 7
(top) depicts the number of elements sorted per second as a
function of the array size. The number of threads used was 64
and the maximal number of keys was held at 228. This time
our algorithm outperforms the others for small array sizes,
but falls off for larger sizes. This is an artifact of the memory
subsystem of KNL and seems to affect algorithms with certain
types of memory access patterns. This issue does not seem to
be an artifact of our algorithm itself because we do not see this
issue on the AVX-512 Intel Skylake system we have tested on
(Those results have been omitted for brevity).
Parallel sorting as a function of threads: Fig. 7
(bottom) depicts the number of elements sorted per second
as a function of the number of threads. The array size used
was 220 and the maximal number of keys was held at 228.
Hyperthreading is used when the number of threads is higher
than 64. Our new algorithm outperforms the other algorithms
at larger threads counts.
VII. CONCLUSIONS
In this paper we presented a new way of approaching
vectorized merging and sorting. We showed how to increase
the scalability and control flow of these algorithms from P
threads to P·Wsoftware threads executed on Pphysical
hardware threads. As part of this process we showed how to
greatly increase the control flow available in a single hardware
thread. This was in part due to the introduction of new gather
and scatter instructions which enabled loading and storing
data in an efficient manner to various locations as well the
introduction of new mask instructions. Through these masked
instruction we showed how each data lane could be controlled
at a finer grain than was previously possible.
From a performance perspective, our new vector based
algorithm outperforms many state of the art implementations.
This is thanks to the low overhead, efficiency, and balanced
threading of the algorithm. Testing demonstrates that the algo-
rithm not only surpasses the SIMD based bitonic counterpart,
but that it is over 2.94xfaster than a traditional merge, merging
over 300M keys per second in one thread and over 16B keys
per second in parallel. A full sort reaches to over 5xfaster
than quicksort and is 2xfaster than Intel’s IPP library sort,
sorting over 5.3M keys per second for a single processor and
in parallel over 500M keys per second and a speedup of over
2xfrom a traditional merge sort.
REFERENCES
[1] W. A. Martin, “Sorting,” ACM Comput. Surv., vol. 3, no. 4, pp. 147–174,
1971.
[2] C. A. R. Hoare, “Algorithm 64: Quicksort,” Commun. ACM, vol. 4, no. 7,
p. 321, 1961.
[3] K. E. Batcher, “Sorting networks and their applications,” in Proceedings
of the April 30–May 2, 1968, spring joint computer conference, 1968,
Conference Paper, pp. 307–314.
[4] N. Amato, R. Iyer, S. Sundaresan, and Y. Wu, “A comparison of parallel
sorting algorithms on different architectures,” Texas A & M University,
Report, 1998.
[5] H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani, “Aa-sort: A
new parallel sorting algorithm for multi-core simd processors,” in
16th International Conference on Parallel Architecture and Compilation
Techniques (PACT 2007), 2007, Conference Proceedings, pp. 189–198.
[6] S. Odeh, O. Green, Z. Mwassi, O. Shmueli, and Y. Birk, “Merge Path-
Parallel Merging Made Simple,” in IEEE 26th International Parallel and
Distributed Processing Symposium Workshops & PhD Forum (IPDPSW),
2012, pp. 1611–1618.
[7] J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K.
Chen, A. Baransi, S. Kumar, and P. Dubey, “Efficient implementation
of sorting on multi-core simd cpu architecture,” Proc. VLDB Endow.,
vol. 1, no. 2, pp. 1313–1324, 2008.
[8] H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani, “A high-
performance sorting algorithm for multicore single-instruction multiple-
data processors,” Software: Practice and Experience, vol. 42, no. 6, pp.
753–777, 2012.
[9] N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and
P. Dubey, “Fast sort on cpus and gpus: a case for bandwidth oblivious
simd sort,” in Proceedings of the 2010 ACM SIGMOD International
Conference on Management of data, 2010, Conference Paper, pp. 351–
362.
[10] B. Bramas, “Fast sorting algorithms using avx-512 on intel knights
landing,” CoRR, vol. abs/1704.08579, 2017.
[11] H. Inoue and K. Taura, “Simd- and cache-friendly algorithm for sorting
an array of structures,” Proc. VLDB Endow., vol. 8, no. 11, pp. 1274–
1285, 2015.
[12] T. Xiaochen, K. Rocki, and R. Suda, “Register level sort algorithm on
multi-core simd processors,” in Proceedings of the 3rd Workshop on
Irregular Applications: Architectures and Algorithms, 2013, Conference
Paper, pp. 1–8.
[13] O. Green, M. Dukhan, and R. Vuduc, “Branch-Avoiding Graph Algo-
rithms,” in 27th ACM on Symposium on Parallelism in Algorithms and
Architectures, 2015, pp. 212–223.
[14] O. Green, “When Merging and Branch Predictors Collide,” in IEEE
Fourth Workshop on Irregular Applications: Architectures and Algo-
rithms, 2014, pp. 33–40.
[15] I. Corporation, “Intel R
integrated performance primitives,” 2018,
version 2018.1.163. [Online]. Available: https://software.intel.com/
en-us/intel- ipp
[16] S. G. Akl and N. Santoro, “Optimal parallel merging and sorting without
memory conflicts,” IEEE Transactions on Computers, vol. C-36, no. 11,
pp. 1367–1369, 1987.
[17] Y. Liu, T. Pan, O. Green, and S. Aluru, “Parallelized kendall’s tau
coefficient computation via simd vectorized sorting on many-integrated-
core processors,” arXiv:1704.03767, 2017.
[18] S. Lacey and R. Box, “A fast easy sort,” Byte Magazine, pp. 315–320,
April 1991.
[19] B. Schlegel, T. Willhalm, and W. Lehner, “Fast sorted-set intersection
using simd instructions,” in ADMS@ VLDB, 2011, Conference Proceed-
ings, pp. 1–8.
[20] E. Stehle and H.-A. Jacobsen, “A memory bandwidth-efficient hybrid
radix sort on gpus,” CoRR, vol. abs/1611.01137, 2016.
[21] A. Fog, “The microarchitecture of intel, amd and via cpus,” An optimiza-
tion guide for assembly programmers and compiler makers. Copenhagen
University College of Engineering, 2011.
[22] O. Green, R. McColl, and D. Bader., “GPU Merge Path: A GPU Merging
Algorithm,” in 26th ACM International Conference on Supercomputing,
2012, pp. 331–340.
[23] N. Deo and D. Sarkar, “Parallel algorithms for merging and sorting,”
Information Sciences, vol. 56, no. 1, pp. 151 – 161, 1991.
[24] A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani,
S. Hutsell, R. Agarwal, and Y.-C. Liu, “Knights landing: Second-
generation intel xeon phi product,” Ieee micro, vol. 36, no. 2, pp. 34–46,
2016.
[25] TACC, “Stampede user guide,” 2018. [On-
line]. Available: https://portal.tacc.utexas.edu/user-guides/stampede2#
best-known-practices-and-preliminary-observations-knl