Conference PaperPDF Available

A Fast and Simple Approach to Merge and Merge Sort Using Wide Vector Instructions

Authors:

Abstract and Figures

Merging and sorting algorithms are the backbone of many modern computer applications. As such, efficient implementations are desired. Recent architectural advancements in CPUs (Central Processing Units), such as wider and more powerful vector instructions, allow for algorithmic improvements. This paper presents a new approach to merge sort using vector instructions. Traditional approaches to vectorized sorting typically utilize a bitonic sorting network (Batcher's Algorithm) which adds significant overhead. Our approach eliminates the overhead from this approach. We start with a branch-avoiding merge algorithm and then use the Merge Path algorithm to split up merging between the different SIMD lanes. Testing demonstrates that the algorithm not only surpasses the SIMD based bitonic counterpart, but that it is over 2.94x faster than a traditional merge, merging over 300M keys per second in one thread and over 16B keys per second in parallel. Our new sort reaches is over 5x faster than quicksort and 2x faster than Intel's IPP library sort, sorting over 5.3M keys per second for a single processor and in parallel over 500M keys per second and a speedup of over 2x from a traditional merge sort.
Content may be subject to copyright.
A Fast and Simple Approach to Merge and Merge
Sort using Wide Vector Instructions
Alex Watkins and Oded Green
Georgia Institute of Technology
{jwatkins45, ogreen}@gatech.edu
Abstract—Merging and sorting algorithms are the backbone
of many modern computer applications. As such, efficient im-
plementations are desired. Recent architectural advancements
in CPUs (Central Processing Units), such as wider and more
powerful vector instructions, allow for algorithmic improvements.
This paper presents a new approach to merge sort using
vector instructions. Traditional approaches to vectorized sorting
typically utilize a bitonic sorting network (Batcher’s Algorithm)
which adds significant overhead. Our approach eliminates the
overhead from this approach. We start with a branch-avoiding
merge algorithm and then use the Merge Path algorithm to
split up merging between the different SIMD lanes. Testing
demonstrates that the algorithm not only surpasses the SIMD
based bitonic counterpart, but that it is over 2.94xfaster than
a traditional merge, merging over 300M keys per second in one
thread and over 16B keys per second in parallel. Our new sort
reaches is over 5xfaster than quicksort and 2xfaster than Intel’s
IPP library sort, sorting over 5.3M keys per second for a single
processor and in parallel over 500M keys per second and a
speedup of over 2xfrom a traditional merge sort.
I. INTRODUCTION
Sorting algorithms play a crucial role in numerous computer
applications [1]. Database systems use sorting algorithms to
organize internal data and to present the data to the users
in a sorted format. Graph searching uses sorting to speedup
lookups. A variety of high-speed sorting algorithms have
been proposed including quicksort [2], merge sort, radix sort,
bitonic (Batcher’s Algorithm) [3], and several others [1], [4],
[5]. The time complexities of these algorithms vary as well as
their parallel scalability. Quicksort works great in the general
case but does not scale favorably as the number of threads
increase. Radix sort has a favorable efficiency but suffers
from memory bandwidth overhead. As processor architecture
changes constantly, sorting algorithms also need to be re-
evaluated. In this paper, we show a new vectorized sorting
algorithm that utilizes Intel’s AVX-512 instruction set. We
show how the Merge Path [6] concept can be vectorized
to efficiently speed up the parallel merging phases of the
algorithm (the lower phases in the merging network) as well
as how to implement a vectorized merging algorithm for the
traditional phases of the algorithm (the upper phases in the
merging network).
SIMD (Single Instruction Multiple Data) instructions can
be found in many modern architectures such as x86, ARM,
and Power processors. These instruction enable massive per-
formance gains by utilizing a single instruction on multiple
pieces of data, hence the term SIMD. Numerous sorting
implementations have been designed to take advantage of the
various SIMD instruction sets: [5], [7], [8], [9], [10], [11], [12].
Many of them introduce a significant algorithmic overhead
making it less than optimal.
Our new algorithm utilizes Intel’s AVX-512 instruction
set taking advantage of new features: 1) scatter and gather
instructions, 2) wider instructions, and 3) a large number of
masked operations. These masked operations enable us to
emulate advanced conditional instructions and implement our
sorting algorithm using the branch-avoiding model discussed
in [13] and [14]. Through the branch-avoiding model, we are
able to increase the control flow of our implementation from
one control flow per thread to Wsoftware control flows per
thread. Altogether, we show that the scalability of our new
sorting algorithm is W·P, where Wis the width of the
vector instruction and Pis the number of processors, whereas
a typical parallel sorting algorithm only scales to Pthreads.
On the KNL (Intel Xeon Phi Knights Landing, described
in Sec. V) processor used in our experiments, P= 272
and W= 16. Therefore, we can effectively execute 4352
concurrent software threads.
In this paper we show the performance for both parallel sort-
ing as well as parallel merging. In summary the contributions
of our algorithm are as follows:
Using the branch-avoiding model, we show a novel way to
increase the control flow available in modern CPU processors.
Our new vectorized merging algorithm is almost 3xfaster
than a scalar merging algorithm and is approximately 2.5x
faster than the sorting network used in [7] for merging.
Our parallel merging algorithm can merge up to 300M keys
per second using a single thread and over 16B keys per second
in parallel.
Using a single thread, our new vectorized algorithm is 5x
faster than the standard C-lib quicksort and 2xfaster than a
sort found in Intel’s IPP library [15].
For a single thread, our algorithm sorts 5.3M keys per
second. Using the KNL system, with 272 hardware threads
our new sort algorithm can sort over 500M keys per second.
In contrast to its scalar counterpart, our new algorithm is over
2xfaster.
TABLE I
NUMBER OF VECTOR INSTRUCTIONS NEEDED IN DIFFERENT MERGE
KE RNE LS F OR DI FFER EN T VEC TO R ISAS. VAL UES A RE E STI MATE S GIV EN
BAS ED O N OUR I MPL EM ENTATI ON S;ACT UAL C OU NT MAY VARY.
Bitonic Our Algorithm
Width SSE AVX-512 [17] AVX-512
4 25 17 10
8 50 23 10
16 100 29 10
II. RE LATE D WORK
CPU architecture changes rapidly adding features like in-
creased memory bandwidth, SIMD width, and thread counts,
as well as other lower level hardware features such as out of
order execution. Numerous algorithms have been designed to
take advantage of these hardware features: [7], [4], [9], [16].
Yet, not all of these algorithms fully utilize all these hardware
features.
a) Sorting Networks: Numerous algorithms can be de-
scribed through a sorting network. In most cases, sorting
networks involve a fixed number of comparisons for a fixed
input size. This does not allow for flexible array size selection
or scaling to large array sizes due to overhead costs. One
commonly used sorting network is the bitonic sorting network
[3]. Bitonic involves a multilevel network with multiple com-
parisons at each level. The network depth is proportional to
the size of the network.
Chhugani et al. [7] produced a version of bitonic using In-
tel’s SSE instructions. Specifically, Chhugani et al. [7] showed
how to replace the merging phase of the algorithm with their
4x4 (8 input) network. Chhugani et al.’s [7] vectorized sort
is about 3.3xfaster than its scalar counterpart. It is therefore
not surprising that this algorithm is a common baseline when
measuring performance. We too use it as a common baseline
and show that our new algorithm is substantially faster than
Chhugani et al.’s [7] (over 2.5x). Our algorithm avoids unnec-
essary comparisons required by the merging step. While their
algorithm has the same time complexity as our new algorithm,
the constant in their algorithm is dependent on the overhead
of the bitonic network. Our algorithm has a smaller constant.
Further, their algorithm also does not scale well for larger
SIMD widths because of the overheads. For example, a 16-
way network (16x2 input elements) is actually faster than a
32-way (32x2 input elements) network when working with
512-bit SIMD [17]. Table I demonstrates how a software based
bitonic network scales for wider instruction sets. Notice how
for larger networks the instruction count continues to grow;
partly because each new level adds additional overhead. In
contrast, our algorithm is invariant to the vector width.
b) Other Vectorized Algorithms: Another approach to
sorting using SIMD is using the AA-Sort [5] algorithm which
is similar to comb sort [18] and merge sort. The merging step
uses a similar algorithm to the Intel SSE bitonic merge [7]
and has a similar speedup of around 3.33x. Inoue et al. [11].
also presents an AVX-512 based sort which is similar to the
AA-Sort but updated for AVX-512. Bramas [10] presents a
modified quicksort variant using 512bit SIMD on the KNL
Algorithm 1: Branch-Avoiding Merge - assuming 32-
bit data
function Branch-Avoiding Merge (A,B )
Input : Two sorted sub arrays: Aand B
Output : Sorted array C
AIndex 0; BI ndex 0; CIndex 0;
while (AIndex < |A|and BI ndex < |B|)do
flag Rig ht Shift(A[AI ndex]B[BIndex],31);
C[CIndex + +] (flag )A[AIndex] + (1 flag )B[BIndex];
AIndex AI ndex +flag;
BIndex B Index + (1 f lag);
// Copy remaining elements into C
processor. Schlegel et al. [19], show a sorted-set SIMD in-
tersection; however, this implementation is restricted to 8 bit
keys making it ineffective for the sorting of most values. The
Merge Path concept has been used with other SIMD based
sorting algorithms. For example Xiaochen et al. [12] used
Merge Path for partitioning at the thread granularity and then
using a SIMD bitonic [7] to merge the values. In contrast, our
new algorithm does the partitioning at both the thread level
and at the SIMD instruction granularity.
c) Other Sorting Algorithms: Radix sort is a non-
comparison based sorting algorithm. Radix sort suffers greatly
from high memory bandwidth [9]. Its irregular memory access
patterns result in poor cache utilization [9]. This means for
larger data sets, any performance gain from radix sorts is
quickly dissipated despite radix sort’s reduced computational
complexity. A hybrid radix sort proposed in [20] alleviates this
overhead.
d) Branch-Avoiding Merge: A traditional merging algo-
rithm starts each iteration with a “check for bounds” followed
by a simple IF T HE N ELSE structure. At a high
level and ignoring the W HI LE loop there are two branch
possibilities in the code, either the code block under the IF
statement is executed or the code block under the ELSE
statement is executed. The branching behavior in this case
is purely dependent on the input data itself. Whereas, the
W H ILE loop is easily predicted. This creates a challenge for
CPU branch predictors to efficiently work as intended when
running these heavily data dependent algorithms [14]. Branch
misses can cost tens of clock cycles [21], which is costly in
comparison to the merging process which requires a small
number of instructions.
Green [14] presents a branch-avoiding merging algorithm
which uses masks and conditional-like instructions to avoid
the heavy cost of branch mis-prediction (Alg. 1). From a
performance perspective, the branch-avoiding algorithm in
contrast to the traditional algorithm is : 1) slower when the
number of unique keys is small and the branch predictor
is accurate and 2) faster when the number of unique keys
is large and the branch predictor is inaccurate. An added
benefit of the branch-avoiding algorithm is that it removes
the speculation from the execution and makes the execution
more deterministic. With the removal of the control flow
dependency, we show in this paper how to 1) use each data
lane in a vector as its own thread (essentially each lane is
separate software threads) and 2) how to ensure that each (of
the W) software threads execute the correct instructions given
that there is only a single hardware thread.
TABLE II
DESCRIPTION OF IMPORTANT VARIABLES USED IN PSEUDOCODE
Variable Description
AIndex, BI ndex,CIndex Current index in A, B, and C, respectively.
AStop, BStop Max index for A and B, respectively.
AEndM ask,BEndM ask Mask marking whether a sub-array has elapsed in A and B,
respectively.
maskA, maskB Comparison mask for A and B, respectively.
cmp Comparison mask
AElems, BElems,C Elems Vector values of the A, B, and C sub-arrays respectively.
N,|A|,|B|,|C|Array size
PNumber of Threads
WSIMD width
TABLE III
SUB SET O F SIMD INSTRUCTIONS USED BY OUR MERGING AND SORTING
ALGORITHMS. W DE MOTE S TH E WID TH O F THE V EC TOR I NST RUC TI ON.
Instruction Description
Gather_W(src*,indicies, mask) Load data from multiple memory locations if mask bit is set.
Scatter_W(dest*,elems, indicies,mask) Stores data at multiple memory locations if mask bit is set.
Load_W(src*) Loads a sequential segment of memory.
Store_W(dest*,elems) Store a sequential segment of values to memory.
CompareLT_W(A,B) Pairwise SIMD vector comparison (less than).
CompareGT_W(A,B) Pairwise SIMD vector comparison (greater than).
Blend_W(A,B,mask) Selectively pulls elements from vector B if the mask bit is set
and from vector A otherwise.
Add_W(A,value) Adds a value to each element in the given A vector.
Add_W(A,B) Adds each lane in A to each lane in B.
MaskAdd_W(A,mask,value) Adds a value to each element in A if the corresponding mask
bit is set.
BitAnd_W(maskA, maskB) Bitwise and for masks.
BitOr_W(maskA,maskB) Bitwise or for masks.
BitNeg_W(mask) Flip bits on mask.
Set_W(value,index) Set the lane at the index to value.
III. MERGING ALGORITHMS
In this section we present our new vectorized merging
algorithms. First we present Merge Path [22], [6], which
provides a way of parallel merging of two sorted sub-arrays.
The Merge Path concept is used in both our parallel merging
and parallel sorting. We note, Merge Path was developed
originally for the final phases of a sort, where the number of
threads is greater than the number of sorted arrays. The first
algorithm we present in this section will cover this extension.
The later algorithms will use vector units based on the branch-
avoiding methodology [13], [14] to execute different (yet
concurrent) merges in each lane of the vector unit. These
merging algorithms can also be used in the initial phases of
the sorting where a single unsorted array is sorted by a single
thread (sequentially), significantly increasing throughput.
a) Merge Path: One of the key challenges for any
parallel algorithm is to utilize the entire system throughout the
entire execution. This is also true for sorting. For a parallel
merge sort algorithm, the problem of workload imbalance
and system utilization becomes an issue when the number
of sorted arrays is smaller than the number of processors
available. Several PRAM algorithms with partitioning schemes
have been developed. Some of these gave perfect balance while
others did not [16], [8]. The Merge Path concept highlights
a way to get perfect load-balancing while ensuring that the
algorithm is considered work efficient (under the assumption
that P < N). Merge Path has been implemented successfully
B[1] B[2] B[ 3] B[4] B[5 ] B[6] B[7] B[8 ]
1 2 3 5 8 9 10 12
A[1] 51 1 1 0 0 0 0 0
A[2] 61 1 1 1 0 0 0 0
A[3] 71 1 1 1 0 0 0 0
A[4] 11 1 1 1 1 1 1 1 0
A[5] 13 1 1 1 1 1 1 1 1
A[6] 14 1 1 1 1 1 1 1 1
A[7] 15 1 1 1 1 1 1 1 1
A[8] 16 1 1 1 1 1 1 1 1
Fig. 1. Merge Path matrix showing intersection lines and points
for both CPU [6] and for the GPU [22]. While Merge Path is
similar to [23], Merge Path offers a more visual and intuitive
explanation to the parallel merging process.
The following is a high-level overview of Merge Path. The
reader is referred to [6] and [22] for additional details. Given
two sorted arrays, place one array vertical (A) and one array
horizontal (B) so that a grid forms in between the two arrays
as seen in Fig. 1. Note, the arrays do not have to be equal
length - we use equal length arrays for simplicity.
Starting from the top-left corner, elements from each array
are compared one by one. If the compared element from B
is smaller than the element from A, that element is copied in
the output array C, and the paths move to the right. If the
element in Ais smaller or equal to the element in B, then
that element is copied into Aand the paths moves down by
one position. The merging is finalized once the path reaches
the bottom right corner. Given the Merge Path, we can see the
order in which a merging algorithm will copy elements from
the two input arrays into the output array. While the exact path
is not known until runtime, Merge Path offers a visual way
to see that the sequence of events and partitioning mechanism
ensures threads receive an equal amount of work.
Consider the cross diagonals drawn in Fig. 1. Notice that
the distance, in blocks, between the top-left corner to any
point on these cross diagonal is the same. Now notice that
the distance between each block dot, which is the intersection
of the path with the cross diagonal, are also equally distance
from the other dots. Finding the intersection between the cross
diagonal and the path requires doing a binary search on the
cross diagonal and comparing two adjacent elements on the
cross diagonal (additional details in [6]). And while we do
not know the actual path between two of the dots, notice that
the number of blocks going down and to the right is equal.
Therefore, by finding these dots, it is possible to partition the
merge in multiple threads that work on independent sections
of the array. Finding these dots requires O(log(N)) time per
processor while the sorting itself requires O(N/P )time per
processor, therefore when P < O(N/log(N)) this is consider
optimal.
b) Vectorized Merge Path: Algorithm 3 presents an out-
line of our new vectorized algorithm. Similar to the scalar
version of Merge Path, our inputs are two different arrays
Aand B. Unlike the scalar version which uses one starting
!"#"$ !%#"$ !&#"$ !'#"$ ("#"$ (%#"$ (&#"$ ('#"$
)"#"$ )%#"$ )&#"$ )'#"$ )"#%$ )%#%$ )&#%$ )'#%$
*+,-./0 !11.+22+2
34*5 6+7 8-. 34*5 6+ 78-.
34*5 6+7 8-. 34*5 6+ 78-.
Load(A) Load(B)
Compare(A,B)
Blend(A,B, mask)
Store(A)Store(B)
(a)
!"#"$ %"#"$ !&#"$ %&#"$ !'#"$ %'#"$ !(#"$ %(#"$
)"#"$ )&#"$ )'#"$ )(#"$ )"#&$ )&#&$ )'#&$ )(#&$
*+,-./0 !11.+22+2
34*5 6+7 8-. 34*5 6+ 78-.
34*5 6+7 8-. 34*5 6+ 78-.
Gather(A) Gather(B)
Scatter(A) Scatter(B)
Compare(A,B)
Blend(A,B, mask)
(b)
Fig. 2. (Top) Vectorized merging for small arrays with load and store
instructions instead of gather and scatter, W= 4. Great for merging arrays
of size two or four elements. (Bottom) SIMD merge using gather and scatter.
For both (Top) and (Bottom) the C array is not sorted. An additional variant
for the bottom figure also exists such that the output is sorted.
point per thread, our new algorithm requires Wstarting points
per hardware thread - one for each lane within the vector.
The starting points are found using the Merge Path algorithm.
This pseudo code can be extended for Pthreads by adding
an additional parallel for loop across threads and Merge Path
to partition the input across multiple threads. As such, each
vector is responsible for merging N
(W·P)elements. A similar
two tier approach, albeit more complex, is presented in [22].
The new algorithm uses gather and scatter instructions (used
for non-sequential memory operations). While past SIMD
instruction sets have supported gather instructions, scatter in-
structions have not been widely supported. The scatter instruc-
tion enables our algorithm to write values to Windependent
locations and enable writing the output in a sorted manner
without requiring additional data movement. The algorithm
by Chhugani [7] does not benefit from scatter instructions as
it writes to consecutive locations.
Note that the W H ILE loop now operates on Wlanes. The
parameter checked in the W HI LE condition is now a vector
of Wbits. These W-bit arrays are used to represent the status
of each lane in the vector operation - these bit arrays are known
as masks1. It is through these masks that we increase the
control flow and enable lane-level control granularity. Notice,
that masks allow us to specify if an instruction in a specific
lane should be executed, for example if an index should be
increased. Unlike the scalar algorithm where a single index is
incremented, in the new algorithm both indices are updated
using the mask info (similar to Alg. 1). This doubles the
number of comparisons, but is still few than Chhugani et al.
[7].
c) Local Vectorized Merging for Small Arrays: This
algorithm uses faster sequential loads and stores coupled with
1Mask can be manipulated either using special vector instructions or
through arithmetic operation.
Algorithm 2: Vector merging algorithm for small
arrays. Gather and scatter instructions are replaced with
loads and stores - this can be done so long as the input
is local in the memory.
function Merge (array)
Input : Partially sorted array: array
Output : Partially sorted array: array
for ind 0; ind < subSize;ind ind + 1 do
AElems[ind]Load 16(array +ind 16);
BElems[ind]Load 16(arr ay +ind 16 + subSize);
ACount, BC ount 0;
for ind 0; ind < 2·subSize;ind ind + 1 do
cmp CompareLT (AE lems[0], BElems[0]);
cmp BitAnd 16(cmp, compar eLT (ACount, subSize));
cmp BitOr 16(cmp, compar eLT (BCount, subS ize));
CElems B lend 16(BElems[0], AElems[0], cmp);
Store 16(arr ay +ind 16, CElems);
ACount M askAdd 16(ACount, cmp, 1);
BCount M askAdd 16(BC ount, ¬cmp, 1);
for ind 0; ind < subArraySiz e 1;ind + + do
AElems[ind]
Blend 16(AElems[ind], AElems[ind + 1], cmp );
BElems[ind]
Blend 16(BElems[ind + 1], B Elems[ind], cmp);
// Finish merging with another algorithm
Algorithm 3: Branch-avoiding SIMD based merging
algorithm using gather and scatter
function Merge (array, AI ndex, AStop, BIndex, BS top)
Input : array: array, vector indexes: AInd, AS top, BInd, BS top
Output : Merged array: C
CIndex Add W (AI nd, BInd);
AEndM ask, BEndMask 0xF F F F ;.One bit set to 1 for each lane
while BitOr W (AEndM ask, BEndM ask)6= 0 do
.Pull the elements from memory
AElems Gather W (array, AI nd, AEndMask);
BElems Gather W (array , BInd, BE ndMask);
.Compare the elements
cmp CompareLT W(AE lems, BElems);
cmp BitAnd W (cmp, AEndM ask);
tmp BitN eg(BEndM ask);
cmp BitOr W (cmp, tmp);
CElems B lend W(B Elems, AElems, cmp);
.Store output to memory
tmp BitOr W (AEndM ask, BEndM ask);
Scatter W (C, CElems, C Index, tmp);
.Increment Indicies
AInd M askAdd W(AI nd, 1, cmp);
cmp BitN eg(cmp);
BInd M askAdd W (BInd, 1, cmp);
AEndV ector CompareGT W(AStop, AI nd);
BEndV ector CompareGT W(B Stop, BInd);
CIndex Add W (C Index, 1);
non-sequentially stored sub-arrays. This makes this particular
approach more beneficial for small array sizes. Consider Fig. 2
(top) and the first round of a merge sort for an unsorted array
(Alg. 2). Using a sequential load, two SIMD vectors are loaded
into memory. Each vector’s elements are compared pairwise
with the other vector’s elements. The smaller elements are
stored sequentially first, followed by the larger elements.
This produces sub-arrays of size 2 stored in a non-sequential
fashion where each sub-array is stored with its elements offset
from each other by W. Therefore, these elements need to be
restored. This adds some overhead and makes this approach
less ideal for large arrays. In addition, this algorithm suffers
from the disadvantage that at sub-arrays of size N
(WP)and
above it can no longer perform at full system utilization.
d) SIMD Merge Using Scatter & Gather: As an input to
the algorithm, we provide splitter (indices in the input array)
values that are each offset by the initial sub-array size. For
example, when merging sub-arrays of size 4 in the input, the
A splitters will be 0,8,16,24, and so on. The B splitters will
be 4,12,20,28, and so on. Because the sub arrays sizes are
all perfectly equal,the vector lanes are fully utilized and load-
balanced. Fig. 2 (bottom) shows the merging process for this
Sorted'Data
Arrays of Size 1
Arrays of Size !"#
Arrays of Size 4
Arrays of Size 2 Algorithm 2
Algorithm 2
Algorithm 3
Algorithm 3
with Merge Path
Unsorted'Data
Array of Size 𝑛
Algorithm 3
Algorithm 3
with Merge Path
Arrays of Size 8
Arrays of Size &! "#
Log(n) Levels
Fig. 3. Optimized merge sort algorithm using two different merging
algorithms in three different phases.
Algorithm 4: Single core merge sort algorithm
function Sort (array)
Input : Unsorted array: array
Output : Sorted array: array
for subSize 1; subSiz e < 8; subSize 2·subSize do
for i0; i<N;ii+W·2·subSize do
Algorithm2(array +i);
// Swap pointers for C and array
// Re-arrange sub-arrays so that they are stored sequentially
for ;subSize < N/W ;subSize 2·subS ize do
for i0; i<n;ii+W·2·subS ize do
for ind 0; ind < 16; ind ind + 1 do
AInd Set(x, ind);
xx+subSize;
AStop Set(x, ind);
BInd S et(x, ind);
xx+subSize;
BStop Set(x, ind);
Algorithm3(array +i, AI nd, AStop, BInd, B Stop);
// Swap pointers for C and array
for ;subSize < N ;subSize 2·subSize do
for off set 0; off set < n;of fset of fset + 32 ·subSize do
AInd, BI nd, AStop, BStop MergeP ath();
Algorithm3(array , AInd, BInd, AS top, BStop);
// Swap pointers for C and array
algorithm (Alg. 3). Since we use gather and scatter, there is
no need to store more than two value (one per array) at any
given time - this reduces the memory footprint. Our algorithm
is also responsible for tracking the indices in A, B, and C
which are used for the scatter and gather instructions. This
algorithm, by default, suffers from the same disadvantage as
the previous where at sub-arrays of size N
(WP)and above,
it can no longer perform at full system utilization. However,
this can be overcome by using the Merge Path algorithm
for partitioning the input array to the multiple vector lanes
(essentially replacing the roles of the threads with the software
controlled data lanes).
IV. SORT IN G ALGORITHM
A. Detailed discussion of Parallel Merge Sort
In the first phase of the parallel algorithm, the input data is
entirely unsorted and no known ordering of the data exists.
As such we start with our vectorized sequential load and
store merge, Alg. 2. In Sub. Sec. IV-B we discuss additional
motivation of how this phase was designed from a practical
perspective. This algorithm is used to sort up to sub-arrays of
size four elements. As our algorithm uses vectors that are W
wide, this means that we are dealing with 2·Wand 4·W
elements in these phases.
We determined that using arrays larger than 4, for example
8 elements, proved to be ineffective due to caching and
TABLE IV
MERGE ALGORITHMS IMPLEMENTATIONS USED IN ANALYSIS.
Algorithm Description
Traditional Standard merge algorithm.
Bitonic Implementation of [7] using SSE instructions.
AVX-512-BA Implementation of branch avoiding algorithm (Alg. 3) using merge path and
the AVX-512 instruction set.
TABLE V
SORTING ALGORITHM IMPLEMENTATIONS USED IN ANALYSIS.
Algorithm Description
Traditional Iterative merge sort based on traditional merge.
Bitonic Iterative merge sort based on a bitonic merge [7] using SSE instructions.
AVX-512-BA Implementation of Alg. 4 using AVX-512.
IPP SortAscend from the 2018 Intel IPP library [15].
Quicksort qsort from the c library (Intel Parallel Studio 2018 Cluster Edition).
other overheads associated with vector instructions - including
reorganizing the sorted data which is spread out in a lock-step
manner. The data reorganization can be done using scatter
instructions.
In the next phase of the vector algorithm, Alg. 3, we merge
the sub-arrays of size 4 up to sub-arrays of size N
(WP).
Note that even in this phase, the threads work entirely in an
independent manner. In the first round of this algorithm, each
sub-array will consist of 4 elements. This will then be doubled
to 8 elements, 16 elements, and so forth. Further, in this phase
of the algorithm we have enough sub-arrays to also ensure that
all lanes are always used. As the arrays are of equal length
we can ensure that each data lane receives an equal number
of elements–ensuring good load-balancing.
Lastly, we finish again with Alg. 3 (red boxes in Fig. 3)
using a vectorized merge. Unlike the previous phase which
used a single thread for each array, the merging process here
will use multiple threads for a single array. The overhead of
adding the threads is relatively small as the arrays are fairly
large - this was shown to be inexpensive by the Merge Path
algorithm. The Merge Path splitters will determine the starting
points for each the vector lane’s merge. This algorithm will be
used in an iterative manner starting with arrays sizes of N
(WP)
up to a fully sorted array. This entire process is summed up
in Alg. 4.
B. Failed Parallel Vectorization Attempts
In our initial implementation we started off by sorting small
sub-arrays of size 64 elements using the non vectorized quick-
sort algorithm. This was then followed by using the vectorized
merge algorithm (Alg. 3 with MP) for the larger arrays. Our
assumption was that this phase of the algorithm was fairly
small and as such would not account for a large amount of
the execution time. However, profiling of the execution showed
us otherwise - the time spent sorting the arrays into partitions
of 64 elements was roughly 50% of the total run time. This in
fact motivated the development of Alg. 2 which better utilizes
the vector units and had better locality. This greatly improved
the overall speedup of our new algorithm.
V. EXPERIMENTAL SETU P
In the following section we compare both our merging and
sorting algorithms with several different algorithms. For the
sake of simplicity, in the context of merging we focus on
arrays Aand Bsuch that |A|=|B|and |C|=|A|+|B|;
however, in practice this is not a constraint when wanting
to merge arrays of unequal length. Table IV lists the various
merging algorithms tested in this paper. Our parallel merge sort
algorithm utilizes the various merging algorithms discussed
in Sec. III. Table V lists the sorting algorithms that we
use in our paper. We include comparisons against some of
Intel’s implementations, including a sort from Intel’s IPP (Intel
Integrated Performance Primitives [15]). We also compare
against an implementation of [7]. Our analysis covers a mix
of single threaded and multi-threaded executions with scalar
and vectorized implementations.
System Configuration: The experiments presented in
this paper are executed on an Intel Xeon Phi 7250 processor
with 96GB of DRAM memory (102.4 GB/s peak bandwidth).
This processor is part of the Xeon Phi Knights Landing series
of processors. In addition to to the main memory, the processor
has an additional 16GB of MCDRAM high bandwidth mem-
ory (400 GB/s peak bandwidth). This MCDRAM is used as
our primary memory and so long as the application’s memory
fits into MCDRAM the lower latency DRAM memory is not
utilized. The Intel Xeon Phi 7250 has 68 cores with 272
threads (4-way SMP). These cores run at a 1.4 GHz clock and
share a 32MB L2 cache. All code is compiled with the Intel
Compiler (icc, Intel Parallel Studio 2018 Cluster Edition).
Vector Instruction Set: Our algorithm is implemented
using Intel’s AVX-512 instruction set which supports vectors
of up to 512 bits. Specifically our algorithm is implemented
for 32 bit integers2meaning that our vector width is W= 16.
Given the system parameters, P= 272, we are able to execute
up to 4352 merges concurrently. Our algorithm can also be
implemented for 64 bit values - this would result in a vector
width of W= 8 and a reduction in the number of software
threads by half. The algorithm of Chhugani et al. [7] uses
Intel’s SSE instruction set. SSE uses 128 bit vector units with
W= 4. Recall that in Sec. II an analysis of the algorithm of
[7] is given for wider vectors and that this algorithm does not
scale to larger vectors.
Random Key Generation: All experiments in this paper
use randomly generated numbers taken from an uniform
random distribution. The range of the keys was shown to play
a critical role in the performance of merging algorithms [14].
Specifically, as the number of unique keys increases it becomes
more likely that the Merge Path will be very close to the
main diagonal and the branch predictor becomes less accurate
leading to lower performance. The range of the keys is from 0
to 2iwhere i∈ {4, ..., 28}. Unless mentioned otherwise, the
maximal i= 28 is used for specifying the range.
2The algorithm can easily be changed to 64 bits using similar instructions
designed for large words.
Sequential Merging
Fig. 4. Performance of single core merging. (top) shows the results in keys
merged per second (higher is better) as a function of the array size. Maximal
number of keys was held at 228. (bottom) shows the results in keys merged
per second (higher is better) as a function of the key range. Input arrays have
1M elements each (0.5M for A and 0.5M for B).
To test the performance of the merge algorithms, two
arrays are filled with randomly generated numbers. These are
then sorted separately. Lastly, they are merged together. Only
the merging phase is timed. To test the sort algorithms, a
single array with randomly generated keys is created. This
is followed by the sorting of the array - which is timed.
VI. PERFORMANCE ANA LYSIS
Single core merging as a function of array size: Fig.
4 (top) depicts the number of merge keys per second as a
function of the array size. Our new algorithm outperforms
the other merging algorithms by almost 3x. The measured
execution time is very small leading to high overhead for array
sizes below 103. For larger arrays these overheads disappear.
Single core merging as a function of the key range:
Fig. 4 (bottom) depicts the performance of several merge
algorithms using a single core as a function of keys per second
vs the maximal number of keys. The abscissa is the range
of the key ranges and the ordinate is the number of merge
keys per second - higher is better. The input size is 1M keys.
Our new algorithm outperforms the standard scalar merge and
the SIMD algorithm of Chhugani et al. [7] by a factor of
2.8xand 2.2x, respectively. Note, for both the standard scalar
and SSE [7] implementations, the performance decreases as
the key range increases, due to branch mis-predictions. This
matches the results of [14].
Parallel merging as a function of array size: Fig. 5
(top) depicts the number of merge elements per second as a
function of the array size. The number of threads used was 64
and the maximal number of keys was held at 228. Once again,
our new algorithm outperforms the other merging algorithms.
Merging the smaller arrays has some overheads that disappear
as the array sizes become substantial. Our new algorithm, at
its peak can merge upto 16 billion keys per second.
Parallel merging as a function of threads: Fig. 5
(bottom) depicts the number of merge elements per second
as a function of the number of threads. The array size is 1B
Parallel Merging
Fig. 5. Performance of parallel merging. (top) shows the results in elements
merged per second (higher is better) as a function of the array size. Maximal
number of keys was held at 228 and 64 threads were used. (bottom) shows
the results in keys merged per second (higher is better) as a function of the
number of threads. Input arrays have 1B elements each and maximal number
of keys was held at 228. Hyperthreading is used when the number of threads
is higher than 64.
elements and the maximal number of unique keys was held
at 228. Hyperthreading is used when the number of threads is
higher than 64. Once again, our new algorithm outperforms
the other merging algorithms (except for threads 128 and 256
where the performance of our algorithm drops off). The reason
for the drop-off at those thread counts is that our processor
has 68 cores with 4-way SMT so for 128 and 256 thread
counts, hyperthreading is used (this is a known side effect
[24]). The hardware threads on a single core share - vector
processing units, memory queues for outstanding memory
requests, and caches (1MB L2 cache share amongst 2 cores
- upto 8 threads). Our branch avoiding algorithm uses two
gather instructions per thread. That means at 272 threads we
attempt to load upto 272*2*16=8704 different addresses. This
is a memory intensive operation which can lead to various
memory problems. Furthermore, if too many memory requests
are generated concurrently then the core might stall- a core has
a maximal number of memory requests that it can dispatch
before it is throttled back. From the cache’s perspective,
our algorithm uses 16xmore cache lines. This can lead to
contention, especially at the L1 level where the cache is fairly
small, when more threads are used on a single core. Lastly,
the slowdown seems to be consistent with other KNL research
and is actually noted in the system’s user guide [25].
Single core sorting as a function of the array size:
Fig. 6 (top) depicts the performance of the sorting algorithms
as a function of the array size. The maximal number of keys
was held at 228. Our algorithm is faster by 2x3x. All of
the tested algorithms decay around 50% from an array size of
216 to 224. It seems that our algorithm has a slightly faster
decay in comparison to these other algorithms most likely
due to oversaturation of the memory requests at the core
granularity. KNL was not designed for this level of memory
request intensity.
Sequential Sorting
Fig. 6. Performance of single core sorting. (top) shows the results in keys
merged per second (higher is better) as a function of the array size. Maximal
number of keys was held at 228. (bottom) shows the results in keys merged
per second (higher is better) as a function of the key range. Input arrays have
224 elements each.
Parallel Sorting
Fig. 7. Performance of parallel sorting. (top) shows the results in elements
merged per second (higher is better) as a function of the array size. Maximal
number of keys was held at 228 and 64 threads were used. (bottom) shows
the results in keys merged per second (higher is better) as a function of the
number of threads. Input arrays have 220 elements each and maximal number
of keys was held at 228. Hyperthreading is used when the number of threads
is higher than 64.
Single core sorting as a function of the key range: Fig.
6 (bottom) depicts the performance of the sorting algorithms
as a function of the key range. The size of the input array is
224 elements. For all but Intel’s IPP the performance seems
mostly independent of the key range. This is in contrast to the
merging algorithms where there was a clear reduction in the
performance. However, for all the algorithms, performance did
decrease by about 10%. The algorithm that most standouts is
Intel’s IPP algorithm which has a decrease in performance by
roughly 6x7x. Beyond the range of 1024 unique keys (which
is fairly typical for large applications), our new algorithm
outperforms Intel’s IPP algorithm by roughly 2.2x
Parallel sorting as a function of array size: Fig. 7
(top) depicts the number of elements sorted per second as a
function of the array size. The number of threads used was 64
and the maximal number of keys was held at 228. This time
our algorithm outperforms the others for small array sizes,
but falls off for larger sizes. This is an artifact of the memory
subsystem of KNL and seems to affect algorithms with certain
types of memory access patterns. This issue does not seem to
be an artifact of our algorithm itself because we do not see this
issue on the AVX-512 Intel Skylake system we have tested on
(Those results have been omitted for brevity).
Parallel sorting as a function of threads: Fig. 7
(bottom) depicts the number of elements sorted per second
as a function of the number of threads. The array size used
was 220 and the maximal number of keys was held at 228.
Hyperthreading is used when the number of threads is higher
than 64. Our new algorithm outperforms the other algorithms
at larger threads counts.
VII. CONCLUSIONS
In this paper we presented a new way of approaching
vectorized merging and sorting. We showed how to increase
the scalability and control flow of these algorithms from P
threads to P·Wsoftware threads executed on Pphysical
hardware threads. As part of this process we showed how to
greatly increase the control flow available in a single hardware
thread. This was in part due to the introduction of new gather
and scatter instructions which enabled loading and storing
data in an efficient manner to various locations as well the
introduction of new mask instructions. Through these masked
instruction we showed how each data lane could be controlled
at a finer grain than was previously possible.
From a performance perspective, our new vector based
algorithm outperforms many state of the art implementations.
This is thanks to the low overhead, efficiency, and balanced
threading of the algorithm. Testing demonstrates that the algo-
rithm not only surpasses the SIMD based bitonic counterpart,
but that it is over 2.94xfaster than a traditional merge, merging
over 300M keys per second in one thread and over 16B keys
per second in parallel. A full sort reaches to over 5xfaster
than quicksort and is 2xfaster than Intel’s IPP library sort,
sorting over 5.3M keys per second for a single processor and
in parallel over 500M keys per second and a speedup of over
2xfrom a traditional merge sort.
REFERENCES
[1] W. A. Martin, “Sorting,ACM Comput. Surv., vol. 3, no. 4, pp. 147–174,
1971.
[2] C. A. R. Hoare, “Algorithm 64: Quicksort,Commun. ACM, vol. 4, no. 7,
p. 321, 1961.
[3] K. E. Batcher, “Sorting networks and their applications,” in Proceedings
of the April 30–May 2, 1968, spring joint computer conference, 1968,
Conference Paper, pp. 307–314.
[4] N. Amato, R. Iyer, S. Sundaresan, and Y. Wu, “A comparison of parallel
sorting algorithms on different architectures,” Texas A & M University,
Report, 1998.
[5] H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani, “Aa-sort: A
new parallel sorting algorithm for multi-core simd processors,” in
16th International Conference on Parallel Architecture and Compilation
Techniques (PACT 2007), 2007, Conference Proceedings, pp. 189–198.
[6] S. Odeh, O. Green, Z. Mwassi, O. Shmueli, and Y. Birk, “Merge Path-
Parallel Merging Made Simple,” in IEEE 26th International Parallel and
Distributed Processing Symposium Workshops & PhD Forum (IPDPSW),
2012, pp. 1611–1618.
[7] J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K.
Chen, A. Baransi, S. Kumar, and P. Dubey, “Efficient implementation
of sorting on multi-core simd cpu architecture,” Proc. VLDB Endow.,
vol. 1, no. 2, pp. 1313–1324, 2008.
[8] H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani, “A high-
performance sorting algorithm for multicore single-instruction multiple-
data processors,” Software: Practice and Experience, vol. 42, no. 6, pp.
753–777, 2012.
[9] N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and
P. Dubey, “Fast sort on cpus and gpus: a case for bandwidth oblivious
simd sort,” in Proceedings of the 2010 ACM SIGMOD International
Conference on Management of data, 2010, Conference Paper, pp. 351–
362.
[10] B. Bramas, “Fast sorting algorithms using avx-512 on intel knights
landing,” CoRR, vol. abs/1704.08579, 2017.
[11] H. Inoue and K. Taura, “Simd- and cache-friendly algorithm for sorting
an array of structures,” Proc. VLDB Endow., vol. 8, no. 11, pp. 1274–
1285, 2015.
[12] T. Xiaochen, K. Rocki, and R. Suda, “Register level sort algorithm on
multi-core simd processors,” in Proceedings of the 3rd Workshop on
Irregular Applications: Architectures and Algorithms, 2013, Conference
Paper, pp. 1–8.
[13] O. Green, M. Dukhan, and R. Vuduc, “Branch-Avoiding Graph Algo-
rithms,” in 27th ACM on Symposium on Parallelism in Algorithms and
Architectures, 2015, pp. 212–223.
[14] O. Green, “When Merging and Branch Predictors Collide,” in IEEE
Fourth Workshop on Irregular Applications: Architectures and Algo-
rithms, 2014, pp. 33–40.
[15] I. Corporation, “Intel R
integrated performance primitives,” 2018,
version 2018.1.163. [Online]. Available: https://software.intel.com/
en-us/intel- ipp
[16] S. G. Akl and N. Santoro, “Optimal parallel merging and sorting without
memory conflicts,” IEEE Transactions on Computers, vol. C-36, no. 11,
pp. 1367–1369, 1987.
[17] Y. Liu, T. Pan, O. Green, and S. Aluru, “Parallelized kendall’s tau
coefficient computation via simd vectorized sorting on many-integrated-
core processors,” arXiv:1704.03767, 2017.
[18] S. Lacey and R. Box, “A fast easy sort,” Byte Magazine, pp. 315–320,
April 1991.
[19] B. Schlegel, T. Willhalm, and W. Lehner, “Fast sorted-set intersection
using simd instructions,” in ADMS@ VLDB, 2011, Conference Proceed-
ings, pp. 1–8.
[20] E. Stehle and H.-A. Jacobsen, “A memory bandwidth-efficient hybrid
radix sort on gpus,” CoRR, vol. abs/1611.01137, 2016.
[21] A. Fog, “The microarchitecture of intel, amd and via cpus,” An optimiza-
tion guide for assembly programmers and compiler makers. Copenhagen
University College of Engineering, 2011.
[22] O. Green, R. McColl, and D. Bader., “GPU Merge Path: A GPU Merging
Algorithm,” in 26th ACM International Conference on Supercomputing,
2012, pp. 331–340.
[23] N. Deo and D. Sarkar, “Parallel algorithms for merging and sorting,
Information Sciences, vol. 56, no. 1, pp. 151 – 161, 1991.
[24] A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani,
S. Hutsell, R. Agarwal, and Y.-C. Liu, “Knights landing: Second-
generation intel xeon phi product,” Ieee micro, vol. 36, no. 2, pp. 34–46,
2016.
[25] TACC, “Stampede user guide,” 2018. [On-
line]. Available: https://portal.tacc.utexas.edu/user-guides/stampede2#
best-known-practices-and-preliminary-observations-knl
... Many mergesort variants have been proposed in the last two decades with the goal to maximize data/thread-level parallelism [6], [14], [15], [17], [18], [26], [27], [30], [32], [33]; however, they leave room for improvements in terms of speed and usability. First, none of the papers examine how to optimize each individual phase of the sort pipeline. ...
... Second, the existing frameworks do not offer a unifying mergesort solution that is simultaneously optimized for scalar, SSE, AVX2, and AVX-512 architectures. In fact, some of them [30], [32], [33] inherently work only in the extended instruction set of AVX-512, with back-porting either impossible or requiring a expensive set of substitute instructions. Depending on CPU availability and user preferences (e.g., lower power consumption), it may be desirable to have access to the fastest sort in each category rather than the fastest overall. ...
... A sequence of log 2 (C/ ) binary merges, which comprises phase 2 , then sorts all C items in the block. While many efforts exist for moving pointers along two sorted arrays during merging (e.g., branching [6], [14], [15], [17], [18], [26], [27], branchless [30], partially branchless [16], SIMD-aided [33]), the issue of how to further increase performance of this step has remained open for many years. To this end, we develop a new merge technique that is not only faster than all prior solutions, but also applicable to both scalar and vectorized architectures. ...
Conference Paper
Full-text available
Mergesort is a popular algorithm for sorting real-world workloads as it is immune to data skewness, suitable for parallelization using vectorized intrinsics, and relatively simple to multi-thread. In this paper, we introduce Origami, an in-memory merge-sort framework that is optimized for scalar, as well as all current SIMD (single-instruction multiple-data) CPU architectures. For each vector-extension set (e.g., SSE, AVX2, AVX-512), we present an in-register sorter for small sequences that is up to 8× faster than prior methods and a branchless streaming merger that achieves up to a 1.5× speed-up over the naive merge. In addition, we introduce a cache-residing quad-merge tree to avoid bottlenecking on memory bandwidth and a parallel partitioning scheme to maximize thread-level concurrency. We develop an end-to-end sort with these components and produce a highly utilized mergesort pipeline by reducing the synchronization overhead between threads. Single-threaded Origami performs up to 2× faster than the closest competitor and achieves a nearly perfect speed-up in multi-core environments.
... The 4 threads of each core share 2 vector processing units supporting AVX-512 instructions. To utilize these instructions for irregular applications, we use the vectorized branch avoiding model introduced in [13], [23] (which extend the branch avoiding model introduced in [12]). In the vectorized branch avoiding model, each data lane in the AVX-512 unit executes a different data control flow. ...
... We look to improve the performance of PageRank by using the vectorized branch avoiding model. Specifically, we target Intel's AVX-512 ISA and show that we can parallelize the PageRank computation across the vector unit's multiple data lanes (as was done in [13], [23]). We present our vectorized PageRank in Alg. 2. At a high level, each thread computes the next PageRank values for 16 vertices at a time, by processing each vertice's in-neighbor list in lockstep via AVX-512 instructions. ...
... In the branch-avoiding model[13],[23] these are referred to as control flows and each data lane in a vector unit executes a different thread. In SIMT (GPU), these are the threads within a thread-block. ...
Conference Paper
Full-text available
Effective scheduling and load balancing of applications on massively multi-threading systems remains challenging despite decades of research, especially for irregular and data dependent problems where the execution control path is unknown until run-time. One of the most widely used load-balancing schemes used for data dependent problems is a parallel prefix sum (PPS) array over the expected amount of work per task, followed by a partitioning of tasks to threads. While sufficient for many systems, it is not ideal for massively multi-threaded systems with SIMD/SIMT execution, such as GPUs. More fine-grained load-balancing is needed to effectively utilize SIMD/SIMT units. In this paper we introduce Logarithmic Radix Binning (LRB) as a more suitable alternative to parallel prefix summation for load-balancing on such systems. We show that LRB has better scalability than PPS for high thread counts on Intel's Knight's Landing processor and comparable scalability on NVIDIA Volta GPUs. On the application side, we show how LRB improves the performance of PageRank up to 1.75X using the branch-avoiding model. We also show how to better load-balance segmented sort and improve performance on the GPU.
... Watkins et al. [30] provide an alternative approach to sort based on the merging of multiple vectors. Their method is 2 times faster than the Intel IPP library and 5 times faster than the C-lib qsort. ...
Preprint
Full-text available
The way developers implement their algorithms and how these implementations behave on modern CPUs are governed by the design and organization of these. The vectorization units (SIMD) are among the few CPUs' parts that can and must be explicitly controlled. In the HPC community, the x86 CPUs and their vectorization instruction sets were de-facto the standard for decades. Each new release of an instruction set was usually a doubling of the vector length coupled with new operations. Each generation was pushing for adapting and improving previous implementations. The release of the ARM scalable vector extension (SVE) changed things radically for several reasons. First, we expect ARM processors to equip many supercomputers in the next years. Second, SVE's interface is different in several aspects from the x86 extensions as it provides different instructions, uses a predicate to control most operations, and has a vector size that is only known at execution time. Therefore, using SVE opens new challenges on how to adapt algorithms including the ones that are already well-optimized on x86. In this paper, we port a hybrid sort based on the well-known Quicksort and Bitonic-sort algorithms. We use a Bitonic sort to process small partitions/arrays and a vectorized partitioning implementation to divide the partitions. We explain how we use the predicates and how we manage the non-static vector size. We explain how we efficiently implement the sorting kernels. Our approach only needs an array of O(log N) for the recursive calls in the partitioning phase, both in the sequential and in the parallel case. We test the performance of our approach on a modern ARMv8.2 and assess the different layers of our implementation by sorting/partitioning integers, double floating-point numbers, and key/value pairs of integers. Our approach is faster than the GNU C++ sort algorithm by a speedup factor of 4 on average.
Article
Full-text available
The way developers implement their algorithms and how these implementations behave on modern CPUs are governed by the design and organization of these. The vectorization units (SIMD) are among the few CPUs’ parts that can and must be explicitly controlled. In the HPC community, the x86 CPUs and their vectorization instruction sets were de-facto the standard for decades. Each new release of an instruction set was usually a doubling of the vector length coupled with new operations. Each generation was pushing for adapting and improving previous implementations. The release of the ARM scalable vector extension (SVE) changed things radically for several reasons. First, we expect ARM processors to equip many supercomputers in the next years. Second, SVE’s interface is different in several aspects from the x86 extensions as it provides different instructions, uses a predicate to control most operations, and has a vector size that is only known at execution time. Therefore, using SVE opens new challenges on how to adapt algorithms including the ones that are already well-optimized on x86. In this paper, we port a hybrid sort based on the well-known Quicksort and Bitonic-sort algorithms. We use a Bitonic sort to process small partitions/arrays and a vectorized partitioning implementation to divide the partitions. We explain how we use the predicates and how we manage the non-static vector size. We also explain how we efficiently implement the sorting kernels. Our approach only needs an array of O(log N) for the recursive calls in the partitioning phase, both in the sequential and in the parallel case. We test the performance of our approach on a modern ARMv8.2 (A64FX) CPU and assess the different layers of our implementation by sorting/partitioning integers, double floating-point numbers, and key/value pairs of integers. Our results show that our approach is faster than the GNU C++ sort algorithm by a speedup factor of 4 on average.
Technical Report
Full-text available
This paper describes fast sorting techniques using the recent AVX-512 instruction set. Our implementations benefit from the latest possibilities offered by AVX-512 to vectorize a two-parts hybrid algorithm: we sort the small arrays using a branch-free Bitonic variant, and we provide a vectorized partitioning kernel which is the main component of the well-known Quicksort. Our algorithm sorts in-place and is straightforward to implement thanks to the new instructions. Meanwhile, we also show how an algorithm can be adapted and implemented with AVX-512. We report a performance study on the Intel KNL where our approach is faster than the GNU C++ sort algorithm for any size in both integer and double floating-point arithmetics by a factor of 4 in average.
Article
Full-text available
Pairwise association measure is an important operation in data analytics. Kendall's tau coefficient is one widely used correlation coefficient identifying non-linear relationships between ordinal variables. In this paper, we investigated a parallel algorithm accelerating all-pairs Kendall's tau coefficient computation via single instruction multiple data (SIMD) vectorized sorting on Intel Xeon Phis by taking advantage of many processing cores and 512-bit SIMD vector instructions. To facilitate workload balancing and overcome on-chip memory limitation, we proposed a generic framework for symmetric all-pairs computation by building provable bijective functions between job identifier and coordinate space. Performance evaluation demonstrated that our algorithm on one 5110P Phi achieves two orders-of-magnitude speedups over 16-threaded MATLAB and three orders-of-magnitude speedups over sequential R, both running on high-end CPUs. Besides, our algorithm exhibited rather good distributed computing scalability with respect to number of Phis. Source code and datasets are publicly available at http://lightpcc.sourceforge.net.
Conference Paper
Full-text available
Merging is a building block for many computational domains. In this work we consider the relationship between merging, branch predictors, and input data dependency. Branch predictors are ubiquitous in modern processors as they are useful for many high performance computing applications. While it is well known that the performance and the branch prediction accuracy go hand-in-hand, these have not been studied in the context of merging. We thoroughly test merging using multiple input array sizes and values using the same code and compile optimizations. As the number of possible keys increase, so the do the number of branch mis-predictions -resulting in reduced performance. The reduction in performance can be as much as 5X. We explain this phenomenon using a visualization technique called Merge Path that intuitively shows this. We support this visualization approach with modeling, thorough testing, and analysis on multiple systems.
Conference Paper
Full-text available
This paper quantifies the impact of branches and branch mispredictions on the single-core performance of certain graph problems, specifically for computing connected components. We show that branch mispredictions are costly and can reduce performance by as much as 30%-50%. This insight suggests that one should seek graph algorithms and implementations that avoid branches. As a proof-of-concept, we devise such branch-avoiding implementations of the Shiloach-Vishkin algorithm for computing connected components. We evaluate these implementations on current x86 and ARM-based processors to show the efficacy of the approach. Our results suggest how both compiler writers and architects might exploit this insight to improve graph processing systems more broadly and create better systems for such problems.
Article
Full-text available
In this paper, we focus on sorted-set intersection which is an important part in many algorithms, e.g., RID-list inter-section, inverted indexes, and others. In contrast to tradi-tional scalar sorted-set intersection algorithms that try to reduce the number of comparisons, we propose a parallel algorithm that relies on speculative execution of compar-isons. In general, our algorithm requires more comparisons but less instructions than scalar algorithms that translates into a better overall speed. We achieve this by utilizing ef-ficient single-instruction-multiple-data (SIMD) instructions that are available in many processors. We provide different sorted-set intersection algorithms for different integer data types. We propose versions that use uncompressed integer values as input and output as well as a version that uses a tailor-made data layout for even faster intersections. In our experiments, we achieve speedups up to 5.3x compared to popular fast scalar algorithms.
Conference Paper
Sorting is at the core of many database operations, such as index creation, sort-merge joins, and user-requested output sorting. As GPUs are emerging as a promising platform to accelerate various operations, sorting on GPUs becomes a viable endeavour. Over the past few years, several improvements have been proposed for sorting on GPUs, leading to the first radix sort implementations that achieve a sorting rate of over one billion 32-bit keys per second. Yet, state-of-the-art approaches are heavily memory bandwidth-bound, as they require substantially more memory transfers than their CPU-based counterparts. Our work proposes a novel approach that almost halves the amount of memory transfers and, therefore, considerably lifts the memory bandwidth limitation. Being able to sort two gigabytes of eight-byte records in as little as 50 milliseconds, our approach achieves a 2.32-fold improvement over the state-of-the-art GPU-based radix sort for uniform distributions, sustaining a minimum speed-up of no less than a factor of 1.66 for skewed distributions. To address inputs that either do not reside on the GPU or exceed the available device memory, we build on our efficient GPU sorting approach with a pipelined heterogeneous sorting algorithm that mitigates the overhead associated with PCIe data transfers. Comparing the end-to-end sorting performance to the state-of-the-art CPU-based radix sort running 16 threads, our heterogeneous approach achieves a 2.06-fold and a 1.53-fold improvement for sorting 64 GB key-value pairs with a skewed and a uniform distribution, respectively.
Article
Sorting is at the core of many database operations, such as index creation, sort-merge joins and user-requested output sorting. As GPUs are emerging as a promising platform to accelerate various operations, sorting on GPUs becomes a viable endeavour. Over the past few years, several improvements have been proposed for sorting on GPUs, leading to the first radix sort implementations that achieve a sorting rate of over one billion 32-bit keys per second. Yet, state-of-the-art approaches are heavily memory bandwidth-bound, as they require substantially more memory transfers than their CPU-based counterparts. Our work proposes a novel approach that almost halves the amount of memory transfers and, therefore, considerably lifts the memory bandwidth limitation. Being able to sort two gigabytes of eight byte records in as little as 50 milliseconds, our approach achieves a 2.32-fold improvement over the state-of-the-art GPU-based radix sort for uniform distributions, sustaining a minimum speed-up of no less than a factor of 1.66 for skewed distributions. To address inputs that either do not reside on the GPU or exceed the available device memory, we build on our efficient GPU sorting approach with a pipelined heterogeneous sorting algorithm that mitigates the overhead associated with PCIe data transfers. Comparing the end-to-end sorting performance to the state-of-the-art CPU-based radix sort running 16 threads, our heterogeneous approach achieves a 2.06-fold and a 1.53-fold improvement for sorting 64 GB key-value pairs with a skewed and a uniform distribution, respectively.
Article
This article describes the architecture of Knights Landing, the second-generation Intel Xeon Phi product family, which targets high-performance computing and other highly parallel workloads. It provides a significant increase in scalar and vector performance and a big boost in memory bandwidth compared to the prior generation, called Knights Corner. Knights Landing is a self-booting, standard CPU that is completely binary compatible with prior Intel Xeon processors and is capable of running all legacy workloads unmodified. Its innovations include a core optimized for power efficiency, a 512-bit vector instruction set, a memory architecture comprising two types of memory for high bandwidth and large capacity, a high-bandwidth on-die interconnect, and an integrated on-package network fabric. These features enable the Knights Landing processor to provide significant performance improvement for computationally intensive and bandwidth-bound workloads while still providing good performance on unoptimized legacy workloads, without requiring any special way of programming other than the standard CPU programming model.
Article
This paper describes our new algorithm for sorting an array of structures by efficiently exploiting the SIMD instructions and cache memory of today's processors. Recently, multiway mergesort implemented with SIMD instructions has been used as a high-performance in-memory sorting algorithm for sorting integer values. For sorting an array of structures with SIMD instructions, a frequently used approach is to first pack the key and index for each record into an integer value, sort the key-index pairs using SIMD instructions, then rearrange the records based on the sorted key-index pairs. This approach can efficiently exploit SIMD instructions because it sorts the key-index pairs while packed into integer values; hence, it can use existing highperformance sorting implementations of the SIMD-based multiway mergesort for integers. However, this approach has frequent cache misses in the final rearranging phase due to its random and scattered memory accesses so that this phase limits both single-thread performance and scalability with multiple cores. Our approach is also based on multiway mergesort, but it can avoid costly random accesses for rearranging the records while still efficiently exploiting the SIMD instructions. Our results showed that our approach exhibited up to 2.1x better single-thread performance than the key-index approach implemented with SIMD instructions when sorting 512M 16-byte records on one core. Our approach also yielded better performance when we used multiple cores. Compared to an optimized radix sort, our vectorized multiway mergesort achieved better performance when the each record is large. Our vectorized multiway mergesort also yielded higher scalability with multiple cores than the radix sort.
Article
State-of-the-art hardware increasingly utilizes SIMD parallelism, where multiple processing elements execute the same instruction on multiple data points simultaneously. However, irregular and data intensive algorithms are not well suited for such architectures. Due to their importance, it is crucial to obtain efficient implementations. One example of such a task is sort, a fundamental problem in computer science. In this paper we analyze distinct memory accessing models and propose two methods to employ highly efficient bitonic merge sort using SIMD instructions as register level sort. We achieve nearly 270x speedup (525M integers/s) on a 4M integer set using Xeon Phi coprocessor, where SIMD level parallelism accelerates the algorithm over 3 times. Our method can be applied to any device supporting similar SIMD instructions.