Available via license: CC BY 4.0

Content may be subject to copyright.

SMASH: Sparse Matrix Atomic Scratchpad Hashing

A Thesis Presented

by

Kaustubh Shivdikar

to

The Department of Electrical and Computer Engineering

in partial fulﬁllment of the requirements

for the degree of

Master of Science

in

Electrical and Computer Engineering

Northeastern University

Boston, Massachusetts

April 2021

arXiv:2105.14156v1 [cs.DC] 29 May 2021

To the people trying to make the architectures of today, history tomorrow.

i

Contents

List of Figures iv

List of Tables v

List of Acronyms vi

Acknowledgments ix

Abstract of the Thesis x

1 Introduction 1

1.1 The Development of Matrix-based Methods . . . . . . . . . . . . . . . . . . . . . 1

1.2 High Performance Computing and Matrices . . . . . . . . . . . . . . . . . . . . . 1

1.3 Applications...................................... 2

1.4 Motivation....................................... 3

1.5 DataﬂowinSpGEMM ................................ 5

1.6 IntroductionofSMASH ............................... 7

1.7 Contributions ..................................... 8

2 Background 9

2.1 CPU.......................................... 9

2.2 GPU.......................................... 10

2.3 Branch and Control Flow Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Computational Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 The Properties of Spatial Locality . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.6 Sparse Matrix Storage Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Related Work 15

3.1 SpGEMMonCPU .................................. 15

3.2 SpGEMMonGPU .................................. 16

3.3 SpGEMMAccelerators................................ 17

ii

4 PIUMA Architecture and Simulator 20

4.1 PIUMAArchitecture ................................. 20

4.1.1 PIUMACores ................................ 21

4.1.2 OfﬂoadEngines ............................... 23

4.1.3 Global Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1.4 Network.................................... 26

4.2 SimulationMethodology ............................... 28

4.2.1 Simulator Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2.2 SniperSimulator ............................... 31

5 SMASH Kernels 33

5.1 SMASH Version 1: Atomic Hashing . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.1.1 Window Distribution Phase . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.1.2 HashingPhase ................................ 37

5.1.3 Write-backPhase............................... 38

5.2 SMASH Version 2: Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3 SMASH Version 3: Fragmenting Memory . . . . . . . . . . . . . . . . . . . . . . 43

6 Results 46

6.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.2 Dataset Arithmetic Intensity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.3 DRAMPerformance ................................. 48

6.4 CachePerformance .................................. 49

6.5 WorkloadDistribution ................................ 49

6.6 InstructionThroughput ................................ 51

6.7 ApplicationSpeedup ................................. 52

6.8 SummarryofResults ................................. 53

7 Conclusions and Future Work 54

7.1 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.2 FutureWork...................................... 55

Bibliography 57

iii

List of Figures

1.1 Deep learning Workload characterization . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Matrix Multiplication Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 SMASHOverview .................................. 7

2.1 CPU and GPU Architectural Overview . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1 PIUMACore ..................................... 22

4.2 PIUMADie...................................... 25

4.3 PIUMASystem.................................... 27

4.4 PIUMASystem.................................... 28

5.1 Window Distribution Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2 CollisionResolution ................................. 37

5.3 Tag-DataHashtable.................................. 38

5.4 SMASHAlgorithm .................................. 41

5.5 Hashing on low-order bits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.6 Tag-OffsetHashtable ................................. 44

5.7 Window Distribution Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.1 SMASH V1: Thread utilization plots for unbalanced workload. . . . . . . . . . . . 50

6.2 SMASH V2: Thread utilization plots for balanced workload. . . . . . . . . . . . . 50

6.3 Average thread utilization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.4 Thread utilization histogram comparison between balanced and unbalanced work-

loads. ......................................... 52

iv

List of Tables

1.1 SparseGraphdatasets................................. 4

1.2 Matrix Multiplication Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 SpGEMM Accelerator Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1 Simulator host machine speciﬁcations. . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Simulator target conﬁguration for PIUMA architecture . . . . . . . . . . . . . . . 32

6.1 Input and output data characteristics used in this thesis. . . . . . . . . . . . . . . . 47

6.2 CSR matrix arrays for input matrices A and B. . . . . . . . . . . . . . . . . . . . . 47

6.3 CSR matrix arrays for the output matrix C. . . . . . . . . . . . . . . . . . . . . . . 48

6.4 Aggregated DRAM bandwidth demands. . . . . . . . . . . . . . . . . . . . . . . . 48

6.5 Cache performance of our 3 SMASH implementatinos. . . . . . . . . . . . . . . . 49

6.6 Aggregate IPC Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.7 Runtime for an entire SpGEMM workload on 64 PIUMA threads. . . . . . . . . . 53

v

List of Acronyms

ABM Advanced Bit Manipulation

ATT Address translation tables

ADX Multiprecision Add Carry

AESI Advanced Encryption Instructions

AMD Advanced Micro Devices

AMD Central Processing Unit

DGAS Distributed Global Address Space

AVX2 Advanced Vector Instructions 2

BFS Breadth First Search

BLAS Basic Linear Algebra Subprograms

BMI2 Bit Manipulation Instruction Set 2

C-RAM Computational RAM

ELL ELLPACK

EMMX Extended MMX Extension

CLMUL Carry-less Multiplication Extension

CPU Central Processing Unit

CSR Compressed Sparse Row

CSC Compressed Sparse Column

DARPA Defense Advanced Research Projects Agency

DFFT Dense Fast Fourier Transform

DMA Direct Memory Access

vi

DP Double Precision

DRAM Dynamic Random Access Memory

F16C 16-bit Floating Point Conversion

FMA Floating-point Multiply and Add

FMA3 3-Operand Fused-Multiply-Add

GCN Graph Convolutional Network

GEMM General Matrix Multiplication

GNN Graph Neural Networks

GPGPU General-Purpose Graphic Processing Unit

GPU Graphics Processing Unit

HBM High Bandwidth Memory

HP Half Precision

HIVE Hierarchical Identify Verify Exploit

IPC Instructions per Cycle

ISA Instruction Set Architecture

MLP Multi-layer Perceptron

MKL Math Kernel Library

MLP Multi-Layer Perceptrons

MTC Multi-threaded Core

PIUMA Programmable Integrated Uniﬁed Memory Architecture

RF Register File

SP Single Precision

OMP OpenMP

OS Operating System

PIM Processor in Memory

QP Quadruple Precision

RMAT Recursive Matrix

vii

Acknowledgments

I would like to start with thanking my Ph.D. advisor, Prof. David Kaeli, for his dedicated

support and motivation in this thesis as well as the project. Would like to thank Dr. Fabrizio Petrini

for providing me with this extra-ordinary opportunity to experiment with the latest and greatest tools

as well as his guidance on SpGEMM. In addition, would like to thank Dr. Fabio Checconi for his

valuable feedback and inspiration for SMASH kernels. Thank you Intel for exposing me to some of

the brightest minds in the valley. Thank you all members of NUCAR, Dr. Norm Rubin, Dr. Yifan

Sun, Elmira Karimi, Malith Jayaweera, Zlatan Feric, Derek Rodriguez, Julian Gutierrez, Trinayan

Baruah, and Yuhui Bao for your guidance and support. A special thanks to thank Nicolas Agostini,

the source of inspiration and motivation, throughout this project.

With the world coming to a grinding halt with an epidemic that showed no signs of an end,

I would like to thank my parents Chandrakant and Anagha and my brother Saumil for the relentless

support in all aspects over these last years. Last but not least, I would like to thank all my friends in

US for their constant motivation, who succeeded in making this journey memorable.

ix

Abstract of the Thesis

SMASH: Sparse Matrix Atomic Scratchpad Hashing

by

Kaustubh Shivdikar

Master of Science in Electrical and Computer Engineering

Northeastern University, April 2021

Dr. David Kaeli, Adviser

Abstract: Sparse matrices, more speciﬁcally Sparse Matrix-Matrix Multiply (SpGEMM) kernels,

are commonly found in a wide range of applications, spanning graph-based path-ﬁnding to machine

learning algorithms (e.g., neural networks). A particular challenge in implementing SpGEMM

kernels has been the pressure placed on DRAM memory. One approach to tackle this problem is

to use an inner product method for the SpGEMM kernel implementation. While the inner product

produces fewer intermediate results, it can end up saturating the memory bandwidth, given the high

number of redundant fetches of the input matrix elements. Using an outer product-based SpGEMM

kernel can reduce redundant fetches, but at the cost of increased overhead due to extra computation

and memory accesses for producing/managing partial products.

In this thesis, we introduce a novel SpGEMM kernel implementation based on the row-

wise product approach. We leverage atomic instructions to merge intermediate partial products as

they are generated. The use of atomic instructions eliminates the need to create partial product

matrices, thus eliminating redundant DRAM fetches.

To evaluate our row-wise product approach, we map an optimized SpGEMM kernel to a

custom accelerator designed to accelerate graph-based applications. The targeted accelerator is an

experimental system named PIUMA, being developed by Intel. PIUMA provides several attractive

features, including fast context switching, user-conﬁgurable caches, globally addressable memory,

non-coherent caches, and asynchronous pipelines. We tailor our SpGEMM kernel to exploit many

of the features of the PIUMA fabric.

This thesis compares our SpGEMM implementation against prior solutions, all mapped

to the PIUMA framework. We brieﬂy describe some of the PIUMA architecture features and then

delve into the details of our optimized SpGEMM kernel. Our SpGEMM kernel can achieve 9.4×

speedup as compared to competing approaches.

x

xi

Chapter 1

Introduction

1.1 The Development of Matrix-based Methods

In 1812, a French mathematician named Jacques Philippe Marie Binet pointed out sev-

eral important computations involved the multiplication of two matrices [53]. On November 30 of

the same year, he provided a lecture on his observation and further extended his work, leading to

the Cauchy-Binet formula [54]. This is one of the oldest known sources of the discovery of ma-

trix multiplication. Matrix multiplication was described as a method of multiply data arranged in

rows. Later, in the year 1850, Arthur Cayley applied matrix multiplication to solve a system of

linear equations [15], showing applications of this idea to solve an important class of mathematical

problems.

1.2 High Performance Computing and Matrices

The 20th century witnessed developments in computer technology. Computers, which

were initially developed to crunch numbers for tabulating the United States census [117], were soon

being used to perform calculations for a variety of physics and mathematics problems, many that

included matrix multiplication [117].

Use-cases for matrix multiplication applications were so widespread that they demanded

standardization [49]. In 1979, the BLAS Technical forum published a report on standardization of a

few of the common linear algebra operations (also known as subroutines), which they referred to as

Level-1 Basic Linear Algebra Subroutines (BLAS) [57]. Later, in 1986 and 1988, BLAS was further

augmented with Level-2 and Level-3 subroutines, respectively. The Level-3 subroutines included

1

CHAPTER 1. INTRODUCTION

the matrix multiplication subroutine (also known as GEMM kernel). The General Matrix Multipli-

cation (GEMM) kernel implementations were designed to work with dense matrices (matrices with

mostly non-zero elements).

Matrix multiplication is a key operation in many scientiﬁc computations. One such com-

putation is graph analysis. Graph analysis commonly represents graphs using an adjacency matrix

and then performs matrix multiplication operations on these matrices. The associated adjacency

matrices are inherently sparse [27] (with very few non-zeros) due to many graphs’ structures. Early

library implementations of GEMM performed poorly on such sparse matrices. By 2002, the BLAS

Technical forum adopted a new standard for such sparse data [29]. They presented the Sparse Basic

Linear Algebra Subprograms (Sparse BLAS), which included subroutines that included SpGEMM

(also known as the SpGEMM kernel), and focused on optimizations required for sparse matrix

multiplication [29].

1.3 Applications

Today, GEMM and SpGEMM kernels have found their way into many important applica-

tions. Some of these applications include:

1. Data encryption: AES, SHA1, SHA2, Twoﬁsh [16, 46, 24, 67, 25, 101, 58, 98];

2. Data compression and Decompression: zip ﬁles, JPEG and PNG image compression [80, 59];

3. Image processing: ﬁlters for real-time image processing, such as Sobel ﬁltering, image sharp-

ening, image blurring [82, 71, 88, 91];

4. Pathﬁnding: Breadth First Search (BFS) and Dijkstra’s Algorithm [118];

5. Signal processing: Dense Fast Fourier Transform (DFFT), Sparse Fast Fourier Transform

(SFFT) [44, 86, 81];

6. Simulations: N-body, raytracing, and Monte-Carlo [4, 97, 112, 107]; and

7. Machine learning: various supervised and unsupervised learning algorithms are implemented

using GEMM kernels Deep learning utilizes GEMM kernels for convolution layers [78, 66,

22, 75, 102, 8, 89, 103, 108].

2

CHAPTER 1. INTRODUCTION

This thesis’s motivation lies in improving the performance of SpGEMM kernels, which

will have a signiﬁcant impact on many important applications.

Recently, we have seen growth in graph-based applications in the industry.

Facebook [11], Google [73], Twitter [41], Amazon [93], Netﬂix [13], Cora [52], Cite-

seer [35] and many other large companies use graphs to analyze social networks, citation networks,

and even recommend products.

The currently available computational frameworks have not kept up with the ever-increasing

demands of graph-based workloads [106]. One of the key components of such applications is the

sparse matrix multiplication kernel (SpGEMM kernel). As compared to their dense counterparts,

SpGEMM kernels are complex and harder to optimize. Traditional multi-core CPUs and many-

core GPU architectures provide limited performance over SpGEMM kernels, mainly due to their

irregular memory access patterns and unbalanced work distribution.

1.4 Motivation

One graph-analysis application that is growing in popularity is the Graph Neural Net-

works (GNN). GNNs represent features of a node in the graph with a vector. These vectors are then

recursively aggregated and transformed, based on the features of the neighboring nodes [110]. These

features can then be used to classify nodes or perform inference on datasets. Unlike traditional neu-

ral networks that work with dense data structures, such as Multi-Layer Perceptrons (MLP), GNNs

operate on sparse graph structures [12]. They have been becoming increasingly popular, given their

high accuracy for node classiﬁcation on graph-structured datasets [115, 116]. From a computational

Figure 1.1: Graph Convolutional Network (GCN) kernel execution time breakdown.

perspective, GNNs are comprised of a mix of kernels, including element-wise operations, transpose

operations, dense matrix multiplication, sparse matrix multiplication, index selection, reduction,

and batch normalization. One example of a GNN is a Graph Convolutional Network or Graph Con-

3

CHAPTER 1. INTRODUCTION

volutional Network (GCN). A GCN is similar to a Convolutional Neural Network (CNN), except

that it performs convolution operations on a graph instead of image-based operations on pixels [21].

Figure 1.1 shows a breakdown of time spent in each of these operations in a GCN application.

The time spent in computing SpGEMM kernels is a function of two factors: 1) the sparsity

of the dataset and 2) the sparsity pattern. Although it is difﬁcult to compare sparsity patterns, Ta-

ble 1.1 presents the degree of sparsity for various graph datasets. Many workloads, including graph

convolution [23], node classiﬁcation [5], path planning [62], use the datasets shown in Table 1.1 to

analyze graphs and derive useful information. SpGEMM remains an integral kernel used in such

workloads, processing highly sparse datasets, thus providing us with an opportunity to exploit this

sparsity using an optimized kernel.

Dataset Vertices Edges Degree of Sparsity

Citeseer 3327 9464 99.914

Cora 2708 10,858 99.851

Pubmed 19,717 88,676 99.977

Wikipedia RfA 113,80 188,077 99.854

Epinions 75,888 508,837 99.991

Slashdot 82,144 549,202 99.991

Astro Physics Collaborations 18,772 792,320 99.775

NotreDame 325,729 1,497,134 99.998

Amazon 334,863 1,851,744 99.998

Google Page Hyperlinks 916,428 5,105,039 99.999

Youtube 1,134,890 11,950,496 99.999

Patent Citations 3,774,768 16,518,948 99.999

Stack Overﬂow 2,601,977 36,233,450 99.999

Orkut 3,072,441 486,740,332 99.994

Twitter Follower Network 41,652,230 1,468,365,182 99.999

Table 1.1: Sparse Graph datasets

Given the dominance of SpGEMM execution in GNN applications, we focus on accel-

erating SpGEMM kernels to reduce their execution time. In this thesis, we focus on identifying

the underlying bottlenecks of SpGEMM kernels and improving these workloads’ performance with

datasets with varying degrees of sparsity. We target the Intel PIUMA parallel accelerator, providing

us with a state-of-the-art target to demonstrate our approach’s utility.

4

CHAPTER 1. INTRODUCTION

1.5 Dataﬂow in SpGEMM

There are four common approaches used to multiply two matrices (as shown in Fig-

ure 1.2) [96]:

1. Inner product approach: Row(A)×Col(B) = Element(C)

2. Outer product approach: Col(A)×Row(B) = P ar tial P r oducts of M atrix(C)

3. Row-wise product approach: Row(A)×Corresponding Rows(B) = Row(C)

4. Column-wise product approach: Corresponding C ols(A)×C ol(B) = Col(C)

(d) Column-wise Approach(c) Row-wise Approach

(b) Outer Product Approach

X =

Matrix A Matrix B

Task

Distribution

X =

X =

...

...

...

Matrix C Elements

+

+

X =

Matrix A Matrix B

X =

X =

...

...

...

Partial Products

+

+

X =

Matrix A Matrix B Rows of Matrix C

X =

X =

...

...

...

+

+

X =

Matrix A Matrix B Cols of Matrix C

X =

X =

...

...

...

+

+

Task

Distribution

Task

Distribution

Task

Distribution

(a) Inner Product Approach

Figure 1.2: Matrix multiplication approaches

5

CHAPTER 1. INTRODUCTION

The most widely used approach for matrix multiplication is the inner product approach.

Inner-product methods are based on computing a dot product of a row of Matrix Awith a column

of Matrix B, generating a single output element of Matrix C(Figure 1.2 (a)).

This leads to multiple reads of the input matrices but results in a single write of the output

matrix elements. Thus, inner-product based methods exhibit poor input reuse, but good output

reuse. Equation 1.1 denotes the operations of the inner product method:

ci,j =

N−1

X

k=0

ai,k ×bk,j (1.1)

where Aand Bare input matrices, ci,j is an element of the ith row and jth column of the output

matrix C,Nis the number of columns in the Amatrix, subsequently ai,k and bk,j are elements of

the corresponding rows and columns of matrices Aand B, respectively.

Method Input Reuse Output Reuse Intermediate Size Disadvantage

Inner Product Poor Good Small Redundant input reads

Outer Product Good Poor Large Large intermediate size

Row-wise Poor Good Small Load imbalance

Col-wise Poor Good Small Load imbalance

Table 1.2: Matrix Multiplication Methods

In contrast, the outer-product method multiplies a single row of Matrix Awith all the

rows of matrix Bto produce partial products (Figure 1.2 (b)). These partial products are stored

in intermediate matrices and are later merged to form the output matrix [61, 85]. This leads to a

single read of the input matrices but multiple writes of the partial product output matrices. Thus, in

contrast to the inner-product method, the outer-product method exhibits good input reuse but poor

output reuse. Computation of the output matrix using an outer product approach is expressed in

Equation 1.2:

Output M atrix =

N−1

X

n=0

Cn

Cn=anbn

(1.2)

where Cnis a partial product matrix of output matrix C,Aand Brepresent input matrices, Nis the

number of columns in matrix Aand anbnis a cross-product of nth column of Aand nth row of B.

The row-wise approach consists of a scalar product of every element of a row of matrix

6

CHAPTER 1. INTRODUCTION

Awith corresponding rows of the matrix B(see Figure 1.2 (c) and Equation 1.3). The case for

the column-wise approach is similar, where a single column of the matrix Bis multiplied by cor-

responding columns of matrix A(as seen in Figure 1.2 (d) and Equation 1.4). Both row-wise and

column-wise products have similar dataﬂow properties. They both exhibit poor input reuse due to

redundant accesses made to one of the two input matrices. As opposed to an outer product ap-

proach, they do not generate a large number of intermediate products because partial products are

immediately merged after generation. Thus, both of these approaches produce high output reuse. In

addition, the inner and outer product approaches require the input matrices to be stored in opposite

storage formats, matrix Ain row-major and matrix Bin column-major for the inner product, and

vice versa for the outer product. On the other hand, both row-wise and column-wise require both in-

put matrices to be stored in the same storage format. A signiﬁcant disadvantage of the row-wise and

column-wise approach is a skewed matrix (matrix with unevenly distributed non-zeros) can cause

load imbalance in computations. This problem of load imbalance, and a solution for the same, are

further described in Section 5.2 of this thesis.

C[i, :] =

N

X

k=0

A[i, k]∗B[k, :] (1.3)

C[:, j] =

N

X

k=0

A[:, k]∗B[k, j ](1.4)

where C[i, :] and C[:, j]represent ith row and jth column, respectively, of output matrix C.Aand

Bare the two input matrices, and Nis number of rows of matrix Afrom Equation 1.3 and number

of columns of matrix Bfrom Equation 1.4.

1.6 Introduction of SMASH

START READ INPUT

MATRICES

COMPUTE

REQUIRED

MEMORY

DISTRIBUTE

WORKLOAD

MATRIX A MATRIX B

CORE 0

HASH

DATA

WRITE-BACK

DATA

CORE N

HASH

DATA

WRITE-BACK

DATA

. . . . . .

END

Figure 1.3: SMASH Overview

7

CHAPTER 1. INTRODUCTION

In this thesis, we present Sparse Matrix Atomic Scratchpad Hashing or SMASH, a row-

wise product method that uses the Compressed Sparse Row (CSR) format [92]. An overview of our

algorithm is shown in Figure 1.3. The SMASH algorithm is designed to adapt to varying sparsity

patterns in the input matrices. We address issues related to the row-wise product approach by

designing our kernel to merge partial products using a custom implementation of a hashtable.

This thesis also discusses the unique features of a novel architecture from Intel called

PIUMA, designed to speedup graph-based workloads. We further describe our implementation of

SpGEMM that exploits these features of this new architecture. Then we describe several improve-

ments to our SpGEMM algorithm. We discuss design decisions and their impact on the resulting

algorithm. Finally, we compare our optimized SpGEMM kernel performance on the Intel PIUMA

accelerator architecture and provide an in-depth analysis of the results.

1.7 Contributions

The contributions of this thesis include:

•Analysis of the problems exhibited by sparse matrix multiplication kernels.

•A comparison of architectures that support sparse matrix multiplications.

•A comparison of previous implementations of SpGEMM kernels.

•An architectural overview of Intel’s novel PIUMA graph accelerator.

•A novel SpGEMM kernel implementation that makes the best use of the PIUMA accelerator.

8

Chapter 2

Background

This chapter reviews the background information on CPU and GPU architectures required

to place this thesis in context. We also cover common approaches on improving the performance of

SpGEMM workloads. We include discussion on hardware designs of domain-speciﬁc accelerators

for such workloads. Finally, we describe related work on SpGEMM kernels and their implementa-

tions.

2.1 CPU

The history of the Central Processing Unit (CPU) dates back as far as 1971 when Intel

introduced the ﬁrst microprocessor in the market, the Intel 4004 [10], capable of performing 60,000

operations per second. Since then, there have been rapid advances in this ﬁeld regarding clock

speed and transistor technology, enabling today’s AMD Ryzen CPUs to execute 2.3 teraoperations

per second (a teraoperation is 1012 operations per second).

A CPU can be deﬁned as a computational device that, fundamentally, reads instructions

from program memory and performs calculations. These instructions are fetched from the main

memory of the computer (typically Dynamic Random Access Memory (DRAM)) and undergo 3

stages of computations:

•Fetch: Retrieve the instructions from memory. The control unit usually sends a signal through

the address bus to retrieve instructions.

•Decode: The control unit splits the instruction into two parts, the opcode and the operands.

9

CHAPTER 2. BACKGROUND

•Execute: The command represented by opcode is executed on the operand in the execute

stage.

As far as CPU architecture is concerned, a large portion of the chip area is dedicated to

control logic. With fewer cores and a large control-logic chip area, the chip real estate dedicated to

control logic per core is high. In addition, every single core of the CPU is faster than the GPU. These

factors allow the CPU to excel at certain workloads compared to the GPU. The CPU is designed

to handle a wide range of tasks efﬁciently but is heavily limited when running tasks concurrently.

The larger control-logic area and faster cores provide the CPUs with an advantage over the GPUs

for control-dominated general-purpose workloads containing multiple conditional branches. We

compare the CPU chip area with that of GPU in Figure 2.1

Given that a CPU devotes more logic to control (i.e., branch handling) and is clocked

faster than the GPU, it allows the CPU to excel at executing workloads with complex single-threaded

tasks, such as operating system services and database engine operations.

CONTROL

LOGIC

CACHE

DRAM

ALU

ALU

CPU

CONTROL

LOGIC

CACHE

DRAM

ALU ALU

GPU

ALU ALU ALU ALU ALU ALU ALU

CONTROL

LOGIC

CACHE

ALU ALU ALU ALU ALU ALU ALU ALU ALU

CONTROL

LOGIC

CACHE

ALU ALU ALU ALU ALU ALU ALU ALU ALU

CONTROL

LOGIC

CACHE

ALU ALU ALU ALU ALU ALU ALU ALU ALU

ALU

ALU

Figure 2.1: CPU and GPU Architectural Overview

2.2 GPU

Graphics Processing Unit (GPU)s were traditionally designed to render graphics and

videos to displays. They were used in appliances that need a display, like the personal computer, mo-

10

CHAPTER 2. BACKGROUND

biles, and embedded systems. Modern GPUs are now capable of accelerating a variety of tasks that

were previously executed on CPUs. These devices led to the birth of a new ﬁeld called GPU Com-

pute, where the GPU is equipped with programmable shaders. Today, GPUs commonly run a range

of compute-oriented workloads, including encryption, decryption, physics simulations, pathﬁnding,

and machine learning.

Figure 2.1 compares the chip area distribution between a CPU and a GPU. The high

number of cores on a GPU allows this device to perform parallel tasks efﬁciently. A single core of

a GPU, compared to a CPU, is much slower in terms of clock rate. The GPU amortizes this slower

clock speed by running thousands of tasks in parallel. Hence, workloads that possess a high degree

of parallelism are better suited to run on a GPU.

A GPU can outperform a CPU in many workloads that express such parallelism. Dense

matrix multiplication is one such application. The high number of cores present on a single GPU

allow it to run tasks in parallel.

Although GPUs excel at accelerating data-parallel tasks by utilizing high levels of con-

currency, they do not perform well on workloads that involve control ﬂow. Control ﬂow statements,

such as “if-else” clauses, require control-ﬂow logic in hardware to boost performance. With a large

number of cores on a GPU, the amount of chip area per core dedicated to control logic is limited.

Hence, workloads that contain a signiﬁcant number of conditional branches tend to perform poorly

on GPUs [51].

2.3 Branch and Control Flow Logic

A computer program consists of a set of instructions. These instructions are executed in

a speciﬁc order (commonly referred to as the control ﬂow of the program). Control ﬂow statements

(e.g., if-else, for, while) allow programmers to create algorithms with divergent execu-

tion sequences/paths. The decision of choosing a path is evaluated based on some conditions. The

“branch” instruction is a special instruction that can change the execution path by altering the pro-

gram counter (hence called branching). Modern-day CPUs and GPUs overlap instruction execution

in a single stream to gain speedup. This overlap of instructions is called pipelining. Branching

makes it difﬁcult to overlap instructions because the next instruction to be executed might depend

upon the previous instruction’s output.

The CPU uses techniques such as branch prediction, where if the branch is predicted

correctly, it incurs a little to no execution penalty, but if the branch is mispredicted, the CPU is

11

CHAPTER 2. BACKGROUND

forced to squash the contents of the pipeline and continue execution on the correct branch. In

addition, CPUs have a deeper pipeline (with more stages) as compared to GPUs. Hence the penalty

for mispredicting a branch is higher in a CPU, as more instructions will be squashed. The large

control logic area enables the CPU to reduce this penalty of mispredicting branches.

A GPU executes workload in a warp (the basic unit of execution) [1]. A warp is a col-

lection of GPU threads (NVIDIA GPUs contain 32 threads in a warp). GPUs do not have the

resources to have each of their threads execute divergent branches simultaneously. When execut-

ing a conditional branch instruction, a single warp in the GPU computes both sides of the branch

(sequentially) and discards one of them based on the correct branch path. This is referred to as pred-

ication. Predication works for cases when each branch’s size is considerably small but incurs a large

penalty otherwise. In summary, the CPU is capable of handling workloads with a large number of

conditional statements, while the GPU will encounter signiﬁcant slowdown for such workloads.

2.4 Computational Performance

Floating Point Operations per Second (FLOPS) is a key metric for comparing the perfor-

mance of different hardware designs. This metric captures the number of ﬂoating-point operations

that a device can complete in one second. Floating-point numbers can be stored in memory in dif-

ferent formats based on the precision required. They can be stored in Half Precision (HP) format

(which occupies 16 bits in memory), Single Precision (SP) format (which occupies 32 bits), Double

Precision (DP) format (which occupies 64 bits), or Quadruple Precision (QP) format (which oc-

cupies 128 bits). SP format stores ﬂoating-point values using 32 bits in memory, whereas the DP

format occupies 64 bits.

A modern CPU, such as Intel’s Xeon 8180 Platinum Processor (from the Skylake mi-

croarchitecture family), has many cores (the Xeon 8180 has 24 cores running at a maximum turbo

frequency of 2.3GHz). If all cores execute AVX-512 instructions, the Xeon 8180 can reach a

peak performance of 4.12 TFLOPS [104] single-precision performance, and about 2.06 TFLOPs of

double-precision performance [104] (1TFLOPS = 1012 FLOPS).

A modern GPU, such as NVIDIA’s A100 GPU (based on NVIDIA Ampere architecture),

has many single and double-precision cores (the A100 has 6912 single-precision cores and 3456

double-precision cores). The A100 can reach a peak performance values of 19.5TFLOPs [74]

single-precision and 9.7TFLOPs [74] double precision.

12

CHAPTER 2. BACKGROUND

2.5 The Properties of Spatial Locality

Spatial locality is the property that instructions and data entities tend to be stored relatively

close together in an address space. Workloads exhibit high spatial locality when they request data

from neighboring memory locations frequently. A sparse matrix multiplication kernel (SpGEMM

kernel) consists of accessing elements in random rows and columns of the input matrix (most of the

rows and columns contain few non-zero elements), which results in a somewhat random memory

access pattern. A large number of non-zero elements in every row can lead to high spatial locality

in matrix multiplication kernels, but sparse matrices tend to have fewer non-zero elements per row,

thus the resulting access-pattern of rows leads to accessing few non-zero elements scattered across

the input matrix. Hence SpGEMM kernels exhibit low spatial locality.

This makes the SpGEMM kernel more difﬁcult to optimize. In an ideal case, for efﬁcient

SpGEMM computations, we would require a large control logic area per core (present on the CPU),

along with a large number of CPU cores, to achieve the high computational performance possible

on a single GPU.

2.6 Sparse Matrix Storage Formats

Matrices are generally stored in a one-dimensional linear array of contiguous memory,

organized in either a row-major or column-major format. Row-major format stores subsequent ele-

ments of a row sequentially in the address space, whereas column-major stores subsequent elements

of a column sequentially [99]. While these storage formats work efﬁciently when working with

dense matrices, they encounter performance issues when working with sparse matrices [28]. The

large number of zeros present in sparse matrices cause the row-major and column-major formats to

store mostly zeros in memory, thus leading to inefﬁcient usage of valuable memory space.

The Compressed Sparse Row (CSR) storage format stores only the non-zeros of a sparse

matrix, recording the index of only the non-zero elements in each row. CSR packs the non-zeros

of each row in a single linear array (data array) and the indices corresponding to each element in

another linear array (column-index array). A third array is used to track the number of elements in

each row (i.e., the row-pointer array). Hence, in order to access subsequent non-zeros in a sparse

matrix, one can iterate over these dense arrays of non-zeros (i.e., the data array and column-index

array). The resulting CSR format is efﬁcient in terms of storage, as well as in terms of computation,

as compared to using dense storage formats. Similar concepts of compressed sparse storage format

13

Chapter 3

Related Work

Equipped with a brief introduction on the architecture of the CPU and GPU, and in-

formed with an understanding of spatial locality described above, in this chapter, we discuss the

class of computations that are the target of this thesis. We begin by discussing libraries that are

commonly used in high-performance computing and take a deeper look into prior implementations

of SpGEMM workloads.

3.1 SpGEMM on CPU

The Basic Linear Algebra Subprograms (BLAS) [72] is a standard set of libraries that

provide high-performance application programming interfaces (APIs) to perform linear algebra op-

erations. Many hardware vendors provide their own performance-tuned implementations for BLAS,

providing advantages for their own architecture. For example, Intel provides Intel Math Kernel Li-

brary (MKL) [105] for their x86 processors. Intel’s implementation of these APIs exploits unique

architectural features (i.e., hardware extensions) present on their CPUs to boost their performance.

Some of these extensions include:

•Streaming SIMD Extensions 4.2 (SSE4.2)

•Advanced Vector Instructions 2 (AVX2)

•Advanced Bit Manipulation (ABM)

•Bit Manipulation Instruction Set 2 (BMI2)

•3-Operand Fused-Multiply-Add (FMA3)

15

CHAPTER 3. RELATED WORK

•Advanced Encryption Instructions (AESI)

•Multiprecision Add Carry (ADX)

•Carry-less Multiplication Extension (CLMUL)

•16-bit Floating Point Conversion (F16C)

Other manufacturers provide similar extensions. For the Zen architecture, Central Pro-

cessing Unit (AMD) provides the BLIS library, an optimized software implementation of the BLAS

subroutines.

In this thesis, we focus on the SpGEMM APIs from the BLAS library in our analysis. In

prior work, Xie et al. [109] described an optimized SpGEMM kernel, evaluated on both a CPU and

a GPU. They used deep learning to train their model (called MatNet) to learn the data distribution

patterns of a matrix. Their algorithm chooses the best format to represent the input data based on

the MatNet model’s decisions. By performing input data transformations with MatNet, they were

able to accelerate a SpGEMM kernel by 3.27×over Intel’s MKL platform and 13.17×speedup

over AMD’s platform.

Nagasaka et al. [68] compare the performance of most publicly available implementations

of SpGEMM kernels and propose their own implementation based on hashing and heap-based algo-

rithms. They concluded that speciﬁc implementations work better based on the pattern of non-zeros

in the input data set.

3.2 SpGEMM on GPU

One of the largest GPU chip manufacturers is Nvidia. Similar to the CPU vendors, Nvidia

provides its own library for linear algebra kernels. In this thesis, we focus on efﬁcient SpGEMM

kernels. Nvidia has developed its own libraries that provide SpGEMM kernels, released as part of

cuSparse and CUSP packages. cuSparse is Nvidia’s BLAS implementation to support all sparse

operations. CUSP is an open-source C++ library of generic parallel algorithms used for sparse

linear algebra and graph computations on CUDA architecture GPUs.

Many other attempts have been made to optimize SpGEMM kernels on a GPU. Nagasaka

et al. [68] present an algorithm for efﬁcient sparse matrix multiplication on a Pascal GPU. Their

approach uses a row-counting method (i.e., counting the number of intermediate partial-products

and then grouping rows based on the number of partial-products in each row) [68]. In addition to

16

CHAPTER 3. RELATED WORK

accelerating the kernel, they also try to minimize the total memory required for this operation. They

achieved a 4.3×speedup on single-precision and 4.4×speedup on double-precision compared to

existing SpGEMM libraries. They also reduced memory usage by 14.7% for single precision and

10.9% for double-precision, on average.

In general, it is difﬁcult to optimize SpGEMM workloads on GPUs due to the random

data access patterns and the large memory footprint of the intermediate data generated. We look at

accelerators designed speciﬁcally for SpGEMM in the next section.

3.3 SpGEMM Accelerators

Next, we focus on previous attempts made in designing domain-speciﬁc accelerators for

SpGEMM workloads. Table 3.1 provides a comparison between the various accelerators and kernel

implementations developed speciﬁcally for SpGEMM workloads.

Zhang et al. propose the SpArch accelerator [114], an accelerator designed to speed up

SpGEMM kernels. They designed a kernel using the outer-product method for matrix multiplica-

tion. The issue with outer product multiplication is the large number of intermediate partial products

produced by this approach. Their work’s key contribution addresses the partial product generation

problem by designing a streaming-based merger into the processing pipeline, combining the mul-

tiplies with the merge stage of the partial products. This allows the partial products to be merged

on-chip immediately after they are produced. In addition to this optimization, they also proposed

a condensed matrix representation and a Huffman tree scheduler to gain further speedup. They

report an average speedup of 18×over Intel MKL, cuSparse, and CUSP libraries. Although their

implementation provides considerable speedup compared to other libraries, the merge-tree imple-

mentation occupies a large portion of the overall chip area. Approximately 60% of the chip area is

dedicated to the merge tree implementation, while just 1.6% of the chip area is devoted to a multi-

plication array. In terms of energy, 55% of the energy is spent on merging partial products on-chip.

The extra hardware area and energy requirements devoted to partial product merging leave room for

further optimization, both in terms of accelerator design and the SpGEMM algorithm.

Qin et al. [83] propose SIGMA, an accelerator that tackles irregular memory accesses in

SpGEMM kernels. The fundamental block of their architecture, known as the Flexible Dot Product

Engine (Flex-DPE), consists of switchable interconnects that allow them to build a ﬂexible network

topology. Leveraging a ﬂexible and scalable network topology allows them to keep the utiliza-

tion of their processing elements high. They reported that SIGMA could obtain approximately

17

CHAPTER 3. RELATED WORK

Research

Papers

SpGEMM

Kernel

Accelerator Features

SpGEMM

on GPU [68]

Outer Product NVIDIA

Pascal GPU

On-Chip shared memory merging

Hashtable for partial products

OuterSPACE

[76]

Outer Product OuterSPACE Algorithm-Hardware co-design

ExTensor [42] Inner Product ExTensor Hierarchical elimination of compu-

tation in the presence of sparsity

MatRaptor [96] Row-wise

Product

MatRaptor New sparse storage format C2SR

Hardware Sorting

Sunway Taihu-

Light [20]

Partitioned

Outer Product

Sunway Novel partitioning method

SpArch [114] Outer Product SpArch Streaming based merger

Condensed matrix representation

Huffman tree scheduler

ALRESCHA [9] Inner Product Alrescha Data-dependent task reordering

Locally-dense storage format

Synergistic CPU-

FPGA [94]

Row-wise

Product

CPU-FPGA Cooperative CPU-FPGA platform

New intermediate representation

based on communication packets

SIGMA [83] Row-wise

Product

SIGMA Novel reduction tree microarch.

Flexible Interconnects

SMASH

(Our approach)

Row-wise

Product

PIUMA Hashtable based on-chip merge

Dynamic load balancing

In-memory computation using PIM

modules

Table 3.1: SpGEMM Accelerator Comparison

a3×speedup compared to other state-of-the-art accelerators, including a TPU [36], EIE [40],

SCNN [77], OuterSPACE [76], Eyeriss v2 [19], Packed Systolic [56], and Cambricon-X [113].

In 2018, Pal et al. introduced their SpGEMM accelerator called OuterSPACE [76]. They

took a two-phase approach to implement their SpGEMM kernel. In their ﬁrst phase, called the

multiply phase, they perform an outer product of the two input matrices to produce partial products.

In the subsequent phase, called the merge phase, they merge these partial products to form the output

matrix. Although this approach is not new, their work’s novelty lies in their mapping of these phases

to the OuterSPACE architecture. The computation of an outer product when using sparse matrices

causes poor data reuse and unbalanced workload distribution. The OuterSPACE architecture is

designed, keeping in mind these problems associated with SpGEMM. With asynchronous Single

Program, Multiple Data (SPMD) style worker cores coupled with memory hierarchies and shared

18

CHAPTER 3. RELATED WORK

reconﬁgurable caches, OuterSPACE delivered an average 7.9×speedup over Intel’s Math Kernel

Library, 13.0×over cuSPARSE, and 14.0×over CUSP.

Liu et al. [61] introduce another accelerator that focuses on optimizing SpGEMM kernels

for mobile CNNs, using systolic arrays of Tensor Processing Elements. ALRESCHA [9] is another

accelerator that differentiates itself by having two parts to its architecture; a ﬁxed compute unit and

a light-weight re-conﬁgurable engine. This allows them to adapt to input data sparsity patterns.

Many prior studies on SpGEMM accelerator development have proposed their own sparse

matrix storage format, including a condensed matrix representation in SpArch [114], the REAP

intermediate representation for the CPU-FPGA accelerator [94], C2SR for MatRaptor [96], and

Locally-Dense storage format for ALRESCHA [9]. A common issue with these schemes is that the

degree of reformatting or data rearrangement needed in a sparse matrix, depending on the underlying

architecture, can impact the speedup obtained by these accelerators. Memory access latency is a

major bottleneck for SpGEMM accelerators. Dividing, rearranging, and grouping data based on the

sparsity patterns and accelerator architecture reduces redundant accesses and increases data locality.

Though these novel formats boost performance, they incur associated conversion costs since the

sparse input matrices are typically stored in CSR, CSC, or ELLPACK (ELL) storage formats. These

costs are either in terms of additional hardware requirements or performance degradation or both.

The design of any storage format should consider whether the beneﬁts provided can amortize the

data transformation cost. Considering previous efforts to speed up sparse workloads and considering

the problems faced by these accelerators, we propose our own SpGEMM kernel implementation

developed on a general hardware accelerator architecture.

19

Chapter 4

PIUMA Architecture and Simulator

The Programmable Integrated Uniﬁed Memory Architecture (PIUMA) is being developed

by Intel [3], as part of DARPA’s Hierarchical, Identify, Verify, Exploit (HIVE) program [47]. The

Hierarchical Identify Verify Exploit (HIVE) project recognizes the challenges involved with graph

analytics and aims to achieve a 1,000×performance/Watt improvement over the previous state-of-

the-art system [64]. The PIUMA system is a scalable systems architecture designed to accelerate

graph-based applications. The system is designed to handle sparsity, bearing in mind the highly-

random data access patterns present in graph workloads.

4.1 PIUMA Architecture

This section provides a detailed description of Intel’s PIUMA machine, exploring some

of the key features of this architecture that are exploited in our implementation of the SpGEMM

kernel. We highlight some of the problems faced in SpGEMM and focus on PIUMA’s components

that help tackle these challenges. Primarily, we focus on the following architectural features of

PIUMA [3]:

1. PIUMA Cores

(a) Multi-Threaded Core (MTC) and

(b) Single-Threaded Core (STC)

2. Ofﬂoad Engines (OE)

(a) DMA engine and

20

CHAPTER 4. PIUMA ARCHITECTURE AND SIMULATOR

(b) Collective engine

3. Global Address Space

4. Network

4.1.1 PIUMA Cores

The PIUMA cores form the fundamental computational unit of this architecture. They

are designed to exploit the inherent parallelism exhibited by graph-based workloads. In general,

graph workloads are more memory intensive than compute-intensive workload [31]. A key principle

in PIUMA is to provide a high degree of parallelism to hide memory latency. PIUMA’s large

number of threads are capable of keeping many memory requests in ﬂight. The PIUMA cores can

be classiﬁed into two types:

1. The Multi-threaded Core (MTC) and

2. The Single-threaded Core (STC)

We take a closer look into the architectural layout of each of these cores.

4.1.1.1 The PIUMA Multi-threaded Core

A multi-threaded core consists of one pipeline each [3]. The MTC can issue at most

one instruction per cycle, providing for an energy-efﬁcient design [3]. In addition, the MTC also

has an associated register ﬁle Register File (RF). Each RF stores the context of up to 16 threads

simultaneously, allowing a single MTC to be shared by up to 16 threads. Each thread represents

a single stream of executions. The MTC resources are shared across these 16 threads, utilizing a

round-robin resource allocation routine. When a stage of the MTC’s pipeline stalls, a new thread is

swapped to execute, hiding latency and keeping the pipeline stages full.

4.1.1.2 Single-Threaded Core

Unlike the MTC, the STC comprises a single thread of execution (executing a single

stream of instructions). Since a single instruction stream is being executed every cycle (as opposed

to the round-robin scheduling of MTCs), a higher priority is given to this single thread, making it

capable of handling performance-sensitive tasks [3]. The STCs are in-order blocking (on misses),

21

CHAPTER 4. PIUMA ARCHITECTURE AND SIMULATOR

cores designed to lower power consumption as compared to the out-of-order pipelines [3]. The

primary purpose of STCs is to perform memory and thread management tasks.

Both the single-threaded and multi-threaded cores are equipped with L1 instruction caches

and L1 data caches. Graph applications are known for their irregular data access patterns [31]. The

resulting irregular memory access patterns lead to low L1, L2, and L3 cache utilization due to low

spatial locality [50]. As graph workloads exhibit poor locality, in PIUMA architecture, no higher

cache levels are included to save on power consumption [3]. All caches throughout the PIUMA

system are non-coherent. Although cache coherency is helpful to maintain data uniformity, it is

associated with large overheads. Caches can be made coherent using protocols such as MSI,

MESI ,M OSI, and M OESI (Modiﬁed, Owned, Exclusive, Shared, Invalid) [39]. These proto-

cols can cause the executing pipeline to stall for thousands of cycles to fetch the new data value

from the main memory or neighboring caches. The PIUMA architecture does not provide cache co-

herency. It thus becomes the responsibility of the programmer to avoid modifying shared data and

ﬂush caches as required [95]. In addition, prefetching is disabled to limit power consumption [3].

MULTI-THREADED CORES

SCRATCHPAD

STC 0

L1 DATA

CACHE

L1 INS

CACHE

STC 1

L1 DATA

CACHE

L1 INS

CACHE

MTC 0

L1 DATA

CACHE

L1 INS

CACHE

MTC 1

L1 DATA

CACHE

L1 INS

CACHE

MTC N

L1 DATA

CACHE

L1 INS

CACHE

1 THREAD 1 THREAD 16 THREADS 16 THREADS 16 THREADS

PIUMA BLOCK

MEMORY PORT

TO DRAM

SINGLE-THREADED CORES

PORT

TO OTHER PIUMA CORES

STC N

L1 DATA

CACHE

L1 INS

CACHE

1 THREAD

... ...

Figure 4.1: A single PIUMA block.

22

CHAPTER 4. PIUMA ARCHITECTURE AND SIMULATOR

Multiple MTCs and STCs are grouped in a block (see Figure 4.1). Each block consists of

low latency, user-accessible storage, scratchpad (SPAD) memory. Programmers can use this shared

storage for storing data with high temporal locality [2].

4.1.2 Ofﬂoad Engines

Traditionally, following the Von Neumann architecture [90], programs and data are com-

monly stored in memory and fetched by the processor for execution. This architecture will be

limited by the throughput of the memory channels, commonly known as the Von Neumann bot-

tleneck [70]. In some novel architectures [60, 69, 79], this limitation was overcome by introduc-

ing Computational RAM (C-RAM) technology (i.e., processors in memory). C-RAM is similar to

DRAM but with a vector processing element embedded on the same chip as the memory. This en-

abled C-RAM chips to service instructions other than a simple load or store and perform operations

such as scatter and gather.

The PIUMA architecture provides Ofﬂoad Engines (OE) to support the PIUMA cores in

memory-related operations. This allows selected Single Instruction Multiple Data (SIMD) instruc-

tions to be executed in memory. Some examples of such SIMD instructions supported by PIUMA

include:

1. Copy: Copy a chunk of data from one section of memory to another.

2. Strided Copy: Similar to a copy, but every nth element is copied.

3. Gather: Read an array of data and compute its sum.

4. Scatter: Broadcast a single value to multiple locations in memory.

We discuss two ofﬂoad engines that we use to speed up our SpGEMM algorithm.

4.1.2.1 DMA Engine

Data movement forms an integral part of graph algorithms. One of the major bottlenecks

in graph applications is memory throughput [31]. The Direct Memory Access (DMA) Engine re-

duces the workload on PIUMA cores by executing the memory operations (i.e., loads and stores).

A single copy instruction or gather/scatter instruction can replace thousands of load and store

23

CHAPTER 4. PIUMA ARCHITECTURE AND SIMULATOR

instructions issued by the cores. The DMA engine carries out the underlying task of copying multi-

ple bytes of data or broadcasting data to multiple memory locations. All instructions issued to the

DMA engine run in the background (non-blocking), freeing the core to execute other instructions.

4.1.2.2 Collective Engine

One important element of many parallel algorithms is the required synchronization. The

PIUMA architecture is equipped with a collective engine that provides system-wide barriers and

reduction operations [3].

One of the key features presented by the PIUMA architecture is the ability to ofﬂoad

instructions over the fabric to remote cores. Threads can wrap their instructions in network packets

and forward them to remote threads for execution. These instructions are called remote or network

instructions.

When a thread intends to update data present in remote memory (memory physically

connected to neighboring cores), it can send remote instructions to be executed by another thread,

local to that memory. Thus, instead of streaming data over the network, we stream instruction

packets to the core that is physically connected to that memory chunk. Network instructions can be

helpful in two ways:

1. forwarding instructions to threads that have low latency access to data can improve perfor-

mance, and

2. distributing the workload among threads can improve workload balance.

Remote atomic instruction is one such networked instruction that allows atomic instruc-

tions to be executed by PIUMA cores in memory that is remote. We make use of remote atomics in

our algorithm to update the partial products in our hash table.

4.1.3 Global Address Space

Distributed Global Address Space (DGAS) is a model in which the memory address space

is logically divided, and sections of this address space are local to each computing thread. DGAS

allows using an SPMD style of programming [111], while supporting data addressing semantics

similar to a shared memory system.

The PIUMA architecture is built using DGAS, allowing data present on any core to be

accessible by any other PIUMA core/thread. This allows the programmer to worry less about the

24

CHAPTER 4. PIUMA ARCHITECTURE AND SIMULATOR

BLOCK 0

STC 0 STC N MTC 0 MTC N

PIUMA DIE

DRAM

... ...

BLOCK 1

STC 0 STC N MTC 0 MTC N

... ...

BLOCK 2

STC 0 STC N MTC 0 MTC N

... ...

BLOCK 3

STC 0 STC N MTC 0 MTC N

... ...

BLOCK N-1

STC 0 STC N MTC 0 MTC N

... ...

BLOCK N

STC 0 STC N MTC 0 MTC N

... ...

...

Figure 4.2: A single PIUMA die.

25

CHAPTER 4. PIUMA ARCHITECTURE AND SIMULATOR

scope of memory access pointers and focus instead on parallelizing the workload’s execution. Each

PIUMA thread has an afﬁnity to speciﬁc sections of the DGAS. Despite each thread having access to

the entire address space, local memory accesses will experience lower latency than remote accesses.

Thus, data stored in the DGAS partition belonging to a thread is said to have an afﬁnity to that

thread. We use DGAS when designing our sparse matrix algorithm to broadcast sections of the

input matrix from the ﬁrst core to all other cores involved in computing the workload. This will

transfer the elements to each thread’s local memory, as described in Section 5 in detail.

The PIUMA system consists of Address Translation Tables (ATT). ATT contains recon-

ﬁgurable rules to translate application memory addresses to physical memory locations, enabling us

to rearrange the address space as needed by the application [3].

In addition, the PIUMA memory controllers are redesigned to support native 8-byte ac-

cesses [3]. Instead of a full cache line fetch, the memory controllers can selectively fetch 8-byte

words, reducing redundant memory fetches.

4.1.4 Network

The PIUMA network connects blocks (groups of MTCs and STCs) together and forwards

memory requests to remote memory controllers. The PIUMA system is conﬁgured in a HyperX

topology to achieve high bandwidth and low latency [3, 6]. This allows the network to have a high

radix and a low diameter. The higher-level links are optical to sustain high-bandwidth at low power

consumption levels.

A single PIUMA block consists of both the STCs and MTCs. Each STC and MTC is

accompanied by an L1 instruction cache and an L1 data cache. Each block is accompanied by local

high-bandwidth Scratchpad memory (see Figure 4.1). Multiple PIUMA blocks are laid out together

to form a die, as shown in Figure 4.2). Figure 4.3 presents the layout of an entire PIUMA system at

node, subnode, and die level.

The PIUMA system is a novel approach to tackle the problems present in graph-based

workloads. The simple, in-order, multi-threaded cores and non-coherent caches aid in hiding latency

of random memory accesses present in graph applications. The ofﬂoad engines, including the DMA

engine and collective engine, work alongside the PIUMA cores to help with memory operations and

synchronization. The distributed shared global address space makes it convenient for programmers

to implement distributed kernels. Finally, the HyperX topology of the PIUMA network with optical

interconnects delivers a design that can scale out to multiple nodes, allowing computation of graphs

26

CHAPTER 4. PIUMA ARCHITECTURE AND SIMULATOR

NODE 0

SUBNODE 0

DIE N-1

DIE 0 DIE 1

DIE 2 DIE 3

DIE N

...

...

SUBNODE N

DIE N-1

DIE 0 DIE 1

DIE 2 DIE 3

DIE N

...

NODE 1

SUBNODE 0

DIE N-1

DIE 0 DIE 1

DIE 2 DIE 3

DIE N

...

...

SUBNODE N

DIE N-1

DIE 0 DIE 1

DIE 2 DIE 3

DIE N

...

NODE 2

SUBNODE 0

DIE N-1

DIE 0 DIE 1

DIE 2 DIE 3

DIE N

...

...

SUBNODE N

DIE N-1

DIE 0 DIE 1

DIE 2 DIE 3

DIE N

...

NODE 3

SUBNODE 0

DIE N-1

DIE 0 DIE 1

DIE 2 DIE 3

DIE N

...

...

SUBNODE N

DIE N-1

DIE 0 DIE 1

DIE 2 DIE 3

DIE N

...

NODE N-1

SUBNODE 0

DIE N-1

DIE 0 DIE 1

DIE 2 DIE 3

DIE N

...

...

SUBNODE N

DIE N-1

DIE 0 DIE 1

DIE 2 DIE 3

DIE N

...

NODE N

SUBNODE 0

DIE N-1

DIE 0 DIE 1

DIE 2 DIE 3

DIE N

...

...

SUBNODE N

DIE N-1

DIE 0 DIE 1

DIE 2 DIE 3

DIE N

...

...

Figure 4.3: The PIUMA system.

27

CHAPTER 4. PIUMA ARCHITECTURE AND SIMULATOR

with trillions of vertices [3].

4.2 Simulation Methodology

The evaluation of new computer architecture features is commonly evaluated pre-silicon

using a simulator [7]. Simulators are software models used to simulate the behavior of various

design features. In contrast to simulation models, analytical models are statistical models that can

mathematically evaluate elements of a computer architecture. Owing to the complexity of today’s

computer architectures and the large number of conﬁgurable parameters, analytical models tend to

produce inaccurate results, hence are not suitable for evaluating computer architectures [7].

Simulator

Classification

Detail of

Simulation

Scope of the

Target

Input to the

Simulator

Functional

Timing

Integrated

Classification

Factors

Cycle Level

Event Driven

Interval

Timing Directed

Functional First

Timing First

Full System

Application

Level

Trace

Driven

Execution

Driven

Figure 4.4: execution-driven simulator.

The design cycle and silicon fabrication process required to produce physical hardware is

28

CHAPTER 4. PIUMA ARCHITECTURE AND SIMULATOR

both time-consuming and can incur high cost. Simulators allow architects to make design decisions

before the hardware pre-silicon, thus lowering the hardware development costs [7], making them an

integral part of today’s hardware design process.

4.2.1 Simulator Classiﬁcation

For our simulation of PIUMA architecture, we use the Sniper simulator [43]. To begin,

we ﬁrst review different classes of simulators to better understand the advantages of the Sniper

simulator over others choices. Simulators can be grouped into various classes based on a range of

factors, including the level of detail of the simulation, the scope of the target (the system that is

being simulated), and input that drives the simulator. Figure 4.4 provides an overview of the various

classes of simulator [7]. We brieﬂy describe each class and discuss the advantages/disadvantages

provided by each.

Our classiﬁcation begin by consider the level detail of simulation:

•Functional Simulators: Functional simulators focus on the functionality of the modeled

architecture. They provide for emulation of the target ISA, but they do not implement the

underlying microarchitecture of the target system, making them faster than other simulators.

•Cycle-level Simulators: Cycle-level simulators model the operations of a processor cycle-

by-cycle. Cycle-level simulators are relatively slower than functional simulators and consume

a signiﬁcant amount of memory resources.

•Event-driven Simulators: Event-driven simulators target events instead of cycles. Simula-

tions steps through time to arrive at events when they are scheduled [63], thus saving time by

not simulating cycles that are not scheduled.

•Interval Simulators: While cycle-level simulators are accurate, they are very slow. Event-

driven simulators are fast, but compromise accuracy. Interval simulators strike a balance

between speed and accuracy [33]. Interval simulators work on the fact that instructions ﬂow

through a pipeline can be broken down into sets of intervals, based on miss events (miss

events include branch mispredictions and cache misses). A distinct branch misprediction

simulator and cache miss simulator can then be used to accurately evaluate every interval’s

performance.

•Integrated Timing-directed Simulators: Functional simulators are often integrated with

timing simulators. In a timing-directed simulator, the functional-simulator records the archi-

29

CHAPTER 4. PIUMA ARCHITECTURE AND SIMULATOR

tectural state (register and memory) of the processor and forwards it to the timing simulator.

The timing simulator then uses these values to perform corresponding computations. There

exists heavy communication between the functional simulator and the timing simulator.

•Integrated Functional-ﬁrst Simulators: In a functional-ﬁrst simulator, a functional-simulator

leads the timing simulator. The functional simulator generates an instruction trace and for-

wards it to the timing simulator. Functional-ﬁrst simulators are similar to trace-driven simu-

lators, with a distinguishing factor being the traces are generated by the functional simulator

and forwarded to the timing simulator immediately while trace-driven simulators store traces

on ﬁle after generation.

•Integrated Timing-ﬁrst Simulators: In a timing-ﬁrst simulator, the timing simulator leads

the functional simulator. The timing simulators simulate the microarchitecture at cycle-level

and then use functional simulators for veriﬁcation purposes. In cases where the timing sim-

ulator results do not match the functional simulator, the timing simulator ﬂushes its pipeline

and restarts from the fetch cycle for that instruction.

Classiﬁcation based on the scope of the target architecture is as follows:

•Full System Simulators: Full system simulators support booting the entire operating system

(OS) and run target applications in that OS. Full system simulators are signiﬁcantly slower

and are generally used to simulate I/O devices.

•Application-level Simulators: As opposed to full system simulators, application-level sim-

ulators only simulate target applications. Since they do not suffer the high overhead of simu-

lating the operating system or system calls. This class of simulator is signiﬁcantly faster.

Classiﬁcation based on input to the simulator is as follows:

•Trace-driven Simulators: Trace-driven simulators use trace ﬁles as input. Trace ﬁles are

pre-recorded streams of instructions from a previous run of the application. Trace ﬁles are

usually stored in a ﬁle system and occupy a large amount of space, which in some cases,

becomes a bottleneck for simulation [48][30].

•Execution-driven Simulators: In contrast to trace-driven simulators, execution-driven sim-

ulators use application binaries as input instead of trace-ﬁles. These application binaries are

signiﬁcantly smaller in size as compared to trace ﬁles. Execution-driven simulators can sim-

ulate misspeculated instructions, unlike trace-driven simulators.

30

CHAPTER 4. PIUMA ARCHITECTURE AND SIMULATOR

4.2.2 Sniper Simulator

Sniper is a parallel, multi-core x86 architecture simulator [43]. For functional simulation,

Sniper uses the Graphite simulator, which is based on the Pin tool [65]. Sniper is classiﬁed as an

Application-level simulator, making it faster than full system simulators.

Sniper is an interval-based simulator. Instead of simulating each instruction, Sniper breaks

down the stream of instructions into discrete sets called intervals. These intervals are based on

events, such as cache misses and branch predictions. Hence, the Sniper simulator is signiﬁcantly

faster owing to it’s ability to ’jump’ between miss events [43]. A special branch predictor simulator

coupled with a memory system simulator can then be used to evaluate miss events. The metrics of

these simulators are then compiled with the analytical model’s ﬁndings to estimate the duration of

every interval [7].

In this thesis, to evaluate performance of our proposed scheme, we use a modiﬁed imple-

mentation of the Sniper simulator. An interval-based simulator, like Sniper, simulates processors

at a higher level of abstraction. Using such a simulator allows us to simulate multi-core proces-

sors efﬁciently (several million instructions per second), as compared to a detailed cycle-accurate

simulator. Although cycle-level simulators, such as Gem5 [14], are more accurate than high-level

simulators, they tend to be signiﬁcantly slower, limiting our options to simulate a range of hardware

conﬁgurations [17]. To record these observations, we use a host machine based on the machine

conﬁguration described in Table 4.1. Our target architecture simulator conﬁguration is as shown in

Table 4.2.

Item Description

Manufacturer Intel Corporation

System Details Intel Server Board S2600TP Family

CPU Intel Xeon CPU E5-2699A

Threads per core 2

Total cores per CPU 18

Number of Sockets 2

Total threads in node 72

Max CPU Frequency 3.4 GHz

RAM 256 GB

OS CentOS Linux release 7.7.1908 (Core)

Table 4.1: Simulator host machine speciﬁcations.

31

CHAPTER 4. PIUMA ARCHITECTURE AND SIMULATOR

Scope Conﬁguration Value Description

Machine Global

Rack Count 1 Number of racks in the System

Board Count 1 Number of socket units per rack

Socket Count 1 Number of sockets

Die Count 1 Number of dies

Core Count Varying (1 to 8) Number of cores per die

Socket Global DRAM Count 1 Number of MCs to external

DRAM banks per socket

DRAM Size unlimited Size in MB per MC for external

DRAM

Core Global

STC Count 2 Number of STCs per core

MTC Count 4 Number of MTCs per core

Core SPAD Count 1 Total logical Scratchpad enteries

per block

Core SPAD Size 4,096 Size of each scratchpad in KB

Cache Size 16 Size in KB for Cache Module

Cache Assoc. 4 Associativity of the cache

Cache Line Size 64 Line size in bytes of cache mod-

ule

Cache Policy wb-wa Replacement policies of the

cache (Write-back, Write Allo-

cate)

Table 4.2: Simulator target conﬁguration for PIUMA architecture

32

Chapter 5

SMASH Kernels

We have reviewed the problems present in prior sparse matrix multiplication algorithms.

We have also described the PIUMA architecture that we will target in this work. In this chapter, we

will present a new SpGEMM kernel that can fully exploit the features of the PIUMA machine.

One of the key design points for our SpGEMM kernel implementation will be to choose

between the four general matrix-multiplication approaches (presented in Figure 1.2). The inner

product approach for sparse matrix multiplication faces issues due to the slow index-matching pro-

cess, in addition to exhibiting poor input data reuse [76]. The outer-product approach generates a

large number of intermediate partial products. These partial products have to be buffered some-

where for later merging. The high on-chip memory requirements of the outer-product approach

make it unsuitable for multiplying extremely sparse matrices.

We introduce our novel implementation of the SpGEMM kernel based on a row-wise

product method called Sparse Matrix Atomic Scratchpad Hashing or (SMASH). The row-wise prod-

uct method is beneﬁcial given its high input reuse behavior [83]. It gives us the ability to perform a

minimum number of input matrix reads while maintaining low on-chip memory usage.

In this thesis, we present 3 different versions of our SMASH kernel. Each version im-

proves the efﬁciency of a different section of our Sparse Matrix Atomic Scratchpad Hashing (SMASH)

implementation.

We adopt an iterative improvement approach to identify bottlenecks in the current version

and modify our algorithm to mitigate them in the next version. The following sections will describe

in detail the 3 different versions of the SMASH kernel.

33

CHAPTER 5. SMASH KERNELS

5.1 SMASH Version 1: Atomic Hashing

A row-wise product method multiples each element of the ﬁrst input matrix with an entire

row of the second input matrix to generate a partial product for the output matrix. These partial

products are then merged to form output matrix elements.

Each row of the ﬁrst input matrix, when multiplied by their corresponding rows from the

second input matrix, will generate a series of partial product matrices, as seen from Equation 1.3.

This is one of the disadvantages of using a row-wise product method. The intermediate results (par-

tial products) generated need to be stored in the main memory and refetched to be merged back into

the output matrix. We overcome this obstacle with our ﬁrst implementation of the SMASH kernel

by using atomic hashing. Instead of writing the partial products back to memory, we implement a

streaming mechanism to merge them on-chip, thus avoiding redundant writes to DRAM.

The SMASH implementation can be distinctively divided into three different phases.

1. the window distribution phase,

2. the hashing phase and

3. the write-back phase.

Each phase’s completion is accompanied by a synchronization barrier that spans the entire

PIUMA system.

5.1.1 Window Distribution Phase

Our window distribution phase begins with reading both input matrices. We store the

input data matrices in the CSR format. This helps us in two key respects:

1. The row pointer array of the CSR format allows us to compute the amount of memory

required to allocate the output matrix. This computation is inspired by Gustafson’s algorithm

for SpGEMM [38, 32, 34, 32, 55, 87].

2. The element access pattern of the row-wise product method involves obtaining rows of input

matrices. Thus, the data arrangement in the CSR storage format improves the spatial locality

pattern of our solution.

After reading the input matrix arrays in CSR format, we compute the required amount

of memory needed to store the output matrix by counting the total Floating-point Multiply and

34

CHAPTER 5. SMASH KERNELS

READ INPUT

MATRICES

COMPUTE

FMA PER ROW

COMPUTE

MEMORY

REQUIRED

GROUP INTO

WINDOWS

DISTRIBUTE

ACROSS

CORES

MATRIX A

MATRIX B

INITIALIZE

CLASSIFIED

DENSE OR SPARSE

COMPARE

FMA TO

THRESHOLD

FMA ARRAY

DYNAMICALLY

ALLOTED SPACE

CLASSIFIED

DENSE OR SPARSE

FMA ARRAY

FMA ARRAY

DYNAMICALLY

ALLOTED SPACE WINDOW GROUPS

WINDOW GROUPS

MODULE INPUT MODULE OUTPUT

CORE 0

CORE 1

CORE 2

CORE 3

Figure 5.1: Window Distribution Algorithm.

35

CHAPTER 5. SMASH KERNELS

Add (FMA) operations per row. To accomplish this, we use Gustafson’s two-step algorithm [38].

The computation of the total number of FMAs per row has a computational complexity of O(n),

where nis one of the dimensions of the input matrix.

Once an array of FMAs is generated, we decide, on a row-by-row basis, if each row

should be computed as a dense row or a sparse row. The decision is made by using a threshold value

specifying the maximum number of elements that need to be present in a sparse row.

Next, we group multiple rows in a single window to be computed by one PIUMA block.

The size of a window is a function of the Scratchpad (SPAD) size. Sections of input matrices are

then packaged and shipped to individual blocks in network packets using PIUMA’s global address

space feature. This data is then stored in the block’s DRAM, ready to be processed.

Every individual PIUMA block processes its own window independently, regardless of

the status of other windows. This allows us to schedule windows to blocks in random order and

oversubscribe windows to blocks (Blocks with windows containing largely sparse rows can be over-

subscribed as they will end up completing before other windows). Details of the window set up and

distribution can be seen in the Algorithm 1.

Algorithm 1: SMASH SETUP

Input: Matrix to be multiplied: mat A (Stored on DRAM)

Input: Matrix to be multiplied: mat B (Stored on DRAM)

Output: Final output matrix: mat C = (mat A ×mat B)(Stored on DRAM)

// Read mat A in CSC format

1A col ptr ←Array of column pointers of matrix A in CSC format

2A row idx ←Array of row indices of matrix A in CSC format

3A data ←Array of data values of matrix A in CSC format

4B row ptr ←Array of row pointers of matrix B in CSR format

5B col idx ←Array of col indices of matrix A in CSR format

6B data ←Array of data values of matrix A in CSR format

7A col ptr copy 1←First copy of column pointer array of A

8A col ptr copy 2←Second copy of column pointer array of A

9hash size ←S P AD SI ZE

10 matrix size ←217 // Number of rows or columns in matrix A

11 window size // Computed dynamically for each window

12 element size ←Size of one element

13 Launch tasks on all threads

14 tid ←Unique thread ID for every thread

15 hash shift ←log2(T otal bins in W indow

T otal bins in SP AD )

16 EMPTY ← −1// A unique flag

17 for w←T otal W indows do

// HASHING PHASE

18 barrier

// WRITEBACK PHASE

19 barrier

20 end

36

CHAPTER 5. SMASH KERNELS

5.1.2 Hashing Phase

In the hashing phase, a global hashtable is created in the SPAD. A single row is allocated

to one thread of each MTC in a round-robin fashion. Each element of the row from the ﬁrst matrix

is multiplied with an entire corresponding row of the second matrix. This leads to the creation of

partial products. These partial products are hashed into the SPAD using bit-shift hashing.

THREAD 0 THREAD 1

THREAD 2 THREAD 3

HASHPAD

HASHPAD

HASHPAD

HASHPAD

ITERATION 0ITERATION 1

ITERATION 2

ITERATION 3

Figure 5.2: Collision Resolution.

To hash using bit-shifts, we ignore the lower nbits and store the elements based on the

upper n−1bits. The hashing is performed following Equation (5.1),

H(x) = x

2n(5.1)

where nis the number of bits shifted.

In case of collision, we resolve the conﬂict by adding 1to the position tag, thus offsetting

the storage location by 1towards the right. We repeat this collision resolution until an empty space

is found on the hashtable (Hashtable walk). To prevent data races, we use atomic compare and

exchange instructions to test for empty locations in the hashtable. This collision resolution is shown

37

CHAPTER 5. SMASH KERNELS

in Figure 5.2.

The use of the upper bits (the high-order bits) for hashing preserves the partial products’

sorted order in the hashtable. Whenever collisions occur, the hashtable walk disrupts this order,

making the table semi-sorted (most elements will be in sorted order, with only a few outliers). To

merge the partial products, we employ a simple atomic fetch and add instruction to add partial

products together. Our hashtable, with its tag and data pairs, is shown in Figure 5.3. This hashtable

GLOBAL HASHTABLE

ROW 0

TAG DATA TAG DATA TAG DATA TAG DATATAG DATA

ROW 1

TAG DATA TAG DATA TAG DATA TAG DATATAG DATA

ROW 2

TAG DATA

TAG DATA TAG DATA TAG DATATAG DATA

ROW 3

TAG DATA TAG DATA TAG DATA TAG DATATAG DATA

ROW 4

TAG DATA TAG DATA TAG DATA TAG DATATAG DATA

Figure 5.3: Tag-Data Hashtable.

is stored in SPAD memory for quick updates by the atomic instructions.

The pseudo-code for the entire hashing phase is shown in the Algorithm 2.

5.1.3 Write-back Phase

The write-back phase moves the partial products from the hashtable to their ﬁnal output

matrix, stored in DRAM in the CSR format. In the write-back phase, ﬁrst, we employ a sorting

mechanism to sort the partially sorted hashtable. We take advantage of the partially sorted hashtags

for our implementation and sort them using a variation of insertion sort. Using insertion sort also

helps us merge the remaining partial products by matching their tags. This enables us to stream

elements from a hashtable, in ascending order, directly to the output matrix present stored in DRAM

in the CSR format. Pseudocode for this phase is as shown in Algorithm 5.

An overview of the entire SMASH algorithm is presented in Figure 5.1 and Figure 5.4.

38

CHAPTER 5. SMASH KERNELS

Algorithm 2: SMASH HASHING

// READ PHASE

1while Till you reach end of window do

// Atomically distribute work to each thread

2token ←Each thread will receive one unique token

3if token id % 2 = 0 then

4row begin ←A col ptr copy 1[token id

2]

5else

6row begin ←A col ptr copy 2[token id

2]

7end

8row end ←A col ptr[token id

2+ 1]

9for i←Iterate from row begin to row end do

10 if Check if we are within our assigned window then

11 col begin ←B row ptr[token id

2]

12 col end ←B row ptr[token id

2+ 1]

13 if token id % 2 = 0 then

// Hash EVEN Section

14 else// Hash ODD Section

15 end

16 end

17 end

18 end

19 A col ptr copy 1and A col ptr copy 2will now reﬂect new positions

Algorithm 3: SMASH HASHING Even Section

1for k←Iterate from col begin to col end−col begin

2do

// Multiply element from mat A with that from mat B

and store its tag and value

2tag ←Xcoordinate from mat A element and Ycoordinate from mat B

element

// Hash the Tag

3tag ←tag >> hash shif t

4if SP AD tag[tag] = EM P T Y then

5SP AD tag[tag]←tag // Store Tag on scratchpad

6SP AD val[tag]←value // Store Value on scratchpad

7else

8if SP AD tag[tag] = tag then

9SP AD val[tag]+ = value // Accumulate Value

10 else// Probe for empty space on Scratchpad

11 end

12 end

13 end

39

CHAPTER 5. SMASH KERNELS

Algorithm 4: SMASH HASHING Odd Section

1for k←Iterate from col end to col end−col beg in

2do

// Multiply element from mat A with that from mat B

and store its tag and value

2tag ←Xcoordinate from mat A element and Ycoordinate from mat B

element

3tag ←tag >> hash shif t

4if SP AD tag[tag] = EM P T Y then

5SP AD tag[tag]←tag // Store Tag on scratchpad

6SP AD val[tag]←value // Store Value on scratchpad

7else

8if SP AD tag[tag] = tag then

9SP AD val[tag]+ = value // Accumulate Value

10 else// Probe for empty space on Scratchpad

11 end

12 end

13 end

Algorithm 5: SMASH WRITEBACK

// WRITE PHASE

// Divide SPAD into 64 equal sections

1scan start ←tid ×T otal B ins on S P AD

64

// Add an offset to scan start to take into account

overflow

2scan start ←scan start −O F F SET T H RES HO LD

3scan end ←tid ×(T otal Bins on S P AD

64 + 1)

4index ←0// Counter to keep track of last written element

on C matrix

5for i←Iterate from scan start to scan end do

6if SP AD tag[i]6=E MP T Y then

// Match value on minimum heap tree

7mat C tag[tid][index]←S P AD tag[i]// Copy tag from

scratchpad to DRAM

8mat C val[tid][index]←S P AD val[i]// Copy value from

scratchpad to DRAM

9end

10 end

40

CHAPTER 5. SMASH KERNELS

BEGIN WORKLOAD

CACHE

GLOBAL

VARIABLES

ATOMIC TOKEN

HASH

DENSE

ROW

HASH

SPARSE

ROW

ROW

SPARSITY

CACHE

GLOBAL

VARIABLES

CACHE

GLOBAL

VARIABLES

CACHE

GLOBAL

VARIABLES

HASH

DENSE

ROW

HASH

SPARSE

ROW

ROW

SPARSITY

HASH

DENSE

ROW

HASH

SPARSE

ROW

ROW

SPARSITY

HASH

DENSE

ROW

HASH

SPARSE

ROW

ROW

SPARSITY

ATOMIC TOKEN

WRITE

BACK

ROW

WRITE

BACK

ROW

WRITE

BACK

ROW

WRITE

BACK

ROW

WORKLOAD COMPLETE

HASHING PHASE

WRITE BACK PHASE

Figure 5.4: SMASH Algorithm

41

CHAPTER 5. SMASH KERNELS

5.2 SMASH Version 2: Tokenization

The previous version allows one row of the input matrix to be assigned to one thread on

the PIUMA block. This leads to a considerable imbalance in work across all threads in a block. The

SpGEMM datasets are notorious for encountering workload imbalance during kernel execution.

We tackle the issue of workload imbalance by adding a dynamic work scheduler layer

in our Hashing phase. Instead of statically allocating rows to threads in a round-robin fashion,

we adopt the Producer-Consumer for model row allocation. The dynamic row allocation works as

follows:

1. Generate two tokens for every single row present in the window.

2. Each PIUMA thread polls for a single token. Thus, every row is allocated 2 PIUMA threads.

3. These 2 PIUMA threads start hashing the row. The ﬁrst thread starts from the beginning of

the row and hashes the ﬁrst half of the row (i.e., the even section). The second thread does

the same over the second half of the row (i.e., the odd section).

4. Partial products from both threads are hashed into a common hashtable, stored in the SPAD

memory.

5. When all of the tokens have been polled, the window execution is completed.

The split of workload between even and odd sections can be seen in Algorithms 2, 3, and 4.

Despite the overhead of polling tokens, tokenization produces great speedup over static

allocation, as it achieves a near-perfect distribution of workload across threads. More details of the

performance are presented in the following chapter.

Another optimization carried out in SMASH Version 2 is the change of hashing bits. In

SMASH Version 1, we used the high-order bits for hashing in the hashtable. The downside of using

high-order bits is that if two elements with their tag values close to each other need to be hashed,

they will end up hashing to the same position, thus following the collision routine. Hashing on high-

order bits groups clusters of adjacent elements together. Instead, in this version, we chose to hash on

low-order bits by setting the top nhigh-order bits to zero. Using low-order bits evenly distributes

a cluster of elements over the entire hashtable, thus sharply reducing the number of collisions. This

can be seen in Figure 5.5

The disadvantage of using low-order bits is that the order of hashing is no longer pre-

served. The hashtable is no longer partially sorted, as in the case of the previous version. We

42

CHAPTER 5. SMASH KERNELS

overcome this problem by merging all partial products of the same tag before writing them to the

hashtable. Even though the order is not preserved and the output matrix in CSR format is not sorted,

the correctness of the solution is maintained, as all partial products are properly merged.

HASHPAD

HASHPAD

THREAD 0 THREAD 1 THREAD 2 THREAD 3

ITERATION 0

ITERATION 1

Figure 5.5: Hashing on low-order bits.

5.3 SMASH Version 3: Fragmenting Memory

Previous sections describe how Atomic hashing removes redundant accesses to the partial

product matrices and Tokenization balances workload across PIUMA threads.

This section describes the integration of the DMA engine in the SMASH algorithm by

fragmenting the SPAD memory. The DMA engine provides the PIUMA system the capability to

move data independently within its global address space while the PIUMA blocks work on other

segments of the algorithm.

We incorporate a copy instruction to move data from the SPAD to DRAM, as well as a

scatter instruction to prepare the DRAM for the next window. To use the DMA engine efﬁciently,

we make modiﬁcations to our previous version of SMASH. These modiﬁcations included:

1. In addition to hashing on a common global hashtable, the PIUMA threads also maintain a

private local array that stores the tag values in a dense array.

2. Instead of storing the hashtable in the SPAD, we store the hashtable in DRAM, a memory

that has lower bandwidth but more available space.

3. As every element gets hashed on the global hashtable, we store its position in the hashtable

in a separate array in SPAD, called the off set array.

43

CHAPTER 5. SMASH KERNELS

Figure 5.6: Tag-Offset Hashtable in DRAM.

The layout of the SPAD in SMASH V3 can be best described by Figure ,5.7 and the layout of

DRAM can be best described by Figure 5.6.

SMASH Version 1 and SMASH Version 2 store tag and data values in a hashtable, scat-

tered across a large array. SMASH Version 3, on the other hand, stores the tag values and data

values in dense arrays. These dense arrays can then be easily transferred to the DRAM by simple

copy instructions. Also, since the DMA engine runs in parallel with the MTCs, the PIUMA threads

will not have to spend cycles moving these dense arrays from the SPAD to DRAM.

In the next chapter, we will look deeper into the performance results of each SMASH

implementation and the advantages gained by optimizing each version.

44

CHAPTER 5. SMASH KERNELS

SCRATCHPAD

Thread 0

C_index Array

Row 0 Row 1 Row 2 Row 3 Row 4

Thread 1

C_index Array

Row 0 Row 1 Row 2 Row 4

Thread 2

C_index Array

Row 3 Row 4

Thread 3

C_index Array

Row 0 Row 1 Row 2 Row 3 Row 4 Row 5 Row 6

. . . . .

Thread 63

C_index Array

Row 0 Row 1 Row 2 Row 3 Row 4

DRAM OFFSETS

OFFSET ARRAY

OFFSET ARRAY

OFFSET ARRAY

OFFSET ARRAY

OFFSET ARRAY

COMMON COUNTER ARRAY

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9

DATA ARRAY

FLOATING POINT VALUES

DATA ARRAY

FLOATING POINT VALUES

DATA ARRAY

FLOATING POINT VALUES

DATA ARRAY

FLOATING POINT VALUES

DATA ARRAY

FLOATING POINT VALUES

Thread 0

Thread 1

Thread 2

Thread 3

. . . . .

Thread 63

DRAM OFFSETS

DRAM OFFSETS

DRAM OFFSETS

DRAM OFFSETS

Figure 5.7: SPAD array layout

45

Chapter 6

Results

This chapter begins by providing more details on the evaluation methodology used in

our experiments. Then, following this methodology, we compare the performance of different ver-

sions of our SMASH algorithm, thus providing insights on the improvements offered by, and the

challenges remaining in, each implementation.

6.1 Experimental Methodology

For our experiments, we chose the R-MAT synthetic sparse matrix dataset [18, 45]. In

prior work [18], Chakrabarti et al. proposes a synthetic graph generator that closely represents real-

world graphs in multiple disciplines. In addition to their graph generator being fast, multithreaded,

and robust, they successfully simulate the famous Erdos-Renyi model [26], providing a measure of

the uniform probability distribution of each possible independent edge of a graph. The RMAT sparse

matrices exhibit irregular sparsity patterns with a power-law distribution of non-zeros, making them

notoriously difﬁcult to balance between threads. A synthetic data generator also allows us to create

graphs of speciﬁc required dimensions and varying sparsity patterns for analysis.

We generate two 16K×16Kmatrices using the R-MAT generator and multiply them

with each other using the row-wise multiplication method. We evaluate the performance of all 3

of our kernel implementations on the same input matrices using our simulator and report various

performance metrics in the next section.

46

CHAPTER 6. RESULTS

6.2 Dataset Arithmetic Intensity

Before we dive deeper into the analytics of our SpGEMM implementation, it is worth-

while to explore the arithmetic intensity of multiplying two sparse matrices. The characteristics of

the matrices used in this thesis are shown in Table 6.1.

Matrix Dimensions Total Non-zeros Sparsity

Input Matrix A 16,384 * 16,384 254,211 99.9%

Input Matrix B 16,384 * 16,384 254,211 99.9%

Output Matrix C 16,384 * 16,384 5,174,841 98.1%

Table 6.1: Input and output data characteristics used in this thesis.

The arithmetic Intensity of SpGEMM is computed as the ratio of the total number of

ﬂoating-point operations to the number of total data movement operations (reported in bytes) [37].

An AI value of 0.09 or 9

100 means for 9 ﬂoating-point operations, at least 100 bytes of data need

to be moved.The arithmetic intensity (AI) for multiplying sparse matrix A with sparse matrix B to

produce output matrix C is given by equation 6.1:

AI ≤nnz(C)∗cf

[nnz(A) + nnz(B) + nnz(C)] ∗b≤cf

b(6.1)

where nnz is the total number of non-zeros in the matrix, bis the total number of bytes

required to store one element of the input matrix, cf is the compression factor computed as a ratio

of FLOPs to nonzeros in the output matrix as seen in Equation 6.2.

cf =flop

nnz(C)(6.2)

Matrix Parameters Data Type Elements Size (Bytes) Size (in KB)

Row Pointer INT 4 Bytes 16,385 65,540 64 KB

Column Index INT 4 Bytes 2,54,211 1,016,844 993 KB

Data Array Double 8 Bytes 2,54,211 2,033,688 1,986 KB

Total - 5,683,263 3,116,072 3,043 KB

Table 6.2: CSR matrix arrays for input matrices A and B.

47

CHAPTER 6. RESULTS

Matrix Parameters Data Type Elements Size (Bytes) Size (in KB)

Row Pointer INT 4 Bytes 16,385 65,540 B 64 KB

Column Index INT 4 Bytes 5,174,841 20,699,364 20,214 KB

Data Array Double 8 Bytes 5,174,841 41,398,728 40,428 KB

Total - 10,366,067 62,163,632 60,706 KB

Table 6.3: CSR matrix arrays for the output matrix C.

In our case, we consider one particular example to compute cf and AI using data metrics,

as shown in Table 6.1, Table,6.2 and Table 6.3.

For our implementation, we compute the compression factor, cf = 1.23. We further

compute our arithmetic intensity AI using this cf. The arithmetic intensity of our SMASH Version

3 implementation is AI = 0.09.

6.3 DRAM Performance

Both the input matrices and the output matrix are stored in DRAM. The DRAM band-

width is a measure of the rate at which the input matrices are read, and the output matrices are writ-

ten [84]. The DRAM bandwidth is considered a bottleneck for SpGEMM implementations [100].

We compare the DRAM bandwidth utilization for all our SMASH implementations. DRAM band-

width helps us decide if our SpGEMM kernel is memory-bound or compute-bound. Changes in

bandwidth utilization over time also help us narrow down algorithm phases that produce a bottle-

neck. Considering we are using a row-wise product approach, the DRAM is utilized to read the row

pointers, column indices, and data values for input matrices Aand B. As version 3 of our SpGEMM

implementation stores the hashtable in DRAM, this hashtable contributes to the DRAM bandwidth

demands as well. The DRAM bandwidth demands are compared in Table 6.4.

SMASH Versions DRAM Bandwidth

Version 1 55.2% (3.03 GB/s)

Version 2 73.9 % (4.06 GB/s)

Version 3 95.9% (5.26 GB/s)

Table 6.4: Aggregated DRAM bandwidth demands.

48

CHAPTER 6. RESULTS

6.4 Cache Performance

Cache performance and utilization play a crucial role in the speed achieved by our SpGEMM

kernels. We maintain temporal locality across the ﬁrst input matrix elements and spatial locality

across the second input matrix. This reuse of elements from the ﬁrst matrix, combined with the ac-

cess to neighboring elements from the second matrix, allows us to achieve high data-cache hit rates.

The L1 data-cache hit rates for all 3 versions of our SMASH algorithm are presented in Table 6.5.

SMASH Versions L1 Data Cache Hit Rate

Version 1 88.7%

Version 2 92.2%

Version 3 94.1%

Table 6.5: Cache performance of our 3 SMASH implementatinos.

6.5 Workload Distribution

We achieved a signiﬁcant performance improvement when leveraging tokenization in

SMASH Version 2. This leads to a near-perfect balance of workload across PIUMA threads. This

avoided idle PIUMA cores, with individual cores waiting for other cores to ﬁnish. This, in turn,

signiﬁcantly boosted the Instructions per Cycle (IPC), as shown in the following section.

We analyze the performance of SMASH Version 1 and SMASH Version 2 on a single

window, using a single PIUMA block. Results are shown in Figures 6.1, 6.2, 6.3 and 6.4.

Figures 6.1 and 6.2 provide information on the utilization of each thread in a block, mea-

sured over time. These ﬁgures represent the thread utilization, with the x-axis plotting the time in

milliseconds and the y-axis reporting the associated thread utilization.

Figure 6.1 shows thread utilization of SMASH V1 kernel. As observed in this ﬁgure,

some of the threads do not achieve high thread utilization, indicating that the multi-threaded core is

underutilized as some of the threads stall during execution (they stall on barriers, waiting for other

threads to complete).

Figure 6.2 shows the same workload on the SMASH V2 kernel. All threads in this ﬁgure

achieve close to 100% thread utilization. This shows that our later implementation has mitigated

the cause of under-utilization of the multi-threaded cores.

49

CHAPTER 6. RESULTS

Figure 6.1: SMASH V1: Thread utilization plots for unbalanced workload.

Figure 6.2: SMASH V2: Thread utilization plots for balanced workload.

50

CHAPTER 6. RESULTS

Figure 6.3: Average thread utilization.

Figure 6.3 reports the average thread utilization for each workload. Using dynamic allo-

cation prevents threads from stalling while waiting for other threads to complete, thus maintaining

a higher IPC value.

In Figure 6.4 we provide a normalized histogram to report on thread utilization, showing

the performance improvement in SMASH V2 as compared to SMASH V1. The balanced work-

load shown on the right exhibits more threads achieve nearly 100% utilization, as opposed to the

unbalanced workload, where multiple threads idling.

With the use of tokenization, we managed to not only distribute workload evenly across

all threads, but we also end up reducing the overall execution time required for this window from

14.15 ms to 4.09 ms, despite the presence of overhead introduced when creating and polling tokens.

6.6 Instruction Throughput

IPC describes the total number of instructions being executed in the system for every

cycle. We compare our SMASH implementations by comparing aggregate IPC over the entire

execution of the workload while considering all PIUMA threads. Ideally, the max value of the IPC

that can be achieved is equal to the number of MTCs present in the block.

51

CHAPTER 6. RESULTS

Figure 6.4: Thread utilization histogram comparison between balanced and unbalanced workloads.

Agg regate I P C =T otal I nstructions E xecuted

T otal C ycles (6.3)

Aggregate IPC can be computed using the equation 6.3.

SMASH Versions Aggregate IPC

Version 1 0.9 IPC

Version 2 1.7 IPC

Version 3 2.3 IPC

Table 6.6: Aggregate IPC Comparisons

We provide the aggregate IPC values for each of our 3 SMASH implementations in Ta-

ble 6.6.

6.7 Application Speedup

We simulate the total time required by each of our SMASH versions on our interval sim-

ulator, simulating the PIUMA hardware. We consider the time required to run the entire SpGEMM

52

CHAPTER 6. RESULTS

workload on a single PIUMA block. The runtime comparison for all 3 versions of SMASH are

presented in Table 6.7.

SMASH Versions Runtimes Speedup over Version 1

Version 1 986.7 ms 1.0×

Version 2 432.5 ms 2.3×

Version 3 105.4 ms 9.4×

Table 6.7: Runtime for an entire SpGEMM workload on 64 PIUMA threads.

6.8 Summarry of Results

The SMASH V3 kernel is a state-of-the-art SpGEMM implementation built on the PI-

UMA architecture. It employs various optimizations over previous SpGEMM kernel implemen-

tations, as well as previous SMASH versions. The new kernel is capable of utilizing 95.9% of

DRAM bandwidth, nearly saturating the available bandwidth. With the deployment of the producer-

consumer model, the SMASH V3 kernel is able to deliver almost 100% multi-threaded core utiliza-

tion. This optimization translates to a 9.4×speedup, as well as a 155% increase in instruction

throughput, as seen in Tables 6.6 and 6.7.

53

Chapter 7

Conclusions and Future Work

SpGEMM workloads are well known to test the limits of both hardware and software.

The software kernel implementation can play a key role in the resulting irregular memory access

pattern. For this reason, most general-purpose architectures, including CPUs and GPUs, typically

fail to achieve high speedups when executing SpGEMM-based applications.

In this work, we described the many uses of SpGEMM kernels, helping to motivate the

need for domain-speciﬁc architectures for such workloads. We identiﬁed key issues that need to

be addressed when designing SpGEMM kernels. We further investigated prior research that has

pursued these same issues. This helped shape the design of our approach pursued in this thesis

while optimizing the mapping of the SpGEMM kernel to the underlying PIUMA architecture. We

utilized a row-wise product approach in each of our four implementations.

We explored some of the novel features present in the PIUMA architecture, designed

to tackle sparse graph and sparse matrix applications. We designed 3 different SpGEMM kernel

implementations called SMASH for the PIUMA system, focusing on the key features available on

this accelerator, including DGAS, networked instructions, DMA Engines, and multi-threaded cores.

Our set of optimizations focused primarily on improving DRAM bandwidth utilization of

the SpGEMM kernel. But an increase in DRAM bandwidth utilization by itself is not an indication

of improved performance, as multiple factors can impact the resulting performance. Avoiding re-

dundant reads to memory, poor reuse of input matrices, and increases in metadata size can all offset

bandwidth utilization improvements. Our SMASH V3 kernel implementation stores the hashtable

in memory instead of using an on-chip Scratchpad. Thus, in addition to reading input matrices,

the kernel also has to read intermediate partial-products from the hashtable stored on DRAM. Thus

the DRAM bandwidth is shared between the input data reads and the partial-product reads. But in

54

CHAPTER 7. CONCLUSIONS AND FUTURE WORK

addition to an increase in DRAM bandwidth utilization, we also observed a signiﬁcant speedup of

Version 3 over previous versions. We were able to achieve a speedup of 9.4×over Version 1 by

iterative improvements, performing tokenization, and memory defragmentation.

To summarize, the SMASH kernel improvements are as follows.

•We successfully built an implementation using atomic hashing that eliminated the need for

partial product matrices in row-wise product methods, thus preventing redundant accesses to

DRAM memory.

•We improved workload balance by adding a layer of dynamic work allocation, leveraging a

producer-consumer model.

•Finally, we leveraged PIUMA’s DMA engine, which enabled us to move data from the SPAD

to DRAM without wasting precious cycles of the MTCs.

7.1 Contributions of this Thesis

The main contributions of this thesis include:

•An in-depth analysis of the inherent problems exhibited by sparse matrix multiplication ker-

nels.

•A comparison study of prior architectures that support SpGEMM workloads.

•A comparative study on previous implementations of SpGEMM kernels.

•An architectural overview of Intel’s novel PIUMA graph accelerator.

•A state-of-the-art SpGEMM kernel implementation that uses the features present in the PI-

UMA accelerator architecture to speedup sparse matrix operations.

7.2 Future Work

Sparse matrix-matrix multiplication kernel optimizations will continue to be an active

research area. A key problem to deal with in any SpGEMM kernel implementation is the resulting

workload imbalance. In our implementation, we explored applying a uniform work distribution by

estimating ﬂoating-point operations based on the number of non-zeros in each row. Although this

55

CHAPTER 7. CONCLUSIONS AND FUTURE WORK

method improves the algorithm’s performance for our dataset, it leaves room for optimization for

other sparsity patterns.

In our work, to store and merge partial products, we employed an in-memory hashtable.

Such a hashtable allows us to ensure that the partial products are merged immediately as they are

produced. One of the drawbacks of using hashtables is that they can cause memory hotspots. Based

on the hashing mechanism in our implementation, we used either the high-order bits or low-bits

bits for hashing. This resulted in some sparsity patterns to generate hotspots (multiple elements

getting hashed to the same hash class) in our hashtable. Such patterns will cause our algorithm to

run the collision resolution subroutine, leading to degraded performance. In our next iteration, we

plan to avoid collisions by incorporating a better hashing algorithm, one that is not solely based on

restricting the bits selected. We will consider a dynamic hashing algorithm, developing one that can

adapt to different sparsity patterns.

The PIUMA architecture provides a rich platform where we can further explore the ac-

celeration of linear algebra operations. We want to extend this work well beyond the present focus

on the SpGEMM kernel. We intend to explore other linear algebra subroutines (GraphBLAS) and

consider how to optimize performance given the unique features of this architecture.

56

Bibliography

[1] Tor M Aamodt, Wilson Wai Lun Fung, and Timothy G Rogers. General-purpose graphics

processor architectures. Synthesis Lectures on Computer Architecture, 13(2):1–140, 2018.

[2] S. Aananthakrishnan, R. Pawlowski, J. Fryman, and I. Hur. Efﬁcient sparse matrix-vector

multiplication on intel piuma architecture. In 2020 IEEE High Performance Extreme Com-

puting Conference (HPEC), pages 1–2, 2020.

[3] Sriram Aananthakrishnan, Nesreen K Ahmed, Vincent Cave, Marcelo Cintra, Yigit Demir,

Kristof Du Bois, Stijn Eyerman, Joshua B Fryman, Ivan Ganev, Wim Heirman, et al. Piuma:

Programmable integrated uniﬁed memory architecture. arXiv preprint arXiv:2010.06277,

2020.

[4] Sverre J Aarseth. Gravitational N-body simulations: tools and algorithms. Cambridge Uni-

versity Press, 2003.

[5] Sami Abu-El-Haija, Amol Kapoor, Bryan Perozzi, and Joonseok Lee. N-gcn: Multi-scale

graph convolution for semi-supervised node classiﬁcation. In Uncertainty in Artiﬁcial Intel-

ligence, pages 841–851. PMLR, 2020.

[6] Jung Ho Ahn, Nathan Binkert, Al Davis, Moray McLaren, and Robert S Schreiber. Hyperx:

topology, routing, and packaging of efﬁcient large-scale networks. In Proceedings of the

Conference on High Performance Computing Networking, Storage and Analysis, pages 1–

11, 2009.

[7] Ayaz Akram and Lina Sawalha. A survey of computer architecture simulation techniques

and tools. Ieee Access, 7:78120–78145, 2019.

[8] Moustafa Alzantot, Yingnan Wang, Zhengshuang Ren, and Mani B Srivastava. Rstensorﬂow:

Gpu enabled tensorﬂow for deep learning on commodity android devices. In Proceedings

57

BIBLIOGRAPHY

of the 1st International Workshop on Deep Learning for Mobile Systems and Applications,

pages 7–12, 2017.

[9] Bahar Asgari, Ramyad Hadidi, Tushar Krishna, Hyesoon Kim, and Sudhakar Yalamanchili.

Alrescha: A lightweight reconﬁgurable sparse-computation accelerator. In 2020 IEEE Inter-

national Symposium on High Performance Computer Architecture (HPCA), pages 249–260.

IEEE, 2020.

[10] William Aspray. The intel 4004 microprocessor: What constituted invention? IEEE Annals

of the History of Computing, 19(3):4–15, 1997.

[11] Michael Bailey, Rachel Cao, Theresa Kuchler, Johannes Stroebel, and Arlene Wong. Social

connectedness: Measurement, determinants, and effects. Journal of Economic Perspectives,

32(3):259–80, 2018.

[12] Trinayan Baruah, Kaustubh Shivdikar, Shi Dong, Yifan Sun, Saiful Mojumder, Kihoon Jung,

Jos´

e L. Abell´

an, Yash Ukidave, Ajay Joshi, John Kim, and David Kaeli. Gnnmark: A bench-

mark suite to characterize graph neural network training on gpus. 2021.

[13] James Bennett, Stan Lanning, et al. The netﬂix prize. In Proceedings of KDD cup and

workshop, volume 2007, page 35. New York, 2007.

[14] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali Saidi,

Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh Sardashti, et al.

The gem5 simulator. ACM SIGARCH computer architecture news, 39(2):1–7, 2011.

[15] Encyclopaedia Britannica. Matrix.

[16] David Canright and Lejla Batina. A very compact “perfectly masked” s-box for aes. In

International Conference on Applied Cryptography and Network Security, pages 446–459.

Springer, 2008.

[17] Trevor E. Carlson, Wim Heirman, Stijn Eyerman, Ibrahim Hur, and Lieven Eeckhout. An

evaluation of high-level mechanistic core models. ACM Transactions on Architecture and

Code Optimization (TACO), 2014.

[18] Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. R-mat: A recursive model for

graph mining. In Proceedings of the 2004 SIAM International Conference on Data Mining,

pages 442–446. SIAM, 2004.

58

BIBLIOGRAPHY

[19] Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, and Vivienne Sze. Eyeriss v2: A ﬂexible acceler-

ator for emerging deep neural networks on mobile devices. IEEE Journal on Emerging and

Selected Topics in Circuits and Systems, 9(2):292–308, 2019.

[20] Yuedan Chen, Guoqing Xiao, and Wangdong Yang. Optimizing partitioned csr-based

spgemm on the sunway taihulight. Neural Computing and Applications, 32(10):5571–5582,

2020.

[21] Zhaodong Chen, Mingyu Yan, Maohua Zhu, Lei Deng, Guoqi Li, Shuangchen Li, and Yuan

Xie. fusegnn: Accelerating graph convolutional neural network training on gpgpu. In

2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD), pages 1–9.

IEEE, 2020.

[22] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan

Catanzaro, and Evan Shelhamer. cudnn: Efﬁcient primitives for deep learning. arXiv preprint

arXiv:1410.0759, 2014.

[23] Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. Cluster-

gcn: An efﬁcient algorithm for training deep and large graph convolutional networks. In

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery

& Data Mining, pages 257–266, 2019.

[24] Joan Daemen and Vincent Rijmen. The rijndael block cipher: Aes proposal. In First candi-

date conference (AeS1), pages 343–348, 1999.

[25] Joan Daemen and Vincent Rijmen. The design of Rijndael: AES — the Advanced Encryption

Standard. Springer-Verlag, 2002.

[26] J-J Daudin, Franck Picard, and St ´

ephane Robin. A mixture model for random graphs. Statis-

tics and computing, 18(2):173–183, 2008.

[27] Michael DeLorimier, Nachiket Kapre, Nikil Mehta, Dominic Rizzo, Ian Eslick, Raphael

Rubin, Tomas E Uribe, F Thomas Jr, Andre DeHon, et al. Graphstep: A system architecture

for sparse-graph algorithms. In 2006 14th Annual IEEE Symposium on Field-Programmable

Custom Computing Machines, pages 143–151. IEEE, 2006.

[28] Jack Dongarra. Sparse matrix storage formats. Templates for the Solution of Algebraic

Eigenvalue Problems: a Practical Guide. SIAM, 11:445–448, 2000.

59

BIBLIOGRAPHY

[29] Iain S Duff, Michael A Heroux, and Roldan Pozo. An overview of the sparse basic linear

algebra subprograms: The new standard from the blas technical forum. ACM Transactions

on Mathematical Software (TOMS), 28(2):239–267, 2002.

[30] Lieven Eeckhout. Computer architecture performance evaluation methods. Synthesis Lec-

tures on Computer Architecture, 5(1):1–145, 2010.

[31] Stijn Eyerman, Wim Heirman, Kristof Du Bois, Joshua B Fryman, and Ibrahim Hur. Many-

core graph workload analysis. In SC18: International Conference for High Performance

Computing, Networking, Storage and Analysis, pages 282–292. IEEE, 2018.

[32] Dimitar Filev, Olga Georgieva, P Angelov, and A Kasabov. An extended version of the

gustafson-kessel algorithm for evolving data stream clustering. Evolving intelligent systems:

Methodology and applications, pages 273–300, 2010.

[33] Davy Genbrugge, Stijn Eyerman, and Lieven Eeckhout. Interval simulation: Raising the

level of abstraction in architectural simulation. In HPCA-16 2010 The Sixteenth International

Symposium on High-Performance Computer Architecture, pages 1–12. IEEE, 2010.

[34] Olga Georgieva and Dimitar Filev. Gustafson-kessel algorithm for evolving data stream clus-

tering. In Proceedings of the International Conference on Computer Systems and Technolo-

gies and Workshop for PhD Students in Computing, pages 1–6, 2009.

[35] C Lee Giles, Kurt D Bollacker, and Steve Lawrence. Citeseer: An automatic citation indexing

system. In Proceedings of the third ACM conference on Digital libraries, pages 89–98, 1998.

[36] Google. Cloud tpu. https://cloud.google.com/tpu, 2019.

[37] Zhixiang Gu, Jose Moreira, David Edelsohn, and Ariful Azad. Bandwidth-optimized par-

allel algorithms for sparse matrix-matrix multiplication using propagation blocking. arXiv

preprint arXiv:2002.11302, 2020.

[38] Fred G Gustavson. Two fast algorithms for sparse matrices: Multiplication and permuted

transposition. ACM Transactions on Mathematical Software (TOMS), 4(3):250–269, 1978.

[39] Daniel Hackenberg, Daniel Molka, and Wolfgang E Nagel. Comparing cache architectures

and coherency protocols on x86-64 multicore smp systems. In Proceedings of the 42Nd

Annual IEEE/ACM International Symposium on microarchitecture, pages 413–422, 2009.

60

BIBLIOGRAPHY

[40] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and

William J Dally. Eie: efﬁcient inference engine on compressed deep neural network. ACM

SIGARCH Computer Architecture News, 44(3):243–254, 2016.

[41] Carl Lee Hanson, Ben Cannon, Scott Burton, and Christophe Giraud-Carrier. An exploration

of social circles and prescription drug abuse through twitter. J Med Internet Res, 15(9):e189,

Sep 2013.

[42] Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel,

Edgar Solomonik, Joel Emer, and Christopher W Fletcher. Extensor: An accelerator for

sparse tensor algebra. In Proceedings of the 52nd Annual IEEE/ACM International Sympo-

sium on Microarchitecture, pages 319–333, 2019.

[43] Wim Heirman, Trevor Carlson, and Lieven Eeckhout. Sniper: Scalable and accurate parallel

multi-core simulation. In 8th International Summer School on Advanced Computer Architec-

ture and Compilation for High-Performance and Embedded Systems (ACACES-2012), pages

91–94. High-Performance and Embedded Architecture and Compilation Network of . . . ,

2012.

[44] Sung-Hsien Hsieh, Chun-Shien Lu, and Soo-Chang Pei. Sparse fast fourier transform by

downsampling. In 2013 IEEE International Conference on Acoustics, Speech and Signal

Processing, pages 5637–5641. IEEE, 2013.

[45] Lorenz H ¨

ubschle-Schneider and Peter Sanders. Linear work generation of r-mat graphs.

Network Science, 8(4):543–550, 2020.

[46] Ken Ivanov. Autonomous collision attack on ocsp services, 2016.

[47] Bryan Jacobs. Hierarchical identify verify exploit (hive).

[48] Lizy Kurian John. 8.2 performance evaluation: Techniques, tools, and benchmarks. The

Computer Engineering Handbook, 8:21, 2002.

[49] Bo K˚

agstr¨

om, Per Ling, and Charles Van Loan. Gemm-based level 3 blas: high-performance

model implementations and performance evaluation benchmark. ACM Transactions on Math-

ematical Software (TOMS), 24(3):268–302, 1998.

[50] Mike Kelly. Gyrfalcon starts shipping ai chip. electronics-labs.com, Oct 2018.

61

BIBLIOGRAPHY

[51] Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili. Modeling gpu-cpu workloads

and systems. In Proceedings of the 3rd workshop on general-purpose computation on graph-

ics processing units, pages 31–42, 2010.

[52] Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprint

arXiv:1611.07308, 2016.

[53] Oliver Knill. When was matrix multiplication invented? Harvard Mathematics Department,

Jun 2009. Available at http://people.math.harvard.edu/˜knill/history/

matrix/.

[54] Oliver Knill. Cauchy–binet for pseudo-determinants. Linear Algebra and its Applications,

459:522–547, 2014.

[55] Raghu Krishnapuram and Jongwoo Kim. A note on the gustafson-kessel and adaptive fuzzy

clustering algorithms. IEEE Transactions on Fuzzy systems, 7(4):453–461, 1999.

[56] HT Kung, Bradley McDanel, and Sai Qian Zhang. Packing sparse convolutional neural net-

works for efﬁcient systolic array implementations: Column combining under joint optimiza-

tion. In Proceedings of the Twenty-Fourth International Conference on Architectural Support

for Programming Languages and Operating Systems, pages 821–834, 2019.

[57] Chuck L Lawson, Richard J. Hanson, David R Kincaid, and Fred T. Krogh. Basic linear al-

gebra subprograms for fortran usage. ACM Transactions on Mathematical Software (TOMS),

5(3):308–323, 1979.

[58] Hua Li and Zachary Friggstad. An efﬁcient architecture for the aes mix columns operation.

In 2005 IEEE International Symposium on Circuits and Systems, pages 4637–4640. IEEE,

2005.

[59] Chun-Yuan Lin, Yeh-Ching Chung, and Jen-Shiuh Liu. Efﬁcient data compression methods

for multidimensional sparse array operations based on the ekmr scheme. IEEE Transactions

on Computers, 52(12):1640–1646, 2003.

[60] Rong Lin and Martin Margala. Multiplier-based processor-in-memory architectures for im-

age and graphics processing, January 23 2007. US Patent 7,167,890.

62

BIBLIOGRAPHY

[61] Zhi-Gang Liu, Paul N Whatmough, and Matthew Mattina. Systolic tensor array: An efﬁcient

structured-sparse gemm accelerator for mobile cnn inference. IEEE Computer Architecture

Letters, 19(1):34–37, 2020.

[62] Adam Lugowski, David Alber, Aydm Buluc¸, John R Gilbert, Steve Reinhardt, Yun Teng, and

Andrew Waranis. A ﬂexible open-source toolbox for scalable complex graph analysis. In

Proceedings of the 2012 SIAM International Conference on Data Mining, pages 930–941.