Conference PaperPDF Available

Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads

Authors:

Abstract

Today, DNNs’ high computational complexity and sub-optimal device utilization present a major roadblock to democratizing DNNs. To reduce the execution time and improve device utilization, researchers have been proposing new system design solutions, which require performance models (especially GPU models) to help them with pre-product concept validation. Currently, researchers have been utilizing simulators to predict execution time, which provides high flexibility and acceptable accuracy, but at the cost of a long simulation time. Simulators are becoming increasingly impractical to model today’s large-scale systems and DNNs, urging us to find alternative lightweight solutions. To solve this problem, we propose using a data-driven method for modeling DNNs system performance. We first build a dataset that includes the execution time of numerous networks/layers/kernels. After identifying the relationships of directly known information (e.g., network structure, hardware theoretical computing capabilities), we discuss how to build a simple, yet accurate, performance model for DNNs execution time. Our observations on the dataset demonstrate prevalent linear relationships between the GPU kernel execution times, operation counts, and input/output parameters of DNNs layers. Guided by our observations, we develop a fast, linear- regression-based DNNs execution time predictor. Our evaluation using various image classification models suggests our method can predict new DNNs performance with a 7% error and new GPU performance with a 15.2% error. Our case studies also demonstrate how the performance model can facilitate future DNNs system research.
Path Forward Beyond Simulators: Fast and Accurate GPU
Execution Time Prediction for DNN Workloads
Ying Li
William & Mary
Williamsburg, VA, USA
yli81@wm.edu
Yifan Sun
William & Mary
Williamsburg, VA, USA
ysun25@wm.edu
Adwait Jog
University of Virginia
Charlottesville, VA, USA
ajog@virginia.edu
ABSTRACT
Today, DNNs’ high computational complexity and sub-optimal de-
vice utilization present a major roadblock to democratizing DNNs.
To reduce the execution time and improve device utilization, re-
searchers have been proposing new system design solutions, which
require performance models (especially GPU models) to help them
with pre-product concept validation. Currently, researchers have
been utilizing simulators to predict execution time, which provides
high exibility and acceptable accuracy, but at the cost of a long
simulation time. Simulators are becoming increasingly impractical
to model today’s large-scale systems and DNNs, urging us to nd
alternative lightweight solutions.
To solve this problem, we propose using a data-driven method for
modeling DNNs system performance. We rst build a dataset that
includes the execution time of numerous networks/layers/kernels.
After identifying the relationships of directly known information
(e.g., network structure, hardware theoretical computing capabili-
ties), we discuss how to build a simple, yet accurate, performance
model for DNNs execution time. Our observations on the dataset
demonstrate prevalent linear relationships between the GPU kernel
execution times, operation counts, and input/output parameters of
DNNs layers. Guided by our observations, we develop a fast, linear-
regression-based DNNs execution time predictor. Our evaluation
using various image classication models suggests our method can
predict new DNNs performance with a 7% error and new GPU per-
formance with a 15.2% error. Our case studies also demonstrate how
the performance model can facilitate future DNNs system research.
CCS CONCEPTS
Computing methodologies Modeling methodologies.
KEYWORDS
Deep Neural Networks; Graphics Processing Units; Performance
Model
ACM Reference Format:
Ying Li, Yifan Sun, and Adwait Jog. 2023. Path Forward Beyond Simulators:
Fast and Accurate GPU Execution Time Prediction for DNN Workloads.
In 56th Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO ’23), October 28–November 01, 2023, Toronto, ON, Canada. ACM,
New York, NY, USA, 15 pages. https://doi.org/10.1145/3613424.3614277
This work is licensed under a Creative Commons Attribution-NoDerivs International
4.0 License.
MICRO ’23, October 28–November 01, 2023, Toronto, ON, Canada
©2023 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0329-4/23/10.
https://doi.org/10.1145/3613424.3614277
1 INTRODUCTION
Deep neural networks (DNNs) are becoming increasingly popu-
lar because they have the extraordinary capability of performing
tasks that typically require signicant human involvement (e.g.,
recognizing objects in images [
26
,
61
,
63
,
73
], processing natural
languages [
14
,
52
,
71
]). DNNs’ power leads to the proliferation
of DNNs, as demonstrated by a large number of DNNs available
on the HuggingFace [
70
] platform designed to solve various prob-
lems. Today, most DNNs consume a huge amount of computing
power [
62
], preventing practitioners from lowering DNNs deploy-
ment costs [13, 16, 75] and boarding the user bases [10].
Improving the eciency of DNNs requires better system (broadly
dened as the collection of software, run-time library, operating
system, and hardware) designs. As a critical process of developing
new solutions, researchers typically need to evaluate the perfor-
mance [
24
] of the improved systems and compare them with a
baseline design. Since building new systems (making changes to
the hardware, operating system, or machine learning software) is
costly [
13
,
45
], researchers commonly use performance modeling
methods to predict the performance and validate their design ideas.
Moreover, since DNN workloads are mainly executed on Graphics
Processing Units (GPUs), and GPUs are likely to be a performance
bottleneck [
29
], it is critical to provide a high-performance, high-
exibility, and high-accuracy performance model for DNNs running
on GPUs.
Researchers have been developing simulators to predict DNNs
performance [
5
,
21
,
72
]. While GPU simulators provide great exi-
bility and a reasonable error around 10% to 20% [
7
9
,
21
,
42
,
64
,
69
],
the long simulation time hinders researchers from evaluating large-
scale and complex systems executing long-running applications.
For example, it is reported that GPGPU-Sim [
7
] may need years to
centuries [
5
] to simulate end-to-end machine learning workloads.
Modern system-level research inevitably involves large DNNs (e.g.,
GPT [
56
]) that execute on large systems (e.g., multi-GPU systems),
pushing researchers to stay away from simulators and rely on real
systems. However, using real systems has disadvantages, including
1) high acquiring costs and 2) not being able to evaluate non-existing
hardware. Therefore, the community urgently demands a new set of
solutions that accurately model DNNs system performance, which
requires approaching the problem from a new direction.
To address the aforementioned problem, we take a data-driven
approach that is complementary to simulator development. Our
goal is to develop a faster performance model that can provide a
similar or even more accurate estimation of the DNNs performance
on GPUs compared to simulators. To achieve this goal, we rst
collect a large number (646) of models from commonly used model
zoos (e.g., HuggingFace [
70
], TorchVision [
43
]) and measure the
MICRO ’23, October 28–November 01, 2023, Toronto, ON, Canada Ying Li, Yifan Sun, and Adwait Jog
execution time of DNNs, layers, and kernels on various GPUs. Our
analysis of the dataset suggests that the execution time on each
GPU is linearly correlated with the amount of work (represented by
oating-point operations, FLOPs
1
). Especially, after careful classi-
cation, kernel execution times demonstrate an almost perfect linear
relationship to either layer’s input dimension, FLOPs, or output
dimension. We discover that using complex statistical approaches,
such as Principle Component Analysis (PCA) and Neural Networks,
is not necessary for highly regular DNN workloads running on
GPUs.
Using FLOPs as the main independent variable does not mean we
only consider compute-intensive workloads. It might be intuitive
to use factors such as memory bandwidth to predict the perfor-
mance of memory-intensive workloads and use FLOPS for compute-
intensive workloads. Rather than incorporating more hardware
parameters in a single model that can predict the performance of
all the workloads on all devices, building more models with fewer
parameters can be more practical and accurate for DNNs system
evaluation. Following this path, we reveal how FLOPs can be used
for both memory-intensive and compute-intensive workloads and
achieve superior accuracy. Indeed, we nd that FLOPs are more
suitable for inter-workload models, while memory bandwidth is
more suitable for inter-device models.
Based on our detailed investigation of the performance mea-
surements, we build several linear-regression-based performance
models with dierent levels of complexity and prediction accuracy.
Our performance models include an End-to-End (E2E) Model, a
Layer-Wise (LW) Model, a Kernel-Wise (KW) Model, and an Inter-
GPU Kernel-Wise (IGKW) Model. Our performance models demon-
strate that lightweight performance analytical models can be highly
accurate in predicting DNN workloads performance. For example,
our Kernel-Wise model, which only uses the DNNs structure as
input, can predict end-to-end DNN workloads execution time with
an error as low as 7%.
Overall, this paper makes the following contributions.
We initiate an open DNNs performance database that in-
cludes the DNNs structure, hardware specication, and per-
formance measurement on multiple GPUs.
We examine the prevalent linear relationship between exe-
cution time and the DNNs FLOPs. Moreover, we propose a
kernel classication method that can identify the parameter
with the strongest linear correlation with the execution time.
We evaluate three linear regression-based performance mod-
els for DNN workloads on single GPUs. Our ndings suggest
that simple models can deliver satisfying accuracy in predict-
ing DNN workloads performance. Our End-to-End, Layer-
Wise, and Kernel-Wise models can reduce the performance
prediction error to as low as 35%, 28%, and 7% on NVIDIA
A100 GPU, respectively.
We develop a mechanism that tunes the linear regression
models for dierent GPUs. Applying this mechanism, our
Inter-GPU Kernel-Wise model can predict the performance
of the TITAN RTX GPU using the measurement collected
on three other GPUs, with an error of 15.2%.
1
In this paper, we use FLOPs to represent oating-point operations; and FLOPS to
represent oating-point operations per second.
H
W
Cin
Kh
Cout
Cin
H'
W'
Kw
Input Feature Maps Filters Output Feature Maps
*=
Figure 1: Convolutions in CNNs.
2 BACKGROUND
In this section, we provide a brief introduction to convolution neu-
ral networks, layers, followed by GPU programming models, and
performance metrics.
2.1 Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are among the most com-
mon DNNs that employ convolutional layers (to be introduced
in Section 2.2) to extract features from signals or images [
19
]. Al-
though many convolutional neural networks are being used, they
are all composed of a few types of layers [
51
,
66
]. These layers
include convolutional layers (CONV), pooling layers (Pooling), ac-
tivation layers, normalization layers (NORM), and fully connected
layers (FC). As dierent layers perform dierent operations, one
model may not be able to predict the performance of all types of
layers. This paper aims to evaluate such possibilities about using a
single performance model for all the layers.
At a high level, we can separate the implementation of DNNs
into three levels, including the layer-, operator-, and kernel-level.
Level 1. Modeling libraries (e.g., HuggingFace [
70
], Keras [
17
])
allow programmers to dene the layers according to their needs.
Level 2. DNN frameworks (e.g., PyTorch [
54
], TensorFlow [
2
]) are
responsible for connecting layers with operators (e.g., matrix trans-
pose, element-wise summation) that implement the layer-required
operations. Level 3. Finally, device vendors (e.g., NVIDIA, AMD)
typically provide libraries (e.g., cuDNN [
11
], MIOpen [
35
]) that
dene how calculations can take place on devices. As performance
models can be implemented at all these three levels (or even lower
down to the GPU instruction level, as done by GPU simulators), it
is critical to examine the pros and cons of developing performance
models at dierent layers.
The detailed implementation has a signicant implication on
DNN workloads performance. Here, we use the convolutional layer
as an example. Libraries typically use four dierent algorithms to
implement convolutional layers, including direct method [
39
,
51
],
matrix multiplication-based method [
48
], Fast Fourier Transform
(FFT)-based method [
47
], and the Winograd method [
37
]. Inter-
ested readers may refer to the associated citations for more details.
Additionally, even if the same method is used, we observe that the
GPU libraries might use dierent implementations according to the
layer size and data layout. It is necessary and challenging to keep
the performance model suciently generic to include all kinds of
methods and implementations.
The way that DNN workloads are executed also plays a role in
determining performance. For example, since single images cannot
Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads MICRO ’23, October 28–November 01, 2023, Toronto, ON, Canada
fully utilize whole GPUs, users commonly group multiple images as
aBatch, and feed in the whole batch of images to a GPU at once. The
number of images in a batch (batch size, or BS) critically determines
the utilization of GPUs. In general, we consider cases where BS
is large enough to fully utilize the GPU. However, performance
models should have the capability of predicting performance when
BS is dierent.
2.2 Convolutional Layer
Convolutional layers (CONV) are typically the most time-consuming
layers in CNNs [
65
]. They are also challenging to predict perfor-
mance as libraries like cuDNN provide many dierent implemen-
tations to convolutional layers. Therefore, in this paper, we pay
special attention to convolutional layers.
CONV layers are mainly composed of high dimensional convo-
lutions as shown in Figure 1 [
66
]. The input is a tensor (i.e., feature
maps) of the size
𝐶𝑖𝑛 ×𝐻×𝑊
. Here,
𝐻
and
𝑊
denote the height and
width of the feature maps [
41
], and
𝐶𝑖𝑛
represents the number of
channels (each channel includes
𝐻×𝑊
pixels) of the input tensor.
A CONV layer includes
𝐶𝑜𝑢𝑡
convolution lters, with the size of
𝐶𝑖𝑛 ×𝐾×𝐾𝑤
. Each channel of the input image performs a convolu-
tion operation with a 2-D lter (
𝐾𝑤×𝐾
). The convolution results
are then summed across all the channels, forming one channel of
the output. The operation is repeated by
𝐶𝑜𝑢𝑡
times, and hence,
the output feature map has
𝐶𝑜𝑢𝑡
channels. Additionally, since all
these operations are executed in batches of images, the BS (
𝑁
) will
multiply the amount of the workload by 𝑁times.
FLOPs is an important parameter to measure the amount of
work that needs to be completed. For convolutional layers, if we
only consider multiplications, FLOPs can be calculated as
𝐶𝑜𝑢𝑡
𝐻𝑊𝐶𝑖𝑛 𝐾𝑤𝐾
, where
𝐻
,
𝑊
represents the output
height and width respectively. Note that the equation above only
calculates theoretical FLOPs; smart algorithms may reduce actual
FLOPs, while compiler/hardware limitations may increase actual
FLOPs. There are many popular tools to calculate the theoretical
FLOPs of networks according to the network structure. In this work,
we use PyTorch-OpCounter (thop) [79].
2.3 GPU Programming Model
GPUs work under the close supervision of the CPUs. When running
DNN workloads, the CPUs are responsible for scheduling the tasks
on GPUs and moving the data across devices (CPUs and GPUs).
These are usually completed by calling vendor-provided GPU-
controlling APIs (e.g., CUDA Runtime API [
46
], OpenCL API [
68
]).
To start a task on a GPU, a CPU needs to call the kernel launching
API. Here, a kernel is a collection of a large number of parallel-
executing threads running on a GPU. These threads execute on
GPU cores (i.e., Streaming Multi-Processors or SMs). Given that the
capacity of an SM is limited, only a certain number of threads can
execute on an SM at one time. If the kernel has more threads than
all the SMs in a GPU can handle, some threads must wait before
some other threads complete execution.
2.4 Performance Metrics
While the paragraphs above discussed the changing factors and im-
plementations, a performance model should focus on the invariant
Table 1: GPUs used in the experiments.
GPU Bandwidth Memory TFLOPS Tensor
(GB/s) (GB) (FP32) Core
A100 1555 40 19.5 432
A40 696 48 37.4 336
GTX 1080 Ti 484 11 11.3 0
Quadro P620 80 2 1.4 0
RTX A5000 768 24 27.8 256
TITAN RTX 672 24 16.3 576
V100 900 16 14.1 640
factors. For example, we believe a GPU has two critical parameters
that limit the performance, including the compute throughput and
memory bandwidth. If any parameter (e.g., GPU’s compute through-
put) is a performance bottleneck, doubling the workload (e.g., the
computation required) will simply double the execution time. This
analysis suggests that the performance is likely to maintain a lin-
ear relationship against the metrics that represent the “amount of
work”. Therefore, we focus on using linear models for performance
prediction, as linear models have advantages in simplicity, speed,
and explainability.
3 METHODOLOGY
Since we employ a data-driven method, we rst introduce our
dataset and our evaluation methods. We focus on how we collect
DNN workloads, select hardware platforms, measure execution
times, and build our analytical model.
DNN workloads. Our experiments are based on the framework
of PyTorch [
25
]. Moreover, since DNN frameworks commonly use
GPU-vendor-provided libraries (e.g., cuDNN [
12
], MIOpen [
34
])
and the GPU execution is likely to be the performance bottleneck
of DNNs [
44
], selecting PyTorch implementation simplies our
analysis without losing generality.
We collect DNNs from common DNNs model zoos, such as
Torchvision [
43
] and HuggingFace [
70
]. We select these model
zoos because they include a large number, highly diverse, and rep-
resentative networks.
In this work, we focus on image classication DNNs as they have
similar building blocks (i.e. layers) to DNNs that are designed to
complete other types of tasks (e.g., natural language processing). We
also only focus on inference tasks as these tasks are more sensitive
to latency. We use the ILSVRC2012 dataset [
58
], which consists of
50000 images and is labeled by 1000 object categories, to eliminate
the impact of input data.
GPU hardware. We use several computers (see Table 1) with
dierent GPUs throughout our experiments. All the computers use
the same software stack, including 1) Ubuntu 20.04.5 LTS, 2) Python
3.8.10, 3) PyTorch 1.13.1 + cu116, 4) TorchVision 0.14.1 + cu116, 5)
CUDA 11.6, and 6) cuDNN 8.3.2. The GPUs selected represent a
wide range of NVIDIA GPU products that span across dierent
markets (e.g., high-performance computing, professional graphics,
gaming). The GPUs also use dierent architectures. The diversity
of the GPUs helps us build more robust performance models and
demonstrate the generality of our proposed solution.
MICRO ’23, October 28–November 01, 2023, Toronto, ON, Canada Ying Li, Yifan Sun, and Adwait Jog
Layer A Layer B Layer A
Kernel 3Kernel 1 Kernel 2 Kernel 1 Kernel 2
CPU
GPU
Figure 2: Trace generated by PyTorch Proler shows layers
in CPU and kernels in GPU.
Performance measurement. We warm up all the execution
by rst running 20 batches (batch size may vary) of the DNNs
inference task. We then measure the execution time from the 21st
to the 50th batch, as the execution time becomes stable after the
initial period. We take the average across the batches (either at the
kernel-, layer-, or network-level) to reduce error.
We measure end-to-end execution time using
torch.cuda.Event
API provided by PyTorch. It can take time stamps before and after
the execution of each batch.
We obtain kernel execution times using PyTorch Proler [
54
].
PyTorch Proler combines and links network-level metrics (e.g.,
layer input/output shapes), framework-level metrics (e.g., layer
execution start/end time), and hardware-level traces (e.g., kernel
start time), which provides more information than existing GPU
prolers [
23
,
77
,
78
]. We rely on the traces generated from PyTorch
Proler (which is also visualized in a style shown in Figure 2) to
create a mapping between the layers and the GPU kernels. We
calculate layer execution times from the start and end execution
times for all the kernels launched for this layer.
FLOPs calculation. Since the goal of our performance models
is to predict execution time without execution, we need to infer the
compute requirement of each workload from the DNN structure.
To this end, we use FLOPs to represent the required computational
power of each layer and each DNN network. In our dataset, we
use the input and output dimensions of all the layers to calculate
FLOPs. All the FLOPs are calculated by PyTorch-OpCounter [79].
Data management. We prepare our dataset as CSV les, with
columns including network structure (e.g., input/output shapes),
batch size, layer FLOPs, hardware information (e.g., GPU name,
bandwidth), kernel-by-kernel execution times, layer-to-kernel map-
ping, and end-to-end execution times. We clean the dataset by
removing the duplications and fail-to-execute experiments (e.g.,
out-of-memory error). In total, we have 646 networks and about
182 kernels (
240,000 kernel executions) each GPU recorded in our
dataset.
To train and validate our performance models, we partition the
dataset as a training set and a test set. The test set is a randomly
selected 15% executions from the dataset, while the rest is the
training set.
4 MOTIVATION AND KEY OBSERVATIONS
We make the following observations in our dataset and use the
observations to guide performance model development.
O1: The execution times of DNNs are generally linearly
correlated to FLOPs. As demonstrated in Figure 3, the more work
that needs to be accomplished (represented by the theoretical FLOPs
The band is constantly
about 10 times wide
The trend is linear
Figure 3: The execution times of all the networks in our
dataset against their FLOPs, when batch size is 4 or higher.
The execution times of DNN networks are generally linearly
correlated to FLOPs, with exceptions when the operation
count is small.
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
GFLOPs
0
100
200
300
400
Exec Time (ms)
ResNet VGG
Figure 4: Execution time for ResNet and VGG networks when
batch size is 512. Networks with dierent structures fall on
dierent lines.
of all the layers in a network), the longer time it takes for the GPUs
to execute. Our ndings in Figure 3 suggest that a lightweight linear
regression model is promising to serve as a reasonably accurate
timing model for the end-to-end DNNs execution time.
Meanwhile, we do observe two eects that prevent the execu-
tion time from forming a perfect line in Figure 3. First, the linear
trend does not hold when the workloads have fewer FLOPs. With
fewer FLOPs, the GPU will not be fully utilized. The performance
is mainly bounded by the scheduling overhead on the CPU and the
communication between the CPU and GPU. Second, the band of
the data is about one magnitude wide (when GFLOPs is 10
2
, the
execution time is between
10
1
to
10
2
ms), suggesting that some
networks are more ecient than others. Accurate performance
models need to take into account the factors other than FLOPs that
cause the eciency dierence.
O2: Networks with dierent structures fall on dierent
lines. To understand why the execution times and FLOPs do not
form a perfect linear relationship, we need to look into the prob-
lem with a higher resolution (investigating the structure of the
networks).
Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads MICRO ’23, October 28–November 01, 2023, Toronto, ON, Canada
2 10 18 26 34 42 50 58 66 74 82
Batch Size
0
10
20
30
40
Exec Time (ms)
ResNet-50 MobileNetV2 VGG-16
Figure 5: The DNNs execution time is linearly correlated with
the batch size. However, the slope diers from network to
network.
8 64 128 192 256 320 384 448 512
Batch Size
0
2
4
6
8
10
TFLOPS
ResNet-50 MobileNetV2 VGG-16
Figure 6: GPUs can achieve a steady computing throughput
when the batch size is larger than a certain value. When the
batch size is small, the achieved FLOPS are lower.
As a motivating example, we compare the performance of two
popular DNN workloads, ResNet and VGG. ResNet and VGG allow
variations by adding/removing blocks (a group of pre-dened lay-
ers) to/from the standard design. Exploiting this feature, we build a
few non-standard ResNet and VGG networks. Together with stan-
dard ResNet and VGG networks, we plot the execution time against
their end-to-end FLOPs as shown in Figure 4. We observe that these
two series of networks follow dierent linear relationships between
FLOPs and execution time, suggesting that the GPU is more ecient
on VGG due to network structure dierences.
O3: Execution time is mostly linear to batch size, as batch
size directly translates to FLOPs. For every DNN workload, we
collect multiple execution times as we vary the batch size. In theory,
the batch size is a multiplication factor in the FLOPs calculation
equation. BS = 256 is basically repeating the operation of BS = 1
for 256 times. Thus, the FLOPs are linear to batch size for the same
model.
As previous observation demonstrated that the execution time
is linear to FLOPs, we hypothesize that the execution time should
also be linear to batch size. Our data conrm that there is a linear
relationship between the execution times and FLOPs (see Figure 5).
10 310 1101103105
GFLOPs
10 1
101
103
Exec time (ms)
BN CONV FC Pooling
Figure 7: Dierent types of DNN layers fall on dierent linear
trend lines.
The reason we observe this linear relationship is that GPU is fully
utilized by the network execution at large batch sizes (see Figure 6).
O4: Dierent types of DNNs layers have dierent linear
trends. O2 inspires us that GPU may have dierent processing
eciency for layers and layer parameters. Therefore, we need to
investigate if layer-wise execution times demonstrate a clearer
relationship with the FLOPs.
Plotting the layer execution time against the layers’ theoretical
FLOPs (see Figure 7), we observe that each layer type demonstrates
linear relationships. GPUs are less ecient in Pooling and batch
normalization (BN) layers, as they are towards the top-left of the
plot. Both Pooling and BN layers demonstrate near-perfect linear
relationships.
The GPU is more ecient in FC and CONV layers. FC layers
also demonstrate a near-perfect linear relationship, except when
the layer is small. When FLOPs are small, the execution time is
almost constant, suggesting that the execution is not ecient and
is dominated by overhead. Meanwhile, CONV layers typically have
high FLOPs and the execution time is not perfectly linear. This is
because that cuDNN may use dierent algorithms and dierent
kernels. The non-perfect linear relationship in CONV layers is
critical to performance models as CONV layers take the majority of
the execution time. Therefore, to achieve better prediction accuracy,
we are likely to need to dive deeper into the kernel level.
O5: Kernel execution times are also linear to FLOPs but it
is not easy to acquire the kernel FLOPs. Following our previ-
ous analysis, we claim that kernel execution times are also linear
to kernel FLOPs. Our proling results based on NVIDIA Nsight
Compute [
1
] do support the claim. However, the number of FLOPs
may not be a good indicator, as prolers are required to access
kernel-level FLOPs. As using prolers signicantly slows down
the execution, we need to estimate the kernel FLOPs from DNNs’
structural information.
Our analysis of the dataset indicates that there is no single pa-
rameter that has a perfect linear correlation with the execution time
of all the kernels. As a result, we need to classify the kernels based
on the parameter that is linearly correlated with the execution time.
After examining the cuDNN execution, we have observed a com-
mon pattern, regardless of the algorithms or implementations. The
cuDNN library typically follows a process of 1) pre-processing the
MICRO ’23, October 28–November 01, 2023, Toronto, ON, Canada Ying Li, Yifan Sun, and Adwait Jog
Predict with
Input NCHW
Predict with
Operation
Input-Driven
Kernels
Operation-Driven
Kernels
Output-Driven
Kernels
Low Correlation
Low Correlation
Low Correlation
Low Correlation
Low Correlation
Low Correlation
High Correlation
High Correlation
High Correlation
Figure 8: We classify kernels into three categories, including (top) input-driven kernels, (mid) operation-driven kernels, and
(bottom) output-driven kernels. We show the linear relationship between dierent factors and execution time. We demonstrate
that the classication can amplify the linear relationship.
input data, 2) performing the calculation, and 3) post-processing
the output data. Since the pre-processing kernel works on the input
data, we nd the input dimension is critical to the execution time
of such data. The main kernels in the layers (e.g., matrix multiplica-
tion in CONV layers) perform the layer operations, and hence, the
execution times are correlated with the number of operations of
the kernel. Finally, the execution times of post-processing kernels
are highly correlated with the layer output size, as they work on
the output data.
This nding has led us to classify the kernels into three groups:
1) input-driven, 2) operation-driven, and 3) output-driven kernels.
For input- and output-driven kernels, we use the product of the
batch size (
𝑁
), channels count (
𝐶
), feature map height (
𝐻
), and
feature map width (
𝑊
) as the indicator of the input/output size.
For operation-driven kernels, we use the layer FLOPs as the inde-
pendent variable. As shown in Figure 8, our classication is highly
successful in separating the kernels into three groups and iden-
tifying an independent variable that linearly correlated with the
kernel execution time. The classication can also be automated.
For each kernel, our algorithm can build linear regression for all
three groups and compare the quality of the linear regression (the
𝑅2
value). The kernel will be automatically classied into the group
with the highest 𝑅2value.
O6: Bandwidth is an important factor that aects the per-
formance of DNNs on dierent GPUs. Figure 8 reveals that the
reciprocal of slope accurately represents the FLOPS achieved by
dierent groups of kernels. Our measurement on other GPUs also
suggests that there exist similar linear relationships but with dier-
ent parameters. To allow our performance model to work across
dierent GPUs, a model that adapts linear regression parameters
according to the hardware performance metrics is necessary. The
key to building such a model is understanding the determining
factor that dictates the performance.
Considering that most workloads are typically divided into memory-
or compute-intensive workloads, we closely examine the impact of
GPU FLOPS and GPU memory bandwidth. Our method aims to nd
the link between the theoretical value (e.g., DNNs structures, GPU
theoretical performance metrics) and the achieved value. There-
fore, we perform a study that compares the achieved FLOPS with
the theoretical FLOPS of the GPUs (compute eciency), as well
as compares the achieved memory bandwidth with the theoretical
memory bandwidth (bandwidth eciency).
Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads MICRO ’23, October 28–November 01, 2023, Toronto, ON, Canada
A40 A100 1080 Ti TITAN RTX A5000
Quadro P620
0
10%
20%
30%
40%
Efficiency
BW Efficiency Compute Efficiency
Figure 9: The eciency of memory bandwidth and compute
of ResNet-18 on dierent GPUs. The bandwidth eciency is
relatively stable across GPUs, but not the compute eciency.
Regression
Model
Training
Dataset
Linear
Functions
PredictionTraining
Network Structure
Kernel Mapping Table
Execution Time
Network Structure
Predicted Time
Figure 10: An overview of the workow of using the DNNs
analytical performance model.
Our results (see Figure 9) suggest the theoretical bandwidth of
GPUs can potentially serve as a practical metric for predicting exe-
cution time. In Figure 9, we demonstrate the bandwidth eciency
(achieved value divided by the theoretical value) of ResNet-18 (other
networks demonstrate similar results) across dierent GPUs, in-
cluding A40, A100, GeForce GTX 1080 Ti, TITAN RTX, and RTX
A5000. The bandwidth eciency stays around 10%. In contrast, the
achieved TFLOPS does not exhibit a consistent ratio to theoretical
TFLOPS. Note that the low-eciency numbers do not indicate the
actual hardware utilization. Here, we use the layer shape informa-
tion to estimate the number of bytes to read/write and the required
FLOPS, while the actual GPU may read/write much more bytes and
perform much more computation. Still, the consistent bandwidth
eciency suggests our estimation is helpful.
5 PERFORMANCE MODELS
5.1 System Architecture
A trained performance analytical model provides an accurate ex-
ecution time prediction for DNN workloads that are not in the
training set. Therefore, as seen in Figure 10, we separate the task
of using the analytical performance model into the training part
and the prediction part. The training part takes a training dataset
as input, which includes the DNNs structure, execution times, and
a kernel mapping table (to be introduced in subsection 5.4), and
generates the performance analytical model (including parameters)
0% 10% 25% 50% 75% 90%100%
Percentage
0.3
0.5
0.8
1
1.2
1.5
2
3
Pred / Measured Time
Figure 11: The End-to-End Model predicted execution time,
normalized to real-device execution time on A100, and sorted
in ascending order. The average error is 0.35. The X-axis is
the percentage of the network number in the test set.
as the output. The performance analytical model and its parameters
can be distributed to users. To predict the performance of a new
network, users can feed the network structure to the analytical
model to produce the predicted execution time.
Overall, the workow denes a simple interface for the perfor-
mance analytics model, allowing easy replacement of the perfor-
mance model and the prediction algorithm. Using this workow,
in the rest of the section, we introduce 3 dierent models (using
A100 as an example), ordered with increasing model complexity and
prediction accuracy. At the end of this section, we also introduce a
method that generalizes the models to predict the performance of
GPUs that are not included in the training set.
5.2 End-to-End (E2E) Model
We rst introduce a simple linear regression model (based on O1)
that is commonly used in DNNs system design projects [
16
,
31
]. This
method takes the total theoretical FLOPs (sum of the theoretical
FLOPs of every layer) as the input to generate the output—the
predicted end-to-end execution time.
We use the FLOPs and execution time captured when GPUs
are fully utilized (BS =512) to train the linear regression model
(and other models to be introduced later). O3 (execution time is
linear to batch size) allows us to train with data of a single batch
size, and our predictions will still be reasonably accurate for other
batch sizes. This design reduces the data to collect and makes our
solutions more suitable for online learning (updating the model in
the deployed environment in real-time).
The accuracy (35% average error on A100, see Figure 11) of such
a simple linear regression model is promising. We notice a few
outliers. We under-estimate or over-estimate by about three times
in extreme cases, which matches the analysis in Figure 3 that the
execution time dierence of networks with the same number of
FLOPs tends to be
10
×
. However, most networks have execution
times close to the central line, leading to a 35% error.
In conclusion, when a higher level of accuracy is required, a
more detailed model is desired. This E2E model lacks the capability
of analyzing at a lower granularity (e.g., scheduling at the DNNs
layer level). Therefore, in the next section, we discuss kernel and
layer-level models.
MICRO ’23, October 28–November 01, 2023, Toronto, ON, Canada Ying Li, Yifan Sun, and Adwait Jog
0% 10% 25% 50% 75% 90%100%
Percentage
0.3
0.5
0.8
1
1.2
1.5
2
3
Pred / Measured Time
Figure 12: The Layer-Wise Model predicted execution time,
normalized to real-device execution time on A100, and sorted
in ascending order. The average error is 0.28. The X-axis is
the percentage of the network number in the test set.
5.3 Layer-Wise (LW) Model
The E2E model’s accuracy cannot be further improved mainly be-
cause it lacks the capability of dierentiating the layers/kernels
with dierent computing eciency (demonstrated in Figure 3 and
Figure 4). To achieve higher accuracy, we need to dive one level
deeper.
Guided by O4, we evaluate a layer-wise linear regression model.
This model is a combination of multiple linear regression models
with one model for a type of layer (e.g., CONV, ReLU, Pooling). The
training process is to nd the best linear regression parameters for
each type of layer, independently. We take the theoretical FLOPs
of each layer as input and estimate the layer-by-layer execution
time. The predicted total execution time is the sum of the predicted
execution time of all the layers.
We observe a slight improvement in prediction accuracy as we
predict the LW model and the E2E model. The prediction error
(see Figure 12) is reduced from 35% to 28% on A100. Moreover,
it provides more exibility if users’ design involves layer-wise
scheduling, which is not suitable for the E2E model. However, we
believe the layer-wise model is not the best option for most cases.
The extra layer-wise information adds a burden to researchers, and
the reward in accuracy is limited. This evaluation suggests that if we
want to improve accuracy further, dissecting layers into operators
or kernels is needed.
5.4 Kernel-Wise (KW) Model
Guided by O5, we evaluate creating a performance model with
linear regression models that work at the kernel level.
Since the cuDNN library decides the kernels to use according to
the problem sizes, we create a look-up table (as the left-most block
in Figure 10) that maps from the layer type and input/output size
to the kernel list. We provide the look-up table for all the kernels
we encounter in our dataset.
As discussed in O5, we cannot use the layer FLOPs as the input
to all the kernels or prole the achieved FLOPs. Therefore, we rely
on the classication method mentioned in O5 to separate kernels
into input-, operation-, or output-driven kernels. For dierent types
of kernels, the linear regression model takes dierent input values.
0% 10% 25% 50% 75% 90%100%
Percentage
0.8
0.9
1
1.2
1.5
2
Pred / Measured Time
Figure 13: The Kernel-Wise Model predicted execution time,
normalized to real-device execution time on A100, and sorted
in ascending order. The average error is 0.07. The X-axis is
the percentage of the network number in the test set.
Additionally, to avoid creating a linear regression model for
every kernel, we combine kernels that demonstrate similar linear
relationships and only build one model for these kernels. In total, on
A100, for 182 kernels recorded, we built 83 linear regression models.
Considering we have 242,394 kernel executions measured on A100,
each linear regression model has 2,920 data points on average for
training and testing purposes, which is considered sucient for
linear regression models.
The accuracy of the KW model (see Figure 13) is much higher
than E2E and LW models. The error rate is only 7% on A100, which
is more accurate than many architecture-level simulators. More-
over, the S-curve is highly asymmetrical (dierent from Figure 11),
suggesting we almost do not underestimate the execution time.
For a few networks, the KW model overestimates with an error
from 15% to 100%. These networks do not have sucient workloads
to keep the GPU busy for long periods, rendering the CPU-GPU
communication the global bottleneck. These types of networks are
not the main target of our model since we mainly focus on the GPU
execution time.
To validate the eectiveness of our KW model in predicting
DNNs execution time on other GPUs, we train our model on addi-
tional GPUs, including A100, A40, GeForce GTX 1080 Ti, TITAN
RTX, and V100. Our analysis reveals that the KW model produces
accurate predictions on these GPUs, with an error rate 6% for A40,
7% for A100, 7.8% for 1080 Ti, 9.2% for TITAN and 9.4% for V100.
Specically, the average error rate for all DNNs in the test set ranges
from 6% to 9%, indicating that our model can eectively predict
execution time for a variety of GPUs.
Comparing the KW model with Principal Kernel Analy-
sis [
5
]. We compared our KW model with two existing approaches,
Principal Kernel Analysis (PKA) and Principal Kernel Selection
(PKS) [
5
]. PKA and PKS are both based on Accel-Sim [
33
] to accel-
erate the simulation performance for large-scale workloads.
We compare predicting ResNet-50 inference performance on
a V100 GPU with various batch sizes of 64, 128, and 256 (see Ta-
ble 2, PKS/PKA results are from the original PKA paper [
5
]). While
achieving higher accuracy, the KW model only takes seconds rather
than hours. In fact, the KW model is expected to demonstrate even
Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads MICRO ’23, October 28–November 01, 2023, Toronto, ON, Canada
Table 2: Modeling ResNet-50 performance, KW model
achieves higher accuracy than PKS/PKA within seconds.
While we only compare ResNet-50 in this experiment, the
KW model is expected to demonstrate even more speed ad-
vantages over PKA/PKS for complex networks such as GPT-4.
ResNet-50 Error (%) Time (Hour)
Batch Size KW Model PKS PKA PKS PKA
64 2.6 6.4 18 10 1.3
128 0.4 3.5 12 8 1.5
256 0.8 2.2 24 18 1.6
more speed advantages over PKA/PKS for complex networks such
as GPT-4 [50].
Note that given the dissimilarities in the level of modeling, our
KW model and PKA/PKS are not directly comparable. KW model is
not a replacement for PKA/PKS, and PKA/PKS is much more exible
in terms of hardware modeling and conguration. Nonetheless, the
purpose of this comparison is to impart a clear understanding of
the potential and capabilities of our proposed KW model.
KW model extension for Transformers. Our current dataset
and performance models primarily concentrate on CNNs. However,
we anticipate that other types of DNN workloads, such as transform-
ers, exhibit similar characteristics, making our models more generic.
To validate this, we expanded our dataset and performance model
to include text classication group networks from the Huggingface
library, which are representative transformer networks. Our KW
model achieved an average prediction error of approximately 4.76%
on A100 when applied to transformers. This result demonstrates
the generality and versatility of our models for dierent network
types.
Overall, the KW model achieves very high accuracy without
introducing too much complexity. If accuracy is required, the kernel-
wise model can fulll most of the requirements. It also provides the
highest level of exibility and can support validating system designs
that require kernel-level scheduling. Finally, we believe keep using
the linear regression model maintains the best explainability and
interpretability.
5.5 Inter-GPU Kernel-Wise (IGKW) Model
Our results of the KW model suggest that the performance model
of the same kernel may have dierent linear regression parameters
on dierent GPUs. To support predicting the performance of a
new GPU, we need to build a new model that can adapt the linear
regression parameters trained for some GPUs to t the new GPU’s
capability.
According to O6, we can use the GPUs’ theoretical bandwidth to
predict the linear regression parameters of our KW models. One of
the most critical parameters is the slope of the linear relationship,
which represents the achieved FLOPS. Here, we establish another
linear relationship between the achieved FLOPS (the slope of the
KW models) and the GPU theoretical memory bandwidth. By mea-
suring the execution time of kernels on a few dierent GPUs, we
can estimate how the FLOPS changes with the GPU memory band-
width. Thanks to the simplicity of linear regression models, a few
0% 10% 25% 50% 75% 90%100%
Percentage of Test Set
0.6
0.8
1
1.2
1.5
2
Pred / Measured Time
Figure 14: The Inter-GPU Kernel-Wise Model predicted ex-
ecution time, normalized to real-device execution time on
TITAN RTX, and sorted in ascending order. In this gure, the
training set includes measurements on A100, A40, and GTX
1080 Ti, but does not include measurements on TITAN RTX.
The average error is 0.152. The X-axis is the percentage of
the network number in the test set.
measurements on a small number of GPUs are sucient to learn
the relationship, as long as the GPUs are diverse.
Next, we evaluate the Inter-GPU Kernel-Wise model. We train
the model using time measurements from A100, A40, and GeForce
GTX 1080 Ti and predict all the network execution times on TITAN
RTX. The results (see Figure 14) suggest we can still predict about
half of the models with an error of less than 10% and an average
error of 15.2%.
6 CASE STUDIES
The models are intended for early evaluation of the performance
of ML workloads on GPUs. They could benet researchers who
work in domains such as multi-GPU training architecture, memory
swapping, disaggregated memory systems, and in-network comput-
ing, as real hardware may not be exible or available. With models
demonstrated with reasonable accuracy, we demonstrate how the
models can be used in three case studies. While these case studies
are not intended to be novel, they can serve as demonstrations
of how our models can be used for architectural-, system-, and
software-level research.
Case Study 1: Exploring hardware conguration of GPU for
specic DNNs. There is a growing trend in designing GPUs tailored
to meet specic customer requirements [
15
,
22
]. This approach
allows for more optimized and ecient hardware congurations
to address diverse application needs. OpenAI [
49
] may require
vendors to produce GPUs with specic congurations, but how
can OpenAI know what is the optimal memory bandwidth if the
number of cores and the frequency of the GPUs kept unchanged?
Here, we demonstrate a design space exploration when running
ResNet-50 on modied TITAN RTX, we plot the predicted execution
time against various bandwidth values, as shown in Figure 15. As
expected, we see that the performance improves as the memory
bandwidth increases. Our analysis reveals that the ideal bandwidth
range is between 600 GB/s to 800 GB/s. TITAN RTX’s bandwidth
falls in this region.
MICRO ’23, October 28–November 01, 2023, Toronto, ON, Canada Ying Li, Yifan Sun, and Adwait Jog
200 400 600 800 1000 1200 1400
Bandwidth (GB/s)
200
400
600
Predicted Time (ms)
Figure 15: The predicted execution time of ResNet-50 on TI-
TAN RTX with modied bandwidth. The red line represents
the bandwidth of TITAN RTX, 672 GB/s.
200 400 600 800 1000 1200 1400
Bandwidth (GB/s)
1000
2000
3000
4000
Predicted Time (ms)
Figure 16: The predicted execution time of DenseNet-169 on
TITAN RTX with modied memory bandwidth. The red line
represents the bandwidth of TITAN RTX, 672 GB/s.
Similarly, when running DenseNet-169 on TITAN RTX, we plot
the predicted execution time against dierent bandwidth values,
as shown in Figure 16. Our results indicate that DenseNet-169 is
less sensitive to high memory bandwidth. The optimal bandwidth
range for this network-GPU conguration is between 500 GB/s to
700 GB/s. The native TITAN RTX is on the high end. If a company
wants to order customized GPUs for DenseNet workloads, memory
bandwidth can be reduced to save money as reducing the memory
bandwidth to 500 GB/s will not signicantly reduce performance.
Case Study 2: Identifying the required network bandwidth
for disaggregated memory systems. Disaggregated systems [
36
]
are composed of GPUs with small local memory and a huge network-
attached memory pool. Disaggregated systems are promising as
they provide more exibility for cloud computing users as users
can purchase computing resources and memory that t their needs
and reduce costs. However, the network bandwidth is a limiting
factor as the data needs to be moved back and forth between the
GPU’s local memory and the remote memory pool. Here, we use a
case study to demonstrate what is the minimum required network
bandwidth for each GPU.
To evaluate a disaggregated system, we connect our model with
a simple network model from MGPUSim [
64
]. We use MGPUSim
because it is a pure event-driven simulator, allowing us to fast-
forward to the end of each kernel without simulating cycle-by-cycle
details. The GPU runs a prefetcher that keeps fetching the layer
ResNet-50
ResNet-77
DenseNet-121
DenseNet-161
ShuffleNet v1
0.0
0.5
1.0
1.5
2.0
2.5
Speedup
16 GB/s
32 GB/s
64 GB/s
128 GB/s
256 GB/s
512 GB/s
Figure 17: Speedup over 16 GB/s network for dierent net-
works running on memory-disaggregated GPU systems.
parameters required for future layer computing while the GPU
calculates the layer output.
As the results suggest (see Figure 17), dierent networks have
dierent network bandwidth requirements. ResNet requires a 128
GB/s network, while DenseNet-121 requires a 256 GB/s network
to keep the GPU fully utilized. Additionally, the whole experiment
(including four extra DNNs, as well as network bandwidth of 8
GB/s, 1 TB/s, 4 TB/s, and 16 TB/s, not shown in the gure due to
similar insights) takes less than 5 seconds to run on the author’s
laptop.
Case Study 3: Real-time task scheduling across dierent
GPUs. Consider a machine-learning-as-a-service vendor with dif-
ferent GPUs available in their cloud. As customers want to execute
dierent networks, how to schedule the networks that optimize
the overall throughput is a practical problem [
28
,
67
]. These prob-
lems typically require performance models that do not incur major
performance overhead.
With our method of predicting the execution time for DNNs
on dierent GPUs, we can make informed decisions about which
GPU to use for running specic tasks. In this experiment, we se-
lected a subset of networks from our dataset, including ResNet-44,
ResNet-50, ResNet-62, ResNet-77, DenseNet-121, DenseNet-161,
DenseNet-169, DenseNet-201, and ShueNet v1. We also assume
that two GPUs, A40 and TITAN RTX, are equipped. We complete
two tasks with our model: 1) considering the networks as individual
workloads, which GPU runs the network faster; 2) given a queue of
tasks, how to schedule the tasks on both GPUs that minimize the
overall execution time.
Our execution results suggest that our performance model cor-
rectly selects the GPU that runs faster for all the DNNs (see Fig-
ure 18). Moreover, when considering the all-job scheduling problem,
our model gives a near-perfect workload-balancing solution (see Fig-
ure 19). Thanks to the extremely fast execution, we can easily run a
brute force design space search. The dispatching scheme is identical
to the oracle execution solution.
7 DISCUSSION
How can researchers use the model? Our performance model
can be used to facilitate the design of next-generation machine learn-
ing systems. Machine learning system researchers usually need to
use performance models to predict the performance of their new
designs. A lightweight performance model can give researchers an
Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads MICRO ’23, October 28–November 01, 2023, Toronto, ON, Canada
ResNet-50
ResNet-77
DenseNet-161
DenseNet-169
DenseNet-121
ShuffleNet v1
0
100
200
300
400
Exec Time (ms)
A40-Measured
A40-Predicted
TITAN-Measured
TITAN-Predicted
Figure 18: Actual execution time and predicted time for dif-
ferent networks on A40 and TITAN RTX. The yellow cross in-
dicates the GPU with a shorter time which we should choose.
A40
TITAN
DenseNet-169
Time
0300 600 670 ms
DenseNet-121 ResNet-44
ShuffleNet v1
DenseNet-201 DenseNet-161
ResNet-50 ResNet-62 ResNet-77
Figure 19: Scheduling a queue of networks using predicted
time on both GPUs to minimize the overall execution time.
early performance indicator for their designs. We recommend using
the kernel-wise model for most cases as its accuracy is satisfying.
When extremely high accuracy is not required, or the knowledge
of the networks is limited, the E2E model can also be used.
Our models do not strictly rely on specic hardware congura-
tions. First, our inter-GPU model allows users to evaluate hypothet-
ical GPUs by providing memory bandwidth and FLOPS. Second,
the simplicity of our models enables us to train the models with
a limited amount of data. So, our models can be combined with
architectural simulators. Simulators can measure the performance
of small workloads to train our models and our models can evaluate
large-scale applications.
Why FLOPs for the inter-DNN model? Why not memory
bandwidth? Establishing a performance model typically requires
a feature selection process [
18
,
30
32
]. Researchers typically pick
many features initially, including both network features and hard-
ware features. Then, the most inuential features are either manu-
ally selected or selected with algorithms (e.g., Principle Component
Analysis (PCA)).
We believe only simple and reliable solutions can benet the
community the most. Therefore, our goal is to build an extremely
lightweight, explainable, and hardware-agnostic model for DNN
workloads performance. To achieve this goal, we only use a single
FLOPs parameter as the measure of “the amount of work that needs
to be completed.
Other than FLOPs, one may argue that some workloads are
memory-bounded, so why do we not consider memory bandwidth
for single-GPU models? The reason is that we have been classi-
fying the kernels down to small clusters with similar arithmetic
intensity (AI, dened by the number of operations per byte read
from memory). Due to such classication, we are able to use FLOPs
without explicitly considering factors such as memory read/write.
Therefore, regardless of the input or network, we use FLOPs as
a proxy for performance while the sustained memory bandwidth
value is roughly constant for a given DNN. Indeed, the selection
of bandwidth as the main parameter for the inter-GPU model sug-
gests that most of the evaluated workloads are actually memory
intensive.
Limitations. Currently, our models focus on cases where GPUs
are fully utilized. When the batch size is small, we observe some data
points that our models suer from larger errors. When the batch
size or the network is small, and the GPU cannot be fully utilized,
we nd that the CPU and the CPU-GPU communication can be the
major performance bottleneck. Although we believe this should
not be the main focus of a GPU performance model, in the future,
we plan to include a CPU and a communication model so that we
can also accurately predict performance for small workloads.
Another limitation of our inter-GPU model is that it requires
the same kernels to be used on multiple GPUs. If one GPU uses a
very dierent kernel from all other GPUs used in the training set,
we cannot predict the performance reliably at the kernel level. A
viable solution is to fall back to the layer-wise model, although the
error may be higher.
Finally, selecting bandwidth as the main metric for the inter-
GPU model hints that bandwidth is a critical performance metric.
This is true for all the GPUs we tested. When designing products,
it is reasonable to assume that NVIDIA designers try to balance
the memory bandwidth and computing capability, mainly targeting
matrix multiplication (used in fully connected and convolutional
layers) workloads. However, our inter-GPU model cannot predict
corner-case GPUs with imbalanced memory bandwidth and com-
puting capability. We consider this limitation acceptable as we
envision this tool to be more useful for system researchers who
may not often alter low-level hardware design.
8 RELATED WORK
Many analytical-based performance models have been proposed
to predict the performance of DNNs. We can classify these ap-
proaches as regression-based [
40
], graph-based [
6
,
16
], network-
structure-based [
3
,
30
], and feature-selection-based [
30
,
55
,
74
].
These approaches can also be classied into network-level [
3
,
20
],
layer-level [16, 30, 40, 55, 74], and operator/kernel-level [6, 76].
Most of the works predict the execution time from the network
level. For example, Adolf et al. [
3
] built an analysis framework
based on the assumption that similar structures have similar per-
formance. Their model predicts the execution time of a network
by measuring the similarity between DNNs. Gujarati et al. [
20
]
proposed Clockwork, a fully distributed model serving system that
predicts end-to-end execution time by representing DNNs inference
as a deterministic sequence of mathematical operations.
At the layer level, Qi et al. [
55
] developed Paelo, a performance
framework that models the DNNs performance by extracting com-
putational requirements (FLOPs, network structure, Operation type)
MICRO ’23, October 28–November 01, 2023, Toronto, ON, Canada Ying Li, Yifan Sun, and Adwait Jog
of DNNs structure and device features. Paleo can prole layer exe-
cution time to predict the overall execution time. Justus et al. [
30
]
developed an approach that trains a neural-network model to pre-
dict the execution time by considering individual layers as the
atomic operations and collecting a large set of features that could
inuence the prediction of execution times, including layer features,
layer-specic features, implementation features, and hardware fea-
tures. Additionally, Liao et al. [
40
] use the multi-layer-regression
model and polynomial regression to predict the inference time at
the layer and network level. While layer-wise models dominate
existing DNNs performance modeling work, our nding suggests
that layer-level models may not be the best option; network- or
kernel-level models have advantages over layer-level models from
simplicity and accuracy perspectives.
Finally, operator-level or kernel-level performance models are
among the most complex and accurate solutions. In particular, nn-
Meter [
76
] is a kernel-based prediction system for predicting the
execution time of DNNs on diverse edge devices. The nn-Meter
framework focuses on discovering operator-fusion behaviors on
edge devices and provides a robust scheme for predicting non-
standard fused kernels. Baghsorkhi et al. [
6
] propose an analytical
model to predict the execution time of kernels based on a workow
graph, which is an abstract interpretation of a GPU kernel. Overall,
there is limited literature on predicting performance at the operator
or kernel level, partially due to the high complexity.
Inspired by previous works, we adopt linear regression as the
main method to predict execution time. We also follow existing
work by providing performance models at network, layer, and ker-
nel levels. However, our work distinguishes itself from the literature
because our prediction models explore and avoid complex modeling
solutions while maintaining high accuracy. Our kernel-wise model
demonstrates that a highly accurate kernel-level model does not
have to be very complex. Our work also discusses what are the
tradeos between models working at dierent levels, which is a
critical, yet understudied, research question.
Additionally, researchers have been developing models for DNNs
accelerators and System-on-Chips (SoC). These performance mod-
els include Timeloop [
53
], Gables [
27
], ASTRA-SIM [
57
], SCALE-
SIM [
59
,
60
], and Garnet [
4
]. In contrast, our paper focuses on the
performance of GPUs instead of domain-specic DNN accelerators.
9 CONCLUSIONS
In this paper, we show that accurate prediction of DNNs execution
time on GPUs with lightweight performance models is possible.
Based on extensive data analysis of DNN workloads and their exe-
cution on GPUs, we nd that there are several hidden relationships
between latency and other metrics which are not reported before.
Specically, we nd that there exists a strong linear relationship
between DNNs execution times, operation counts, and input and
output shapes of the network layers. Based on the presented data
and analysis, we show that it is possible to predict DNNs execution
time without actually running them on GPU hardware. To make
our models reproducible and easy to use, we take advantage of the
fact that FLOPs and input/output details can be readily obtained
by static DNNs analysis without pre-running or sampling them on
any hardware. Our kernel-wise analysis shows that execution time
can be predicted with an error as low as 7%. We also demonstrate
an inter-GPU model that predicts the performance of a GPU that is
not in the training set with an error of 15.2%. Our future work will
focus on extending our models for more diverse workloads (e.g.,
training) and emerging GPU hardware (e.g., multi-instance GPUs).
ACKNOWLEDGMENTS
A part of this work was performed and supported by the Google
Research Scholar Award while Jog was with William & Mary. This
work was performed using the computing facilities at William &
Mary and Google Cloud. An earlier version of this work appeared
as a poster at the ISPASS 2023 conference [38].
REFERENCES
[1]
Accessed: 2023. N VIDIA Nsight Compute. https://developer.nvidia.com/nsight-
compute.
[2]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
Craig Citro, Greg S. Corrado, Andy Davis, Jerey Dean, Matthieu Devin, San-
jay Ghemawat, Ian Goodfellow, Andrew Harp, Georey Irving, Michael Isard,
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg,
Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike
Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul
Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals,
Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.
2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.
https://www.tensorow.org/ Software available from tensorow.org.
[3]
Robert Adolf, Saketh Rama, Brandon Reagen, Gu-Yeon Wei, and David Brooks.
2016. Fathom: Reference workloads for modern deep learning methods. In 2016
IEEE International Symposium on Workload Characterization (IISWC). IEEE, 1–10.
[4]
Niket Agarwal, Tushar Krishna, Li-Shiuan Peh, and Niraj K Jha. 2009. GARNET:
A detailed on-chip network model inside a full-system simulator. In 2009 IEEE
international symposium on performance analysis of systems and software. IEEE,
33–42.
[5]
Cesar Avalos Baddouh, Mahmoud Khairy, Roland N Green, Mathias Payer, and
Timothy G Rogers. 2021. Principal Kernel Analysis: A Tractable Methodology to
Simulate Scaled GPU Workloads. In MICRO-54: 54th Annual IEEE/ACM Interna-
tional Symposium on Microarchitecture. 724–737.
[6]
Sara S Baghsorkhi, Matthieu Delahaye, Sanjay J Patel, William D Gropp, and Wen-
mei W Hwu. 2010. An adaptive performance modeling tool for GP U architectures.
In Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of
parallel programming. 105–114.
[7]
Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt.
2009. Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEE
international symposium on performance analysis of systems and software. IEEE,
163–174.
[8]
Yuhui Bao, Yifan Sun, Zlatan Feric, Michael Tian Shen, Micah Weston, José L
Abellán, Trinayan Baruah, John Kim, Ajay Joshi, and David Kaeli. 2022. NaviSim:
A Highly Accurate GPU Simulator for AMD RDNA GPUs. In The 31st International
Conference on Parallel Architectures and Compilation Techniques (PACT).
[9]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali
Saidi, Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh
Sardashti, et al
.
2011. The gem5 simulator. ACM SIGARCH computer architecture
news 39, 2 (2011), 1–7.
[10]
Xing Chen, Ming Li, Hao Zhong, Yun Ma, and Ching-Hsien Hsu. 2021. DNNO:
ooading DNN-based intelligent IoT applications in mobile edge computing.
IEEE transactions on industrial informatics 18, 4 (2021), 2820–2829.
[11]
Sharan Chetlur, Cli Woolley, Philippe Vandermersch, Jonathan Cohen, John
Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Ecient primitives
for deep learning. arXiv preprint arXiv:1410.0759 (2014).
[12]
Sharan Chetlur, Cli Woolley, Philippe Vandermersch, Jonathan Cohen, John
Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Ecient primitives
for deep learning. arXiv preprint arXiv:1410.0759 (2014).
[13]
Robert David, Jared Duke, Advait Jain, Vijay Janapa Reddi, Nat Jeries, Jian Li,
Nick Kreeger, Ian Nappier, Meghna Natraj, Tiezhen Wang, et al
.
2021. Tensorow
lite micro: Embedded machine learning for tinyml systems. Proceedings of Machine
Learning and Systems 3 (2021), 800–811.
[14]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
Pre-training of deep bidirectional transformers for language understanding. arXiv
preprint arXiv:1810.04805 (2018).
[15]
Ashraf Eassa. 2018. Why NVIDIA Corp. Will Design a Fully Custom Chip for
Future AI Workloads. https://www.fool.com/investing/2018/04/26/why-nvidia-
corp-will- design-a- fully- custom-chip- fo.aspx.
Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads MICRO ’23, October 28–November 01, 2023, Toronto, ON, Canada
[16]
Yanjie Gao, Xianyu Gu, Hongyu Zhang, Haoxiang Lin, and Mao Yang. 2021.
Runtime Performance Prediction for Deep Learning Models with Graph Neural
Network. Technical Report. Technical Report MSR-TR-2021-3. Microsoft.
[17] Google. 2022. Keras. https://keras.io/guides/ [Accessed October 6, 2022].
[18]
Nilanjan Goswami, Ramkumar Shankar, Madhura Joshi, and Tao Li. 2010. Ex-
ploring GPGPU workloads: Characterization methodology, analysis and microar-
chitecture evaluation implications. In IEEE International Symposium on Workload
Characterization (IISWC’10). IEEE, 1–10.
[19]
Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing
Shuai, Ting Liu, Xingxing Wang, Gang Wang, Jianfei Cai, et al
.
2018. Recent
advances in convolutional neural networks. Pattern recognition 77 (2018), 354–
377.
[20]
Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir
Vigfusson, and Jonathan Mace. 2020. Serving
{
DNNs
}
like Clockwork: Per-
formance Predictability from the Bottom Up. In 14th USENIX Symposium on
Operating Systems Design and Implementation (OSDI 20). 443–462.
[21]
Anthony Gutierrez, Bradford M Beckmann, Alexandru Dutu, Joseph Gross,
Michael LeBeane, John Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon
Potter, Sooraj Puthoor, et al
.
2018. Lost in abstraction: Pitfalls of analyzing GPUs
at the intermediate language level. In 2018 IEEE International Symposium on High
Performance Computer Architecture (HPCA). IEEE, 608–619.
[22]
Gareth Halfacree. 2017. AMD Signs Semi-Custom AI Chip Deal with Tesla, Source
Claims. https://bit-tech.net/news/tech/cpus/amd- signs-semi- custom- ai-chip-
deal-with- tesla-source- claims/1/.
[23]
Yueming Hao, Nikhil Jain, Rob Van der Wijngaart, Nirmal Saxena, Yuanbo Fan,
and Xu Liu. 2023. DrGP U: A Top-Down Proler for GPU Applications. In Proceed-
ings of the 2023 ACM/SPEC International Conference on Performance Engineering.
43–53.
[24]
Yueming Hao, Xu Zhao, Bin Bao, David Berard, Will Constable, Adnan Aziz,
and Xu Liu. 2023. TorchBench: Benchmarking PyTorch with High API Surface
Coverage. arXiv:2304.14226 [cs.LG]
[25]
Horace He. 2019. The State of Machine Learning Frameworks in
2019. https://thegradient.pub/state-of- ml-frameworks- 2019- pytorch-dominates-
research-tensorow-dominates-industry/. The Gradient (2019).
[26]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 770–778.
[27]
Mark Hill and Vijay Janapa Reddi. 2019. Gables: A rooine model for mobile socs.
In 2019 IEEE International Symposium on High Performance Computer Architecture
(HPCA). IEEE, 317–330.
[28]
Myeongjae Jeon and Shivaram Venkataraman. [n.d.]. Analysis of Large-Scale
Multi-Tenant GPU Clusters for DNN Training Workloads.
[29]
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo.
2020. A unied architecture for accelerating distributed
{
DNN
}
training in
heterogeneous
{
GPU/CPU
}
clusters. In 14th USENIX Symposium on Operating
Systems Design and Implementation (OSDI 20). 463–479.
[30]
Daniel Justus, John Brennan, Stephen Bonner, and Andrew Stephen McGough.
2018. Predicting the computational cost of deep learning models. In 2018 IEEE
international conference on big data (Big Data). IEEE, 3873–3882.
[31]
Ali Karami, Sayyed Ali Mirsoleimani, and Farshad Khunjush. 2013. A statistical
performance prediction model for OpenCL kernels on NVIDIA GPUs. In The 17th
CSI International Symposium on Computer Architecture & Digital Systems (CADS
2013). IEEE, 15–22.
[32]
Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili. 2010. Modeling GPU-
CPU workloads and systems. In Proceedings of the 3rd workshop on general-purpose
computation on graphics processing units. 31–42.
[33]
Mahmoud Khairy, Zhesheng Shen, Tor M Aamodt, and Timothy G Rogers. 2020.
Accel-Sim: An extensible simulation framework for validated GPU modeling. In
2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture
(ISCA). IEEE, 473–486.
[34]
Jehandad Khan, Paul Fultz, Artem Tamazov, Daniel Lowell, Chao Liu, Michael
Melesse, Murali Nandhimandalam, Kamil Nasyrov, Ilya Perminov, Tejash Shah,
et al
.
2019. MIOpen: An open source library for deep learning primitives. arXiv
preprint arXiv:1910.00078 (2019).
[35]
Jehandad Khan, Paul Fultz, Artem Tamazov, Daniel Lowell, Chao Liu, Michael
Melesse, Murali Nandhimandalam, Kamil Nasyrov, Ilya Perminov, Tejash Shah,
Vasilii Filippov, Jing Zhang, Jing Zhou, Bragadeesh Natarajan, and Mayank
Daga. 2019. MIOpen: An Open Source Library For Deep Learning Primitives.
arXiv:1910.00078 [cs.LG]
[36]
Youngeun Kwon and Minsoo Rhu. 2019. A disaggregated memory system for
deep learning. IEEE Micro 39, 5 (2019), 82–90.
[37]
Andrew Lavin and Scott Gray. 2016. Fast algorithms for convolutional neural
networks. In Proceedings of the IEEE conference on computer vision and pattern
recognition. 4013–4021.
[38]
Ying Li, Yifan Sun, and Adwait Jog. 2023. A Regression-based Model for End-
to-End Latency Prediction for DNN Execution on GPUs. In the Proceedings of
International Symposium on Performance Analysis of Systems and Software (IS-
PASS), Raleigh, NC. 343–345.
[39]
Zewen Li, Fan Liu, Wenjie Yang, Shouheng Peng, and Jun Zhou. 2021. A survey
of convolutional neural networks: analysis, applications, and prospects. IEEE
transactions on neural networks and learning systems (2021).
[40]
Ying-Chiao Liao, Chuan-Chi Wang, Chia-Heng Tu, Ming-Chang Kao, Wen-Yew
Liang, and Shih-Hao Hung. 2020. PerfNetRT: Platform-Aware Performance
Modeling for Optimized Deep Neural Networks. In 2020 International Computer
Symposium (ICS). IEEE, 153–158.
[41]
Lingqiao Liu, Chunhua Shen, and Anton Van den Hengel. 2015. The treasure
beneath convolutional layers: Cross-convolutional-layer pooling for image clas-
sication. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 4749–4757.
[42]
Jason Lowe-Power, Abdul Mutaal Ahmad, Ayaz Akram, Mohammad Alian, Rico
Amslinger, Matteo Andreozzi, Adrià Armejach, Nils Asmussen, Brad Beckmann,
Srikant Bharadwaj, et al
.
2020. The gem5 simulator: Version 20.0+. arXiv preprint
arXiv:2007.03152 (2020).
[43]
Sébastien Marcel and Yann Rodriguez. 2010. Torchvision the machine-vision
package of torch. In Proceedings of the 18th ACM international conference on
Multimedia. 1485–1488.
[44]
Saiful A Mojumder, Marcia S Louis, Yifan Sun, Amir Kavyan Ziabari, José L
Abellán, John Kim, David Kaeli, and Ajay Joshi. 2018. Proling dnn workloads on
a volta-based dgx-1 system. In 2018 IEEE International Symposium on Workload
Characterization (IISWC). IEEE, 122–133.
[45]
MG Sarwar Murshed, Christopher Murphy, Daqing Hou, Nazar Khan, Ganesh
Ananthanarayanan, and Faraz Hussain. 2021. Machine learning at the network
edge: A survey. ACM Computing Surveys (CSUR) 54, 8 (2021), 1–37.
[46]
NVIDIA. 2022. CUDA. https://developer.nvidia.com/cuda-toolkit [Accessed
October 6, 2022].
[47]
NVIDIA. 2022. cuFFT. https://developer.nvidia.com/cut [Accessed October 6,
2022].
[48]
NVIDIA. 2022. Matrix Multiplication Background. https://docs.nvidia.com/
deeplearning/performance/dl-performance- matrix-multiplication/index.html
[Accessed October 6, 2022].
[49]
OpenAI. 2022. OpenAI and Microsoft Extend Partnership. https://openai.com/
blog/openai-and- microsoft- extend- partnership. Accessed on: April 28, 2023.
[50] OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
[51]
Keiron O’Shea and Ryan Nash. 2015. An introduction to convolutional neural
networks. arXiv preprint arXiv:1511.08458 (2015).
[52]
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng,
David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for
sequence modeling. arXiv preprint arXiv:1904.01038 (2019).
[53]
Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen,
Victor A Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany,
Stephen W Keckler, and Joel Emer. 2019. Timeloop: A systematic approach to
dnn accelerator evaluation. In 2019 IEEE international symposium on performance
analysis of systems and software (ISPASS). IEEE, 304–315.
[54]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al
.
2019.
Pytorch: An imperative style, high-performance deep learning library. Advances
in neural information processing systems 32 (2019).
[55]
Hang Qi, Evan R Sparks, and Ameet Talwalkar. 2016. Paleo: A performance
model for deep neural networks. (2016).
[56]
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al
.
2018.
Improving language understanding by generative pre-training. (2018).
[57]
Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna.
2020. Astra-sim: Enabling sw/hw co-design exploration for distributed dl training
platforms. In 2020 IEEE International Symposium on Performance Analysis of
Systems and Software (ISPASS). IEEE, 81–92.
[58]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean
Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al
.
2015. Imagenet large scale visual recognition challenge. International journal of
computer vision 115, 3 (2015), 211–252.
[59]
Ananda Samajdar, Jan Moritz Joseph, Yuhao Zhu, Paul Whatmough, Matthew
Mattina, and Tushar Krishna. 2020. A systematic methodology for characterizing
scalability of dnn accelerators using scale-sim. In 2020 IEEE International Sympo-
sium on Performance Analysis of Systems and Software (ISPASS). IEEE, 58–68.
[60]
Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar
Krishna. 2018. Scale-sim: Systolic cnn accelerator simulator. arXiv preprint
arXiv:1811.02883 (2018).
[61]
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-
Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In
Proceedings of the IEEE conference on computer vision and pattern recognition.
4510–4520.
[62]
Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. 2020. Green ai.
Commun. ACM 63, 12 (2020), 54–63.
[63]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[64]
Yifan Sun, Trinayan Baruah, Saiful A. Mojumder, Shi Dong, Xiang Gong, Shane
Treadway, Yuhui Bao, Spencer Hance, Carter McCardwell, Vincent Zhao, et al
.
MICRO ’23, October 28–November 01, 2023, Toronto, ON, Canada Ying Li, Yifan Sun, and Adwait Jog
2019. MGPUSim: Enabling Multi-GP U Performance Modeling and Optimization.
In 46th International Symposium on Computer Architecture.
[65]
Yifan Sun, Saoni Mukherjee, Trinayan Baruah, Shi Dong, Julian Gutierrez, Pran-
nory Mohan, and David Kaeli. 2018. Evaluating Performance Tradeos on the
Radeon Open Compute Platform. In IEEE International Symposium on Performance
Analysis of Systems and Software.
[66]
Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. 2017. Ecient
processing of deep neural networks: A tutorial and survey. Proc. IEEE 105, 12
(2017), 2295–2329.
[67]
Cheng Tan, Zhichao Li, Jian Zhang, Yu Cao, Sikai Qi, Zherui Liu, Yibo Zhu, and
Chuanxiong Guo. 2021. Serving DNN models with multi-instance gpus: A case of
the recongurable machine scheduling problem. arXiv preprint arXiv:2109.11067
(2021).
[68]
The Khronos OpenCL Working Group. 2022. OpenCL. http://www.khronos.org/
opencl/ [Accessed October 6, 2022].
[69]
Oreste Villa, Daniel Lustig, Zi Yan, Evgeny Bolotin, Yaosheng Fu, Niladrish
Chatterjee, Nan Jiang, and David Nellans. 2021. Need for speed: Experiences
building a trustworthy system-level gpu simulator. In 2021 IEEE International
Symposium on High-Performance Computer Architecture (HPCA). IEEE, 868–880.
[70]
Thomas Wolf,Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al
.
2019. Huggingface’s transformers: State-of-the-art natural language processing.
arXiv preprint arXiv:1910.03771 (2019).
[71]
Thomas Wolf,Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al
.
2020. Transformers: State-of-the-art natural language processing. In Proceedings
of the 2020 conference on empirical methods in natural language processing: system
demonstrations. 38–45.
[72]
Sam Xi, Yuan Yao, Kshitij Bhardwaj, Paul Whatmough, Gu-Yeon Wei, and David
Brooks. 2020. SMAUG: End-to-end full-stack simulation infrastructure for deep
learning workloads. ACM Transactions on Architecture and Code Optimization
(TACO) 17, 4 (2020), 1–26.
[73]
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2016.
Aggregated Residual Transformations for Deep Neural Networks. https://doi.
org/10.48550/ARXIV.1611.05431
[74]
Gingfung Yeung, Damian Borowiec, Adrian Friday, Richard Harper, and Peter
Garraghan. 2020. Towards
{
GPU
}
utilization prediction for cloud deep learning.
In 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 20).
[75]
Jie You, Jae-Won Chung, and Mosharaf Chowdhury. 2022. Zeus: Understanding
and Optimizing GPU Energy Consumption of DNN Training. arXiv preprint
arXiv:2208.06102 (2022).
[76]
Li Lyna Zhang, Shihao Han, Jianyu Wei, Ningxin Zheng, Ting Cao, Yuqing Yang,
and Yunxin Liu. 2021. nn-Meter: towards accurate latency prediction of deep-
learning model inference on diverse edge devices. In Proceedings of the 19th
Annual International Conference on Mobile Systems, Applications, and Services.
81–93.
[77]
Keren Zhou, Yueming Hao, John Mellor-Crummey, Xiaozhu Meng, and Xu Liu.
2020. GVProf: A value proler for GPU-based clusters. In SC20: International
Conference for High Performance Computing, Networking, Storage and Analysis.
IEEE, 1–16.
[78]
Keren Zhou, Yueming Hao, John Mellor-Crummey, Xiaozhu Meng, and Xu Liu.
2022. ValueExpert: Exploring value patterns in GPU-Accelerated applications. In
Proceedings of the 27th ACM International Conference on Architectural Support for
Programming Languages and Operating Systems. 171–185.
[79] Ligeng Zhu. 2022. THOP: PyTorch-OpCounter. THOP:PyTorch-OpCounter
A ARTIFACT APPENDIX
A.1 Abstract
The artifact includes source codes, shell scripts, and prediction
datasets required to reproduce the error rates of dierent models
on various GPUs, as described in the Performance Models section
(Section 5). Additionally, the artifact encompasses all the gures
generated from the experimental data.
A.2 Artifact check-list (meta-information)
Algorithm: Linear-regression-based DNNs time predictor.
Program: Pytorch, Python.
Model: Image classication DNNs and text classication trans-
formers from Torchvision and HuggingFace.
Data set: ILSVRC2012 dataset. The Subset of the dataset is
included in the artifact.
Run-time environment: 1) Ubuntu 20.04.5 LTS, 2) Python
3.8.10, 3) PyTorch 1.13.1 + cu116, 4) TorchVision 0.14.1 + cu116,
5) CUDA 11.6, and 6) cuDNN 8.3.2.
Hardware: GPUs: A100, A40, TITAN RTX, V100, GTX 1080 Ti.
Execution: Proling by Pytorch Proler is required when
running experiments on a GPU not included in the prediction
dataset.
Metrics: Error rate.
Output: Error rates of dierent performance models on GPUs
and gures generated from the experimental data in the pa-
per.
How much disk space required (approximately)?: 16GB.
How much time is needed to prepare workow (approxi-
mately)?: 1 hour.
How much time is needed to complete experiments (approxi-
mately)?: If collecting the prediction dataset for a GPU from
scratch, the time depends on the specic GPU used. It may
take around 20 hours for V100. Otherwise, it takes approxi-
mately 10 minutes if the included dataset is used.
Publicly available?: Yes.
A.3 Description
A.3.1 How to access. The scripts and the codes are available on
Zenodo and can be accessed at: https://doi.org/10.5281/zenodo.
8365078
A.3.2 Hardware dependencies. To predict DNNs’ time on a new
GPU and collect a new prediction dataset, a GPU is necessary. Our
experiments involve A100, A40, TITAN RTX, V100, GTX 1080 Ti
GPUs. When predicting DNNs’ time using the existing dataset we
collected, GPUs are not required.
A.3.3 Soware dependencies. Virtual environment, python pip,
relevant pip packages, torch, and torchvision.
A.3.4 Data sets. We collect DNNs from common DNNs model
zoos and use the ILSVRC2012 dataset.
At the same time, we organize the experimental data into a
prediction dataset and save it as CSV les, including essential infor-
mation such as network structure, layer FLOPs, kernel-by-kernel
execution times, and layer-to-kernel mapping.
The prediction dataset is then randomly split into two parts
during the experiments: 1) training dataset, used to generate the
performance analytical model. 2) test set, which enables users to
Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads MICRO ’23, October 28–November 01, 2023, Toronto, ON, Canada
input the network structure into the analytical model to obtain
predicted execution times and accuracy.
A.4 Installation
The artifact provides detailed installation steps in the Readme le.
To predict DNNs’ time based on existing GPUs from the collected
dataset, one needs to set up a virtual environment, install Python
and the required pip packages. After that, simply run the provided
scripts.
For predicting DNNs’ time on a new GPU and collecting a new
dataset, one must install torch and torchvision in addition to the
other dependencies mentioned above. Then run the scripts for the
prediction process.
A.5 Experiment workow
The experiment consists of two cases: 1) predicting DNNs time
based on existing GPUs in the dataset collected. 2) predicting DNNs
time on a new GPU and collecting a new dataset.
We recommend following the steps outlined in the Readme le.
1) run install.sh: start by executing the install.sh script to set up
the required environment and install all necessary dependencies.
2) follow run.sh: after successfully setting up the environment,
proceed to follow the steps outlined in the run.sh script to conduct
the experiments and collect the data.
A.6 Evaluation and expected results
Upon running the run.sh, the following outcomes are expected: 1)
the results of Table 2, which are the error rates when modeling
ResNet-50 performance with KW model. 2) the error rates of dif-
ferent models (E2E model, LW model, KW model, IGKW model)
on GPUs (A100, A40, TITAN RTX, V100, GTX 1080 Ti) are shown
in the Performance Models section. 3) gures generated from the
experimental data.
When executing the runNewGPU.sh script, it will calculate the
error rate for a specic network or all networks in the random test
set.
Please be aware that the error rates and results shown in Figure
11, Figure 12, Figure 13, and Figure 14 may vary in their outcomes
due to the random selection of networks in the test set during each
run.
... As such, developing a cycle accurate simulator is a time-consuming process and using them to simulate the simplest of the model can take a long time. The most popular GPU simulator [27,28] can take up to 18 hours to simulate ResNet-50 with a batch size of 256 [2]. ResNet-50 is relatively smaller compared to new-age complex networks such as the LLMs. ...
... Prior works in the domain of modeling GPU performance take an analytical model, a linear regression-based model, or a neural network approach [1][2][3]. We empirically observe that these solutions exhibit high percentage errors for newer models and GPUs, for both inference and training. ...
... We empirically observe that these solutions exhibit high percentage errors for newer models and GPUs, for both inference and training. They are limited for the proposed use case because they either: (1) use a rough analytical model without information about detailed GPU execution (hardware or software) (2) require execution on a specific GPU to extrapolate latency for newer GPUs [1,2], or (3) are unable to capture the necessary architectural information of the GPUs [3]. Additionally, these prior works exhibit high percentage error for newer models and unseen GPUs because they aim to directly model the overall latency of a kernel through machine learning. ...
Preprint
Full-text available
Deep learning kernels exhibit predictable memory accesses and compute patterns, making GPUs' parallel architecture well-suited for their execution. Software and runtime systems for GPUs are optimized to better utilize the stream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. As deep learning models and GPUs evolve, access to newer GPUs is often limited, raising questions about the performance of new model architectures on existing GPUs, existing models on new GPUs, and new model architectures on new GPUs. To address these questions, we introduce NeuSight, a framework to predict the performance of various deep learning models, for both training and inference, on unseen GPUs without requiring actual execution. The framework leverages both GPU hardware behavior and software library optimizations to estimate end-to-end performance. Previous work uses regression models that capture linear trends or multilayer perceptrons to predict the overall latency of deep learning kernels on GPUs. These approaches suffer from higher error percentages when forecasting performance on unseen models and new GPUs. Instead, NeuSight decomposes the prediction problem into smaller problems, bounding the prediction through fundamental performance laws. NeuSight decomposes a single deep learning kernel prediction into smaller working sets called tiles, which are executed independently on the GPU. Tile-granularity predictions are determined using a machine learning approach and aggregated to estimate end-to-end latency. NeuSight outperforms prior work across various deep learning workloads and the latest GPUs. It reduces the percentage error from 198% and 19.7% to 3.8% in predicting the latency of GPT3 model for training and inference on H100, compared to state-of-the-art prior works, where both GPT3 and H100 were not used to train the framework.
... This approach reduces the search space and accelerates the development process, thereby enhancing efficiency. Furthermore, incorporates a hybrid analyzer that incorporates analytical [24,52] and empirical approaches [4,51] to evaluate the performance of various strategies, thereby enabling high-performance code generation with minimal system overhead. ...
... This quantifies the hardware's parallel processing ability, scaling execution cost with the parallel loop size. The overall strategy cost at layer L, , is the product of the parallel execution cost and the temporal execution cost, capturing the total time required for the execution of a computational task across various layers: = × (4) Our analytical cost model faces a recursive complexity, needing −1 for each level L. Furthermore, hardware optimizations such as instruction pipelining and out-of-order execution can lead to substantial inaccuracies in the cost model, posing a challenge in ensuring the precision of the analysis module [24]. Hybrid Analyzer Design. ...
Preprint
Dynamic-shape deep neural networks (DNNs) are rapidly evolving, attracting attention for their ability to handle variable input sizes in real-time applications. However, existing compilation optimization methods for such networks often rely heavily on predefined samples to guide the compilation process, which restricts their adaptability and efficiency. These sample-driven methods struggle to efficiently manage the diverse and unpredictable shapes encountered in real-world scenarios, often resulting in suboptimal performance. To tackle these issues, we introduce Vortex, a hardware-driven and sample-free compiler tailored for dynamic-shape tensor programs. Vortex capitalizes on detailed hardware information and hierarchizes the strategy space to facilitate high-performance code generation without relying on runtime shape samples. It features a unique bidirectional compilation workflow, combining top-down abstraction for aligning tensor program execution with hardware hierarchies and bottom-up kernel construction to narrow the search space, enabling Vortex to achieve remarkable efficiency. Comprehensive evaluations confirm that Vortex reduces compilation time by 176×176\times compared to the existing dynamic-shape compiler. Additionally, it substantially outperforms existing vendor-provided libraries and dynamic-shape compilers on both CPU and GPU platforms, delivering speedups of 2.53×2.53\times and 3.01×3.01\times, respectively.
... Neural networks can be executed on many different types of hardware, and it is impossible to propose a single, unified analysis model applicable to all hardware architectures. All analytical models can be classified according to their target hardware architecture, which include CPUs [10], GPUs [5,11], analog in-memory computing accelerators [12], and PE-array based neural accelerators [4,6,[13][14][15][16]. Applying an analytical model to target hardware not considered in the model may result in inaccurate estimates. ...
Article
Full-text available
An analytical model can quickly predict performance and energy efficiency based on information about the neural network model and neural accelerator architecture, making it ideal for rapid pre-synthesis design space exploration. This paper proposes a new analytical model specifically targeted for convolutional neural networks used in always-on applications. To validate the proposed model, the performance and energy efficiency estimated by the model were compared with actual hardware and post-synthesis gate-level simulations of hardware synthesized with a state-of-the-art electronic design automation (EDA) synthesis tool. Comparisons with hardware created for the Eyeriss neural accelerator showed average execution time and energy consumption error rates of 3.33% and 13.54%, respectively. Comparisons with hardware synthesis results showed an error of 3.18% to 9.44% for two example neural accelerator configurations used to execute MobileNet, EfficientNet, and DarkNet neural network models. Finally, the utility of the proposed model was demonstrated by using it to evaluate the effects of different channel sizes, pruning rates, and batch sizes in several neural network designs for always-on vision, text, and audio processing.
... A major problem of computer architecture simulators, as discussed in prior work across the community [5], [26], [27], [31], [44], is slow simulation speed. As cycle-level simulators are usually millions of times slower than real hardware, simulating one second of execution can easily last days, weeks, or years [7], [39], [44]. ...
Conference Paper
Full-text available
Computer architecture simulators are essential for validating novel chip designs. However, they often provide little transparency during execution. This opaqueness limits the ability of users to identify issues during the simulation, leading to both wasted computational and human time. We address these issues by providing an intuitive user experience during simulation execution. Particularly, we reveal the status of executions and allow users to control the execution of a computer architecture simulator through AkitaRTM, an interactive web-based tool for real-time monitoring of computer architecture simulations. We based its design on the design workflow inefficiencies experienced by computer architects when using simulations. We demonstrate AkitaRTM's utility through two case studies, the second leading to a patch in the simulator. Additionally, we conducted a user study with computer architects, aiming to validate AkitaRTM. We found that, in addition to solving the observed problems, AkitaRTM also provided an educational benefit by making simulators more transparent to users. Based on these findings, we reflect upon the design of AkitaRTM and provide guidance for future human-centered tools in this space.
... ML-based performance models opt to build a performance model that can extrapolate the performance to unseen microarchitecture designs by simulating a few designs. Of note, these performance models aid in design exploration, different from ML-based performance models that model the runtime of applications in CPU [46,78], data centers [66] or GPU [9,13,49,71,75]. These work use linear regression [34], artificial neural network (ANN) [32], spline functions [42,43], and radial basis function networks [33] to build the performance model. ...
Conference Paper
Full-text available
Microarchitecture simulators are indispensable tools for microarchitecture designers to validate, estimate, and optimize new hardware that meets specific design requirements. While the quest for a fast, accurate and detailed microarchitecture simulation has been ongoing for decades, existing simulators excel and fall short at different aspects: (i) Although execution-driven simulation is accurate and detailed, it is extremely slow and requires expert-level experience to design. (ii) Trace-driven simulation reuses the execution traces in pursuit of fast simulation but faces accuracy concerns and fails to achieve significant speedup. (iii) Emerging deep learning (DL)-based simulations are remarkably fast and have acceptable accuracy but fail to provide adequate low-level microarchitectural performance metrics crucial for microarchitectural bottleneck analysis. Additionally, they introduce substantial overheads from trace regeneration and model re-training when simulating a new microarchitecture. Re-thinking the advantages and limitations of the aforementioned simulation paradigms, this paper introduces TAO that redesigns the DL-based simulation with three primary contributions: First, we propose a new training dataset design such that the subsequent simulation only needs functional trace as inputs, which can be rapidly generated and reused across microarchitectures. Second, we redesign the input features and the DL model using self-attention to support predicting various performance metrics. Third, we propose techniques to train a microarchitecture agnostic embedding layer that enables fast transfer learning between different microarchitectural configurations and reduces the re-training overhead of conventional DL-based simulators. Our extensive evaluation shows TAO can reduce the overall training and simulation time by 18.06x over the state-of-the-art DL-based endeavors.
Conference Paper
The widespread adoption of GPUs has driven the development of GPU simulators, which, in turn, lead advancements in both GPU architectures and software optimization. Trace-driven cycle-accurate Cycle-accurate simulators, which provide detailed microarchitectural models and clock-level precision, come at the cost of extended simulation times and require high computational resources. Their scalability has become a bottleneck. A growing trend is the adoption of cycle-approximate simulators, which introduce mathematical modeling of partial hardware units and utilize sampling to accelerate simulation. However, this approach faces challenges regarding the accuracy of performance predictions. To address these limitations, we introduce HyFiSS, a hybrid fidelity stall-aware GPU simulator. HyFiSS features fine-grained stall events tracking and attribution by constructing a detailed execution pipeline model for various stall events on Streaming Multiprocessors (SMs). It accurately emulates the thread block scheduler behavior using real-time scheduling logs and utilizes sampling based on thread block sets to minimize the precision loss due to fine-grained sampling points on the microarchitectural state. We achieve a balance between reliability, speed, and the level of simulation detail, especially regarding bottlenecks. By evaluating a diverse set of benchmarks, HyFiSS achieves a mean absolute percentage error in predicting active cycles that is comparable to the state-of-the-art cycle-accurate simulator Accel-Sim. Moreover, HyFiSS achieves a substantial 12.8 × speedup in the simulation efficiency compared to Accel-Sim. HyFiSS also requires at least 3.2 × less disk storage than both Accel-Sim and another state-of-the-art cycle-approximate simulator PPT-GPU due to its efficient SASS (Streaming Assembler) traces compression. With precise, per-cycle stall events statistics, HyFiSS can provide accurate GPU performance metrics and stall cause reporting. This significantly simplifies performance analysis, bottleneck identification, and performance optimization tasks for researchers, making it easier to enhance GPU performance effectively.
Conference Paper
Full-text available
As GPUs continue to grow in popularity for accelerating demanding applications, such as high-performance computing and machine learning, GPU architects need to deliver more powerful devices with updated instruction set architectures (ISAs) and new microarchitectural features. The introduction of the AMD RDNA architecture is one example where the GPU architecture was dramatically changed, modifying the underlying programming model, the core architecture, and the cache hierarchy. To date, no publicly-available simulator infrastructure can model the AMD RDNA GPU, preventing researchers from exploring new GPU designs based on the state-of-the-art RDNA architecture. In this paper, we present the NaviSim simulator, the first cycle-level GPU simulator framework that models AMD RDNA GPUs. NaviSim faithfully emulates the new RDNA ISA. We extensively tune and validate NaviSim using several microbenchmarks and 10 full workloads. Our evaluation shows that NaviSim can accurately model the GPU's kernel execution time, achieving similar performance to hardware execution within 9.92% (on average), as measured on an AMD RX 5500 XT GPU and an AMD Radeon Pro W6800 GPU. To demonstrate the full utility of the NaviSim simulator, we carry out a performance study of the impact of individual RDNA features, attempting to understand better the design decisions behind these features. We carry out a number of experiments to isolate each RDNA feature and evaluate its impact on overall performance, as well as demonstrate the usability and flexibility of NaviSim.
Article
A deep neural network (DNN) has become increasingly popular in industrial Internet of Things scenarios. Due to high demands on computational capability, it is hard for DNN-based applications to directly run on intelligent end devices with limited resources. Computation offloading technology offers a feasible solution by offloading some computation-intensive tasks to the cloud or edges. Supporting such capability is not easy due to two aspects: Adaptability: offloading should dynamically occur among computation nodes. Effectiveness: it needs to be determined which parts are worth offloading. This article proposes a novel approach, called DNNOff. For a given DNN-based application, DNNOff first rewrites the source code to implement a special program structure supporting on-demand offloading and, at runtime, automatically determines the offloading scheme. We evaluated DNNOff on a real-world intelligent application, with three DNN models. Our results show that, compared with other approaches, DNNOff saves response time by 12.4–66.6% on average.