PreprintPDF Available

Deep Learning Training on Multi-Instance GPUs

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Deep learning training is an expensive process that extensively uses GPUs, but not all model training saturates the modern powerful GPUs. Multi-Instance GPU (MIG) is a new technology introduced by NVIDIA that can partition a GPU to better fit workloads that don't require all the memory and compute resources of a full GPU. In this paper, we examine the performance of a MIG-enabled A100 GPU under deep learning workloads of three sizes focusing on image recognition training with ResNet models. We investigate the behavior of these workloads when running in isolation on a variety of MIG instances allowed by the GPU in addition to running them in parallel on homogeneous instances co-located on the same GPU. Our results demonstrate that employing MIG can significantly improve the utilization of the GPU when the workload is too small to utilize the whole GPU in isolation. By training multiple small models in parallel, more work can be performed by the GPU per unit of time, despite the increase in time-per-epoch, leading to \sim3 times the throughput. In contrast, for medium and large-sized workloads, which already utilize the whole GPU well on their own, MIG only provides marginal performance improvements. Nevertheless, we observe that training models in parallel using separate MIG partitions does not exhibit interference underlining the value of having a functionality like MIG on modern GPUs.
Deep Learning Training on Multi-Instance GPUs
Anders Friis Kaas
IT University of Copenhagen
anfk@itu.dk
Stilyan Petrov Paleykov
IT University of Copenhagen
stil@itu.dk
Ties Robroek
IT University of Copenhagen
titr@itu.dk
Pınar Tözün
IT University of Copenhagen
pito@itu.dk
ABSTRACT
Deep learning training is an expensive process that extensively uses
GPUs, but not all model training saturates the modern powerful
GPUs. Multi-Instance GPU (MIG) is a new technology introduced
by NVIDIA that can partition a GPU to better t workloads that
don’t require all the memory and compute resources of a full GPU.
In this paper, we examine the performance of a MIG-enabled A100
GPU under deep learning workloads of three sizes focusing on
image recognition training with ResNet models. We investigate the
behavior of these workloads when running in isolation on a variety
of MIG instances allowed by the GPU in addition to running them
in parallel on homogeneous instances co-located on the same GPU.
Our results demonstrate that employing MIG can signicantly
improve the utilization of the GPU when the workload is too small
to utilize the whole GPU in isolation. By training multiple small
models in parallel, more work can be performed by the GPU per
unit of time, despite the increase in time-per-epoch, leading to 3
times the throughput. In contrast, for medium and large-sized work-
loads, which already utilize the whole GPU well on their own, MIG
only provides marginal performance improvements. Nevertheless,
we observe that training models in parallel using separate MIG
partitions does not exhibit interference underlining the value of
having a functionality like MIG on modern GPUs.
1 INTRODUCTION
Deep learning models have defeated the world champion of Go
[
27
], can write coherent news articles [
6
], and have surpassed hu-
man abilities in image recognition [
12
]. Although these types of
models come at high computational costs, there are also deep neu-
ral networks that demand far less from the hardware [
5
,
13
,
32
].
Today, most deep neural networks are eciently trained on Graph-
ics Processing Units (GPUs) thanks to the embarrassingly parallel
nature of their operations. However, if the training process does
not fully saturate the resources provided by a GPU, the remaining
computational power of the GPU goes to waste since the training
process is given exclusive access to the GPU resources.
Workload co-location is a technique to increase hardware uti-
lization when a workload does not require the entire compute or
memory resources of a device. It refers to running multiple work-
loads simultaneously on the same device so that these workloads
share the resources of that device. While workload co-location is
heavily studied for CPUs [
9
,
11
,
17
], its opportunities and challenges
have been largely unexplored for new generation GPUs.
Both authors contributed equally to this research.
In this paper, we examine the workload co-location possibili-
ties on GPUs enabled by a new technology from NVIDIA called
Multi-Instance GPU (MIG). Using MIG, a user is able to partition
specic NVIDIA GPUs into several logical GPUs, (GPU instances)
[
20
]. More specically, we devise a performance characterization
study for a MIG-enabled A100 GPU using deep learning workloads
of three sizes focusing on image recognition training. The three
sizes represent dierent complexities and hardware resource needs
of model training: (1) small with Resnet26 on CIFAR, (2) medium
with Resnet50 on ImageNet64x64, and (3) large with Resnet152 on
ImageNet. We investigate the behavior of these workloads when
running in isolation on a variety of MIG instances with dierent
compute and memory resources allowed by the A100 GPU in addi-
tion to running them in parallel on homogeneous MIG instances
co-located on the same GPU. Our results demonstrate that:
When model training is unable to utilize the full GPU on its own,
i.e., the small case, training multiple models in parallel using
several MIG instances has signicant benets. One can train
seven ResNet26 models in parallel with a latency penalty of 2.5X
over training the same model in isolation on the whole GPU
leading to nearly 3 times the throughput. This can especially
be benecial for hyper-parameter tuning.
In the medium and large cases the models are suciently large
to saturate the GPU and co-location has marginal to no benet.
In addition, these models cannot run on the smallest GPU in-
stances as the memory needs of the models exceed the memory
on those instances (5GB).
Co-located instances run in parallel without any interference
as long as the available memory per instance is enough for the
runs, and co-located and isolated runs over the same type of
instances perform similarly.
Rest of the paper is organized as follows. Firstly, Section 2 gives
background on MIG and surveys related work. Section 3 follows this
up with our experimental methodology and setup. Finally, Section 4
discusses the results and Section 5 and Section 6 draw conclusions
from our ndings.
2 BACKGROUND AND RELATED WORK
This section rst provides some background on MIG and how to cre-
ate MIG partitions. Then, we survey related work on benchmarking
deep learning and MIG.
2.1 Multi-Instance GPU (MIG)
Multi-Instance GPU (MIG) is a recent technology bundled with
NVIDIA’s Ampere GPUs. It allows these GPUs to be split into
arXiv:2209.06018v1 [cs.LG] 13 Sep 2022
, , Kaas and Paleykov, et al.
Figure 1: Possible partitioning schemes on a NVIDIA A100-
40GB GPU. Horizontals can overlap (co-location) but verti-
cals cannot. For example, having a 3g.20gb instance is not
compatible with 5x 1g.5gb instances (gure from [19]).
smaller GPU instances of varying sizes that can be used to run
dierent workloads in parallel on the same GPU. On the hardware
side, MIG-capable GPUs are divided up into multiple slices. These
can be combined into GPU instances providing a partitioning of
the GPU. The memory of the GPU is split into 8 memory slices and
the compute side is split into 7 compute slices, plus one reduced
slice for overhead.
A consideration regarding enabling MIG instances is that it does
not allow for one model to be trained on multiple GPUs [19].
The amount and types of the combinations of partitions across
the A30 and A100 versions vary, the latter supporting more proles
than its lower-spec counterpart. In this paper we have investigated
the MIG capabilities of the A100. NVIDIA provide ve readily avail-
able proles (see Figure 1).
The smallest possible GPU instance is one with just one mem-
ory slice and one compute slice,
1g.5gb
. Consecutively, a
2g.10gb
prole consists of two compute slices and 10 GB of memory, or two
memory slices. The slices can also be referred to as fractions of the
total resource. The other available proles are
3g.20gb
,
4g.20gb
,
and
7g.40gb
. The last prole consists of the almost all of the GPU
resources. However, using the GPU without MIG mode is not anal-
ogous to running this large prole as the compute capability of
the GPU is hampered slightly due to MIG overhead. Dierent GPU
instance sizes may be more ecient from an utilization perspective
based on the size of the workload.
Many dierent partitions are possible as long as the maximum
resource capacity is not exceeded. For example, splitting the GPU
into a
4g.20gb
and
1g.5gb
instance is possible but two
4g.20gb
instances would exceed the compute resources of the device. There
is, however, a notable exception. While a split of one
4g.20gb
,
2g.10gb
, and
1g.5gb
instance is possible, one cannot proceed with
a split of
4g.20gb
and
3g.20gb
instances, despite the values sum-
ming up to the maximum resources of the device.
2.2 Related work
2.2.1 Benchmarking with deep learning. There has been several
eorts towards creating a standardized deep learning benchmark.
In 2012, BenchNN [
7
] was proposed as an alternative benchmark to
do neural network performance analysis for hardware. Fathom [
2
]
was released in 2016 and contains eight workloads, each of which
involve the training of a distinct (at the time) state-of-the-art deep
learning model. 2017 saw the introduction of BenchIP [
29
], which
contains a combination of microbenchmarks and macrobenchmarks
along with a set of hardware evaluation metrics. Similar to Fathom,
TBD [
33
] both compiled and analyzed a suite of deep learning work-
loads targeting state-of-the-art in 2018. There have also been eorts
towards providing GPU benchmarks with deep learning workloads
in commercial settings, e.g. Lambda provides benchmarks of train-
ing and inference time of various popular deep learning models
for many dierent consumer-grade and data center-grade GPUs
[
15
] and Baidu Research has released a micro-benchmark software
package called DeepBench [4].
On the other hand, the rst ocial standardization eort for
deep learning was the MLPerf [
24
] initiative in 2018, now known as
MLCommons, which provides training and inference benchmarks
ranging across use cases of image recognition, speech recognition,
etc. Following this initiative, the Transaction Processing Perfor-
mance Council released a standardized benchmark focusing on
end-to-end machine learning, not just deep learning, called TPCx-
AI [31].
Despite these eorts, doing a performance analysis for a hard-
ware device using deep learning benchmarks is still a challenge.
Since deep learning is a rapidly evolving eld, standard reference
models are often only valid for a few years before they are super-
seded by better or more ecient models. In addition, training times
are especially long and may require expensive and powerful hard-
ware, which makes deployment of these benchmarks sometimes
dicult in academic settings. Finally, other than TPCx-AI, there
is not a standardized benchmark where one can easily scale the
benchmark up and down to stress test hardware resources of dif-
ferent strengths. Most TPCx-AI use cases, on the other hand, do
not suciently stress the GPU hardware. Therefore, in this work,
we design a custom benchmark (see Section 3.3) to both stress the
GPUs under test and scale the workload up and down.
2.2.2 MIG. MIG is a relatively new technology and there has not
been many works that thoroughly explore its possibilities. Wang
et al. [
32
] compare the eectiveness of their hardware utilization
squeezer to MIG when it comes to improving both the GPU utiliza-
tion and deep learning training times. Tan et al. [
28
] build a system
that, given a set of deep learning inference tasks and service-level
objective constraints, is capable of automatically and seamlessly
recongure MIG-enabled GPUs on Amazon Web Services (AWS) to
the most ecient MIG prole conguration. Our work is comple-
mentary to these works since we focus on devising an experimental
methodology to investigate the strengths and limitations of MIG.
3 METHODOLOGY
This section provides a detailed overview of our methodology. First,
Section 3.1 describes the hardware used for conducting our experi-
ments. Next, Section 3.2 denes our metrics, their relevance, and
how we measure them, followed by the workloads we designed for
this study in Section 3.3. Finally, Section 3.4 details the list of the
experiments ran.
3.1 System
For characterizing the performance of MIG, we use a DGX Station
A100, which is composed of an AMD EPYC 7742 CPU and four
A100 GPUs.
Deep Learning Training on Multi-Instance GPUs , ,
The CPU consists of 64 cores, amounting to 128 logical cores
(threads) operating at a base clock of 2.25 GHz and capable of
reaching a maximum boosted clock of up to 3.4GHz [
3
]. The L3
cache is 256MB and DRAM is 512GB. The A100 GPUs have 40GB
of high bandwidth memory, and the graphics processors are based
on the SXM form factor [
22
]. They support a maximum of 7 MIG
instances at 5 GB of memory per instance (see Section 2.1). The DGX
Station A100 setup represents a pre-packaged solution provided by
NVIDIA, and the operating system is DGX OS, which is a variant
of Ubuntu 20.04.4 LTS with hardware-specic optimizations.
3.2 Metrics
The metrics we used to reason about the performance of MIG
can be classied into three categories: application-level metrics
(Section 3.2.1), GPU metrics (Section 3.2.2) and CPU metrics (Sec-
tion 3.2.3).
3.2.1 Application-level metrics. Application-level metrics are the
metrics related to model training.
Time per epoch
is the time it takes to nish a single epoch of
training for a particular model. The reason GPUs are used in deep
learning is to reduce training time by exploiting the embarrassingly
parallel nature of most deep learning computations. Therefore, time
per epoch is arguably the most fundamental metric to look at and
optimize. We obtain the time per epoch by dividing training time
by the total number of epochs we use to train the models.
Accuracy
indicates the predictive power of a machine learning
model. For a set of categorical predictions, it is dened as the share
of predictions that are correct. While accuracy does not determine
how well hardware resources are utilized, and hence not the primary
metric in this work, one should never ignore it when focusing on
machine learning. We record the training and validation accuracy
to ensure the models are training correctly.
3.2.2 GPU metrics. Throughout our experiments, the GPU metrics
are reported in the context of both the full GPU and its respective
partitions. While instances are separate units that execute each
workload with the provided number of resources, they are also part
of the full GPU. In addition, the full GPU has an extra compute unit
that cannot be included as part of the instances ([
23
] and Section 2.1).
Therefore, it is essential to track these metrics described below both
at the instance-level and at the level of the whole GPU.
Except for the GPU memory metric, which is collected using the
NVIDIA System Management Interface (
nvidia-smi
), we use the
Data Center GPU Manager (
dcgm
) to collect all the GPU metrics.
nvidia-smi
does not provide measurements with MIG instances
and
dcgm
does not measure GPU memory used. Therefore, we need
both of these sources to fetch all the required information.
GRACT
, Graphics Engine Activity, shows the fraction of time
any portion of the graphics or compute engines were active. We
are particularly interested in observing how active (and utilized)
the whole GPU and its respective instances are depending on the
workload being executed.
SMACT
, Streaming Multiprocessor (SM) Activity, is the fraction
of time at least one warp, the GPU unit of work allocation, is active
on a multiprocessor averaged over all SMs.
Note that active does not necessarily mean a warp is actively
computing. For instance, warps waiting on memory requests are
also considered active. The ocial documentation [
21
] mentions
that while a SMACT value of 0.8 or larger may indicate an eective
use of the GPU, a value of 0.5 or smaller likely indicates ineective
usage of GPU. Moreover, we deem the values that range between
these two boundaries as neither eective nor ineective - or neu-
tral. SMACT is complementary to GRACT. While GRACT is more
focused on how busy the device is, SMACT shows us the actual
activity of the multiprocessors.
SMOCC
, SM Occupancy, is the fraction of resident warps on
a multiprocessor, relative to the maximum number of concurrent
warps supported on a multiprocessor. It is a complementary metric
to SMACT and SMOCC. High SMOCC indicates more eective GPU
usage for workloads that stress the memory or memory bandwidth.
However, if the workload is compute-bound, high SMOCC may not
necessarily indicate more eective GPU usage.
DRAMA
, Memory Bandwidth Utilization, is the fraction of cy-
cles where data was sent to or received from device memory. Higher
DRAMA shows higher memory utilization of the device, and serves
as a complementary metric to the three metrics above that focus
on the utilization of a GPU’s compute resources.
GPU memory
usage for models being trained is essential for
deciding the ideal setup for MIG partitioning as the partitions can
have varying amounts of memory. By default, a deep learning frame-
work like TensorFlow allocates all available GPU memory from the
moment training starts, which is done to reduce memory fragmen-
tation [
30
]. However, this prevents us from monitoring the actual
GPU memory consumption characteristics of our training work-
loads. Therefore, we disable such behavior during our experiments
to quantify the GPU memory actually required by the workload.
3.2.3 CPU metrics. In addition to tracking the GPU compute and
memory utilization, we also monitor how MIG impacts the CPU
usage characteristics.
CPU utilization
We monitor CPU utilization on the process
level as an aggregate over the threads of the training process us-
ing the tool
top
. Since the DGX Station has 128 logical cores, the
maximum utilization of the CPU would be 128
×
100%
=
12
,
800%.
100% CPU utilization could therefore mean a few dierent things.
It could mean that one of the logical cores is working at maximum
capacity. Alternatively, it could mean that ncores are working with
an average of 1/𝑛utilization per core.
Main memory
usage on CPU is also an important metric to
monitor. Some operations, such as model initialization and data
management, depend on the CPU memory. We monitor total mem-
ory allocation to training processes in order to observe the memory
requirements of running a single or multiple models using
top
.
More specically, we report
RES
, resident memory, which is the
total physical memory allocated to a process.
3.3 Workloads
We design three workloads of dierent sizes (small,medium, and
large) to assess the performance of MIG under dierent loads. All
three workloads consist of training a ResNet model on an image
dataset. The complexity of the dataset and model vary between
workload sizes. Here, we rst go over the selected datasets after
, , Kaas and Paleykov, et al.
which we discuss the models. All the workloads are implemented
in TensorFlow [1].
3.3.1 Data and preprocessing. The three workloads have dierent
dataset sizes.
The
small
workload, with
resnet_small
, is trained on CIFAR-
10 [
14
], which is a relatively small dataset containing 60,000 labeled
32
×
32 pixel images divided over 10 classes. The dataset is split into
50,000 training images and 10,000 test images. The entire CIFAR-10
dataset uses approximately 32
×
32
pixels ×
3
channels ×
8
bytes ×
60
,
000
images
1
.
5
GB
of memory. It is therefore feasible to load
the entire dataset into memory at runtime instead of dynamically
streaming the data into memory from disk. In terms of prepro-
cessing, we normalize the dataset by subtracting the mean image
from every image in the dataset as suggested in [
10
]. We train
resnet_small
on 90% of the training set and use the remaining
10% as a validation set.
The
medium
workload, with
resnet_medium
, uses a downsam-
pled version of the ImageNet2012 dataset called ImageNet64
×
64
[
8
]. Imagenet2012 is a collection of 1,431,167 labeled images from
1,000 dierent classes. The dataset is split into 1,281,167 training
images, 50,000 validation images, and 100,000 test images. Unlike
CIFAR-10, the dataset is not balanced and the images of ImageNet
are furthermore not all uniform in size.
While loading the whole scaled-down dataset into memory would
in theory be possible, as it only demands 64
×
64
pixels×
3
channels×
8
bytes ×
1431167
images
17
.
5
GB
, we decided against doing this.
The main reason is that we wanted to make the medium-sized work-
load comparable to the large workload so that the only dierences
between these two experiments would be the size of the dataset and
the size of the model. Secondly, some of our experiments involve
training multiple models in parallel which would require load-
ing up to seven versions of the dataset into memory, which could
present an issue. Instead of loading all of the images into memory,
we use the data generator
ImageDataGenerator
from TensorFlow
to dynamically stream training data from disk. The images are
also preprocessed using the
imagenet_utils.preprocess_input
function, and we empirically determine the smallest optimal val-
ues of
workers
and
max_queue_size
to be 1 and 10, respectively.
Setting the number of workers to 1 means that TensorFlow will
create and use 1 CPU thread to fetch training data and setting
the maximum queue size to 10 means that a maximum of 10 pre-
processed batches of training data will be stored in RAM at any
point time. The purpose of storing more than one batch of training
data in RAM at the same time is to try to minimize the amount
of time that the GPU is spending on waiting for training data. To
determine the optimal values for
workers
and
max_queue_size
,
we used Tensorboard, which shows the amount of time spent on
waiting for input to be fed to the model. The values of
workers
and
max_queue_size
were then gradually increased until the time
spent on input was close to 0.
The
large
workload, with
resnet_large
, is trained on Ima-
geNet2012 [
25
]. Every picture is resized to 224
×
224 using the
nearest pixel interpolation method in order to conform with the size
of images used in the original ResNet specication [
10
]. Loading all
of ImageNet into memory is impossible due to its exceedingly large
size (greater than 100GB). Instead, we dynamically load batches into
memory once again using the data generator
ImageDataGenerator
that is available in TensorFlow. This generator automatically fetches
a batch of training data at a time from disk, preprocesses it and
stores it in RAM, and then transfers it to GPU memory where it can
be consumed by the ResNet model. Additionally, we experimentally
determine
workers=16
and
max_queue_size=20
, using the same
methodology to the medium case for these parameters.
3.3.2 Deep learning models. All three workloads feature a ResNet
convolutional network as their model [
10
]. ResNets are a very
popular choice of model for image classication and image segmen-
tation [
16
,
26
] and are easy to scale up and down in size. This aligns
well with our use case where we want to test the performance
of GPU MIG instances of dierent sizes. We train a ResNet26V2,
ResNet50V2, and ResNet 152V2 models for the small, medium, and
large workload cases, respectively. The larger ResNet models have
more layers and parameters. The medium model has about twice
the number of parameters as the small one, and the large model
has about twice the number of the medium model.
3.4 Experiments
This subsection provides an overview of our experimental runs.
Each experiment was run twice (i.e., replicated) to ensure reliable
results.
We use a batch size of 32 for all the models to strike a balance
between statistical accuracy, memory requirements, and time per
epoch. All experiments with the small model train for 30 epochs,
while all medium and large model runs are for 5 epochs. This was
done to strike a balance between training results and time spent on
a run, since there is a dramatic increase in complexity and time to
completion for medium and large model runs.
The number of concurrently trained models depends on the MIG
prole used, which determines the allocation of GPU SM and mem-
ory slices. All of our experiments cover workload executions on a
non-MIG instance and the ve MIG proles provided by NVIDIA
(Section 2.1. For each prole, and for each dataset size, we run two
types of runs. The rst runs one training in isolation on an instance
of that prole. The second one runs several homogeneous MIG in-
stances in parallel all training a model at the same time. We examine
the maximum possible instances that can be congured for a given
prole. For example, for the
1g.5gb
prole there is respectively
one training run on a single instance of the prole and a second
run with the maximum possible instances that can be generated
for that prole, which is 7, where 7 models are trained in paral-
lel. Furthermore, we decided to focus on homogeneous instances
for the parallel runs in this study to scope down the total number
of experiments. Investigation of the impact of dierent instance
combinations are left for future work.
There are two exceptions to the parallel runs using MIG proles.
This is due to the
7g.40gb
and
4g.20gb
proles not being able
to have existing parallel instances due to the maximum available
resources. Furthermore, we do not report GPU metrics derived from
DCGM for
4g.20gb
due to DCGM not reporting anything for this
prole. Section 5 further elaborates this issue. However, in the case
of a single instance run, we deem an experiment with
3g.20gb
prole comparable to
4g.20gb
. The experiment with the full MIG
prole,
7g.40gb
, has the purpose to explore possible alterations of
Deep Learning Training on Multi-Instance GPUs , ,
performance when MIG mode for a GPU is enabled, in comparison
to a case where it is not.
Lastly, NVIDIA DGX systems and the surrounding ecosystem
of tools supporting them can be considered state-of-the-art tech-
nologies and as such are constantly evolving. Before running our
experiments, we made sure to acquire the most up-to-date soft-
ware and drivers. The versions of the
nvidia-smi
and GPU drivers
for the A100 GPUs, are 510.47.03. The CUDA version used is 11.6.
DCGM is at version 2.3.2. In addition, we use the Python version
3.9.7, while TensorFlow is at version 2.7.0.
4 RESULTS
We now discuss and analyse our results. In total, a full run of our of
our experiments took approximately 135 hours or about ve and a
half days. Running the medium and large workloads on the smallest
GPU instance,
1g.5gb
, resulted in an out-of-memory error and we
therefore only have full run results to show for
resnet_small
in
the 1g.5gb experiments.
Throughout the reporting of DCGM data, the tool was unex-
pectedly terminated on two occasions, resulting in only partially
complete data for our initial experiments. These were the setups
for the non-MIG-enabled GPU and
3g.20gb
one prole, impacting
only our large workload executions. In our eorts of carrying out
replications of the experiments, we identied that the data in said
replications can be benecial to supplementing the current analysis.
While there were another two occasions of such occurrence in the
replication data, they were for dierent instances and our original
data was not impacted. We had access to data resulting from an
alternative experiment with identical setup for all of our workload
executions. Therefore, the complete data for large workload exe-
cution on
3g.20gb
one and non-MIG setups has been included in
the visualizations we present in the upcoming section. While the
incomplete data was highly similar to the complete results from the
replications, we discuss these challenges in section 5. Furthermore,
metrics reporting for the
4g.20gb
instance are not viable due to
challenges with querying metrics from DCGM for this instance
size.
Overall, the results indicate that smaller GPU instances result
in longer training times but increased utilization. Additionally, we
found no penalty in training several models in parallel in dierent
GPU instances compared to training a single model at a time. Lastly,
for the medium and large workloads, we do observe a signicant
dierence in using the A100 in non MIG mode and using it in MIG
mode as a 7g.40gb GPU instance.
4.1 Time per epoch
Figure 2 shows the epoch times for
resnet_small
. Each bar rep-
resents the total amount of time needed to complete one epoch
of training for
resnet_small
averaged over 30 epochs of training.
For the experiments where we trained multiple models in parallel,
the bars have been given the same color, e.g. the brown bars with a
star hatch pattern represent seven models trained in parallel.
From the chart it can be seen that, in general, using smaller
instances results in longer training times. However, the relationship
between number of compute slices and training time is not 1:1.
For instance, training an instance of
resnet_small
on a
1g.5gb
Figure 2: Time per epoch for resnet_small per experiment.
Although smaller instance sizes result in longer training
times, it can be observed that there is not a 1:1 relationship
between instance size and training time, e.g. a 1g.5gb run
does not take seven times as long as a 7g.40gb run.
(a) resnet_medium (b) resnet_large
Figure 3: Time per epoch for resnet_medium and
resnet_large. Note the dierent y-axes. The medium
and large sized workloads do not benet as much from MIG
as the small workload.
instance takes approximately 39
.
8
/
16
.
1
=
2
.
47 times longer than
training it on the
7g.40gb
instance despite having 1/7 as much
compute power and 1/8 as much memory.
We also observe that training multiple models in parallel on sep-
arate instances has no signicant eect on the training time when
compared to training a single model on an instance of the same
size. For example, training a single instance of
resnet_small
on a
2g.10gb
instance for one epoch took 25.7 seconds whereas training
three models in parallel on
2g.10gb
instances took between 25.6
and 26.0 seconds for one epoch. This supports the claim by NVIDIA
that separate MIG GPU instances are completely isolated from each
other.
The fact that there is not a 1:1 relationship between the size of
the GPU instance and the training time combined with the fact
that there is no training time penalty associated with training mul-
tiple models in parallel presents a unique opportunity with MIG.
When performing hyperparameter optimization of a machine learn-
ing model, one could run seven models in parallel with dierent
hyperparameter settings on seven dierent instances of
1g.5gb
.
This would be signicantly faster than sequentially running the
model seven times on a
7g.40gb
instance. As an example, for
resnet_small
, it would take
(
7
×
16
.
1
)/
39
.
8
=
2
.
83 times as long
to train seven models sequentially on a
7g.40gb
instance than in
parallel on seven 1g.5gb instances.
, , Kaas and Paleykov, et al.
Charts showing the time per epoch for the large and medium
models can be seen in g. 3. The processes running the medium and
large workloads crashed immediately when running on
1g.5gb
. For
the experiments that ran correctly we saw a much larger penalty
associated with running the workload on smaller instances. For
instance, running one epoch of
resnet_medium
in a
7g.40gb
in-
stance took 35.4 minutes, whereas running three workloads in
parallel in
2g.10gb
instances took 106.8 minutes per epoch. Run-
ning three medium workloads sequentially in a
7g.40gb
instance
thus takes almost exactly the same time as running three medium
workloads in parallel in
2g.10gb
instances (
(
35
.
4
·
3
)/
106
.
4
=
0
.
99).
For
resnet_medium
we saw marginal improvements in running
two parallel workloads in
3g.20gb
instances compared to running
them sequentially in a single
7g.40gb
instance. For
resnet_large
we saw very similar results as for
resnet_medium
; running three
workloads in parallel in
2g.10gb
instances took exactly as long as
it would have taken to run three large workloads sequentially in a
7g.40gb instance.
Regarding the non-MIG experiments, we see slightly faster exe-
cutions in the non-MIG runs compared to the
7g.40gb
runs for all
resnet_small
,
resnet_medium
and
resnet_large
. The improve-
ment is smallest for
resnet_small
where the non-MIG time per
epoch is only 0
.
7% faster than the
7g.40gb
time per epoch. For
resnet_medium
and
resnet_large
, however, we clearly see sig-
nicant improvements when disabling MIG. The time per epoch
for a
resnet_medium
run in a non-MIG GPU is 2
.
8% faster than
in
7g.40gb
. For
resnet_large
the improvement is even larger at
2
.
9%. A likely reason why the non-MIG runs were slightly faster is
that a compute unit gets disabled when MIG is enabled. The small
workload might not have benetted as much from disabling MIG
as it already did not fully utilize all of the compute power available
in
7g.40gb
. Providing it with even more SMs would therefore not
be benecial.
4.2 GPU utilization
We proceed by going over the DCGM and GPU memory metrics.
4.2.1 DCGM. We provide graphic representations of the perfor-
mance of 4 metrics obtained from DCGM. Our focus is on Graph-
ics Engine Activity (GRACT), Streaming Multiprocessor Activity
(SMACT) and Occupancy (SMOCC). In addition, we provide a brief
analysis of the performance of Memory Bandwidth Utilization
(DRAMA). For each of the 3 workloads, we created two types of
graphs per metric. The rst type focuses on the metric performance
for the full GPU, while in the second we display information for
the individual instances to better understand their impact. Sepa-
rately and as a baseline, we also provide DCGM metrics for a full
GPU with MIG mode disabled. An expectation was that the results
reported from these model runs will be similar, or nearly identical,
to the ones reported by the 7g.40gb instance.
On the y-axes we display the median of the reported average-
over-time metric values in percentages. Our x-axes show the dif-
ferent device groups corresponding to the aforementioned MIG
proles. In a
3g.20gb
parallel example, consisting of two paral-
lel
3g.20gb
instances as part of the whole GPU, the rst graph
shows the median value of a given metric for the full GPU while
the second one displays the median of that metric for each of the
two instances. It is worth remembering that our device allocations
are homogeneous, and this is why we omit the otherwise possible
1g.5gb
instance in this allocation. In terms of non-MIG report-
ing, we include the same device-level values in both device- and
instance-level visualizations. This is a decision taken to allow for
comparison with a baseline, both for the full device but also across
instances.
In the following analysis, we will look at specic device groups’
performance and their comparison to each other, for each metric.
Our explicit focus is on the MIG-enabled device groups. Towards the
end of each metric analysis, we specically focus on a comparison
with the case for a non-MIG device.
GRACT. The highest reported Graphics Engine Activity for the
small workloads results from the
1g.5gb
parallel device group. The
reported activity across its seven instances ranges from 90.2% to
90.5%, which amounts to an average of 90.2% utilization of the full
GPU. Similar result is produced by the device group covering an
individual instance of the
1g.5gb
prole where the overall device
activity is dramatically lower due to the small fraction of overall
resource utilized (see g. 4a and g. 4d).
Although the
2g.10gb
parallel device group reports
84% activ-
ity throughout its instances, the overall device activity is at 71.8% for
the small workload. The
2g.10gb
prole is the second-to-highest
in terms of GRACT both for individual and parallel instance allo-
cations. The lower overall device utilization, in comparison to the
1g.5gb
parallel group, can be explained by the omitted compute
slice. This is due to the homogeneous nature of device groups used
in our experiments. If an additional
1g.5gb
allocation was utilized
in parallel, which is supported by NVIDIA, a higher value would
be expected.
Similar is the case with
3g.20gb
parallel, where one compute
slice remains unused, despite the complete allocation of available
memory.
3g.20gb
one, however, is
10 percentage points more
utilized than
2g.10gb
one when it comes to the entire device -
despite the instance-level utilization being 5.2 percentage points
lower. A possible explanation is the previously mentioned consider-
ation for allocated resources per instance, with respect to the total
available for the device. More allocation does not necessitate better
utilization in terms of instance-level activity.
An interesting detail is the activity of the partition covering the
entirety of the GPU resource as part of the
7g.40gb
prole. Using
that prole, the reported utilization is at 71.6% for the small model
and is the lowest of all reported. Such result could entail that for
workloads of that size, a GPU with such specications is better uti-
lized if smaller portions of it are used as an alternative. This comes
at the cost of extra time to completion, which we discussed sepa-
rately. However, once the workloads grow in scale, the dierences
between device groups’ GRACT performance diminish.
In the medium workload (see g. 4b and g. 4e), the
2g.10gb
one group reported highest instance-level activity at 96.3%. The
2g.10gb
parallel group reported highly similar values across its
instances at 96.1% and second-highest utilization of the full device -
82.4%. The highest device utilization was achieved by the
7g.40gb
one prole at 88.6%. While its respective instance is the lowest
utilized in comparison to the rest of the proles’ instances also
Deep Learning Training on Multi-Instance GPUs , ,
(a) Small workload on full device (b) Medium workload on full device (c) Large workload on full device
(d) Small workload per instance (e) Medium workload per instance (f ) Large workload per instance
Figure 4: Median GRACT in percentages across dierent device groups.
88.6%, the dierence from the highest-utilized one,
2g.10gb
one, is
7.7 percentage points.
In the large workload (see g. 4c and g. 4f), the highest uti-
lized instances belonged to the
2g.10gb
prole, reporting identical
values of 96.9%. The dierence between the highest- and lowest-
performing proles in the large run is 6.1 percentage points. This
is a display of the shrinkage across setup utilization dierences
throughout scaled workloads.
There was an interesting phenomenon when it comes to full
device utilization. In the small run, the
7g.40gb
one prole was
utilized at 71.6% while
1g.5gb
parallel was most utilized reporting
90.2% - a dierence of
19 percentage points. In the medium work-
load, the
7g.40gb
prole reporting 88.6% utilization exceeded all
other proles with a device-level dierence of 6.2 percentage points
compared to the second highest-utilized device the
2g.10gb
par-
allel. Similar behavior was observed in the large workload, where
the dierence between the two highest-utilized devices was 7.7
percentage points.
However, as we acknowledged, the context of full device metric
calculations can turn out to be misleading and in our case in-
complete. We are only using homogeneous instances, sometimes
resulting in unutilized resources. In the very case of
2g.10gb
par-
allel, an extra
1g.5gb
instance might have lowered the dierence
between the two device groups, or possibly changed the order of
ranking.
In terms of non-MIG device data, the reported GRACT for all
workloads was highly similar to the values for
7g.40gb
. In all cases,
the
7g.40gb
prole reported slightly higher activity, not surpassing
0.2 percentage points.
SMACT. When it comes to the activity of streaming multiproces-
sors, the highest instance-level metric values for the small workload
were reported by the two
1g.5gb
device groups. The performance
of the instances in the parallel device group ranged from 75.2% to
(a) Full device, small (b) Per instance, small
(c) Full device, medium (d) Per instance, medium
(e) Full device, large (f) Per instance, large
Figure 5: Median SMACT for all experiments.
75.4%, which resulted in 75.1% SM Activity for the full GPU - the
highest in the context of the full device. Similar to the graphics
engine activity, the second-highest SM Activity results from the
instances of the
2g.10gb
prole. The parallel device group reported
values between 60.7% and 61% across its instances, and 52.1% for
the full GPU (see g. 5a and g. 5b). This shows that increasing
the resources by an extra compute and memory slice reduces the
, , Kaas and Paleykov, et al.
overall instance SM Activity by
14 percentage points. However,
as we discussed earlier, such increases may benet the time to com-
pletion. The lowest-utilized instance was from the
7g.40gb
one
prole with 40% reported SM Activity.
Drawing on these results, it can be argued that the 40GB version
of A100 is not well utilized when it comes to our small workload
experiment - especially if the full device is used. If 7 instances are
used, it would mean that the total amount of work completed is
7 times more compared to the case of a single instance. This can
be benecial depending on the context and use case - despite the
reported device-level SM Activity of 75.1% being less than the sug-
gested 80% as an indicator of eective utilization. As we covered
earlier in section 3.2, all values less than 50% can be indicative of
ineective usage, while a categorization for eective usage would
require more than 80%. Three of our proles,
3g.20gb
,
2g.10gb
,
1g.5gb
, and their instance device groups reported values higher
than 50% but less than 80% which we argue to be in the neutral
range. However, the reported
75% SMACT for
1g.5gb
is close to
the suggested value indicative of eective utilization. The
7g.40gb
prole and its instance device group can be categorized as an inef-
fective choice.
It is however in the medium and large workloads (respectively
g. 5d and g. 5f) when the results for SM Activity become more
interesting. For both of these workloads the reported SM Activity
values across device groups not only follow the same pattern, but
are almost the same between the two workloads - with insignicant
dierences. This is seen in the metric reporting for instances, as
well as for the full device (see g. 5c and g. 5e).
In both medium and large workload executions, the least SM
Activity per instance is reported by the
7g.40gb
one device group.
However, this time the respective values are 73.4% and 74.4%. This
represents a growth of more than 30 percentage points when com-
pared to the performance of the same device group in the small
workload execution. 73.4% and 74.4% are both values that are close
to the suggested 80%+ for a classication of a device as eective. An
explanation for the higher activity can be the nature of the work-
loads expressed in their larger scale - which has impact on both the
utilization of compute units and memory. We discuss memory in
more detail in the upcoming analysis of the SM Occupancy results,
as well as in a dedicated subsection as part of this analysis.
The highest value for SM activity, also in both medium and large
workloads, results from the
2g.10gb
instances. It is within the range
of 91.3% and 91.8% across the two workloads, which represents a
minor dierence. In the medium workload, the
2g.10gb
one device
group slightly outperforms its parallel analogue on instance-level
by
0.2 percentage points. In the large workload, one of the parallel
instances matches the SM Activity value of the individual prole’s
instance, while the rest report highly similar results. The range,
therefore, between the least and highest-reported SM Activity for
medium and large workload runs, is
18 percentage points. How-
ever, if we reduce the context for calculating the ranges to only
include partitioned device groups (i.e., excluding the
7g.40gb
one
prole), that range is decreased to 6.5 percentage points.
Overall, for both types of workload runs, 4 out of 5 device groups,
respectively belonging to 2 distinct proles, can be classied as
eectively utilized when it comes to SM activity per instance.
In the context of the full device SM activity reporting,
7g.40gb
one slightly outperforms
3g.20gb
parallel in both medium and
large workload runs. The
2g.10gb
parallel device group is still
most eective with
78.5% SM Activity. None of the device groups
reported values higher than 80% for the full device. However, it is
once again worth acknowledging the fact that the device groups
we use for all experiments are homogeneous and do not sum up
to the entirety of the device resource. This is also reected in the
device-level calculations.
In consideration of the non-MIG-enabled device, all reported
SMACT values were once again highly similar to the
7g.40gb
pro-
le, with minor dierences in benet to
7g.40gb
as part of the
small workload execution.
(a) Full device, small (b) Per instance, small
(c) Full device, medium (d) Per instance, medium
(e) Full device, large (f) Per instance, large
Figure 6: Median SMOCC for all experiments.
SMOCC. The small workload (g. 6a and g. 6b) reports the
lowest SM Occupancy values from all 3 workload types. Across
the instance device groups, the
1g.5gb
prole reports the highest
values ranging between 34.9% and 35.4%, while the lowest value
is once again resulting from the device group for the
7g.40gb
one
instance with 20.3%. The device-level values are highest for the
1g.5gb
parallel device group (35%), while
7g.40gb
one reports the
same value as its instance - 20.3%.
In consideration with the SMACT values for this workload re-
porting lower activity, the low SMOCC values further enhance the
hypothesis that the GPU can be deemed underutilized for such
type of workload. Another relevant aspect is the reported Graphics
Engine Activity. While it is relatively high for the small workload
across most of the instances, there are still signicant dierences in
comparison to the GRACT for medium and large workloads. These
dierences often result in
15 percentage points or more in benet
to the medium and large workloads’ reported values.
Deep Learning Training on Multi-Instance GPUs , ,
For medium and large workload executions (as seen in g. 6c,
g. 6d, g. 6e and g. 6f), the distribution of values for SM Occu-
pancy follows a similar trend to the one for SM Activity. In addition,
each of the device groups in the medium workload execution shares
almost identical SMOCC values to its respective analogue in the
large workload. This is observable both across the metric reporting
for the full device, as well as the individual instances. For both work-
load types, the highest occupancy across instances results from the
2g.10gb
prole and its respective device groups, while the lowest
is part of the
7g.40gb
one partition. The dierence between lowest-
and highest-reported occupancy per instance is 17.7 percentage
points for both workloads.
Drawing on these results, it is also worth considering the ef-
fect of memory bandwidth limitations[
21
]. Our workloads are not
strictly bandwidth-limited when it comes to memory. However,
the medium and large workloads can be considered both compute-
and memory- limited due to their larger scale. Referring back to
the Graphics Engine Activity, we see values in the
90% range for
most of the instances used across these workloads, while for SM
activity most values reported are near the
85% range. This can be
considered as high utilization with relatively high (mostly 50%-60%)
SM Occupancy.
We acknowledge the complex nature of calculating Occupancy
and implications of categorizing reported values for that metric as
eective or ineective. As reported in [
18
], “Low occupancy results
in poor instruction issue eciency, because there are not enough
eligible warps to hide latency between dependent instructions".
Conversely, “When occupancy is at a sucient level to hide latency,
increasing it further may degrade performance due to the reduc-
tion in resources per thread" [
18
]. We acknowledged the levels of
achieved occupancy across our experiments and analyzed them in
a context supported by the workload memory requirements, limita-
tions of some of the proles, as well as reported GRACT and SMACT.
However, our analysis is not comprehensive when it comes to a
detailed examination at the occupancy for the type of workloads
we execute. This is due to the complex nature of this metric and the
many additional factors that need to be taken in consideration for
a comprehensive study. We consider that further and specialized
eorts in the area could be more benecial. In terms of graphics
memory, our attempts to execute the medium and large workloads
on a
1g.5gb
instance have failed due to the memory limitations
of that prole. When it comes to compute limitations, the main
trade-o we identied was time to completion against amount of
completed work. While we recognize the possibility of compute
slices being a bottleneck in a case where more memory is allocated,
examining such instances was also not at the core of this project’s
scope.
In all cases, the non-MIG GPU also shares highly similar SMOCC
values to the
7g.40gb
prole. The highest dierence is reported
in the small workload, 1.4 percentage points. For the medium and
large workloads, the dierences are less than 0.5 percentage points.
This could indicate that, typically, the non-MIG GPU performs in
a very similar manner to a MIG instance utilizing the full device’s
resource.
DRAMA. When it comes to the memory bandwidth utilization
(DRAMA), the instance-level value for each device group was highly
similar and almost identical to its respective analogue across all 3
workloads. The highest-reported values resulted from the
2g.10gb
prole, followed by 3g.20gb and 7g.40gb (see g. 7).
The values reported on device-level were lowest for the small
workload, ranging between 3.5% and 24.8% - respectively for the
1g.5gb
one and
1g.5gb
parallel device groups. Therefore, for this
workload, the highest device-level memory bandwidth utilization
was reported by the
1g.5gb
parallel device group. While the lower
reported value for
1g.5gb
one can be explained by the context
that is used for device-level calculations,
7g.40gb
one reported a
similarly low value of 6.1%.
When it comes to the performance in medium and large work-
loads, we observed a trend between the two, expressed in similar
device-level values across device groups’ respective analogues in
the two workloads. Another dierence was seen in the overall
higher reported values in comparison to the small workload. In
both medium and large workloads, the highest reported device-
level value was for the
3g.20gb
parallel instances (
52%), followed
by
2g.10gb
parallel (
49%) and
7g.40gb
one (
44%). As expected,
the proles for
2g.10gb
and
3g.20gb
covering a single instance
reported much lower device-level values. This can be explained by
the fact that only a single instance is used.
On device-level, the reported values for the non-MIG-enabled
device were highly similar to the ones for 7g.40gb
(a) Full device, small (b) Per instance, small
(c) Full device, medium (d) Per instance, medium
(e) Full device, large (f) Per instance, large
Figure 7: Median DRAMA for all experiments.
4.2.2 GPU memory. Figure 8a shows the maximum amount of
allocated GPU memory per experiment. Maximum amount may be
misleading since these amounts were allocated at the very begin-
ning of each experiment and did not uctuate during the whole run.
The rst thing to notice in the graph is that there is no dierence
, , Kaas and Paleykov, et al.
(a) Maximum amount of allocated GPU memory. (b) Maximum amount of aggregate allocated CPU memory
Figure 8: Memory allocation across all of our experiments.
(a) Aggregate CPU memory allocated to all processes running
resnet_large in parallel over time. The other workloads are not de-
picted but show similar results.
(b) Average aggregate CPU utilization for each experiment. Smaller
GPU instances generally result in lower CPU utilization.
Figure 9: Host system memory allocation and CPU utilization. Running workloads in parallel results in additional resource
utilization on the host system.
in the amount of allocated memory between using the GPU in non-
MIG mode and using a
7g.40gb
instance. Furthermore, it can be
seen that given optimal conditions (i.e. non-MIG mode or
7g.40gb
where there is 40 GB of available memory),
resnet_small
uses 9.5
GB,
resnet_medium
uses 10.4 GB and
resnet_large
uses 19.0 GB.
This is also the case in both the
4g.20gb
one and the
3g.20gb
one
experiments. These amounts of memory seem to be what Tensor-
ow considers the optimal amounts of memory to use given the
models and the training data.
However, it can be seen from the chart that the models are still
able to train when less memory than they ‘prefer’ is available. For
instance, in the
2g.10gb
one experiment,
resnet_large
uses only
9.9 GB of memory, which is about half of its memory consumption
in the
7g.40gb
one experiment, and in the
1g.5gb
one experiment,
resnet_small
is able to train using only 4.7 GB of memory. This
indicates that Tensorow is aware of its hardware environment
and is able to adapt the training process to the amount of available
memory.
Lastly, it can be observed that within a particular MIG prole,
training nmodels in parallel simply uses ntimes as much GPU
memory as training a single model. For example, training two
resnet_medium
models in the
3g.20gb
parallel experiment uses
two times as much memory as training a single
resnet_medium
model in the 3g.20gb one experiment.
4.3 CPU and main memory
In addition to the GPU-related metrics, we also monitored main
memory consumption and CPU utilization. Here we present the
main memory results (section 4.3.1) and the CPU utilization results
(section 4.3.2). Overall, our results indicate that both main memory
consumption and CPU utilization is proportional to the number of
models training in parallel on the GPU.
4.3.1 Main memory consumption. Figure 8b shows the maximum
aggregate amount of physical RAM allocated to the process(es)
running our workloads for each experiment as calculated using the
procedure outlined in section 3.2.3.
The bars in g. 8b represent the amount of RAM needed to be able
to run each experiment. We see that running a single
resnet_small
workload requires approximately 7.1 GB of main memory and run-
ning a single
resnet_medium
workload demands only 5.4 GB. The
reason why
resnet_small
has a larger memory footprint than
resnet_medium
is likely that
resnet_small
stores all of its train-
ing data in memory. On the other hand,
resnet_medium
streams
the training data from disk dynamically, thereby requiring a smaller
Deep Learning Training on Multi-Instance GPUs , ,
(a) resnet_small (b) resnet_medium (c) resnet_large
Figure 10: Training and validation accuracy for our three models. It can be seen that although training time is aected by
instance size, accuracy is not.
working set. Lastly we see that
resnet_large
requires at most 12.6
GB of memory.
From the gure it can be observed that the GPU instance size does
not signicantly impact the maximum main memory requirements,
e.g.
resnet_small
takes up about 7 GB of memory in all of the
non-parallel experiments. One exception to this is the
4g.20gb
one run where
resnet_large
has a maximum resident memory
requirement of 12.6 GB which is more than 2 GB more than for the
other non-parallel experiments. We also saw this dierence in our
experiment replication results and we are not sure what causes it.
Lastly, the graph shows that training nmodels in parallel requires
about ntimes as much memory as training a single model. This
implies that a user must also have a fairly large amount of available
main memory in order to reap all the benets of MIG. For example,
running a single
resnet_small
workload uses 7.1 GB of memory
but running seven in parallel on
1g.5gb
instances uses 48.7 GB of
memory which may be not be feasible for every user.
Figure 9a shows the aggregate amount of memory allocated
to the processes running the workloads over time when training
resnet_large
. We see that the allocated memory increases over
time for all of the experiments. Specically, at each epoch start,
between one and two additional gigabytes of memory are allocated
per model running. We saw similar memory allocation characteris-
tics for resnet_small and resnet_medium.
4.3.2 CPU utilization. The CPU-utilization was measured on the
process level. In g. 9b we see the average aggregate CPU utilization
in percentages across all of our experiments, calculated using the
procedure explained in section 3.2.3.
The main activities requiring CPU resources are reading training
data from disk (in the case of
resnet_medium
and
resnet_large
),
preprocessing the training data, transferring the training data to
the GPU and keeping track of gradient information. From the chart
we see that for the
7g.40gb
one experiment, which represents the
least constrained GPU computing environment,
resnet_large
re-
quires signicantly more CPU resources than
resnet_small
and
resnet_medium
. This is likely because
resnet_large
uses much
larger images to train on which in turn also demand more prepro-
cessing time. For the non-parallel experiments, we see that smaller
GPU instances result in lower CPU utilization, e.g.
resnet_large
uses 198% CPU in 7g.40gb one whereas it only uses 119% CPU in
2g.10gb
one. This makes sense since the smaller instances take
longer to process each batch, which means that fewer images need
to be read from disk and preprocessed per second.
For a parallel experiment with nconcurrent workloads, we see
that it uses approximately ntimes as much CPU processing power
as the non-parallel version. For
resnet_medium
and
resnet_large
this relationship is almost exact, e.g.
resnet_medium
uses on aver-
age 85% CPU in
2g.10gb
one and 257% CPU in
2g.10gb
parallel
which is almost exactly 3
×
85%. For the
resnet_small
experiments,
the relationship between the average CPU utilization of the non-
parallel and parallel workloads is not as perfect as for the larger
workloads. Instead, the parallel workloads utilize the CPU less than
ntimes as much as for the non-parallel workloads. We are unsure
what causes this.
An interesting thing to note is that in order to eciently train
seven concurrent machine learning models in parallel on
1g.5gb
instances, a signicant amount of CPU processing power is also
required (in our case 630% with a very powerful CPU). This means
that in addition to the A100 GPU and a substantial amount of RAM
(see section 4.3.1) the system must also have fairly powerful CPU
in order to get all the benets of MIG.
4.4 Accuracy
We recorded the training and validation accuracies after each epoch
for all of our experiments. Figure 10 provides an overview of the
achieved accuracies. The charts show training and validation ac-
curacy over time during a training run in a
7g.40gb
instance and
a smaller instance. In g. 10a, the smaller instance is a
1g.5gb
in-
stance and in gs. 10b and 10c it is a
2g.10gb
instance. It can be
seen that
resnet_small
reaches a validation accuracy plateau of
approximately 0.76 after about 1/5 of its total training time which is
about ve epochs. The
resnet_medium
and
resnet_large
models
do not seem to reach their maximum validation accuracy within
the ve epochs that they were trained but their highest validation
accuracies that they achieve were about 0.50 and 0.56 respectively.
The gure shows that the size of the instance only impacts the total
training time and not the achieved accuracy.
, , Kaas and Paleykov, et al.
5 DISCUSSION
We begin this section by looking at some general trends across the
results presented in Section 4. Then, we discuss the results from
the replication of our experiments in addition to the some of the
data used in Section 4 from them. Finally, we detail the challenges
we experienced with DCGM tool covering both the
4g.20gb
prole
case and the dierences that we observed in some of the data.
5.1 General trends across results
Overall, the results demonstrate that parallel workload executions
do not cause interference across workloads and, depending on the
use case and needs, a smaller portion of the device can be a more e-
cient choice. Furthermore, instances with fewer allocated resources
always report higher values for the hardware metrics than those
with more resources. For example, in the case of hyperparameter
optimization, a
1g.5gb
parallel instances would be able to both ex-
ecute more workloads (in this with 3X throughput compared to the
full device) and perform with higher device utilization. However,
we found that using the smallest instance is not always possible.
This was the case in our medium and large workloads. For such
cases, we consider the
2g.10gb
and
3g.20gb
as alternatives. On the
other hand, we do not observe any throughput increase associated
with the parallel runs over the isolated run for the medium and
large cases.
Throughout the small workload experiments, the dierences
across the dierent MIG instance performance metrics are more
expressed than the workloads operating at larger scale, following
our expectations. When a workload dramatically grows in scale, the
availability of more resources becomes benecial. In our case, we
use the same batch sizes. However, a bigger model requires more
computation per batch iteration or data item. This can be expressed
in the increased computation per unit time, which increases the uti-
lization of the GPU. We suggest that this higher utilization reduces
the variation in performance across instances. For that reason, we
observed the small workload executions in more detail than the
medium and large ones. Furthermore, we recognize the implications
of increased time to completion resulting from proles with lower
resources. A consideration in that area was presented separately in
analysis of time to completion (see Section 4.1).
5.2 Results of replicated experimental runs
As mentioned in Section 3.4, we replicated each experimental even
though results reported in Section 4 are from a single run. These
replicated runs of experiments shows very similar or nearly identi-
cal results to the initial ones.
5.3 Challenges in metric collection
Even though the
4g.20gb
prole is a valid prole, we were limited
in the information we could collect for this metric using DCGM
tool. Some metrics could not be properly read with this prole. Fur-
thermore, for some other runs, the last few seconds of a workload
execution reported zero values for GRACT, SMACT, and SMOCC.
The DRAMA metric showed an anomaly where it would present
zero or near-zero values for the last few seconds of each run in
some other cases. Since we did not further investigate the reasons
for zero values toward workload completion, we considered the
median values to be a more accurate representation of these metrics
and helping us dealing with these reporting challenges.
6 CONCLUSION
In this paper, we performed a performance characterization for
a modern GPU device that has hardware support, called Multi-
Instance GPU (MIG), to split the GPU into multiple logical instances.
Our results demonstrate that MIG is most useful for smaller work-
loads that cannot fully saturate the whole GPU. Executing the small
workload on the smallest GPU instance,
1g.5gb
, resulted in signi-
cant increases in GRACT, SMACT and SMOCC. Although training
is overall slower, more work can be done per unit of time by exe-
cuting workloads in parallel on multiple GPU instances. We nd
no performance impact associated with co-location of workloads
in separate GPU instances. Across all of our instance-level metrics,
we see no dierence between running one workload at a time and
running multiple workloads in parallel. This highlights that MIG,
even though still maturing, is a promising technology for workload
co-location on GPUs.
In this work, we scoped our analysis on homogeneous instances
and workloads when testing MIG. As future work, an investigation
of more asymmetrical / heterogeneous instances and workloads
would be important. In addition, we limited our focus on training
using one GPU, since MIG doesn’t allow distributed training. Ob-
serving MIG while running other workloads on other GPUs on
the same device may also be promising, as in a data center setting,
many workloads can be co-located not only on the same GPU but
also on the same server. Furthermore, the MIG tool-chain is still
young and maturing. One should keep their tool chain up-to-date
as much as possible for doing performance characterization studies
such as this one to get the latest functionality.
REFERENCES
[1]
Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jerey
Dean, Matthieu Devin, Sanjay Ghemawat, Georey Irving, Michael Isard, Man-
junath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray,
Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan
Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine
learning. In OSDI 16. 265–283.
[2]
Robert Adolf, Saketh Rama, Brandon Reagen, Gu-Yeon Wei, and David Brooks.
2016. Fathom: reference workloads for modern deep learning methods. In IISWC.
1–10.
[3]
AMD. 2022. AMD EPYC 7742. https://www.amd.com/en/products/cpu/amd-
epyc-7742.
[4]
Baidu Research. 2016. DeepBench. https://github.com/baidu-research/
DeepBench.
[5]
Sebastian Baunsgaard, Sebastian Benjamin Wrede, and Pınar Tözün. 2020. Train-
ing for Speech Recognition on Coprocessors. In ADMS. 1–6.
[6]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
Rewon Child, Aditya Ramesh, Daniel Ziegler, Jerey Wu, Clemens Winter, Chris
Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack
Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and
Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in
Neural Information Processing Systems, Vol. 33. 1877–1901.
[7]
Tianshi Chen, Yunji Chen, Marc Duranton, Qi Guo, Atif Hashmi, Mikko Lipasti,
Andrew Nere, Shi Qiu, Michèle Sebag, and Olivier Temam. 2012. BenchNN: On
the broad potential application scope of hardware neural network accelerators.
In IISWC. 36–45.
[8]
Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. 2017. A Downsampled
Variant of ImageNet as an Alternative to the CIFAR datasets. CoRR arXiv (2017).
[9]
Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-Ecient
and QoS-Aware Cluster Management. In ASPLOS. 127–144.
Deep Learning Training on Multi-Instance GPUs , ,
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual
Learning for Image Recognition. In CVPR. 770–778.
[11]
Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D.
Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for
Fine-Grained Resource Sharing in the Data Center. In NSDI. 295–308.
[12]
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger.
2017. Densely Connected Convolutional Networks. In CVPR. 2261–2269.
[13]
Alexandros Koliousis, Pijika Watcharapichat, Matthias Weidlich, Luo Mai, Paolo
Costa, and Peter Pietzuch. 2019. Crossbow: Scaling Deep Learning with Small
Batch Sizes on Multi-GPU Servers. PVLDB 12, 11 (2019), 1399–1412.
[14]
Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images.
Technical Report. University of Toronto.
[15]
Lambda. 2022. Deep Learning GPU Benchmarks. https://lambdalabs.com/gpu-
benchmarks.
[16]
Shervin Minaee, Yuri Y. Boykov, Fatih Porikli, Antonio J Plaza, Nasser Kehtar-
navaz, and Demetri Terzopoulos. 2021. Image Segmentation Using Deep Learning:
A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 7
(2021), 3523–3542.
[17]
Konstantinos Nikas, Nikela Papadopoulou, Dimitra Giantsidi, Vasileios
Karakostas, Georgios Goumas, and Nectarios Koziris. 2019. DICER: Diligent
Cache Partitioning for Ecient Workload Consolidation. In ICPP.
[18]
NVIDIA. 2015. Achieved Occupancy. Technical Report. NVIDIA.
https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/
report/cudaexperiments/kernellevel/achievedoccupancy.htm.
[19]
NVIDIA 2021. NVIDIA Multi-Instance GP U User Guide. NVIDIA. https://docs.
nvidia.com/datacenter/tesla/mig-user- guide/.
[20]
NVIDIA. 2021. NVIDIA Multi-Instance GPU User Guide Documentation. Technical
Report. NVIDIA. https://docs.nvidia.com/datacenter/tesla/mig-user- guide/.
[21]
NVIDIA. 2022. Data Center GP U Manager Documentation. Technical Report.
NVIDIA. https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-user- guide/.
[22]
NVIDIA. 2022. NVIDIA A100 Tensor Core GPU. https://www.nvidia.com/en-
us/data-center/a100/.
[23]
Kevin Pouget. 2021. Using NVIDIA A100’s Multi-Instance GPU to Run Multiple
Workloads in Parallel on a Single GPU. https://cloud.redhat.com/blog/using-
nvidia-a100s- multi-instance- gpu-to- run-multiple-workloads- in-parallel- on-
a-single- gpu.
[24]
Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther
Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark
Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan
Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin
Idgunji, Thomas B. Jablin, Je Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jef-
fery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius Micikevicius,
Colin Osborne, Gennady Pekhimenko, Arun Tejusve Raghunath Rajan, Dilip
Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei,
Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong,
Peizhao Zhang, and Yuchen Zhou. 2020. MLPerf Inference Benchmark. In ISCA.
446–459.
[25]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein, Alexander C.
Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge.
IJCV 115, 3 (2015), 211–252.
[26]
Lars Schmarje, Monty Santarossa, Simon-Martin Schroder, and Reinhard Koch.
2021. A Survey on Semi-, Self- and Unsupervised Learning for Image Classica-
tion. IEEE Access 9 (2021), 82146–82168.
[27]
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George
van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershel-
vam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalch-
brenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu,
Thore Graepel, and Demis Hassabis. 2016. Mastering the game of Go with deep
neural networks and tree search. Nature 529, 7587 (2016), 484–489.
[28]
Cheng Tan, Zhichao Li, Jian Zhang, Yu Cao, Sikai Qi, Zherui Liu, Yibo Zhu, and
Chuanxiong Guo. 2021. Serving DNN Models with Multi-Instance GPUs: A Case
of the Recongurable Machine Scheduling Problem. CoRR arXiv (2021).
[29]
Jin-Hua Tao, Zi-Dong Du, Qi Guo, Hui-Ying Lan, Lei Zhang, Sheng-Yuan Zhou,
Cong Liu, Hai-Feng Liu, Shan Tang, Allen Rush, Willian Chen, Shao-Li Liu, Yun-Ji
Chen, and Tian-Shi Chen. 2018. BenchIP: Benchmarking Intelligence Processors.
Journal of Computer Science and Technology 33 (10 2018), 1–23.
[30]
Tensorow 2022. Use a GPU. Tensorow. https://www.tensorow.org/guide/gpu,
Accessed: 2022-05-29.
[31]
Transaction Processing Performance Council (TPC). 2021. TPC Express AI (TPCx-
AI). https://www.tpc.org/tpcx-ai/default5.asp.
[32]
Shang Wang, Peiming Yang, Yuxuan Zheng, Xin Li, and Gennady Pekhimenko.
2021. Horizontally Fused Training Array: An Eective Hardware Utilization
Squeezer for Training Novel Deep Learning Models. In MLSys. 1–15.
[33]
Hongyu Zhu, Mohamed Akrout, Bojian Zheng, Andrew Pelegris, Anand Ja-
yarajan, Amar Phanishayee, Bianca Schroeder, and Gennady Pekhimenko. 2018.
Benchmarking and Analyzing Deep Neural Network Training. In IISWC. 88–100.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
While deep learning strategies achieve outstanding results in computer vision tasks, one issue remains: The current strategies rely heavily on a huge amount of labeled data. In many real-world problems, it is not feasible to create such an amount of labeled training data. Therefore, it is common to incorporate unlabeled data into the training process to reach equal results with fewer labels. Due to a lot of concurrent research, it is difficult to keep track of recent developments. In this survey, we provide an overview of often used ideas and methods in image classification with fewer labels. We compare 34 methods in detail based on their performance and their commonly used ideas rather than a fine-grained taxonomy. In our analysis, we identify three major trends that lead to future research opportunities. 1. State-of-the-art methods are scaleable to real-world applications in theory but issues like class imbalance, robustness, or fuzzy labels are not considered. 2. The degree of supervision which is needed to achieve comparable results to the usage of all labels is decreasing and therefore methods need to be extended to settings with a variable number of classes. 3. All methods share some common ideas but we identify clusters of methods that do not share many ideas. We show that combining ideas from different clusters can lead to better performance.
Conference Paper
Full-text available
Workload consolidation has been shown to achieve improved resource utilisation in modern datacentres. In this paper we focus on the extended problem of allocating resources when co-locating High-Priority (HP) and Best-Effort (BE) applications. Current approaches either neglect this prioritisation and focus on maximising the utilisation of the server or favour HP execution resulting to severe performance degradation for BEs. We propose DICER, a novel, practical, dynamic cache partitioning scheme that adapts the LLC allocation to the needs of the HP and assigns spare cache resources to the BEs. Our evaluation reveals that DICER successfully increases the system's utilisation, while at the same time minimising the impact of co-location on HP's performance.
Article
Full-text available
Recent work has shown that convolutional networks can be substantially deeper, more accurate and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper we embrace this observation and introduce the Dense Convolutional Network (DenseNet), where each layer is directly connected to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections, one between each layer and its subsequent layer (treating the input as layer 0), our network has L(L+1)/2 direct connections. For each layer, the feature maps of all preceding layers are treated as separate inputs whereas its own feature maps are passed on as inputs to all subsequent layers. Our proposed connectivity pattern has several compelling advantages: it alleviates the vanishing gradient problem and strengthens feature propagation; despite the increase in connections, it encourages feature reuse and leads to a substantial reduction of parameters; its models tend to generalize surprisingly well. We evaluate our proposed architecture on five highly competitive object recognition benchmark tasks. The DenseNet obtains significant improvements over the state-of-the-art on all five of them (e.g., yielding 3.74% test error on CIFAR-10, 19.25% on CIFAR-100 and 1.59% on SVHN).
Article
Full-text available
TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with particularly strong support for training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model in contrast to existing systems, and demonstrate the compelling performance that TensorFlow achieves for several real-world applications.
Article
Cloud computing promises flexibility and high performance for users and high cost-efficiency for operators. Nevertheless, most cloud facilities operate at very low utilization, hurting both cost effectiveness and future scalability. We present Quasar, a cluster management system that increases resource utilization while providing consistently high application performance. Quasar employs three techniques. First, it does not rely on resource reservations, which lead to underutilization as users do not necessarily understand workload dynamics and physical resource requirements of complex codebases. Instead, users express performance constraints for each workload, letting Quasar determine the right amount of resources to meet these constraints at any point. Second, Quasar uses classification techniques to quickly and accurately determine the impact of the amount of resources (scale-out and scale-up), type of resources, and interference on performance for each workload and dataset. Third, it uses the classification results to jointly perform resource allocation and assignment, quickly exploring the large space of options for an efficient way to pack workloads on available resources. Quasar monitors workload performance and adjusts resource allocation and assignment when needed. We evaluate Quasar over a wide range of workload scenarios, including combinations of distributed analytics frameworks and low-latency, stateful services, both on a local cluster and a cluster of dedicated EC2 servers. At steady state, Quasar improves resource utilization by 47% in the 200-server EC2 cluster, while meeting performance constraints for workloads of all types.
Article
The original ImageNet dataset is a popular large-scale benchmark for training Deep Neural Networks. Since the cost of performing experiments (e.g, algorithm design, architecture search, and hyperparameter tuning) on the original dataset might be prohibitive, we propose to consider a downsampled version of ImageNet. In contrast to the CIFAR datasets and earlier downsampled versions of ImageNet, our proposed ImageNet32 (and its variants ImageNet64 and ImageNet16) contains exactly the same number of classes and images as ImageNet, with the only difference that the images are downsampled to 32×\times32 pixels per image (64×\times64 and 16×\times16 pixels for the variants, respectively). Experiments on these downsampled variants are dramatically faster than on the original ImageNet and the characteristics of the downsampled datasets with respect to optimal hyperparameters appear to remain similar. The proposed datasets and scripts to reproduce our results are available at https://image-net.org/download-images and https://github.com/PatrykChrabaszcz/Imagenet32_Scripts