ArticlePDF Available

ResBench: A Comprehensive Analysis of NVIDIA Optimized Deep Learning Frameworks for Image Classification

Authors:

Abstract

Due to a variety of factors, including cheap GPUs and large-scale parallel processing, Deep Learning has exploded in the past decade, accomplishing new tasks that would have been unimaginable 10 years ago. In addition to more robust hardware, a variety of software advances have helped push the field forward, including many open-source Deep Learning frameworks and hundreds of popular, successful Deep Learning models or architectures. Deep learning frameworks are of particular note, as they have been shown to drastically alter the performance of both the training process and final result, despite the theoretically identical math being done while training a model. This study expands upon the works of DLBench by testing only the ResNet50v1.5 architecture, with the frameworks TensorFlow, PyTorch, and MxNet. This gives us less variation between framework tests, as the DLBench study used different architectures as well as different databases when testing, leaving room for variation not caused by framework differences. Our study performed tests using 2 Tesla V100 GPUs from Nvidia and evaluated each framework on 2 datasets: CIFAR-100 and ImageNet. 3
1
ResBench: A Comprehensive Analysis of
NVIDIA Optimized Deep Learning
Frameworks for Image Classification
Sanjeev Chauhan
_____________________________________________________________________________________
2
Abstract
Due to a variety of factors, including cheap GPUs and large-scale parallel processing,
Deep Learning has exploded in the past decade, accomplishing new tasks that would have been
unimaginable 10 years ago. In addition to more robust hardware, a variety of software advances
have helped push the field forward, including many open-source Deep Learning frameworks and
hundreds of popular, successful Deep Learning models or architectures. Deep learning
frameworks are of particular note, as they have been shown to drastically alter the performance
of both the training process and final result, despite the theoretically identical math being done
while training a model. This study expands upon the works of DLBench by testing only the
ResNet50v1.5 architecture, with the frameworks TensorFlow, PyTorch, and MxNet. This gives
us less variation between framework tests, as the DLBench study used different architectures as
well as different databases when testing, leaving room for variation not caused by framework
differences. Our study performed tests using 2 Tesla V100 GPUs from Nvidia and evaluated
each framework on 2 datasets: CIFAR-100 and ImageNet.
3
Introduction:
In recent years, a significant increase in available computing power (coming from
cheaper graphic control processing units and large-scale parallelization) has led to a renewed
interest in AI and machine learning. Artificial intelligence (AI) is a general term for a program
that can sense and adapt from data. Machine learning is a subset of AI and contains algorithms
that improve performance in pattern recognition as more data is presented. Deep learning is a
subset of machine learning where multilayered networks learn from massive datasets. Deep
learning in particular has greatly risen in popularity in certain fields like medicine during the past
decade (citation). What distinguishes deep learning is its ability to learn from only inputted data.
The model itself extracts features from the dataset without human intervention. This allows
models to train without close human monitoring and attempt larger tasks. Deep learning can
allow a model to extract features from a image that may not be apparent to a human.
Our research focused on using deep learning to classify images.Our research focused on
using deep learning to classify images.Image classification is where an AI model determines the
objects inside an image and assigns them a category. Image classification is an extremely
important application of AI. It is the foundation for AI powered cameras that can recognize
objects around them. Furthermore, image classification can be useful for search engines trying to
organize images or organizations with large data sets that need sorting and indexing. High
efficiency in image classification is important because it allows more images to be sorted per
second. Also, high efficiency is important in real time scenarios, such as object detection by
autonomous vehicles where speed and accuracy are both critical for safe operation. High
accuracy is important to correctly detect more objects and classify images correctly.
4
Machine Learning tasks are computationally intensive as they require processing large
amounts of data to train a model. Training a model requires many operations and changing
numerous weights in the model. Once a model has been trained, its implementation is less
computationally intensive and can be used on everyday devices like smartphones and home
computers. This is because a model is now processing data with predetermined weights, not
extracting features and reconfiguring weights Alternative methods to image classification
without machine learning use computer vision.. In contrast to machine learning, computer vision
is using simple pattern recognition easily confused by variability. Computer vision is less
computationally intensive than machine learning but is limited in its ability to extract one feature
from a very limited dataset. Computer vision can match patterns from one dataset, but the dataset
needs to contain similar images. Machine learning has previously been limited because of
computational restrictions but recent technological advances have led to cheaper GPUs and
easier access to high performance computing. This has contributed to a significant growth in
interest about artificial intelligence and its applications. Such an increase in interests has created
a race to create the best combination of frameworks and architectures to achieve the most
accurate results with the most efficient training process.
A Deep Learning Framework (DLF) is the foundation for the math being done during
deep learning. The DLF serves to load and preprocess data into the workspace, execute
operations and monitor accuracy and time passed. A DLF is like a computer operating system
(e.g., Windows, Mac OS, etc.) where each DLF differs, but both are used to run the same
applications and accomplish the same task. Each DLF can run a variety of models and process a
variety of datasets.
5
A deep learning model is the set of operations used to perform task (e.g., image
classification, image segmentation, text generation, etc.). The model is responsible for how
many operations are performed to each piece of data fed into the model. More operations add
more layers to the model and thus creates a deeper model.
Theoretically, the same deep learning model should create the same results regardless of
what DLF is used to run it. The same mathematical operations are applied when using the same
model. The problem is that different DLFs do not produce the same results. This paper is
analyzing different DLFs when using the same model. Some reasons why this could happen is
that each DLF’s preprocessing method augments the dataset in a different way.
This paper is building on the work of DLBench: a comprehensive experimental
evaluation of deep learning frameworks” (Elshawi, R., Wahab, A., Barnawi, A. et al. DLBench).
DLBench compared six popular DLFs (TensorFlow, MxNet, PyTorch, Theano, Chainer, and
Keras), but also varied the architecture between Convolutional Neural Networks (CNN), Faster
Region-based Convolutional Neural Networks (Faster R-CNN), and Long Short Term Memory
(LSTM). DLBench also changed datasets when benchmarking. Our study maintains the same
architecture, datasets, and computational environment, while only changing the DLF used. We
focused on comparing PyTorch, MxNet and TensorFlow.
To benchmark these DLFs, we used ResNet-50 v1.5 as the model. In traditional
convolutional neural networks, accuracy has been shown to degrade after a point, with added
layers decreasing the network's accuracy because of vanishing and exploding gradients. ResNet
fixes this problem by adding "skip connections", copying the input at a point and adding it back
into the network later, creating a residual block. In these blocks, it is easy for gradient descent to
6
diminish their effect on the network to nothing, so added residual blocks improve a network's
accuracy. ResNet-50 V1.5 is comprised of 50 layers and is a slightly modified version of the
original ResNet-50 model. ResNet-50 V1.5 has a stride of two on the first 3x3 convolution
instead of the first 1x1 convolution. This slightly increases accuracy and decreases performance.
In Deep Learning, models often need to train on one dataset many times to increase their
accuracy. Thus, most Deep Learning studies portray accuracy as a function of the number of
times the full dataset has been fed into the model. Each time an entire dataset goes through a
machine is defined as an epoch. As datasets for Deep Learning are massive, computers cannot
process them at once, and they need to be broken up into smaller chunks for a machine to process
them better. A batch is a small subset of the dataset that the machine processes. Within one
epoch, multiple batches of images are ran through the machine until the whole dataset is
processed and the epoch is complete. An interaction is how many times a batch must be
processed to complete one epoch. One needs to input hyperparameters into a model when
training it. Hyperparameters are the initial configuration settings for using a model. These
include the learning rate of the model, the specifications of the dataset (classes, number of
images, image size, etc.), number of GPUS to train on, and more.
Our approach to comparing multiple DLFs was to run each DLF with the same dataset
and computational environment and then run each DLF again with a different dataset and same
computational environment. We then tracked the progression of accuracy vs epochs and other
parameters like learning rate.
7
Materials & Methods
Dataset Preprocessing
We acquired and preprocessed public datasets for image classification. We used
ImageNet LSVRC (Large Scale Vision Recognition Challenge) 2012 and CIFAR-100. This data
was then preprocessed as required by each DLF. For example, for MxNet, the images were
converted into record.io files and for TensorFlow, the images were converted into tensors.
ImageNet Large Scale Visual Recognition Challenge
Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) is a subset of the
ImageNet dataset. ImageNet consists of 10,000,000 hand labeled images with 10,000+ different
categories. The training data for ILSVRC2012 consists of 1.2 million images containing 1000
different categories. The validation dataset consists of 50,000 randomly selected labeled images.
CFAR-100
The CIFAR-100 dataset is a subset of the 80 million tiny images dataset. CIFAR-100
consists of 60,000 32 X 32 color images sorted into 100 classes. Each class contains 500 training
images and 100 testing images. The 100 classes are organized into 20 superclasses. Super classes
are more generic categories like “fruits and vegetables” while normal classes are more specific
like “apples, oranges, and pears.”
PyTorch
PyTorch is an open-source machine learning library based on the Torch library, used for
applications such as computer vision and natural language processing. It was developed by the
Facebook AI Research Lab and first introduced in 2016 (Berga & Coelho, n.d.). PyTorch has a
8
python interface, and its functionalities are built like python classes. PyTorch is one of the most
popular choices of DLFs.
MxNet
MxNet is an open-source DLF created by Apache Software Foundation. Amazon has
chosen to use MxNet as its DLF on AWS. MxNet supports multiple programming languages,
such as C++, Python, R, Julia, Perl. (Apache, n.d.). This eliminated the need to learn a new
programming language to use MxNet. MxNet models can fit in very small amounts of memory
thus making them extremely portable.
TensorFlow
TensorFlow is an open-source machine learning library created by the Google Brain team
in 2015. It has many uses but mainly focuses on creating deep neural networks. TensorFlow
works with Tensor board, a set of applications that allows visualization of the model and results.
TensorFlow has options for high-level model development and offers support for mobile devices.
Hyperparameters
All hyperparameters used were the defaults as shown in [citation], except for batch size, which
was set to 192.
9
ImageNet:
FRAMEWO
RK
BATC
H
SIZE
EPOC
HS
LEARNI
NG
RATE
MOMENT
UM
DECAY
OPTIMIZ
ER
DATA
LOAD
ER
PYTORCH
192
90
.1
0.875
Stochastic
Gradient
Descent
syntheti
c
MXNET
192
90
.1
0.875
Stochastic
Gradient
Descent
syntheti
c
TENSORFL
OW
192
90
.1
0.875
Stochastic
Gradient
Descent
syntheti
c
CIFAR-100:
FRAMEWOR
K
BATC
H
SIZE
EPOC
HS
LEARNI
NG RATE
MOMENT
UM
DECAY
OPTIMIZ
ER
DATA
LOADE
R
PRECISI
ON
PYTORCH
192
90
.1
0.875
Stochastic
Gradient
Descent
syntheti
c
Auto
Mixed
Precision
MXNET
192
90
.1
0.875
Stochastic
Gradient
Descent
syntheti
c
Auto
Mixed
Precision
TENSORFL
OW
192
90
.1
0.875
Stochastic
Gradient
Descent
syntheti
c
Auto
Mixed
Precision
High Performance Computing Cluster
Training was done on Hyperion GPU Nodes, part of University of South Carolina’s flagship
High Performance Computing cluster, which was set in the configuration of 1 Intel Xeon
Platinum 8260 CPU, 128GB RAM, and 2 Tesla V100 GPUs with 32GB of video memory each.
10
Results:
These graphs show the epochs compared to the accuracies of various DLFs. Each time an entire dataset is
ran through a machine is defined as an epoch. Accuracy is the number of correct predictions that model
made over the number of total predictions.
ImageNet Large Scale Visual Recognition Challenge 2012:
MxNet has an initial quick increase in accuracy (epochs 0-10) for ILSVRC2012 and then continues to
improve over time. It has a jagged line with a gradual increase in accuracy after epoch 10. MxNet reached
a final accuracy of 77.1%.
11
PyTorch quickly learns to over 50% accuracy in under 10 epochs for ILSVRC2012. It then continues
learning in a jagged, positive curve. At epoch 30 it has a large jump to around 71% accuracy. It then
plateaus until epoch 60 where there is another small jump (about 3% accuracy) and then plateaus after
that. PyTorch’s final accuracy is 76.05%.
12
TensorFlow quickly learns to over 50% accuracy in under 10 epochs for ILSVRC2012. It has a major
jump around epoch 30 and then plateaus and even slightly decreases accuracy until another smaller jump
at epoch 60. It reaches a final accuracy of 75.62%.
13
CIFAR-100:
14
15
Discussion:
Our results suggest that different DLFs impact the results of training a model, and that
frameworks have performance patterns that show when using the same model across different datasets.
Convergence Patterns
All three DLFs quickly increased their accuracy to around 50% during the first 10 epochs. We found that
MxNet has a smoother overall convergence than PyTorch and TensorFlow. MxNet has no sudden jumps
unlike the others. It also continues to improve over the 90 epochs. Both PyTorch and TensorFlow had
sudden jumps around epoch 30 and 60. This is a strange phenomenon. This could be because the model
found a feature that could greatly help it at that it. It is strange that this jump occurs in multiple runs of the
DLF.
16
Final Accuracy
MxNet achieved a final accuracy of 77.11%, PyTorch an accuracy of 76.05% and TensorFlow
and accuracy of 75.62%. These differences in final accuracy show that DLFs have an impact on the result
of training on a dataset. Even though this difference is small, around one percentage point, it still can have
an impact when dealing with massive datasets. When dealing with a dataset of 1 million images, this
difference in accuracy could result in misclassification of approximately 10,000 images.
Tradeoffs
Convergence patterns relate to how quickly one wants a model to train on a dataset. Final
accuracy is only achieved once the entire training process is complete. These two factors have many
tradeoffs. In some cases, an organization wants a DLF that can quickly converge. For example, if an
organization regularly modifies their dataset, then they may value faster training capabilities to see the
results of their training faster. On the other hand, if an organization is working with a stable dataset, then
they may value final accuracy over speed. This study has shown that different DLFs impact both the final
accuracy and convergence rate, even when run on the same exact model and dataset.
Future work
We plan to continue our work and pursue object detection as our next step. Our plan is to implement a
Mask R-CNN and benchmark its object detection performance. After that, we plan to do a similar
comparison on natural language procession with a recurrent neural network (RNN).
17
Acknowledgements
We would like to thank the SPRI program at Universertiy of South Carolina for coordinating
summer research and Dr. Edward Gatzke for coordinating SPRI. AT UofSC Research
Computing Group we would like to thank Jun Zhou and Nathan Elger for mentoring us and
providing HPC access. At South Carolina Gvoernor’s School for Science and Math we would
like to thank Dr. Joshua Witten for organizing research and Ms. Elizabeth Bunn for
mentoring us.
Resources
https://www.image-net.org/challenges/LSVRC/2012/index#introduction
https://www.image-net.org/challenges/LSVRC/2012/index
https://www.cs.toronto.edu/~kriz/CIFAR.html
https://iq.opengenus.org/resnet50-v1-5/
https://www.tandfonline.com/doi/full/10.1080/01431160600746456
https://link.springer.com/article/10.1007/s10586-021-03240-4#Sec1
https://link.springer.com/article/10.1007/s12194-017-0406-5
https://www.imaginarycloud.com/blog/pytorch-vs-tensorflow/#PyTorch
https://ngc.nvidia.com/catalog/resources/nvidia:resnet_50_v1_5_for_pytorch
https://ngc.nvidia.com/catalog/resources/nvidia:resnet_50_v1_5_for_mxnet
https://ngc.nvidia.com/catalog/resources/nvidia:resnet_50_v1_5_for_tensorflow
18
Figures
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.