ArticlePDF Available

High-Performance Data Loader for Large-Scale Data Processing

Authors:
  • Institute of Science Tokyo

Figures

Content may be subject to copyright.
High-Performance Data Loader for Large-Scale Data Processing
Edgar Josafat Martinez-Noriega1, Chen Peng1, and Rio Yokota2; National Institute of Advanced Industrial Science and Technology1
and Tokyo Institute of Technology2; Japan
Abstract
The utilization of supercomputers and large clusters for big-
data processing has recently gained immense popularity, primar-
ily due to the widespread adoption of Graphics Processing Units
(GPUs) to execute iterative algorithms, such as Deep Learning
and 2D/3D imaging applications. This trend is especially promi-
nent in the context of large-scale datasets, which can range from
hundreds of gigabytes to several terabytes in size. Similar to the
field of Deep Learning, which deals with datasets of compara-
ble or even greater sizes (e.g. LION-3B), these efforts encounter
complex challenges related to data storage, retrieval, and effi-
cient GPU utilization. In this work, we benchmarked a collection
of high-performance general dataloaders used in Deep Learn-
ing with a dual focus on user-friendliness (Pythonic) and high-
performance execution. These dataloaders have become crucial
tools. Notably, advanced dataloading solutions such as Web-
datasets, FFCV, and DALI have demonstrated significantly supe-
rior performance when compared to traditional PyTorch general
data loaders. This work provides a comprehensive benchmark-
ing analysis of high-performance general dataloaders tailored for
handling extensive datasets within supercomputer environments.
Our findings indicate that DALI surpasses our baseline PyTorch
dataloader by up to 3.4x in loading times for datasets comprising
one million images.
Introduction
The employment of large clusters and supercomputers[1] for
the processing of big data has garnered significant attention, pri-
marily driven by the widespread integration of Graphics Process-
ing Units (GPUs) for the execution of iterative algorithms, in-
cluding Deep Learning and 2D/3D imaging applications. This
phenomenon is particularly pronounced in the realm of large-
scale datasets, spanning from hundreds of gigabytes to several
terabytes in magnitude. Analogous to the domain of Deep Learn-
ing, which contends with datasets of comparable or larger scales
(e.g., JFT-300M[2] and LION-5B[3]), these endeavors confront
intricate challenges pertaining to data storage, retrieval, and the
effective utilization of GPUs. Moreover, state of the art architec-
ture such as the Vision Transformer is data-hungry, prevailing the
challenge encountered by this architectural framework on to the
need for a substantial volume of data. In order to attain cutting-
edge performance levels through transfer learning, such models
conventionally mandate in excess of 100 million images [4].
Moreover, dataloaders are integral components in the deep
learning workflow, facilitating data ingestion, pre-processing, and
augmentation. The primary objective of a dataloader lies in exe-
cuting the operations essential for transferring data instances from
a storage repository to the memory space situated alongside the
processing units, facilitating their utilization during training to
construct a batch of samples intended for input. The execution
of these operations is limited or constrained by the bandwidth of
the storage system, particularly its I/O. Consequently, contingent
upon the hardware specifications of the system, the filesystem
supporting it, and the data transfer rate of the connection with
the computational units, it can exert a significant impact on the
overall duration required to finalize the training. As the size of
datasets continues to grow exponentially, the efficient loading of
data onto GPUs becomes a paramount concern.
In this study, we present a benchmarking analysis of high-
performance general dataloaders tailored for managing extensive
datasets within supercomputer environments. Our investigation
focuses on two critical dimensions: user-friendliness, particularly
within the Python programming context, and high-performance
execution. We conduct benchmarking evaluations across a spec-
trum of general dataloader solutions, encompassing established
options from the PyTorch[5] library and advanced alternatives
such as Webdatasets[6], FFCV [7], and DALI [8]. The objec-
tive is to evaluate their efficacy in terms of data loading speed,
especially when confronted with datasets comprising millions of
images or more. Through comparative analysis, we elucidate
the strengths and limitations of these dataloaders in supercom-
puting environments. Initial findings from our benchmarking ex-
periments demonstrate that DALI, an open-source library devel-
oped by NVIDIA, outperforms our baseline PyTorch dataloader
by up to 3.4x in loading large datasets. This substantial enhance-
ment underscores the potential of DALI in expediting the data
ingestion phase within deep learning workflows on supercomput-
ers. Additionally, we explore the user-friendliness aspect of these
dataloaders, emphasizing the importance of seamless integration
with popular deep learning frameworks like PyTorch. Moreover,
we discuss the implications of our findings in the context of con-
temporary deep learning applications, underscoring the burgeon-
ing need for handling massive datasets across domains such as
computer vision. Efficient data loading and pre-processing are
paramount for minimizing training duration and maximizing the
efficient utilization of supercomputing resources.
Related Work
The emergence of deep learning has engendered an escalat-
ing demand for training larger models with expansive datasets,
prompting extensive investigations into the bottlenecks encoun-
tered during the training process. For instance, Mohan et al. [9]
conducted a comprehensive examination of data stalls within var-
ious models, revealing that these stalls could account for up to
65% of the total training time in certain scenarios. The insights
gleaned from their research informed the development of a coor-
dinated caching and pre-processing library named CoorDL, which
has demonstrated the capacity to accelerate training by up to 15-
fold in distributed training scenarios across two servers. Notably,
this study introduced a sizable dataset for Audio Classification,
https://doi.org/10.2352/EI.2024.36.12.HPCI-196
© 2024, Society for Imaging Science and Technology
IS&T International Symposium on Electronic Imaging 2024
High Performance Computing for Imaging2024 196--1
denoted as M5 [10], with a size of approximately 950 GB, along
with performance evaluations on ImageNet-21k, boasting a size
of 1.3 TB [11]. Additionally, Matson et al. [12] presented an
extensive investigation into machine learning workloads, focus-
ing on both training and inference tasks across various AI bench-
marks . Their study primarily aimed to assess the time required to
attain specific accuracy thresholds, a metric demanding incremen-
tal resource allocation and unsuitable for evaluating data loaders
and associated parameters. Moreover, Wu et al. [13] conducted a
study on memory and CPU utilization, scrutinizing access and us-
age patterns through an array of parallel libraries . Their analysis
included investigations into the impact of batch sizes on training
efficiency and accuracy, providing valuable insights into optimiz-
ing computational resources for deep learning tasks.
On another hand, reaching large datasets such as JFT-300M
and LION-5B represents some challenges. In the firs place JFT
and its variants are proprietary and not open to the public. On the
later one, LION-5B is open to the public consisting in 5B links.
Even employing high-end CPUs downloading the whole dataset
from their servers over the internet would take some months
and also some computing resources to check consistency. More-
over, the use of these large datasets are not exempt from ethi-
cal concerns, such as societal biases, privacy issues, and copy-
right violations. Thus, synthetic datasets are gained traction to
perform transfer learning. For instance, Barad et al. [14] pre-
sented a synthetic dataset consisting of 21,000 programs, each
tasked with generating a varied collection of synthetic images.
These images are generated using OpenGL Shading Language
(GLSL) programs, recognized for their versatility and adjustabil-
ity. These scrtipts are executed on the GPU utilizing the OpenGL
API. Furthermore, Formula-Driven Supervised Learning (FDSL)
was originally introduced by Kataoka et al. [15], where synthetic
images and their corresponding labels are generated using math-
ematical formulas. These unique fractal images offer a distinc-
tive approach to constructing extensive, diverse, and dynamically
evolving datasets suitable for training purposes. Numerous en-
deavors have been proposed in the FDSL domain [16, 17], en-
compassing the development of innovative datasets such as Visual
Atom [18] and MV-Fractal [19]. In this study, we utilize a modi-
fied version of the FractalDB, which facilitates the generation of
large datasets for our experiments.
Method
This section furnishes a comprehensive overview of the dat-
aloader and the challenges it encounters when processing exten-
sive datasets within distributed computing environments. Addi-
tionally, we elucidate the assortment of dataloaders subjected to
benchmarking, along with their impact on the dataset characteris-
tics both prior to and subsequent to their utilization.
The Deep Learning Dataloader
As previously mentioned, the dataloader assumes a piv-
otal role in deep learning workflows. This indispensable tool
is entrusted with the responsibility of loading and preprocessing
datasets, thereby rendering them accessible to the model during
both training and inference phases. Serving as a conduit between
the raw data and the deep learning model, it facilitates the seam-
less ingestion and manipulation of data. Furthermore, the dat-
aloader retrieves samples from the dataset, encompassing various
NFS
NFS
NFS
CPU GPU
SSD
~2 TB
Node 0
~ 100 PB
Threads
Network
Threads
Threads
Threads
CPU GPU
SSD
~2 TB
Node 1
CPU GPU
SSD
~2 TB
Node n
Figure 1. The high traffic on a distributed system when datasets are retrive
from network file system.
data modalities such as images, text, audio, and more, and effi-
ciently loads them into memory, subsequently transferring them
to the GPU memory. In addition to this primary function, it un-
dertakes ancillary tasks such as image resizing, pixel value nor-
malization, text tokenization, and data conversion into numerical
format. The dataloader orchestrates the organization of data into
batches, wherein multiple samples are amalgamated, thus func-
tioning as the primary scheduler for batch processing. This batch-
ing mechanism enhances computational efficiency by enabling
the model to concurrently process multiple samples, thereby har-
nessing the power of parallelism. Lastly, the dataloader introduces
randomness by shuffling the data, thereby thwarting any potential
memorization of sample order by the model during training.
Moreover, within the domain of deep learning, various data
types with distinct characteristics necessitate handling, includ-
ing Image, Text, Numerical, Audio, and Video data. These
data types serve specific purposes, such as time series forecast-
ing, image pattern recognition, or video description tasks. Con-
sequently, acquiring and pre-processing such diverse data pose
unique challenges for a single dataloader to provide an optimal
solution. While a general-purpose dataloader is available in Py-
Torch [5], purportedly supporting all data types, research by Ham-
bardzumyan et al. [20] underscores that handling numerous small
datasets of SQL datatypes presents disparate challenges compared
to processing an equivalent volume of items in images or videos.
Additionally, as depicted in Figure 1, the scalability of distributed
training across nodes amplifies the network load required to ac-
cess datasets for multiple CPU threads on each node. Given the
focus of this study on large-scale vision transformers or image
processing, we opt to benchmark dataloaders that prioritize com-
pression or exploit alternate hardware resources, such as GPUs,
for pre-processing. This strategy aims to mitigate the computa-
tional bottlenecks during distributed training.
Webdataset
The Webdatasets data loader was introduced [6] to stream-
line the acquisition of datasets directly from online repositories
for integration into machine learning pipelines. It compresses the
196--2 IS&T International Symposium on Electronic Imaging 2024
High Performance Computing for Imaging2024
Dataset
Class 0000 Dataset_0000.tar
1 Million images 362x362 gray-scale.
18GB.
Class 0001
Class n
Webdataset
~ 1.9 GB.
Dataset_0000.tar ~ 1.9 GB.
Dataset_0099.tar ~ 1.9 GB.
Total: ~20GB, ~1.5 Hrs.
Figure 2. Conversion process when a normal dataset is processed by
Webdataset. Original size and time for pre-processing are included for a
million image dataset.
whole dataset into POSIX tar archives. Furthermore, offers a flex-
ible and adaptable interface for accessing diverse datasets avail-
able on the web, eliminating the need for physical presence within
the local network and circumventing the manual downloading and
pre-processing steps. Notably, one of its notable features is dy-
namic dataset retrieval, empowering users to fetch datasets from
online repositories dynamically, encompassing public datasets
hosted on websites, cloud storage platforms, or data APIs. Addi-
tionally, this dataloader boasts on-the-fly pre-processing capabil-
ities, enabling users to apply transformations and manipulations
to the data during the loading process. Furthermore, Webdatasets
facilitates streaming data loading, enabling efficient processing
of large datasets without requiring the entire dataset to be loaded
into memory simultaneously. This streaming functionality is par-
ticularly advantageous for handling datasets exceeding available
memory resources. Moreover, the library offers support for par-
allel data loading, capitalizing on multi-core processors or dis-
tributed computing environments to expedite the loading process.
Furthermore, Webdatasets seamlessly integrates with prominent
deep learning frameworks such as PyTorch and TensorFlow, en-
abling users to seamlessly incorporate web-based datasets into
their machine learning models and experiments.
As depicted in Figure 2, the left side illustrates the conven-
tional structure of datasets loaded by PyTorch, whereas the right
side showcases the resulting format subsequent to pre-processing
with Webdataset, culminating in the conversion of the dataset
into POSIX tar archives. It is noteworthy to mention that the
conversion of the original dataset to the Webdataset format ne-
cessitates certain computational resources. While this operation
imposes minimal burden for smaller datasets such as CIFAR100
[21] or Stanford Cars [22], it may become considerable for larger
datasets.
FFCV
The FFCV dataloader was proposed [7] to efficiently man-
age image datasets, emphasizing the optimization of data loading
performance through methodologies such as caching and parallel
processing. Introducing a proprietary dataset format termed the
”beton” extension, this dataloader encapsulates data in a binary-
like representation, optionally compressed in JPEG format for
network transmission, thus maximizing throughput and perfor-
mance. A notable feature of the FFCV dataloader is its file
caching mechanism, which preserves pre-processed image data
in memory or on disk to mitigate the overhead associated with
recurrent data loading and pre-processing tasks. This approach
significantly augments data loading speed and overall training ef-
Dataset
Class 0000
Dataset.beton
1 Million images 362x362 gray-scale.
18GB.
Class 0001
Class n
FFCV
~ 350 GB.
~ 8 Hrs.
Figure 3. Conversion process when a normal dataset is processed by
FFCV. Original size and time for pre-processing are included for a million
image dataset.
ficiency, particularly in scenarios involving numerous augmenta-
tion techniques. Moreover, FFCV employs parallel processing
strategies to concurrently load and pre-process image data, lever-
aging multi-core processors to expedite the loading process. By
distributing computational tasks across multiple cores, FFCV en-
hances throughput and reduces loading times, particularly when
handling extensive image datasets. Furthermore, FFCV facilitates
efficient image augmentation capabilities, enabling the applica-
tion of diverse transformations and augmentations to image data
during the loading phase. These augmentations bolster the diver-
sity and resilience of the training data, thereby enhancing model
generalization and performance.
Figure 3 illustrates the conventional structure of a dataset on
the left, contrasted with the converted output produced by the
FFCV conversion library on the right. Notably, the converted
output consists of a single file encapsulating the entire dataset,
enhancing bandwidth utilization when accessed by CPUs on the
nodes due to its binary-like data format representation. However,
akin to Webdataset, the creation of the ”beton” file by FFCV ne-
cessitates additional computational resources. While this may not
pose a significant challenge for smaller datasets, it can present
considerable difficulties for larger datasets, particularly those in-
corporating numerous augmentation techniques inside the im-
ages.
NVIDIA Data Loading Library
The NVIDIA Data Loading Library (DALI) was proposed
[8] to efficiently manage large-scale datasets for deep learning,
particularly for computer vision applications. Its primary objec-
tive is to optimize the data loading process, thereby maximiz-
ing GPU and CPU utilization to accelerate training and inference
workflows. DALI incorporates GPU-accelerated data loading,
leveraging the GPU to minimize CPU-GPU data transfer over-
head and enhance overall performance. It supports parallel data
loading and augmentation, allowing multiple CPU threads to con-
currently process data, thus enhancing throughput. Additionally,
DALI offers a comprehensive suite of image pre-processing oper-
ations, including resizing, cropping, rotation, and color augmen-
tation, all performed on-the-fly during data loading to mitigate
memory overhead. Furthermore, DALI seamlessly integrates with
popular deep learning frameworks such as TensorFlow and Py-
Torch. Its GPU acceleration, parallel execution capabilities, and
deep learning framework compatibility collectively contribute to
enhanced training efficiency and model performance.
In contrast to the Webdataset and FFCV dataloaders exam-
ined in this study, DALI does not need the transformation of the
original dataset format or structure. DALI optimizes performance
IS&T International Symposium on Electronic Imaging 2024
High Performance Computing for Imaging2024 196--3
Dataset
Class 0000
1 Million images 362x362 gray-scale.
18GB.
Class 0001
Class n
DALI
Dataset
Class 0000
Class 0001
Class n
18GB.
CPU GPU
Figure 4. Conversion process when a normal dataset is processed by DALI.
Original size for pre-processing is included.
by leveraging the GPU during data retrieval, encompassing tasks
such as decoding JPEG format and performing augmentations that
necessitate image normalization, resizing or other operations. As
illustrated in Figure 4, the dataset remains unaltered, preserving
its original size and format.
Evaluation
In this section, we present the outcomes of our experiments
evaluating three distinct dataloaders. We outline the experimental
setup and the environment in which the experiments were con-
ducted. We provide a brief description of the FractalDB [15] use
in our experiments. Our analysis primarily focuses on measuring
image retrieval time without considering the whole training. For
the main tests presented herein, we offer various configurations
for the dataloaders capable of utilizing compression techniques.
Moreover, we compare these findings against those obtained using
the PyTorch dataloader, which serves as our baseline reference.
Experimental Environment
We leveraged the AI Bridging Cloud Infrastructure (ABCI)
supercomputer [23], renowned for its specialization in AI com-
puting tasks. This supercomputer comprises two distinct node
configurations: those featuring A100 GPUs and those housing
V100 GPUs. The Volta-equipped nodes, are equipped of 1,088
compute nodes, each integrating 2 Intel Xeon Gold 6148 CPUs
(a total of 40 cores), 384 GiB of DRAM, 4 NVidia V100 GPUs,
and InfiniBand EDR NICs. Additionally, each node includes with
1.6TB of local storage and shares access to a 35PB Lustre paral-
lel filesystem. On the other hand, the Ampere nodes encompass
120 compute nodes, each with 2 Intel Xeon Platinum 8360Y Pro-
cessors (resulting in a total of 76 cores), 512 GiB of DRAM, 8
NVidia A100 GPUs, and InfiniBand HDR connectivity. Further-
more, every node includes 2.0TB of local storage and is linked to
the Lustre NFS for shared filesystem access.
The Fractal Dataset
Fractals are complex geometric shapes that exhibits self-
similarity at different scales. Firstly introduced by Kataoka et
al. [15], the FractalDB is a collection of fractal images generated
by a render method that uses the Iterated Function System (IFS).
In this sense, we can search and generate an arbitrary number of
fractals and its corresponded labels, thus, we can form any dataset
size desired. Figure 5 shows an example of the different patterns
generated on each class. More specifically, each Cclass has a hy-
perparameter set θyto generate fractals as Fy(s) = R(θy,s), where
Ris the rendering routine and sis a random seed that creates vari-
ations within the class C. The hyperparameter θy={(wi,pi)}n
i=1
Fractal Class 0 Fractal Class 1 Fractal Class 2
Figure 5. Sample fractal images from FractalDB.
consist of functions wi:R2R2with probability mass function
pi. The original FractalDB comprises a million images divided
in a C=1k classes, and 1k instances per class. In this study we
utilized such dataset.
Performance per Batch Step
In this experiment, we assessed the performance of all dat-
aloaders exclusively during a large batch step, which is a com-
mon practice in Deep Learning for training data delivery. The
batch size was set to 500 for consistency. We quantified the to-
tal time elapsed from file retrieval to its aggregation into batches
and the application of basic transformations (Resize(), ToTensor(),
and Normalize()) until it is transferred to the GPU using the op-
eration ToDevice(). To emphasize raw dataloader performance,
only a subset of essential transformations was employed for this
experiment. The experiments were conducted on Volta nodes, and
the dataloader configurations were as follows:
For FFCV, all three transformations were included dur-
ing beton file creation, with three different configurations
(10%,50%, and 90%) utilized for compressing images using
JPEG format.
For Webdataset, shards were divided into 100 and 1,000,
with the inclusion of a file pair inside each shard, comprising
the image and its corresponding Class ID.
DALI incorporated identical transformation routines, exe-
cuted from both CPU and GPU.
Figure 6 illustrates the performance comparison among all
dataloaders. It is evident that the baseline approach using Py-
Torch alone to retrieve 500 images via the file system exhibits
the poorest performance, even exceeding the time required by
Webdatasets, FFCV, or DALI by a considerable margin. As for
the FFCV loader, the compression ratio significantly influences
loading times, as it introduces computational overhead during de-
coding. However, FFCV demonstrates improved performance
when the compression ratio is set to 10%. Webdatasets exhibit
the shortest loading times for both configurations, with the opti-
mal performance observed when shards are set to 1,000. Further-
more, we observed performance variations based on the number
of shards used to partition the original dataset. DALI outperforms
FFCV and achieves comparable results to Webdatasets. The ob-
served variations in loading times suggest that network traffic in-
fluences performance when accessing images. Consequently, we
conducted additional measurements by relocating the dataset to
the SSD of the node, thereby minimizing the impact of network
IO, which we assume to be negligible.
Figure 7 depicts a comparable experiment to the previously
described one. In this instance, the dataset was allocated on the
SSD. The considerable reduction in loading time observed in the
196--4 IS&T International Symposium on Electronic Imaging 2024
High Performance Computing for Imaging2024
Baseline Less is
Better
Figure 6. Time to load one batch step (500 images). Location of the dataset
is on the file system NFS.
Baseline
Less is
Better
Figure 7. Time to load one batch step (500 images). Location of the dataset
is on the local SSD.
baseline, or when using the PyTorch dataloader, is noteworthy, de-
creasing from 68 seconds to 30 seconds. This emphasizes the sub-
stantial burden imposed by IO during training, particularly when
employing multiple nodes. Interestingly, the Webdatasets loader
exhibits the poorest performance among all other dataloaders,
even with the 1,000 shards configuration. FFCV follows a similar
trend as in the preceding experiment, with the 10% compression
ratio yielding the best loading time. Notably, DALI demonstrates
the most remarkable improvement in loading time, decreasing
from 40 seconds to 6 seconds. This represents an order of mag-
nitude enhancement in performance, attributed to the concurrent
transformations performed by both CPU and GPU. Directly com-
paring both experiments reveals a reduction of more than half the
time for all dataloaders, except for Webdatasets, which performs
similarly to when the dataset was on the NFS.
Performance per Epoch
In this experiment, we assessed the performance of all dat-
aloaders across a full epoch, encompassing the entire dataset. This
evaluation extends beyond previous results, involving the mea-
surement of the complete time required to load 1 million images.
Additionally, we introduced a more intricate set of transforma-
tions, including AutoAugment [24], which imposes greater CPU
load during batch formation. It is noteworthy that integrating Au-
toAugment into FFCV necessitated implementing this algorithm
within their custom transformations, which proved to be a non-
trivial task. Consequently, for FFCV, AutoAugment was not in-
cluded when creating the beton file. In contrast, DALI, which
also offers custom transformations, supports algorithms such as
AutoAugment. The complete list of transformations includes Au-
toAugment(),ToTensor(), and Normalize(), with the time taken to
reach the GPU measured using ToDevice(). In order to provide
a better insight on how much would take to load 1 Million im-
Baseline
Less is
Better
Figure 8. Time to load one full epoch. The location of the dataset is on the
local SSD.
ages as fast as possible we conducted the experiment allocating
the whole dataset on the SSD. The experiments were conducted
on Volta nodes on ABCI.
Figure 8 presents the performance of the dataloaders when
loading the entire 1 million-image dataset. The results shows a
similar trend to those observed during the batch step, as the pri-
mary challenge in this experiment lies in handling the larger vol-
ume of images. The baseline PyTorch loader required 5 minutes
to load the entire dataset, a performance closely matched by Web-
dataset with the 1,000 shards configuration. However, reducing
the number of shards in Webdataset increased the loading time
by one minute. FFCV demonstrated the best performance with
the 10% compression ratio, loading the entire dataset in under
three minutes, which is more than half the time compared to the
baseline. DALI excelled in loading time, completing the task in
close to 1 minute and 7 seconds which outperforms up to 3.4x to
the baseline. This represents the most efficient option for load-
ing large images, particularly when combined with relocating the
dataset to a local SSD and utilizing DALI. Additionally, we con-
ducted an additional measurement to assess the time required for
a dataset to be loaded directly into RAM memory when created
on-the-fly. We observed that it was faster that even DALI under
1 minute and 2 seconds. This measurement is part of our future
work, focusing on enhancing dataset access and creation speed.
Conclusion
We have presented a benchmark that provides a valuable ref-
erence for researchers and practitioners involved in deep learn-
ing tasks on supercomputing platforms. Our benchmark offers
insights and evaluations of high-performance general dataload-
ers, with a primary focus on their loading speed. Our findings
underscore DALI as a promising solution for accelerating data
loading on supercomputers. Additionally, we have observed that
a straightforward solution to enhance loading times is to relo-
cate datasets to local storage. In our experiments, leveraging the
SSD within each node significantly reduced loading times, halv-
ing them compared to using the file system. Furthermore, we
noted variations among dataloaders, with Webdatasets maintain-
ing consistent performance regardless of whether the dataset is
stored on the SSD or over NFS. FFCV may be a suitable op-
tion, particularly when extensive augmentation or preprocessing
is not required. Moreover, DALI demonstrates superior loading
times, making it the optimal choice for handling large datasets.
As datasets continue to expand in size and complexity, efficient
IS&T International Symposium on Electronic Imaging 2024
High Performance Computing for Imaging2024 196--5
data management becomes increasingly critical task.
Acknowledgements
This paper is based on results obtained from a project,
JPNP20006, subsidized by the New Energy and Industrial Tech-
nology Development Organization (NEDO).
References
[1] TOP500, The List, https://www.top500.org/, 2023, [November.
2023].
[2] Sun. C, Shrivastava. A, Singh. S, Gupta. A. Revisiting unreasonable
effectiveness of data in deep learning era. In Proceedings of the IEEE
international conference on computer vision 2017 (pp. 843-852).
[3] Schuhmann. C, Beaumont. R, Vencu. R, Gordon. CW, Wightman.
R, Cherti. M, Coombes. T, Katta. A, Mullis. C, Wortsman. M,
Schramowski. P. LAION-5B: An open large-scale dataset for train-
ing next generation image-text models. In Thirty-sixth Conference
on Neural Information Processing Systems Datasets and Benchmarks
Track.
[4] Kolesnikov. A, Beyer. L, Zhai. X, Puigcerver. J, Yung. J, Gelly. S,
Houlsby. N. Big transfer (bit): General visual representation learning.
InComputer Vision–ECCV 2020: 16th European Conference, Glas-
gow, UK, August 23–28, 2020, Proceedings, Part V 16 2020 (pp.
491-507). Springer International Publishing.
[5] PyTorch Core Team, PyTorch Vision Docs,
https://pytorch.org/vision/stable/datasets.html, [January 2024].
[6] Aizman. A, Maltby. G, Breuel. T. High performance I/O for large
scale deep learning. In 2019 IEEE International Conference on Big
Data (Big Data) 2019 Dec 9 (pp. 5965-5967). IEEE.
[7] Leclerc. G, Ilyas. A, Engstrom. L, Park. SM, Salman. H, Madry. A.
FFCV: Accelerating training by removing data bottlenecks. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition 2023 (pp. 12011-12020).
[8] Nvidia Data Loading Library - DALI.
https://developer.nvidia.com/dali, 2024, [January 2024].
[9] Mohan. J, Phanishayee. A, Raniwala. A, Chidambaram. V. Analyz-
ing and mitigating data stalls in DNN training. In Proceedings of the
VLDB Endowment, Volume 14, Issue 5, pp 771–784.
[10] Defferrard. M, Benzi. K, Vandergheynst. P, Bresson. X. FMA: A
dataset for music analysis. 18th International Society for Music In-
formation Retrieval Conference (ISMIR), 2017.
[11] ImageNet Dataset, https://www.image-net.org/index.php, 2024,
[January 2024].
[12] Mattson. P, Cheng. C, Diamos. G, Coleman. C, Micikevicius. P, Pat-
terson. D, Tang. H, et al. Mlperf training benchmark. Proceedings of
Machine Learning and Systems, 2:336–349, 2020.
[13] Wu. Y, Liu. L, Pu. C, Cao. W, Sahin. S, Wei. W, Zhang. Q. A com-
parative measurement study of deep learning as a service framework.
IEEE Transactions on Services Computing. 2019 Jul 18;15(1):551-
66.
[14] Baradad. M, Chen. CF, Wulff. J, Wang. T, Feris. R, Torralba. A,
Isola. P. Procedural Image Programs for Representation Learning. In
Advances in Neural Information Processing Systems 2022 Nov 26.
[15] Kataoka. H, Okayasu. K, Matsumoto. A, Yamagata. E, Yamada. R,
Inoue. N, Nakamura. A, Satoh. Y. Pre-training without natural im-
ages. In Proceedings of the Asian Conference on Computer Vision
2020.
[16] Kataoka. H, Hayamizu. R, Yamada. R, Nakashima. K, Takashima.
S, Zhang. X, Martinez-Noriega. EJ, Inoue. N, Yokota. R. Replacing
Labeled Real-image Datasets with Auto-generated Contours. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition 2022 (pp. 21232-21241).
[17] Nakashima. K, Kataoka. H, Matsumoto. A, Iwata. K, Inoue. N,
Satoh. Y. Can vision transformers learn without natural images?. In
Proceedings of the AAAI Conference on Artificial Intelligence 2022
Jun 28 (Vol. 36, No. 2, pp. 1990-1998).
[18] Takashima. S, Hayamizu. R, Inoue. N, Kataoka. H, Yokota. R. Vi-
sual atoms: Pre-training vision transformers with sinusoidal waves. In
Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition 2023 (pp. 18579-18588).
[19] Yamada. R, Takahashi. R, Suzuki. R, Nakamura. A, Yoshiyasu. Y,
Sagawa. R, Kataoka. H. MV-FractalDB: formula-driven supervised
learning for multi-view image recognition. In 2021 IEEE/RSJ Inter-
national Conference on Intelligent Robots and Systems (IROS) 2021
Sep 27 (pp. 2076-2083). IEEE.
[20] Hambardzumyan. S, Tuli. A, Ghukasyan. L, Rahman. F, Topchyan.
H, Isayan. D, McQuade. M, Harutyunyan. M, Hakobyan. T, Stranic.
I, Buniatyan. D. Deep lake: A lakehouse for deep learning. arXiv
preprint arXiv:2209.10785. 2022 Sep 22.
[21] Krizhevsky A, Hinton G. Learning multiple layers of features from
tiny images. 2009.
[22] Krause. J, Stark. M, Deng. J, Fei-Fei. L. 3d object representations for
fine-grained categorization. In Proceedings of the IEEE international
conference on computer vision workshops 2013 (pp. 554-561).
[23] National Institute of Advanced Industrial Science and Technology,
ABCI Supercomputer, https://abci.ai, 2023, [January 2023].
[24] Cubuk. ED, Zoph. B, Mane. D, Vasudevan. V, Le. QV. Autoaug-
ment: Learning augmentation policies from data. arXiv preprint
arXiv:1805.09501. 2018 May 24.
Author Biography
Edgar Josafat Martinez-Noriega obtained his Doctorate in Com-
puter Science from the University of Electro-Communications, Tokyo in
2022. Following this, he has been employed as a Post-Doctoral Re-
searcher at the National Institute of Advanced Industrial Science and
Technology (AIST), working on the application of synthetic datasets for
large-scale deep learning. His research focuses on parallel computing,
computer graphics, and deep learning.
Peng Chen is a researcher at National Institute of Advanced Indus-
trial Science and Technology (AIST). Also, he is working as a visiting sci-
entist at RIKEN Center for Computational Science (RIKEN-CCS), Japan.
He received the B.E. degree in navigation from Dalian Maritime Univer-
sity, China, in 2005; the M.E. degree in traffic information engineering
and control from Shanghai Maritime University, China, in 2007; the Ph.D.
from Tokyo Institute of Technology, Japan, in 2020. His research interests
include parallel computing, image processing, and machine learning.
Rio Yokota is a professor at the Global Scientific Information and
Computing Center, Tokyo Institute of Technology. His research focuses
on high performance computing, linear algebra, and machine learning.
He has developed several libraries, including ExaFMM for fast multipole
methods, and Hatrix for hierarchical low-rank algorithms. He has re-
ceived the Gordon Bell prize in 2009 using the first GPU supercomputer.
Rio is a member of ACM, IEEE, and SIAM.
196--6 IS&T International Symposium on Electronic Imaging 2024
High Performance Computing for Imaging2024
... • We develop an efficient data loader implemented in C ++ with a Python API, designed to integrate seamlessly with Cassandra-compatible NoSQL databases and NVIDIA DALI [7,17]. This loader supports data loading across the network and is compatible with popular DL frameworks such as PyTorch and TensorFlow. ...
... NVIDIA DALI is a state-of-the-art data loader [7,17], which addresses these inefficiencies by managing the entire pipeline of loading, decoding, and pre-processing images. It leverages GPU acceleration for decoding and pre-processing tasks, significantly reducing CPU bottlenecks. ...
Preprint
Full-text available
In the last decades, the computational power of GPUs has grown exponentially, allowing current deep learning (DL) applications to handle increasingly large amounts of data at a progressively higher throughput. However, network and storage latencies cannot decrease at a similar pace due to physical constraints, leading to data stalls, and creating a bottleneck for DL tasks. Additionally, managing vast quantities of data and their associated metadata has proven challenging, hampering and slowing the productivity of data scientists. Moreover, existing data loaders have limited network support, necessitating, for maximum performance, that data be stored on local filesystems close to the GPUs, overloading the storage of computing nodes. In this paper we propose a strategy, aimed at DL image applications, to address these challenges by: storing data and metadata in fast, scalable NoSQL databases; connecting the databases to state-of-the-art loaders for DL frameworks; enabling high-throughput data loading over high-latency networks through our out-of-order, incremental prefetching techniques. To evaluate our approach, we showcase our implementation and assess its data loading capabilities through local, medium and high-latency (intercontinental) experiments.
... Among the investigated parameters pi n_memor y showed positive effects on the algorithm's performance. The parameter is attributed to DataLoader [19] and forces the system to use only page-locked memory and prevents intermediate data from being swapped to disk. By locking the memory pages in RAM, it allows for faster and more efficient data transfers between the CPU and GPU. ...
Preprint
Full-text available
The effectiveness and efficiency of machine learning methodologies are crucial, especially with respect to the quality of results and computational cost. This paper discusses different model optimization techniques, providing a comprehensive analysis of key performance indicators. Several parallelization strategies for image recognition, adapted to different hardware and software configurations, including distributed data parallelism and distributed hardware processing, are analyzed. Selected optimization strategies are studied in detail, highlighting the related challenges and advantages of their implementation. Furthermore, the impact of different performance improvement techniques (DPO, LoRA, QLoRA, and QAT) on the tuning process of large language models is investigated. Experimental results illustrate how the nature of the task affects the iteration time in a multiprocessor environment, VRAM utilization, and overall memory transfers. Test scenarios are evaluated on the modern NVIDIA H100 GPU architecture.
... Aach et al. [3] compare distributed DL frameworks' performance and scalability, demonstrating that the NVIDIA DALI data loader significantly reduces training time. DALI, a high-performance data loader, outperforms traditional PyTorch data loaders by up to 3.4× in loading times for large datasets [4]. Using data loading tools such as binary data formats and NVIDIA DALI can improve the training time of deep neural networks by 20-40% [5]. ...
Preprint
Full-text available
Training large language models requires extensive processing, made possible by many high-performance computing resources. This study compares multi-node and multi-GPU environments for training large language models of electrocardiograms. It provides a detailed mapping of current frameworks for distributed deep learning in multinode and multi-GPU settings, including Horovod from Uber, DeepSpeed from Microsoft, and the built-in distributed capabilities of PyTorch and TensorFlow. We compare various multi-GPU setups for different dataset configurations, utilizing multiple HPC nodes independently and focusing on scalability, speedup, efficiency, and overhead. The analysis leverages HPC infrastructure with SLURM, Apptainer (Singularity) containers, CUDA, PyTorch, and shell scripts to support training workflows and automation. We achieved a sub-linear speedup when scaling the number of GPUs, with values of 1.6x for two and 1.9x for four.
Conference Paper
Full-text available
While 3D object representations are being revived in the context of multi-view object class detection and scene understanding, they have not yet attained wide-spread use in fine-grained categorization. State-of-the-art approaches achieve remarkable performance when training data is plentiful, but they are typically tied to flat, 2D representations that model objects as a collection of unconnected views, limiting their ability to generalize across viewpoints. In this paper, we therefore lift two state-of-the-art 2D object representations to 3D, on the level of both local feature appearance and location. In extensive experiments on existing and newly proposed datasets, we show our 3D object representations outperform their state-of-the-art 2D counterparts for fine-grained categorization and demonstrate their efficacy for estimating 3D geometry from images via ultra-wide baseline matching and 3D reconstruction.
Article
Big data powered Deep Learning (DL) and its applications have blossomed in recent years, fueled by three technological trends: a large amount of digitized data openly accessible, a growing number of DL software frameworks in open source and commercial markets, and a selection of affordable parallel computing hardware devices. However, no single DL framework, to date, dominates in terms of performance and accuracy even for baseline classification tasks on standard datasets, making the selection of a DL framework an overwhelming task. This paper takes a holistic approach to conduct empirical comparison and analysis of four representative DL frameworks with three unique contributions. First , given a selection of CPU-GPU configurations, we show that for a specific DL framework, different configurations of its hyper-parameters may have a significant impact on both performance and accuracy of DL applications. Second , to the best of our knowledge, this study is the first to identify the opportunities for improving the training time performance and the accuracy of DL frameworks by configuring parallel computing libraries and tuning individual and multiple hyper-parameters. Third , we also conduct a comparative measurement study on the resource consumption patterns of four DL frameworks and their performance and accuracy implications, including CPU and memory usage, and their correlations to varying settings of hyper-parameters under different configuration combinations of hardware, parallel computing libraries. We argue that this measurement study provides in-depth empirical comparison and analysis of four representative DL frameworks, and offers practical guidance for service providers to deploying and delivering DL as a Service (DLaaS) and for application developers and DLaaS consumers to select the right DL frameworks for the right DL workloads.
LAION-5B: An open large-scale dataset for training next generation image-text models
  • C Schuhmann
  • R Beaumont
  • Gordon R Vencu
  • Cw
  • Cherti R M Wightman
  • Coombes T A Katta
  • C Mullis
  • Schramowski M P Wortsman
Schuhmann. C, Beaumont. R, Vencu. R, Gordon. CW, Wightman. R, Cherti. M, Coombes. T, Katta. A, Mullis. C, Wortsman. M, Schramowski. P. LAION-5B: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
Big transfer (bit): General visual representation learning
  • A Kolesnikov
  • L Beyer
  • X Zhai
  • Yung J J Puigcerver
  • S Gelly
  • Houlsby
Kolesnikov. A, Beyer. L, Zhai. X, Puigcerver. J, Yung. J, Gelly. S, Houlsby. N. Big transfer (bit): General visual representation learning. InComputer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part V 16 2020 (pp. 491-507). Springer International Publishing.