Figure 2 - available via license: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

Content may be subject to copyright.

Source publication

Deploying deep learning (DL) models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used parallelization strategy, but as the number of devices in data parallel training grows, so does the communica...

## Contexts in source publication

**Context 1**

... Parallel Training: To accelerate training using DP, a full set of model parameters (i.e., weights) are replicated across multiple devices/workers. As Figure 2a shows, each worker performs a forward and backward pass independently on a different batch of inputs first. Gradients are then communicated across workers and averaged; after which, each worker applies the same set of gradient values to the model weights. ...

**Context 2**

... approach has been traditionally used for models whose parameters will not fit into a single device's memory ( Wu et al., 2016;Krizhevsky et al., 2017). However, MP can provide per step training speedup ( Mirhoseini et al., 2018;Dean et al., 2012) even when the entire model fits on one device by executing independent operations concurrently on separate devices, as shown in Figure 2b. Splitting a DFG among multiple devices is non-trivial for many networks. ...

**Context 3**

... data parallel training, the network parameters (weights) are replicated across multiple worker devices and each worker performs a forward and a backward pass individually on a distinct batch of inputs (shown in Figure 2a). In this work, we focus on synchronous stochastic gradient decent (sync − SGD) for weight updates. ...

**Context 4**

... 1 is the average training time per step when only one device is used for training, while T N is the time per step when N data parallel devices (with a constant mini-batch size per device) are used. T N is always larger than T 1 because in DP, after each device has performed a forward and backward pass, the gradients must be exchanged between the devices using all-reduce communication (see Figure 2a) 2 . Due to this communication overhead, T 1 T N will never be larger than one and is typically less than one. ...

## Similar publications

The work is devoted to a parallel realization of a hybrid model for study of plasma dynamics in axially symmetric open magnetic traps. The model is based on MHD approximation for electron component of the plasma and on the kinetic approach for ion component. In the model the particle-in-cells method (PIC) with explicit numerical schemes on staggere...