ThesisPDF Available

Hardware-Aware Co-Optimization of Deep Convolutional Neural Networks

Authors:

Abstract and Figures

The unprecedented success of deep neural networks (DNNs), especially convolutional neural networks (CNNs), stems from its high representational power and capability to model complex functions. The representational power in DNNs comes from its complex structure, which increases the computational complexity and (memory) size of models. Thus, high memory capacity and high compute power are required for DNN's processing. However, the embedded devices and mobile platforms have very limited (on-chip)memory and compute capacity, which prohibits the wide deployment of DNNs. To overcome these challenges, compact DNNs with low computational complexity and model size have been proposed. However, the data-reuse unaware compute reduction leads to higher data movement, which consumes orders of magnitude higher energy than an arithmetic operation and renders the compact model energy-inefficient. Moreover, on systolic-array based accelerators, the low data reuse in compact DNNs causes PE (processing element) underutilization, which results in higher (inference) latency. Numerous co-design (DNN accelerator and algorithm) techniques have been proposed to address the inefficiencies of compact DNNs (low energy efficiency and sub-optimal latency). However, their generalizability is quite limited, and more importantly, these co-design techniques are oblivious to the predictive performance of model, which leads to sub-optimal inference accuracy. In this thesis, first, we investigate the performance and security implications of designing compact DNNs. We found that the contemporary methods of reducing the number of parameters and computations increase the total number of activations, which in turn increases the memory-footprint and data movements; thus, lower energy-efficiency. Also, we demonstrate that distinctive characteristics of (compact) DNNs can easily be exploited to decipher the architecture of building blocks in DNNs through side-channel attacks. Then, we propose security-aware design methodologies that are robust against side-channel attacks. Further, we propose data-reuse aware co-design, which balances the computational complexity with data reuse and enables a sweet point for optimal energy-efficiency and latency on both GPUs and systolic accelerators. Moreover, unlike previous co-design methods, our co-design approach enables a trade-off between representational power (of DNNs) and generalization capability, thus, maximize the predictive performance (accuracy on image classification tasks). Furthermore, we propose a subspace self-attention mechanism that improves computational efficiency and boosts the representational power of DNNs. This attention-mechanism incurs negligible parameter overhead and hence, suitable for deployment in compact DNNs. In the end, we employ knowledge distillation as a learning paradigm to achieve a gain in predictive performance without changing the architecture of DNNs. We investigate the efficacy of knowledge distillation as a substitute for residual connections in the residual network. We found that knowledge distillation serves as a good (weight) initializer, which regularizes the gradient-flow in student network. In effect, training DNNs with knowledge distillation avoids gradient flow in (convex) chaotic regions on loss surface and enables convergence in well-behaved regions on error surface.
Content may be subject to copyright.
A preview of the PDF is not available
... We use this framework to analyze the performance bounds of each algorithm in terms of energy. To further analyze energy efficiency, we use the methods introduced by Jha et al. [23], which provide another means of estimating the energy efficiency of a given convolution-based image upsampling algorithm by its data requirements. We discuss each of these quantitative models further in Section VII. ...
... Algorithms with low data reuse put more strain on a system's memory bandwidth as each compute operation requires more off-chip memory accesses. While the value of this ratio implies the scalability and locality of an algorithm, it fails to reliably estimate the energy efficiency of convolution-based deep learning algorithms [23]. By separately considering weight and activation reuse, Jha et al. [23] show that, for convolution-based deep learning algorithms, the variation in arithmetic intensity is attributed to the variation in activation reuse, which in turn is highly correlated to variations in energy efficiency. ...
... While the value of this ratio implies the scalability and locality of an algorithm, it fails to reliably estimate the energy efficiency of convolution-based deep learning algorithms [23]. By separately considering weight and activation reuse, Jha et al. [23] show that, for convolution-based deep learning algorithms, the variation in arithmetic intensity is attributed to the variation in activation reuse, which in turn is highly correlated to variations in energy efficiency. As such, we use the compute and memory requirements discussed in Section VII-A to estimate the energy efficiency of convolutionbased upsampling algorithms by activation reuse. ...
Article
Full-text available
State-of-the-art deep learning solutions for image upsampling are currently trained using either resize or sub-pixel convolution to learn kernels that generate high fidelity images with minimal artifacts. However, performing inference with these learned convolution kernels requires memory-intensive feature map transformations that dominate time and energy costs in real-time applications. To alleviate this pressure on memory bandwidth, we propose a novel energy-efficient edge computing paradigm that confines the use of resize or sub-pixel convolution to training in the cloud by transforming learned convolution kernels to deconvolution kernels before deploying them for inference as a functionally equivalent deconvolution. These kernel transformations, intended as a one-time cost when shifting from training to inference, enable a systems designer to use each algorithm in their optimal context by preserving the image fidelity learned when training in the cloud while minimizing data transfer penalties during inference at the edge. We compare the inference properties of these convolution-based image upsampling algorithms and introduce a novel deconvolution inference algorithm, which we refer to as REVD2. To demonstrate the benefits of our approach, we upsample images selected from the BSD300 dataset using a pre-trained single-image super resolution network provided by the PyTorch model zoo. Using quantitative models of incurred time and energy costs to analyze this deep neural network, we estimate that using REVD2 for inference at the edge improves system latency by 2.1x or 2.8x and energy efficiency by 2.1x or 2.7x when respectively compared to sub-pixel or resize convolution counterparts.
... Algorithms with low data reuse put more strain on a system's memory bandwidth as each compute operation requires more off-chip memory accesses. While the value of this ratio implies the scalability and locality of an algorithm, it fails to properly estimate the energy efficiency of convolution-based deep learning algorithms when the ratio of activations (A) to weights (W ) drastically deviates from 1 [15]. By separately considering weight and activation reuse, Jha et al. [15] show that, for convolution-based deep learning algorithms, the variation in arithmetic intensity is attributed to the variation in activation reuse and is highly correlated to variations in energy efficiency 11 . ...
... While the value of this ratio implies the scalability and locality of an algorithm, it fails to properly estimate the energy efficiency of convolution-based deep learning algorithms when the ratio of activations (A) to weights (W ) drastically deviates from 1 [15]. By separately considering weight and activation reuse, Jha et al. [15] show that, for convolution-based deep learning algorithms, the variation in arithmetic intensity is attributed to the variation in activation reuse and is highly correlated to variations in energy efficiency 11 . Following this work, we use the compute and memory requirements discussed in Section 5.1 to estimate the energy efficiencies of convolution-based upsampling algorithms using activation reuse. ...
... First, we use energy per pixel to evaluate the en- Tables 1 and 2 to calculate the activation reuse for each convolution-based image upsampling algorithm. Following the work of Jha et al. [15], we use this metric to estimate the energy efficiency of each algorithm as a function of upsampling factor r. Each experiment assumes a square 1K RGB input image upsampled using a standard 3 × 3 kernel. ...
Preprint
A novel energy-efficient edge computing paradigm is proposed for real-time deep learning-based image upsampling applications. State-of-the-art deep learning solutions for image upsampling are currently trained using either resize or sub-pixel convolution to learn kernels that generate high fidelity images with minimal artifacts. However, performing inference with these learned convolution kernels requires memory-intensive feature map transformations that dominate time and energy costs in real-time applications. To alleviate this pressure on memory bandwidth, we confine the use of resize or sub-pixel convolution to training in the cloud by transforming learned convolution kernels to deconvolution kernels before deploying them for inference as a functionally equivalent deconvolution. These kernel transformations, intended as a one-time cost when shifting from training to inference, enable a systems designer to use each algorithm in their optimal context by preserving the image fidelity learned when training in the cloud while minimizing data transfer penalties during inference at the edge. We also explore existing variants of deconvolution inference algorithms and introduce a novel variant for consideration. We analyze and compare the inference properties of convolution-based upsampling algorithms using a quantitative model of incurred time and energy costs and show that using deconvolution for inference at the edge improves both system latency and energy efficiency when compared to their sub-pixel or resize convolution counterparts.
Article
Full-text available
The remarkable predictive performance of deep neural networks (DNNs) has led to their adoption in service domains of unprecedented scale and scope. However, the widespread adoption and growing commercialization of DNNs have underscored the importance of intellectual property (IP) protection. Devising techniques to ensure IP protection has become necessary due to the increasing trend of outsourcing the DNN computations on the untrusted accelerators in cloud-based services. The design methodologies and hyper-parameters of DNNs are crucial information, and leaking them may cause massive economic loss to the organization. Furthermore, the knowledge of DNN’s architecture can increase the success probability of an adversarial attack where an adversary perturbs the inputs and alters the prediction. In this work, we devise a two-stage attack methodology “DeepPeep,” which exploits the distinctive characteristics of design methodologies to reverse-engineer the architecture of building blocks in compact DNNs. We show the efficacy of “DeepPeep” on P100 and P4000 GPUs. Additionally, we propose intelligent design maneuvering strategies for thwarting IP theft through the DeepPeep attack and proposed “Secure MobileNet-V1.” Interestingly , compared to vanilla MobileNet-V1, secure MobileNet-V1 provides a significant reduction in inference latency (≈60%) and improvement in predictive performance (≈2%) with very low memory and computation overheads.
Article
Full-text available
In recent years, researchers have focused on reducing the model size and number of computations (measured as "multiply-accumulate" or MAC operations) of DNNs. The energy consumption of a DNN depends on both the number of MAC operations and the energy efficiency of each MAC operation. The former can be estimated at design time; however, the latter depends on the intricate data reuse patterns and underlying hardware architecture. Hence, estimating it at design time is challenging. This work shows that the conventional approach to estimate the data reuse, viz. arithmetic intensity, does not always correctly estimate the degree of data reuse in DNNs since it gives equal importance to all the data types. We propose a novel model, termed "data type aware weighted arithmetic intensity" (DI), which accounts for the unequal importance of different data types in DNNs. We evaluate our model on 25 state-of-the-art DNNs on two GPUs. We show that our model accurately models data-reuse for all possible data reuse patterns for different types of convolution and different types of layers. We show that our model is a better indicator of the energy efficiency of DNNs. We also show its generality using the central limit theorem.