Article

EPU: An Energy-Efficient Explainable AI Accelerator With Sparsity-Free Computation and Heat Map Compression/Pruning

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Deep neural networks (DNNs) have recently gained significant prominence in various real-world applications such as image recognition, natural language processing, and autonomous vehicles. However, due to their black-box nature in system, the underlying mechanisms of DNNs behind the inference results remain opaque to users. In order to address this challenge, researchers have focused on developing explainable artificial intelligence (AI) algorithms. Explainable AI aims to provide a clear and human-understandable explanation of the model’s decision, thereby building more reliable systems. However, the explanation task differs from well-known inference and training processes as it involves interactions with the user. Consequently, existing inference and training accelerators face inefficiencies when processing explainable AI on edge devices. This article introduces explainable processing unit (EPU), the first hardware accelerator designed for explainable AI workloads. The EPU utilizes a novel data compression format for the output heat maps and intermediate gradients to enhance the overall system performance by reducing both memory footprint and external memory access. Its sparsity-free computing core efficiently handles the input sparsity with negligible control overhead, resulting in a throughput boost of up to 9.48×. It also proposes a dynamic workload scheduling with a customized on-chip network for distinct inference and explanation tasks to maximize internal data reuse hence reducing external memory access by 63.7%. Furthermore, the EPU incorporates point-wise gradient pruning (PGP) that can significantly reduce the size of heat maps by a factor of 7.01× combined with the proposed compression format. Finally, the EPU chip fabricated in a 28 nm CMOS process achieves a remarkable heat map generation rate of 367 frames/s for ResNet-34 while maintaining the state-of-the-art area and energy efficiency of 112.3 GOPS/mm 2^2 and 26.55 TOPS/W, respectively.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In highly dynamic scenarios such as real-time network slicing and vehicular network slicing, the time and energy efficiency of XAI methods is critical. Lightweight or simplified interpretability models represent one of the future research directions, ensuring efficient operation on resourceconstrained edge nodes [232], [233].Techniques such as model compression, quantization, and the use of specialized hardware accelerators can significantly reduce the energy consumption of XAI methods [234], [235]. Furthermore, exploring decentralized or distributed XAI approaches could further enhance energy efficiency and reduce latency [236]. ...
Article
Full-text available
The unprecedented advancement of Artificial Intelligence (AI) has positioned Explainable AI (XAI) as a critical enabler in addressing the complexities of next-generation wireless communications. With the evolution of the 6G networks, characterized by ultra-low latency, massive data rates, and intricate network structures, the need for transparency, interpretability, and fairness in AI-driven decision-making has become more urgent than ever. This survey provides a comprehensive review of the current state and future potential of XAI in communications, with a focus on network slicing, a fundamental technology for resource management in 6G. By systematically categorizing XAI methodologies–ranging from model-agnostic to model-specific approaches, and from pre-model to post-model strategies–this paper identifies their unique advantages, limitations, and applications in wireless communications. Moreover, the survey emphasizes the role of XAI in network slicing for vehicular network, highlighting its ability to enhance transparency and reliability in scenarios requiring real-time decision-making and high-stakes operational environments. Real-world use cases are examined to illustrate how XAI-driven systems can improve resource allocation, facilitate fault diagnosis, and meet regulatory requirements for ethical AI deployment. By addressing the inherent challenges of applying XAI in complex, dynamic networks, this survey offers critical insights into the convergence of XAI and 6G technologies. Future research directions, including scalability, real-time applicability, and interdisciplinary integration, are discussed, establishing a foundation for advancing transparent and trustworthy AI in 6G communications systems.
Article
Full-text available
With advances in NanoSat (CubeSat) and high-resolution sensors, the amount of raw data to be analyzed by human supervisors has been explosively increasing for satellite image analysis. To reduce the raw data, the satellite onboard AI processing with low-power COTS (Commercial, Off-The-Shelf) HW has emerged from a real satellite mission. It filters the useless data (e.g. cloudy images) that is worthless to supervisors, achieving efficient satellite-ground station communication. In the application for complex object recognition, however, additional explanation is required for the reliability of the AI prediction due to its low performance. Although various eXplainable AI (XAI) methods for providing human-interpretable explanation have been studied, the pyramid architecture in a deep network leads to the background bias problem which visual explanation only focuses on the background context around the object. Missing the small objects in a tiny region leads to poor explainability although the AI model corrects the object class. To resolve the problems, we propose a novel federated onboard-ground station (FOGS) computing with Cascading Pyramid Attention Network (CPANet) for reliable onboard XAI in object recognition. We present an XAI architecture with a cascading attention mechanism for mitigating the background bias for the onboard processing. By exploiting the localization ability in pyramid feature blocks, we can extract high-quality visual explanation covering the both semantic and small contexts of an object. For enhancing visual explainability of complex satellite images, we also describe a novel computing federation with the ground station and supervisors. In the ground station, active learning-based sample selection and attention refinement scheme with a simple feedback method are conducted to achieve the robustness of explanation and efficient supervisor’s annotation cost, simultaneously. Experiments on various datasets show that the proposed system improves accuracy in object recognition and accurate visual explanation detecting small contexts of objects even in a peripheral region. Then, our attention refinement mechanism demonstrates that the inconsistent explanation can be efficiently resolved only with very simple selection-based feedback.
Article
Full-text available
The class activation maps are generated from the final convolutional layer of CNN. They can highlight discriminative object regions for the class of interest. These discovered object regions have been widely used for weakly-supervised tasks. However, due to the small spatial resolution of the final convolutional layer, such class activation maps often locate coarse regions of the target objects, limiting the performance of weakly-supervised tasks that need pixel-accurate object locations. Thus, we aim to generate more fine-grained object localization information from the class activation maps to locate the target objects more accurately. In this paper, by rethinking the relationships between the feature maps and their corresponding gradients, we propose a simple yet effective method, called LayerCAM. It can produce reliable class activation maps for different layers of CNN. This property enables us to collect object localization information from coarse (rough spatial localization) to fine (precise fine-grained details) levels. We further integrate them into a high-quality class activation map, where the object-related pixels can be better highlighted. To evaluate the quality of the class activation maps produced by LayerCAM, we apply them to weakly-supervised object localization and semantic segmentation. Experiments demonstrate that the class activation maps generated by our method are more effective and reliable than those by the existing attention methods. The code will be made publicly available.
Article
Full-text available
Explainability is essential for users to effectively understand, trust, and manage powerful artificial intelligence applications.
Article
Full-text available
We propose a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent and explainable. Our approach—Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say ‘dog’ in a classification network or a sequence of words in captioning network) flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept. Unlike previous approaches, Grad-CAM is applicable to a wide variety of CNN model-families: (1) CNNs with fully-connected layers (e.g.VGG), (2) CNNs used for structured outputs (e.g.captioning), (3) CNNs used in tasks with multi-modal inputs (e.g.visual question answering) or reinforcement learning, all without architectural changes or re-training. We combine Grad-CAM with existing fine-grained visualizations to create a high-resolution class-discriminative visualization, Guided Grad-CAM, and apply it to image classification, image captioning, and visual question answering (VQA) models, including ResNet-based architectures. In the context of image classification models, our visualizations (a) lend insights into failure modes of these models (showing that seemingly unreasonable predictions have reasonable explanations), (b) outperform previous methods on the ILSVRC-15 weakly-supervised localization task, (c) are robust to adversarial perturbations, (d) are more faithful to the underlying model, and (e) help achieve model generalization by identifying dataset bias. For image captioning and VQA, our visualizations show that even non-attention based models learn to localize discriminative regions of input image. We devise a way to identify important neurons through Grad-CAM and combine it with neuron names (Bau et al. in Computer vision and pattern recognition, 2017) to provide textual explanations for model decisions. Finally, we design and conduct human studies to measure if Grad-CAM explanations help users establish appropriate trust in predictions from deep networks and show that Grad-CAM helps untrained users successfully discern a ‘stronger’ deep network from a ‘weaker’ one even when both make identical predictions. Our code is available at https://github.com/ramprs/grad-cam/, along with a demo on CloudCV (Agrawal et al., in: Mobile cloud visual media computing, pp 265–290. Springer, 2015) (http://gradcam.cloudcv.org) and a video at http://youtu.be/COjUB9Izk6E.
Conference Paper
Full-text available
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.
Article
Full-text available
Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
Article
Full-text available
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide detailed a analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
Article
Full-text available
In this paper, we propose a novel neural network model called RNN Encoder--Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder--Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.
Conference Paper
Full-text available
In order to achieve autonomous operation of a vehicle in urban situations with unpredictable traffic, several realtime systems must interoperate, including environment perception, localization, planning, and control. In addition, a robust vehicle platform with appropriate sensors, computational hardware, networking, and software infrastructure is essential. We previously published an overview of Junior, Stanford's entry in the 2007 DARPA Urban Challenge. This race was a closed-course competition which, while historic and inciting much progress in the field, was not fully representative of the situations that exist in the real world. In this paper, we present a summary of our recent research towards the goal of enabling safe and robust autonomous operation in more realistic situations. First, a trio of unsupervised algorithms automatically calibrates our 64-beam rotating LIDAR with accuracy superior to tedious hand measurements. We then generate high-resolution maps of the environment which are subsequently used for online localization with centimeter accuracy. Improved perception and recognition algorithms now enable Junior to track and classify obstacles as cyclists, pedestrians, and vehicles; traffic lights are detected as well. A new planning system uses this incoming data to generate thousands of candidate trajectories per second, choosing the optimal path dynamically. The improved controller continuously selects throttle, brake, and steering actuations that maximize comfort and minimize trajectory error. All of these algorithms work in sun or rain and during the day or night. With these systems operating together, Junior has successfully logged hundreds of miles of autonomous operation in a variety of real-life conditions.
Article
Recently, on-device training has become crucial for the success of edge intelligence. However, frequent data movement between computing units and memory during training has been a major problem for battery-powered edge devices. Processing-in-memory (PIM) is a novel computing paradigm that merges computing logic into memory, which can address the data movement problem with excellent power efficiency. However, previous PIM accelerators cannot support the entire training process on chip due to its computing complexity. This article presents a PIM accelerator for end-to-end on-device training (T-PIM), the first PIM realization that enables end-to-end on-device training as well as high-speed inference. Its full-custom PIM macro contains 8T-SRAM cells to perform the energy-efficient in-cell and operation and the bit-serial-based computation logic enables fully variable bit-precision for input data. The macro supports various data mapping methods and computational paths for both fully connected and convolutional layers, in order to handle the complex training process. An efficient tiling scheme is also proposed to enable T-PIM to compute any size of deep neural network with the implemented hardware. In addition, configurable arithmetic units in a forward propagation path make T-PIM handle power-of-two bit-precision for weight data, enabling a significant performance boost during inference. Finally, T-PIM efficiently handles sparsity in both operands by skipping the computation of zeros in the input data and by gating-off computing units when the weight data are zero. Finally, we fabricate the T-PIM chip in 28-nm CMOS technology, occupying a die area of 5.04 mm 2^{2} , including five T-PIM cores. It dissipates 5.25–51.23 mW at 50–280 MHz operating frequency with 0.75–1.05-V supply voltage. We successfully demonstrate that T-PIM can run the end-to-end training of VGG16 model on the CIFAR10 and CIFAR100 datasets, achieving 0.13–161.08-and 0.25–7.59-TOPS/W power efficiency during inference and training, respectively. The result shows that T-PIM is 2.02 ×\times more energy-efficient than the state-of-the-art PIM chip that only supports backward propagation, not a whole training. Furthermore, we conduct an architectural experiment using a cycle-level simulator based on actual measurement results, which suggests that the T-PIM architecture is scalable and its scaled-up version provides up to 203.26 ×\times higher power efficiency than a comparable GPU.
Article
FPGA is a promising platform in designing hardware due to its design flexibility and fast development cycle, despite the device's limited hardware resources. To address this, latest FPGAs have adopted a multi-die architecture that employs multiple dies in a single device to provide abundant hardware resources. However, the multi-die architecture causes critical timing issues when signal paths cross the die-to-die boundaries, adding another design challenge in using FPGA. We propose OpenMDS, an open-source shell generation framework for high-performance design on Xilinx multi-die FPGAs. Based on the user's design requirements, it generates an optimized shell for the target FPGA via die-level kernel encapsulation, automated bus pipelining, and customized floorplanning. To evaluate our shell generation, we compare its implementation results against Xilinx's Vitis framework. As a result, OpenMDS uses average 20% less logic resources than Vitis for the same shell functionality. To show its practicality, we use OpenMDS for the design of machine learning accelerator that contains multiple systolic-array processors. OpenMDS achieves 247 MHz and 235 MHz kernel frequency and 400 MHz and 429 MHz memory bus frequency for U50 and U280, respectively, for the accelerator design over 90% logic utilization, claiming up to 12.27% and 22.92% higher kernel and memory bus frequency over Vitis.
Article
This article presents HNPU, which is an energy-efficient deep neural network (DNN) training processor by adopting algorithm-hardware co-design. The HNPU supports stochastic dynamic fixed-point representation and layer-wise adaptive precision searching unit for low-bit-precision training. It additionally utilizes slice-level reconfigurability and sparsity to maximize its efficiency both in DNN inference and training. Adaptive bandwidth reconfigurable accumulation network enables reconfigurable DNN allocation and maintains its high core utilization even in various bit-precision conditions. Fabricated in a 28-nm process, the HNPU accomplished at least 5.9×5.9\times higher energy efficiency and 2.5×2.5\times higher area efficiency in actual DNN training compared with the previous state-of-the-art on-chip learning processors.
Chapter
Weakly supervised object localization (WSOL) is a task of localizing an object in an image only using image-level labels. To tackle the WSOL problem, most previous studies have followed the conventional class activation mapping (CAM) pipeline: (i) training CNNs for a classification objective, (ii) generating a class activation map via global average pooling (GAP) on feature maps, and (iii) extracting bounding boxes by thresholding based on the maximum value of the class activation map. In this work, we reveal the current CAM approach suffers from three fundamental issues: (i) the bias of GAP that assigns a higher weight to a channel with a small activation area, (ii) negatively weighted activations inside the object regions and (iii) instability from the use of the maximum value of a class activation map as a thresholding reference. They collectively cause the problem that the localization to be highly limited to small regions of an object. We propose three simple but robust techniques that alleviate the problems, including thresholded average pooling, negative weight clamping, and percentile as a standard for thresholding. Our solutions are universally applicable to any WSOL methods using CAM and improve their performance drastically. As a result, we achieve the new state-of-the-art performance on three benchmark datasets of CUB-200–2011, ImageNet-1K, and OpenImages30K.
Chapter
We propose Deep Feature Factorization (DFF), a method capable of localizing similar semantic concepts within an image or a set of images. We use DFF to gain insight into a deep convolutional neural network’s learned features, where we detect hierarchical cluster structures in feature space. This is visualized as heat maps, which highlight semantically matching regions across a set of images, revealing what the network ‘perceives’ as similar. DFF can also be used to perform co-segmentation and co-localization, and we report state-of-the-art results on these tasks.
Conference Paper
Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, We introduce a three stage pipeline: pruning, quantization and Huffman encoding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman encoding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory, which has 180x less access energy.
Conference Paper
In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks. Despite the apparent simplicity of global average pooling, we are able to achieve 37.1% top-5 error for object localization on ILSVRC 2014, which is remarkably close to the 34.2% top-5 error achieved by a fully supervised CNN approach. We demonstrate that our network is able to localize the discriminative image regions on a variety of tasks despite not being trained for them
Article
Neural networks are a family of powerful machine learning models. This book focuses on the application of neural network models to natural language data. The first half of the book (Parts I and II) covers the basics of supervised machine learning and feed-forward neural networks, the basics of working with machine learning over language data, and the use of vector-based rather than symbolic representations for words. It also covers the computation-graph abstraction, which allows to easily define and train arbitrary neural networks, and is the basis behind the design of contemporary neural network software libraries. The second part of the book (Parts III and IV) introduces more specialized neural network architectures, including 1D convolutional neural networks, recurrent neural networks, conditioned-generation models, and attention-based models. These architectures and techniques are the driving force behind state-of-the-art algorithms for machine translation, syntactic parsing, and many other a...
Article
In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks. Despite the apparent simplicity of global average pooling, we are able to achieve 37.1% top-5 error for object localization on ILSVRC 2014, which is remarkably close to the 34.2% top-5 error achieved by a fully supervised CNN approach. We demonstrate that our network is able to localize the discriminative image regions on a variety of tasks despite not being trained for them
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch}. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
Article
Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
A consistent and efficient evaluation strategy for attribution methods
  • Rong
Axiom-based Grad-CAM: Towards accurate visualization and explanation of CNNs
  • Fu
A consistent and efficient evaluation strategy for attribution methods
  • Y Rong
  • T Leemann
  • V Borisov
  • G Kasneci
  • E Kasneci
Axiom-based Grad-CAM: Towards accurate visualization and explanation of CNNs
  • R Fu
  • Q Hu
  • X Dong
  • Y Guo
  • Y Gao
  • B Li