Conference Paper

ViTA: A Highly Efficient Dataflow and Architecture for Vision Transformers

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Instead, they compute the denominator online using a temporary maximum value, which is updated dynamically whenever a new maximum is encountered. A more general method for approximating any nonlinearity within Transformer networks, introduced by Yu et al. [38] and applied in ViTA [39], involves training a twolayer fully-connected neural network with ReLU activation to replicate the nonlinear functions. This network is then replaced by a look-up table, enabling the approximation of these functions through a single look-up operation and one multiply-accumulate. ...
... Transformer inference accelerators commonly target integer formats [20], [21], [39], [40], yielding high power and area efficiency compared to designs that use floating-point formats at the cost of the need of model quantization (including both weights and activations), which however is not always possible as models could be trained on large, nonpublicly available datasets or with human feedback. Other processors, like the one developed by Tambe et al. [36], mix various techniques: very low-bitwidth floating-point (down to 8 or 4 bits); efficient handling of sparse matrices; and the usage of early exit algorithms, avoiding the computation of superfluous layers altogether. ...
... However, the applicability of all of these techniques is very limited when targeting Transformer networks that have not been retrained for this specific purpose. ViTA [39], achieves major area and efficiency gains by optimizing its architecture specifically for the Vision Transformer, significantly reducing memory usage and data movement. Similarly, Dumoulin et al. [40] specifically focus on hybrid CNN-Transformers vision models, optimizing the execution of such architectures exploiting operation reordering and layer fusions, but limiting the accelerator to a single class of models. ...
Preprint
Transformer-based generative Artificial Intelligence (GenAI) models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. However, this comes at the cost of increased complexity and the need of sophisticated non-linearities such as softmax and GELU. Even if Transformers are computationally dominated by matrix multiplications (MatMul), these non-linearities can become a performance bottleneck, especially if dedicated hardware is used to accelerate MatMul operators. In this work, we introduce a GenAI BFloat16 Transformer acceleration template based on a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM, 8 general-purpose RISC-V cores, a 24x8 systolic array MatMul accelerator, and a novel accelerator for Transformer softmax and GELU non-linearities: SoftEx. SoftEx introduces an approximate exponentiation algorithm balancing efficiency (121x speedup over glibc's implementation) with accuracy (mean relative error of 0.14%). In 12nm technology, SoftEx occupies 0.039 mm2^2, only 3.22% of the cluster, which achieves an operating frequency of 1.12 GHz. Compared to optimized software running on the RISC-V cores, SoftEx achieves significant improvements, accelerating softmax and GELU computations by up to 10.8x and 5.11x, respectively, while reducing their energy consumption by up to 10.8x and 5.29x. These enhancements translate into a 1.58x increase in throughput (310 GOPS at 0.8V) and a 1.42x improvement in energy efficiency (1.34 TOPS/W at 0.55V) on end-to-end ViT inference workloads.
... Overall, these studies collectively demonstrate the growing interest and advances in the use of CNN models for IDC detection and classification in breast cancer histopathological images. Recent studies have also explored the use of advanced architectures in other domains, such as ViTAE-SL [19], a vision transformer-based autoencoder for spatial field reconstruction, and deep learning surrogate models for global wildfire prediction [20,21]. These approaches highlight innovative modeling strategies that could be adapted for medical imaging tasks in future research. ...
Article
Full-text available
Invasive ductal carcinoma (IDC) is the most com- mon type of breast cancer, accounting for approximately 80% of cases. Accurate and early diagnosis of IDC is critical for effective treatment and improved patient survival rates. This study explores the use of convolutional neural networks (CNN) for the classification of IDC in histological breast tissue images, aiming to develop a computer-aided diagnostic (CAD) system that can support pathologists in identifying cancerous tissues. Using a public dataset of 5,547 labeled images, resized to 50x50 pixels to balance computational efficiency and the retention of diagnostically relevant features, we trained a CNN model optimized for binary classification (IDC vs. non-IDC). The preprocessing steps included image normalization and class balancing, with training and validation sets split in an 80:20 ratio. The CNN architecture utilized three convolutional layers with batch normalization and max-pooling, a dense layer with ReLU activation, and a final sigmoid-activated output layer. The model achieved an accuracy of 78%, with precision, recall, and F1-scores all at 0.78, and an area under the ROC curve (AUC) of 0.84, indicating effective discrimination between classes. These results suggest that CNN-based models hold promise for aiding in IDC diagnosis, although further research is needed to improve model performance. Future work will focus on exploring advanced architectures, data augmentation, and transfer learning to improve sensitivity and clinical applicability.
Article
Full-text available
The systolic array is frequently used in accelerators for neural networks, including Transformer models that have recently achieved remarkable progress in natural language processing (NLP) and machine translation. Due to the constraints of FPGA EDA (Field Programmable Gate Array Electronic Design Automation) tools and the limitations of design methodology, existing systolic array accelerators for FPGA deployment often cannot achieve high frequency. In this work, we propose a well-designed high-frequency systolic array for an FPGA-based Transformer accelerator, which is capable of performing the Multi-Head Attention (MHA) block and the position-wise Feed-Forward Network (FFN) block, reaching 588 MHz and 474 MHz for different array size, achieving a frequency improvement of 1.8× and 1.5× on a Xilinx ZCU102 board, while drastically saving resources compared to similar recent works and pushing the utilization of each DSP slice to a higher level. We also propose a semi-automatic design flow with constraint-generating tools as a general solution for FPGA-based high-frequency systolic array deployment.
Article
Full-text available
In this work a novel architecture, named pseudo-softmax, to compute an approximated form of the softmax function is presented. This architecture can be fruitfully used in the last layer of Neural Networks and Convolutional Neural Networks for classification tasks, and in Reinforcement Learning hardware accelerators to compute the Boltzmann action-selection policy. The proposed pseudo-softmax design, intended for efficient hardware implementation, exploits the typical integer quantization of hardware-based Neural Networks obtaining an accurate approximation of the result. In the paper, a detailed description of the architecture is given and an extensive analysis of the approximation error is performed by using both custom stimuli and real-world Convolutional Neural Networks inputs. The implementation results, based on CMOS standard-cell technology, compared to state-of-the-art architectures show reduced approximation errors.
Article
Transformers are the mainstream of NLP applications and are becoming increasingly popular in other domains such as Computer Vision. Despite the improvements in model quality, the enormous computation costs make Transformers difficult at deployment, especially when the sequence length is large in emerging applications. Processing attention mechanism as the essential component of Transformer is the bottleneck of execution due to the quadratic complexity. Prior art explores sparse patterns in attention to support long sequence modeling, but those pieces of work are on static or fixed patterns. We demonstrate that the sparse patterns are dynamic, depending on input sequences. Thus, we propose the Dynamic Sparse Attention (DSA) that can efficiently exploit dynamic sparse patterns in attention. Compared with other methods, our approach can achieve better trade-offs between accuracy and model complexity. Moving forward, we identify challenges and provide solutions to implement DSA on existing hardware (GPUs) and specialized hardware in order to achieve practical speedup and efficiency improvements for Transformer execution.
Chapter
There has been a rapid development of custom hardware for accelerating the inference speed of deep neural networks (DNNs), by explicitly incorporating hardware metrics (e.g., area and energy) as additional constraints, in addition to application accuracy. Recent efforts mainly focused on linear functions (matrix multiplication) in convolutional (Conv) or fully connected (FC) layers, while there is no publicly available study on optimizing the inference of non-linear functions in DNNs, with hardware constraints.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Conference Paper
Deep convolutional neural networks (CNNs) are widely used in modern AI systems for their superior accuracy but at the cost of high computational complexity. The complexity comes from the need to simultaneously process hundreds of filters and channels in the high-dimensional convolutions, which involve a significant amount of data movement. Although highly-parallel compute paradigms, such as SIMD/SIMT, effectively address the computation requirement to achieve high throughput, energy consumption still remains high as data movement can be more expensive than computation. Accordingly, finding a dataflow that supports parallel processing with minimal data movement cost is crucial to achieving energy-efficient CNN processing without compromising accuracy. In this paper, we present a novel dataflow, called row-stationary (RS), that minimizes data movement energy consumption on a spatial architecture. This is realized by exploiting local data reuse of filter weights and feature map pixels, i.e., activations, in the high-dimensional convolutions, and minimizing data movement of partial sum accumulations. Unlike dataflows used in existing designs, which only reduce certain types of data movement, the proposed RS dataflow can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine (PE) local storage, direct inter-PE communication and spatial parallelism. To evaluate the energy efficiency of the different dataflows, we propose an analysis framework that compares energy cost under the same hardware area and processing parallelism constraints. Experiments using the CNN configurations of AlexNet show that the proposed RS dataflow is more energy efficient than existing dataflows in both convolutional (1.4× to 2.5×) and fully-connected layers (at least 1.3× for batch size larger than 16). The RS dataflow has also been demonstrated on a fabricated chip, which verifies our energy analysis.
Article
In this paper we introduce Chisel, a new hardware construction language that supports advanced hardware design using highly parameterized generators and layered domain-specific hardware languages. By embedding Chisel in the Scala programming language, we raise the level of hardware design abstraction by providing concepts including object orientation, functional programming, parameterized types, and type inference. Chisel can generate a high-speed C++-based cycle-accurate software simulator, or low-level Verilog designed to map to either FPGAs or to a standard ASIC flow for synthesis. This paper presents Chisel, its embedding in Scala, hardware examples, and results for C++ simulation, Verilog emulation and ASIC synthesis.
Training data-efficient image transformers & distillation through attention
  • H Touvron
Flashattention: Fast and memory-efficient exact attention with io-awareness
  • T Dao
  • D Fu
I-bert: Integer-only bert quantization
  • Kim
Verilator and systemperl
  • W Snyder
An image is worth 16×16 words: Transformers for image recognition at scale
  • A Dosovitskiy
I-bert: Integer-only bert quantization
  • S Kim
  • A Gholami
  • Z Yao
  • M W Mahoney
  • K Keutzer