Figure 4 - uploaded by Xiaofan Zhang
Content may be subject to copyright.
Example of a pair of sentences showing how the words in source sentence are mapping to the target sentence. The darker the line is, the more related the pair of words linked by the line [15].

Example of a pair of sentences showing how the words in source sentence are mapping to the target sentence. The darker the line is, the more related the pair of words linked by the line [15].

Source publication
Conference Paper
Full-text available
Neural machine translation (NMT) is a popular topic in Natural Language Processing which uses deep neural networks (DNNs) for translation from source to targeted languages. With the emerging technologies, such as bidirectional Gated Recurrent Units (GRU), attention mechanisms, and beam-search algorithms, NMT can deliver improved translation quality...

Similar publications

Article
Full-text available
Neural machine translation (NMT) is one of the most critical applications in natural language processing (NLP) with the main idea of converting text from one language to another using deep neural networks. In recent years, we have seen continuous development of NMT by integrating more emerging technologies, such as bidirectional gated recurrent uni...
Article
Full-text available
The transformer model has recently been a milestone in artificial intelligence. The algorithm has enhanced the performance of tasks such as Machine Translation and Computer Vision to a level previously unattainable. However, the transformer model has a strong performance but also requires a high amount of memory overhead and enormous computing powe...

Citations

... This is a computational challenge for the central processing unit (CPU) as it consumes excessive power. Instead, hardware accelerators such as a graphics processing unit (GPU), field-programmable gate array (FPGA), and application-specific integrated circuit (ASIC) have been used to increase the throughput of CNNs [2,3]. When CNNs are integrated through hardware, latency is improved, and the energy consumption is reduced. ...
Article
Full-text available
Owing to their high accuracy, deep convolutional neural networks (CNNs) are extensively used. However, they are characterized by high complexity. Real-time performance and acceleration are required in current CNN systems. A graphics processing unit (GPU) is one possible solution to improve real-time performance; however, its power consumption ratio is poor owing to high power consumption. By contrast, field-programmable gate arrays (FPGAs) have lower power consumption and flexible architecture, making them more suitable for CNN implementation. In this study, we propose a method that offers both the speed of CNNs and the power and parallelism of FPGAs. This solution relies on two primary acceleration techniques—parallel processing of layer resources and pipelining within specific layers. Moreover, a new method is introduced for exchanging domain requirements for speed and design time by implementing an automatic parallel hardware–software co-design CNN using the software-defined system-on-chip tool. We evaluated the proposed method using five networks—MobileNetV1, ShuffleNetV2, SqueezeNet, ResNet-50, and VGG-16—and FPGA processors—ZCU102. We experimentally demonstrated that our design has a higher speed-up than the conventional implementation method. The proposed method achieves 2.47×, 1.93×, and 2.16× speed-up on the ZCU102 for MobileNetV1, ShuffleNetV2, and SqueezeNet, respectively.
... Recent studies have shown that field-programmable gate arrays (FPGA) are promising candidates for deep neural network (DNN) implementation [1][2][3][4][5]. A DNN can be integrated via hardware, rather than via an existing central processing unit (CPU) or graphics processing unit (GPU), thus improving latency and reducing energy consumption. ...
Article
Full-text available
With the increasing use of multi-purpose artificial intelligence of things (AIOT) devices, embedded field-programmable gate arrays (FPGA) represent excellent platforms for deep neural network (DNN) acceleration on edge devices. FPGAs possess the advantages of low latency and high energy efficiency, but the scarcity of FPGA development resources challenges the deployment of DNN-based edge devices. Register-transfer level programming, hardware verification, and precise resource allocation are needed to build a high-performance FPGA accelerator for DNNs. These tasks present a challenge and are time consuming for even experienced hardware developers. Therefore, we propose an automated, collaborative design process employing an automatic design space exploration tool; an automatic DNN engine enables the tool to reshape and parse a DNN model from software to hardware. We also introduce a long short-term memory (LSTM)-based model to predict performance and generate a DNN model that suits the developer requirements automatically. We demonstrate our design scheme with three FPGAs: a zcu104, a zcu102, and a Cyclone V SoC (system on chip). The results show that our hardware-based edge accelerator exhibits superior throughput compared with the most advanced edge graphics processing unit.
... Another research direction to the FPGA accelerator on NNs is based on the highlevel synthesis (HLS) tools like Xilinx Vivado [12][13][14]. HLS allows users to build a network by using high level language like C or C++, and then convert the design into register-transfer level (RTL). Comparing with adoptions of early generations of HLS, the latest applications to HLS tools have made significant progress to the design on FPGA. ...
Article
Full-text available
This paper proposes field-programmable gate array (FPGA) acceleration on a scalable multi-layer perceptron (MLP) neural network for classifying handwritten digits. First, an investigation to the network architectures is conducted to find the optimal FPGA design corresponding to different classification rates. As a case study, then a specific single-hidden-layer MLP network is implemented with an eight-stage pipelined structure on Xilinx Ultrascale FPGA. It mainly contains a timing controller designed by Verilog Hardware Description Language (HDL) and sigmoid neurons integrated by Xilinx IPs. Finally, experimental results show a greater than ×10 speedup compared with prior implementations. The proposed FPGA architecture is expandable to other specifications on different accuracy (up to 95.82%) and hardware cost.
... DL is considered as one of the highly suggested techniques in machine learning for various NLP challenges such as SA (Kwaik et al., 2019;Luo, 2019), machine translation (Ameur et al., 2017;Li et al., 2019), named entity recognition (Khalifa and Shaalan, 2019), and speech recognition (Zerari et al., 2019;Algihab et al., 2019). The strength of DL is that, aside from its great performance, it does not rely on handcrafted features or external resources. ...
... To deal with this task (Li et al., 2019) explored BERT embeddings (Devlin et al., 2018) with various simple neural networks such as Linear, GRU, conditional random field, and self-attention layers. The experimental results showed BERT-based neural networks achieved higher results compared to non-BERT complex models. ...
Article
Full-text available
Aspect-based Sentiment analysis (ABSA) accomplishes a fine-grained analysis that defines the aspects of a given document or sentence and the sentiments conveyed regarding each aspect. This level of analysis is the most detailed version that is capable of exploring the nuanced viewpoints of the reviews. The bulk of study in ABSA focuses on English with very little work available in Arabic. Most previous work in Arabic has been based on regular methods of machine learning that mainly depends on a group of rare resources and tools for analyzing and processing Arabic content such as lexicons, but the lack of those resources presents another challenge. In order to address these challenges, Deep Learning (DL)-based methods are proposed using two models based on Gated Recurrent Units (GRU) neural networks for ABSA. The first is a DL model that takes advantage of word and character representations by combining bidirectional GRU, Convolutional Neural Network (CNN), and Conditional Random Field (CRF) making up the (BGRU-CNN-CRF) model to extract the main opinionated aspects (OTE). The second is an interactive attention network based on bidirectional GRU (IAN-BGRU) to identify sentiment polarity toward extracted aspects. We evaluated our models using the benchmarked Arabic hotel reviews dataset https://github.com/msmadi/ABSA-Hotels proposed by pontiki-etal-2016-semeval. The results indicate that the proposed methods are better than baseline research on both tasks having 39.7% enhancement in F1-score for opinion target extraction (T2) and 7.58% in accuracy for aspect-based sentiment polarity classification (T3). Achieving F1 score of 70.67% for T2, and accuracy of 83.98% for T3.
... It thus helps reduce considerable amounts of lines of code compared to RTL programming languages (7× for a hardware design with one million logic gate [10]) and significantly improves the design efficiency [11]. With the HLS design flow, customized DNN accelerators have been developed to meet the needs of various AI applications, such as accelerating image classification [12]- [14], object detection [15]- [17], and language translation [18]- [20]. As most DNNs are developed by machine learning frameworks using Python, it allows an even higher design abstraction level and creates a greater gap between DNN designs on software and their hardware deployments. ...
Article
Full-text available
Deep neural network (DNN) based video analysis has become one of the most essential and challenging tasks to capture implicit information from video streams. Although DNNs significantly improve the analysis quality, they introduce intensive compute and memory demands and require dedicated hardware for efficient processing. The customized heterogeneous system is one of the promising solutions with general-purpose processors (CPUs) and specialized processors (DNN Accelerators). Among various heterogeneous systems, the combination of CPU and FPGA has been intensively studied for DNN inference with improved latency and energy consumption compared to CPU + GPU schemes and with increased flexibility and reduced time-to-market cost compared to CPU + ASIC designs. However, deploying DNN-based video analysis on CPU + FPGA systems still presents challenges from the tedious RTL programming, the intricate design verification, and the time-consuming design space exploration. To address these challenges, we present a novel framework, called EcoSys, to explore co-design and optimization opportunities on CPU-FPGA heterogeneous systems for accelerating video analysis. Novel technologies include 1) a coherent memory space shared by the host and the customized accelerator to enable efficient task partitioning and online DNN model refinement with reduced data transfer latency; 2) an end-to-end design flow that supports high-level design abstraction and allows rapid development of customized hardware accelerators from Python-based DNN descriptions. 3) a design space exploration (DSE) engine that determines the design space and explores the optimized solutions by considering the targeted heterogeneous system and user-specific constraints; and 4) a complete set of co-optimization solutions, including a layer-based pipeline, a feature map partition scheme, and an efficient memory hierarchical design for the accelerator and multi-threading programming for the CPU. In this paper, we demonstrate our design framework to accelerate the long-term recurrent convolution network (LRCN), which analyzes the input video and output one semantic caption for each frame. EcoSys can deliver 314.7 and 58.1 frames per second (FPS) by targeting the LRCN model with AlexNet and VGG-16 backbone, respectively. Compared to the multithreaded CPU and pure FPGA design, EcoSys achieves 20.6× and 5.3× higher throughput performance.
... In [23], a customized accelerator for image captioning is implemented on FPGA using high-level synthesis (HLS). Following the HLS-based design, more accelerators have been developed to meet the needs of various AI applications, such as accelerating image classification [25,39], face recognition [40,41], object detection [29,42], and language translation [43][44][45]. ...
Preprint
Full-text available
Customized hardware accelerators have been developed to provide improved performance and efficiency for DNN inference and training. However, the existing hardware accelerators may not always be suitable for handling various DNN models as their architecture paradigms and configuration tradeoffs are highly application-specific. It is important to benchmark the accelerator candidates in the earliest stage to gather comprehensive performance metrics and locate the potential bottlenecks. Further demands also emerge after benchmarking, which require adequate solutions to address the bottlenecks and improve the current designs for targeted workloads. To achieve these goals, in this paper, we leverage an automation tool called DNNExplorer for benchmarking customized DNN hardware accelerators and exploring novel accelerator designs with improved performance and efficiency. Key features include (1) direct support to popular machine learning frameworks for DNN workload analysis and accurate analytical models for fast accelerator benchmarking; (2) a novel accelerator design paradigm with high-dimensional design space support and fine-grained adjustability to overcome the existing design drawbacks; and (3) a design space exploration (DSE) engine to generate optimized accelerators by considering targeted AI workloads and available hardware resources. Results show that accelerators adopting the proposed novel paradigm can deliver up to 4.2X higher throughput (GOP/s) than the state-of-the-art pipeline design in DNNBuilder and up to 2.0X improved efficiency than the recently published generic design in HybridDNN given the same DNN model and resource budgets. With DNNExplorer's benchmarking and exploration features, we can be ahead at building and optimizing customized AI accelerators and enable more efficient AI applications.
... Consequently, an increasingly higher abstraction level is achieved [9]. DL is considered as one of the highly recommended machine learning methods for dealing with many research challenges in NLP such as named entity recognition [10], machine translation [11,12], speech recognition [13,14], and SA [15,16].DL advantage lies in its independence from specialized knowledge and linguistic resources as well as in its superior performance. The collection of algorithms (i.e. ...
Preprint
Full-text available
Aspect-based Sentiment analysis (ABSA) accomplishes a fine-grained analysis that defines the aspects of a given document or sentence and the sentiments conveyed regarding each aspect. This level of analysis is the most detailed version that is capable of exploring the nuanced viewpoints of the reviews. Most of the research available in ABSA focuses on English language with very few work available on Arabic. Most previous work in Arabic has been based on regular methods of machine learning that mainly depends on a group of rare resources and tools for analyzing and processing Arabic content such as lexicons, but the lack of those resources presents another challenge.To overcome these obstacles, Deep Learning (DL)-based methods are proposed using two models based on Gated Recurrent Units (GRU) neural networks for ABSA.The first one is a DL model that takes advantage of the representations on both words and characters via the combination of bidirectional GRU, Convolutional neural network (CNN), and Conditional Random Field (CRF) which makes up (BGRU-CNN-CRF) model to extract the main opinionated aspects (OTE). The second is an interactive attention network based on bidirectional GRU (IAN-BGRU) to identify sentiment polarity toward extracted aspects. We evaluated our models using the benchmarked Arabic hotel reviews dataset. The results indicate that the proposed methods are better than baseline research on both tasks having 38.5% enhancement in F1-score for opinion target extraction (T2) and 7.5% in accuracy for aspect-based sentiment polarity classification (T3). Obtaining F1 score of 69.44% for T2, and accuracy of 83.98% for T3.
... We quantize the model to mixedprecision representation in which parameters and portions of calculation are in 16-bit half precision, and others remain as 32-bit floating-point. This paper is a continuation of our previous work [17], the first real-life NMT design on FP-GAs using floating-point precision. The key improvements compared to [17] include a hybrid-precision NMT model design to achieve improved board-level performance and the same level of accuracy as the floating-point version, a heterogeneous decoder design to integrate dedicated decoders for layers with different compute-to-communication (CTC) ratios, and a refined attention module to optimize the computation order and reduce process latency. ...
... This paper is a continuation of our previous work [17], the first real-life NMT design on FP-GAs using floating-point precision. The key improvements compared to [17] include a hybrid-precision NMT model design to achieve improved board-level performance and the same level of accuracy as the floating-point version, a heterogeneous decoder design to integrate dedicated decoders for layers with different compute-to-communication (CTC) ratios, and a refined attention module to optimize the computation order and reduce process latency. To sum up, following are the contributions of this work: • We introduce a hardware-oriented profiler and a comprehensive task partitioning strategy for mapping NMT onto FPGAs. ...
... The utilization is presented as a set of percentages using the same FPGA (VCU118), and 50-word Translation means the time needed for translating a 50-word English sentence to a French sentence. We list four models, Float, Orig, Onchip, and Optimized, where Float represents the original float NMT model implemented in [17], Orig is direct conversion from the original float model to mixed-precision model without additional optimization, Onchip implementation adds on-chip weight storage for the attention mechanism, and Optimized represents our current design with mixedprecision MVM kernels, on-chip attention weight storage and refined attention mechanism. ...
Article
Full-text available
Neural machine translation (NMT) is one of the most critical applications in natural language processing (NLP) with the main idea of converting text from one language to another using deep neural networks. In recent years, we have seen continuous development of NMT by integrating more emerging technologies, such as bidirectional gated recurrent units (GRU), attention mechanisms, and beam-search algorithms, for improved translation quality. However, the real-life NMT models have become much more complicated and challenging to implement on hardware for acceleration opportunities. In this paper, we aim to exploit the capability of FPGAs to deliver highly efficient implementations for real-life NMT applications. We map the inference of a large-scale NMT model with total computation of 172 GFLOP to a highly optimized high-level synthesis (HLS) IP and integrate the IP into Xilinx VCU118 FPGA platform. The model has widely used key features for NMTs, including the bidirectional GRU layer, attention mechanism, and beam search. We quantize the model to mixed-precision representation in which parameters and portions of calculations are in 16-bit half-precision, and others remain as 32-bit floating-point. Compared to the float NMT implementation on FPGA, we achieve 13.1x speedup with an end-to-end performance of 22.0 GFLOPS without any accuracy degradation.
... Another dimension of deep learning, which has recently emerged in the edge, is time series analysis and forecasting. Healthcare [5,6,7], device health monitoring [8,9,10], machine translation [11,12] are some examples of deep learning use in time sequence analysis. ...
Preprint
This paper presents a scalable deep learning model called Agile Temporal Convolutional Network (ATCN) for high-accurate fast classification and time series prediction in resource-constrained embedded systems. ATCN is primarily designed for mobile embedded systems with performance and memory constraints such as wearable biomedical devices and real-time reliability monitoring systems. It makes fundamental improvements over the mainstream temporal convolutional neural networks, including the incorporation of separable depth-wise convolution to reduce the computational complexity of the model and residual connections as time attention machines, increase the network depth and accuracy. The result of this configurability makes the ATCN a family of compact networks with formalized hyper-parameters that allow the model architecture to be configurable and adjusted based on the application requirements. We demonstrate the capabilities of our proposed ATCN on accuracy and performance trade-off on three embedded applications, including transistor reliability monitoring, heartbeat classification of ECG signals, and digit classification. Our comparison results against state-of-the-art approaches demonstrate much lower computation and memory demand for faster processing with better prediction and classification accuracy. The source code of the ATCN model is publicly available at https://github.com/TeCSAR-UNCC/ATCN.
... Meanwhile, the optimization techniques of deploying AI algorithms on hardware platforms are also being intensively explored. These techniques include building accelerators by taking advantages of different hardware devices [4][5][6][7][8], exploring optimization schemes to reduce the DNNs' model complexity and increase their hardware efficiency [9,10], and designing Electronic Design Automation (EDA) tools to enable automatic end-to-end DNN optimization and deployment [11][12][13]. ...
Preprint
High quality AI solutions require joint optimization of AI algorithms, such as deep neural networks (DNNs), and their hardware accelerators. To improve the overall solution quality as well as to boost the design productivity, efficient algorithm and accelerator co-design methodologies are indispensable. In this paper, we first discuss the motivations and challenges for the Algorithm/Accelerator co-design problem and then provide several effective solutions. Especially, we highlight three leading works of effective co-design methodologies: 1) the first simultaneous DNN/FPGA co-design method; 2) a bi-directional lightweight DNN and accelerator co-design method; 3) a differentiable and efficient DNN and accelerator co-search method. We demonstrate the effectiveness of the proposed co-design approaches using extensive experiments on both FPGAs and GPUs, with comparisons to existing works. This paper emphasizes the importance and efficacy of algorithm-accelerator co-design and calls for more research breakthroughs in this interesting and demanding area.