## No full-text available

To read the full-text of this research,

you can request a copy directly from the authors.

Acceleration and compression on deep Convolutional Neural Networks (CNNs) have become a critical problem to develop intelligence on resource-constrained devices. Previous channel pruning can be easily deployed and accelerated without specialized hardware and software. However, weight-level redundancy is not well explored in channel pruning, which results in a relatively low compression ratio. In this work, we propose a Low-rank Approximated channel Pruning (LAP) framework to tackle this problem with two targeted steps. First, we utilize low-rank approximation to eliminate the redundancy within filter. This step achieves acceleration, especially in shallow layers, and also converts filters into smaller compact ones. Then, we apply channel pruning on the approximated network in a global way and obtain further benefits, especially in deep layers. In addition, we propose a spectral norm based indicator to coordinate these two steps better. Moreover, inspired by the integral idea adopted in video coding, we propose an evaluator based on Integral of Decay Curve (IDC) to judge the efficiency of various acceleration and compression algorithms. Ablation experiments and IDC evaluator prove that LAP can significantly improve channel pruning. To further demonstrate the hardware compatibility, the network produced by LAP obtains impressive speedup efficiency on the FPGA.

To read the full-text of this research,

you can request a copy directly from the authors.

... Consequently, often reduced precision is used for implementing these CNNs on a FPGA platform. For example, in [16,[28][29][30][31][32], researchers were using VGG-16 as the targeted CNN model for FPGA implementation which has 30.76 billion operations. Mei et al. reduce the precision of the data from single precision floating point to half precision point in [28]. ...

... Lian et al. use 8-bit block floating point (BFP) instead of regular floating point data types [29]. In other research such as [16,[30][31][32], authors take one step further by using fixed points data types. J. Wang et al. proposed an equal distance quantization method to adopt various bits depth up to 8-bit on different layers in [31], while Z. Chen et al. tested different quantization methods in [32]. ...

... In other research such as [16,[30][31][32], authors take one step further by using fixed points data types. J. Wang et al. proposed an equal distance quantization method to adopt various bits depth up to 8-bit on different layers in [31], while Z. Chen et al. tested different quantization methods in [32]. Unlike AlexNet or VGG-16, the Ultra-CNN has only 1.93 million operations. ...

Convolutional Neural Networks (CNN) and derivative architectures have been increasingly popular for image and signal processing applications such as detection and classification. Recently, a CNN architecture with a Wavelet Packet feature selector (Ultra-CNN) was introduced for Ultrasonic Non-Destructive Evaluation applications. This CNN based classifier manages to detect the presence of flaws with accuracy up to 92% using experimental data. In this study, an FPGA based Ultra-CNN design using high-level synthesis (HLS) is presented. Implementing the algorithm on a portable FPGA platform facilitates detection of ultrasonic flaws with high accuracy even when there is no access to high performance computation resources in the field. Unlike most other CNN designs used for pattern recognition in images, Ultra-CNN’s fully connected layers require more operations than its convolutional layers. In order to maximize the throughput, proposed design is optimized for both convolutional and fully connected layers. Therefore, we introduce a new design with two pipelined processors optimized for convolutional and fully connected layers, respectively. The results demonstrate highest utilization efficiency achieved compared to other CNN implementations and validate the low-cost, real-time operation of the design.

... A novel clustering based pruning was suggested in [17] which utilized latent space clustering for measuring the importance of the filters and pruned weak filters which was followed by re-initialization of redundant filters for protecting the diversity. Chen et al. adopted a low rank approximation for eliminating the redundancy among filters [4]. [19] employed a two-step feature map reconstruction procedure to remove redundant filters and channels. ...

Convolutional Neural Networks (CNN) have achieved excellent performance in the processing of high-resolution images. Most of these networks contain many deep layers in quest of greater segmentation performance. However, over-sized CNN models result in overwhelming memory usage and large inference costs. Earlier studies have revealed that over-sized deep neural models tend to deal with abundant redundant filters that are very similar and provide tiny or no contribution in accelerating the inference of the model. Therefore, we have proposed a novel optimal-score-based filter pruning (OSFP) approach to prune redundant filters according to their relative similarity in feature space. OSFP not only speeds up learning in the network but also eradicates redundant filters leading to improvement in the segmentation performance. We empirically demonstrate on widely used segmentation network models (TernausNet, classical U-Net and VGG16 U-Net) and benchmark datasets (Inria Aerial Image Labeling Dataset and Aerial Imagery for Roof Segmentation (AIRS)) that computation costs (in terms of Float Point Operations (FLOPs) and parameters) are reduced significantly, while maintaining or even improving accuracy.

... In the actual power system, with the new development of the internet of things (IoT) platform and the deployment of large number of edge devices, the need to collect data on the edge devices of IoT and make prediction at the edge becomes more urgent [16]. This requires us to develop methods with fewer parameters and a lower computational cost to apply to resource-constrained devices [17,18]. It is obvious that PQD signals exactly have time series characteristics, and as a special type of deep learning model, recurrent network contains a hidden layer to process time series data. ...

This paper presents a new concise deep learning–based sequence model to detect the power quality disturbances (PQD), which only uses original signals and does not require pre‐processing and complex artificial feature extraction process. A simple gated recurrent network (SGRN) with a new recurrent cell structure is developed, which consists of only two gates: forget gate and input gate, and two weight matrices. Compared with the standard Recurrent Neural Network (RNN) model, the training process of the proposed method is more stable and the prediction accuracy is higher. In addition, this special structure retains basic non‐linearity and long‐term memory, while enabling the simple gated recurrent network model to be superior to Long Short‐Term Memory (LSTM) Network and Gated Recurrent Unit (GRU) Network in terms of the number of parameters (i.e. memory cost) and detection speed. In the light of the experimental results, the simple gated recurrent network algorithm can achieve 99.07% detection accuracy, and contains only 18,959 parameters, which indicates that our proposed method is easier to deploy in resource‐constrained internet of things (IoT) micro‐controllers.

Previous AutoML pruning works utilized individual layer features to automatically prune filters. We analyze the correlation for two layers from the different blocks which have a short-cut structure. It shows that, in one block, the deeper layer has many redundant filters which can be represented by filters in the former layer. So, it is necessary to take information from other layers into consideration in pruning. In this paper, a novel pruning method, named GraphPruning, is proposed. Any series of the network is viewed as a graph. To automatically aggregate neighboring features for each node, a graph aggregator based on graph convolution networks (GCN) is designed. In the training stage, a PruningNet that is given aggregated node features generates reasonable weights for any size of the sub-network. Subsequently, the best configuration of the Pruned Network is searched by reinforcement learning. Different from previous work, we take the node features from a well-trained graph aggregator instead of the hand-craft features, as the states in reinforcement learning. Compared with other AutoML pruning works, our method has achieved the state-of-the-art under the same conditions on ImageNet-2012.

Pruning is widely regarded as an effective neural network compression and acceleration method, which can significantly reduce model parameters and speed up inference. This paper proposes a novel network Channel Pruning method based on Sequential Interval Estimation (SIECP). Our method mainly solves the problem that existing methods need to sample and evaluate a large number of sub-structures or introduce a large number of parameters, which leads to slow search. Specifically, we divide the entire channel number search process into multiple stages. In each stage, we divide the channel range that needs to be estimated into multiple channel intervals by grouping and use gradient descent to optimize the number of reserved channels. At the same time, the distribution of the number of channels is counted in each interval. At the end of each search stage, we select the channel interval with the highest frequency for further search to gradually narrow the search range until the final number of channels is determined. Then, the unpruned network is pruned according to the number of channels in each layer of the network to obtain the pruned network. Extensive experiments using ResNet and MobileNet V2 as backbones on CIFAR10, CIAFR100 and Tiny-ImageNet datasets are conducted to demonstrate the effectiveness of our method.

High spatial resolution remote sensing (HSRRS) classification is one of the most promising topics in the field of remote sensing. Recently, convolutional neural network (CNN), as a method with superior performance, has achieved amazing results in the HSRRS classification tasks. However, the previously proposed CNN models generally have large model sizes, and thus require a tremendous amount of computing and consume a lot of memory. Furthermore, wearable electronic devices with limited hardware resources in actual application scenarios fail to meet the storage and computing requirements of these complex CNNs. To solve these problems, lightweight processing is important and practical. This paper proposes a new method called DC-LW-CNN, which applies Dilated Convolution (DC) and neuron pruning methods to maintain the classification performance of CNN and reduce resource consumption. The simulation results indicate that different DC-LW-CNNs have better performance effects in HSRRS classification task. Not only that, they also achieve this remarkable performance with smaller model size, less memory, and faster feedforward computing speed.

Object detection using convolutional neural networks (CNNs) has garnered a lot of interest due to their high performance capability. Yet, the large number of operations and memory fetches to both on-chip and external memory needed for such CNNs result in high latency and power dissipation on resource constrained edge devices, hence impeding their real-time operation from a battery supply. In this paper, a resource and cost efficient hardware accelerator for CNN is implemented on an FPGA. Using an existing metric \(\mathrm{DSP}_\mathrm{efficiency}\) and a new metric \(\mathrm{Cost}_\mathrm{efficiency}\) as the primary optimization variables, exploration of algorithms and hardware using a design space exploration tool, called ZigZag, is undertaken. An optimized architecture is implemented on a Xilinx XC7Z035 FPGA and tiny-YOLOv2 is mapped to demonstrate the real-time object detection application. Compared to the state-of-the-art (SotA), the implementation results shows that the hardware achieves the best \(\mathrm{DSP}_\mathrm{efficiency}\) at 90% and \(\mathrm{Cost}_\mathrm{efficiency}\) at 0.146.

Edge computing brings artificial intelligence algorithms and graphics processing units closer to data sources, making autonomy and energy-efficient processing vital for their design. Approximate computing has emerged as a popular strategy for energy-efficient circuit design, where the challenge is to achieve the best tradeoff between design efficiency and accuracy. The essential operation in artificial intelligence algorithms is the general matrix multiplication (GEMM) operation comprised of matrix multiplication and accumulation. This paper presents an approximate general matrix multiplication (AGEMM) unit that employs approximate multipliers to perform matrix-matrix operations on four-by-four matrices given in sixteen-bit signed fixed-point format. The synthesis of the proposed AGEMM unit to the 45 nm Nangate Open Cell Library revealed that it consumed only up to 36% of the area and 25% of the energy required by the exact general matrix multiplication unit. The AGEMM unit is ideally suited to convolutional neural networks, which can adapt to the error induced in the computation. We evaluated the AGEMM units' usability for honeybee detection with the YOLOv4-tiny convolutional neural network. The results implied that we can deploy the AGEMM units in convolutional neural networks without noticeable performance degradation. Moreover, the AGEMM unit's employment can lead to more area-and energy-efficient convolutional neural network processing, which in turn could prolong sensors' and edge nodes' autonomy.

Multiplication is an essential image processing operation commonly implemented in hardware DSP cores. To improve DSP cores’ area, speed, or energy efficiency, we can approximate multiplication. We present an approximate multiplier that generates two partial products using hybrid radix-4 and logarithmic encoding of the input operands. It uses the exact radix-4 encoding to generate the partial product from the three most significant bits and the logarithmic approximation with mantissa trimming to approximate the partial product from the remaining least-significant bits. The proposed multiplier fills the gap between highly accurate approximate non-logarithmic multipliers with a complex design and less accurate approximate logarithmic multipliers with a more straightforward design. We evaluated the multiplier’s efficiency in terms of error, energy (power-delay-product) and area utilisation using NanGate 45 nm. The experimental results show that the proposed multiplier exhibits good area utilisation and energy consumption and behaves well in image processing applications.

Recently, the group convolutions are widely used in mobile convolutional neural networks (CNNs) to improve the model’s efficiency. However, the training process of these popular group-conv mobile models is usually time-consuming compared to the regular models. Since some practical applications usually need retraining when the new data are added, the training process should be also efficient. Therefore, this paper firstly explores the inefficiency of current popular group-conv mobile CNNs to propose the guidelines of constructing mobile CNNs. According to these guidelines, the fully shared convolutional neural networks (FSC-Nets) are proposed in this paper. The FSC-Nets can share the same filter in both the channel and layer extents, and it is the first method to share the filters from all the extents. Experimental results show that, compared with current popular regular networks and the latest state-of-the-art group-conv mobile networks, the FSC-Nets can perform better while effectively decreasing the model size and runtime in both the training and test processes.

Convolutional neural networks (CNN) have been proved to be an effective method in the field of artificial intelligence (AI), and large-scale deploying CNN to embedded devices, no doubt, will greatly promote the development and application of AI into the practical industry. However, mainly due to the space-time complexity of CNN, computing power, memory bandwidth and flexibility are performance bottlenecks. In this paper, a framework containing model compression and hardware acceleration is proposed to solve the above problems. This framework consists of a mixed pruning method, data storage optimization for efficient memory utilization and an accelerator for mapping CNN on field programmable gate array (FPGA). The mixed pruning method is used to compress the model, and data bit-width is reduced to 8-bit by data quantization. Accelerator based on FPGA makes it flexible, configurable and efficient for CNN implementation. The model compression is evaluated on NVIDIA RTX2080Ti, and the results illustrate that the VGG16 is compressed by
$30\times $
and the fully convolutional network (FCN) is compressed by
$11\times $
within 1% accuracy loss. The compressed model is deployed and accelerated on ZCU102, which is up to
$1.7\times $
and
$24.5\times $
better in energy efficiency compared with RTX2080Ti and Intel i7 7700.

The increase in the number of edge devices has led to the emergence of edge computing where the computations are performed on the device. In recent years, deep neural networks (DNNs) have become the state-of-the-art method in a broad range of applications, from image recognition, to cognitive tasks to control. However, neural network models are typically large and computationally expensive and therefore not deployable on power and memory constrained edge devices. Sparsification techniques have been proposed to reduce the memory foot-print of neural network models. However, they typically lead to substantial hardware and memory overhead. In this article, we propose a hardware-aware pruning method using linear feedback shift register (LFSRs) to generate the locations of non-zero weights in real-time during inference. We call this LFSR-generated pseudo-random sequence based sparsity (LGPS) technique. We explore two different architectures for our hardware-friendly LGPS technique, based on (1) row/column indexing with LFSRs and (2) column-wise indexing with nested LFSRs, respectively. Using the proposed method, we present a total saving of energy and area up to 37.47% and 49.93% respectively and speed up of
$1.53\times $
w.r.t the baseline pruning method, for the VGG-16 network on down-sampled ImageNet.

Acceleration and compression on Deep Neural Networks (DNNs) have become a critical problem to develop intelligence on resource-constrained hardware, especially on Internet of
Things (IoT) devices. Previous works based on channel pruning
can be easily deployed and accelerated without specialized
hardware and software. However, weight-level sparsity is not
well explored in channel pruning, which results in relatively
low compression rate. In this work, we propose a framework
that combines channel pruning with low-rank decomposition
to tackle this problem. First, the low-rank decomposition is
utilized to eliminate redundancy within filter, and achieves
acceleration in shallow layers. Then, we apply channel pruning
on the decomposed network in a global way, and obtains further
acceleration in deep layers. In addition, a spectral norm-based
indicator is proposed to balance low-rank approximation and
channel pruning. We conduct a series of ablation experiments
and prove that low-rank decomposition can effectively improve
channel pruning by generating small and compact filters. To
further demonstrate the hardware compatibility, we deploy the
pruned networks on the FPGA, and the networks produced by
our method have obviously low latency.

Due to recent advances in digital technologies, and availability of credible data, an area of artificial intelligence, deep learning, has emerged, and has demonstrated its ability and effectiveness in solving complex learning problems not possible before. In particular, convolution neural networks (CNNs) have demonstrated their effectiveness in image detection and recognition applications. However, they require intensive CPU operations and memory bandwidth that make general CPUs fail to achieve desired performance levels. Consequently, hardware accelerators that use application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), and graphic processing units (GPUs) have been employed to improve the throughput of CNNs. More precisely, FPGAs have been recently adopted for accelerating the implementation of deep learning networks due to their ability to maximize parallelism as well as due to their energy efficiency. In this paper, we review recent existing techniques for accelerating deep learning networks on FPGAs. We highlight the key features employed by the various techniques for improving the acceleration performance. In addition, we provide recommendations for enhancing the utilization of FPGAs for CNNs acceleration. The techniques investigated in this paper represent the recent trends in FPGA-based accelerators of deep learning networks. Thus, this review is expected to direct the future advances on efficient hardware accelerators and to be useful for deep learning researchers.

While bigger and deeper neural network architectures continue to advance the state-of-the-art for many computer vision tasks, real-world adoption of these networks is impeded by hardware and speed constraints. Conventional model compression methods attempt to address this problem by modifying the architecture manually or using pre-defined heuristics. Since the space of all reduced architectures is very large, modifying the architecture of a deep neural network in this way is a difficult task. In this paper, we tackle this issue by introducing a principled method for learning reduced network architectures in a data-driven way using reinforcement learning. Our approach takes a larger `teacher' network as input and outputs a compressed `student' network derived from the `teacher' network. In the first stage of our method, a recurrent policy network aggressively removes layers from the large `teacher' model. In the second stage, another recurrent policy network carefully reduces the size of each remaining layer. The resulting network is then evaluated to obtain a reward -- a score based on the accuracy and compression of the network. Our approach uses this reward signal with policy gradients to train the policies to find a locally optimal student network. Our experiments show that we can achieve compression rates of more than 10x for models such as ResNet-34 while maintaining similar performance to the input `teacher' network. We also present a valuable transfer learning result which shows that policies which are pre-trained on smaller `teacher' networks can be used to rapidly speed up training on larger `teacher' networks.

Very deep convolutional networks with hundreds of layers have led to significant reductions in error on competitive benchmarks. Although the unmatched expressiveness of the many layers can be highly desirable at test time, training very deep networks comes with its own set of challenges. The gradients can vanish, the forward flow often diminishes, and the training time can be painfully slow. To address these problems, we propose stochastic depth, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time. We start with very deep networks but during training, for each mini-batch, randomly drop a subset of layers and bypass them with the identity function. This simple approach complements the recent success of residual networks. It reduces training time substantially and improves the test error significantly on almost all data sets that we used for evaluation. With stochastic depth we can increase the depth of residual networks even beyond 1200 layers and still yield meaningful improvements in test error (4.91 % on CIFAR-10).

Convolutional Neural Networks (CNNs) are extensively used in image and video recognition, natural language processing and other machine learning applications. The success of CNNs in these areas corresponds with a significant increase in the number of parameters and computation costs. Recent approaches towards reducing these overheads involve pruning and compressing the weights of various layers without hurting the overall CNN performance. However, using model compression to generate sparse CNNs mostly reduces parameters from the fully connected layers and may not significantly reduce the final computation costs. In this paper, we present a compression technique for CNNs, where we prune the filters from CNNs that are identified as having a small effect on the output accuracy. By removing whole planes in the network, together with their connecting convolution kernels, the computational costs are reduced significantly. In contrast to other techniques proposed for pruning networks, this approach does not result in sparse connectivity patterns. Hence, our techniques do not need the support of sparse convolution libraries and can work with the most efficient BLAS operations for matrix multiplications. In our results, we show that even simple filter pruning techniques can reduce inference costs for VGG-16 by up to 34% and ResNet-110 by up to 38% while regaining close to the original accuracy by retraining the networks.

The rapid growth of data size and accessibility in recent years has instigated a shift of philosophy in algorithm design for artificial intelligence. Instead of engineering algorithms by hand, the ability to learn composable systems automatically from massive amounts of data has led to ground-breaking performance in important domains such as computer vision, speech recognition, and natural language processing. The most popular class of techniques used in these domains is called deep learning, and is seeing significant attention from industry. However, these models require incredible amounts of data and compute power to train, and are limited by the need for better hardware acceleration to accommodate scaling beyond current data and model sizes. While the current solution has been to use clusters of graphics processing units (GPU) as general purpose processors (GPGPU), the use of field programmable gate arrays (FPGA) provide an interesting alternative. Current trends in design tools for FPGAs have made them more compatible with the high-level software practices typically practiced in the deep learning community, making FPGAs more accessible to those who build and deploy models. Since FPGA architectures are flexible, this could also allow researchers the ability to explore model-level optimizations beyond what is possible on fixed architectures such as GPUs. As well, FPGAs tend to provide high performance per watt of power consumption, which is of particular importance for application scientists interested in large scale server-based deployment or resource-limited embedded applications. This review takes a look at deep learning and FPGAs from a hardware acceleration perspective, identifying trends and innovations that make these technologies a natural fit, and motivates a discussion on how FPGAs may best serve the needs of the deep learning community moving forward.

Real time application of deep learning algorithms is often hindered by high
computational complexity and frequent memory accesses. Network pruning is a
promising technique to solve this problem. However, pruning usually results in
irregular network connections that not only demand extra representation efforts
but also do not fit well on parallel computation. We introduce structured
sparsity at various scales for convolutional neural networks, which are channel
wise, kernel wise and intra kernel strided sparsity. This structured sparsity
is very advantageous for direct computational resource savings on embedded
computers, parallel computing environments and hardware based systems. To
decide the importance of network connections and paths, the proposed method
uses a particle filtering approach. The importance weight of each particle is
assigned by computing the misclassification rate with corresponding
connectivity pattern. The pruned network is re-trained to compensate for the
losses due to pruning. While implementing convolutions as matrix products, we
particularly show that intra kernel strided sparsity with a simple constraint
can significantly reduce the size of kernel and feature map matrices. The
pruned network is finally fixed point optimized with reduced word length
precision. This results in significant reduction in the total storage size
providing advantages for on-chip memory based implementations of deep neural
networks.

The Internet of Things (IoT) makes smart objects the ultimate building blocks in the development of cyber-physical smart pervasive frameworks. The IoT has a variety of application domains, including health care. The IoT revolution is redesigning modern health care with promising technological, economic, and social prospects. This paper surveys advances in IoT-based health care technologies and reviews the state-of-the-art network architectures/platforms, applications, and industrial trends in IoT-based health care solutions. In addition, this paper analyzes distinct IoT security and privacy features, including security requirements, threat models, and attack taxonomies from the health care perspective. Further, this paper proposes an intelligent collaborative security model to minimize security risk; discusses how different innovations such as big data, ambient intelligence, and wearables can be leveraged in a health care context; addresses various IoT and eHealth policies and regulations across the world to determine how they can facilitate economies and societies in terms of sustainable development; and provides some avenues for future research on IoT-based health care based on a set of open issues and challenges.

As deep nets are increasingly used in applications suited for mobile devices,
a fundamental dilemma becomes apparent: the trend in deep learning is to grow
models to absorb ever-increasing data set sizes; however mobile devices are
designed with very little memory and cannot store such large models. We present
a novel network architecture, HashedNets, that exploits inherent redundancy in
neural networks to achieve drastic reductions in model sizes. HashedNets uses a
low-cost hash function to randomly group connection weights into hash buckets,
and all connections within the same hash bucket share a single parameter value.
These parameters are tuned to adjust to the HashedNets weight sharing
architecture with standard backprop during training. Our hashing procedure
introduces no additional memory overhead, and we demonstrate on several
benchmark data sets that HashedNets shrink the storage requirements of neural
networks substantially while mostly preserving generalization performance.

Internet of Things (IoT) has provided a promising opportunity to build powerful industrial systems and applications by leveraging the growing ubiquity of radio-frequency identification (RFID), and wireless, mobile, and sensor devices. A wide range of industrial IoT applications have been developed and deployed in recent years. In an effort to understand the development of IoT in industries, this paper reviews the current research of IoT, key enabling technologies, major IoT applications in industries, and identifies research trends and challenges. A main contribution of this review paper is that it summarizes the current state-of-the-art IoT in industries systematically.

We propose a simple two-step approach for speeding up convolution layers
within large convolutional neural networks based on tensor decomposition and
discriminative fine-tuning. Given a layer, we use non-linear least squares to
compute a low-rank CP-decomposition of the 4D convolution kernel tensor into a
sum of a small number of rank-one tensors. At the second step, this
decomposition is used to replace the original convolutional layer with a
sequence of four convolutional layers with small kernels. After such
replacement, the entire network is fine-tuned on the training data using
standard backpropagation process.
We evaluate this approach on two CNNs and show that it yields larger CPU
speedups at the cost of lower accuracy drops compared to previous approaches.
For the 36-class character classification CNN, our approach obtains a 8.5x CPU
speedup of the whole network with only minor accuracy drop (1% from 91% to
90%). For the standard ImageNet architecture (AlexNet), the approach speeds up
the second convolution layer by a factor of 4x at the cost of 1% increase of
the overall top-5 classification error.

The ImageNet Large Scale Visual Recognition Challenge is a benchmark in
object category classification and detection on hundreds of object categories
and millions of images. The challenge has been run annually from 2010 to
present, attracting participation from more than fifty institutions.
This paper describes the creation of this benchmark dataset and the advances
in object recognition that have been possible as a result. We discuss the
challenges of collecting large-scale ground truth annotation, highlight key
breakthroughs in categorical object recognition, provide detailed a analysis of
the current state of the field of large-scale image classification and object
detection, and compare the state-of-the-art computer vision accuracy with human
accuracy. We conclude with lessons learned in the five years of the challenge,
and propose future directions and improvements.

We present techniques for speeding up the test-time evaluation of large
convolutional networks, designed for object recognition tasks. These models
deliver impressive accuracy but each image evaluation requires millions of
floating point operations, making their deployment on smartphones and
Internet-scale clusters problematic. The computation is dominated by the
convolution operations in the lower layers of the model. We exploit the linear
structure present within the convolutional filters to derive approximations
that significantly reduce the required computation. Using large
state-of-the-art models, we demonstrate speedups by a factor of 2x, while
keeping the accuracy within 1% of the original model.

Learning filters to produce sparse image representations in terms of over complete dictionaries has emerged as a powerful way to create image features for many different purposes. Unfortunately, these filters are usually both numerous and non-separable, making their use computationally expensive. In this paper, we show that such filters can be computed as linear combinations of a smaller number of separable ones, thus greatly reducing the computational complexity at no cost in terms of performance. This makes filter learning approaches practical even for large images or 3D volumes, and we show that we significantly outperform state-of-the-art methods on the linear structure extraction task, in terms of both accuracy and speed. Moreover, our approach is general and can be used on generic filter banks to reduce the complexity of the convolutions.

The combination of the Internet and emerging technologies such as nearfield communications, real-time localization, and embedded sensors lets us transform everyday objects into smart objects that can understand and react to their environment. Such objects are building blocks for the Internet of Things and enable novel computing applications. As a step toward design and architectural principles for smart objects, the authors introduce a hierarchy of architectures with increasing levels of real-world awareness and interactivity. In particular, they describe activity-, policy-, and process-aware smart objects and demonstrate how the respective architectural abstractions support increasingly complex application.

Multilayer neural networks trained with the back-propagation
algorithm constitute the best example of a successful gradient based
learning technique. Given an appropriate network architecture,
gradient-based learning algorithms can be used to synthesize a complex
decision surface that can classify high-dimensional patterns, such as
handwritten characters, with minimal preprocessing. This paper reviews
various methods applied to handwritten character recognition and
compares them on a standard handwritten digit recognition task.
Convolutional neural networks, which are specifically designed to deal
with the variability of 2D shapes, are shown to outperform all other
techniques. Real-life document recognition systems are composed of
multiple modules including field extraction, segmentation recognition,
and language modeling. A new learning paradigm, called graph transformer
networks (GTN), allows such multimodule systems to be trained globally
using gradient-based methods so as to minimize an overall performance
measure. Two systems for online handwriting recognition are described.
Experiments demonstrate the advantage of global training, and the
flexibility of graph transformer networks. A graph transformer network
for reading a bank cheque is also described. It uses convolutional
neural network character recognizers combined with global training
techniques to provide record accuracy on business and personal cheques.
It is deployed commercially and reads several million cheques per day

In recent years, convolutional neural networks (CNNs) have shown great performance in various fields such as image classification, pattern recognition, and multi-media compression. Two of the feature properties, local connectivity and weight sharing, can reduce the number of parameters and increase processing speed during training and inference. However, as the dimension of data becomes higher and the CNN architecture becomes more complicated, the end-to-end approach or the combined manner of CNN is computationally intensive, which becomes limitation to CNN's further implementation. Therefore, it is necessary and urgent to implement CNN in a faster way. In this paper, we first summarize the acceleration methods that contribute to but not limited to CNN by reviewing a broad variety of research papers. We propose a taxonomy in terms of three levels, i.e. structure level, algorithm level, and implementation level, for acceleration methods. We also analyze the acceleration methods in terms of CNN architecture compression, algorithm optimization, and hardware-based improvement. At last, we give a discussion on different perspectives of these acceleration and optimization methods within each level. The discussion shows that the methods in each level still have large exploration space. By incorporating such a wide range of disciplines, we expect to provide a comprehensive reference for researchers who are interested in CNN acceleration.

Breakthroughs from the field of deep learning are radically changing how sensor data are interpreted to extract important information to help advance healthcare, make our cities smarter, and innovate in smart home technology. Deep convolutional neural networks, which are at the heart of many emerging Internet-of-Things (IoT) applications, achieve remarkable performance in audio and visual recognition tasks, at the expense of high computational complexity in convolutional layers, limiting their deployability. In this paper, we present an easy-to-implement acceleration scheme, named ADaPT, which can be applied to already available pre-trained networks. Our proposed technique exploits redundancy present in the convolutional layers to reduce computation and storage requirements. Additionally, we also decompose each convolution layer into two consecutive one-dimensional stages to make full use of the approximate model. This technique can easily be applied to existing low power processors, GPUs or new accelerators. We evaluated this technique using four diverse and widely used benchmarks, on hardware ranging from embedded CPUs to server GPUs. Our experiments show an average 3-5x speed-up in all deep models and a maximum 8-9x speed-up on many individual convolutional layers. We demonstrate that unlike iterative pruning based methodology, our approximation technique is mathematically well grounded, robust, does not require any time-consuming retraining, and still achieves speed-ups solely from convolutional layers with no loss in baseline accuracy.

Deep learning is a promising approach for extracting accurate information from raw sensor data from IoT devices deployed in complex environments. Because of its multilayer structure, deep learning is also appropriate for the edge computing environment. Therefore, in this article, we first introduce deep learning for IoTs into the edge computing environment. Since existing edge nodes have limited processing capability, we also design a novel offloading strategy to optimize the performance of IoT deep learning applications with edge computing. In the performance evaluation, we test the performance of executing multiple deep learning tasks in an edge computing environment with our strategy. The evaluation results show that our method outperforms other optimization solutions on deep learning for IoT.

In recent years, deep neural networks (DNNs) have received increased attention, have been applied to different applications, and achieved dramatic accuracy improvements in many tasks. These works rely on deep networks with millions or even billions of parameters, and the availability of graphics processing units (GPUs) with very high computation capability plays a key role in their success. For example, Krizhevsky et al. [1] achieved breakthrough results in the 2012 ImageNet Challenge using a network containing 60 million parameters with five convolutional layers and three fully connected layers. Usually, it takes two to three days to train the whole model on the ImagetNet data set with an NVIDIA K40 machine. In another example, the top face-verification results from the Labeled Faces in the Wild (LFW) data set were obtained with networks containing hundreds of millions of parameters, using a mix of convolutional, locally connected, and fully connected layers [2], [3]. It is also very time-consuming to train such a model to obtain a reasonable performance. In architectures that only rely on fully connected layers, the number of parameters can grow to billions [4].

Along with the recent development of Convolutional Neural Network (CNN) and its multilayering, it is important to reduce the amount of computation and the amount of data associated with convolution processing. Some compression methods of convolutional filters using low-rank approximation have been studied. The common goal of these studies is to accelerate the computation wherever possible while maintaining the accuracy of image recognition. In this paper, we investigate the trade-off between the compression error by low-rank approximation and the computational complexity for the state-of-the-arts CNN model.

The state-of-the-art face recognition systems are built on deep convolutional neural networks (CNNs). However, these CNNs contain millions of parameters, leading to the deployment difficulties on mobile and embedded devices. One solution is to reduce the size of the trained CNNs by model compression. In this work, we propose an entropy-based prune metric to reduce the size of intermediate activations so as to accelerate and compress CNN models both in training and inference stages. First the importance of each filter in each layer is evaluated by our entropy-based method. Then some unimportant filters are removed according to a predefined compressing rate. Finally, we fine-tune the pruned model to improve its discrimination ability. Experiments conducted on LFW face dataset shows the effectiveness of our entropy-based method. We achieve \(1.92\times \) compression and \(1.88\times \) speed-up on VGG-16 model, \(2\times \) compression and \(1.74\times \) speed-up on WebFace model, both with only about 1% accuracy decrease evaluated on LFW.

Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, We introduce a three stage pipeline: pruning, quantization and Huffman encoding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman encoding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory, which has 180x less access energy.

While depth tends to improve network performances, it also makes gradient-based training more difficult since deeper networks tend to be more non-linear. The recently proposed knowledge distillation approach is aimed at obtaining small and fast-to-execute models, and it has shown that a student network could imitate the soft output of a larger teacher network or ensemble of networks. In this paper, we extend this idea to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student. Because the student intermediate hidden layer will generally be smaller than the teacher's intermediate hidden layer, additional parameters are introduced to map the student hidden layer to the prediction of the teacher hidden layer. This allows one to train deeper students that can generalize better or run faster, a trade-off that is controlled by the chosen student capacity. For example, on CIFAR-10, a deep student network with almost 10.4 times less parameters outperforms a larger, state-of-the-art teacher network.

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry

High demand for computation resources severely hinders deployment of large-scale Deep Neural Networks (DNN) in resource constrained devices. In this work, we propose a Structured Sparsity Learning (SSL) method to regularize the structures (i.e., filters, channels, filter shapes, and layer depth) of DNNs. SSL can: (1) learn a compact structure from a bigger DNN to reduce computation cost; (2) obtain a hardware-friendly structured sparsity of DNN to efficiently accelerate the DNNs evaluation. Experimental results show that SSL achieves on average 5.1x and 3.1x speedups of convolutional layer computation of AlexNet against CPU and GPU, respectively, with off-the-shelf libraries. These speedups are about twice speedups of non-structured sparsity; (3) regularize the DNN structure to improve classification accuracy. The results show that for CIFAR-10, regularization on layer depth can reduce 20 layers of a Deep Residual Network (ResNet) to 18 layers while improve the accuracy from 91.25% to 92.60%, which is still slightly higher than that of original ResNet with 32 layers. For AlexNet, structure regularization by SSL also reduces the error by around ~1%. Open source code is in https://github.com/wenwei202/caffe/tree/scnn.

We propose an efficient and unified framework, namely ThiNet, to simultaneously accelerate and compress CNN models in both training and inference stages. We focus on the filter level pruning, i.e., the whole filter would be discarded if it is less important. Our method does not change the original network structure, thus it can be perfectly supported by any off-the-shelf deep learning libraries. We formally establish filter pruning as an optimization problem, and reveal that we need to prune filters based on statistics information computed from its next layer, not the current layer, which differentiates ThiNet from existing methods. Experimental results demonstrate the effectiveness of this strategy, which has advanced the state-of-the-art. We also show the performance of ThiNet on ILSVRC-12 benchmark. ThiNet achieves 3.31$\times$ FLOPs reduction and 16.63$\times$ compression on VGG-16, with only 0.52$\%$ top-5 accuracy drop. Similar experiments with ResNet-50 reveal that even for a compact network, ThiNet can also reduce more than half of the parameters and FLOPs, at the cost of roughly 1$\%$ top-5 accuracy drop. Moreover, the original VGG-16 model can be further pruned into a very small model with only 5.05MB model size, preserving AlexNet level accuracy but showing much stronger generalization ability.

We propose a new framework for pruning convolutional kernels in neural networks to enable efficient inference, focusing on transfer learning where large and potentially unwieldy pretrained networks are adapted to specialized tasks. We interleave greedy criteria-based pruning with fine-tuning by backpropagation - a computationally efficient procedure that maintains good generalization in the pruned network. We propose a new criterion based on an efficient first-order Taylor expansion to approximate the absolute change in training cost induced by pruning a network component. After normalization, the proposed criterion scales appropriately across all layers of a deep CNN, eliminating the need for per-layer sensitivity analysis. The proposed criterion demonstrates superior performance compared to other criteria, such as the norm of kernel weights or average feature map activation.

We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32\(\times \) memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58\(\times \) faster convolutional operations (in terms of number of the high precision operations) and 32\(\times \) memory savings. XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work on challenging visual tasks. We evaluate our approach on the ImageNet classification task. The classification accuracy with a Binary-Weight-Network version of AlexNet is the same as the full-precision AlexNet. We compare our method with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than \(16\,\%\) in top-1 accuracy. Our code is available at: http:// allenai. org/ plato/ xnornet.

State-of-the-art neural networks are getting deeper and wider. While their performance increases with the increasing number of layers and neurons, it is crucial to design an efficient deep architecture in order to reduce computational and memory costs. Designing an efficient neural network, however, is labor intensive requiring many experiments, and fine-tunings. In this paper, we introduce network trimming which iteratively optimizes the network by pruning unimportant neurons based on analysis of their outputs on a large dataset. Our algorithm is inspired by an observation that the outputs of a significant portion of neurons in a large network are mostly zero, regardless of what inputs the network received. These zero activation neurons are redundant, and can be removed without affecting the overall accuracy of the network. After pruning the zero activation neurons, we retrain the network using the weights before pruning as initialization. We alternate the pruning and retraining to further reduce zero activations in a network. Our experiments on the LeNet and VGG-16 show that we can achieve high compression ratio of parameters without losing or even achieving higher accuracy than the original network.

State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power.
Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120× energy saving; Exploiting sparsity saves 10×; Weight sharing gives 8×; Skipping zero activations from ReLU saves another 3×. Evaluated on nine DNN benchmarks, EIE is 189× and 13× faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS working directly on a compressed network, corresponding to 3 TOPS on an uncompressed network, and processes FC layers of AlexNet at 1.88×10⁴ frames/sec with a power dissipation of only 600mW. It is 24,000× and 3,400× more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9×, 19× and 3× better throughput, energy efficiency and area efficiency.

Deep convolutional neural networks (DCNNs) have become a very powerful tool in visual perception. DCNNs have applications in autonomous robots, security systems, mobile phones, and automobiles, where high throughput of the feedforward evaluation phase and power efficiency are important. Because of this increased usage, many field-programmable gate array (FPGA)-based accelerators have been proposed. In this paper, we present an optimized streaming method for DCNNs' hardware accelerator on an embedded platform. The streaming method acts as a compiler, transforming a high-level representation of DCNNs into operation codes to execute applications in a hardware accelerator. The proposed method utilizes maximum computational resources available based on a novel-scheduled routing topology that combines data reuse and data concatenation. It is tested with a hardware accelerator implemented on the Xilinx Kintex-7 XC7K325T FPGA. The system fully explores weight-level and node-level parallelizations of DCNNs and achieves a peak performance of 247 G-ops while consuming less than 4 W of power. We test our system with applications on object classification and object detection in real-world scenarios. Our results indicate high-performance efficiency, outperforming all other presented platforms while running these applications.

Deeper neural networks are more difficult to train. We present a residual
learning framework to ease the training of networks that are substantially
deeper than those used previously. We explicitly reformulate the layers as
learning residual functions with reference to the layer inputs, instead of
learning unreferenced functions. We provide comprehensive empirical evidence
showing that these residual networks are easier to optimize, and can gain
accuracy from considerably increased depth. On the ImageNet dataset we evaluate
residual nets with a depth of up to 152 layers---8x deeper than VGG nets but
still having lower complexity. An ensemble of these residual nets achieves
3.57% error on the ImageNet test set. This result won the 1st place on the
ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100
and 1000 layers.
The depth of representations is of central importance for many visual
recognition tasks. Solely due to our extremely deep representations, we obtain
a 28% relative improvement on the COCO object detection dataset. Deep residual
nets are foundations of our submissions to ILSVRC & COCO 2015 competitions,
where we also won the 1st places on the tasks of ImageNet detection, ImageNet
localization, COCO detection, and COCO segmentation.

Although the latest high-end smartphone has powerful CPU and GPU, running
deeper convolutional neural networks (CNNs) for complex tasks such as ImageNet
classification on mobile devices is challenging. To deploy deep CNNs on mobile
devices, we present a simple and effective scheme to compress the entire CNN,
which we call one-shot whole network compression. The proposed scheme consists
of three steps: (1) rank selection with variational Bayesian matrix
factorization, (2) Tucker decomposition on kernel tensor, and (3) fine-tuning
to recover accumulated loss of accuracy, and each step can be easily
implemented using publicly available tools. We demonstrate the effectiveness of
the proposed scheme by testing the performance of various compressed CNNs
(AlexNet, VGGS, GoogLeNet, and VGG-16) on the smartphone. Significant
reductions in model size, runtime, and energy consumption are obtained, at the
cost of small loss in accuracy. In addition, we address the important
implementation level issue on 1?1 convolution, which is a key operation of
inception module of GoogLeNet as well as CNNs compressed by our proposed
scheme.

Large CNNs have delivered impressive performance in various computer vision
applications. But the storage and computation requirements make it problematic
for deploying these models on mobile devices. Recently, tensor decompositions
have been used for speeding up CNNs. In this paper, we further develop the
tensor decomposition technique. We propose a new algorithm for computing the
low-rank tensor decomposition for removing the redundancy in the convolution
kernels. The algorithm finds the exact global optimizer of the decomposition
and is more effective than iterative methods. Based on the decomposition, we
further propose a new method for training low-rank constrained CNNs from
scratch. Interestingly, while achieving a significant speedup, sometimes the
low-rank constrained CNNs delivers significantly better performance than their
non-constrained counterparts. On the CIFAR-10 dataset, the proposed low-rank
NIN model achieves $91.31\%$ accuracy, which also improves upon
state-of-the-art result. We evaluated the proposed method on CIFAR-10 and
ILSVRC12 datasets for a variety of modern CNNs, including AlexNet, NIN, VGG and
GoogleNet with success. For example, the forward time of VGG-16 is reduced by
half while the performance is still comparable. Empirical success suggests that
low-rank tensor decompositions can be a very useful tool for speeding up large
CNNs.

Neural networks are both computationally intensive and memory intensive,
making them difficult to deploy on embedded systems. Also, conventional
networks fix the architecture before training starts; as a result, training
cannot improve the architecture. To address these limitations, we describe a
method to reduce the storage and computation required by neural networks by an
order of magnitude without affecting their accuracy, by learning only the
important connections. Our method prunes redundant connections using a
three-step method. First, we train the network to learn which connections are
important. Next, we prune the unimportant connections. Finally, we retrain the
network to fine tune the weights of the remaining connections. On the ImageNet
dataset, our method reduced the number of parameters of AlexNet by a factor of
9x, from 61 million to 6.7 million, without incurring accuracy loss. Similar
experiments with VGG16 found that the network as a whole can be reduced 6.8x
just by pruning the fully-connected layers, again with no loss of accuracy.

State-of-the-art object detection networks depend on region proposal
algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN
have reduced the running time of these detection networks, exposing region
proposal computation as a bottleneck. In this work, we introduce a Region
Proposal Network (RPN) that shares full-image convolutional features with the
detection network, thus enabling nearly cost-free region proposals. An RPN is a
fully-convolutional network that simultaneously predicts object bounds and
objectness scores at each position. RPNs are trained end-to-end to generate
high-quality region proposals, which are used by Fast R-CNN for detection. With
a simple alternating optimization, RPN and Fast R-CNN can be trained to share
convolutional features. For the very deep VGG-16 model, our detection system
has a frame rate of 5fps (including all steps) on a GPU, while achieving
state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and
2012 (70.4% mAP) using 300 proposals per image. The code will be released.

This paper aims to accelerate the test-time computation of convolutional
neural networks (CNNs), especially very deep CNNs that have substantially
impacted the computer vision community. Unlike existing methods that are
designed for approximating linear filters or linear responses, our method takes
the nonlinear units into account. We develop an effective solution to the
resulting nonlinear optimization problem without the need of stochastic
gradient descent (SGD). More importantly, while current methods mainly focus on
optimizing one or two layers, our nonlinear method enables an asymmetric
reconstruction that reduces the rapidly accumulated error when multiple (e.g.,
>=10) layers are approximated. For the widely used very deep VGG-16 model, our
method achieves a whole-model speedup of 4x with merely a 0.3% increase of
top-5 error in ImageNet classification. Our 4x accelerated VGG-16 model also
shows a graceful accuracy degradation for object detection when plugged into
the latest Fast R-CNN detector.

A very simple way to improve the performance of almost any machine learning
algorithm is to train many different models on the same data and then to
average their predictions. Unfortunately, making predictions using a whole
ensemble of models is cumbersome and may be too computationally expensive to
allow deployment to a large number of users, especially if the individual
models are large neural nets. Caruana and his collaborators have shown that it
is possible to compress the knowledge in an ensemble into a single model which
is much easier to deploy and we develop this approach further using a different
compression technique. We achieve some surprising results on MNIST and we show
that we can significantly improve the acoustic model of a heavily used
commercial system by distilling the knowledge in an ensemble of models into a
single model. We also introduce a new type of ensemble composed of one or more
full models and many specialist models which learn to distinguish fine-grained
classes that the full models confuse. Unlike a mixture of experts, these
specialist models can be trained rapidly and in parallel.

In this work we investigate the effect of the convolutional network depth on
its accuracy in the large-scale image recognition setting. Our main
contribution is a thorough evaluation of networks of increasing depth, which
shows that a significant improvement on the prior-art configurations can be
achieved by pushing the depth to 16-19 weight layers. These findings were the
basis of our ImageNet Challenge 2014 submission, where our team secured the
first and the second places in the localisation and classification tracks
respectively.

The focus of this paper is speeding up the evaluation of convolutional neural
networks. While delivering impressive results across a range of computer vision
and machine learning tasks, these networks are computationally demanding,
limiting their deployability. Convolutional layers generally consume the bulk
of the processing time, and so in this work we present two simple schemes for
drastically speeding up these layers. This is achieved by exploiting
cross-channel or filter redundancy to construct a low rank basis of filters
that are rank-1 in the spatial domain. Our methods are architecture agnostic,
and can be easily applied to existing CPU and GPU convolutional frameworks for
tuneable speedup performance. We demonstrate this with a real world network
designed for scene text character recognition, showing a possible 2.5x speedup
with no loss in accuracy, and 4.5x speedup with less than 1% drop in accuracy,
still achieving state-of-the-art on standard benchmarks.

Treatment of the predictive aspect of statistical mechanics as a form of statistical inference is extended to the density-matrix formalism and applied to a discussion of the relation between irreversibility and information loss. A principle of "statistical complementarity" is pointed out, according to which the empirically verifiable probabilities of statistical mechanics necessarily correspond to incomplete predictions. A preliminary discussion is given of the second law of thermodynamics and of a certain class of irreversible processes, in an approximation equivalent to that of the semiclassical theory of radiation. It is shown that a density matrix does not in general contain all the information about a system that is relevant for predicting its behavior. In the case of a system perturbed by random fluctuating fields, the density matrix cannot satisfy any differential equation because rho˙(t) does not depend only on rho(t), but also on past conditions The rigorous theory involves stochastic equations in the type rho(t)=G(t, 0)rho(0), where the operator G is a functional of conditions during the entire interval (0-->t). Therefore a general theory of irreversible processes cannot be based on differential rate equations corresponding to time-proportional transition probabilities. However, such equations often represent useful approximations.

H.264/AVC is newest video coding standard of the ITU-T Video Coding Experts Group and the ISO/IEC Moving Picture Experts Group. The main goals of the H.264/AVC standardization effort have been enhanced compression performance and provision of a "network-friendly" video representation addressing "conversational" (video telephony) and "nonconversational" (storage, broadcast, or streaming) applications. H.264/AVC has achieved a significant improvement in rate-distortion efficiency relative to existing standards. This article provides an overview of the technical features of H.264/AVC, describes profiles and applications for the standard, and outlines the history of the standardization process.