Conference Paper

Dynamic Recursive Neural Network

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Some researchers [4] found that not all data samples needed deep structures and proposed using CNNs with input-adaptive depth. Nevertheless, no matter what the proposed strategy was based upon, early-exiting [5,6,7,8,9,10,11] or layer-skipping [12,13,14,15,16,17], previous depthadaptive CNNs always used the method of "substraction", i.e., training a very large model and only executing a part of it each time. Also, for different data samples, only the network structure was adaptive but the model parameters could not be adjusted accordingly. ...
... Built upon Adaptive Computation Time (ACT) [30], SACT-ResNet [14] used a halting score to only keep part of layers within each residual block, while IAMNN [15] also enforced the parameter sharing among layers within the block. Dynamic Recursive Neural Network (DRNN) [16] further extended the idea in [15] by replacing the original residual block with a recursive stack of the same residual layer. Our model also supports the use of a stack of residual layers within a block, but they are associated with slightly different parameters to make a block more expressive in capturing data features, and parameters between adjacent residual blocks are made related to better emulate the knowledge extraction process of a human brain. ...
... To illustrate the effectiveness of our Puppet-CNN, we perform experiments on Cifar-10. We select and rerun some other adaptive models in the baselines, including the methods using adaptive-parameters such as DFN [21], WeightNet [24] and DDFN [25] (modified to sample-wise, denoted as DDFN-SW), as well as the adaptive-depth methods such as BranchyNet [7], SkipNet [12] and DRNN [16]. Since our method focuses on the sample-wise adaptation, we only select the most representative sample-wise adaptive methods in the experiments, but not those that also used pixel-wise or location-wise adaption. ...
Preprint
Convolutional Neural Network (CNN) has been applied to more and more scenarios due to its excellent performance in many machine learning tasks, especially with deep and complex structures. However, as the network goes deeper, more parameters need to be stored and optimized. Besides, almost all common CNN models adopt "train-and-use" strategy where the structure is pre-defined and the kernel parameters are fixed after the training with the same structure and set of parameters used for all data without considering the content complexity. In this paper, we propose a new CNN framework, named as Puppet-CNN\textit{Puppet-CNN}, which contains two modules: a puppet module\textit{puppet module} and a puppeteer module\textit{puppeteer module}. The puppet module is a CNN model used to actually process the input data just like other works, but its depth and kernels are generated by the puppeteer module (realized with Ordinary Differential Equation (ODE)) based on the input complexity each time. By recurrently generating kernel parameters in the puppet module, we can take advantage of the dependence among kernels of different convolutional layers to significantly reduce the size of CNN model by only storing and training the parameters of the much smaller puppeteer ODE module. Through experiments on several datasets, our method has proven to be superior than the traditional CNNs on both performance and efficiency. The model size can be reduced more than 10 times.
... Interestingly, the idea is versatile and can be applied to different types of architectures. These networks can fit seamlessly into CNN [16] or Transformer architectures [17], thus making it possible to replace various blocks with repetitions of the same block. Furthermore, the dynamic use of recursion in specific layers can be achieved using techniques such as dynamic gates, which can reduce the number of executions in certain layers, not only reducing the model size but also improving inference speed [16]. ...
... These networks can fit seamlessly into CNN [16] or Transformer architectures [17], thus making it possible to replace various blocks with repetitions of the same block. Furthermore, the dynamic use of recursion in specific layers can be achieved using techniques such as dynamic gates, which can reduce the number of executions in certain layers, not only reducing the model size but also improving inference speed [16]. ...
... In order to promote generalization and the reusability of recursive convolutional layers, we implement independent batch normalization in each recursive call. This approach has been shown to assist in the specialization of the convolutional kernels for each recursive call [16]. ...
Article
Full-text available
This paper addresses the challenge of deploying recognition models in specific scenarios in which memory size is relevant, such as in low-cost devices or browser-based applications. We specifically focus on developing memory-efficient approaches for Handwritten Text Recognition (HTR) by leveraging recursive networks. These networks reuse learned weights across successive layers, thus enabling the maintenance of depth, a critical factor associated with model accuracy, without an increase in memory footprint. We apply neural recursion techniques to models typically used in HTR that contain convolutional and recurrent layers. We additionally study the impact of kernel scaling, which allows the activations of these recursive layers to be modified for greater expressiveness with little cost to memory. Our experiments on various HTR benchmarks demonstrate that recursive networks are, indeed, a good alternative. It is noteworthy that these recursive networks not only preserve but in some instances also enhance accuracy, making them a promising solution for memory-efficient HTR applications. This research establishes the utility of recursive networks in addressing memory constraints in HTR models. Their ability to sustain or improve accuracy while being memory-efficient positions them as a promising solution for practical deployment, especially in contexts where memory size is a critical consideration, such as low-cost devices and browser-based applications.
... For instance, in [18], a gating module has been proposed in order to controls layer-wise skip connections and is trained using reinforcement learning methods. In [19], instead of skipping layers, the gating module controls the number of iterations for some layers, while in [20] the gating function is trained jointly. ...
... To that end, we can transform the separation model into an adaptive network which determines the amount of computation needed on a sample-by-sample basis. We achieve this by attaching to the separation network a learnable gating module (based on [19]) that for each input decides when to take an early exit. Joint training of the separation model and the gating module not only renders the network dynamic but it also enables it to adapt to different inputs. ...
... The gating module used in the experiments follows a similar architecture as described in [19]. Specifically, ...
Preprint
Full-text available
Traditional source separation approaches train deep neural network models end-to-end with all the data available at once by minimizing the empirical risk on the whole training set. On the inference side, after training the model, the user fetches a static computation graph and runs the full model on some specified observed mixture signal to get the estimated source signals. Additionally, many of those models consist of several basic processing blocks which are applied sequentially. We argue that we can significantly increase resource efficiency during both training and inference stages by reformulating a model's training and inference procedures as iterative mappings of latent signal representations. First, we can apply the same processing block more than once on its output to refine the input signal and consequently improve parameter efficiency. During training, we can follow a block-wise procedure which enables a reduction on memory requirements. Thus, one can train a very complicated network structure using significantly less computation compared to end-to-end training. During inference, we can dynamically adjust how many processing blocks and iterations of a specific block an input signal needs using a gating module.
... Analog-based Dynamic structure [129][130][131] , dynamic parameters [132][133][134][135] Dynamic networks adapt their structures or parameters to different inputs Section 5.2.1 ...
... Dynamic structure models can selectively activate network components conditioned on the input, such as subnetworks [129,324] , layers [130,325] , or channels [320,326] . For dynamic sub-network, there are two classic approaches to perform inference with dynamic architectures on each sample, including enabling early exiting in cascading multiple models and skipping branches in mixture-of-experts (MoE) via in parallel way. ...
... Work dimensions of DyNN. Dynamic networks can perform adaptive computation at three different work granularities, i.e., sample-wise [129,318] , spatial-wise [334][335][336] , and temporal-wise [66,337] . Sample-wise dynamic models process each sample with the abovementioned data-dependent dynamic structures or parameters. ...
Article
Full-text available
Visual recognition is currently one of the most important and active research areas in computer vision, pattern recognition, and even the general field of artificial intelligence. It has great fundamental importance and strong industrial needs, particularly the modern deep neural networks (DNNs) and some brain-inspired methodologies, have largely boosted the recognition performance on many concrete tasks, with the help of large amounts of training data and new powerful computation resources. Although recognition accuracy is usually the first concern for new progresses, efficiency is actually rather important and sometimes critical for both academic research and industrial applications. Moreover, insightful views on the opportunities and challenges of efficiency are also highly required for the entire community. While general surveys on the efficiency issue have been done from various perspectives, as far as we are aware, scarcely any of them focused on visual recognition systematically, and thus it is unclear which progresses are applicable to it and what else should be concerned. In this survey, we present the review of recent advances with our suggestions on the new possible directions towards improving the efficiency of DNN-related and brain-inspired visual recognition approaches, including efficient network compression and dynamic brain-inspired networks. We investigate not only from the model but also from the data point of view (which is not the case in existing surveys) and focus on four typical data types (images, video, points, and events). This survey attempts to provide a systematic summary via a comprehensive survey that can serve as a valuable reference and inspire both researchers and practitioners working on visual recognition problems.
... In this paper, we extend these works by decomposing the sharing topology into com- Dynamic Recurrence for Sharing Parameters. Several works [1,4,16,20] explore parameter sharing through the lens of dynamically repeating layers. However, each technique is applied to a different model architecture, and evaluated in different ways. ...
... Due to space limit, we report accuracy numbers in Appendix B. Furthermore, we discuss the differences between applying weight sharing methods to CNNs versus transformers. [4,11] can only achieve at most 1.58x and 1.45x share rate at while maintaining accuracy. Although [16] can achieve 12x share rate, it results in a 8.8% accuracy drop. ...
... These results show that isotropic networks can achieve a high share rate while maintaining accuracy with simple weight sharing methods. The traditional pyramid style networks, while using complicated sharing schemes [4,11,16] Significant compression rates can be achieved without loss in accuracy across multiple isotropic ConvMixer models. We also generate a full family of weight sharing models by varying the share rate, which is the reduction factor in number of unique layers for the weight shared model compared to the original. ...
Preprint
Recent isotropic networks, such as ConvMixer and vision transformers, have found significant success across visual recognition tasks, matching or outperforming non-isotropic convolutional neural networks (CNNs). Isotropic architectures are particularly well-suited to cross-layer weight sharing, an effective neural network compression technique. In this paper, we perform an empirical evaluation on methods for sharing parameters in isotropic networks (SPIN). We present a framework to formalize major weight sharing design decisions and perform a comprehensive empirical evaluation of this design space. Guided by our experimental results, we propose a weight sharing strategy to generate a family of models with better overall efficiency, in terms of FLOPs and parameters versus accuracy, compared to traditional scaling methods alone, for example compressing ConvMixer by 1.9x while improving accuracy on ImageNet. Finally, we perform a qualitative study to further understand the behavior of weight sharing in isotropic architectures. The code is available at https://github.com/apple/ml-spin.
... The authors of [64] exploited recurrence for early image prediction, and found that an RCNN naturally learns taxonomic representations that are iteratively refined to finer-grain classes, improving explainability. Similar to this paper, some works have explicitly transformed ResNet into RCNN for constructing efficient models [65], [66], which can be taken a step further by using recurrent depthwise separable convolutions [67], [68]. ...
... Since complex samples that benefit from additional iterations are a minority [53], increasing the number of iterations for all samples generally has a globally negative effect. A possible way of exploiting this iterative generality is to adjust the number of iterations dynamically on a per-sample basis as in [71] or [66], a mechanism that has also been observed in primate's visual cortex [50]. Table 2 compares the four methods when applied to the same ResNet-50. ...
Article
Full-text available
Accurate neural networks can be found just by pruning a randomly initialized overparameterized model, leaving out the need for any weight optimization. The resulting subnetworks are small, sparse, and ternary, making excellent candidates for efficient hardware implementation. However, finding optimal connectivity patterns is an open challenge. Based on the evidence that residual networks may be approximating unrolled shallow recurrent neural networks, we conjecture that they contain better candidate subnetworks at inference time when explicitly transformed into recurrent architectures. This hypothesis is put to the test on image classification tasks, where we find subnetworks within the recurrent models that are more accurate and parameter-efficient than both the ones found within feedforward models and than the full models with learned weights. Furthermore, random recurrent subnetworks are tiny: under a simple compression scheme, ResNet-50 is compressed without a drastic loss in performance to 48.55× less memory size, fitting in under 2 megabytes. Code available at: https://github.com/Lopez-Angel/hidden-fold-networks.
... However, these suggestions may include extra building pieces, resulting in a bigger number of network parameters and, as a result, more GPU RAM. It has been established that using recurrent convolution to repeatedly modify the features extracted at different periods is feasible and successful for many computer vision problems [39][40][41][42]. Guo et al. [39] advocated reusing residual blocks in ResNet to completely utilise available parameters and greatly reduce model size. ...
... It has been established that using recurrent convolution to repeatedly modify the features extracted at different periods is feasible and successful for many computer vision problems [39][40][41][42]. Guo et al. [39] advocated reusing residual blocks in ResNet to completely utilise available parameters and greatly reduce model size. Such a mechanism is also beneficial to the evolution of U-Net. ...
Article
Full-text available
In recent years, convolutional neural network architectures have become increasingly complex to achieve improved performance on well-known benchmark datasets. In this research, we have introduced G-Net light, a lightweight modified GoogleNet with improved filter count per layer to reduce feature overlaps, hence reducing the complexity. Additionally, by limiting the amount of pooling layers in the proposed architecture, we have exploited the skip connections to minimize the spatial information loss. The suggested architecture is analysed using three publicly available datasets for retinal vessel segmentation, namely DRIVE, CHASE and STARE datasets. The proposed G-Net light achieves an average accuracy of 0.9686, 0.9726, 0.9730 and F1-score of 0.8202, 0.8048, 0.8178 on DRIVE, CHASE, and STARE datasets, respectively. The proposed G-Net light achieves state-of-the-art performance and outperforms other lightweight vessel segmentation architectures with fewer trainable number of parameters.
... Specifically, early existing methods allow samples (easy to classify) to be predicted using the early outputs of cascade DNNs [33] or networks with multiple intermediate classifiers [8]. Moreover, skipping methods selectively activate the model components, e.g., layers [9], branches [23], or sub-networks [2] conditioned on the sample. Unlike DI to dynamically allocate computation for each sample, we seek to boost the performance of a static model on those samples with low-confident predictions. ...
... Compute the centerȳ for the q-th cluster O q . 9 Obtain the top-K classes ofȳ, namely top-K(ȳ). 10 Construct Daux based on top-K(ȳ). ...
Preprint
Conventional deep models predict a test sample with a single forward propagation, which, however, may not be sufficient for predicting hard-classified samples. On the contrary, we human beings may need to carefully check the sample many times before making a final decision. During the recheck process, one may refine/adjust the prediction by referring to related samples. Motivated by this, we propose to predict those hard-classified test samples in a looped manner to boost the model performance. However, this idea may pose a critical challenge: how to construct looped inference, so that the original erroneous predictions on these hard test samples can be corrected with little additional effort. To address this, we propose a general Closed-Loop Inference (CLI) method. Specifically, we first devise a filtering criterion to identify those hard-classified test samples that need additional inference loops. For each hard sample, we construct an additional auxiliary learning task based on its original top-K predictions to calibrate the model, and then use the calibrated model to obtain the final prediction. Promising results on ImageNet (in-distribution test samples) and ImageNet-C (out-of-distribution test samples) demonstrate the effectiveness of CLI in improving the performance of any pre-trained model.
... Recurrent convolutional networks. Iterative refinement of features has been proved effective in many computer vision tasks (Han et al., 2018;Guo et al., 2019;Wang et al., 2019;Alom et al., 2018). Auto-context (Tu and Bai, 2009) implicitly extracts image features together with context information by learning a series of classifiers in a similar recurrent manner. ...
... However their classifiers need to be independently built in a boosting style, where our methods recurse semantic features inside a single encoder-decoder network. (Guo et al., 2019) reuses ResNet residual blocks (He et al., 2016) to fully utilize the limited parameters. With the similar recurrent strategy, (Wang et al., 2019) proposed R-U-Net, which connects multiple U-Net architectures head-to-tail with shared parameters to enhance segmentation performances. ...
Preprint
U-Net, as an encoder-decoder architecture with forward skip connections, has achieved promising results in various medical image analysis tasks. Many recent approaches have also extended U-Net with more complex building blocks, which typically increase the number of network parameters considerably. Such complexity makes the inference stage highly inefficient for clinical applications. Towards an effective yet economic segmentation network design, in this work, we propose backward skip connections that bring decoded features back to the encoder. Our design can be jointly adopted with forward skip connections in any encoder-decoder architecture forming a recurrence structure without introducing extra parameters. With the backward skip connections, we propose a U-Net based network family, namely Bi-directional O-shape networks, which set new benchmarks on multiple public medical imaging segmentation datasets. On the other hand, with the most plain architecture (BiO-Net), network computations inevitably increase along with the pre-set recurrence time. We have thus studied the deficiency bottleneck of such recurrent design and propose a novel two-phase Neural Architecture Search (NAS) algorithm, namely BiX-NAS, to search for the best multi-scale bi-directional skip connections. The ineffective skip connections are then discarded to reduce computational costs and speed up network inference. The finally searched BiX-Net yields the least network complexity and outperforms other state-of-the-art counterparts by large margins. We evaluate our methods on both 2D and 3D segmentation tasks in a total of six datasets. Extensive ablation studies have also been conducted to provide a comprehensive analysis for our proposed methods.
... Cross-layer parameter sharing has proven to be an effective method for achieving parameter efficiency in deep learning models such as RNNs (Graves, 2016b;Sherstinsky, 2018), CNNs (Eigen et al., 2014;Guo et al., 2019;Savarese and Maire, 2019), and the popular Transformer architecture. The Universal Transformer (Dehghani et al., 2019), a recurrent self-attentive model, demonstrated superior performance to non-recursive counterparts with significantly fewer parameters. ...
Preprint
Full-text available
Large language models (LLMs) are expensive to deploy. Parameter sharing offers a possible path towards reducing their size and cost, but its effectiveness in modern LLMs remains fairly limited. In this work, we revisit "layer tying" as form of parameter sharing in Transformers, and introduce novel methods for converting existing LLMs into smaller "Recursive Transformers" that share parameters across layers, with minimal loss of performance. Here, our Recursive Transformers are efficiently initialized from standard pretrained Transformers, but only use a single block of unique layers that is then repeated multiple times in a loop. We further improve performance by introducing Relaxed Recursive Transformers that add flexibility to the layer tying constraint via depth-wise low-rank adaptation (LoRA) modules, yet still preserve the compactness of the overall model. We show that our recursive models (e.g., recursive Gemma 1B) outperform both similar-sized vanilla pretrained models (such as TinyLlama 1.1B and Pythia 1B) and knowledge distillation baselines -- and can even recover most of the performance of the original "full-size" model (e.g., Gemma 2B with no shared parameters). Finally, we propose Continuous Depth-wise Batching, a promising new inference paradigm enabled by the Recursive Transformer when paired with early exiting. In a theoretical analysis, we show that this has the potential to lead to significant (2-3x) gains in inference throughput.
... The development of artificial neural networks (ANNs) has been tailored for specific tasks, such as image classification (Krizhevsky et al. [3]), computer vision tasks (Guo et al. [4]), and speech recognition tasks (Hochreiter and Schmidhuber [5]). A unique aspect of ANNs is that their topologies can be evolved alongside their weights and biases, allowing for rapid evolution. ...
Article
Full-text available
This paper presents a novel exploration of the use of an evolving neural network approach to generate dynamic content for video games, specifically for a tower defence game. The objective is to employ the NeuroEvolution of Augmenting Topologies (NEAT) technique to train a NEAT neural network as a wave manager to generate enemy waves that challenge the player’s defences. The approach is extended to incorporate NEAT-generated curriculums for tower deployments to gradually increase the difficulty for the generated enemy waves, allowing the neural network to learn incrementally. The approach dynamically adapts to changes in the player’s skill level, providing a more personalised and engaging gaming experience. The quality of the machine-generated waves is evaluated through a blind A/B test with the Games Experience Questionnaire (GEQ), and results are compared with manually designed human waves. The study finds no discernible difference in the reported player experience between AI and human-designed waves. The approach can significantly reduce the time and resources required to design game content while maintaining the quality of the player experience. The approach has the potential to be applied to a range of video game genres and within the design and development process, providing a more personalised and engaging gaming experience for players.
... the latter designs two full connection layers in each block. Furthermore, dynamic recursive network [126] also utilizes shared parameters to skip the layer, and it is different from IamNN because it decides to skip by gating modules. ...
Article
The dynamic neural network (DNN), in contrast to the static counterpart, offers numerous advantages, such as improved accuracy, efficiency, and interpretability. These benefits stem from the network’s flexible structures and parameters, making it highly attractive and applicable across various domains. As the broad learning system (BLS) continues to evolve, DNNs have expanded beyond deep learning (DL), orienting a more comprehensive range of domains. Therefore, this comprehensive review article focuses on two prominent areas where DNN structures have rapidly developed: 1) DL and 2) broad learning. This article provides an in-depth exploration of the techniques related to dynamic construction and inference. Furthermore, it discusses the applications of DNNs in diverse domains while also addressing open issues and highlighting promising research directions. By offering a comprehensive understanding of DNNs, this article serves as a valuable resource for researchers, guiding them toward future investigations.
... Specifically, we choose YOLOv3 (Redmon & Farhadi, 2018), a technically sound and mature algorithm in the field of image recognition, to recognize those relatively small targets in satellite images. The main advantages of YOLOv3 include its ability to effectively extract high-dimensional information from images using ResNet-53 (Guo et al., 2019) as a feature extraction network, and its low GPU occupancy and high detection speed due to the one-stage detection mode with lower computational complexity. The model is pre-trained with DOTA 2 and then transfer-trained with Los Angeles satellite imagery data, this training strategy is proposed by Ganji et al. (2022). ...
... Unified Classification and Generation Models. Recent advanced approaches [18,46,64] perform well in one task often exhibit poor performance in the other. Xie et al. [60] first draw the connection between the discriminative and generative power of a ConveNet with EBMs. ...
Conference Paper
Full-text available
Learning image classification and image generation using the same set of network parameters presents a formidable challenge. Recent advanced approaches perform well in one task often exhibit poor performance in the other. This work introduces an energy-based classifier and generator, namely EGC, which can achieve superior performance in both tasks using a single neural network. Unlike conventional classifiers that produce a label given an image (i.e., a conditional distribution p(y|x)), the forward pass in EGC is a classification model that yields a joint distribution p(x, y), enabling a diffusion model in its backward pass by marginalizing out the label y to estimate the score function. Furthermore, EGC can be adapted for unsupervised learning by considering the label as latent variables. EGC achieves competitive generation results compared with state-of-the-art approaches on ImageNet-1k, CelebA-HQ and LSUN Church, while achieving superior classification accuracy and robustness against adversarial attacks on CIFAR-10. This work marks the inaugural success in mastering both domains using a unified network parameter set. We believe that EGC bridges the gap between discriminative and generative learning. Code will be released at https://github.com/GuoQiushan/EGC.
... In this study, we addressed these intricacies by using 45 indicators as input indices for the BP neural network, enabling it to uncover relationships with exercise performance. While neural networks such as ANN [61,62], CNN [63], and RNN [64] offer strong fitting abilities, their requirement for large sample sizes often restricts their application, primarily to image recognition. In contrast, BP neural networks, with their lesser data volume requirements, have been widely applied in similar studies, thus representing a fitting choice for our research. ...
Article
Full-text available
Carbohydrate-protein supplement (CPS) intake is a well-established strategy for enhancing athletic performance, promoting glycogen replenishment, maintaining a positive nitrogen balance, and minimizing muscle damage in endurance athletes. Current CPS intake recommendations often rely solely on weight, lacking personalization. This study aimed to develop a machine learning-based personalized CPS intake recommendation system for endurance sports enthusiasts. We recruited 171 participants and collected 45 indicators from 12 diverse aspects, including lifestyle, psychological state, sleep quality, demographics, anthropometrics and body composition, physical activity levels, exercise capacity, blood markers and central nervous system parameters, cardiovascular metrics, meal timings, and beverage composition. Additionally, we assessed each subject’s performance in the Jensen Kurt’s 60-minute rowing ergometer distance race. Utilizing back propagation (BP) neural networks with 5-fold cross-validation, we identified the relationship between the 45 indicators and the 1-hour rowing distance, and observed a well-fitted model. We further employed an enumeration method to tailor the CPS intake protocol for each individual. Our results demonstrate the feasibility and potential of using machine learning to deliver personalized CPS intake recommendations. Future work will focus on expanding the dataset’s dimensions to iterate, update, and enhance the model’s robustness.
... Each convolutional layer uses 64 3 × 3 kernel-sized filters and is activated by PReLU. We use a recursive structure [25] to allow the model to build a deep structure with a small number of parameters. By iteratively performing operations on the previous iteration results, this light structure can achieve a similar effect as in the deep network, and thus deeper features in the image can be extracted that would aid in image restoration. ...
Article
Full-text available
Video compression algorithms are commonly used to reduce the number of bits required to represent a video with a high compression ratio. However, this can result in the loss of content details and visual artifacts that affect the overall quality of the video. We propose a learning-based restoration method to address this issue, which can handle varying degrees of compression artifacts with a single model by predicting the difference between the original and compressed video frames to restore video quality. To achieve this, we adopted a recursive neural network model with dilated convolution, which increases the receptive field of the model while keeping the number of parameters low, making it suitable for deployment on a variety of hardware devices. We also designed a temporal fusion module and integrated the color channels into the objective function. This enables the model to analyze temporal correlation and repair chromaticity artifacts. Despite handling color channels, and unlike other methods that have to train a different model for each quantization parameter (QP), the number of parameters in our lightweight model is kept to only about 269 k, requiring only about one-twelfth of the parameters used by other methods. Our model applied to the HEVC test model (HM) improves the compressed video quality by an average of 0.18 dB of BD-PSNR and −5.06% of BD-BR.
... Gate Mechanism. In many works [19], [23], [41], [42], [43], [44], [45], [46] with dynamic depth or dynamic width architectures, a gating network G i (·) with a sigmoid activation function is often used to generate gating values for gate g i as follows: ...
Preprint
Dynamic neural networks can greatly reduce computation redundancy without compromising accuracy by adapting their structures based on the input. In this paper, we explore the robustness of dynamic neural networks against energy-oriented attacks targeted at reducing their efficiency. Specifically, we attack dynamic models with our novel algorithm GradMDM. GradMDM is a technique that adjusts the direction and the magnitude of the gradients to effectively find a small perturbation for each input, that will activate more computational units of dynamic models during inference. We evaluate GradMDM on multiple datasets and dynamic models, where it outperforms previous energy-oriented attack techniques, significantly increasing computation complexity while reducing the perceptibility of the perturbations.
... The literature [11] uses a multilayer perceptron (MLP) to model user-item interactions, which further optimizes the problem of CF endogenous feedback and opens up new research avenues for deep learning recommendation algorithms. In addition, convolutional neural networks (CNN) [12], recurrent neural networks (RNN) [13], and deep reinforcement learning (DRL) [14] have been widely used in CF to make a breakthrough in alleviating the cold start and data sparsity problems of CF. Additionally, the algorithm's performance can be improved. ...
Article
Full-text available
This paper proposes a novel graph neural network recommendation method to alleviate the user cold-start problem caused by too few relevant items in personalized recommendation collaborative filtering. A deep feedforward neural network is constructed to transform the bipartite graph of user–item interactions into the spectral domain, using a random wandering method to discover potential correlation information between users and items. Then, a finite-order polynomial is used to optimize the convolution process and accelerate the convergence of the convolutional network, so that deep connections between users and items in the spectral domain can be discovered quickly. We conducted experiments on the classic dataset MovieLens-1M. The recall and precision were improved, and the results show that the method can improve the accuracy of recommendation results, tap the association information between users and items more effectively, and significantly alleviate the user cold-start problem.
... Recurrent neural networks, first proposed in 1990, are considered a generalization of recurrent neural networks, which are artificial neural networks with a tree-like hierarchical structure and network nodes that recursively respond to the input information in the order of their connections [57]. When each parent node of a recurrent neural network is connected to only one child node, its structure is equivalent to that of a fully connected recurrent neural network [58]. Since recurrent neural networks have variable topology and shared weights, they are used for machine learning tasks that contain structural relationships. ...
Article
Full-text available
Underwater target recognition is a research component that is crucial to realizing crewless underwater detection missions and has significant prospects in both civil and military applications. This paper provides a comprehensive description of the current stage of deep-learning methods with respect to raw hydroacoustic data classification, focusing mainly on the variety and recognition of vessels and environmental noise from raw hydroacoustic data. This work not only aims to describe the latest research progress in this field but also summarizes three main elements of the current stage of development: feature extraction in the time and frequency domains, data enhancement by neural networks, and feature classification based on deep learning. In this paper, we analyze and discuss the process of hydroacoustic signal processing; demonstrate that the method of feature fusion can be used in the pre-processing stage in classification and recognition algorithms based on raw hydroacoustic data, which can significantly improve target recognition accuracy; show that data enhancement algorithms can be used to improve the efficiency of recognition in complex environments in terms of deep learning network structure; and further discuss the field’s future development directions.
... The dynamic CNNs adaptive to each input sample can be divided into two categories (Han et al., 2021). The approaches in the first class have a conditional structure, but the parameters are stationary (Veit and Belongie, 2018;Guo et al., 2019). They conditionally skip the computation of some layers in the network according to each input sample to alter the network depth without placing extra classifiers. ...
Article
Medical image segmentation is a critical step in pathology assessment and monitoring. Extensive methods tend to utilize a deep convolutional neural network for various medical segmentation tasks, such as polyp segmentation, skin lesion segmentation, etc. However, due to the inherent difficulty of medical images and tremendous data variations, they usually perform poorly in some intractable cases. In this paper, we propose an input-specific network called conditional-synergistic convolution and lesion decoupling network (CCLDNet) to solve these issues. First, in contrast to existing CNN-based methods with stationary convolutions, we propose the conditional synergistic convolution (CSConv) that aims to generate a specialist convolution kernel for each lesion. CSConv has the ability of dynamic modeling and could be leveraged as a basic block to construct other networks in a broad range of vision tasks. Second, we devise a lesion decoupling strategy (LDS) to decouple the original lesion segmentation map into two soft labels, i.e., lesion center label and lesion boundary label, for reducing the segmentation difficulty. Besides, we use a transformer network as the backbone, further erasing the fixed structure of the standard CNN and empowering dynamic modeling capability of the whole framework. Our CCLDNet outperforms state-of-the-art approaches by a large margin on a variety of benchmarks, including polyp segmentation (89.22% dice score on EndoScene) and skin lesion segmentation (91.15% dice score on ISIC2018). Our code is available at https://github.com/QianChen98/CCLD-Net.
... However, (Greff et al., 2016b) further showed empirically that residual mapping in ResNets and Highway networks (Srivastava et al., 2015a;Zilly et al., 2017) in vision contexts do not learn new representations but instead tend to iteratively improve the representation under each residual block. The iterative refinement scheme further got confirmed as being a key reason for the great performance of residual networks (Casanova et al., 2018;Guo et al., 2019;Jastrzebski et al., 2018;Zhang et al., 2019). In recurrent networks, unitary/orthogonal RNNs circumvented the problem of vanishing/exploding gradients by conditioning the hidden-to-hidden transition matrix's eigenvalues at 1, using unitary/orthogonal mappings (Arjovsky et al., 2016a;Jing et al., 2017;Lezcano-Casado and Martınez-Rubio, 2019). ...
Preprint
Full-text available
Residual mappings have been shown to perform representation learning in the first layers and iterative feature refinement in higher layers. This interplay, combined with their stabilizing effect on the gradient norms, enables them to train very deep networks. In this paper, we take a step further and introduce entangled residual mappings to generalize the structure of the residual connections and evaluate their role in iterative learning representations. An entangled residual mapping replaces the identity skip connections with specialized entangled mappings such as orthogonal, sparse, and structural correlation matrices that share key attributes (eigenvalues, structure, and Jacobian norm) with identity mappings. We show that while entangled mappings can preserve the iterative refinement of features across various deep models, they influence the representation learning process in convolutional networks differently than attention-based models and recurrent neural networks. In general, we find that for CNNs and Vision Transformers entangled sparse mapping can help generalization while orthogonal mappings hurt performance. For recurrent networks, orthogonal residual mappings form an inductive bias for time-variant sequences, which degrades accuracy on time-invariant tasks.
... In RNNs, the inputs that maintain the hierarchy in a tree style will use the recursive air community. An example of such a method is as follows: the parse tree of a statement is explored by recursively retrieving the output of an operation completed in a small byte of text [10]. ...
Article
Full-text available
A distributed denial-of-service (DDoS) attack attempts to prevent people from accessing a server. A website may become inaccessible due to a DDoS attack because the server is inundated with fake requests and cannot handle real ones. A DDoS attack affects a large number of computers. Attackers employ a zombie network, which is a collection of infected machines on which the attacker has hidden the denial-of-service attacking application to carry out a DDoS attack. The MATLAB 2018a simulator was used in this study for training. Additionally, during design, the knowledge discovery dataset (KDD) was cleaned and the values of attacks were incorporated. A neural network model was subsequently developed, and the KDD was trained using a recursive artificial neural network. This network was developed using five distinct training algorithms: 1) Fletcher–Powell conjugate gradient, 2) Polak–Ribiére conjugate gradient of, 3) resilient backpropagation, 4) gradient conjugation with Powell/Beale restarts, and 5) gradient descent algorithm with variable learning rate. The artificial neural network toolset in MATLAB was used to investigate the detection of DDoS attacks. The conjugate gradient with Powell/Beale restart algorithm had a success rate of 99.9% and a training time of 00:53. This inquiry uses the KDD-CUP99 dataset. Has a better level of accuracy, according to the results
... Special architectures. One way is to change the architecture of the model to support adaptive computations [4,14,15,18,25,27,30,37,42,51,54]. For example, models that represent a neural network as a fixed-point function can have the property of adaptive computation by default. ...
Preprint
We introduce A-ViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. A-ViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds. We reformulate Adaptive Computation Time (ACT) for this task, extending halting to discard redundant spatial tokens. The appealing architectural properties of vision transformers enables our adaptive token reduction mechanism to speed up inference without modifying the network architecture or inference hardware. We demonstrate that A-ViT requires no extra parameters or sub-network for halting, as we base the learning of adaptive halting on the original network parameters. We further introduce distributional prior regularization that stabilizes training compared to prior ACT approaches. On the image classification task (ImageNet1K), we show that our proposed A-ViT yields high efficacy in filtering informative spatial features and cutting down on the overall compute. The proposed method improves the throughput of DeiT-Tiny by 62% and DeiT-Small by 38% with only 0.3% accuracy drop, outperforming prior art by a large margin.
Article
Tiny Machine Learning (TinyML) is an emerging technology proposed by the scientific community for developing autonomous and secure devices that can gather, process, and provide results without transferring data to external entities. The technology aims to democratize AI by making it available to more sectors and contribute to the digital revolution of intelligent devices. In this work, a classification of the most common optimization techniques for Neural Network compression is conducted. Additionally, a review of the development boards and TinyML software is presented. Furthermore, the work provides educational resources, a classification of the technology applications, and future directions and concludes with the challenges and considerations.
Article
Adaptive optimization methods for deep learning adjust the inference task to the current circumstances at runtime to improve the resource footprint while maintaining the model’s performance. These methods are essential for the widespread adoption of deep learning, as they offer a way to reduce the resource footprint of the inference task while also having access to additional information about the current environment. This survey covers the state-of-the-art at-runtime optimization methods, provides guidance for readers to choose the best method for their specific use-case, and also highlights current research gaps in this field.
Article
Full-text available
While deep neural networks (DNNs) have brought revolutions in many intelligent services and systems, the deployment of high-performing models for real-world applications faces challenges posed by resource constraints and diverse operating environments. While existing methods such as model compression combined with inference accelerators have enhanced the efficiency of deep neural networks, they are not dynamically adaptable to dynamically changing resource conditions since they provide static accuracy-efficiency trade-offs. Further, since they are not aware of performance requirements, such as desired inference latency, they are not able to provide robust and effective performance under unpredictable workloads. This paper introduces a holistic solution to address this challenge, consisting of two key components: adaptive depth neural networks and the Quality of Service (QoS)-aware inference accelerator. The adaptive depth neural networks exhibit the ability to scale computation instantly with minimal impact on accuracy, utilizing a novel architectural pattern and training algorithm. Complementing this, the QoS-aware inference accelerator employs a feedback control loop, adapting network depth dynamically to meet desired inference latency. Experimental results demonstrate that the proposed adaptive depth networks outperform non-adaptive counterparts, achieving up to 38% dynamic acceleration via depth adaption, with a marginal accuracy loss of 1.5%. Furthermore, the QoS-aware inference accelerator successfully controls network depth at runtime, ensuring robust performance in unpredictable environments.
Article
Many recent image restoration methods use Transformer as the backbone network and redesign the Transformer blocks. Differently, we explore the parameter-sharing mechanism over Transformer blocks and propose a dynamic recursive process to address the image super-resolution task efficiently. We firstly present a Recursive Image Super-resolution Transformer (RIST). By sharing the weights across different blocks, a plain forward process through the whole Transformer network can be folded into recursive iterations through a Transformer block. Such a parameter-sharing based recursive process can not only reduce the model size greatly, but also enable restoring images progressively. Features in the recursive process are modeled as a sequence and propagated with a temporal attention network. Besides, by analyzing the prediction variation across different iterations in RIST, we design a dynamic recursive process that can allocate adaptive computation costs to different samples. Specifically, a quality assessment network estimates the restoration quality and terminates the recursive process dynamically. We propose a relativistic learning strategy to simplify the objective from absolute image quality assessment to relativistic quality comparison. The proposed Recursive Image Super-resolution Transformer with Relativistic Assessment (RISTRA) reduces the model size greatly with the parameter-sharing mechanism, and achieves an instance-wise dynamic restoration process as well. Extensive experiments on several image super-resolution benchmarks show the superiority of our approach over state-of-the-art counterparts
Chapter
This paper discusses FPGA implementation of a convolutional neural network (CNN) for surgical image segmentation, which is part of a project to develop an automatic endoscope manipulation robot for laparoscopic surgery. From a viewpoint of hardware design, the major challenge to be addressed is that simple parallel implementation with spatial expansion requires a huge amount of FPGA resources. To cope with this problem, we propose a highly efficient implementation approach focusing on the recursive structure of the proposed network. Experimental results showed that the dominant computing resources could be reduced by about half in exchange for a 6% increase in memory resources and a 0.01% increase in latency. It was also observed that the operations performed on the network itself did not change, keeping the same inference results and throughput.
Chapter
This study aims to develop a new concept resulting from a synthesis of knowledge management practices with collective engagement process. Knowledge management is defined as managing knowledge effectively within an organization and treating knowledge as an organizational asset. The knowledge management field identifies two main types of knowledge, explicit and tacit knowledge, and includes four main sections: people, process, technology, and governance. Engagement is related to the understanding of an employee, about why and how providing optimal contribution continuously in knowledge production and its influence in implementing knowledge both sharing and utilizing. This study uses an integrated, comprehensive literature review. It concludes that Knowledge Quality Engagement (KQE) is defined as the quality of acquiring and implementing knowledge involving cognitive and affective aspects of individuals engagement. We also propose that KQE has four dimensions: (a) quality of knowledge acquisition, (b) quality of knowledge utilization, (c) cognitive involvement, and (d) affective engagement in seeking and sharing knowledge. In the future, we will empirically examine this new concept and propose that this new knowledge quality will improve human resource performance.KeywordsKnowledge creationknowledge implementationaffective engagementcognitive engagementknowledge quality
Chapter
This paper aims to develop a new concept of leadership style based on psychological work contracts and achievement motivation. We used extensive and comprehensive literature to create a theoretical synthesis. The result shows that Psychological Achievement Leadership can be defined as leaders who can inspire and encourage members to always excel based on the psychological relational work contract that has been agreed upon. The ultimate goal to have the best work performance and achievement is psychological satisfaction by both leader and member. We also propose four dimensions to indicate psychological achievement leadership: Achievement Motivation, Inspirational Motivation, Affective Work Contract, and Cognitive Work Contract. In our future research, we empirically test the effectiveness of this concept to improve organizational performance.KeywordsTransformational leadershippsychological contractinspirational Achievementmotivational inspiration
Chapter
This study aims to develop a conceptual framework that describes the essentials needed when an organization uses the metaverse as an alternative to virtual offices. Recent discussion in the existing literature widely concludes that hybrid working offers high productivity, wellbeing, and employee mental health. However, not all work types can be done through metaverse, while certain parts of work might be finished in the metaverse. Another discussion also offers that the metaverse as a place for leisure rather than work. Drawing from a review of the current literature and interviews with three senior leaders, we provide detailed insight ranging from essential needs to strategy for maximizing the metaverse as a virtual office, an action list to be taken by the top management team, and modifications to organizational policies and practices that can be considered for implementation. The specific outcome to be targeted in this research is to understand the opportunities and challenges of the future of work and workplace.KeywordsMetaverseFuture workFuture workplaceMental healthWellbeingVirtual environment
Article
Dynamic neural networks can greatly reduce computation redundancy without compromising accuracy by adapting their structures based on the input. In this paper, we explore the robustness of dynamic neural networks against energy-oriented attacks targeted at reducing their efficiency. Specifically, we attack dynamic models with our novel algorithm GradMDM. GradMDM is a technique that adjusts the direction and the magnitude of the gradients to effectively find a small perturbation for each input, that will activate more computational units of dynamic models during inference. We evaluate GradMDM on multiple datasets and dynamic models, where it outperforms previous energy-oriented attack techniques, significantly increasing computation complexity while reducing the perceptibility of the perturbations.
Preprint
Full-text available
Reverse-mode differentiation is used for optimization, but it introduces references, which break the purity of the underlying programs, making them notoriously harder to optimize. We present a reverse-mode differentiation on a purely functional language with array operations. It is the first one to deliver a provably efficient, purely functional, and denotationally correct reverse-mode differentiation. We show that our transformation is semantically correct and verifies the cheap gradient principle. Inspired by PROPs and compilation to categories, we introduce a novel intermediate representation that we call 'unary form'. Our reverse-mode transformation is factored as a compilation scheme through this intermediate representation. We obtain provably efficient gradients by performing general partial evaluation optimizations after our reverse-mode transformation, as opposed to manually derived ones. For simple first-order programs, the obtained output programs resemble static-single-assignment (SSA) code. We emphasize the modularity of our approach and show how our language can easily be enriched with more optimized primitives, as required for some speed-ups in practice.
Chapter
We present a neat yet effective recursive operation on vision transformers that can improve parameter utilization without involving additional parameters. This is achieved by sharing weights across depth of transformer networks. The proposed method can obtain a substantial gain (\sim 2%) simply using naïve recursive operation, requires no special or sophisticated knowledge for designing principles of networks, and introduces minimal computational overhead to the training procedure. To reduce the additional computation caused by recursive operation while maintaining the superior accuracy, we propose an approximating method through multiple sliced group self-attentions across recursive layers which can reduce the cost consumption by 10–30% without sacrificing performance. We call our model Sliced Recursive Transformer (SReT), a novel and parameter-efficient vision transformer design that is compatible with a broad range of other designs for efficient ViT architectures. Our best model establishes significant improvement on ImageNet-1K over state-of-the-art methods while containing fewer parameters. The proposed weight sharing mechanism by sliced recursion structure allows us to build a transformer with more than 100 or even 1000 shared layers with ease while keeping a compact size (13–15 M), to avoid optimization difficulties when the model is too large. The flexible scalability has shown great potential for scaling up models and constructing extremely deep vision transformers. Code is available at https://github.com/szq0214/SReT.
Chapter
Dynamic inference networks improve computational efficiency by executing a subset of network components, i.e., executing path, conditioned on input sample. Prevalent methods typically assign routers to computational blocks so that a computational block can be skipped or executed. However, such inference mechanisms are prone to suffer instability in the optimization of dynamic inference networks. First, a dynamic inference network is more sensitive to its routers than its computational blocks. Second, the components executed by the network vary with samples, resulting in unstable feature evolution throughout the network. To alleviate the problems above, we propose SP-Nets to slow down the progress from two aspects. First, we design a dynamic auxiliary module to slow down the progress in routers from the perspective of historical information. Moreover, we regularize the feature evolution directions across the network to smoothen the feature extraction in the aspect of information flow. As a result, we conduct extensive experiments on three widely used benchmarks and show that our proposed SP-Nets achieve state-of-the-art performance in terms of efficiency and accuracy.KeywordsDynamic inferenceSlowly progressingExecuting path regularizationFeature evolution regularization
Chapter
Recent isotropic networks, such as ConvMixer and Vision Transformers, have found significant success across visual recognition tasks, matching or outperforming non-isotropic Convolutional Neural Networks. Isotropic architectures are particularly well-suited to cross-layer weight sharing, an effective neural network compression technique. In this paper, we perform an empirical evaluation on methods for sharing parameters in isotropic networks (SPIN). We present a framework to formalize major weight sharing design decisions and perform a comprehensive empirical evaluation of this design space. Guided by our experimental results, we propose a weight sharing strategy to generate a family of models with better overall efficiency, in terms of FLOPs and parameters versus accuracy, compared to traditional scaling methods alone, for example compressing ConvMixer by 1.9× while improving accuracy on ImageNet. Finally, we perform a qualitative study to further understand the behavior of weight sharing in isotropic architectures. The code is available at https://github.com/apple/ml-spin.
Chapter
Dynamic neural networks could adapt their structures or parameters based on different inputs. By reducing the computation redundancy for certain samples, it can greatly improve the computational efficiency without compromising the accuracy. In this paper, we investigate the robustness of dynamic neural networks against energy-oriented attacks. We present a novel algorithm, named GradAuto, to attack both dynamic depth and dynamic width models, where dynamic depth networks reduce redundant computation by skipping some intermediate layers while dynamic width networks adaptively activate a subset of neurons in each layer. Our GradAuto carefully adjusts the direction and the magnitude of the gradients to efficiently find an almost imperceptible perturbation for each input, which will activate more computation units during inference. In this way, GradAuto effectively boosts the computational cost of models with dynamic architectures. Compared to previous energy-oriented attack techniques, GradAuto obtains the state-of-the-art result and recovers 100% dynamic network reduced FLOPs on average for both dynamic depth and dynamic width models. Furthermore, we demonstrate that GradAuto offers us great control over the attacking process and could serve as one of the keys to unlock the potential of the energy-oriented attack. Please visit https://github.com/JianhongPan/GradAuto for code.
Article
Neural networks contain considerable redundant computation, which drags down the inference efficiency and hinders the deployment on resource-limited devices. In this paper, we study the sparsity in convolutional neural networks and propose a generic sparse mask mechanism to improve the inference efficiency of networks. Specifically, sparse masks are learned in both data and channel dimensions to dynamically localize and skip redundant computation at a fine-grained level. Based on our sparse mask mechanism, we develop SMPointSeg, SMSR, and SMStereo for point cloud semantic segmentation, single image super-resolution, and stereo matching tasks, respectively. It is demonstrated that our sparse masks are well compatible to different model components and network architectures to accurately localize redundant computation, with computational cost being significantly reduced for practical speedup. Extensive experiments show that our SMPointSeg, SMSR, and SMStereo achieve state-of-the-art performance on benchmark datasets in terms of both accuracy and efficiency.
Article
U-Net, as an encoder-decoder architecture with forward skip connections, has achieved promising results in various medical image analysis tasks. Many recent approaches have also extended U-Net with more complex building blocks, which typically increase the number of network parameters considerably. Such complexity makes the inference stage highly inefficient for clinical applications. Towards an effective yet economic segmentation network design, in this work, we propose backward skip connections that bring decoded features back to the encoder. Our design can be jointly adopted with forward skip connections in any encoder-decoder architecture forming a recurrence structure without introducing extra parameters. With the backward skip connections, we propose a U-Net based network family, namely Bi-directional O-shape networks, which set new benchmarks on multiple public medical imaging segmentation datasets. On the other hand, with the most plain architecture (BiO-Net), network computations inevitably increase along with the pre-set recurrence time. We have thus studied the deficiency bottleneck of such recurrent design and propose a novel two-phase Neural Architecture Search (NAS) algorithm, namely BiX-NAS, to search for the best multi-scale bi-directional skip connections. The ineffective skip connections are then discarded to reduce computational costs and speed up network inference. The finally searched BiX-Net yields the least network complexity and outperforms other state-of-the-art counterparts by large margins. We evaluate our methods on both 2D and 3D segmentation tasks in a total of six datasets. Extensive ablation studies have also been conducted to provide a comprehensive analysis for our proposed methods.
Article
Full-text available
Do convolutional networks really need a fixed feed-forward structure? What if, after identifying the high-level concept of an image, a network could move directly to a layer that can distinguish fine-grained differences? Currently, a network would first need to execute sometimes hundreds of intermediate layers that specialize in unrelated aspects. Ideally, the more a network already knows about an image, the better it should be at deciding which layer to compute next. In this work, we propose convolutional networks with adaptive inference graphs (ConvNet-AIG) that adaptively define their network topology conditioned on the input image. Following a high-level structure similar to residual networks (ResNets), ConvNet-AIG decides for each input image on the fly which layers are needed. In experiments on ImageNet we show that ConvNet-AIG learns distinct inference graphs for different categories. Both ConvNet-AIG with 50 and 101 layers outperform their ResNet counterpart, while using 20%20\% and 38%38\% less computations respectively. By grouping parameters into layers for related classes and only executing relevant layers, ConvNet-AIG improves both efficiency and overall classification quality. Lastly, we also study the effect of adaptive inference graphs on the susceptibility towards adversarial examples. We observe that ConvNet-AIG shows a higher robustness than ResNets, complementing other known defense mechanisms.
Article
Full-text available
Do convolutional networks really need a fixed feed-forward structure? Often, a neural network is already confident after a few layers about the high-level concept shown in the image. However, due to the fixed network structure, all remaining layers still need to be evaluated. What if the network could jump right to a layer that is specialized in fine-grained differences of the image's content? In this work, we propose Adanets, a family of convolutional networks with adaptive computation graphs. Following a high-level structure similar to residual networks (Resnets), the key difference is that for each layer a gating function determines whether to execute the layer or move on to the next one. In experiments on CIFAR-10 and ImageNet we demonstrate that Adanets efficiently allocate computational budget among layers and learn distinct layers specializing in similar categories. Adanet 50 achieves a top 5 error rate of 7.94% on ImageNet using 30% fewer computations than Resnet 34, which only achieves 8.58%. Lastly, we study the effect of adaptive computation graphs on the susceptibility towards adversarial examples. We observe that Adanets show a higher robustness towards adversarial attacks, complementing other defenses such as JPEG compression.
Article
Full-text available
Residual networks (Resnets) have become a prominent architecture in deep learning. However, a comprehensive understanding of Resnets is still a topic of ongoing research. A recent view argues that Resnets perform iterative refinement of features. We attempt to further expose properties of this aspect. To this end, we study Resnets both analytically and empirically. We formalize the notion of iterative refinement in Resnets by showing that residual architectures naturally encourage features to move along the negative gradient of loss during the feedforward phase. In addition, our empirical analysis suggests that Resnets are able to perform both representation learning and iterative refinement. In general, a Resnet block tends to concentrate representation learning behavior in the first few layers while higher layers perform iterative refinement of features. Finally we observe that sharing residual layers naively leads to representation explosion and hurts generalization performance, and show that simple existing strategies can help alleviating this problem.
Article
Full-text available
Recent work has shown that convolutional networks can be substantially deeper, more accurate and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper we embrace this observation and introduce the Dense Convolutional Network (DenseNet), where each layer is directly connected to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections, one between each layer and its subsequent layer (treating the input as layer 0), our network has L(L+1)/2 direct connections. For each layer, the feature maps of all preceding layers are treated as separate inputs whereas its own feature maps are passed on as inputs to all subsequent layers. Our proposed connectivity pattern has several compelling advantages: it alleviates the vanishing gradient problem and strengthens feature propagation; despite the increase in connections, it encourages feature reuse and leads to a substantial reduction of parameters; its models tend to generalize surprisingly well. We evaluate our proposed architecture on five highly competitive object recognition benchmark tasks. The DenseNet obtains significant improvements over the state-of-the-art on all five of them (e.g., yielding 3.74% test error on CIFAR-10, 19.25% on CIFAR-100 and 1.59% on SVHN).
Article
Full-text available
We present a class of efficient models called MobileNets for mobile and embedded vision applications. MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency and accuracy. These hyper-parameters allow the model builder to choose the right sized model for their application based on the constraints of the problem. We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification. We then demonstrate the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.
Article
Full-text available
Categorical variables are a natural choice for representing discrete structure in the world. However, stochastic neural networks rarely use categorical latent variables due to the inability to backpropagate through samples. In this work, we present an efficient gradient estimator that replaces the non-differentiable sample from a categorical distribution with a differentiable sample from a novel Gumbel-Softmax distribution. This distribution has the essential property that it can be smoothly annealed into a categorical distribution. We show that our Gumbel-Softmax estimator outperforms state-of-the-art gradient estimators on structured output prediction and unsupervised generative modeling tasks with categorical latent variables, and enables large speedups on semi-supervised classification.
Conference Paper
Full-text available
Very deep convolutional networks with hundreds of layers have led to significant reductions in error on competitive benchmarks. Although the unmatched expressiveness of the many layers can be highly desirable at test time, training very deep networks comes with its own set of challenges. The gradients can vanish, the forward flow often diminishes, and the training time can be painfully slow. To address these problems, we propose stochastic depth, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time. We start with very deep networks but during training, for each mini-batch, randomly drop a subset of layers and bypass them with the identity function. This simple approach complements the recent success of residual networks. It reduces training time substantially and improves the test error significantly on almost all data sets that we used for evaluation. With stochastic depth we can increase the depth of residual networks even beyond 1200 layers and still yield meaningful improvements in test error (4.91 % on CIFAR-10).
Conference Paper
Full-text available
Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less memory and computation to achieve high performance. Code and models are available at https://github.com/liuzhuang13/DenseNet.
Article
Full-text available
Very deep convolutional networks with hundreds or more layers have lead to significant reductions in error on competitive benchmarks like the ImageNet or COCO tasks. Although the unmatched expressiveness of the many deep layers can be highly desirable at test time, training very deep networks comes with its own set of challenges. The gradients can vanish, the forward flow often diminishes and the training time can be painfully slow even on modern computers. In this paper we propose stochastic depth, a training procedure that enables the seemingly contradictory setup to train short networks and obtain deep networks. We start with very deep networks but during training, for each mini-batch, randomly drop a subset of layers and bypass them with the identity function. The resulting networks are short (in expectation) during training and deep during testing. Training Residual Networks with stochastic depth is compellingly simple to implement, yet effective. We show that this approach successfully addresses the training difficulties of deep networks and complements the recent success of Residual and Highway Networks. It reduces training time substantially and improves the test errors on almost all data sets significantly (CIFAR-10, CIFAR-100, SVHN). Intriguingly, with stochastic depth we can increase the depth of residual networks even beyond 1200 layers and still yield meaningful improvements in test error (4.91%) on CIFAR-10.
Article
Full-text available
Recent research on deep neural networks has focused primarily on improving accuracy. For a given accuracy level, it is typically possible to identify multiple DNN architectures that achieve that accuracy level. With equivalent accuracy, smaller DNN architectures offer at least three advantages: (1) Smaller DNNs require less communication across servers during distributed training. (2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car. (3) Smaller DNNs are more feasible to deploy on FPGAs and other hardware with limited memory. To provide all of these advantages, we propose a small DNN architecture called SqueezeNet. SqueezeNet achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. Additionally, with model compression techniques we are able to compress SqueezeNet to less than 0.5MB (510x smaller than AlexNet). The SqueezeNet architecture is available for download here: https://github.com/DeepScale/SqueezeNet
Article
Full-text available
We discuss relations between Residual Networks (ResNet), Recurrent Neural Networks (RNNs) and the primate visual cortex. We begin with the observation that a shallow RNN is exactly equivalent to a very deep ResNet with weight sharing among the layers. A direct implementation of such a RNN, although having orders of magnitude fewer parameters, leads to a performance similar to the corresponding ResNet. We propose 1) a generalization of both RNN and ResNet architectures and 2) the conjecture that a class of moderately deep RNNs is a biologically-plausible model of the ventral stream in visual cortex. We demonstrate the effectiveness of the architectures by testing them on the CIFAR-10 dataset.
Conference Paper
Full-text available
Published as a conference paper at ICLR 2016 Trained Models at http://dx.doi.org/10.5281/zenodo.53189
Article
Full-text available
We propose an image super-resolution method (SR) using a deeply-recursive convolutional network (DRCN). Our network has a very deep recursive layer (up to 16 recursions). Increasing recursion depth can improve performance without introducing new parameters for additional convolutions. Albeit advantages, learning a DRCN is very hard with a standard gradient descent method due to exploding/vanishing gradients. To ease the difficulty of training, we propose two extensions: recursive-supervision and skip-connection. Our method outperforms previous methods by a large margin.
Article
Full-text available
A major challenge in biometrics is performing the test at the client side, where hardware resources are often limited. Deep learning approaches pose a unique challenge: while such architectures dominate the field of face recognition with regard to accuracy, they require elaborate, multi-stage computations. Recently, there has been some work on compressing networks for the purpose of reducing run time and network size. However, it is not clear that these compression methods would work in deep face nets, which are, generally speaking, less redundant than the object recognition networks, i.e., they are already relatively lean. We propose two novel methods for compression: one based on eliminating lowly active channels and the other on coupling pruning with repeated use of already computed elements. Pruning of entire channels is an appealing idea, since it leads to direct saving in run time in almost every reasonable architecture.
Article
Full-text available
Recent advances in 3D sensing technologies make it possible to easily record color and depth images which together can improve object recognition. Most current methods rely on very well-designed features for this new 3D modality. We in-troduce a model based on a combination of convolutional and recursive neural networks (CNN and RNN) for learning features and classifying RGB-D images. The CNN layer learns low-level translationally invariant features which are then given as inputs to multiple, fixed-tree RNNs in order to compose higher order fea-tures. RNNs can be seen as combining convolution and pooling into one efficient, hierarchical operation. Our main result is that even RNNs with random weights compose powerful features. Our model obtains state of the art performance on a standard RGB-D object dataset while being more accurate and faster during train-ing and testing than comparable architectures such as two-layer CNNs.
Article
Full-text available
We investigate multiple techniques to improve upon the current state of the art deep convolutional neural network based image classification pipeline. The techiques include adding more image transformations to training data, adding more transformations to generate additional predictions at test time and using complementary models applied to higher resolution images. This paper summarizes our entry in the Imagenet Large Scale Visual Recognition Challenge 2013. Our system achieved a top 5 classification error rate of 13.55% using no external data which is over a 20% relative improvement on the previous year's winner.
Article
Full-text available
Stochastic neurons can be useful for a number of reasons in deep learning models, but in many cases they pose a challenging problem: how to estimate the gradient of a loss function with respect to the input of such stochastic neurons, i.e., can we "back-propagate" through these stochastic neurons? We examine this question, existing approaches, and present two novel families of solutions, applicable in different settings. In particular, it is demonstrated that a simple biologically plausible formula gives rise to an an unbiased (but noisy) estimator of the gradient with respect to a binary stochastic neuron firing probability. Unlike other estimators which view the noise as a small perturbation in order to estimate gradients by finite differences, this estimator is unbiased even without assuming that the stochastic perturbation is small. This estimator is also interesting because it can be applied in very general settings which do not allow gradient back-propagation, including the estimation of the gradient with respect to future rewards, as required in reinforcement learning setups. We also propose an approach to approximating this unbiased but high-variance estimator by learning to predict it using a biased estimator. The second approach we propose assumes that an estimator of the gradient can be back-propagated and it provides an unbiased estimator of the gradient, but can only work with non-linearities unlike the hard threshold, but like the rectifier, that are not flat for all of their range. This is similar to traditional sigmoidal units but has the advantage that for many inputs, a hard decision (e.g., a 0 output) can be produced, which would be convenient for conditional computation and achieving sparse representations and sparse gradients.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Article
Full-text available
This paper describes a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the Integral Image which allows the features used by our detector to be computed very quickly. The second is a simple and efficient classifier which is built using the AdaBoost learning algorithm (Freund and Schapire, 1995) to select a small number of critical visual features from a very large set of potential features. The third contribution is a method for combining classifiers in a cascade which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions. A set of experiments in the domain of face detection is presented. The system yields face detection performance comparable to the best previous systems (Sung and Poggio, 1998; Rowley et al., 1998; Schneiderman and Kanade, 2000; Roth et al., 2000). Implemented on a conventional desktop, face detection proceeds at 15 frames per second.
Article
Increasing depth and complexity in convolutional neural networks has enabled significant progress in visual perception tasks. However, incremental improvements in accuracy are often accompanied by exponentially deeper models that push the computational limits of modern hardware. These incremental improvements in accuracy imply that only a small fraction of the inputs require the additional model complexity. As a consequence, for any given image it is possible to bypass multiple stages of computation to reduce the cost of forward inference without affecting accuracy. We exploit this simple observation by learning to dynamically route computation through a convolutional network. We introduce dynamically routed networks (SkipNets) by adding gating layers that route images through existing convolutional networks and formulate the routing problem in the context of sequential decision making. We propose a hybrid learning algorithm which combines supervised learning and reinforcement learning to address the challenges of inherently non-differentiable routing decisions. We show SkipNet reduces computation by 30 - 90% while preserving the accuracy of the original model on four benchmark datasets. We compare SkipNet with SACT and ACT to show SkipNet achieves better accuracy with lower computation.
Article
Deep residual networks have emerged as a family of extremely deep architectures showing compelling accuracy and nice convergence behaviors. In this paper, we analyze the propagation formulations behind the residual building blocks, which suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation. A series of ablation experiments support the importance of these identity mappings. This motivates us to propose a new residual unit, which further makes training easy and improves generalization. We report improved results using a 1001-layer ResNet on CIFAR-10/100, and a 200-layer ResNet on ImageNet.
Conference Paper
Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, We introduce a three stage pipeline: pruning, quantization and Huffman encoding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman encoding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory, which has 180x less access energy.
Article
We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32x memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58x faster convolutional operations and 32x memory savings. XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work on challenging visual tasks. We evaluate our approach on the ImageNet classification task. The classification accuracy with a Binary-Weight-Network version of AlexNet is only 2.9% less than the full-precision AlexNet (in top-1 measure). We compare our method with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than 16% in top-1 accuracy.
Conference Paper
The reparameterization trick enables the optimization of large scale stochastic computation graphs via gradient descent. The essence of the trick is to refactor each stochastic node into a differentiable function of its parameters and a random variable with fixed distribution. After refactoring, the gradients of the loss propagated by the chain rule through the graph are low variance unbiased estimators of the gradients of the expected loss. While many continuous random variables have such reparameterizations, discrete random variables lack continuous reparameterizations due to the discontinuous nature of discrete states. In this work we introduce concrete random variables -- continuous relaxations of discrete random variables. The concrete distribution is a new family of distributions with closed form densities and a simple reparameterization. Whenever a discrete stochastic node of a computation graph can be refactored into a one-hot bit representation that is treated continuously, concrete stochastic nodes can be used with automatic differentiation to produce low-variance biased gradients of objectives (including objectives that depend on the log-likelihood of latent stochastic nodes) on the corresponding discrete graph. We demonstrate their effectiveness on density estimation and structured prediction tasks using neural networks.
Conference Paper
Categorical variables are a natural choice for representing discrete structure in the world. However, stochastic neural networks rarely use categorical latent variables due to the inability to backpropagate through samples. In this work, we present an efficient gradient estimator that replaces the non-differentiable sample from a categorical distribution with a differentiable sample from a novel Gumbel-Softmax distribution. This distribution has the essential property that it can be smoothly annealed into a categorical distribution. We show that our Gumbel-Softmax estimator outperforms state-of-the-art gradient estimators on structured output prediction and unsupervised generative modeling tasks with categorical latent variables, and enables large speedups on semi-supervised classification.
Conference Paper
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Article
The reparameterization trick enables the optimization of large scale stochastic computation graphs via gradient descent. The essence of the trick is to refactor each stochastic node into a differentiable function of its parameters and a random variable with fixed distribution. After refactoring, the gradients of the loss propagated by the chain rule through the graph are low variance unbiased estimators of the gradients of the expected loss. While many continuous random variables have such reparameterizations, discrete random variables lack continuous reparameterizations due to the discontinuous nature of discrete states. In this work we introduce concrete random variables -- continuous relaxations of discrete random variables. The concrete distribution is a new family of distributions with closed form densities and a simple reparameterization. Whenever a discrete stochastic node of a computation graph can be refactored into a one-hot bit representation that is treated continuously, concrete stochastic nodes can be used with automatic differentiation to produce low-variance biased gradients of objectives (including objectives that depend on the log-likelihood of latent stochastic nodes) on the corresponding discrete graph. We demonstrate their effectiveness on density estimation and structured prediction tasks using neural networks.
Conference Paper
We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32×\times memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58×\times faster convolutional operations (in terms of number of the high precision operations) and 32×\times memory savings. XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work on challenging visual tasks. We evaluate our approach on the ImageNet classification task. The classification accuracy with a Binary-Weight-Network version of AlexNet is the same as the full-precision AlexNet. We compare our method with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than 16%16\,\% in top-1 accuracy. Our code is available at: http:// allenai. org/ plato/ xnornet.
Chapter
As discussed in the previous chapter, an important benefit of recurrent neural networks is their ability to use contextual information when mapping between input and output sequences. Unfortunately, for standard RNN architectures, the range of context that can be in practice accessed is quite limited. The problem is that the influence of a given input on the hidden layer, and therefore on the network output, either decays or blows up exponentially as it cycles around the network’s recurrent connections. This effect is often referred to in the literature as the vanishing gradient problem (Hochreiter, 1991; Hochreiter et al., 2001a; Bengio et al., 1994). The vanishing gradient problem is illustrated schematically in Figure 4.1
Article
This paper introduces Adaptive Computation Time (ACT), an algorithm that allows recurrent neural networks to learn how many computational steps to take between receiving an input and emitting an output. ACT requires minimal changes to the network architecture, is deterministic and differentiable, and does not add any noise to the parameter gradients. Experimental results are provided for four synthetic problems: determining the parity of binary vectors, applying binary logic operations, adding integers, and sorting real numbers. Overall, performance is dramatically improved by the use of ACT, which successfully adapts the number of computational steps to the requirements of the problem. We also present character-level language modelling results on the Hutter prize Wikipedia dataset. In this case ACT does not yield large gains in performance; however it does provide intriguing insight into the structure of the data, with more computation allocated to harder-to-predict transitions, such as spaces between words and ends of sentences. This suggests that ACT or other adaptive computation methods could provide a generic method for inferring segment boundaries in sequence data.
Article
We propose a new method for creating computationally efficient convolutional neural networks (CNNs) by using low-rank representations of convolutional filters. Rather than approximating filters in previously-trained networks with more efficient versions, we learn a set of small basis filters from scratch; during training, the network learns to combine these basis filters into more complex filters that are discriminative for image classification. To train such networks, a novel weight initialization scheme is used. This allows effective initialization of connection weights in convolutional layers composed of groups of differently-shaped filters. We validate our approach by applying it to several existing CNN architectures and training these networks from scratch using the CIFAR, ILSVRC and MIT Places datasets. Our results show similar or higher accuracy than conventional CNNs with much less compute. Applying our method to an improved version of VGG-11 network using global max-pooling, we achieve comparable validation accuracy using 41% less compute and only 24% of the original VGG-11 model parameters; another variant of our method gives a 1 percentage point increase in accuracy over our improved VGG-11 model, giving a top-5 center-crop validation accuracy of 89.7% while reducing computation by 16% relative to the original VGG-11 model. Applying our method to the GoogLeNet architecture for ILSVRC, we achieved comparable accuracy with 26% less compute and 41% fewer model parameters. Applying our method to a near state-of-the-art network for CIFAR, we achieved comparable accuracy with 46% less compute and 55% fewer parameters.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
We consider the task of building compact deep learning pipelines suitable for deployment on storage and power constrained mobile devices. We propose a unified framework to learn a broad family of structured parameter matrices that are characterized by the notion of low displacement rank. Our structured transforms admit fast function and gradient evaluation, and span a rich range of parameter sharing configurations whose statistical modeling capacity can be explicitly tuned along a continuum from structured to unstructured. Experimental results show that these transforms can significantly accelerate inference and forward/backward passes during training, and offer superior accuracy-compactness-speed tradeoffs in comparison to a number of existing techniques. In keyword spotting applications in mobile speech recognition, our methods are much more effective than standard linear low-rank bottleneck layers and nearly retain the performance of state of the art models, while providing more than 3.5-fold compression.
Article
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch}. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
Convolutional neural network models have recently been shown to achieve excellent performance on challenging recognition benchmarks. However, like many deep models, there is little guidance on how the architecture of the model should be selected. Important hyper-parameters such as the degree of parameter sharing, number of layers, units per layer, and overall number of parameters must be selected manually through trial-and-error. To address this, we introduce a novel type of recursive neural network that is convolutional in nature. Its similarity to standard convolutional models allows us to tease apart the important architectural factors that influence performance. We find that for a given parameter budget, deeper models are preferred over shallow ones, and models with more parameters are preferred to those with fewer. Surprisingly and perhaps counterintuitively, we find that performance is independent of the number of units, so long as the network depth and number of parameters is held constant. This suggests that, computational efficiency considerations aside, parameter sharing within deep networks may not be so beneficial as previously supposed.
Article
In this paper, we show how new training principles and optimization techniques for neural networks can be used for different network structures. In particular, we revisit the Recurrent Neural Network (RNN), which explicitly models the Markovian dynamics of a set of observations through a non-linear function with a much larger hidden state space than traditional sequence models such as an HMM. We apply pretraining principles used for Deep Neural Networks (DNNs) and second-order optimization techniques to train an RNN. Moreover, we explore its application in the Aurora2 speech recognition task under mismatched noise conditions using a Tandem approach. We observe top performance on clean speech, and under high noise conditions, compared to multi-layer perceptrons (MLPs) and DNNs, with the added benefit of being a "deeper" model than an MLP but more compact than a DNN.
Conference Paper
We describe a general method for building cascade classifiers from part-based deformable models such as pictorial structures. We focus primarily on the case of star-structured models and show how a simple algorithm based on partial hypothesis pruning can speed up object detection by more than one order of magnitude without sacrificing detection accuracy. In our algorithm, partial hypotheses are pruned with a sequence of thresholds. In analogy to probably approximately correct (PAC) learning, we introduce the notion of probably approximately admissible (PAA) thresholds. Such thresholds provide theoretical guarantees on the performance of the cascade method and can be computed from a small sample of positive examples. Finally, we outline a cascade detection algorithm for a general class of models defined by a grammar formalism. This class includes not only tree-structured pictorial structures but also richer models that can represent each part recursively as a mixture of other parts.