Article

# Deep Residual Learning for Image Recognition

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

## Abstract

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

## No full-text available

... Three different state-of-art Keras pre-trained convolutional neural network architectures were considered for this work (Team, n.d.-d), VGG19 (Simonyan & Zisserman, 2015), ResNet50 (He et al., 2015) and EfficientNet (Tan & Le, 2020). Nevertheless, all these CNNs have important advantages and some disadvantages. ...
... ResNet-50 for Image classification tasks (He et al., 2015). ResNet-50 is a convolutional neural network that was also trained in the ImageNet database (Deng et al., 2009) containing more than a 1E6 images, including plants. ...
... RestNet-50 building block(He et al., 2015) ...
Article
Full-text available
Maize is the second most plentiful cereal grown for human consumption. It constitutes 36% of total grain production worldwide and it is cultivated in about 160 countries on nearly 150 m ha. Maize faces fungal diseases causing extraordinary reduction in the grain yield. Fungi are responsible for many maize foliar diseases. Fungicides show hazardous effects on human health and also soil and water pollution. Near-infrared (NIR) images can disclose damage patterns not visible to the naked eye or depicted in RGB images. Unmanned aerial vehicles (UAV) are an inexpensive way to collect low altitude images. State-of-the-art Convolutional Neural Networks (CNN) have proven excellent results in image classification in computer vision. This study presents a novel Transfer Learning (TL) based CNN technique and states the hypothesis that NIR images acquired by UAVs contribute to a more precise classification of pathogens in maize. GPS coordinates of the infested areas are also provided for precision spraying with fungicide agents for specific targets, representing an economical mean for yield protection and with the least possible hazard to people and to the ecosystem. The proposed model was evaluated on its performance using different metrics achieving an accuracy of 86.7%, precision 98%, sensitivity 86.9% and F1 Score 92%. According to the state-of-the-art literature consulted, this is the first time that a validated deep learning-based approach has been applied in fungal diseases classification using infrared images.
... It learns the inner features of data and then implicitly expresses a complex nonlinear mapping relationship that we need by using a well-trained model [1], [30], [41], [42]. As one of the most representative techniques of deep learning, some classical frameworks of the convolutional neural network (CNN) have been proposed, such as AlexNet [43], VGGNet [44], residual neural network (Res-Net) [45], and feedforward denoising CNN (DnCNN) [46]. These proposed frameworks have shown excellent performance in the processing of natural images [45], [46]. ...
... As one of the most representative techniques of deep learning, some classical frameworks of the convolutional neural network (CNN) have been proposed, such as AlexNet [43], VGGNet [44], residual neural network (Res-Net) [45], and feedforward denoising CNN (DnCNN) [46]. These proposed frameworks have shown excellent performance in the processing of natural images [45], [46]. Also, CNN has gradually applied to some fields of seismic data processing, such as noise suppression [1], [47], [48], FWI [49], [50], fault detection [51], seismic lithology prediction [52], and interpolation [53]. ...
... In this article, we propose a new architecture of DnCNN, as shown in Fig. 1, called MSSA-Net. Different from conventional existing CNN, such as DnCNNs [1], [46], Res-Net [45], and the recurrent encoder-decoder deep neural network (RED-Net) [59], we further enhance the denoising performance of CNN from two aspects: multiscale strategy and attention mechanism. ...
Article
Full-text available
Seismic background noise often damages the desired signals, thereby resulting in some artifacts in the seismic imaging that follows. Since about 2016, some supervised-deep-learning methods have shown impressive performance in seismic data denoising, but they usually only consider single-scale features and neglect the multiscale strategy. To further reinforce their denoising performance, a novel multiscale convolutional neural network (CNN) combined with a spatial attention mechanism, called multiscale spatial attention denoising network (MSSA-Net), is proposed to tell weak reflected signals apart from strong seismic background noise. Unlike conventional single-scale CNNs, this proposed MSSA-Net can achieve the extraction of multiscale features, which is beneficial for the suppression of strong noise and the recovery of weak reflected signals. Specifically, MSSA-Net contains a principal denoising network and two auxiliary networks. The former utilizes the widen convolution composed of multiple parallel convolution layers with different kernel sizes to capture the informative multiscale features; the latter two leverage upsampling and downsampling to extract local fine and global coarse features, respectively. Furthermore, a spatial attention block is adopted to fuse these multiscale features, thereby distinguishing weak reflected signals from strong seismic background noise. Multiple experiments of synthetic and real seismic records demonstrate the effectiveness of MSSA-Net. In addition, compared with two classical single-scale CNNs, MSSA-Net performs better in signal recovery, indicating the positive effect of the multiscale strategy. Index Terms-Convolutional neural network (CNN), multi-scale strategy, seismic noise suppression, spatial attention, weak reflection.
... Classification Covid-ResNet [9,15] Covid-Transformer [7] -U-CSRNet [20] MTCSN [18] Segmentation 2D-UNet [16] LOD-Net [5] -Han-Net [14] CD-Net [13] 3D-UNet [29] UNETR [12] - Secondly, we conduct bench-marking comparisons of the SOTA algorithms with accuracy and performances. ...
... -ResNet [15] was proposed in 2015 and became one of the most famous Convolutional Neural Networks(CNN) in deep learning. -It uses a residual learning framework to ease the training of a deeper network than previous work and shows promising performance. ...
Preprint
Full-text available
In this paper, we present OpenMedIA, an open-source toolbox library containing a rich set of deep learning methods for medical image analysis under heterogeneous Artificial Intelligence (AI) computing platforms. Various medical image analysis methods, including 2D$/$3D medical image classification, segmentation, localisation, and detection, have been included in the toolbox with PyTorch and$/$or MindSpore implementations under heterogeneous NVIDIA and Huawei Ascend computing systems. To our best knowledge, OpenMedIA is the first open-source algorithm library providing compared PyTorch and MindSp
... However, these three tasks are naturally contradictory to each other. For a CNNembedded model, the input image of larger resolution requires more memory, parameters, and layers [1][2][3][4] to train. To relieve the memory consumption, traditional methods adopt preprocessing including image resizing, or cutting patches to fit the network capacity and computing time. ...
... There is a branch of works [6,8,9,15] recombining low-level features with high-level features through skip links or connections. Some others apply residual blocks [4,13,14] to alleviate problem of vanishing gradient through either channel-wise or element-wise addition between layers. Enlarged kernel: another process is driven by how to enlarge receptive field given a fixed number of layers in network. ...
Conference Paper
Full-text available
Semantic segmentation results in pixel-wise perception accompanied with GPU computation and expensive memory, which makes trained models hard to apply to small devices in testing. Assuming the availability of hardware in training CNN backbones, this work converts them to a linear architecture enabling the inference on edge devices. Keeping the same accuracy as patch-mode testing, we segment images using a scanning line with the minimum memory. Exploring periods of pyramid network shifting on image, we perform such sequential semantic segmentation (SE3) with a circular memory to avoid redundant computation and preserve the same receptive field as patches for spatial dependency. In the experiments on large drone images and panoramas, we examine this approach in terms of accuracy, parameter memory, and testing speed. Benchmark evaluations demonstrate that, with only one-line computation in linear time, our designed SE3 network consumes a small fraction of memory to maintain an equivalent accuracy as the image segmentation in patches. Considering semantic segmentation for high-resolution images, particularly for data streamed from sensors, this method is significant to the real-time applications of CNN based networks on light-weighted edge devices.
... Pipeline of The Model. As shown in Figure 2, after extracting features from video frames and input text with Resnet101 [13] and RoBERTa [34], there are mainly two steps involved in the end-toend modeling of HERO. The algorithm flowchart is shown in the Alg 1. ...
Preprint
Video Object Grounding (VOG) is the problem of associating spatial object regions in the video to a descriptive natural language query. This is a challenging vision-language task that necessitates constructing the correct cross-modal correspondence and modeling the appropriate spatio-temporal context of the query video and caption, thereby localizing the specific objects accurately. In this paper, we tackle this task by a novel framework called HiErarchical spatio-tempoRal reasOning (HERO) with contrastive action correspondence. We study the VOG task at two aspects that prior works overlooked: (1) Contrastive Action Correspondence-aware Retrieval. Notice that the fine-grained video semantics (e.g., multiple actions) is not totally aligned with the annotated language query (e.g., single action), we first introduce the weakly-supervised contrastive learning that classifies the video as action-consistent and action-independent frames relying on the video-caption action semantic correspondence. Such a design can build the fine-grained cross-modal correspondence for more accurate subsequent VOG. (2) Hierarchical Spatio-temporal Modeling Improvement. While transformer-based VOG models present their potential in sequential modality (i.e., video and caption) modeling, existing evidence also indicates that the transformer suffers from the issue of the insensitive spatio-temporal locality. Motivated by that, we carefully design the hierarchical reasoning layers to decouple fully connected multi-head attention and remove the redundant interfering correlations. Furthermore, our proposed pyramid and shifted alignment mechanisms are effective to improve the cross-modal information utilization of neighborhood spatial regions and temporal frames. We conducted extensive experiments to show our HERO outperforms existing techniques by achieving significant improvement on two benchmark datasets.
... The use of alternative, yet similar network architectures (e.g. ResNet by He et al. (2016) or UNet by Ronneberger et al. (2015)) was only partially explored and for this reason not reported, preventing a more comprehensive analysis of the available network architectures. The use of skip connections, present in the aforementioned architectures, can potentially improve the network performance further, but this assessment is left for future work. ...
Preprint
Flow-control techniques are extensively studied in fluid mechanics, as a means to reduce energy losses related to friction, both in fully-developed and spatially-developing flows. These techniques typically rely on closed-loop control systems that require an accurate representation of the state of the flow to compute the actuation. Such representation is generally difficult to obtain without perturbing the flow. For this reason, in this work we propose a fully-convolutional neural-network (FCN) model trained on direct-numerical-simulation (DNS) data to predict the instantaneous state of the flow at different wall-normal locations using quantities measured at the wall. Our model can take as input the heat-flux field at the wall from a passive scalar with Prandtl number $Pr = \nu/\alpha = 6$ (where $\nu$ is the kinematic viscosity and $\alpha$ is the thermal diffusivity of the scalar quantity). The heat flux can be accurately measured also in experimental settings, paving the way for the implementation of a \textit{non-intrusive} sensing of the flow in practical applications.
... In one stage, we have two conv-layers padded to the same size as the stage input followed by the ReLU activation. In The last architecture adapted for the purpose of multi-modal brain tumor segmentation was the ResNet50 [14]. The ResNet50 is made up of 5 convolutional blocks on the 5 consecutive sizes of the encoding part. ...
Article
Full-text available
In this paper we propose to create an end-to-end brain tumor segmentation system that applies three variants of the well-known U-Net convolutional neural networks. In our results we obtain and analyse the detection performances of U-Net, VGG16-UNet and ResNet-UNet on the BraTS2020 training dataset. Further, we inspect the behavior of the ensemble model obtained as the weighted response of the three CNN models. We introduce essential preprocessing and post-processing steps so as to improve the detection performances. The original images were corrected and the different intensity ranges were transformed into the 8-bit grayscale domain to uniformize the tissue intensities, while preserving the original histogram shapes. For post-processing we apply region connectedness onto the whole tumor and conversion of background pixels into necrosis inside the whole tumor. As a result, we present the Dice scores of our system obtained for WT (whole tumor), TC (tumor core) and ET (enhanced tumor) on the BraTS2020 training dataset.
... In P2BNet, we use multi-scale (480, 576, 688, 864, 1000, 1200) as the short side to resize the image during training and single-scale (1200) during inference. We choose the classic Faster R-CNN FPN [30,22] (backbone is ResNet-50 [16]) as the detector with the default setting, and single-scale (800) images are used during training and inference. More details are included in the supplementary section. ...
Conference Paper
Full-text available
Object detection using single point supervision has received increasing attention over the years. However, the performance gap between point supervised object detection (PSOD) and bounding box supervised detection remains large. In this paper, we attribute such a large performance gap to the failure of generating high-quality proposal bags which are crucial for multiple instance learning (MIL). To address this problem, we introduce a lightweight alternative to the off-the-shelf proposal (OTSP) method and thereby create the Point-to-Box Network (P2BNet), which can construct an inter-objects balanced proposal bag by generating proposals in an anchor-like way. By fully investigating the accurate position information, P2BNet further constructs an instance-level bag, avoiding the mixture of multiple objects. Finally, a coarse-to-fine policy in a cascade fashion is utilized to improve the IoU between proposals and ground-truth (GT). Benefiting from these strategies, P2BNet is able to produce high-quality instance-level bags for object detection. P2BNet improves the mean average precision (AP) by more than 50% relative to the previous best PSOD method on the MS COCO dataset. It also demonstrates the great potential to bridge the performance gap between point supervised and bounding-box supervised detectors. The code will be released at github.com/ucas-vg/P2BNet.
... To further illustrate the superiority of the proposed health condition evaluation method, it is compared with the representative traditional learning-based classification algorithms and the deep learning models. Specifically, the traditional learningbased classification algorithms are selected as: Support Vector Machine (SVM) [35,36] and K-Nearest Neighbor algorithm (KNN) [37,38], and the deep learning models are selected as: AlexNet [39], ResNet-18 [40], and DarkNet-53 [41]. To be more detailed, Grid Searching (GS) technique is adopted to optimize the parameters (e.g., penalty coefficient and the kernel function parameter) of the SVM model, and the number Content courtesy of Springer Nature, terms of use apply. ...
Article
Full-text available
In railway engineering, monitoring the health condition of rail track structures is crucial to prevent abnormal vibration issues of the wheel–rail system. To address the problem of low efficiency of traditional nondestructive testing methods, this work investigates the feasibility of the computer vision-aided health condition monitoring approach for track structures based on vibration signals. The proposed method eliminates the tedious and complicated data pre-processing including signal mapping and noise reduction, which can achieve robust signal description using numerous redundant features. First, the method converts the raw wheel–rail vibration signals directly into two-dimensional grayscale images, followed by image feature extraction using the FAST-Unoriented-SIFT algorithm. Subsequently, Visual Bag-of-Words (VBoW) model is established based on the image features, where the optimal parameter selection analysis is implemented based on fourfold cross-validation by considering both recognition accuracy and stability. Finally, the Euclidean distance between word frequency vectors of testing set and the codebook vectors of training set is compared to recognize the health condition of track structures. For the three health conditions of track structures analyzed in this paper, the overall recognition rate could reach 96.7%. The results demonstrate that the proposed method performs higher recognition accuracy and lower bias with strong time-varying and random vibration signals, which has promising application prospect in early-stage structural defect detection.
... The ResNet50 and ResNet101 model architectures utilize stacking of convolutional layers to learn the residuals of the provided input. ResNet50 is a 50 layer residual network, and ResNet101 is a 101 layer residual network (16). The Inception-v3 model increases the depth and width of a deep convolutional network, while keeping the computational budget constant using a sparsely connected architecture (17). ...
Article
Full-text available
Purpose Deep learning (DL) is a technique explored within ophthalmology that requires large datasets to distinguish feature representations with high diagnostic performance. There is a need for developing DL approaches to predict therapeutic response, but completed clinical trial datasets are limited in size. Predicting treatment response is more complex than disease diagnosis, where hallmarks of treatment response are subtle. This study seeks to understand the utility of DL for clinical problems in ophthalmology such as predicting treatment response and where large sample sizes for model training are not available. Materials and Methods Four DL architectures were trained using cross-validated transfer learning to classify ultra-widefield angiograms (UWFA) and fluid-compartmentalized optical coherence tomography (OCT) images from a completed clinical trial (PERMEATE) dataset (n=29) as tolerating or requiring extended interval Anti-VEGF dosing. UWFA images (n=217) from the Anti-VEGF study were divided into five increasingly larger subsets to evaluate the influence of dataset size on performance. Class activation maps (CAMs) were generated to identify regions of model attention. Results The best performing DL model had a mean AUC of 0.507 ± 0.042 on UWFA images, and highest observed AUC of 0.503 for fluid-compartmentalized OCT images. DL had a best performing AUC of 0.634 when dataset size was incrementally increased. Resulting CAMs show inconsistent regions of interest. Conclusions This study demonstrated the limitations of DL for predicting therapeutic response when large datasets were not available for model training. Our findings suggest the need for hand-crafted approaches for complex and data scarce prediction problems in ophthalmology.
... Following the state-of-the-art retrieval methods, our framework is built upon the ResNet101 model [45]. SSL training part: Given a benchmark retrieval dataset, a generalpurpose unsupervised object proposal generator is used to search for object regions from each image. ...
Preprint
Full-text available
Quality feature representation is key to instance image retrieval. To attain it, existing methods usually resort to a deep model pre-trained on benchmark datasets or even fine-tune the model with a task-dependent labelled auxiliary dataset. Although achieving promising results, this approach is restricted by two issues: 1) the domain gap between benchmark datasets and the dataset of a given retrieval task; 2) the required auxiliary dataset cannot be readily obtained. In light of this situation, this work looks into a different approach which has not been well investigated for instance image retrieval previously: {can we learn feature representation \textit{specific to} a given retrieval task in order to achieve excellent retrieval?} Our finding is encouraging. By adding an object proposal generator to generate image regions for self-supervised learning, the investigated approach can successfully learn feature representation specific to a given dataset for retrieval. This representation can be made even more effective by boosting it with image similarity information mined from the dataset. As experimentally validated, such a simple self-supervised learning + self-boosting'' approach can well compete with the relevant state-of-the-art retrieval methods. Ablation study is conducted to show the appealing properties of this approach and its limitation on generalisation across datasets.
... While deep neural networks (DNNs) have achieved impressive performance on numerous vision tasks [1,2,3], recent studies [4,5] have revealed their vulnerability against adversarial examples, which are crafted by adding a maliciously designed perturbation to the image. Such adversarial attack is categorized as either white-box or black-box depending on the knowledge of the model owned by the attacker, and recent works have focused on more challenging blackbox attacks. ...
Preprint
Full-text available
Adversarial attacks with improved transferability - the ability of an adversarial example crafted on a known model to also fool unknown models - have recently received much attention due to their practicality. Nevertheless, existing transferable attacks craft perturbations in a deterministic manner and often fail to fully explore the loss surface, thus falling into a poor local optimum and suffering from low transferability. To solve this problem, we propose Attentive-Diversity Attack (ADA), which disrupts diverse salient features in a stochastic manner to improve transferability. Primarily, we perturb the image attention to disrupt universal features shared by different models. Then, to effectively avoid poor local optima, we disrupt these features in a stochastic manner and explore the search space of transferable perturbations more exhaustively. More specifically, we use a generator to produce adversarial perturbations that each disturbs features in different ways depending on an input latent code. Extensive experimental evaluations demonstrate the effectiveness of our method, outperforming the transferability of state-of-the-art methods. Codes are available at https://github.com/wkim97/ADA.
... The Standard 3D-UNet was expanded to include residual connections for each convolution block in a layer, with motivation from the "ResNet" architecture (17). Residual connections have the benefit of allowing a "flow" of loss from previous convolutions. ...
Article
Full-text available
Objectives Colorectal cancer (CRC), the third most common cancer in the USA, is a leading cause of cancer-related death worldwide. Up to 60% of patients develop liver metastasis (CRLM). Treatments like radiation and ablation therapies require disease segmentation for planning and therapy delivery. For ablation, ablation-zone segmentation is required to evaluate disease coverage. We hypothesize that fully convolutional (FC) neural networks, trained using novel methods, will provide rapid and accurate identification and segmentation of CRLM and ablation zones. Methods Four FC model styles were investigated: Standard 3D-UNet, Residual 3D-UNet, Dense 3D-UNet, and Hybrid-WNet. Models were trained on 92 patients from the liver tumor segmentation (LiTS) challenge. For the evaluation, we acquired 15 patients from the 3D-IRCADb database, 18 patients from our institution (CRLM = 24, ablation-zone = 19), and those submitted to the LiTS challenge ( n = 70). Qualitative evaluations of our institutional data were performed by two board-certified radiologists (interventional and diagnostic) and a radiology-trained physician fellow, using a Likert scale of 1–5. Results The most accurate model was the Hybrid-WNet. On a patient-by-patient basis in the 3D-IRCADb dataset, the median (min–max) Dice similarity coefficient (DSC) was 0.73 (0.41–0.88), the median surface distance was 1.75 mm (0.57–7.63 mm), and the number of false positives was 1 (0–4). In the LiTS challenge ( n = 70), the global DSC was 0.810. The model sensitivity was 98% (47/48) for sites ≥15 mm in diameter. Qualitatively, 100% (24/24; minority vote) of the CRLM and 84% (16/19; majority vote) of the ablation zones had Likert scores ≥4. Conclusion The Hybrid-WNet model provided fast (<30 s) and accurate segmentations of CRLM and ablation zones on contrast-enhanced CT scans, with positive physician reviews.
... Base Model Architecture: For the learning tasks under CIFAR-10 and CIFAR-100, we use DenseNet-40 with about 0.19 million parameters [25] as the base model for FL training. For the IMDB learning task, we use a Transformer based classification model consisting of 5 encoder layers followed by a linear layer with 17 million parameters [26] for FL training. In addition, we load pretrained GloVe word embeddings [27] before training. ...
Preprint
Full-text available
Large-scale neural networks possess considerable expressive power. They are well-suited for complex learning tasks in industrial applications. However, large-scale models pose significant challenges for training under the current Federated Learning (FL) paradigm. Existing approaches for efficient FL training often leverage model parameter dropout. However, manipulating individual model parameters is not only inefficient in meaningfully reducing the communication overhead when training large-scale FL models, but may also be detrimental to the scaling efforts and model performance as shown by recent research. To address these issues, we propose the Federated Opportunistic Block Dropout (FedOBD) approach. The key novelty is that it decomposes large-scale models into semantic blocks so that FL participants can opportunistically upload quantized blocks, which are deemed to be significant towards training the model, to the FL server for aggregation. Extensive experiments evaluating FedOBD against five state-of-the-art approaches based on multiple real-world datasets show that it reduces the overall communication overhead by more than 70% compared to the best performing baseline approach, while achieving the highest test accuracy. To the best of our knowledge, FedOBD is the first approach to perform dropout on FL models at the block level rather than at the individual parameter level.
... Deep neural networks (DNNs) have received ubiquitous adoption in recent years across many data-driven application domains such as computer vision [20,38,65], natural language processing [21,57], personalized recommendation [32,39], and speech recognition [33]. To support such growth effectively, DNN models are predominantly trained in clusters of highly parallel and increasingly more powerful GPUs [15,70]. ...
Preprint
Training deep neural networks (DNNs) is becoming more and more resource- and energy-intensive every year. Unfortunately, existing works primarily focus on optimizing DNN training for faster completion, often without considering the impact on energy efficiency. In this paper, we observe that common practices to improve training performance can often lead to inefficient energy usage. More importantly, we demonstrate that there is a tradeoff between energy consumption and performance optimization. To this end, we propose an optimization framework, Zeus, to navigate this tradeoff by automatically finding optimal job- and GPU-level configurations for recurring DNN training jobs. Zeus uses an online exploration-exploitation approach in conjunction with just-in-time energy profiling, averting the need for expensive offline measurements, while adapting to data drifts over time. Our evaluation shows that Zeus can improve the energy efficiency of DNN training by 15.3%--75.8% for diverse workloads.
... The temporal convolution window L is set to 9 and the maximum graph sampling length D is set to 2. The batch size of training set and test set is 16. We train our model for 50 epochs and use a warmup strategy [34] at the first 5 epochs. We choose Cross-Entropy as the loss function, adopt an SGD optimizer with the Nesterov momentum of 0.9 and the weight decay of 0.0002. ...
Preprint
Graph Convolutional Network (GCN) outperforms previous methods in the skeleton-based human action recognition area, including human-human interaction recognition task. However, when dealing with interaction sequences, current GCN-based methods simply split the two-person skeleton into two discrete sequences and perform graph convolution separately in the manner of single-person action classification. Such operation ignores rich interactive information and hinders effective spatial relationship modeling for semantic pattern learning. To overcome the above shortcoming, we introduce a novel unified two-person graph representing spatial interaction correlations between joints. Also, a properly designed graph labeling strategy is proposed to let our GCN model learn discriminant spatial-temporal interactive features. Experiments show accuracy improvements in both interactions and individual actions when utilizing the proposed two-person graph topology. Finally, we propose a Two-person Graph Convolutional Network (2P-GCN). The proposed 2P-GCN achieves state-of-the-art results on four benchmarks of three interaction datasets, SBU, NTU-RGB+D, and NTU-RGB+D 120.
Article
Heavy-weight impact sounds caused by footsteps are a major factor that affects acoustic comfort in concrete residential buildings. An impact monitoring system that predicts sound based on vibration could be beneficial to alter the behavior of the occupant causing excessive sound, or the stored data can be used by mediators in case of disputes to identify the sound source household and assess the disturbance. This study presents a method for predicting the actual impact sound especially footstep in the rooms of buildings. A convolutional neural network (CNN) was used as the prediction model and the signal from vibration sensors placed in the floors and walls of the room as input data. We experimentally collected a dataset and compared its performance according to the location of the vibration sensors and the resolution of the short-time Fourier transform (STFT) feature, which represents footstep-induced vibrations. The highest accuracy was achieved when the vibration signals of both the wall and floor slab were used simultaneously in the CNN model, with the frequency resolution of the STFT of 10 Hz and the window frame offset of 50 ms. The equivalent continuous A-weighted sound pressure level for 2 s was predicted with 0.99 dB as the mean absolute error, and the value of the coefficient of determination was 0.95. The performance of sound pressure level in the 63 and 500 Hz frequency bands achieved mean absolute error of 1.63–2.22 dB.
Article
Convolutional neural networks can achieve remarkable performance in semantic segmentation tasks. However, such neural network approaches heavily rely on costly pixel-level annotation. Semi-supervised learning is a promising resolution to tackle this issue, but its performance still far falls behind the fully supervised counterpart. This work proposes a cross-teacher training framework with three modules that significantly improves traditional semi-supervised learning approaches. The core is a cross-teacher module, which could simultaneously reduce the coupling among peer networks and the error accumulation between teacher and student networks. In addition, we propose two complementary contrastive learning modules. The high-level module can transfer high-quality knowledge from labeled data to unlabeled ones and promote separation between classes in feature space. The low-level module can encourage low-quality features learning from the high-quality features among peer networks. In experiments, the cross-teacher module significantly improves the performance of traditional student-teacher approaches, and our framework outperforms state-of-the-art methods on benchmark datasets.
Article
Processing structured data has become an interesting topic in recent years. The development of graph-based semi-supervised learning models has attracted much attention from machine learning researchers. In this paper, we present a novel approach for graph-based semi-supervised learning. We provide an effective method for simultaneous label recovery and linear transformation estimation. The targeted linear transformation is to obtain a discriminant subspace. The most important factor in this work to improve the semi-supervised learning is to exploit the data structure and soft labels of the available unlabeled samples. In the iterative optimization scheme used, the prior estimation of the labels increases the monitoring information in an indirect way through an introduced label-graph, avoiding the use of confidence-based hard decisions as used in self-supervised methods. It also enforces label smoothing and projected data smoothing through the use of hybrid graphs. For each smoothing type, the hybrid graph is an adaptive fusion of the two graphs encoding the similarity of the data information and the similarity of the label information. The proposed method leads to an improved discriminant linear transformation. Several experimental results on real image datasets confirm the effectiveness of the proposed method. This work also shows superior performance compared to semi-supervised methods that use simultaneous embedding and inference of labels.
Article
Essential tremor (ET) is one of the most common movement disorders in adults, and its early assessment and diagnosis are crucial for disease management in movement disorders. Nowadays, the severity of tremors can only be diagnosed and evaluated by laboratory tests. However, there are certain subjective factors in traditional assessment methods by the naked eye of a neurologist, which often leads to some biases. This study proposes a novel multi-modal signals-based automated quantitative assessment system for tremor severity. Specifically, we develop a two-stage framework that performs posture pattern recognition on the raw data, then extracts kinematic parameters to build an individualized model for each task. Besides, we established a strict clinical paradigm, including 121 ET patients, finely evaluated by a committee of neurologists to build a high-quality database. The models' performances showed that most of the kinematic parameters designed in this study could effectively map the severity of the tremor. The F1 score for classification of the posture task based on deep learning networks was 99.02%, and the quantification of symptom scores based on machine learning models ranged from 94.77-99.00%. These results demonstrate the effectiveness of the proposed framework can automatically provide objective and accurate scores for ET symptom assessment.
Article
As large-scale laser 3D point clouds data contains massive and complex data, it faces great challenges in the automatic intelligent processing and classification of large-scale 3D point clouds. Aiming at the problem that 3D point clouds in complex scenes are self-occluded or occluded, which could reduce the object classification accuracy, we propose a multidimension feature optimal combination classification method named MFOC-CliqueNet based on CliqueNet for large-scale laser point clouds. The optimal combination matrix of multidimension features is constructed by extracting the three-dimensional features and multidirectional two-dimension features of 3D point cloud. This is the first time that multidimensional optimal combination features are introduced into cyclic convolutional networks CliqueNet. It is important for large-scale 3D point cloud classification. The experimental results show that the MFOC-CliqueNet framework can realize the latest level with fewer parameters. The experiments on the Large-Scale Scene Point Cloud Oakland dataset show that the classification accuracy of our method is 98.9%, which is better than other classification algorithms mentioned in this paper.
Article
To ensure the operational reliability of power systems, it is important for wind speed signal forecasting systems of wind turbines to be efficient, accurate and stable. This paper proposes a two-phase deep learning structure with network augmentation and pruning. By introducing the cross-correlation and quasi-convex optimization, a fractional quadratic programming problem and related convex optimization models are constructed to generate the augmented data for this proposed internal network; by pruning weakly correlated convolution channels, the redundant features of its external network are reduced. Furthermore, the closed-form solution of the convex optimization model is derived, which reduces the computational complexity considerably from O(n*log(2N)) to O(n). The proposed approach has been extensively validated using the real data of the wind farm in China. The results of the numerical experiments demonstrate that the proposed method achieves the superior performance in the training flexibility, model accuracy, stability, and interpretability.
Article
Full-text available
Nonvolatile memory (NVM)-based convolutional neural networks (NvCNNs) have received widespread attention as a promising solution for hardware edge intelligence. However, there still exist many challenges in the resource-constrained conditions, such as the limitations of the hardware precision and cost and, especially, the large overhead of the analog-to-digital converters (ADCs). In this study, we systematically analyze the performance of NvCNNs and the hardware restrictions with quantization in both weight and activation and propose the corresponding requirements of NVM devices and peripheral circuits for multiply–accumulate (MAC) units. In addition, we put forward an in situ sparsity-aware processing method that exploits the sparsity of the network and the device array characteristics to further improve the energy efficiency of quantized NvCNNs. Our results suggest that the 4-bit-weight and 3-bit-activation (W4A3) design demonstrates the optimal compromise between the network performance and hardware overhead, achieving 98.82% accuracy for the Modified National Institute of Standards and Technology database (MNIST) classification task. Moreover, higher-precision designs will claim more restrictive requirements for hardware nonidealities including the variations of NVM devices and the nonlinearities of the converters. Moreover, the sparsity-aware processing method can obtain 79%/53% ADC energy reduction and 2.98×/1.15× energy efficiency improvement based on the W8A8/W4A3 quantization design with an array size of 128 × 128.
Article
Underwater images are serious problems affected by the absorption and scattering of light. At present, the existing sharpening methods can't effectively solve all underwater image degradation problems, thus it is necessary to propose a specific solution to the degradation problem. To solve the above problems, the Multi-Color Convolutional and Attentional Stacking Network (MCCA-Net) for Underwater image classification are proposed in this paper. First, an underwater image is converted to HSV and Lab color spaces and fused to achieve a refined image. Then, the attention mechanism module is used to fine the extracted image features. At last, the vertically stacked convolution module fully utilizes different levels of feature information, which realizes the fusion of convolution and attention mechanism, optimizes feature extraction and parameter reduction, and improves the classification performance of the MCCA-Net model. Extensive experiments on underwater degraded image classification show that our MCCA-Net model and method outperform other models and improve the accuracy of underwater degraded image classification. Our image fusion method can achieve 96.39% accuracy on other models, and the MCCA-Net model achieves 97.38% classification accuracy.
Article
Street view images (SVIs) have great potential for automatic land use classification. Previous studies have paid little attention to the spatial context of SVIs and land parcels, leaving room for improvement in classification accuracy and identification of parcels without SVIs. This study proposes a novel spatial context-aware method for land use classification that synthesizes SVI content and spatial context among SVIs and land parcels through a derived spatial context graph convolution network (SC-GCN). Specifically, the method characterizes the spatial context among SVIs and land parcels into a graph, which formalizes SVIs and land parcels as nodes. The spatial relationships among SVIs and land parcels are represented as graph edges. SC-GCN is designed to model the spatial context of relevant SVIs and land parcels by incorporating heterogeneous structural information into land use classification. Experimental results show that the proposed method outperforms the baseline methods of land use classification at the parcel level and can successfully identify land use types of land parcels without SVIs. Specifically, precision, recall and F1-score values of the proposed method are 72.22%, 64.22% and 68.13%, respectively, which are 2.38%, 12.40% and 13.56% higher than those of the Random Forest method. This work contributes to land use mapping with limited available data by exploring the modeling of complex geospatial relationships, and it serves as a methodological reference for the prediction and supplementation of missing geographic data.
Article
The deep learning methods for various disease prediction tasks have become very effective and even surpass human experts. However, the lack of interpretability and medical expertise limits its clinical application. This paper combines knowledge representation learning and deep learning methods, and a disease prediction model is constructed. The model initially constructs the relationship graph between the physical indicator and the test value based on the normal range of human physical examination index. And the human physical examination index for testing value by knowledge representation learning model is encoded. Then, the patient physical examination data is represented as a vector and input into a deep learning model built with self-attention mechanism and convolutional neural network to implement disease prediction. The experimental results show that the model which is used in diabetes prediction yields an accuracy of 97.18% and the recall of 87.55%, which outperforms other machine learning methods (e.g., lasso, ridge, support vector machine, random forest, and XGBoost). Compared with the best performing random forest method, the recall is increased by 5.34%, respectively. Therefore, it can be concluded that the application of medical knowledge into deep learning through knowledge representation learning can be used in diabetes prediction for the purpose of early detection and assisting diagnosis.
Article
Scene recognition plays an important role in many computer vision tasks. However, the recognition performance hardly meets the development of computer vision, since scene images show large variations in spatial position, illumination, and scale. To address this issue, a joint global metric learning and local manifold preservation (JGML-LMP) approach is proposed. First, we formulate a new global metric learning problem based on the cluster centers of each specific class, allowing to capture the global discriminative information with more informative samples. Second, in order to exploit the local manifold structure, we introduce an adaptive nearest neighbors constraint through which the local intrinsic relationships can be preserved in the new metric space instead of the Euclidean space. Third, through performing global metric learning and local manifold preservation jointly within a unified optimization framework, our approach can take advantage of both global and local information, and hence produces more discriminative and robust feature representations for scene recognition. Extensive experiments on four benchmark scene datasets demonstrate the superiority of the proposed method over state-of-the-art methods.
Article
In the last decade, deep neural networks have been widely applied to medical image segmentation, achieving good results in computer-aided diagnosis tasks etc. However, the task of segmenting highly complex, low-contrast images of organs and tissues with high accuracy still faces great challenges. To better address this challenge, this paper proposes a novel model SWTRU (Star-shaped Window Transformer Reinforced U-Net) by combining the U-Net network which plays well in the image segmentation field, and the Transformer which possesses a powerful ability to capture global contexts. Unlike the previous methods that import the Transformer into U-Net, an improved Star-shaped Window Transformer is introduced into the decoder of the SWTRU to enhance the decision-making capability of the whole method. The SWTRU uses a redesigned multi-scale skip-connection model, which retains the inductive bias of the original FCN structure for images while obtaining fine-grained features and coarse-grained semantic information. Our method also presents the FFIM (Filtering Feature Integration Mechanism) to integration and dimensionality reduction of the fused multi-layered features, which reduces the computation. Our SWTRU yields 0.972 DICE on CHLISC for liver and tumor segmentation, 0.897 DICE on LGG for glioma segmentation, and 0.904 DICE on ISIC2018 for skin diseases’ segmentation, achieves substantial improvements over the current SoTA across 9 different medical image segment methods. SWTRU can combine feature mapping from different scales, high-level semantics, and global contextual relationships, this architecture is effective in the medical image segmentation. The experimental findings indicate that SWTRU produces superior performance on the medical image segmentation tasks.
Article
Full-text available
The generation of building footprints from laser scanning point clouds or remote sensing images involves three steps: segmentation, outline extraction and boundary regularization/generalization. Currently, existing approaches mainly focus on the first and third steps, while only few studies have been conducted for the second step. However, the extraction result of the building outlines directly determines the regularization performance. Therefore, high-quality building outlines are important to be delivered for the regularization. Determining parameters, such as point distance and neighborhood radius, is the primary challenge in the process of extracting building outlines. In this study, a parameter-free method is proposed by using an improved generative adversarial network (GAN). It extracts building outlines from gridded binary images with default resolution and no other input of parameters. Hence, the parameter selection problem is overcome. The experimental results on segmented point cloud datasets of building roofs reveal that our method achieves the mean intersection over union of 93.52%, the Hausdorff distance of 0.640m and the PoLiS of 0.165 m. The comparison with a-shape method shows that our method can improve the extraction performance of concave shapes and provide a more regularized outline result. The method reduces the difficulty and complexity of the next regularization task, and contributes to the accuracy of point cloud-based building footprint generation.
Article
In this paper, we propose an Outlined Attention U-network (OAU-net) with bypass branching strategy to solve biomedical image segmentation tasks, which is capable of sensing shallow and deep features. Unlike previous studies, we use residual convolution and res2convolution as encoders. In particular, the outline filter and attention module are embedded in the skip connection part, respectively. Shallow features will enhance the edge information after being processed by the outline filter. Meanwhile, in the depths of the network, to better realize feature fusion, our attention module will simultaneously emphasize the independence between feature map channels (channel attention module) and each position information (spatial attention module), that is, the hybrid domain attention module. Finally, we conducted ablation experiments and comparative experiments according to three public data sets (pulmonary CT lesions, Kaggle 2018 data science bowl, skin lesions), and analyzed them with classical evaluation indexes. Experimental results show that our proposed method improves segmentation accuracy effectively. Our code is public at https://github.com/YF-W/OAU-net.
Article
The International Roughness Index (IRI) is one of the most critical parameters in the field of pavement performance management. Traditional methods for the measurement of IRI rely on expensive instrumented vehicles and well-trained professionals. The equipment and labor costs of traditional measurement methods limit the timely updates of IRI on the pavements. In this article, a novel imaging-based Deep Neural Network (DNN) model, which can use pavement photos to directly identify the IRI values, is proposed. This model proved that it is possible to use 2-dimensional (2D) images to identify the IRI other than the typically used vertical accelerations or 3-dimensional (3D) images. Due to the fast growth in photography equipment, small and convenient sports action cameras such as the GoPro Hero series are able to capture smooth videos at a high framerate with built-in electronic image stabilization systems. These significant improvements make it not only more convenient to collect high-quality 2D images, but also easier to process them than vibrations or accelerations. In the proposed method, 15% of the imaging data were randomly selected for testing and had never been touched during the training steps. The testing results showed an averaged coefficient of determination (R square) of 0.6728 and an averaged root mean square error (RMSE) of 0.50.
Conference Paper
Article
For NP-hard combinatorial optimization problems, it is usually challenging to find high-quality solutions in polynomial time. Designing either an exact algorithm or an approximate algorithm for these problems often requires significantly specialized knowledge. Recently, deep learning methods have provided new directions to solve such problems. In this paper, an end-to-end deep reinforcement learning framework is proposed to solve this type of combinatorial optimization problems. This framework can be applied to different problems with only slight changes of input, masks, and decoder context vectors. The proposed framework aims to improve the models in literacy in terms of the neural network model and the training algorithm. The solution quality of TSP and the CVRP up to 100 nodes are significantly improved via our framework. Compared with the best results of the state-of-the-art methods, the average optimality gap is reduced from 4.53% to 3.67% for TSP with 100 nodes and from 7.34% to 6.68% for CVRP with 100 nodes when using the greedy decoding strategy. Besides, the proposed framework can be used to solve a multi-depot CVRP case without any structural modification. Furthermore, our framework uses about 1/3∼3/4 training samples compared with other existing learning methods while achieving better results. The results performed on randomly generated instances, and the benchmark instances from TSPLIB and CVRPLIB confirm that our framework has a linear running time on the problem size (number of nodes) during training and testing phases and has a good generalization performance from random instance training to real-world instance testing.
Article
The morphology of the cervical cell nucleus is the most important consideration for pathological cell identification. And a precise segmentation of the cervical cell nucleus determines the performance of the final classification for most traditional algorithms and even some deep learning-based algorithms. Many deep learning-based methods can accurately segment cervical cell nuclei but will cost lots of time, especially when dealing with the whole-slide image (WSI) of tens of thousands of cells. To address this challenge, we propose a dual-supervised sampling network structure, in which a supervised-down sampling module uses compressed images instead of original images for cell nucleus segmentation, and a boundary detection network is introduced to supervise the up-sampling process of the decoding layer for accurate segmentation. This strategy dramatically reduces the convolution calculation in image feature extraction and ensures segmentation accuracy. Experimental results on various cervical cell datasets demonstrate that compared with UNet, the inference speed of the proposed network is increased by 5 times without losing segmentation accuracy. The codes and datasets are available at https://github.com/ldrunning/DSSNet
Article
The use of automatic systems for medical image classification has revolutionized the diagnosis of a high number of diseases. These alternatives, which are usually based on artificial intelligence (AI), provide a helpful tool for clinicians, eliminating the inter and intra-observer variability that the diagnostic process entails. Convolutional Neural Network (CNNs) have proved to be an excellent option for this purpose, demonstrating a large performance in a wide range of contexts. However, it is also extremely important to quantify the reliability of the model’s predictions in order to guarantee the confidence in the classification. In this work, we propose a multi-level ensemble classification system based on a Bayesian Deep Learning approach in order to maximize performance while providing the uncertainty of each classification decision. This tool combines the information extracted from different architectures by weighting their results according to the uncertainty of their predictions. Performance is evaluated in a wide range of real scenarios: in the first one, the aim is to differentiate between different pulmonary pathologies: controls vs bacterial pneumonia vs viral pneumonia. A two-level decision tree is employed to divide the 3-class classification into two binary classifications, yielding an accuracy of 98.19%. In the second context, performance is assessed for the diagnosis of Parkinson’s disease, leading to an accuracy of 95.31%. The reduced preprocessing needed for obtaining this high performance, in addition to the information provided about the reliability of the predictions evidence the applicability of the system to be used as an aid for clinicians.
Article
Graph Convolutional Networks (GCNs) have emerged as a hot topic of interest for collaborative filtering among researchers in the recent past. The research which exists in literature and is applied to recommendation does not analyze all the facets of GCN, as GCN is introduced for graph classification activities. It is observed that the two facets of GCNs namely, feature transformation and non-linear activation have a small influence on increasing the effectiveness of collaborative filtering (CF). Furthermore, the inclusion of these two facets increases the complexity of training and even decreases the recommendation performance. In this paper, a novel approach namely Improved Graph Convolutional Network (ImprovedGCN) has been proposed which only makes use of the important part of GCN termed neighborhood aggregation for CF. The aforesaid model can be implemented and trained which leads to significant improvements as compared to a similar approach termed Neural Graph Collaborative Filtering (NGCF).
Article
Full-text available
We present CrossInfoMobileNet, a hand pose estimation convolutional neural network based on CrossInfoNet, specifically tuned to mobile phone processors through the optimization, modification, and replacement of computationally critical CrossInfoNet components. By introducing a state-of-the-art MobileNetV3 network as a feature extractor and refiner, replacing ReLU activation with a better performing H-Swish activation function, we have achieved a network that requires 2.37 times less multiply-add operations and 2.22 times less parameters than the CrossInfoNet network, while maintaining the same error on the state-of-the-art datasets. This reduction of multiply-add operations resulted in an average 1.56 times faster real-world performance on both desktop and mobile devices, making it more suitable for embedded applications. The full source code of CrossInfoMobileNet including the sample dataset and its evaluation is available online through Code Ocean.
Article
Recent studies on one-class classification have achieved a remarkable performance by employing the self-supervised classifier that predicts the type of pre-defined geometric transformations applied on in-class images. However, they cannot correctly identify in-class images as in-class at all when the input images have various viewpoints (e.g., translated or rotated images), because their classification-based in-class scores assume that in-class images always have a fixed viewpoint. Pointing out that humans can easily recognize such images having various viewpoints as the same class, in this work, we aim to propose a one-class classifier robust to geometrically-transformed inputs, named as GROC. To this end, we remark that in-class images match better with the in-class transformations than out-of-class images do. We introduce a conformity score indicating how strongly an input image agrees with one of the predefined in-class transformations, then utilize the conformity score with our proposed agreement measures for one-class classification. Our extensive experiments demonstrate that GROC is able to accurately distinguish in-class images from out-of-class images regardless of whether the inputs are geometrically-transformed or not, whereas the existing methods fail.
Article
Full-text available
Background Image-based cancer classifiers suffer from a variety of problems which negatively affect their performance. For example, variation in image brightness or different cameras can already suffice to diminish performance. Ensemble solutions, where multiple model predictions are combined into one, can improve these problems. However, ensembles are computationally intensive and less transparent to practitioners than single model solutions. Constructing model soups, by averaging the weights of multiple models into a single model, could circumvent these limitations while still improving performance. Objective To investigate the performance of model soups for a dermoscopic melanoma-nevus skin cancer classification task with respect to (1) generalisation to images from other clinics, (2) robustness against small image changes and (3) calibration such that the confidences correspond closely to the actual predictive uncertainties. Methods We construct model soups by fine-tuning pre-trained models on seven different image resolutions and subsequently averaging their weights. Performance is evaluated on a multi-source dataset including holdout and external components. Results We find that model soups improve generalisation and calibration on the external component while maintaining performance on the holdout component. For robustness, we observe performance improvements for pertubated test images, while the performance on corrupted test images remains on par. Conclusions Overall, souping for skin cancer classifiers has a positive effect on generalisation, robustness and calibration. It is easy for practitioners to implement and by combining multiple models into a single model, complexity is reduced. This could be an important factor in achieving clinical applicability, as less complexity generally means more transparency.
Article
Full-text available
Article
Efficient image‐recognition algorithms to classify the pixels accurately are required for the computer‐vision‐based inspection of concrete defects. This study proposes a deep learning‐based model called sparse‐sensing and superpixel‐based segmentation (SSSeg) for accurate and efficient crack segmentation. The model employed a sparse‐sensing‐based encoder and a superpixel‐based decoder and was compared with six state‐of‐the‐art models. An input pipeline of 1231 diverse crack images was specially designed to train and evaluate the models. The results indicated that the SSSeg achieved a good balance between the recognition correctness and completeness and outperformed other models in both accuracy and efficiency. The SSSeg also exhibited good resistance to the interference of surface roughness, dirty stains, and moisture. The increased depth and receptive field of sparse‐sensing units guaranteed the representability; meanwhile, structured sparse characteristics protected the network from overfitting. The lightweight superpixel‐based decoder omitted skip connections, which greatly reduced the computation and memory footprint and enlarged the input size in the inference.
Article
Increasing voltage levels and realizing power-line communications are important parts of a smart grid, and because of this, the need for intelligent, digital, and multi-functional electronic sensors that can simultaneously perform the functions of high-voltage monitoring and carrier-signal demodulation in a power transmission system is urgent. Inspired by the operation mode of light-emitting diodes (LEDs) driven by triboelectric-nanogenerators (TENGs), we propose an electrode-LED-electrode structure, namely, LED-in-capacitors (LIC), for high-voltage monitoring and high-frequency signal demodulation. We demonstrate that the proposed LIC can sensitively extract the high-voltage amplitude and detect the harmonic pollution on a power line due to the LIC’s being highly sensitive to the rate of change of the electric potential. We build a one-dimensional convolutional neural network that we use to identify successfully, with correct rate as high as 94.53%, the harmonic pollution. Additionally, by using the LIC, we are able to demodulate accurately the high-frequency carrier signals transferred in the high-voltage line, showing that the LIC has promise for potential applications in power-line communications. As a novel type of electronic device derived from TENG-related technology, we believe the LIC can provide impetus for the development of next-generation high-voltage technology.
Article
Human-centered applications using wearable sensors in combination with machine learning have received a great deal of attention in the last couple of years. At the same time, wearable sensors have also evolved and are now able to accurately measure physiological signals and are, therefore, suitable for detecting body reactions to stress. The field of machine learning, or more precisely, deep learning, has been able to produce outstanding results. However, in order to produce these good results, large amounts of labeled data are needed, which, in the context of physiological data related to stress detection, are a great challenge to collect, as they usually require costly experiments or expert knowledge. This usually results in an imbalanced and small dataset, which makes it difficult to train a deep learning algorithm. In recent studies, this problem is tackled with data augmentation via a Generative Adversarial Network (GAN). Conditional GANs (cGAN) are particularly suitable for this as they provide the opportunity to feed auxiliary information such as a class label into the training process to generate labeled data. However, it has been found that during the training process of GANs, different problems usually occur, such as mode collapse or vanishing gradients. To tackle the problems mentioned above, we propose a Long Short-Term Memory (LSTM) network, combined with a Fully Convolutional Network (FCN) cGAN architecture, with an additional diversity term to generate synthetic physiological data, which are used to augment the training dataset to improve the performance of a binary classifier for stress detection. We evaluated the methodology on our collected physiological measurement dataset, and we were able to show that using the method, the performance of an LSTM and an FCN classifier could be improved. Further, we showed that the generated data could not be distinguished from the real data any longer.
Conference Paper
Full-text available
Article
Full-text available
While depth tends to improve network performances, it also makes gradient-based training more difficult since deeper networks tend to be more non-linear. The recently proposed knowledge distillation approach is aimed at obtaining small and fast-to-execute models, and it has shown that a student network could imitate the soft output of a larger teacher network or ensemble of networks. In this paper, we extend this idea to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student. Because the student intermediate hidden layer will generally be smaller than the teacher's intermediate hidden layer, additional parameters are introduced to map the student hidden layer to the prediction of the teacher hidden layer. This allows one to train deeper students that can generalize better or run faster, a trade-off that is controlled by the chosen student capacity. For example, on CIFAR-10, a deep student network with almost 10.4 times less parameters outperforms a larger, state-of-the-art teacher network.
Article
Full-text available
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Article
Full-text available
We propose a deep convolutional neural network architecture codenamed "Inception", which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC 2014). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC 2014 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.
Article
Full-text available
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide detailed a analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
Article
Full-text available
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Article
Full-text available
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
Article
Full-text available
We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to increase detection confidence. We show that different tasks can be learnt simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013), and produced near state of the art results for the detection and classifications tasks. Finally, we release a feature extractor from our best model called OverFeat.
Article
Full-text available
Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed remains finite: for a special class of initial conditions on the weights, very deep networks incur only a finite delay in learning speed relative to shallow networks. We further show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, thereby providing analytical insight into the success of unsupervised pretraining in deep supervised learning tasks.
Conference Paper
Full-text available
Recently, we proposed to transform the outputs of each hidden neuron in a multi-layer perceptron network to have zero output and zero slope on average, and use separate shortcut connections to model the linear dependencies instead. We continue the work by firstly introducing a third transformation to normalize the scale of the outputs of each hidden neuron, and secondly by analyzing the connections to second order optimization methods. We show that the transformations make a simple stochastic gradient behave closer to second-order optimization methods and thus speed up learning. This is shown both in theory and with experiments. The experiments on the third transformation show that while it further increases the speed of learning, it can also hurt performance by converging to a worse local optimum, where both the inputs and outputs of many hidden neurons are close to zero.
Conference Paper
Full-text available
We transform the outputs of each hidden neuron in a multi-layer perceptron network to be zero mean and zero slope, and use separate shortcut connections to model the linear dependencies instead. This transformation aims at separating the problems of learning the linear and nonlinear parts of the whole input-output mapping, which has many benefits. We study the theoretical properties of the transformation by noting that they make the Fisher information matrix closer to a diagonal matrix, and thus standard gradient closer to the natural gradient. We experimentally confirm the usefulness of the transformations by noting that they make basic stochastic gradient learning competitive with state-of-the-art learning algorithms in speed, and that they seem also to help find solutions that generalize better. The experiments include both classification of handwritten digits with a 3-layer network and learning a low-dimensional representation for images by using a 6-layer auto-encoder network. The transformations were beneficial in all cases, with and without regularization. 1
Article
Full-text available
When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random "dropout" gives big improvements on many benchmark tasks and sets new records for speech and object recognition.
Article
Full-text available
Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations functions. We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new initialization scheme that brings substantially faster convergence.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.
Conference Paper
While depth tends to improve network performances, it also makes gradient-based training more difficult since deeper networks tend to be more non-linear. The recently proposed knowledge distillation approach is aimed at obtaining small and fast-to-execute models, and it has shown that a student network could imitate the soft output of a larger teacher network or ensemble of networks. In this paper, we extend this idea to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student. Because the student intermediate hidden layer will generally be smaller than the teacher's intermediate hidden layer, additional parameters are introduced to map the student hidden layer to the prediction of the teacher hidden layer. This allows one to train deeper students that can generalize better or run faster, a trade-off that is controlled by the chosen student capacity. For example, on CIFAR-10, a deep student network with almost 10.4 times less parameters outperforms a larger, state-of-the-art teacher network.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
It has long been known that neural networks can learn faster when their input and hidden unit activities are centered about zero; recently we have extended this approach to also encompass the centering of error signals [15]. Here we generalize this notion to all factors involved in the network's gradient, leading us to propose centering the slope of hidden unit activation functions as well. Slope centering removes the linear component of backpropagated error; this improves credit assignment in networks with shortcut connections. Benchmark results show that this can speed up learning significantly without adversely affecting the trained network's generalization ability.
Chapter
It has long been known that neural networks can learn faster when their input and hidden unit activities are centered about zero; recently we have extended this approach to also encompass the centering of error signals [15]. Here we generalize this notion to all factors involved in the network's gradient, leading us to propose centering the slope of hidden unit activation functions as well. Slope centering removes the linear component of backpropagated error; this improves credit assignment in networks with shortcut connections. Benchmark results show that this can speed up learning significantly without adversely affecting the trained network's generalization ability.
Article
Theoretical and empirical evidence indicates that the depth of neural networks is crucial for their success. However, training becomes more difficult as depth increases, and training of very deep networks remains an open problem. Here we introduce a new architecture designed to overcome this. Our so-called highway networks allow unimpeded information flow across many layers on information highways. They are inspired by Long Short-Term Memory recurrent networks and use adaptive gating units to regulate the information flow. Even with hundreds of layers, highway networks can be trained directly through simple gradient descent. This enables the study of extremely deep and efficient architectures.
Article
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
We propose an object detection system that relies on a multi-region deep convolutional neural network (CNN) that also encodes semantic segmentation-aware features. The resulting CNN-based representation aims at capturing a diverse set of discriminative appearance factors and exhibits localization sensitivity that is essential for accurate object localization. We exploit the above properties of our recognition module by integrating it on an iterative localization mechanism that alternates between scoring a box proposal and refining its location with a deep CNN regression model. Thanks to the efficient use of our modules, we detect objects with very high localization accuracy. On the detection challenges of PASCAL VOC2007 and PASCAL VOC2012 we achieve mAP of 74.9% and 70.7% correspondingly, surpassing any other published work by a significant margin.
Article
This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.
Article
Most object detectors contain two important components: a feature extractor and an object classifier. The feature extractor has rapidly evolved with significant research efforts leading to better deep ConvNet architectures. The object classifier, however, has not received much attention and most state-of-the-art systems (like R-CNN) use simple multi-layer perceptrons. This paper demonstrates that carefully designing deep networks for object classification is just as important. We take inspiration from traditional object classifiers, such as DPM, and experiment with deep networks that have part-like filters and reason over latent variables. We discover that on pre-trained convolutional feature maps, even randomly initialized deep classifiers produce excellent results, while the improvement due to fine-tuning is secondary; on HOG features, deep classifiers outperform DPMs and produce the best HOG-only results without external data. We believe these findings provide new insight for developing object detection systems. Our framework, called Networks on Convolutional feature maps (NoC), achieves outstanding results on the PASCAL VOC 2007 (73.3% mAP) and 2012 (68.8% mAP) benchmarks.
Article
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch}. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
Article
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
Article
Though recent advanced convolutional neural networks (CNNs) have been improving the image recognition accuracy, the models are getting more complex and time-consuming. For real-world applications in industrial and commercial scenarios, engineers and developers are often faced with the requirement of constrained time budget. In this paper, we investigate the accuracy of CNNs under constrained time cost. Under this constraint, the designs of the network architectures should exhibit as trade-offs among the factors like depth, numbers of filters, filter sizes, etc. With a series of controlled comparisons, we progressively modify a baseline model while preserving its time complexity. This is also helpful for understanding the importance of the factors in network designs. We present an architecture that achieves very competitive accuracy in the ImageNet dataset (11.8% top-5 error, 10-view test), yet is 20% faster than "AlexNet" (16.0% top-5 error, 10-view test).
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Article
April 8, 2009Groups at MIT and NYU have collected a dataset of millions of tiny colour images from the web. It is, in principle, an excellent dataset for unsupervised training of deep generative models, but previous researchers who have tried this have found it di cult to learn a good set of lters from the images. We show how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex. Using a novel parallelization algorithm to distribute the work among multiple machines connected on a network, we show how training such a model can be done in reasonable time. A second problematic aspect of the tiny images dataset is that there are no reliable class labels which makes it hard to use for object recognition experiments. We created two sets of reliable labels. The CIFAR-10 set has 6000 examples of each of 10 classes and the CIFAR-100 set has 600 examples of each of 100 non-overlapping classes. Using these labels, we show that object recognition is signi cantly
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing the fixed-size limitation, we can improve all CNN-based image classification methods in general. Our SPP-net achieves state-of-the-art accuracy on the datasets of ImageNet 2012, Pascal VOC 2007, and Caltech101. The power of SPP-net is more significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method computes convolutional features 30-170× faster than the recent leading method R-CNN (and 24-64× faster overall), while achieving better or comparable accuracy on Pascal VOC 2007.
Conference Paper
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Book
Ripley brings together two crucial ideas in pattern recognition: statistical methods and machine learning via neural networks. He brings unifying principles to the fore, and reviews the state of the subject. Ripley also includes many examples to illustrate real problems in pattern recognition and how to overcome them.
Conference Paper
Large Convolutional Neural Network models have recently demonstrated impressive classification performance on the ImageNet benchmark \cite{Kriz12}. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. We also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky \etal on the ImageNet classification benchmark. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.
Article
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
Within the field of pattern classification, the Fisher kernel is a powerful framework which combines the strengths of generative and discriminative approaches. The idea is to characterize a signal with a gradient vector derived from a generative probability model and to subsequently feed this representation to a discriminative classifier. We propose to apply this framework to image categorization where the input signals are images and where the underlying generative model is a visual vocabulary: a Gaussian mixture model which approximates the distribution of low-level features in images. We show that Fisher kernels can actually be understood as an extension of the popular bag-of-visterms. Our approach demonstrates excellent performance on two challenging databases: an in-house database of 19 object/scene categories and the recently released VOC 2006 database. It is also very practical: it has low computational needs both at training and test time and vocabularies trained on one set of categories can be applied to another set without any significant loss in performance.
Conference Paper
VLFeat is an open and portable library of computer vision algorithms. It aims at facilitating fast prototyping and reproducible research for computer vision scientists and students. It includes rigorous implementations of common building blocks such as feature detectors, feature extractors, (hierarchical) k-means clustering, randomized kd-tree matching, and super-pixelization. The source code and interfaces are fully documented. The library integrates directly with MATLAB, a popular language for computer vision research.
Conference Paper
Restricted Boltzmann machines were developed using binary stochastic hidden units. These can be generalized by replacing each binary unit by an infinite number of copies that all have the same weights but have progressively more negative biases. The learning and inference rules for these “Stepped Sigmoid Units ” are unchanged. They can be approximated efficiently by noisy, rectified linear units. Compared with binary units, these units learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset. Unlike binary units, rectified linear units preserve information about relative intensities as information travels through multiple layers of feature detectors. 1.
Article
This paper develops locally adapted hierarchical basis functions for effectively preconditioning large optimization problems that arise in computer graphics applications such as tone mapping, gradient- domain blending, colorization, and scattered data interpolation. By looking at the local structure of the coefficient matrix and p erform- ing a recursive set of variable eliminations, combined with a sim- plification of the resulting coarse level problems, we obtai n bases better suited for problems with inhomogeneous (spatially varying) data, smoothness, and boundary constraints. Our approach removes the need to heuristically adjust the optimal number of precondi- tioning levels, significantly outperforms previously prop osed ap- proaches, and also maps cleanly onto data-parallel architectures such as modern GPUs.
Article
The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. This approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification.
Conference Paper
The convergence of back-propagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers explanations of why they work. Many authors have suggested that second-order optimization methods are advantageous for neural net training. It is shown that most "classical" second-order methods are impractical for large neural networks. A few methods are proposed that do not have these limitations.