Conference Paper

Deep Residual Learning for Image Recognition

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Fine-tuning Pre-Trained Models (PTMs) on downstream tasks has shown remarkable improvements in various fields [28,20,59,34], making "pre-training → fine-tuning" the de-facto paradigm in many real-world applications. A model zoo contains diverse PTMs in their architectures and functionalities [1,11], but a randomly selected PTM makes their helpfulness for a particular downstream task vary unpredictably [63,55,77]. ...
... We follow [83] and construct a model zoo with 10 PTMs pre-trained on ImageNet [64] across five architecture families, i.e. Inception [70], ResNet [28], DenseNet [31], MobileNet [66], and MNASNet [71]. We evaluate various methods on 9 downstream datasets, i.e. ...
... We follow [83] and construct a model zoo with 10 PTMs pre-trained on ImageNet [64] across 5 families of architectures available from PyTorch. Concretely, they are Inception V1 [70], Inception V3 [70], ResNet 50 [28], ResNet 101 [28], ResNet 152 [28], DenseNet 121 [31], DenseNet 169 [31], DenseNet 201 [31], MobileNet V2 [66], and NASNet-A Mobile [71]. The model zoo spans PTMs of multiple parameter quantities. ...
Preprint
Figuring out which Pre-Trained Model (PTM) from a model zoo fits the target task is essential to take advantage of plentiful model resources. With the availability of numerous heterogeneous PTMs from diverse fields, efficiently selecting the most suitable PTM is challenging due to the time-consuming costs of carrying out forward or backward passes over all PTMs. In this paper, we propose Model Spider, which tokenizes both PTMs and tasks by summarizing their characteristics into vectors to enable efficient PTM selection. By leveraging the approximated performance of PTMs on a separate set of training tasks, Model Spider learns to construct tokens and measure the fitness score between a model-task pair via their tokens. The ability to rank relevant PTMs higher than others generalizes to new tasks. With the top-ranked PTM candidates, we further learn to enrich task tokens with their PTM-specific semantics to re-rank the PTMs for better selection. Model Spider balances efficiency and selection ability, making PTM selection like a spider preying on a web. Model Spider demonstrates promising performance in various configurations of model zoos.
... Params TBS CV ResNet18 [14] 11.7M 8192, 16384, 32768 ResNet50 [14] 25.6M ResNet152 [14] 60.2M WideResNet101_2 [37] 126.9M ConvNextLarge [22] 197.8M ...
... Params TBS CV ResNet18 [14] 11.7M 8192, 16384, 32768 ResNet50 [14] 25.6M ResNet152 [14] 60.2M WideResNet101_2 [37] 126.9M ConvNextLarge [22] 197.8M ...
... Params TBS CV ResNet18 [14] 11.7M 8192, 16384, 32768 ResNet50 [14] 25.6M ResNet152 [14] 60.2M WideResNet101_2 [37] 126.9M ConvNextLarge [22] 197.8M ...
Preprint
Full-text available
Training deep learning models in the cloud or on dedicated hardware is expensive. A more cost-efficient option are hyperscale clouds offering spot instances, a cheap but ephemeral alternative to on-demand resources. As spot instance availability can change depending on the time of day, continent, and cloud provider, it could be more cost-efficient to distribute resources over the world. Still, it has not been investigated whether geo-distributed, data-parallel spot deep learning training could be a more cost-efficient alternative to centralized training. This paper aims to answer the question: Can deep learning models be cost-efficiently trained on a global market of spot VMs spanning different data centers and cloud providers? To provide guidance, we extensively evaluate the cost and throughput implications of training in different zones, continents, and clouds for representative CV and NLP models. To expand the current training options further, we compare the scalability potential for hybrid-cloud scenarios by adding cloud resources to on-premise hardware to improve training throughput. Finally, we show how leveraging spot instance pricing enables a new cost-efficient way to train models with multiple cheap VMs, trumping both more centralized and powerful hardware and even on-demand cloud offerings at competitive prices.
... Relying on massive personal images, the industry has shown promising capabilities for developing artificial intelligence (AI) for many computer vision tasks, e.g., image classification [1,2], face recognition [3,4], action recognition [5,6], etc. In this process, a major conflict has been seen relating to software engineers between better developing AI systems and distancing from the sensitive training data. ...
... Residual connection. The residual connection [1] can be formulated as F(X) + X. If the order of the nonlinear mapping F(·) is co-variant with that of X, the residual connection is co-variant with that of X. ...
... To destroy the human-recognizable contents, we choose RS as the encryption strategy to encrypt images. The reason is two-fold: (1) The key space of an image encrypted by RS is big enough and (2) The drop in performance is insignificant. To learn on the images encrypted by RS, we design permutation-invariant ViT (PEViT), defined as follows, ...
Preprint
Full-text available
Massive human-related data is collected to train neural networks for computer vision tasks. A major conflict is exposed relating to software engineers between better developing AI systems and distancing from the sensitive training data. To reconcile this conflict, this paper proposes an efficient privacy-preserving learning paradigm, where images are first encrypted to become ``human-imperceptible, machine-recognizable'' via one of the two encryption strategies: (1) random shuffling to a set of equally-sized patches and (2) mixing-up sub-patches of the images. Then, minimal adaptations are made to vision transformer to enable it to learn on the encrypted images for vision tasks, including image classification and object detection. Extensive experiments on ImageNet and COCO show that the proposed paradigm achieves comparable accuracy with the competitive methods. Decrypting the encrypted images requires solving an NP-hard jigsaw puzzle or an ill-posed inverse problem, which is empirically shown intractable to be recovered by various attackers, including the powerful vision transformer-based attacker. We thus show that the proposed paradigm can ensure the encrypted images have become human-imperceptible while preserving machine-recognizable information. The code is available at \url{https://github.com/FushengHao/PrivacyPreservingML.}
... DNNs are complex multi-dimensional non-linear functions. An example of DNN with around eleven million trainable parameters is ResNet-18 [40]. For a better explanation of our metrics, however, we will use a simplified function, which is linear and only has two parameters (or dimensions): ( ) = 1 · + 2 . ...
... To be comparable to other FL defenses, we chose similar settings to related works and focus mainly on image classification with CIFAR-10 [43], GTSRB [91], and MNIST [25]. For model architectures, we use ResNet-18 [40], SqueezeNet [41], and a CNN. Additionally, we investigate into the text domain by training a DistilBERT [82] transformer model on SST-2 [89] sentiment analysis dataset. ...
... Default Scenario. We train the CIFAR-10 [43] image classification task (ten classes) on a ResNet-18 [40] model with LR 0.01 (SGD optimizer, momentum 0.9, decay 0.005). The federation consists of N = 20 clients, which are all selected each round ( = 20). ...
Preprint
Federated Learning (FL) trains machine learning models on data distributed across multiple devices, avoiding data transfer to a central location. This improves privacy, reduces communication costs, and enhances model performance. However, FL is prone to poisoning attacks, which can be untargeted aiming to reduce the model performance, or targeted, so-called backdoors, which add adversarial behavior that can be triggered with appropriately crafted inputs. Striving for stealthiness, backdoor attacks are harder to deal with. Mitigation techniques against poisoning attacks rely on monitoring certain metrics and filtering malicious model updates. However, previous works didn't consider real-world adversaries and data distributions. To support our statement, we define a new notion of strong adaptive adversaries that can simultaneously adapt to multiple objectives and demonstrate through extensive tests, that existing defense methods can be circumvented in this adversary model. We also demonstrate, that existing defenses have limited effectiveness when no assumptions are made about underlying data distributions. To address realistic scenarios and adversary models, we propose Metric-Cascades (MESAS) a new defense that leverages multiple detection metrics simultaneously for the filtering of poisoned model updates. This approach forces adaptive attackers into a heavy multi-objective optimization problem, and our evaluation with nine backdoors and three datasets shows that even our strong adaptive attacker cannot evade MESAS's detection. We show that MESAS outperforms existing defenses in distinguishing backdoors from distortions originating from different data distributions within and across the clients. Overall, MESAS is the first defense that is robust against strong adaptive adversaries and is effective in real-world data scenarios while introducing a low overhead of 24.37s on average.
... For the dynamic weight sub-network, we replace the whole two-layer basic block with our AW-Net block mentioned in Section 4.2 based on the ResNet-18 [15], and the dimension of the final fully-connected layer adapts to the x's dimension of different applied dataset (e.g. CIFAR and Tiny-ImageNet) by average pooling operation. ...
... To ensure fair comparisons, all the networks are trained with the same optimization algorithm. Specifically, for CIFAR-10 and CIFAR-100, the compared networks include ResNet-34 [15], ResNet-50 [15], WideResNet-34-8 [49], VGG-16-BN [32] and RepVGG-A2 [7]. All of them are trained by the same state-of-the-art adversarial distillation method [53] with multi-teachers. ...
... To ensure fair comparisons, all the networks are trained with the same optimization algorithm. Specifically, for CIFAR-10 and CIFAR-100, the compared networks include ResNet-34 [15], ResNet-50 [15], WideResNet-34-8 [49], VGG-16-BN [32] and RepVGG-A2 [7]. All of them are trained by the same state-of-the-art adversarial distillation method [53] with multi-teachers. ...
Preprint
Adversarial attacks have been proven to be potential threats to Deep Neural Networks (DNNs), and many methods are proposed to defend against adversarial attacks. However, while enhancing the robustness, the clean accuracy will decline to a certain extent, implying a trade-off existed between the accuracy and robustness. In this paper, we firstly empirically find an obvious distinction between standard and robust models in the filters' weight distribution of the same architecture, and then theoretically explain this phenomenon in terms of the gradient regularization, which shows this difference is an intrinsic property for DNNs, and thus a static network architecture is difficult to improve the accuracy and robustness at the same time. Secondly, based on this observation, we propose a sample-wise dynamic network architecture named Adversarial Weight-Varied Network (AW-Net), which focuses on dealing with clean and adversarial examples with a ``divide and rule" weight strategy. The AW-Net dynamically adjusts network's weights based on regulation signals generated by an adversarial detector, which is directly influenced by the input sample. Benefiting from the dynamic network architecture, clean and adversarial examples can be processed with different network weights, which provides the potentiality to enhance the accuracy and robustness simultaneously. A series of experiments demonstrate that our AW-Net is architecture-friendly to handle both clean and adversarial examples and can achieve better trade-off performance than state-of-the-art robust models.
... Since these architectures are freely available, it seems reasonable to identify architectures as state-of-the-art and then use them for load safety assessment. CNN architectures such as VGG [11], AlexNet [12], GoogleNet [13], and ResNet [14] are commonly used. However, in many application cases, the size of the input image for the CNN architecture is often not optimally chosen. ...
... Three CNN architectures are used for the experiment, which can be assigned to two categories. On the one hand, two deep ANN were used for the InceptionV3 [27] and the ResNet101 [14]. On the other hand, one shallow architecture, called LogisticNet based on the AlexNet architecture [12], is used. ...
... In the standardized design, InceptionV3 and the ResNet101 have an input resolution of 299x299 [13,27], and 244x244 [14] pixels, respectively. The resolution can significantly impact the model's performance, so a higher resolution can also lead to better classification accuracy [28] to be chosen can be further influenced by the receptive field of an ANN [15,16]. ...
Preprint
Full-text available
Load safety assessment and compliance is an essential step in the corporate process of every logistics service provider. In 2020, a total of 11,371 police checks of trucks were carried out, during which 9.6% (1091) violations against the load safety regulations were detected. For a logistic service provider, every load safety violation results in height fines and damage to reputation. An assessment of load safety supported by artificial intelligence (AI) will reduce the risk of accidents by unsecured loads and fines during safety assessments. This work shows how photos of the load, taken by the truck driver or the loadmaster after the loading process, can be used to assess load safety. By a trained two-stage artificial neural network (ANN), these photos are classified into three different classes I) cargo loaded safely, II) cargo loaded unsafely, and III) unusable image. By applying several architectures of convolutional neural networks (CNN), it can be shown that it is possible to distinguish between unusable and usable images for cargo safety assessment. This distinction is quite crucial since the truck driver and the loadmaster sometimes provide photos without the essential image features like the case structure of the truck and the whole cargo. A human operator or another ANN will then assess the load safety within the second stage.
... He et al., 2016) trained on CI-FAR10(Krizhevsky et al., 2009) and ResNet50 / variants of ViT(Dosovitskiy et al., 2020) trained on ImageNet-1K with AutoAugment(Cubuk et al., 2018) for 300 epochs. ResNet18s are trained on one NVIDIA GeForce RTX 3090 GPU, ResNet50s and ViT variants are trained on four GPUs. ...
... Right: DeiT-S on ImageNet-1K. each side as inHe et al. (2016). Cross-Entropy loss is used if not specified otherwise.For DeiT-T and DeitT-S (Touvron et al., 2021a), the two ViT variants used in our experiments, we use AdamW(Loshchilov & Hutter, 2017) with a cosine annealing scheduler as the optimizer. ...
Preprint
Recent studies empirically demonstrate the positive relationship between the transferability of neural networks and the within-class variation of the last layer features. The recently discovered Neural Collapse (NC) phenomenon provides a new perspective of understanding such last layer geometry of neural networks. In this paper, we propose a novel metric, named Variability Collapse Index (VCI), to quantify the variability collapse phenomenon in the NC paradigm. The VCI metric is well-motivated and intrinsically related to the linear probing loss on the last layer features. Moreover, it enjoys desired theoretical and empirical properties, including invariance under invertible linear transformations and numerical stability, that distinguishes it from previous metrics. Our experiments verify that VCI is indicative of the variability collapse and the transferability of pretrained neural networks.
... Implementation Details: Our model is implemented using Pytorch. For the first training stage, the RGB encoder is initialized with ResNet50 [108] backbone and the depth encoder is initialized with VGG16-Net [107]. Both of the two backbones Image GT Depth P redraw P red crf P redours Fig. 5. Prediction comparison, where "Image", "GT" and "Depth" are the RGB image, the ground truth map, and the depth map. ...
... Output: Parameters for encoder E vgg , E res , decoder D vgg , D res , fusion module Fu, and Mutual Information Optimization module.1: Initialize encoder E vgg with VGG16-Net[107], encoder E res with ResNet50[108], and other parameters by default. 2: for t ← 1 to T do ...
Preprint
In this paper, we present a weakly-supervised RGB-D salient object detection model via scribble supervision. Specifically, as a multimodal learning task, we focus on effective multimodal representation learning via inter-modal mutual information regularization. In particular, following the principle of disentangled representation learning, we introduce a mutual information upper bound with a mutual information minimization regularizer to encourage the disentangled representation of each modality for salient object detection. Based on our multimodal representation learning framework, we introduce an asymmetric feature extractor for our multimodal data, which is proven more effective than the conventional symmetric backbone setting. We also introduce multimodal variational auto-encoder as stochastic prediction refinement techniques, which takes pseudo labels from the first training stage as supervision and generates refined prediction. Experimental results on benchmark RGB-D salient object detection datasets verify both effectiveness of our explicit multimodal disentangled representation learning method and the stochastic prediction refinement strategy, achieving comparable performance with the state-of-the-art fully supervised models. Our code and data are available at: https://github.com/baneitixiaomai/MIRV.
... We analyze local model divergence and observe that the entropy of penultimate layer output shares a similar pattern as the root-meansquare error (RMSE) loss of the model and can serve as a good indicator of the confidence of each local model. We then propose a novel use of entropy to identify the most confident We evaluate our proposed approach with comprehensive experiments on a real-world dataset (Udacity [11]) using popular deep neural networks (PilotNet [12] and ResNet-8 [13]). The experimental results show that our proposed approach outperforms state-of-the-art methods. ...
... setting, we consider the training data from each trip as the local training data to a vehicular-client. We evaluate two popular neural networks, PilotNet [12] and ResNet-8 [13]. PilotNet consists of five convolutional layers and four fully-connected layers with 559K parameters. ...
Preprint
A fundamental challenge of autonomous driving is maintaining the vehicle in the center of the lane by adjusting the steering angle. Recent advances leverage deep neural networks to predict steering decisions directly from images captured by the car cameras. Machine learning-based steering angle prediction needs to consider the vehicle's limitation in uploading large amounts of potentially private data for model training. Federated learning can address these constraints by enabling multiple vehicles to collaboratively train a global model without sharing their private data, but it is difficult to achieve good accuracy as the data distribution is often non-i.i.d. across the vehicles. This paper presents a new confidence-based federated distillation method to improve the performance of federated learning for steering angle prediction. Specifically, it proposes the novel use of entropy to determine the predictive confidence of each local model, and then selects the most confident local model as the teacher to guide the learning of the global model. A comprehensive evaluation of vision-based lane centering shows that the proposed approach can outperform FedAvg and FedDF by 11.3% and 9%, respectively.
... 2) Encoder-Decoder Structure: The output of the Manhattan normal module (i.e., a three-channel map n ∈ R H×W ×3 ) is concatenated with the one-channel raw depth image d raw to form the input to an encoder-decoder. The encoder-decoder of MCN, as shown in Fig. 5, is based on ResNet-18 [72] and pre-trained on the ImageNet dataset [73]. Given this input, the encoding stage downsamples the feature size by 32 times and expands the feature dimension to 512. ...
... Besides the main GAN structure, to enhance the effects of texture information in generating depth maps, we form a structure of CycleGAN [59] with an auxiliary pair of generator G r (·) and discriminator D r (·), which generate RGB images from depth maps and distinguishing generated RGB images from real RGB images, respectively. G r (·) employs the ResNet-18 architecture [72], and D r (·) follows the same architecture as D(·) except no condition inputs. We adopt the objective functions of WGAN [74] and CycleGAN [59] for training RDFC-GAN. ...
Preprint
The raw depth image captured by indoor depth sensors usually has an extensive range of missing depth values due to inherent limitations such as the inability to perceive transparent objects and the limited distance range. The incomplete depth map with missing values burdens many downstream vision tasks, and a rising number of depth completion methods have been proposed to alleviate this issue. While most existing methods can generate accurate dense depth maps from sparse and uniformly sampled depth maps, they are not suitable for complementing large contiguous regions of missing depth values, which is common and critical in images captured in indoor environments. To overcome these challenges, we design a novel two-branch end-to-end fusion network named RDFC-GAN, which takes a pair of RGB and incomplete depth images as input to predict a dense and completed depth map. The first branch employs an encoder-decoder structure, by adhering to the Manhattan world assumption and utilizing normal maps from RGB-D information as guidance, to regress the local dense depth values from the raw depth map. In the other branch, we propose an RGB-depth fusion CycleGAN to transfer the RGB image to the fine-grained textured depth map. We adopt adaptive fusion modules named W-AdaIN to propagate the features across the two branches, and we append a confidence fusion head to fuse the two outputs of the branches for the final depth map. Extensive experiments on NYU-Depth V2 and SUN RGB-D demonstrate that our proposed method clearly improves the depth completion performance, especially in a more realistic setting of indoor environments, with the help of our proposed pseudo depth maps in training.
... Machine learning models have achieved remarkable performance on a wide range of tasks in computer vision (He et al., 2016) and natural language processing (Devlin et al., 2019). However, due to the lack of strict supervision in crowd- Comparison of the distributions using the discrepancy information (i.e., the output discrepancy of current and historical models). ...
... The second phase is termed as victim phase, in which the adversaries begin to inject the poison samples to attack the current model. Same as (Pang et al., 2021), we train ResNet-18 (He et al., 2016) using the SGD optimizer with the learning rate 0.1, momentum 0.9, and weight decay 0.0001. During the whole process, we keep the batchsize of data streaming at 100. ...
Preprint
Adversarial poisoning attacks pose huge threats to various machine learning applications. Especially, the recent accumulative poisoning attacks show that it is possible to achieve irreparable harm on models via a sequence of imperceptible attacks followed by a trigger batch. Due to the limited data-level discrepancy in real-time data streaming, current defensive methods are indiscriminate in handling the poison and clean samples. In this paper, we dive into the perspective of model dynamics and propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information. By implicitly transferring the changes in the data manipulation to that in the model outputs, Memorization Discrepancy can discover the imperceptible poison samples based on their distinct dynamics from the clean samples. We thoroughly explore its properties and propose Discrepancy-aware Sample Correction (DSC) to defend against accumulative poisoning attacks. Extensive experiments comprehensively characterized Memorization Discrepancy and verified its effectiveness. The code is publicly available at: https://github.com/tmlr-group/Memorization-Discrepancy.
... Point-wise density weighting for each context image. For each input context image C inputs i , our geometry model first extracts semantic features using a ResNet50 [10] backbone and then reshapes the encoded feature into a 4 dimensional volumetric representation V i ∈ R c×d×h×w , where h and w are the height and width of the feature volume, respectively, d is the depth resolution, and c is the feature dimension. We pixel-align the spatial dimensions of the volume to that of the original input image via bilinear upsampling. ...
... Feature Extraction Backbone. To encode a volumetric feature field, we input a context image through a ResNet50 [10] backbone and extract multi-scale feature maps from the first three blocks having feature dimensions 256, 512, and, 1024 respectively. To accommodate an additional depth dimension, the feature dimension is divided by the depth, which we set to 64, resulting in features of size 4, 8, 16 for the three volumes respectively. ...
Preprint
Full-text available
Synthesizing novel view images from a few views is a challenging but practical problem. Existing methods often struggle with producing high-quality results or necessitate per-object optimization in such few-view settings due to the insufficient information provided. In this work, we explore leveraging the strong 2D priors in pre-trained diffusion models for synthesizing novel view images. 2D diffusion models, nevertheless, lack 3D awareness, leading to distorted image synthesis and compromising the identity. To address these problems, we propose DreamSparse, a framework that enables the frozen pre-trained diffusion model to generate geometry and identity-consistent novel view image. Specifically, DreamSparse incorporates a geometry module designed to capture 3D features from sparse views as a 3D prior. Subsequently, a spatial guidance model is introduced to convert these 3D feature maps into spatial information for the generative process. This information is then used to guide the pre-trained diffusion model, enabling it to generate geometrically consistent images without tuning it. Leveraging the strong image priors in the pre-trained diffusion models, DreamSparse is capable of synthesizing high-quality novel views for both object and scene-level images and generalising to open-set images. Experimental results demonstrate that our framework can effectively synthesize novel view images from sparse views and outperforms baselines in both trained and open-set category images. More results can be found on our project page: https://sites.google.com/view/dreamsparse-webpage.
... Results on Vision. We first adopt SimCLR [10] and MoCo v3 [11] as the backbone based on ResNet-50 [21]. We start with training the model for 800 epochs with a batch size of 2048 for SimCLR and 4096 for MoCo v3, respectively. ...
... For text representation learning, we evaluate the method on a one-million English Wikipedia dataset which is used in the SimCSE and can be downloaded from HuggingFace repository 3 . In image domain, we apply SimCLR [10] and MoCo v3 [11] as the baseline method, with ResNet-50 [21] as an encoder to learn image representations. The feature map generated by ResNet-50 block is projected to a 128-D image embedding via a two-layer MLP (2048-D hidden layer with ReLU activation function). ...
Preprint
In-Batch contrastive learning is a state-of-the-art self-supervised method that brings semantically-similar instances close while pushing dissimilar instances apart within a mini-batch. Its key to success is the negative sharing strategy, in which every instance serves as a negative for the others within the mini-batch. Recent studies aim to improve performance by sampling hard negatives \textit{within the current mini-batch}, whose quality is bounded by the mini-batch itself. In this work, we propose to improve contrastive learning by sampling mini-batches from the input data. We present BatchSampler\footnote{The code is available at \url{https://github.com/THUDM/BatchSampler}} to sample mini-batches of hard-to-distinguish (i.e., hard and true negatives to each other) instances. To make each mini-batch have fewer false negatives, we design the proximity graph of randomly-selected instances. To form the mini-batch, we leverage random walk with restart on the proximity graph to help sample hard-to-distinguish instances. BatchSampler is a simple and general technique that can be directly plugged into existing contrastive learning models in vision, language, and graphs. Extensive experiments on datasets of three modalities show that BatchSampler can consistently improve the performance of powerful contrastive models, as shown by significant improvements of SimCLR on ImageNet-100, SimCSE on STS (language), and GraphCL and MVGRL on graph datasets.
... In the experiments on simulated datasets, we use ResNet-18 [19] for Fashion-MNIST and Kuzushiji-MNIST, and ResNet-34 networks [19] for CIFAR10 and SVHN. The noise-transition networks are the same architecture as the classification network, but the last linear layer is modified according to the transition matrix shape. ...
... In the experiments on simulated datasets, we use ResNet-18 [19] for Fashion-MNIST and Kuzushiji-MNIST, and ResNet-34 networks [19] for CIFAR10 and SVHN. The noise-transition networks are the same architecture as the classification network, but the last linear layer is modified according to the transition matrix shape. ...
Preprint
Learning from crowds describes that the annotations of training data are obtained with crowd-sourcing services. Multiple annotators each complete their own small part of the annotations, where labeling mistakes that depend on annotators occur frequently. Modeling the label-noise generation process by the noise transition matrix is a power tool to tackle the label noise. In real-world crowd-sourcing scenarios, noise transition matrices are both annotator- and instance-dependent. However, due to the high complexity of annotator- and instance-dependent transition matrices (AIDTM), \textit{annotation sparsity}, which means each annotator only labels a little part of instances, makes modeling AIDTM very challenging. Prior works simplify the problem by assuming the transition matrix is instance-independent or using simple parametric way, while lose modeling generality. Motivated by this, we target a more realistic problem, estimating general AIDTM in practice. Without losing modeling generality, we parameterize AIDTM with deep neural networks. To alleviate the modeling challenge, we suppose every annotator shares its noise pattern with similar annotators, and estimate AIDTM via \textit{knowledge transfer}. We hence first model the mixture of noise patterns by all annotators, and then transfer this modeling to individual annotators. Furthermore, considering that the transfer from the mixture of noise patterns to individuals may cause two annotators with highly different noise generations to perturb each other, we employ the knowledge transfer between identified neighboring annotators to calibrate the modeling. Experiments confirm the superiority of the proposed approach on synthetic and real-world crowd-sourcing data. Source codes will be released.
... In order to establish UCBS 1 , we employed the widely-used ResNet model (34 layers) [41], pretrained on the ILSVRC2012 dataset (ImageNet) [42], and adapted it for the concept learning step. The target classes were randomly selected among the ImageNet classes (generally 100 classes to report the following measures). ...
... We selected the ResNet model [41] comprising 34 layers and pre-trained on the ImageNet dataset. This network has the classification head for a total of 1000 classes. ...
Preprint
Full-text available
Explainability of intelligent models has been garnering increasing attention in recent years. Of the various explainability approaches, concept-based techniques are notable for utilizing a set of human-meaningful concepts instead of focusing on individual pixels. However, there is a scarcity of methods that consistently provide both local and global explanations. Moreover, most of the methods have no offer to explain misclassification cases. To address these challenges, our study follows a straightforward yet effective approach. We propose a unified concept-based system, which inputs a number of super-pixelated images into the networks, allowing them to learn better representations of the target's objects as well as the target's concepts. This method automatically learns, scores, and extracts local and global concepts. Our experiments revealed that, in addition to enhancing performance, the models could provide deeper insights into predictions and elucidate false classifications.
... The development of deep learning (DL) [9][10], was soon applied to the field of image analysis [11] [11][12] [13]. Moreover, there has been much research work on the analysis of ultrasound [14] [15] and echocardiography [16] [17][18] [19]. ...
... The overall model structure is depicted in Fig 3. For each frame, we use a resNet18 [9] to obtain the feature representation and concatenate them to form an N × D general video representation where N is the number of frames used for training and D is the dimension of the feature. Following this, we use attention pooling to generate N × 1 weights and multiply them by the general feature representation to form the final spatial representation D spatial . ...
Preprint
Full-text available
Purpose: Congenital heart defect (CHD) is the most common birth defect. Thoracic echocardiography (TTE) can provide sufficient cardiac structure information, evaluate hemodynamics and cardiac function, and is an effective method for atrial septal defect (ASD) examination. This paper aims to study a deep learning method based on cardiac ultrasound video to assist in ASD diagnosis. Materials and methods: We select two standard views of the atrial septum (subAS) and low parasternal four-compartment view (LPS4C) as the two views to identify ASD. We enlist data from 300 children patients as part of a double-blind experiment for five-fold cross-validation to verify the performance of our model. In addition, data from 30 children patients (15 positives and 15 negatives) are collected for clinician testing and compared to our model test results (these 30 samples do not participate in model training). We propose an echocardiography video-based atrial septal defect diagnosis system. In our model, we present a block random selection, maximal agreement decision and frame sampling strategy for training and testing respectively, resNet18 and r3D networks are used to extract the frame features and aggregate them to build a rich video-level representation. Results: We validate our model using our private dataset by five-cross validation. For ASD detection, we achieve 89.33 AUC, 84.95 accuracy, 85.70 sensitivity, 81.51 specificity and 81.99 F1 score. Conclusion: The proposed model is multiple instances learning-based deep learning model for video atrial septal defect detection which effectively improves ASD detection accuracy when compared to the performances of previous networks and clinical doctors.
... With the M M L 2 tâ model, we aim to build a basic architecture to integrate visual and textual features of the issue reports for our specific classification task. We use the fastText embeddings Bojanowski et al. 2017) to obtain the vector representations for the textual data and the ResNet image recognition model (He et al. 2016) for the visual representations. The vector outputs are passed through a linear layer separately to reduce their dimension. ...
... We have, furthermore, employed well-known and frequently-used pre-processing steps to analyze the text extracted from the issue reports and the screenshot attachments, including tokenization and removal of non-letter characters (Manning et al. 2008). Similarly, the architectures of the machine learning models we used to extract and analyze the visual features present in the screenshot attachments, namely V GG 1 a and M M L 2 tâ , were also published in the litera-ture (Simonyan and Zisserman 2014;Joulin et al. 2017;Bojanowski et al. 2017;He et al. 2016). We (unless otherwise stated) used the machine learning models with their default configurations. ...
Preprint
Full-text available
In previous work, we deployed IssueTAG, which uses the texts present in the one-line summary and the description fields of the issue reports to automatically assign them to the stakeholders, who are responsible for resolving the reported issues. Since its deployment on January 12, 2018 at Softtech, i.e., the software subsidiary of the largest private bank in Turkey, IssueTAG has made a total of 301,752 assignments (as of November 2021). One observation we make is that a large fraction of the issue reports submitted to Softtech has screenshot attachments and, in the presence of such attachments, the reports often convey less information in their one-line summary and the description fields, which tends to reduce the assignment accuracy. In this work, we use the screenshot attachments as an additional source of information to further improve the assignment accuracy, which (to the best of our knowledge) has not been studied before in this context. In particular, we develop a number of multi-source (using both the issue reports and the screenshot attachments) and single-source assignment models (using either the issue reports or the screenshot attachments) and empirically evaluate them on real issue reports. In the experiments, compared to the currently deployed single-source model in the field, the best multi-source model developed in this work, significantly (both in the practical and statistical sense) improved the assignment accuracy for the issue reports with screenshot attachments from 0.843 to 0.858 at acceptable overhead costs, a result strongly supporting our basic hypothesis.
... Pixel-level module contains a backbone and a pixel decoder. Specifically, we first employ the backbone, such as ResNet [62] or Swin Transformer [63], to extract deep feature of low-resolution F low ∈ R H 64 × W 64 ×C . Afterwards, we use a pixel decoder to progressively upsample deep feature to generate pyramid features of various resolutions F i p , i = 1, 2, 3, 4. The first three successive pyramid features F i p , i = 1, 2, 3 are fed to successive transformer decoder layers to generate mask embeddings, and the last pyramid feature F 4 p is used as pixel-level embeddings F pixel . ...
... We adopt the deep model ResNet50 [62] or Swin-T [63] pre-trained on ImageNet as the backbone. The strides of feature maps F 1,2,3,4 p generated by pixel-level decoder are 32, 16, 8, and 4, and the number of feature channels is 256. ...
Preprint
This paper introduces an approach, named DFormer, for universal image segmentation. The proposed DFormer views universal image segmentation task as a denoising process using a diffusion model. DFormer first adds various levels of Gaussian noise to ground-truth masks, and then learns a model to predict denoising masks from corrupted masks. Specifically, we take deep pixel-level features along with the noisy masks as inputs to generate mask features and attention masks, employing diffusion-based decoder to perform mask prediction gradually. At inference, our DFormer directly predicts the masks and corresponding categories from a set of randomly-generated masks. Extensive experiments reveal the merits of our proposed contributions on different image segmentation tasks: panoptic segmentation, instance segmentation, and semantic segmentation. Our DFormer outperforms the recent diffusion-based panoptic segmentation method Pix2Seq-D with a gain of 3.6% on MS COCO val2017 set. Further, DFormer achieves promising semantic segmentation performance outperforming the recent diffusion-based method by 2.2% on ADE20K val set. Our source code and models will be publicly on https://github.com/cp3wan/DFormer
... To reliably identify unique cell clusters within each imaging frame despite the presence of background noise, debris, and out-of-focus organoids, we performed image segmentation using a U-Net architecture 72 with ResNet-34 16,73 as the backbone. U-Net, a type of convolutional neural network (CNN), consists of an encoder that extracts rich feature maps from an input image and a decoder that expands the resolution of the feature maps back to the image's original size. ...
... The processed images were then segmented into individual cells or organoids using a convolutional neural network (U-Net architecture 72 with a ResNet-34 encoder 16,73 ). The model was initialized with weights derived from a model pretrained on the ImageNet dataset 126 . ...
Article
Full-text available
High throughput drug screening is an established approach to investigate tumor biology and identify therapeutic leads. Traditional platforms use two-dimensional cultures which do not accurately reflect the biology of human tumors. More clinically relevant model systems such as three-dimensional tumor organoids can be difficult to scale and screen. Manually seeded organoids coupled to destructive endpoint assays allow for the characterization of treatment response, but do not capture transitory changes and intra-sample heterogeneity underlying clinically observed resistance to therapy. We present a pipeline to generate bioprinted tumor organoids linked to label-free, time-resolved imaging via high-speed live cell interferometry (HSLCI) and machine learning-based quantitation of individual organoids. Bioprinting cells gives rise to 3D structures with unaltered tumor histology and gene expression profiles. HSLCI imaging in tandem with machine learning-based segmentation and classification tools enables accurate, label-free parallel mass measurements for thousands of organoids. We demonstrate that this strategy identifies organoids transiently or persistently sensitive or resistant to specific therapies, information that could be used to guide rapid therapy selection.
... It increases the computational time of the network to store all the features in a one-dimensional layer. He et al. [20] discussed the importance of residual connections in a convolution neural network called residual nets (ResNets). The residual connections are used to prevent the loss of information, reduce the training error, and increase the accuracy of the network. ...
... The dimensions of the computed image must be same at both the terminals of residual connection for identity mapping. In ResNets, the identity mapping was performed by utilizing padded dimensions of zero entries and a stride of 2 [20]. In the proposed approach, three groups of layers are designed such that each group consists of convolution, pooling, and unpooling layers. ...
Article
Full-text available
Smog is one of the air pollutants that makes it difficult for drivers to see. Smog is a mixture of fog and smoke that produces black fumes and reduces the visibility of drivers within the range of one kilometre. The small size and high density of smog particles, in comparison to other air pollutants, impede drivers’ vision on the road. To resolve these problems, researchers designed a number of visibility restoration models. However, the development of an adequate desmogging technique is a challenging issue. The aerial and sensing imaging of machine vision systems are modified by the desmogging model. In this paper, a residual regression network (RRNet) is proposed followed by morphological erosion to produce a transmission map. The atmospheric light is estimated by using a 2D order statistic filter. The smoggy image is further reconstructed to obtain the clear scene radiance. Thus, the proposed model has a susceptibility to remove smog from road images in an effective manner. The proposed model is evaluated on the four well-known benchmark datasets and compared with five well-known desmogging techniques. The performance of the proposed desmogging model is evaluated in terms of color deviation, structure similarity index, and peak signal to noise ratio. It is found superior as compared to the existing models in terms of various performance metrics namely, fog aware density evaluation, naturalness image quality evaluator, perception-based image-quality, blind/referenceless image spatial quality evaluator, and image entropy by 2.2%, 1.17%, 8.05%, 2.64%, and 0.69% respectively.
... Residual connections [14] assist deep networks in learning better representations by directly transferring a layer's input to a later layer's output so that only the residual transformation needs to be modelled. This makes the learning task easier. ...
... This can be viewed from another perspective of residual learning as well. Residual learning such as the popular ResNet model [14] works by learning easier, residual sub-problems within deep network layers instead of complete transformations. By incorporating a direct ultimate skip connection from input to output, we have applied residual learning to the original, end-to-end expression synthesis problem and made it easier to solve. ...
Article
Full-text available
We demonstrate the benefit of using an ultimate skip (US) connection for facial expression synthesis using generative adversarial networks (GAN). A direct connection transfers identity, facial, and color details from input to output while suppressing artifacts. The intermediate layers can therefore focus on expression generation only. This leads to a light-weight US-GAN model comprised of encoding layers, a single residual block, decoding layers, and an ultimate skip connection from input to output. US-GAN has 3 × fewer parameters than state-of-the-art models and is trained on 2 orders of magnitude smaller dataset. It yields 7% increase in face verification score (FVS) and 27% decrease in average content distance (ACD). Based on a randomized user-study, US-GAN outperforms the state of the art by 25% in face realism, 43% in expression quality, and 58% in identity preservation.
... Part of the transferred convolutional neural network remained frozen, while the new layers were adapted on the relatively size-limited LoockMe dataset, improving the target model's accuracy and reducing the risk of overfitting. In this study, eight models with different architectures and layer types were used, including the most prominent: a) VGG [12], b) Inception [13], c) Xception [14], d) ResNet [15], e) NasNet [16], f) MobileNet [17], and g) DenseNet [18], available in the Keras [19] online repository. Hyper-parameters optimization. ...
Conference Paper
Full-text available
LoockMe is an artificial intelligence-powered location scouting platform that combines deep learning image analysis, cutting-edge machine learning natural language processing (NLP) for automated content annotation, and intelligent search. The platform's objective is to label input images of local landscapes , and/or any other assets that regional film offices want to expose to those interested in identifying potential locations for the film production industry. The deep learning-based image analysis achieved high classification performance with an AUC score of 99.4%. Moreover, the state-of-the-art machine learning NLP module enhances the platform's capabilities by analyzing text descriptions of the locations and thus allowing for automated annotation, while the intelligent search engine combines image analysis with NLP to extract relevant context from available data. The proposed artificial intelligence platform has the potential to substantially assist asset publishers and revolutionize the location scouting process for the film production industry in Greece.
... DNNs. We investigated the neural fits of 135 different DNNs representing a variety of approaches used in computer vision today: 62 convolutional neural networks trained on ImageNet [21][22][23][24][25][26][27][28][29][30][31][32][33][34][34][35][36][37][38][39][40][41][42][43][44][45][46][47] (CNNs), 23 DNNs trained on other datasets in addition to ImageNet (which we refer to as "DNN extra data") [32,36,48], 25 vision transformers [49][50][51][52][53][54] (ViTs), 10 DNNs trained with self-supervision [55,56], and 15 DNNs trained to be robust to noise or adversarial examples [57,58]. Each model was implemented in PyTorch with the TIMM toolbox (https://github.com/huggingface/ ...
Preprint
Full-text available
One of the most impactful findings in computational neuroscience over the past decade is that the object recognition accuracy of deep neural networks (DNNs) correlates with their ability to predict neural responses to natural images in the inferotemporal (IT) cortex. This discovery supported the long-held theory that object recognition is a core objective of the visual cortex, and suggested that more accurate DNNs would serve as better models of IT neuron responses to images. Since then, deep learning has undergone a revolution of scale: billion parameter-scale DNNs trained on billions of images are rivaling or outperforming humans at visual tasks including object recognition. Have today's DNNs become more accurate at predicting IT neuron responses to images as they have grown more accurate at object recognition? Surprisingly, across three independent experiments, we find this is not the case. DNNs have become progressively worse models of IT as their accuracy has increased on ImageNet. To understand why DNNs experience this trade-off and evaluate if they are still an appropriate paradigm for modeling the visual system, we turn to recordings of IT that capture spatially resolved maps of neuronal activity elicited by natural images. These neuronal activity maps reveal that DNNs trained on ImageNet learn to rely on different visual features than those encoded by IT and that this problem worsens as their accuracy increases. We successfully resolved this issue with the neural harmonizer, a plug-and-play training routine for DNNs that aligns their learned representations with humans. Our results suggest that harmonized DNNs break the trade-off between ImageNet accuracy and neural prediction accuracy that assails current DNNs and offer a path to more accurate models of biological vision.
... To extract features from V L and I F , we use an architecture similar to the one described in VisualVoice [37] (which is an audio-visual speech separation model). For I F , the face embedding f ∈ R D f is computed using ResNet-18 [38] with the last two layers discarded. The lip video V L is encoded using a lip-reading architecture [39]. ...
Preprint
Full-text available
Lip-to-speech involves generating a natural-sounding speech synchronized with a soundless video of a person talking. Despite recent advances, current methods still cannot produce high-quality speech with high levels of intelligibility for challenging and realistic datasets such as LRS3. In this work, we present LipVoicer, a novel method that generates high-quality speech, even for in-the-wild and rich datasets, by incorporating the text modality. Given a silent video, we first predict the spoken text using a pre-trained lip-reading network. We then condition a diffusion model on the video and use the extracted text through a classifier-guidance mechanism where a pre-trained ASR serves as the classifier. LipVoicer outperforms multiple lip-to-speech baselines on LRS2 and LRS3, which are in-the-wild datasets with hundreds of unique speakers in their test set and an unrestricted vocabulary. Moreover, our experiments show that the inclusion of the text modality plays a major role in the intelligibility of the produced speech, readily perceptible while listening, and is empirically reflected in the substantial reduction of the WER metric. We demonstrate the effectiveness of LipVoicer through human evaluation, which shows that it produces more natural and synchronized speech signals compared to competing methods. Finally, we created a demo showcasing LipVoicer's superiority in producing natural, synchronized, and intelligible speech, providing additional evidence of its effectiveness. Project page: https://lipvoicer.github.io
... For all Habitat experiments we used the same policy network as in (Wijmans et al., 2020), which includes a ResNet50 visual encoder (He et al., 2015) and a 2-layer LSTM (Hochreiter & Schmidhuber, 1997) policy. In addition to RGB and Depth images, the agent also receives GPS coordinates and compass orientation, represented by 3 scalars total, which are fed into the policy. ...
Preprint
Full-text available
Exploration in environments which differ across episodes has received increasing attention in recent years. Current methods use some combination of global novelty bonuses, computed using the agent's entire training experience, and \textit{episodic novelty bonuses}, computed using only experience from the current episode. However, the use of these two types of bonuses has been ad-hoc and poorly understood. In this work, we shed light on the behavior of these two types of bonuses through controlled experiments on easily interpretable tasks as well as challenging pixel-based settings. We find that the two types of bonuses succeed in different settings, with episodic bonuses being most effective when there is little shared structure across episodes and global bonuses being effective when more structure is shared. We develop a conceptual framework which makes this notion of shared structure precise by considering the variance of the value function across contexts, and which provides a unifying explanation of our empirical results. We furthermore find that combining the two bonuses can lead to more robust performance across different degrees of shared structure, and investigate different algorithmic choices for defining and combining global and episodic bonuses based on function approximation. This results in an algorithm which sets a new state of the art across 16 tasks from the MiniHack suite used in prior work, and also performs robustly on Habitat and Montezuma's Revenge.
... In practice, image encoder, depending on different recognizers, can be instantiated into a wide variety of model structures, which usually includes the CNN based e.g. ResNet [16] and transformer [58] based image encoder along with the extra module e.g., flexible rectification [47]. The recognition decoder receives the visual image em-beddingX to produce a sequence of predicted characterŝ Y = {y 1 , y 2 , . . . ...
Preprint
Full-text available
Text recognition in the wild is a long-standing problem in computer vision. Driven by end-to-end deep learning, recent studies suggest vision and language processing are effective for scene text recognition. Yet, solving edit errors such as add, delete, or replace is still the main challenge for existing approaches. In fact, the content of the text and its audio are naturally corresponding to each other, i.e., a single character error may result in a clear different pronunciation. In this paper, we propose the AudioOCR, a simple yet effective probabilistic audio decoder for mel spectrogram sequence prediction to guide the scene text recognition, which only participates in the training phase and brings no extra cost during the inference stage. The underlying principle of AudioOCR can be easily applied to the existing approaches. Experiments using 7 previous scene text recognition methods on 12 existing regular, irregular, and occluded benchmarks demonstrate our proposed method can bring consistent improvement. More importantly, through our experimentation, we show that AudioOCR possesses a generalizability that extends to more challenging scenarios, including recognizing non-English text, out-of-vocabulary words, and text with various accents. Code will be available at https://github.com/wenwenyu/AudioOCR.
... CLIP combines an image encoder (ResNet [80] or ViT [81]) and a text encoder (Transformer [82]), which independently map image and text representations to a shared embedding space. CLIP trains both encoders using a contrastive loss on a dataset of 400 million image-text pairs sourced from the Internet that contain diverse image classes and textual concepts. ...
Preprint
Full-text available
Vision foundation models such as Contrastive Vision-Language Pre-training (CLIP) and Segment Anything (SAM) have demonstrated impressive zero-shot performance on image classification and segmentation tasks. However, the incorporation of CLIP and SAM for label-free scene understanding has yet to be explored. In this paper, we investigate the potential of vision foundation models in enabling networks to comprehend 2D and 3D worlds without labelled data. The primary challenge lies in effectively supervising networks under extremely noisy pseudo labels, which are generated by CLIP and further exacerbated during the propagation from the 2D to the 3D domain. To tackle these challenges, we propose a novel Cross-modality Noisy Supervision (CNS) method that leverages the strengths of CLIP and SAM to supervise 2D and 3D networks simultaneously. In particular, we introduce a prediction consistency regularization to co-train 2D and 3D networks, then further impose the networks' latent space consistency using the SAM's robust feature representation. Experiments conducted on diverse indoor and outdoor datasets demonstrate the superior performance of our method in understanding 2D and 3D open environments. Our 2D and 3D network achieves label-free semantic segmentation with 28.4% and 33.5% mIoU on ScanNet, improving 4.7% and 7.9%, respectively. And for nuScenes dataset, our performance is 26.8% with an improvement of 6%. Code will be released (https://github.com/runnanchen/Label-Free-Scene-Understanding).
... We evaluate the proposed method on three public video polyp detection benchmarks: SUN Colonoscopy Video Database [10,7] We use ResNet-50 [6] as our backbone and CenterNet [26] as our base detector. Following the same setting in CenterNet, we set λ size = 0.1 and λ of f = 1. ...
Preprint
Accurate polyp detection is essential for assisting clinical rectal cancer diagnoses. Colonoscopy videos contain richer information than still images, making them a valuable resource for deep learning methods. Great efforts have been made to conduct video polyp detection through multi-frame temporal/spatial aggregation. However, unlike common fixed-camera video, the camera-moving scene in colonoscopy videos can cause rapid video jitters, leading to unstable training for existing video detection models. Additionally, the concealed nature of some polyps and the complex background environment further hinder the performance of existing video detectors. In this paper, we propose the \textbf{YONA} (\textbf{Y}ou \textbf{O}nly \textbf{N}eed one \textbf{A}djacent Reference-frame) method, an efficient end-to-end training framework for video polyp detection. YONA fully exploits the information of one previous adjacent frame and conducts polyp detection on the current frame without multi-frame collaborations. Specifically, for the foreground, YONA adaptively aligns the current frame's channel activation patterns with its adjacent reference frames according to their foreground similarity. For the background, YONA conducts background dynamic alignment guided by inter-frame difference to eliminate the invalid features produced by drastic spatial jitters. Moreover, YONA applies cross-frame contrastive learning during training, leveraging the ground truth bounding box to improve the model's perception of polyp and background. Quantitative and qualitative experiments on three public challenging benchmarks demonstrate that our proposed YONA outperforms previous state-of-the-art competitors by a large margin in both accuracy and speed.
... Experimental Settings We evaluate transfer flow and pseudo transfer flow on both CIFAR100 and our proposed ImageNet-based benchmark under different semantic similarity cases. We employ ResNet18 (He et al., 2016) as the backbone for both datasets following Han et al. (2019;; Fini et al. (2021). Known-class data and unknown-class data are selected based on semantic similarity, as mentioned in Section 4. We first apply fully supervised learning to the labeled data for each dataset to obtain the pretrained model. ...
Preprint
Novel class discovery (NCD) aims to infer novel categories in an unlabeled dataset by leveraging prior knowledge of a labeled set comprising disjoint but related classes. Given that most existing literature focuses primarily on utilizing supervised knowledge from a labeled set at the methodology level, this paper considers the question: Is supervised knowledge always helpful at different levels of semantic relevance? To proceed, we first establish a novel metric, so-called transfer flow, to measure the semantic similarity between labeled/unlabeled datasets. To show the validity of the proposed metric, we build up a large-scale benchmark with various degrees of semantic similarities between labeled/unlabeled datasets on ImageNet by leveraging its hierarchical class structure. The results based on the proposed benchmark show that the proposed transfer flow is in line with the hierarchical class structure; and that NCD performance is consistent with the semantic similarities (measured by the proposed metric). Next, by using the proposed transfer flow, we conduct various empirical experiments with different levels of semantic similarity, yielding that supervised knowledge may hurt NCD performance. Specifically, using supervised information from a low-similarity labeled set may lead to a suboptimal result as compared to using pure self-supervised knowledge. These results reveal the inadequacy of the existing NCD literature which usually assumes that supervised knowledge is beneficial. Finally, we develop a pseudo-version of the transfer flow as a practical reference to decide if supervised knowledge should be used in NCD. Its effectiveness is supported by our empirical studies, which show that the pseudo transfer flow (with or without supervised knowledge) is consistent with the corresponding accuracy based on various datasets. Code is released at https://github.com/J-L-O/SK-Hurt-NCD
... The success of the deep learning method in different applications motivated researchers to adapt them to time series data [18,19,47,3]. For instance, CNN-based models such as Multi-Channel Deep Convolutional Neural Network (MC-DCNN) [49], Residual Network (ResNet) [15], and Fully Convolutional Networks (FCN) [25] are used for time series classification tasks, and they show remarkable results [18]. Despite their impressive results in time series classification tasks, interpreting deep learning outputs remains a challenge. ...
Preprint
Full-text available
Conventional time series classification approaches based on bags of patterns or shapelets face significant challenges in dealing with a vast amount of feature candidates from high-dimensional multivariate data. In contrast, deep neural networks can learn low-dimensional features efficiently, and in particular, Convolutional Neural Networks (CNN) have shown promising results in classifying Multivariate Time Series (MTS) data. A key factor in the success of deep neural networks is this astonishing expressive power. However, this power comes at the cost of complex, black-boxed models, conflicting with the goals of building reliable and human-understandable models. An essential criterion in understanding such predictive deep models involves quantifying the contribution of time-varying input variables to the classification. Hence, in this work, we introduce a new framework for interpreting multivariate time series data by extracting and clustering the input representative patterns that highly activate CNN neurons. This way, we identify each signal's role and dependencies, considering all possible combinations of signals in the MTS input. Then, we construct a graph that captures the temporal relationship between the extracted patterns for each layer. An effective graph merging strategy finds the connection of each node to the previous layer's nodes. Finally, a graph embedding algorithm generates new representations of the created interpretable time-series features. To evaluate the performance of our proposed framework, we run extensive experiments on eight datasets of the UCR/UEA archive, along with HAR and PAM datasets. The experiments indicate the benefit of our time-aware graph-based representation in MTS classification while enriching them with more interpretability.
... Past research efforts have formulated the stress prediction problem as image-to-image translation [17][18][19], where the input image is the design and the output image is the full-field stress contour. As the designs are typically 2D or 3D that contain important spatial information, convolution neural networks (CNNs) [20] are widely used, such as the residual net [17,21], the U-Net [22], the residual U-Net (ResUNet) [3,23], and the generative adversarial network [19,24,25]. These network architectures typically consist of a convolutional encoder to extract and encode important spatial information in the input geometry and compress them to a compact latent space. ...
Preprint
Full-text available
A novel deep operator network (DeepONet) with a residual U-Net (ResUNet) as the trunk network is devised to predict full-field highly nonlinear elastic-plastic stress response for complex geometries obtained from topology optimization under variable loads. The proposed DeepONet uses a ResUNet in the trunk to encode complex input geometries, and a fully-connected branch network encodes the parametric loads. Additional information fusion is introduced via an element-wise multiplication of the encoded latent space to improve prediction accuracy further. The performance of the proposed DeepONet was compared to two baseline models, a standalone ResUNet and a DeepONet with fully connected networks as the branch and trunk. The results show that ResUNet and the proposed DeepONet share comparable accuracy; both can predict the stress field and accurately identify stress concentration points. However, the novel DeepONet is more memory efficient and allows greater flexibility with framework architecture modifications. The DeepONet with fully connected networks suffers from high prediction error due to its inability to effectively encode the complex, varying geometry. Once trained, all three networks can predict the full stress distribution orders of magnitude faster than finite element simulations. The proposed network can quickly guide preliminary optimization, designs, sensitivity analysis, uncertainty quantification, and many other nonlinear analyses that require extensive forward evaluations with variable geometries, loads, and other parameters. This work marks the first time a ResUNet is used as the trunk network in the DeepONet architecture and the first time that DeepONet solves problems with complex, varying input geometries under parametric loads and elasto-plastic material behavior.
... In CA, only x ′l cls is used as the query, as it contains information of l-branch and realizes linear computational complexity. Furthermore, the feed-forward network (FFN) in the original ViT [35] is replaced by a residual connection [36] to reduce the number of parameters and preserve information from x l cls . To this end, we get the class token x ′′l cls after acquiring the information from the small-scale branch, namely ...
Preprint
Full-text available
Cytology test is effective, non-invasive, convenient, and inexpensive for clinical cancer screening. ThinPrep, a commonly used liquid-based specimen, can be scanned to generate digital whole slide images (WSIs) for cytology testing. However, WSIs classification with gigapixel resolutions is highly resource-intensive, posing significant challenges for automated medical image analysis. In order to circumvent this computational impasse, existing methods emphasize learning features at the cell or patch level, typically requiring labor-intensive and detailed manual annotations, such as labels at the cell or patch level. Here we propose a novel automated Label-Efficient WSI Screening method, dubbed LESS, for cytology-based diagnosis with only slide-level labels. Firstly, in order to achieve label efficiency, we suggest employing variational positive-unlabeled (VPU) learning, enhancing patch-level feature learning using WSI-level labels. Subsequently, guided by the clinical approach of scrutinizing WSIs at varying fields of view and scales, we employ a cross-attention vision transformer (CrossViT) to fuse multi-scale patch-level data and execute WSI-level classification. We validate the proposed label-efficient method on a urine cytology WSI dataset encompassing 130 samples (13,000 patches) and FNAC 2019 dataset with 212 samples (21,200 patches). The experiment shows that the proposed LESS reaches 84.79%, 85.43%, 91.79% and 78.30% on a urine cytology WSI dataset, and 96.53%, 96.37%, 99.31%, 94.95% on FNAC 2019 dataset in terms of accuracy, AUC, sensitivity and specificity. It outperforms state-of-the-art methods and realizes automatic cytology-based bladder cancer screening.
... As shown in Fig. 3a, for the global channel dependency relation modeling in frame level, a GAP is first applied on the spatial to obtain each frame's statistics s ∈ R C×1×1×1 . Then, to effectively capture dimension-wise non-linear global dependency f global and evaluate the channel-wise importance, MTA utilizes an MLP activated by Leaky ReLu, which follows the bottleneck design [34,30] with a dimension reduction ratio r: ...
Preprint
Gait recognition, which aims at identifying individuals by their walking patterns, has recently drawn increasing research attention. However, gait recognition still suffers from the conflicts between the limited binary visual clues of the silhouette and numerous covariates with diverse scales, which brings challenges to the model's adaptiveness. In this paper, we address this conflict by developing a novel MetaGait that learns to learn an omni sample adaptive representation. Towards this goal, MetaGait injects meta-knowledge, which could guide the model to perceive sample-specific properties, into the calibration network of the attention mechanism to improve the adaptiveness from the omni-scale, omni-dimension, and omni-process perspectives. Specifically, we leverage the meta-knowledge across the entire process, where Meta Triple Attention and Meta Temporal Pooling are presented respectively to adaptively capture omni-scale dependency from spatial/channel/temporal dimensions simultaneously and to adaptively aggregate temporal information through integrating the merits of three complementary temporal aggregation methods. Extensive experiments demonstrate the state-of-the-art performance of the proposed MetaGait. On CASIA-B, we achieve rank-1 accuracy of 98.7%, 96.0%, and 89.3% under three conditions, respectively. On OU-MVLP, we achieve rank-1 accuracy of 92.4%.
... However, the authors make rather strong assumptions on the objective function, such as Lipschitz continuity of the objective function and certain assumptions on the stochastic gradient error. Analogous technique is applied by Dun et al. (2021) for training ResNet (He et al., 2016a). It is important to note that the IST framework differs from RPT in several key components. ...
Preprint
Full-text available
We propose a Randomized Progressive Training algorithm (RPT) -- a stochastic proxy for the well-known Progressive Training method (PT) (Karras et al., 2017). Originally designed to train GANs (Goodfellow et al., 2014), PT was proposed as a heuristic, with no convergence analysis even for the simplest objective functions. On the contrary, to the best of our knowledge, RPT is the first PT-type algorithm with rigorous and sound theoretical guarantees for general smooth objective functions. We cast our method into the established framework of Randomized Coordinate Descent (RCD) (Nesterov, 2012; Richt\'arik & Tak\'a\v{c}, 2014), for which (as a by-product of our investigations) we also propose a novel, simple and general convergence analysis encapsulating strongly-convex, convex and nonconvex objectives. We then use this framework to establish a convergence theory for RPT. Finally, we validate the effectiveness of our method through extensive computational experiments.
... Thus, we draw upon the GAIN framework for further research. To enable GAN to learn more abstract features from posture after dimensionality reduction, we appropriately deepen the neural network by introducing residual structures [63]. This approach allows certain layers of the neural network to bypass connections with the next layer of neurons and instead connect to deeper layers, thereby attenuating the strong correlations between each layer and preventing potential degradation issues. ...
Preprint
To mitigate the challenges arising from partial occlusion in human pose keypoint based pedestrian detection methods , we present a novel pedestrian pose keypoint completion method called the separation and dimensionality reduction-based generative adversarial imputation networks (SDR-GAIN) . Firstly, we utilize OpenPose to estimate pedestrian poses in images. Then, we isolate the head and torso keypoints of pedestrians with incomplete keypoints due to occlusion or other factors and perform dimensionality reduction to enhance features and further unify feature distribution. Finally, we introduce two generative models based on the generative adversarial networks (GAN) framework, which incorporate Huber loss, residual structure, and L1 regularization to generate missing parts of the incomplete head and torso pose keypoints of partially occluded pedestrians, resulting in pose completion. Our experiments on MS COCO and JAAD datasets demonstrate that SDR-GAIN outperforms basic GAIN framework, interpolation methods PCHIP and MAkima, machine learning methods k-NN and MissForest in terms of pose completion task. In addition, the runtime of SDR-GAIN is approximately 0.4ms, displaying high real-time performance and significant application value in the field of autonomous driving.
... Since terms that exist in the figures may be uncommon words, we also used FastText (Bojanowski et al. 2017b) to obtain word embeddings with subword information. For visual modality, we used Renst152 (He et al. 2016) and Faster R-CNN (Ren et al. 2015) used in extract features from images and bounding boxes. ...
Preprint
Full-text available
In scholarly documents, figures provide a straightforward way of communicating scientific findings to readers. Automating figure caption generation helps move model understandings of scientific documents beyond text and will help authors write informative captions that facilitate communicating scientific findings. Unlike previous studies, we reframe scientific figure captioning as a knowledge-augmented image captioning task that models need to utilize knowledge embedded across modalities for caption generation. To this end, we extended the large-scale SciCap dataset~\cite{hsu-etal-2021-scicap-generating} to SciCap+ which includes mention-paragraphs (paragraphs mentioning figures) and OCR tokens. Then, we conduct experiments with the M4C-Captioner (a multimodal transformer-based model with a pointer network) as a baseline for our study. Our results indicate that mention-paragraphs serves as additional context knowledge, which significantly boosts the automatic standard image caption evaluation scores compared to the figure-only baselines. Human evaluations further reveal the challenges of generating figure captions that are informative to readers. The code and SciCap+ dataset will be publicly available at https://github.com/ZhishenYang/scientific_figure_captioning_dataset
... The model uses two input streams to classify each epoch: a high-frequency 5-minute HR/BR window sampled at 1 Hz and a lower frequency 50-minute window sampled at 0.1 Hz, both centred around the target epoch for classification. Each window of data is passed to a 1D ResNet [8] model, which transforms the two-channel input time series into a feature vector. Our ResNet architecture broadly follows the design of the original paper, except that 2D convolutions are replaced with their 1D equivalents. ...
Preprint
Conventional sleep monitoring is time-consuming, expensive and uncomfortable, requiring a large number of contact sensors to be attached to the patient. Video data is commonly recorded as part of a sleep laboratory assessment. If accurate sleep staging could be achieved solely from video, this would overcome many of the problems of traditional methods. In this work we use heart rate, breathing rate and activity measures, all derived from a near-infrared video camera, to perform sleep stage classification. We use a deep transfer learning approach to overcome data scarcity, by using an existing contact-sensor dataset to learn effective representations from the heart and breathing rate time series. Using a dataset of 50 healthy volunteers, we achieve an accuracy of 73.4\% and a Cohen's kappa of 0.61 in four-class sleep stage classification, establishing a new state-of-the-art for video-based sleep staging.
... Each baseline name starts with the corresponding architecture: ResNet18 and ResNet50: standard ResNet architectures [25]; ConvNeXt-B: the base architecture of ConvNeXt [41]; ViT-T and ViT-S: ViT architectures [18] of size tiny and small respectively; SwinV2-T: a SwinV2-tiny architecture [40]; Million-AID ResNet50: a ResNet architecture with pre-trained weights [70] on Million-AID [42], a remote sensing dataset with a size comparable to imagenet (only RGB). ...
Preprint
Full-text available
Recent progress in self-supervision has shown that pre-training large neural networks on vast amounts of unsupervised data can lead to substantial increases in generalization to downstream tasks. Such models, recently coined foundation models, have been transformational to the field of natural language processing. Variants have also been proposed for image data, but their applicability to remote sensing tasks is limited. To stimulate the development of foundation models for Earth monitoring, we propose a benchmark comprised of six classification and six segmentation tasks, which were carefully curated and adapted to be both relevant to the field and well-suited for model evaluation. We accompany this benchmark with a robust methodology for evaluating models and reporting aggregated results to enable a reliable assessment of progress. Finally, we report results for 20 baselines to gain information about the performance of existing models. We believe that this benchmark will be a driver of progress across a variety of Earth monitoring tasks.
... While the ResNet-18 architecture used in our experiments may not achieve state-of-the-art results on CIFAR and ImageNet, it provides a suitable platform to evaluate the effectiveness of active learning strategies in a competitive environment, where these strategies have been shown to be beneficial. In the following experiments, we trained ResNet-18 [11] on CIFAR-10, CIFAR-100 [14] and ImageNet-50 -a subset of ImageNet [5] containing 50 classes as done in [27]. We use the same hyper-parameters as in [18], as detailed in Suppl. ...
Preprint
Full-text available
In Active Learning (AL), a learner actively chooses which unlabeled examples to query for labels from an oracle, under some budget constraints. Different AL query strategies are more suited to different problems and budgets. Therefore, in practice, knowing in advance which AL strategy is most suited for the problem at hand remains an open problem. To tackle this challenge, we propose a practical derivative-based method that dynamically identifies the best strategy for each budget. We provide theoretical analysis of a simplified case to motivate our approach and build intuition. We then introduce a method to dynamically select an AL strategy based on the specific problem and budget. Empirical results showcase the effectiveness of our approach across diverse budgets and computer vision tasks.
... The results from the different wavelet functions were then compared. ResNet18 model [45] was used to detect four emotional states in a MCC experiment via a subject-biased approach in which subjects and trials were merged. Notably, the proposed method showed an accuracy improvement of approximately 20% (with accuracy equaling 77.66%) over the baseline, with the GGW activation function yielding the highest performance scores (accuracy = 99.57%). ...
Preprint
The integration of emotional intelligence in machines is an important step in advancing human-computer interaction. This demands the development of reliable end-to-end emotion recognition systems. However, the scarcity of public affective datasets presents a challenge. In this literature review, we emphasize the use of generative models to address this issue in neurophysiological signals, particularly Electroencephalogram (EEG) and Functional Near-Infrared Spectroscopy (fNIRS). We provide a comprehensive analysis of different generative models used in the field, examining their input formulation, deployment strategies, and methodologies for evaluating the quality of synthesized data. This review serves as a comprehensive overview, offering insights into the advantages, challenges, and promising future directions in the application of generative models in emotion recognition systems. Through this review, we aim to facilitate the progression of neurophysiological data augmentation, thereby supporting the development of more efficient and reliable emotion recognition systems.
... The baseline model consists of an encoder-decoder model based on T5 text encoder [22] and a ResNet-50 [10] vision encoder. The two modalities are fused together to form a joint feature space from which a smaller set of tokens are learned. ...
Preprint
Building joint representations across images and text is an essential step for tasks such as Visual Question Answering and Video Question Answering. In this work, we find that the representations must not only jointly capture features from both modalities but should also be diverse for better generalization performance. To this end, we propose joint vision-language representation learning by diversifying the tokenization learning process, enabling tokens that are sufficiently disentangled from each other to be learned from both modalities. We observe that our approach outperforms the baseline models in a majority of settings and is competitive with state-of-the-art methods.
... Deep RL for control Recent breakthroughs in Machine Learning (ML) using Deep Neural Networks (DNNs) lead to superhuman performance in supervised learning problems like image classification [3,4]. These advances have inspired the use of deep learning approaches for decision-making problems like playing perfect information games (e.g., Go) and simple video games. ...
Preprint
Deep Reinforcement Learning (RL) has been demonstrated to yield capable agents and control policies in several domains but is commonly plagued by prohibitively long training times. Additionally, in the case of continuous control problems, the applicability of learned policies on real-world embedded devices is limited due to the lack of real-time guarantees and portability of existing deep learning libraries. To address these challenges, we present BackpropTools, a dependency-free, header-only, pure C++ library for deep supervised and reinforcement learning. Leveraging the template meta-programming capabilities of recent C++ standards, we provide composable components that can be tightly integrated by the compiler. Its novel architecture allows BackpropTools to be used seamlessly on a heterogeneous set of platforms, from HPC clusters over workstations and laptops to smartphones, smartwatches, and microcontrollers. Specifically, due to the tight integration of the RL algorithms with simulation environments, BackpropTools can solve popular RL problems like the Pendulum-v1 swing-up about 7 to 15 times faster in terms of wall-clock training time compared to other popular RL frameworks when using TD3. We also provide a low-overhead and parallelized interface to the MuJoCo simulator, showing that our PPO implementation achieves state of the art returns in the Ant-v4 environment while achieving a 25 to 30 percent faster wall-clock training time. Finally, we also benchmark the policy inference on a diverse set of microcontrollers and show that in most cases our optimized inference implementation is much faster than even the manufacturer's DSP libraries. To the best of our knowledge, BackpropTools enables the first-ever demonstration of training a deep RL algorithm directly on a microcontroller, giving rise to the field of Tiny Reinforcement Learning (TinyRL). Project page: https://backprop.tools
... The AttentionUnet++ was trained with three neighboring frames, with a backbone ResNet50 [12] is transferred from pretrained ImageNet. The Adam optimizer [14] is used with a learning ratio of 0.0001. ...
Article
Full-text available
Cell tracking is currently a powerful tool in a variety of biomedical research topics. Most cell tracking algorithms follow the tracking by detection paradigm. Detection is critical for subsequent tracking. Unfortunately, very accurate detection is not easy due to many factors like densely populated, low contrast, and possible impurities included. Keeping tracking multiple cells across frames suffers many difficulties, as cells may have similar appearance, they may change their shapes, and nearby cells may interact each other. In this paper, we propose a unified tracking-by-detection framework, where a powerful detector AttentionUnet++, a multimodal extension of the Efficient Convolution Operators algorithm, and an effective data association algorithm are included. Experiments show that the proposed algorithm can outperform many existing cell tracking algorithms.
... For each batch, each of the 16 frames is first separately fed to resnet152 [18] to extract the 2D spatial features. The resnet152 is a residual CNN for image classification tasks, which converges faster than other CNN models [56]. ...
Article
Full-text available
Automatic video captioning aims to generate captions with textual descriptions to express video content in natural language by a machine. This is a difficult task as the videos contain dynamic challenges. Most of the available approaches for video captioning are often focused on providing a single descriptive sentence. Encoder-decoder is the most popular architecture developed for video captioning. The proposed method in this research aims to learn the distribution of captions to generate more relevant and diverse captions and increase generalizability. A novel architecture was developed based on conditional SeqGAN to learn the distribution for video captioning and increase the generalizability. This architecture consists of two modules: encoding and caption generation. The goal of encoding is to obtain encoded rich spatial-temporal features. The encoding vector is fed as the input of conditional SeqGAN to generate captions. The main novelty of this paper lies in the use of an adversarial approach to learn the distribution of captions and generate diverse captions that fit the characteristics of the video. Experimental results from two popular datasets, MSVD and MSRVTT, showed that the proposed approach achieved more relevant video captions than other state-of-the-art methods.
... Generally speaking, the deeper the network (ResNet-101 [14]) and the wider the network (CBNet-v1 [28], v2 [24]), he more powerful the backbone network to extract features. But at the same time, the inference time is increasing. ...
Article
Full-text available
The unmonitored jellyfish boom inevitably destroys coastal biodiversities as a type of planktons with extremely high fecundity. It even seriously endangers people’s economic and social activities, such as clogging the water intake system of hydropower plants and hindering coastal tourism development. In the past, underwater video monitoring tended to be time-consuming and costly. This paper proposes JF-YOLO: an automatic jellyfish blooms detection model based on deep learning. We collecte many jellyfish videos in real environments to form a dataset for model training. JF-YOLO uses the improved YOLO-V4 detection model to ensure detection accuracy and speed. The experimental results show that the detection accuracy of the JF-YOLO network is better than that of the YOLO-V4 network, with the average detection accuracy increasing from 85.35% to 92.67% and the recall rate increasing from 72.32% to 85.74%. As a promising solution, JF-YOLO can effectively monitor the number or density of jellyfish and provide early warning when they appear abnormal, bringing convenience to ocean governance.
... Following [5,40], for Animals-10N and WebVision, we use VGG-19 [38] (not pretrained using ImageNet) and Inception-ResNetV2 [42] (not pretrained using ImageNet) as the backbone. Following [29,57], for Food101N and Clothing1M, we use ResNet-50 [18] (pretrained on ImageNet) as the backbone. The training epochs for Animal-10N are 120, and the epochs are 100 [29], ELR+ [32], AugDesc [35] and CC [59], using the same random seed as our method. ...
Conference Paper
Full-text available
Existing studies indicate that deep neural networks (DNNs) can eventually memorize the label noise. We observe that the memorization strength of DNNs towards each instance is different and can be represented by the confidence value, which becomes larger and larger during the training process. Based on this, we propose a Dynamic Instance-specific Selection and Correction method (DISC) for learning from noisy labels (LNL). We first use a two-view-based backbone for image classification, obtaining confidence for each image from two views. Then we propose a dynamic threshold strategy for each instance, based on the momentum of each instance's memorization strength in previous epochs to select and correct noisy labeled data. Benefiting from the dynamic threshold strategy and two-view learning, we can effectively group each instance into one of the three subsets (i.e., clean, hard, and purified) based on the prediction consistency and discrepancy by two views at each epoch. Finally, we employ different regularization strategies to conquer subsets with different degrees of label noise, improving the whole network's robustness. Comprehensive evaluations on three control-lable and four real-world LNL benchmarks show that our method outperforms the state-of-the-art (SOTA) methods to leverage useful information in noisy data while alleviating the pollution of label noise. Code is available at https://github.com/JackYFL/DISC.
... Meanwhile, the softmax classifier plugged at the end of the CNN classifies the features into different facial expressions. The popular CNNs (ResNet, 9 VGG, 10 AlexNet, 11 etc.), though, have surpassed humans in large-scale image classification tasks, training these neural networks on small-scale FER datasets is highly challenging. Therefore, in the past few years, researchers developed several neural networks and deep learning techniques based on transfer learning for FER on small-scale datasets. ...
Article
Full-text available
In medical imaging, deep learning models can be a critical tool to shorten time-to-diagnosis and support specialized medical staff in clinical decision making. The successful training of deep learning models usually requires large amounts of quality data, which are often not available in many medical imaging tasks. In this work we train a deep learning model on university hospital chest X-ray data, containing 1082 images. The data was reviewed, differentiated into 4 causes for pneumonia, and annotated by an expert radiologist. To successfully train a model on this small amount of complex image data, we propose a special knowledge distillation process, which we call Human Knowledge Distillation. This process enables deep learning models to utilize annotated regions in the images during the training process. This form of guidance by a human expert improves model convergence and performance. We evaluate the proposed process on our study data for multiple types of models, all of which show improved results. The best model of this study, called PneuKnowNet, shows an improvement of + 2.3% points in overall accuracy compared to a baseline model and also leads to more meaningful decision regions. Utilizing this implicit data quality-quantity trade-off can be a promising approach for many scarce data domains beyond medical imaging.
Preprint
Full-text available
Deep neural networks (DNNs) are known to have a fundamental sensitivity to adversarial attacks, perturbations of the input that are imperceptible to humans yet powerful enough to change the visual decision of a model. Adversarial attacks have long been considered the "Achilles' heel" of deep learning, which may eventually force a shift in modeling paradigms. Nevertheless, the formidable capabilities of modern large-scale DNNs have somewhat eclipsed these early concerns. Do adversarial attacks continue to pose a threat to DNNs? Here, we investigate how the robustness of DNNs to adversarial attacks has evolved as their accuracy on ImageNet has continued to improve. We measure adversarial robustness in two different ways: First, we measure the smallest adversarial attack needed to cause a model to change its object categorization decision. Second, we measure how aligned successful attacks are with the features that humans find diagnostic for object recognition. We find that adversarial attacks are inducing bigger and more easily detectable changes to image pixels as DNNs grow better on ImageNet, but these attacks are also becoming less aligned with features that humans find diagnostic for recognition. To better understand the source of this trade-off, we turn to the neural harmonizer, a DNN training routine that encourages models to leverage the same features as humans to solve tasks. Harmonized DNNs achieve the best of both worlds and experience attacks that are detectable and affect features that humans find diagnostic for recognition, meaning that attacks on these models are more likely to be rendered ineffective by inducing similar effects on human perception. Our findings suggest that the sensitivity of DNNs to adversarial attacks can be mitigated by DNN scale, data scale, and training routines that align models with biological intelligence.
Preprint
The existing face image super-resolution (FSR) algorithms usually train a specific model for a specific low input resolution for optimal results. By contrast, we explore in this work a unified framework that is trained once and then used to super-resolve input face images of varied low resolutions. For that purpose, we propose a novel neural network architecture that is composed of three anchor auto-encoders, one feature weight regressor and a final image decoder. The three anchor auto-encoders are meant for optimal FSR for three pre-defined low input resolutions, or named anchor resolutions, respectively. An input face image of an arbitrary low resolution is firstly up-scaled to the target resolution by bi-cubic interpolation and then fed to the three auto-encoders in parallel. The three encoded anchor features are then fused with weights determined by the feature weight regressor. At last, the fused feature is sent to the final image decoder to derive the super-resolution result. As shown by experiments, the proposed algorithm achieves robust and state-of-the-art performance over a wide range of low input resolutions by a single framework. Code and models will be made available after the publication of this work.
Article
Full-text available
While depth tends to improve network performances, it also makes gradient-based training more difficult since deeper networks tend to be more non-linear. The recently proposed knowledge distillation approach is aimed at obtaining small and fast-to-execute models, and it has shown that a student network could imitate the soft output of a larger teacher network or ensemble of networks. In this paper, we extend this idea to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student. Because the student intermediate hidden layer will generally be smaller than the teacher's intermediate hidden layer, additional parameters are introduced to map the student hidden layer to the prediction of the teacher hidden layer. This allows one to train deeper students that can generalize better or run faster, a trade-off that is controlled by the chosen student capacity. For example, on CIFAR-10, a deep student network with almost 10.4 times less parameters outperforms a larger, state-of-the-art teacher network.
Article
Full-text available
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Article
Full-text available
We propose a deep convolutional neural network architecture codenamed "Inception", which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC 2014). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC 2014 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.
Article
Full-text available
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide detailed a analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
Article
Full-text available
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Article
Full-text available
Preface to the Second Edition Twelve years have passed since the publication of the first edition of A Multigrid Tutorial. During those years, the field of multigrid and multilevel methods has expanded at a tremendous rate, reflecting progress in the development and analysis of algorithms and in the evolution of computing environments. Because of these changes, the first edition of the book has become increasingly outdated and the need for a new edition has become quite apparent. With the overwhelming growth in the subject, an area in which I have never done serious research, I felt remarkably unqualified to attempt a new edition. Realizing that I needed some help, I recruited two experts to assist with the project. Steve McCormick (Department of Applied Mathematics, University of Colorado at Boulder) is one of the original researchers in the field of multigrid methods and the real instigator of the first edition. There could be no better collaborator on the subject. Van Emden Henson (Center for Applied Scientific Computing, Lawrence Livermore National Laboratory) has specialized in applications of multigrid methods, with a particular emphasis on algebraic multigrid methods. Our collaboration on a previous SIAM monograph made him an obvious choice as a co-author. With the team in place, we began deliberating on the content of the new edition. It was agreed that the first edition should remain largely intact with little more than some necessary updating. Our aim was to add a roughly equal amount of new material that reflects important core developments in the field. A topic that probably should have been in the first edition comprises Chapter 6: FAS (Full Approximation Scheme), which is used for nonlinear problems. Chapter 7 is a collection of methods for four special situations that arise frequently in solving boundary value problems: Neumann boundary conditions, anisotropic problems, variable-mesh problems, and variable-coefficient problems. One of the chief motivations for writing a second edition was the recent surge of interest in algebraic multigrid methods, which is the subject of Chapter 8. In Chapter 9, we attempt to explain the complex subject of adaptive grid methods, as it appears in the FAC (Fast Adaptive Composite) Grid Method. Finally, in Chapter 10, we depart from the predominantly finite difference approach of the book and show how finite element formulations arise. This chapter provides a natural closing because it ties a knot in the thread of variational principles that runs through much of the book. There is no question that the new material in the second half of this edition is more advanced than that presented in the first edition. However, we have tried to create a safe passage between the two halves, to present many motivating examples, and to maintain a tutorial spirit in much of the discourse. While the first half of the book remains highly sequential, the order of topics in the second half is largely arbitrary. The FAC examples in Chapter 9 were developed by Bobby Philip and Dan Quinlan, of the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory, using AMR++ within the Overture framework. Overture is a parallel object-oriented framework for the solution of PDEs in complex and moving geometries. More information on Overture can be found at http://www.llnl.gov/casc/ Overture. We thank Irad Yavneh for a thorough reading of the book, for his technical insight, and for his suggestion that we enlarge Chapter 4. We are also grateful to John Ruge who gave Chapter 8 a careful reading in light of his considerable knowledge of AMG. Their suggestions led to many improvements in the book. Deborah Poulson, Lisa Briggeman, Donna Witzleben, Mary Rose Muccie, Kelly Thomas, Lois Sellers, and Vickie Kearn of the editorial staff at SIAM deserve thanks for coaxing us to write a second edition and for supporting the project from beginning to end. Finally, I am grateful for the willingness of my co-authors to collaborate on this book. They should be credited with improvements in the book and held responsible for none of its shortcomings. Bill Briggs November 15, 1999 Boulder, Colorado
Article
Full-text available
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
Article
Full-text available
We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to increase detection confidence. We show that different tasks can be learnt simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013), and produced near state of the art results for the detection and classifications tasks. Finally, we release a feature extractor from our best model called OverFeat.
Article
Full-text available
Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed remains finite: for a special class of initial conditions on the weights, very deep networks incur only a finite delay in learning speed relative to shallow networks. We further show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, thereby providing analytical insight into the success of unsupervised pretraining in deep supervised learning tasks.
Conference Paper
Full-text available
Recently, we proposed to transform the outputs of each hidden neuron in a multi-layer perceptron network to have zero output and zero slope on average, and use separate shortcut connections to model the linear dependencies instead. We continue the work by firstly introducing a third transformation to normalize the scale of the outputs of each hidden neuron, and secondly by analyzing the connections to second order optimization methods. We show that the transformations make a simple stochastic gradient behave closer to second-order optimization methods and thus speed up learning. This is shown both in theory and with experiments. The experiments on the third transformation show that while it further increases the speed of learning, it can also hurt performance by converging to a worse local optimum, where both the inputs and outputs of many hidden neurons are close to zero.
Article
Full-text available
When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random "dropout" gives big improvements on many benchmark tasks and sets new records for speech and object recognition.
Article
Full-text available
The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.
Article
Full-text available
Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations functions. We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new initialization scheme that brings substantially faster convergence.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
Conference Paper
Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.
Conference Paper
While depth tends to improve network performances, it also makes gradient-based training more difficult since deeper networks tend to be more non-linear. The recently proposed knowledge distillation approach is aimed at obtaining small and fast-to-execute models, and it has shown that a student network could imitate the soft output of a larger teacher network or ensemble of networks. In this paper, we extend this idea to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student. Because the student intermediate hidden layer will generally be smaller than the teacher's intermediate hidden layer, additional parameters are introduced to map the student hidden layer to the prediction of the teacher hidden layer. This allows one to train deeper students that can generalize better or run faster, a trade-off that is controlled by the chosen student capacity. For example, on CIFAR-10, a deep student network with almost 10.4 times less parameters outperforms a larger, state-of-the-art teacher network.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image.
Chapter
It has long been known that neural networks can learn faster when their input and hidden unit activities are centered about zero; recently we have extended this approach to also encompass the centering of error signals [15]. Here we generalize this notion to all factors involved in the network's gradient, leading us to propose centering the slope of hidden unit activation functions as well. Slope centering removes the linear component of backpropagated error; this improves credit assignment in networks with shortcut connections. Benchmark results show that this can speed up learning significantly without adversely affecting the trained network's generalization ability.
Article
Theoretical and empirical evidence indicates that the depth of neural networks is crucial for their success. However, training becomes more difficult as depth increases, and training of very deep networks remains an open problem. Here we introduce a new architecture designed to overcome this. Our so-called highway networks allow unimpeded information flow across many layers on information highways. They are inspired by Long Short-Term Memory recurrent networks and use adaptive gating units to regulate the information flow. Even with hundreds of layers, highway networks can be trained directly through simple gradient descent. This enables the study of extremely deep and efficient architectures.
Article
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.
Article
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch}. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
Article
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing the fixed-size limitation, we can improve all CNN-based image classification methods in general. Our SPP-net achieves state-of-the-art accuracy on the datasets of ImageNet 2012, Pascal VOC 2007, and Caltech101. The power of SPP-net is more significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method computes convolutional features 30-170× faster than the recent leading method R-CNN (and 24-64× faster overall), while achieving better or comparable accuracy on Pascal VOC 2007.
Conference Paper
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Article
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Book
Ripley brings together two crucial ideas in pattern recognition: statistical methods and machine learning via neural networks. He brings unifying principles to the fore, and reviews the state of the subject. Ripley also includes many examples to illustrate real problems in pattern recognition and how to overcome them.
Conference Paper
Large Convolutional Neural Network models have recently demonstrated impressive classification performance on the ImageNet benchmark \cite{Kriz12}. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. We also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky \etal on the ImageNet classification benchmark. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.
Article
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Within the field of pattern classification, the Fisher kernel is a powerful framework which combines the strengths of generative and discriminative approaches. The idea is to characterize a signal with a gradient vector derived from a generative probability model and to subsequently feed this representation to a discriminative classifier. We propose to apply this framework to image categorization where the input signals are images and where the underlying generative model is a visual vocabulary: a Gaussian mixture model which approximates the distribution of low-level features in images. We show that Fisher kernels can actually be understood as an extension of the popular bag-of-visterms. Our approach demonstrates excellent performance on two challenging databases: an in-house database of 19 object/scene categories and the recently released VOC 2006 database. It is also very practical: it has low computational needs both at training and test time and vocabularies trained on one set of categories can be applied to another set without any significant loss in performance.
Conference Paper
VLFeat is an open and portable library of computer vision algorithms. It aims at facilitating fast prototyping and reproducible research for computer vision scientists and students. It includes rigorous implementations of common building blocks such as feature detectors, feature extractors, (hierarchical) k-means clustering, randomized kd-tree matching, and super-pixelization. The source code and interfaces are fully documented. The library integrates directly with MATLAB, a popular language for computer vision research.
Conference Paper
Restricted Boltzmann machines were developed using binary stochastic hidden units. These can be generalized by replacing each binary unit by an infinite number of copies that all have the same weights but have progressively more negative biases. The learning and inference rules for these “Stepped Sigmoid Units ” are unchanged. They can be approximated efficiently by noisy, rectified linear units. Compared with binary units, these units learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset. Unlike binary units, rectified linear units preserve information about relative intensities as information travels through multiple layers of feature detectors. 1.
Article
This paper develops locally adapted hierarchical basis functions for effectively preconditioning large optimization problems that arise in computer graphics applications such as tone mapping, gradient- domain blending, colorization, and scattered data interpolation. By looking at the local structure of the coefficient matrix and p erform- ing a recursive set of variable eliminations, combined with a sim- plification of the resulting coarse level problems, we obtai n bases better suited for problems with inhomogeneous (spatially varying) data, smoothness, and boundary constraints. Our approach removes the need to heuristically adjust the optimal number of precondi- tioning levels, significantly outperforms previously prop osed ap- proaches, and also maps cleanly onto data-parallel architectures such as modern GPUs.
Article
The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. This approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification.