Conference Paper

Instance-Level Meta Normalization

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, normalization [12][13][14][15][16][17] is an efficient method that is easily disregarded. Normalization is aimed at standardizing data to a unified and stable distribution regardless of the originating domain. ...
... Since the invention of batch normalization (BN) [13], various normalization techniques [12][13][14][15][16][17][48][49][50][51][52][53][54][55][56][57][58][59][60] have been proposed for different applications. Batch-instance normalization (BIN) [54] and switchable normalization (SN) [17,55] are proposed to generalize the fixed normalizations by combining their standardization statistics with the learnable weights. ...
... However, their restrictive forms still do not allow enough adaptivity. More recently, instance-level meta (ILM) normalization [15] focused on improving rescaling performance by using neural networks to learn rescaling statistics. Before our work, the majority of studies on normalization did not study the generalization ability under domain shift. ...
Article
Full-text available
Retinal vessel segmentation is an important computer vision task for eye retinopathy diagnosis. In the real scenarios, most datasets of source domain and target domain have distribution deviation, and the model often fails to generate accurate segmentation results due to the lack of data variation in single‐source domain, which damages the generalization ability to unseen target domains and may mislead doctors or artificial intelligence model in the following diseases diagnosis. Feature normalization is one feasible solution which can standardize data into uniform and stable distribution without additional data. However, the existing methods like batch normalization, uniform the data by global parameters. This leads to insufficient representation of important semantic information in the local region. To address this problem, the authors propose the spectral‐spatial normalization (SS‐Norm) module to enhance the generalization ability of the model. More specifically, the authors perform a discrete cosine transform (DCT) to decompose the feature into multiple frequency components and to analyze the semantic contribution degree of each component. By learning a spectral vector, the authors reweight the frequency components of features and therefore normalize the distribution in the spectral domain. Extensive experiments on six datasets prove the effectiveness of the authors’ methods.
... Typically, the default implementation of BN initializes the affine transformation scale (γ) and shift (β) parameters to 1 and 0, respectively, with the expectation that they can adapt to optimal values during training and can undo the preceding normalization if desired [11]. It is common to take this default initialization "as is", with extensions of BN typically adding additional parameters and computations [12,18,19,7] that increase the model complexity. ...
... Related work has appeared to extend the affine transformation component after the normalization layer. Rather than using only learnable scale and shift parameters (as in standard BN), the approach of [12] predicts scale and shift values using a small autoencoder network and combines them with the standard learnable versions. Similarly, in [18] they combine predicted and learned scale and shift parameters by utilizing a mixture of affine transformations and a feature attention mechanism to obtain the final transformation. ...
... The approaches of [12,18,19,7] all initialize the existing BN affine transformation parameters to the standard values of γ = 1 and β = 0, but employ different initializations for their additional parameters. In [12,18], their additional parameters are initialized using samples from a normal distribution, thus their initial scale values are not constant across the network and could have a magnitude >1. ...
Preprint
Full-text available
Batch normalization (BN) is comprised of a normalization component followed by an affine transformation and has become essential for training deep neural networks. Standard initialization of each BN in a network sets the affine transformation scale and shift to 1 and 0, respectively. However, after training we have observed that these parameters do not alter much from their initialization. Furthermore, we have noticed that the normalization process can still yield overly large values, which is undesirable for training. We revisit the BN formulation and present a new initialization method and update approach for BN to address the aforementioned issues. Experimental results using the proposed alterations to BN show statistically significant performance gains in a variety of scenarios. The approach can be used with existing implementations at no additional computational cost. We also present a new online BN-based input data normalization technique to alleviate the need for other offline or fixed methods. Source code is available at https://github.com/osu-cvl/revisiting-bn.
... Normalization in Neural Networks. Since the invention of BN [22], various normalization techniques [1,23,29,32,37,44,45,50,[54][55][56] have been proposed for different applications. Batch-instance normalization (BIN) [39] and SN [32,46] were proposed to generalize the fixed normalizations by combining their standardization statistics with the learnable weights. ...
... However, their restrictive forms still do not allow enough adaptivity. More recently, instancelevel meta (ILM) normalization [23] focused on improving the rescaling performance by using neural networks to learn the rescaling statistics. Before our work, the majorities of the works on normalization do not study the generalization ability under domain shift. ...
... We make use of an encoder-decoder structure [19,23] to learn µ stan and σ stan from µ and σ respectively, where the encoder extracts global information by interacting the information of all channels and the decoder learns to decompose the information for each channel. For the sake of efficiency, both the encoder and decoder consist of one fully-connected layer, forming a bottleneck connected by a non-linear activation function, ReLU. ...
Preprint
Single domain generalization aims to learn a model that performs well on many unseen domains with only one domain data for training. Existing works focus on studying the adversarial domain augmentation (ADA) to improve the model's generalization capability. The impact on domain generalization of the statistics of normalization layers is still underinvestigated. In this paper, we propose a generic normalization approach, adaptive standardization and rescaling normalization (ASR-Norm), to complement the missing part in previous works. ASR-Norm learns both the standardization and rescaling statistics via neural networks. This new form of normalization can be viewed as a generic form of the traditional normalizations. When trained with ADA, the statistics in ASR-Norm are learned to be adaptive to the data coming from different domains, and hence improves the model generalization performance across domains, especially on the target domain with large discrepancy from the source domain. The experimental results show that ASR-Norm can bring consistent improvement to the state-of-the-art ADA approaches by 1.6%, 2.7%, and 6.3% averagely on the Digits, CIFAR-10-C, and PACS benchmarks, respectively. As a generic tool, the improvement introduced by ASR-Norm is agnostic to the choice of ADA methods.
... Meta Learning has been used in normalization to fit the instancelevel normalization and address a learning-to-normalize problem. Jia etal [36] proposed instance-level meta (ILM) normalization to associate the rescaling parameters with the input feature maps rather than deciding the parameters merely from back-propagation. Fan etal [37] introduced the learning of standardization statistics to complement the missing part in [36] which can be viewed as a generic form of the traditional normalizations. ...
... Jia etal [36] proposed instance-level meta (ILM) normalization to associate the rescaling parameters with the input feature maps rather than deciding the parameters merely from back-propagation. Fan etal [37] introduced the learning of standardization statistics to complement the missing part in [36] which can be viewed as a generic form of the traditional normalizations. Park etal [38] proposed Meta-Variance Transfer (MVT) which learns to transfer factors of variations from one class to another meta variance and improves the overall classification performance. ...
Preprint
In light of the smoothness property brought by skip connections in ResNet, this paper proposed the Skip Logit to introduce the skip connection mechanism that fits arbitrary DNN dimensions and embraces similar properties to ResNet. Meta Tanh Normalization (MTN) is designed to learn variance information and stabilize the training process. With these delicate designs, our Skip Meta Logit (SML) brought incremental boosts to the performance of extensive SOTA ctr prediction models on two real-world datasets. In the meantime, we prove that the optimization landscape of arbitrarily deep skip logit networks has no spurious local optima. Finally, SML can be easily added to building blocks and has delivered offline accuracy and online business metrics gains on app ads learning to rank systems at TikTok.
... Similar ideas are also used in adaptive instance normalization (AdaIN) [69] and adaptive layer-instance normalization (AdaLIN) [102] for unsupervised image-to-image translation, where the subnetworks are MLPs and the inputs of the subnetworks are the embedding features produced by one encoder. Jia et al. [103] proposed instance-level meta normalization (ILMN), which utilizes an encoder-decoder subnetwork to generate affine parameters, given the instance-level mean and variance as input. Besides, ILMN also combines the learnable affine parameters shown in Eqn. ...
... Instance-level meta normalization [103] ΠLN ...
Preprint
Normalization techniques are essential for accelerating the training and improving the generalization of deep neural networks (DNNs), and have successfully been used in various applications. This paper reviews and comments on the past, present and future of normalization methods in the context of DNN training. We provide a unified picture of the main motivation behind different approaches from the perspective of optimization, and present a taxonomy for understanding the similarities and differences between them. Specifically, we decompose the pipeline of the most representative normalizing activation methods into three components: the normalization area partitioning, normalization operation and normalization representation recovery. In doing so, we provide insight for designing new normalization technique. Finally, we discuss the current progress in understanding normalization methods, and provide a comprehensive review of the applications of normalization for particular tasks, in which it can effectively solve the key issues.
... Stochastic Normalization [28] gives regulaization effects and better on vision tasks. [23] provides a meta learning mechanism for instance-level normalization techniques. [26] generates the rescaling parameters by different speakers and environments for adaptive neural acoustic modeling via Layer Normalization.The proposed KL-Norm uses KL Regularized inference based affine parameters to show the expressive power on low resource NLP downstraemed tasks. ...
Preprint
Full-text available
Large pre-trained models, such as Bert, GPT, and Wav2Vec, have demonstrated great potential for learning representations that are transferable to a wide variety of downstream tasks . It is difficult to obtain a large quantity of supervised data due to the limited availability of resources and time. In light of this, a significant amount of research has been conducted in the area of adopting large pre-trained datasets for diverse downstream tasks via fine tuning, linear probing, or prompt tuning in low resource settings. Normalization techniques are essential for accelerating training and improving the generalization of deep neural networks and have been successfully used in a wide variety of applications. A lot of normalization techniques have been proposed but the success of normalization in low resource downstream NLP and speech tasks is limited. One of the reasons is the inability to capture expressiveness by rescaling parameters of normalization. We propose KullbackLeibler(KL) Regularized normalization (KL-Norm) which make the normalized data well behaved and helps in better generalization as it reduces over-fitting, generalises well on out of domain distributions and removes irrelevant biases and features with negligible increase in model parameters and memory overheads. Detailed experimental evaluation on multiple low resource NLP and speech tasks, demonstrates the superior performance of KL-Norm as compared to other popular normalization and regularization techniques.
... Existing normalization-based methods [11,19] regularize the distribution by normalize it to the standard normal distribution and then modulate it to another distribution with learned statistics. After such 1 − + 1 operation, the distribution change from original , 2 to learned 1 , 2 1 , where , 2 denotes normal distribution with mean and standard deviation . ...
Preprint
Full-text available
Deep learning-based source dehazing methods trained on synthetic datasets have achieved remarkable performance but suffer from dramatic performance degradation on real hazy images due to domain shift. Although certain Domain Adaptation (DA) dehazing methods have been presented, they inevitably require access to the source dataset to reduce the gap between the source synthetic and target real domains. To address these issues, we present a novel Source-Free Unsupervised Domain Adaptation (SFUDA) image dehazing paradigm, in which only a well-trained source model and an unlabeled target real hazy dataset are available. Specifically, we devise the Domain Representation Normalization (DRN) module to make the representation of real hazy domain features match that of the synthetic domain to bridge the gaps. With our plug-and-play DRN module, unlabeled real hazy images can adapt existing well-trained source networks. Besides, the unsupervised losses are applied to guide the learning of the DRN module, which consists of frequency losses and physical prior losses. Frequency losses provide structure and style constraints, while the prior loss explores the inherent statistic property of haze-free images. Equipped with our DRN module and unsupervised loss, existing source dehazing models are able to dehaze unlabeled real hazy images. Extensive experiments on multiple baselines demonstrate the validity and superiority of our method visually and quantitatively.
... The ground-truth segmentation maps are denoted by y b . Instead of batch normalization, we adopt the instance normalization approach [25,26] for the meta-learning setting. Algorithm 1 shows the base version of the algorithm for our approach. ...
Preprint
Full-text available
The lack of sufficient annotated image data is a common issue in medical image segmentation. For some organs and densities, the annotation may be scarce, leading to poor model training convergence, while other organs have plenty of annotated data. In this work, we present MetaMedSeg, a gradient-based meta-learning algorithm that redefines the meta-learning task for the volumetric medical data with the goal to capture the variety between the slices. We also explore different weighting schemes for gradients aggregation, arguing that different tasks might have different complexity, and hence, contribute differently to the initialization. We propose an importance-aware weighting scheme to train our model. In the experiments, we present an evaluation of the medical decathlon dataset by extracting 2D slices from CT and MRI volumes of different organs and performing semantic segmentation. The results show that our proposed volumetric task definition leads to up to 30% improvement in terms of IoU compared to related baselines. The proposed update rule is also shown to improve the performance for complex scenarios where the data distribution of the target organ is very different from the source organs.
... In statistics, commonly used scaling methods include rescaling, normalization, and standardization. Rescaling is a linear transformation; normalization usually refers to scaling based on minimum/maximum values producing numbers ranging from zero to unity, and standardization regularizes datasets using their means and standard deviations (Jia et al., 2019). Standardization is used in this study to convert the predicted trend to actual permeability values. ...
Article
Permeability is arguably the most important parameter in gas pre-drainage and production practices in coal mining and coal seam gas industries. In fractured coal seams, permeability can significantly vary both laterally and vertically. Knowing the vertical variation of permeability in relatively thick coal seams is important in optimizing inseam pre-drainage design in underground coal mining and improving reservoir management and completion design efficiency in the coal seam gas industry. Existing empirical/statistical coal permeability models often yield poor results in estimating permeability profiles where there is significant variation in the characteristics of coal fracture systems. This study presents a novel workflow for evaluating the permeability profile of naturally fractured thick coal seams using downhole geophysical logging data. The workflow combines physics-based simulation, laboratory experiments, and a data-driven machine learning approach for estimating the permeability profile. As part of this workflow, several coal specimens from the study coal seam are first tested under different stresses to measure their permeability, density, and ultrasonic responses. Numerical simulation of the Navier-Stokes fluid flow and elastic wave propagation was then performed on a range of constructed coal block geometries with various fracture densities. In the next step, a physics-informed, neural network-based model (PINN) was trained using the data obtained from both laboratory experiments and numerical simulation. The model inputs included density, as well as compressional and shear wave velocities of the coal seam and its output is the trend of permeability variation across the coal seam interval. In-situ permeability measurement at a certain depth (e.g., from well pressure testing) was then used to convert the permeability trend to the entire permeability profile of the coal seam along its interval. The performance of the proposed workflow was finally evaluated using a suite of downhole geophysical logging data from a borehole intersecting two coal seams in an eastern Australian basin. This study demonstrates that the proposed workflow is relatively simple to implement and yet accurate for evaluating the permeability profile of a coal seam across its interval.
... Sparse SN used SparseMax to make the trainable parameters in SN sparse ). In Instance-Level Meta Normalization (Jia, Chen, and Chen 2019), feature feed-forward and gradient back-propagation were used to learn the normalization parameters. ...
Preprint
Full-text available
Deep Convolutional Neural Networks (DCNNs) are hard and time-consuming to train. Normalization is one of the effective solutions. Among previous normalization methods, Batch Normalization (BN) performs well at medium and large batch sizes and is with good generalizability to multiple vision tasks, while its performance degrades significantly at small batch sizes. In this paper, we find that BN saturates at extreme large batch sizes, i.e., 128 images per worker, i.e., GPU, as well and propose that the degradation/saturation of BN at small/extreme large batch sizes is caused by noisy/confused statistic calculation. Hence without adding new trainable parameters, using multiple-layer or multi-iteration information, or introducing extra computation, Batch Group Normalization (BGN) is proposed to solve the noisy/confused statistic calculation of BN at small/extreme large batch sizes with introducing the channel, height and width dimension to compensate. The group technique in Group Normalization (GN) is used and a hyper-parameter G is used to control the number of feature instances used for statistic calculation, hence to offer neither noisy nor confused statistic for different batch sizes. We empirically demonstrate that BGN consistently outperforms BN, Instance Normalization (IN), Layer Normalization (LN), GN, and Positional Normalization (PN), across a wide spectrum of vision tasks, including image classification, Neural Architecture Search (NAS), adversarial learning, Few Shot Learning (FSL) and Unsupervised Domain Adaptation (UDA), indicating its good performance, robust stability to batch size and wide generalizability. For example, for training ResNet-50 on ImageNet with a batch size of 2, BN achieves Top1 accuracy of 66.512% while BGN achieves 76.096% with notable improvement.
... Future research can be done in three ways. First, we plan to use meta-learning [22,24,42] or few shots learning [35,52,54] to alleviate the problem that deep learning models are prone to overfitting and improve the predictive performance of the proposed model. We hope that the recent most popular state-of-the-art technologies can help deep learning to produce surprising results on small-scale medical imaging datasets. ...
Article
Full-text available
The accurate identification of KRAS mutation status on medical images is critical for doctors to specify treatment options for patients with rectal cancer. Deep learning methods have recently been successfully introduced to medical diagnosis and treatment problems, although substantial challenges remain in the computer-aided diagnosis (CAD) due to the lack of large training datasets. In this paper, we propose a multi-branch cross attention model (MBCAM) to separate KRAS mutation cases from wild type cases using limited T2-weighted MRI data. Our model is built on multiple different branches generated based on our existing MRI data, which can take full advantage of the information contained in small data sets. The cross attention block (CA block) is proposed to fuse formerly independent branches to ensure that the model can learn as many common features as possible for preventing the overfitting of the model due to the limited dataset. The inter-branch loss is proposed to constrain the learning range of the model, confirming that the model can learn more general features from multi-branch data. We tested our method on the collected dataset and compared it to four previous works and five popular deep learning models using transfer learning. Our result shows that the MBCAM achieved an accuracy of 88.92% for the prediction of KRAS mutations with an AUC of 95.75%. These results are a significant improvement over those existing methods (p < 0.05).
Article
Large pretrained models like Bert, GPT, and Wav2Vec have demonstrated their ability to learn transferable representations for various downstream tasks. However, obtaining a substantial amount of supervised data remains a challenge due to resource and time limitations. As a solution, researchers have turned their attention to using large pretrained datasets via techniques like fine-tuning, linear probing, or prompt tuning in low resource settings. Normalization techniques play a crucial role in speeding up training, style transfer, object detection, recurrent neural networks and improving the generalization of deep neural networks. Despite their success in various domains, their effectiveness in low resource NLP and speech tasks has been limited. A notable reason for this limitation is the difficulty in capturing expressiveness using affine parameters of normalization. To address this issue, we propose a novel approach called Kullback-Leibler (KL) Regularized normalization, or KL-Norm. The main objective of KL-Norm is to ensure that normalized data is well-behaved and to improve generalization by reducing overfitting by including a regularization loss function in the training process. It achieves this by promoting good performance on out-of-domain distributions and effectively filters relevant features while eliminating superficial features or biases present in the dataset or pretrained model. Remarkably, KL-Norm accomplishes these objectives with minimal increase in model parameters and memory overheads. Through extensive experimental analysis, we showcase the improved accuracy and performance of KL-Norm in comparison to other normalization techniques on low resource downstream NLP tasks. These tasks encompass a wide range of applications, including sentiment classification, semantic relationship characterization, semantic textual similarity, textual entailment, and paraphrase detection. Additionally, KL-Norm exhibits superior results in downstream speech tasks, specifically in keyword detection and emotion classification.
Article
Batch normalization (BN) is used by default in many modern deep neural networks due to its effectiveness in accelerating training convergence and boosting inference performance. Recent studies suggest that the effectiveness of BN is due to the Lipschitzness of the loss and gradient, rather than the reduction of internal covariate shift. However, questions remain about whether Lipschitzness is sufficient to explain the effectiveness of BN and whether there is room for vanilla BN to be further improved. To answer these questions, we first prove that when stochastic gradient descent (SGD) is applied to optimize a general non-convex problem, three effects will help convergence to be faster and better: (i) reduction of the gradient Lipschitz constant, (ii) reduction of the expectation of the square of the stochastic gradient, and (iii) reduction of the variance of the stochastic gradient. We demonstrate that vanilla BN only with ReLU can induce the three effects above, rather than Lipschitzness, but vanilla BN with other nonlinearities like Sigmoid, Tanh, and SELU will result in degraded convergence performance. To improve vanilla BN, we propose a new normalization approach, dubbed complete batch normalization (CBN), which changes the placement position of normalization and modifies the structure of vanilla BN based on the theory. It is proven that CBN can elicit all the three effects above, regardless of the nonlinear activation used. Extensive experiments on benchmark datasets CIFAR10, CIFAR100, and ILSVRC2012 validate that CBN makes the training convergence faster, and the training loss converges to a smaller local minimum than vanilla BN. Moreover, CBN helps networks with multiple nonlinear activations (Sigmoid, Tanh, ReLU, SELU, and Swish) achieve higher test accuracy steadily. Specifically, benefitting from CBN, the classification accuracies for networks with Sigmoid, Tanh, and SELU are boosted by more than 15.0%, 4.5%, and 4.0% on average, respectively, which is even comparable to the performance for ReLU.
Chapter
Large pretrained models, such as Bert, GPT, and Wav2Vec, have demonstrated great potential for learning representations that are transferable to a wide variety of downstream tasks. It is difficult to obtain a large quantity of supervised data due to the limite d availability of resources and time. In light of this, a significant amount of research has been conducted in the area of adopting large pretrained datasets for diverse downstream tasks via fine tuning, linear probing, or prompt tuning in low resource settings. Normalization techniques are essential for accelerating training and improving the generalization of deep neural networks and have been successfully used in a wide variety of applications. A lot of normalization techniques have been proposed but the success of normalization in low resource downstream NLP and speech tasks is limited. One of the reasons is the inability to capture expressiveness by re-scaling parameters of normalization. We propose Kullback-Leibler(KL) Regularized normalization (KL-Norm) which make the normalized data well behaved and helps in better generalization as it reduces over-fitting, generalises well on out of domain distributions and removes irrelevant biases and features with negligible increase in model parameters and memory overheads. Detailed experimental evaluation on multiple low resource NLP and speech tasks, demonstrates the superior performance of KL-Norm as compared to other popular normalization and regularization techniques.
Article
Normalization techniques are essential for accelerating the training and improving the generalization of deep neural networks (DNNs), and have successfully been used in various applications. This paper reviews and comments on the past, present and future of normalization methods in the context of DNN training. We provide a unified picture of the main motivation behind different approaches from the perspective of optimization, and present a taxonomy for understanding the similarities and differences between them. Specifically, we decompose the pipeline of the most representative normalizing activation methods into three components: the normalization area partitioning, normalization operation and normalization representation recovery. In doing so, we provide insight for designing new normalization technique. Finally, we discuss the current progress in understanding normalization methods, and provide a comprehensive review of the applications of normalization for particular tasks, in which it can effectively solve the key issues.
Chapter
Domain generalization aims to train a robust model on multiple source domains that generalizes well to unseen target domains. While considerable attention has focused on training domain generalizable models, a few recent studies have shifted the attention to test time, i.e., leveraging test samples for better target generalization. To this end, this paper proposes a novel test-time domain generalization method, Domain Conditioned Normalization (DCN), to infer the normalization statistics of the target domain from only a single test sample. In order to learn to predict the normalization statistics, DCN adopts a meta-learning framework and simulates the inference process of the normalization statistics at training. Extensive experimental results have shown that DCN brings consistent improvements to many state-of-the-art domain generalization methods on three widely adopted benchmarks.
Article
Convolutional neural networks (CNNs) have demonstrated excellent capability in various visual recognition tasks but impose an excessive computational burden. The latter problem is commonly solved by utilizing lightweight sparse networks. However, such networks have a limited receptive field in a few layers, and the majority of these networks face a severe information barrage due to their sparse structures. Spurred by these deficiencies, this work proposes a Squeeze Convolution block with Information Flow Enhancement (SCIFE), comprising a Divide-and-Squeeze Convolution and an Information Flow Enhancement scheme. The former module constructs a multi-layer structure through multiple squeeze operations to increase the receptive field and reduce computation. The latter replaces the affine transformation with the point convolution and dynamically adjusts the activation function’s threshold, enhancing information flow in both channels and layers. Moreover, we reveal that the original affine transformation may harm the network’s generalization capability. To overcome this issue, we utilize a point convolution with a zero initial mean. SCIFE can serve as a plug-and-play replacement for vanilla convolution blocks in mainstream CNNs, while extensive experimental results demonstrate that CNNs equipped with SCIFE compress benchmark structures without sacrificing performance, outperforming their competitors.
Chapter
Batch normalization (BN) is comprised of a normalization component followed by an affine transformation and has become essential for training deep neural networks. Standard initialization of each BN in a network sets the affine transformation scale and shift to 1 and 0, respectively. However, after training we have observed that these parameters do not alter much from their initialization. Furthermore, we have noticed that the normalization process can still yield overly large values, which is undesirable for training. We revisit the BN formulation and present a new initialization method and update approach for BN to address the aforementioned issues. Experiments are designed to emphasize and demonstrate the positive influence of proper BN scale initialization on performance, and use rigorous statistical significance tests for evaluation. The approach can be used with existing implementations at no additional computational cost. Source code is available at https://github.com/osu-cvl/revisiting-bn-init.
Chapter
We propose a framework to describe normalizing-activations-as-function methods in Algorithm 4.1. We divide the normalizing-activations-as-function framework into three abstract processes: normalization area partitioning (NAP), normalization operation (NOP), and normalization representation recovery (NRR). We consider the more general mini-batch (of size m) activations in a convolutional layer XRd×m×h×w{{{\textbf {{\textsf {X}}}}}}\in \mathbb {R}^{d \times m \times h \times w}, where d, h and w are the channel number, height and width of the feature maps, respectively. NAP transforms the activations X{{{\textbf {{\textsf {X}}}}}} into XRS1×S2{\boldsymbol{X}}\in \mathbb {R}^{S_1 \times S_2}, where S2S_2 indexes the set of samples used to compute the estimators. NOP denotes the specific normalization operation (see main operations in Sect. 1.1) on the transformed data X{\boldsymbol{X}}. NRR is used to recover the possible reduced representation capacity.
Chapter
The lack of sufficient annotated image data is a common issue in medical image segmentation. For some organs and densities, the annotation may be scarce, leading to poor model training convergence, while other organs have plenty of annotated data. In this work, we present MetaMedSeg, a gradient-based meta-learning algorithm that redefines the meta-learning task for the volumetric medical data with the goal of capturing the variety between the slices. We also explore different weighting schemes for gradients aggregation, arguing that different tasks might have different complexity and hence, contribute differently to the initialization. We propose an importance-aware weighting scheme to train our model. In the experiments, we evaluate our method on the medical decathlon dataset by extracting 2D slices from CT and MRI volumes of different organs and performing semantic segmentation. The results show that our proposed volumetric task definition leads to up to 30% improvement in terms of IoU compared to related baselines. The proposed update rule is also shown to improve the performance for complex scenarios where the data distribution of the target organ is very different from the source organs. (Project page: http://metamedseg.github.io/)
Preprint
Full-text available
We present Sandwich Batch Normalization (SaBN), an embarrassingly easy improvement of Batch Normalization (BN) with only a few lines of code changes. SaBN is motivated by addressing the inherent feature distribution heterogeneity that one can be identified in many tasks, which can arise from data heterogeneity (multiple input domains) or model heterogeneity (dynamic architectures, model conditioning, etc.). Our SaBN factorizes the BN affine layer into one shared sandwich affine layer, cascaded by several parallel independent affine layers. Concrete analysis reveals that, during optimization, SaBN promotes balanced gradient norms while still preserving diverse gradient directions: a property that many application tasks seem to favor. We demonstrate the prevailing effectiveness of SaBN as a drop-in replacement in four tasks: conditional image generation\textbf{conditional image generation}, neural architecture search\textbf{neural architecture search} (NAS), adversarial training\textbf{adversarial training}, and arbitrary style transfer\textbf{arbitrary style transfer}. Leveraging SaBN immediately achieves better Inception Score and FID on CIFAR-10 and ImageNet conditional image generation with three state-of-the-art GANs; boosts the performance of a state-of-the-art weight-sharing NAS algorithm significantly on NAS-Bench-201; substantially improves the robust and standard accuracies for adversarial defense; and produces superior arbitrary stylized results. We also provide visualizations and analysis to help understand why SaBN works. Codes are available at https://github.com/VITA-Group/Sandwich-Batch-Normalization.
Chapter
In state-of-the-art deep neural networks, both feature normalization and feature attention have become ubiquitous. They are usually studied as separate modules, however. In this paper, we propose a light-weight integration between the two schema and present Attentive Normalization (AN). Instead of learning a single affine transformation, AN learns a mixture of affine transformations and utilizes their weighted-sum as the final affine transformation applied to re-calibrate features in an instance-specific way. The weights are learned by leveraging channel-wise feature attention. In experiments, we test the proposed AN using four representative neural architectures in the ImageNet-1000 classification benchmark and the MS-COCO 2017 object detection and instance segmentation benchmark. AN obtains consistent performance improvement for different neural architectures in both benchmarks with absolute increase of top-1 accuracy in ImageNet-1000 between 0.5% and 2.7%, and absolute increase up to 1.8% and 2.2% for bounding box and mask AP in MS-COCO respectively. We observe that the proposed AN provides a strong alternative to the widely used Squeeze-and-Excitation (SE) module. The source codes are publicly available at the ImageNet Classification Repo and the MS-COCO Detection and Segmentation Repo.
Article
Full-text available
Batch Normalization (BN) (Ioffe and Szegedy 2015) normalizes the features of an input image via statistics of a batch of images and hence BN will bring the noise to the gradient of training loss. Previous works indicate that the noise is important for the optimization and generalization of deep neural networks, but too much noise will harm the performance of networks. In our paper, we offer a new point of view that the self-attention mechanism can help to regulate the noise by enhancing instance-specific information to obtain a better regularization effect. Therefore, we propose an attention-based BN called Instance Enhancement Batch Normalization (IEBN) that recalibrates the information of each channel by a simple linear transformation. IEBN has a good capacity of regulating the batch noise and stabilizing network training to improve generalization even in the presence of two kinds of noise attacks during training. Finally, IEBN outperforms BN with only a light parameter increment in image classification tasks under different network structures and benchmark datasets.
Article
Full-text available
Batch Normalization (BN) is capable of accelerating the training of deep models by centering and scaling activations within mini-batches. In this work, we propose Decorrelated Batch Normalization (DBN), which not just centers and scales activations but whitens them. We explore multiple whitening techniques, and find that PCA whitening causes a problem we call stochastic axis swapping, which is detrimental to learning. We show that ZCA whitening does not suffer from this problem, permitting successful learning. DBN retains the desirable qualities of BN and further improves BN's optimization efficiency and generalization ability. We design comprehensive experiments to show that DBN can improve the performance of BN on multilayer perceptrons and convolutional neural networks. Furthermore, we consistently improve the accuracy of residual networks on CIFAR-10, CIFAR-100, and ImageNet.
Article
Full-text available
Orthogonal matrix has shown advantages in training Recurrent Neural Networks (RNNs), but such matrix is limited to be square for the hidden-to-hidden transformation in RNNs. In this paper, we generalize such square orthogonal matrix to orthogonal rectangular matrix and formulating this problem in feed-forward Neural Networks (FNNs) as Optimization over Multiple Dependent Stiefel Manifolds (OMDSM). We show that the rectangular orthogonal matrix can stabilize the distribution of network activations and regularize FNNs. We also propose a novel orthogonal weight normalization method to solve OMDSM. Particularly, it constructs orthogonal transformation over proxy parameters to ensure the weight matrix is orthogonal and back-propagates gradient information through the transformation during training. To guarantee stability, we minimize the distortions between proxy parameters and canonical weights over all tractable orthogonal transformations. In addition, we design an orthogonal linear module (OLM) to learn orthogonal filter banks in practice, which can be used as an alternative to standard linear module. Extensive experiments demonstrate that by simply substituting OLM for standard linear module without revising any experimental protocols, our method largely improves the performance of the state-of-the-art networks, including Inception and residual networks on CIFAR and ImageNet datasets. In particular, we have reduced the test error of wide residual network on CIFAR-100 from 20.04% to 18.61% with such simple substitution. Our code is available online for result reproduction.
Article
Full-text available
We investigate conditional adversarial networks as a general-purpose solution to image-to-image translation problems. These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping. This makes it possible to apply the same generic approach to problems that traditionally would require very different loss formulations. We demonstrate that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks. As a community, we no longer hand-engineer our mapping functions, and this work suggests we can achieve reasonable results without hand-engineering our loss functions either.
Article
Full-text available
Normalization techniques have only recently begun to be exploited in supervised learning tasks. Batch normalization exploits mini-batch statistics to normalize the activations. This was shown to speed up training and result in better models. However its success has been very limited when dealing with recurrent neural networks. On the other hand, layer normalization normalizes the activations across all activities within a layer. This was shown to work well in the recurrent setting. In this paper we propose a unified view of normalization techniques, as forms of divisive normalization, which includes layer and batch normalization as special cases. Our second contribution is the finding that a small modification to these normalization schemes, in conjunction with a sparse regularizer on the activations, leads to significant benefits over standard normalization techniques. We demonstrate the effectiveness of our unified divisive normalization framework in the context of convolutional neural nets and recurrent neural networks, showing improvements over baselines in image classification, language modeling as well as super-resolution.
Article
Full-text available
It this paper we revisit the fast stylization method introduced in Ulyanov et. al. (2016). We show how a small change in the stylization architecture results in a significant qualitative improvement in the generated images. The change is limited to swapping batch normalization with instance normalization, and to apply the latter both at training and testing times. The resulting method can be used to train high-performance architectures for real-time image generation. The code will be made available at https://github.com/DmitryUlyanov/texture_nets.
Article
Full-text available
While the authors of Batch Normalization (BN) identify and address an important problem involved in training deep networks-- \textit{Internal Covariate Shift}-- the current solution has multiple drawbacks. For instance, BN depends on batch statistics for layerwise input normalization during training which makes the estimates of mean and standard deviation of input (distribution) to hidden layers inaccurate due to shifting parameter values (specially during initial training epochs). Another fundamental problem with BN is that it cannot be used with batch-size 1 during training. We address these (and other) drawbacks of BN by proposing a non-adaptive normalization technique for removing covariate shift, that we call \textit{Normalization Propagation}. Our approach does not depend on batch statistics, but rather uses a data-independent parametric estimate of mean and standard-deviation in every layer thus being faster compared with BN. We exploit the observation that the pre-activation before Rectified Linear Units follow a Gaussian distribution in deep networks, and that once the first and second order statistics of any given dataset are normalized, we can forward propagate this normalization without the need for recalculating the approximate statistics (using data) for any of the hidden layers.
Conference Paper
Full-text available
We propose a method for semantic parsing of images with regular structure. The structured objects are modeled in a densely connected CRF. The paper describes how to embody specific spatial relations in a representation called Spatial Pattern Templates (SPT), which allows us to capture regularity constraints of alignment and equal spacing in pairwise and ternary potentials. Assuming the input image is pre-segmented to salient regions the SPT describe which segments could interact in the structured graphical model. The model parameters are learnt to describe the formal language of semantic labelings. Given an input image, a consistent labeling over its segments linked in the CRF is recognized as a word from this language. The CRF framework allows us to apply efficient algorithms for both recognition and learning. We demonstrate the approach on the problem of facade image parsing and show that results comparable with state of the art methods are achieved without introducing additional manually designed detectors for specific terminal objects.
Article
Full-text available
We propose a deep convolutional neural network architecture codenamed "Inception", which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC 2014). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC 2014 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.
Article
Full-text available
We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to increase detection confidence. We show that different tasks can be learnt simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013), and produced near state of the art results for the detection and classifications tasks. Finally, we release a feature extractor from our best model called OverFeat.
Conference Paper
Full-text available
In this paper, we describe a nonlinear image represen- tation based on divisive normalization that is designed to match the statistical properties of photographic images, as well as the perceptual sensitivity of biological visual sys- tems. We decompose an image using a multi-scale oriented representation, and use Student's t as a model of the de- pendencies within local clusters of coefficients. We then show that normalization of each coefficient by the square root of a linear combination of the amplitudes of the coef- ficients in the cluster reduces statistical dependencies. We further show that the resulting divisive normalization trans- form is invertible and provide an efficient iterative inversion algorithm. Finally, we probe the statistical and perceptual advantages of this image representation by examining its robustness to added noise, and using it to enhance image contrast.
Article
Full-text available
Objective methods for assessing perceptual image quality traditionally attempted to quantify the visibility of errors (differences) between a distorted image and a reference image using a variety of known properties of the human visual system. Under the assumption that human visual perception is highly adapted for extracting structural information from a scene, we introduce an alternative complementary framework for quality assessment based on the degradation of structural information. As a specific example of this concept, we develop a structural similarity index and demonstrate its promise through a set of intuitive examples, as well as comparison to both subjective ratings and state-of-the-art objective methods on a database of images compressed with JPEG and JPEG2000. A MATLAB implementation of the proposed algorithm is available online at http://www.cns.nyu.edu/∼lcv/ssim/.
Article
While it is nearly effortless for humans to quickly assess the perceptual similarity between two images, the underlying processes are thought to be quite complex. Despite this, the most widely used perceptual metrics today, such as PSNR and SSIM, are simple, shallow functions, and fail to account for many nuances of human perception. Recently, the deep learning community has found that features of the VGG network trained on the ImageNet classification task has been remarkably useful as a training loss for image synthesis. But how perceptual are these so-called "perceptual losses"? What elements are critical for their success? To answer these questions, we introduce a new Full Reference Image Quality Assessment (FR-IQA) dataset of perceptual human judgments, orders of magnitude larger than previous datasets. We systematically evaluate deep features across different architectures and tasks and compare them with classic metrics. We find that deep features outperform all previous metrics by huge margins. More surprisingly, this result is not restricted to ImageNet-trained VGG features, but holds across different deep architectures and levels of supervision (supervised, self-supervised, or even unsupervised). Our results suggest that perceptual similarity is an emergent property shared across deep visual representations.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ~90% scaling efficiency when moving from 8 to 256 GPUs. This system enables us to train visual recognition models on internet-scale data with high efficiency.
Article
Gatys et al. recently introduced a neural algorithm that renders a content image in the style of another image, achieving so-called style transfer. However, their framework requires a slow iterative optimization process, which limits its practical application. Fast approximations with feed-forward neural networks have been proposed to speed up neural style transfer. Unfortunately, the speed improvement comes at a cost: the network is usually tied to a fixed set of styles and cannot adapt to arbitrary new styles. In this paper, we present a simple yet effective approach that for the first time enables arbitrary style transfer in real-time. At the heart of our method is a novel adaptive instance normalization (AdaIN) layer that aligns the mean and variance of the content features with those of the style features. Our method achieves speed comparable to the fastest existing approach, without the restriction to a pre-defined set of styles. In addition, our approach allows flexible user controls such as content-style trade-off, style interpolation, color & spatial controls, all using a single feed-forward neural network.
Article
We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without tricks, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code will be made available.
Article
Batch Normalization is quite effective at accelerating and improving the training of deep models. However, its effectiveness diminishes when the training minibatches are small, or do not consist of independent samples. We hypothesize that this is due to the dependence of model layer inputs on all the examples in the minibatch, and different activations being produced between training and inference. We propose Batch Renormalization, a simple and effective extension to ensure that the training and inference models generate the same outputs that depend on individual examples rather than the entire minibatch. Models trained with Batch Renormalization perform substantially better than batchnorm when training with small or non-i.i.d. minibatches. At the same time, Batch Renormalization retains the benefits of batchnorm such as insensitivity to initialization and training efficiency.
Article
Objective methods for assessing perceptual image quality have traditionally attempted to quantify the visibility of errors between a distorted image and a reference image using a variety of known properties of the human visual system. Under the assumption that human visual perception is highly adapted for extracting structural information from a scene, we introduce an alternative framework for quality assessment based on the degradation of structural information. As a specific example of this concept, we develop a Structural Similarity Index and demonstrate its promise through a set of intuitive examples, as well as comparison to both subjective ratings and state-of-the-art objective methods on a database of images compressed with JPEG and JPEG2000. A MatLab implementation of the proposed algorithm is available online at http://www.cns.nyu.edu/~lcv/ssim/.
Article
Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.
Article
We present weight normalization: a reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterization is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time. We demonstrate the usefulness of our method on applications in supervised image recognition, generative modelling, and deep reinforcement learning.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch}. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
Article
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Conference Paper
In many recent object recognition systems, feature extraction stages are generally composed of a filter bank, a non-linear transformation, and some sort of feature pooling layer. Most systems use only one stage of feature extraction in which the filters are hard-wired, or two stages where the filters in one or both stages are learned in supervised or unsupervised mode. This paper addresses three questions: 1. How does the non-linearities that follow the filter banks influence the recognition accuracy? 2. does learning the filter banks in an unsupervised or supervised manner improve the performance over random filters or hardwired filters? 3. Is there any advantage to using an architecture with two stages of feature extraction, rather than one? We show that using non-linearities that include rectification and local contrast normalization is the single most important ingredient for good accuracy on object recognition benchmarks. We show that two stages of feature extraction yield better accuracy than one. Most surprisingly, we show that a two-stage system with random filters can yield almost 63% recognition rate on Caltech-101, provided that the proper non-linearities and pooling layers are used. Finally, we show that with supervised refinement, the system achieves state-of-the-art performance on NORB dataset (5.6%) and unsupervised pre-training followed by supervised refinement produces good accuracy on Caltech-101 (> 65%), and the lowest known error rate on the undistorted, unprocessed MNIST dataset (0.53%).
Article
With the advent of the Internet, billions of images are now freely available online and constitute a dense sampling of the visual world. Using a variety of non-parametric methods, we explore this world with the aid of a large dataset of 79,302,017 images collected from the Internet. Motivated by psychophysical results showing the remarkable tolerance of the human visual system to degradations in image resolution, the images in the dataset are stored as 32 x 32 color images. Each image is loosely labeled with one of the 75,062 non-abstract nouns in English, as listed in the Wordnet lexical database. Hence the image database gives a comprehensive coverage of all object categories and scenes. The semantic information from Wordnet can be used in conjunction with nearest-neighbor methods to perform object classification over a range of semantic levels minimizing the effects of labeling noise. For certain classes that are particularly prevalent in the dataset, such as people, we are able to demonstrate a recognition performance comparable to class-specific Viola-Jones style detectors.