Article

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch}. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... For the discriminator, PatchGAN [34] is more effective than StyleGAN [37] and UNet [74] discriminators. However, PatchGAN with BatchNorm [33] can be too strong for VAEs with high compression ratios. In experiments, we find that SpectralNorm [59] improves the training stability more effectively than the commonly used R1 penalty [57] or LeCAM regularization [80]. ...
Preprint
Full-text available
This technical report presents a cost-efficient strategy for training a video generation foundation model. We present a mid-sized research model with approximately 7 billion parameters (7B) called Seaweed-7B trained from scratch using 665,000 H100 GPU hours. Despite being trained with moderate computational resources, Seaweed-7B demonstrates highly competitive performance compared to contemporary video generation models of much larger size. Design choices are especially crucial in a resource-constrained setting. This technical report highlights the key design decisions that enhance the performance of the medium-sized diffusion model. Empirically, we make two observations: (1) Seaweed-7B achieves performance comparable to, or even surpasses, larger models trained on substantially greater GPU resources, and (2) our model, which exhibits strong generalization ability, can be effectively adapted across a wide range of downstream applications either by lightweight fine-tuning or continue training. See the project page at https://seaweed.video/
... The BN normalization algorithm is presented below. For further details, see Ioffe and Szegedy (2015). ...
Article
Full-text available
This paper presents a deep learning approach for option pricing using a long short-term memory (LSTM) neural network applied to European call options on the S&P 500 index. We utilize a rolling window approach that trains 12 instances of the LSTM model, one for each month of 2021. To gain further insight into the model performance, we use explainable artificial intelligence (XAI) through SHapley Additive Explanations (SHAP). We find that the LSTM model outperforms the Black–Scholes and the Heston models and a multilayer perceptron (MLP) neural network regarding overall pricing accuracy. Most notably, the time-sequencing nature of LSTM enables the proposed model to capture sufficient short-term volatility from recently traded options. This result is still robust when controlling for time-varying volatility dynamics. Thus, the model is less prone to measurement errors in volatility.
... Inception-v3, DenseNet, AlexNet, ResNet, and U-Net are neural network architectures with distinct structural variations that have been applied to retinal image analysis [133]. Inception-v3, from the GoogLeNet family, was designed for multi-level feature extraction tailored to largescale image recognition tasks, incorporating factorization and other approaches, such as batch normalization, to minimize parameters and maximize performance [134,135]. DenseNet is well-suited for image classification and segmentation due to dense connections between layers, which improve gradient flow and encourage feature reuse [136]. AlexNet, an early CNN, is adept at image classification due to its deep architecture and use of the rectified linear unit (RELU) activation but may be outperformed by later models in terms of efficiency [137]. ...
Article
Full-text available
Background: Diabetes mellitus (DM) increases the risk of vascular complications, and retinal vasculature imaging serves as a valuable indicator of both microvascular and macrovascular health. Moreover, artificial intelligence (AI)-enabled systems developed for high-throughput detection of diabetic retinopathy (DR) using digitized retinal images have become clinically adopted. This study reviews AI applications using retinal images for DM-related complications, highlighting advancements beyond DR screening, diagnosis, and prognosis, and addresses implementation challenges, such as ethics, data privacy, equitable access, and explainability. Methods: We conducted a thorough literature search across several databases, including PubMed, Scopus, and Web of Science, focusing on studies involving diabetes, the retina, and artificial intelligence. We reviewed the original research based on their methodology, AI algorithms, data processing techniques, and validation procedures to ensure a detailed analysis of AI applications in diabetic retinal imaging. Results: Retinal images can be used to diagnose DM complications including DR, neuropathy, nephropathy, and atherosclerotic cardiovascular disease, as well as to predict the risk of cardiovascular events. Beyond DR screening, AI integration also offers significant potential to address the challenges in the comprehensive care of patients with DM. Conclusion: With the ability to evaluate the patient’s health status in relation to DM complications as well as risk prognostication of future cardiovascular complications, AI-assisted retinal image analysis has the potential to become a central tool for modern personalized medicine in patients with DM.
... The model was based on a U-Net-type deep convolutional neural network [22] with 11 convolution layers, batch normalization [23], and ReLU activation functions [24]. Transposed convolutions in the decoder stage restored spatial resolution, while optimization was performed using the Adam optimizer with a categorical cross-entropy loss function. ...
Article
Full-text available
Background Artificial intelligence (AI)-driven biomarker segmentation offers an objective and reproducible approach for quantifying key anatomical features in neovascular age-related macular degeneration (nAMD) using optical coherence tomography (OCT). Currently, Faricimab, a novel bispecific inhibitor of vascular endothelial growth factor (VEGF) and angiopoietin-2 (Ang-2), offers new potential in the management of nAMD, particularly in treatment-resistant cases. This study utilizes an advanced deep learning-based segmentation algorithm to analyze OCT biomarkers and evaluate the efficacy and durability of Faricimab over nine months in patients with therapy-refractory nAMD. Methods This retrospective real-world study analyzed patients with treatment-resistant nAMD who switched to Faricimab following inadequate responses to ranibizumab or aflibercept. Automated segmentation of key OCT biomarkers - including fibrovascular pigment epithelium detachment (fvPED), intraretinal fluid (IRF), subretinal fluid (SRF), subretinal hyperreflective material (SHRM), choroidal volume, and central retinal thickness (CRT) - was conducted using a deep learning algorithm based on a convolutional neural network. Results A total of 46 eyes from 41 patients completed the nine-month follow-up. Significant reductions in SRF, fvPED, and choroidal volume were observed from baseline (mo0) to three months (mo3) and sustained at nine months (mo9). CRT decreased significantly from 342.7 (interquartile range (iqr): 117.1) µm at mo0 to 296.6 (iqr: 84.3) µm at mo3 and 310.2 (iqr: 93.6) µm at mo9. The deep learning model provided precise quantification of biomarkers, enabling reliable tracking of disease progression. The median injection interval extended from 35 (iqr: 15) days at mo0 to 56 (iqr: 20) days at mo9, representing a 60% increase. Visual acuity remained stable throughout the study. Correlation analysis revealed that higher baseline CRT and fvPED volumes were associated with greater best-corrected visual acuity (BCVA) improvements and longer treatment intervals. Conclusions This study highlights the potential of AI-driven biomarker segmentation as a precise and scalable tool for monitoring disease progression in treatment-resistant nAMD. By enabling objective and reproducible analysis of OCT biomarkers, deep learning algorithms provide critical insights into treatment response. Faricimab demonstrated significant and sustained anatomical improvements, allowing for extended treatment intervals while maintaining disease stability. Future research should focus on refining AI models to improve predictive accuracy and assessing long-term outcomes to further optimize disease management. Trial registration Ethics approval was obtained from the Institutional Review Board of LMU Munich (study ID: 20–0382). This study was conducted in accordance with the Declaration of Helsinki.
... The MLP consists of the sequence of BatchNorm (Ioffe and Szegedy 2015), ReLU, and fully connected layers. ...
Article
Object re-identification (ReID) is committed to searching for objects of the same identity across cameras, and its real-world deployment is gradually increasing. Current ReID methods assume that the deployed system follows the centralized processing paradigm, i.e., all computations are conducted in the cloud server and edge devices are only used to capture images. As the number of videos experiences a rapid escalation, this paradigm has become impractical due to the finite computational resources in the cloud server. Therefore, the ReID system should be converted to fit in the cloud-edge collaborative processing paradigm, which is crucial to boost its scalability and practicality. However, current works lack relevant research on this important specific issue, making it difficult to adapt them into a cloud-edge framework effectively. In this paper, we propose a cloud-edge collaborative inference framework for ReID systems, aiming to expedite the return of the desired image captured by the camera to the cloud server by learning the spatial-temporal correlations among objects. In the system, a Distribution-aware Correlation Modeling network (DaCM) is particularly proposed to embed the spatial-temporal correlations of the camera network implicitly into a graph structure, and it can be applied 1) in the cloud to regulate the size of the upload window and 2) on the edge device to adjust the sequence of images, respectively. Notably, the proposed DaCM can be seamlessly combined with traditional ReID methods, enabling their application within our proposed edge-cloud collaborative framework. Extensive experiments demonstrate that our method obviously reduces transmission overhead and significantly improves performance.
... where W red ∈ R n d C×C is the reduction matrix and BN(·) denotes the batch normalization (Ioffe and Szegedy 2015). ...
Article
Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by combining complementary information from multiple modalities. Existing multi-modal object ReID methods primarily focus on the fusion of heterogeneous features. However, they often overlook the dynamic quality changes in multi-modal imaging. In addition, the shared information between different modalities can weaken modality-specific information. To address these issues, we propose a novel feature learning framework called DeMo for multi-modal object ReID, which adaptively balances decoupled features using a mixture of experts. To be specific, we first deploy a Patch-Integrated Feature Extractor (PIFE) to extract multi-granularity and multi-modal features. Then, we introduce a Hierarchical Decoupling Module (HDM) to decouple multi-modal features into non-overlapping forms, preserving the modality uniqueness and increasing the feature diversity. Finally, we propose an Attention-Triggered Mixture of Experts (ATMoE), which replaces traditional gating with dynamic attention weights derived from decoupled features. With these modules, our DeMo can generate more robust multi-modal features. Extensive experiments on three object ReID benchmarks verify the effectiveness of our methods.
... The Network Inversion Technique was evaluated on classifiers trained on MNIST, FashionMNIST, SVHN and CI-FAR10 datasets by training a generator to produce images that, when passed through a classifier, elicit the desired labels. While the generator is based on Vector-Matrix Conditioning followed by multiple layers of transposed convolutions, batch normalization (Ioffe and Szegedy 2015) and dropout layers (Srivastava et al. 2014) to encourage diversity in the generated images. ...
Article
Neural networks have emerged as powerful tools across various applications, yet their decision-making process often remains opaque, leading to them being perceived as "black boxes." This opacity raises concerns about their interpretability and reliability, especially in safety-critical scenarios. Network inversion techniques offer a solution by allowing us to peek inside these black boxes, revealing the features and patterns learned by the networks behind their decision-making processes and thereby provide valuable insights into how neural networks arrive at their conclusions, making them more interpretable and trustworthy. This paper presents a simple yet effective approach to network inversion using a meticulously conditioned generator that learns the data distribution in the input space of the trained neural network, enabling the reconstruction of inputs that would most likely lead to the desired outputs. To capture the diversity in the input space for a given output, instead of simply revealing the conditioning labels to the generator, we encode the conditioning label information into vectors and intermediate matrices and further minimize the cosine similarity between features of the generated images.
... Accordingly, they suffer from extremely large time costs (more than an hour on ImageNet (Deng et al. 2009)) to perform the col-lect&prune. Inspire by previous works (Ioffe and Szegedy 2015) we adopt the Exponential Moving Average (EMA) mechanism to accumulate the edge weight matrix A l during training, as shown in Equation 4. In order to mitigate the large performance drop after pruning, we progressively remove channels in T step iterations. We prune the model and finetune it for a few iterations to restore the performance, which makes the pruning less destructive. ...
Article
In recent years, semantic segmentation has flourished in various applications. However, the high computational cost remains a significant challenge that hinders its further adoption. The filter pruning method for structured network slimming offers a direct and effective solution for the reduction of segmentation networks. Nevertheless, we argue that most existing pruning methods, originally designed for image classification, overlook the fact that segmentation is a location-sensitive task, which consequently leads to their suboptimal performance when applied to segmentation networks. To address this issue, this paper proposes a novel approach, denoted as Spatial-aware Information Redundancy Filter Pruning (SIRFP), which aims to reduce feature redundancy between channels. First, we formulate the pruning process as a maximum edge weight clique problem (MEWCP) in graph theory, thereby minimizing the redundancy among the remaining features after pruning. Within this framework, we introduce a spatial-aware redundancy metric based on feature maps, thus endowing the pruning process with location sensitivity to better adapt to pruning segmentation networks. Additionally, based on the MEWCP, we propose a low computational complexity greedy strategy to solve this NP-hard problem, making it feasible and efficient for structured pruning. To validate the effectiveness of our method, we conducted extensive comparative experiments on various challenging datasets. The results demonstrate the superior performance of SIRFP for semantic segmentation tasks.
... The results were obtained by testing the new ablated models against the testing dataset. Additionally, given the importance of Batch normalization layers for performance and computational cost improvement, a last ablation study was conducted by incorporating Batch Normalization layers along with dropout (Romijnders et al., 2023), (Garbin et al., 2020), (Ioffe and Szegedy, 2015), to study the impact of Batch normalization layers inclusion. ...
Article
Full-text available
Daily motor activities are affected by motor disabilities caused by Parkinson's disease (PD). Monitoring motor disabilities frequently observed in PD is difficult for physicians, as they are limited to the information observed or self-reported during routine consultations, resulting in subjectivity and limited assessment. Thus, it is necessary to evaluate more frequently and objectively, ideally in a continuous manner including daily life tasks. While wearable sensory devices, such as inertial sensors, and their applications are steadily growing, the advances in artificial intelligence studies have revolutionized their ability to extract deeply hidden information for accurate detection and interpretation of motor activities. However, further studies are required, mainly focused on PD. This study aimed to implement a deep learning (DL) based model for recognizing daily motor activities based on inertial data to contribute to PD. The model relied on a convolutional neural network (CNN) architecture, trained and tested on a created custom dataset. We further benchmarked our model against other popular DL frameworks. The dataset included inertial data captured from 18 patients while performing trivial quotidian tasks, such as walking, turning, sitting, and lying. We hypothesized that a DL model based on a CNN architecture could be an appropriate solution for modeling daily motor non-steady and steady-state tasks from a single inertial sensor. We measured an F1 score of 0.906 and an accuracy of 91.1% on final testing with our optimized CNN model, being the tasks of standing and walking the most accurately recognized by the model. Future challenges should cover exploring attention-based models and increasing the dataset.
... • FedBN [173] deals with the challenge of different workers storing data with different distributions locally, or feature-shift non-IID. This approach uses local batch normalization [139] to mitigate feature shift before the synchronization period. FedBN outperforms both FedAvg and FedProx by achieving 6-10% more accuracy on various non-IID datasets like Office-Caltech-10 [15] and DomainNet [212]. ...
... Figure 7 further elaborates on computational details of each colored block. Batch Norm [30] was applied to all layers except the decoder output and encoder input. In line with the principle of design simplicity [31], the model exclusively used convolutional layers, where downsampling was achieved by increasing the stride. ...
... However, practical experiments have found that, owing to issues related to vanishing or exploding gradients and the degradation problem, deeper networks often perform worse than shallower networks [93,94]. The problem of vanishing or exploding gradients can be addressed through normalization techniques [88,94,95]. Techniques including Batch Normalization, Dropout, and regularization are incorporated into this model to address issues related to gradient vanishing and explosion. ...
Article
Full-text available
Control room operators encounter a substantial risk of mental fatigue, which can reduce their human reliability by diminishing concentration and responsiveness, leading to unsafe operations. There is value in detection of individuals’ mental fatigue status in the workplace. This study introduces a new method for mental fatigue detection (MFD) that combines computer vision and machine learning. Traditional methods for MFD typically rely on multi-dimensional data for fatigue analysis and detection, which can be challenging to apply in a real situation. The traditional methods such as the use of biological data, e.g., electrocardiograms, require operators to be in constant contact with sensors, while this study utilizes computer vision to collect facial data, and a machine learning model to assess fatigue states. The developed machine learning method consists both Deep Residual Network and Random Forest (DRN-RF). A comparison with existing MFD methods, including K Nearest Neighbors and Gradient Boosting Machine, has been carried out. The results show that the accuracy of the DRN-RF model reaches 94.2% and the deviation is 0.004. Evidently, the DRN-RF model demonstrates high accuracy and stability. Overall, the proposed method has the potential to contribute to improving the safety of process system operations, particularly in the aspect of human factor management.
... The DCCA models were trained during 5 learning epochs for simulation experiments and 20 learning epochs for RSA dataset experiments. To enhances training stability and convergence speed, we pre-processed each dataset such that all trace samples in a batch are normalized between 0 and 1 [IS15]. ...
Article
Full-text available
In order to protect against side-channel attacks, masking countermeasure is widely considered. Its application on asymmetric cryptographic algorithms, such as RSA implementations, rendered multiple traces aggregation inefficient and led to the development of single trace horizontal attacks. Among these horizontal attacks proposed in the literature, many are based on the use of clustering techniques or statistical distinguishers to identify operand collisions. These attacks can be difficult to implement in practice, as they often require advanced trace pre-processing, including the selection of points of interest, a step that is particularly complex to perform in a non-profiling context. In recent years, numerous studies have shown the effectiveness of deep learning in security evaluation for conducting side-channel attacks. However, few attentions have been given to its application in asymmetric cryptography and horizontal attack scenarios. Additionally, the majority of deep learning attacks tend to focus on profiling attacks, which involve a supervised learning phase. In this paper, we propose a new non-profiling horizontal attack using an unsupervised deep learning method called Deep Canonical Correlation Analysis. In this approach, we propose to use a siamese neural network to maximize the correlation between pairs of modular operation traces through canonical correlation analysis, projecting them into a highly correlated latent space that is more suitable for identifying operand collisions. Several experimental results, on simulated traces and a protected RSA implementation with up-to-date countermeasures, show how our proposal outperformed state-of-the-art attacks despite being simpler to implement. This suggests that the use of deep learning can be impactful for security evaluators, even in a non-profiling context and in a fully unsupervised way.
... Note that if put all the channels into one group, we get Layer Normalization [1] where the normalization is applied to each instance of the batch across all channels together. Finally, note that most well-known Batch Normalization [16] applies the normalization across each individual channel separately, but does so by considering all the instances in the batch together. ...
Preprint
Full-text available
Much of the federated learning (FL) literature focuses on settings where local dataset statistics remain the same between training and testing time. Recent advances in domain generalization (DG) aim to use data from source (training) domains to train a model that generalizes well to data from unseen target (testing) domains. In this paper, we are motivated by two major gaps in existing work on FL and DG: (1) the lack of formal mathematical analysis of DG objectives and training processes; and (2) DG research in FL being limited to the conventional star-topology architecture. Addressing the second gap, we develop Decentralized Federated Domain Generalization with Style Sharing\textit{Decentralized Federated Domain Generalization with Style Sharing} (StyleDDG\texttt{StyleDDG}), a fully decentralized DG algorithm designed to allow devices in a peer-to-peer network to achieve DG based on sharing style information inferred from their datasets. Additionally, we fill the first gap by providing the first systematic approach to mathematically analyzing style-based DG training optimization. We cast existing centralized DG algorithms within our framework, and employ their formalisms to model StyleDDG\texttt{StyleDDG}. Based on this, we obtain analytical conditions under which a sub-linear convergence rate of StyleDDG\texttt{StyleDDG} can be obtained. Through experiments on two popular DG datasets, we demonstrate that StyleDDG\texttt{StyleDDG} can obtain significant improvements in accuracy across target domains with minimal added communication overhead compared to decentralized gradient methods that do not employ style sharing.
... In the case of weight parameters, such as the weights in a convolutional layer, the sign configuration has a critical role in determining the functional mechanism of the layer as discussed in Wang et al. (2023b;a); Gadhikar & Burkholz (2024). By contrast, for parameters in a normalization layer, such as the batch normalization (Ioffe & Szegedy, 2015) or layer normalization (Ba et al., 2016), the magnitude may be much more important Algorithm 1 AWS: a slight modification to LRR to find winning sign. The modification is highlighted in red. ...
Preprint
The Lottery Ticket Hypothesis (LTH) posits the existence of a sparse subnetwork (a.k.a. winning ticket) that can generalize comparably to its over-parameterized counterpart when trained from scratch. The common approach to finding a winning ticket is to preserve the original strong generalization through Iterative Pruning (IP) and transfer information useful for achieving the learned generalization by applying the resulting sparse mask to an untrained network. However, existing IP methods still struggle to generalize their observations beyond ad-hoc initialization and small-scale architectures or datasets, or they bypass these challenges by applying their mask to trained weights instead of initialized ones. In this paper, we demonstrate that the parameter sign configuration plays a crucial role in conveying useful information for generalization to any randomly initialized network. Through linear mode connectivity analysis, we observe that a sparse network trained by an existing IP method can retain its basin of attraction if its parameter signs and normalization layer parameters are preserved. To take a step closer to finding a winning ticket, we alleviate the reliance on normalization layer parameters by preventing high error barriers along the linear path between the sparse network trained by our method and its counterpart with initialized normalization layer parameters. Interestingly, across various architectures and datasets, we observe that any randomly initialized network can be optimized to exhibit low error barriers along the linear path to the sparse network trained by our method by inheriting its sparsity and parameter sign information, potentially achieving performance comparable to the original. The code is available at https://github.com/JungHunOh/AWS\_ICLR2025.git
Article
Artificial Intelligence (AI) and deep learning models have revolutionized diagnosis, prognostication, and treatment planning by extracting complex patterns from medical images, enabling more accurate, personalized, and timely clinical decisions. Despite its promise, challenges such as image heterogeneity across different centers, variability in acquisition protocols and scanners, and sensitivity to artifacts hinder the reliability and clinical integration of deep learning models. Addressing these issues is critical for ensuring accurate and practical AI-powered neuroimaging applications. We reviewed and summarized the strategies for improving the robustness and generalizability of deep learning models for the segmentation and classification of neuroimages. This review follows a structured protocol, comprehensively searching Google Scholar, PubMed, and Scopus for studies on neuroimaging, task-specific applications, and model attributes. Peer-reviewed, English-language studies on brain imaging were included. The extracted data were analyzed to evaluate the implementation and effectiveness of these techniques. The study identifies key strategies to enhance deep learning in neuroimaging, including regularization, data augmentation, transfer learning, and uncertainty estimation. These approaches address major challenges such as data variability and domain shifts, improving model robustness and ensuring consistent performance across diverse clinical settings. The technical strategies summarized in this review can enhance the robustness and generalizability of deep learning models for segmentation and classification to improve their reliability for real-world clinical practice.
Article
Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary image information from different modalities. Recently, large-scale pre-trained models like CLIP have demonstrated impressive performance in traditional single-modal ReID tasks. However, they remain unexplored for multi-modal object ReID. Furthermore, current multi-modal aggregation methods have obvious limitations in dealing with long sequences from different modalities. To address above issues, we introduce a novel framework called MambaPro for multi-modal object ReID. To be specific, we first employ a Parallel Feed-Forward Adapter (PFA) for adapting CLIP to multi-modal object ReID. Then, we propose the Synergistic Residual Prompt (SRP) to guide the joint learning of multi-modal features. Finally, leveraging Mamba's superior scalability for long sequences, we introduce Mamba Aggregation (MA) to efficiently model interactions between different modalities. As a result, MambaPro could extract more robust features with lower complexity. Extensive experiments on three multi-modal object ReID benchmarks (i.e., RGBNT201, RGBNT100 and MSVR310) validate the effectiveness of our proposed methods.
Article
Proxy-based metric learning has enhanced semantic similarity with class representatives and exhibited noteworthy performance in deep metric learning (DML) tasks. While these methods alleviate computational demands by learning instance-to-class relationships rather than instance-to-instance relationships, they often limit features to be class-specific, thereby degrading generalization performance for unseen class. In this paper, we introduce a novel perspective called Disentangled Deep Metric Learning (DDML), grounded in the framework of information bottleneck, which applies class-agnostic regularization to existing DML methods. Unlike conventional NormSoftmax methods, which primarily emphasize distinct class-specific features, our DDML enables a diverse feature representation by seamlessly transitioning between class-specific features with the aid of class-agnostic features. It smooths decision boundaries, allowing unseen classes to have stable semantic representations in the embedding space. To achieve this, we learn disentangled representations of both class-specific and class-agnostic features in the context of DML. Empirical results demonstrate that our method addresses the limitations of conventional approaches. Our method easily integrates into existing proxy-based algorithms, consistently delivering improved performance.
Article
Full-text available
В статье рассматриваются методы глубокого обучения, применяемые для решения задач регрессии в производственном секторе. Основное внимание уделяется сравнению полносвязных нейронных сетей (MLP) и рекуррентных нейронных сетей (RNN, LSTM, GRU) в задачах прогнозирования ключевых показателей: объема производства, затрат, времени простоя оборудования, уровня брака и энергопотребления. Проведен анализ особенностей этих моделей, их преимуществ и ограничений в зависимости от структуры данных и их временных зависимостей. Рассматриваются практические примеры использования регрессионных моделей для оптимизации производственных процессов, бюджетного планирования и управления ресурсами. Особое внимание уделяется обработке исторических данных, включая временные ряды, а также вопросам выбора подходящей архитектуры нейронной сети в зависимости от поставленных задач. В статье приводятся рекомендации по применению MLP и RNN в различных сценариях, учитывающие вычислительные ресурсы, сложность реализации и эффективность прогнозирования. Сделан вывод о целесообразности использования MLP для задач с ограниченными временными зависимостями и RNN для анализа последовательных данных с выраженной динамикой. Полученные результаты могут быть полезны специалистам в области анализа данных, управления производством и планирования ресурсов.
Article
In this paper, we explore how to develop salient object detection models using adder neural networks (ANNs), which are more energy efficient than convolutional neural networks (CNNs), especially for real-world applications. Based on our empirical studies, we show that directly replacing the convolutions in CNN-based models with adder layers leads to a substantial loss of activations in the decoder part. This makes the feature maps learned in the decoder lack pattern diversity and hence results in a significant performance drop. To alleviate this issue, by investigating the statistics of the feature maps produced by adder layers, we introduce a simple yet effective differential merging strategy to augment the feature representations learned by adder layers and present a simple baseline for SOD using ANNs. Experiments on popular salient object detection benchmarks demonstrate that our proposed method with a simple feature pyramid network (FPN) architecture achieves comparable performance to previous state-of-theart CNN-based models and consumes much less energy. We hope this work could facilitate the development of ANNs in binary segmentation tasks.
Article
Model Inversion (MI) attacks, which reconstruct the training dataset of neural networks, pose significant privacy concerns in machine learning. Recent MI attacks have managed to reconstruct realistic label-level private data, such as the general appearance of a target person from all training images labeled on him. Beyond label-level privacy, in this paper we show sample-level privacy, the private information of a single target sample, is also important but under-explored in the MI literature due to the limitations of existing evaluation metrics. To address this gap, this study introduces a novel metric tailored for training-sample analysis, namely, the Diversity and Distance Composite Score (DDCS), which evaluates the reconstruction fidelity of each training sample by encompassing various MI attack attributes. This, in turn, enhances the precision of sample-level privacy assessments. Leveraging DDCS as a new evaluative lens, we observe that many training samples remain resilient against even the most advanced MI attack. As such, we further propose a transfer learning framework that augments the generative capabilities of MI attackers through the integration of entropy loss and natural gradient descent. Extensive experiments verify the effectiveness of our framework on improving state-of-the-art MI attacks over various metrics including DDCS, coverage and FID. Finally, we demonstrate that DDCS can also be useful for MI defense, by identifying samples susceptible to MI attacks in an unsupervised manner.
Article
Full-text available
Underwater object detection remains a challenging task due to the presence of noise, lighting variations, and occlusions in underwater images. To address these challenges, this study proposes an improved underwater object detection model based on YOLOv9, integrating advanced attention mechanisms and a dilated large-kernel algorithm. Specifically, the model incorporates a residual attention block (RAB) to enhance local feature extraction and denoising capabilities. Additionally, a content-guided hybrid multi-attention fusion module is designed to improve contextual awareness and target focus. Finally, a dilated large-kernel network, GLSKNet, is employed to dynamically adjust the receptive field, making the model more suitable for detecting underwater objects of varying sizes, particularly small and blurred targets. Experimental results on the RUOD and DUO datasets demonstrate that our model outperforms several state-of-the-art models, achieving impressive mAPs (mean average precision) of 88.8% and 89.7% on the RUOD and DUO datasets, respectively. These findings underscore the effectiveness of attention mechanisms and the dilated large-kernel algorithm in enhancing underwater object detection performance. Our code and datasets can be found in https://github.com/down-with-me/RGM-YOLO.git
Article
Load distributing band (LDB) mechanical chest compression (CC) devices are used to treat out-of-hospital cardiac arrest (OHCA) patients. Mechanical CCs induce artifacts in the electrocardiogram (ECG) recorded by defibrillators, potentially leading to inaccurate cardiac rhythm analysis. A reliable analysis of the cardiac rhythm is essential for guiding resuscitation treatment and understanding, retrospectively, the patients’ response to treatment. The aim of this study was to design a deep learning (DL)-based framework for cardiac automatic multiclass rhythm classification in the presence of CC artifacts during OHCA. Concretely, an automatic multiclass cardiac rhythm classification was addressed to distinguish the following types of rhythms: shockable (Sh), asystole (AS), and organized (OR) rhythms. A total of 15,479 segments (2406 Sh, 5481 AS, and 7592 OR) were extracted from 2058 patients during LDB CCs, whereof 9666 were used to train the algorithms and 5813 to assess the performance. The proposed architecture consists of an adaptive filter for CC artifact suppression and a multiclass rhythm classifier. Two DL alternatives were considered for the multiclass classifier: convolutional neuronal networks (CNNs) and residual networks (ResNets). A traditional machine learning-based classifier, which incorporates the research conducted over the past two decades in ECG rhythm analysis using more than 90 state-of-the-art features, was used as a point of comparison. The unweighted mean of sensitivities, the unweighted mean of F1-Scores, and the accuracy of the best method (ResNets) were 88.3%, 88.3%, and 88.2%, respectively. These results highlight the potential of DL-based methods to provide accurate cardiac rhythm diagnoses without interrupting mechanical CC therapy.
Article
Wheat is one of the most essential food crops globally, but diseases significantly threaten its yield and quality, resulting in considerable economic losses. The identification of wheat diseases faces challenges, such as interference from complex environments in the field, the inefficiency of traditional machine learning methods, and difficulty in deploying the existing deep learning models. To address these challenges, this study proposes a multi-scale feature fusion shuffle network model (MFFSNet) for wheat disease identification from complex environments in the field. MFFSNet incorporates a multi-scale feature extraction and fusion module (MFEF), utilizing inflated convolution to efficiently capture diverse features, and its main constituent units are improved by ShuffleNetV2 units. A dual-branch shuffle attention mechanism (DSA) is also integrated to enhance the model’s focus on critical features, reducing interference from complex backgrounds. The model is characterized by its smaller size and fast operation speed. The experimental results demonstrate that the proposed DSA attention mechanism outperforms the best-performing Squeeze-and-Excitation (SE) block by approximately 1% in accuracy, with the final model achieving 97.38% accuracy and 97.96% recall on the test set, which are higher than classical models such as GoogleNet, MobileNetV3, and Swin Transformer. In addition, the number of parameters of this model is only 0.45 M, one-third that of MobileNetV3 Small, which is very suitable for deploying on devices with limited memory resources, demonstrating great potential for practical applications in agricultural production.
Article
Full-text available
Composites are widely used in wind turbine blades due to their excellent strength-to-weight ratio and operational flexibilities. However, wind turbines often operate in harsh environmental conditions that can lead to various types of damage, including abrasion, corrosion, fractures, cracks, and delamination. Early detection through structural health monitoring (SHM) is essential for maintaining the efficient and reliable operation of wind turbines, minimizing downtime and maintenance costs, and optimizing energy output. Further, Damage detection and localization are challenging in curved composites due to their anisotropic nature, edge reflections, and generation of higher harmonics. Previous work has focused on damage localization using deep-learning approaches. However, these models are computationally expensive, and multiple models need to be trained independently for various tasks such as damage classification, localization, and sizing identification. Also, the data generated due to AE waveforms at a minimum sampling rate of 1MSPS is huge, requiring tinyML enabled hardware for real time ML models which can reduce the size of cloud storage required. TinyML hardware can run ML models efficiently with low power consumption. This paper presents a Hybrid Hierarchical Machine-Learning Model (HHMLM) that leverages acoustic emission (AE) data to identify, classify, and locate different types of damage using the single unified model. The AE data is collected using a single sensor, with damage simulated by artificial AE sources (Pencil lead break) and low-velocity impacts. Additionally, simulated abrasion on the blade’s leading edge resembles environmental wear. This HHMLM model achieved 96.4% overall accuracy with less computation time than 83.8% for separate conventional Convolutional Neural Network (CNN) models. The developed SHM solution provides a more effective and practical solution for in-service monitoring of wind turbine blades, particularly in wind farm settings, with the potential for future wireless sensors with tiny ML applications.
Article
To enhance the performance of machine learning algorithms, overcome the curse of dimensionality, and maintain model interpretability, there are significant challenges that continue to confront fuzzy systems (FS). Mini-batch Gradient Descent (MBGD) is characterized by its fast convergence and strong generalization performance. However, its applications have been generally restricted to the low-dimensional problems with small datasets. In this paper, we propose a novel deep-learning-based prediction method. This method optimizes deep neural-fuzzy systems (ODNFS) by considering the essential correlations of external and internal factors. Specifically, the Maximal Information Coefficient (MIC) is used to sort features based on their significance and eliminate the least relevant ones, and then the uniform regularization is introduced, which enforces consistency in the average normalized activation levels across rules. An improved novel MBGD technique with DropRule and AdaBound (MBGD-RDA) is put forward to train deep fuzzy systems for the training of each sub-FS in a fashion of layer by layer. Experiments on several datasets show that the ODNFS can effectively balance the efficiency, accuracy, and stability within the system, which can be used for training datasets of any size. The proposed ODNFS outperforms MBGD-RDA and the state-of-the-art methods in terms of accuracy and generalization, with fewer parameters and rules.
Article
The extensibility of dough and its resistance to extension (toughness) are important indicators, since they are directly linked to dough quality. Therefore, this paper used an independently developed device to blow sheeted dough, and then a three-dimensional (3D) camera was used to continuously collect point cloud images of sheeted dough forming bubbles. After data collection, the rotation algorithm, region of interest (ROI) extraction algorithm, and statistical filtering algorithm were used to process the original point cloud images. Lastly, the oriented bounding box (OBB) algorithm was proposed to calculate the deformation height of each data point. And the point cloud image with the largest deformation depth was selected as the data to input into the 3D convolutional neural network (CNN) models. The Convolutional Block Attention Module (CBAM) was introduced into the 3D Visual Geometry Group 11 (Vgg11) model to build the enhanced Vgg11. And we compared it with the other classical 3D CNN models (MobileNet, ResNet18, and Vgg11) by inputting the voxel-point-based data and the voxel-based data separately into these models. The results showed that the enhanced 3D Vgg11 model using voxel-point-based data was superior to the other models. For prediction of dough extensibility and toughness, the Rp was 0.893 and 0.878, respectively.
Article
Full-text available
Human cells consist of a complex hierarchy of components, many of which remain unexplored1,2. Here we construct a global map of human subcellular architecture through joint measurement of biophysical interactions and immunofluorescence images for over 5,100 proteins in U2OS osteosarcoma cells. Self-supervised multimodal data integration resolves 275 molecular assemblies spanning the range of 10⁻⁸ to 10⁻⁵ m, which we validate systematically using whole-cell size-exclusion chromatography and annotate using large language models³. We explore key applications in structural biology, yielding structures for 111 heterodimeric complexes and an expanded Rag–Ragulator assembly. The map assigns unexpected functions to 975 proteins, including roles for C18orf21 in RNA processing and DPP9 in interferon signalling, and identifies assemblies with multiple localizations or cell type specificity. It decodes paediatric cancer genomes⁴, identifying 21 recurrently mutated assemblies and implicating 102 validated new cancer proteins. The associated Cell Visualization Portal and Mapping Toolkit provide a reference platform for structural and functional cell biology.
Article
Power distribution systems frequently encounter various fault-causing events. Thus, prompt and accurate fault diagnosis is crucial for maintaining system stability and safety. This study presents an innovative residual block-convolutional block attention module-convolutional neural network (ResBlock-CBAM-CNN)-based method for fault cause diagnosis. To enhance diagnostic precision further, the proposed approach incorporates a multimodal data fusion model. This model combines raw on-site measurements, processed data, and external environmental information to extract relevant fault-related details. Empirical results show that the ResBlock-CBAM-CNN method, with data fusion, outperforms existing techniques significantly in fault identification accuracy. Additionally, t-SNE visualization of fault data validates the effectiveness of this approach. Unlike studies that rely on simulated datasets, this research uses real-world measurements, highlighting the practical applicability and value of the proposed model for fault cause diagnosis in power distribution systems.
Article
Full-text available
Recent work in unsupervised feature learning and deep learning has shown that be-ing able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network train-ing. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k cate-gories. We show that these same techniques dramatically accelerate the training of a more modestly-sized deep network for a commercial speech recognition ser-vice. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm.
Conference Paper
Full-text available
Log-linear models are widely used probability models for statistical pattern recognition. Typically, log-linear models are trained according to a convex criterion. In recent years, the interest in log-linear models has greatly increased. The optimization of log-linear model parameters is costly and therefore an important topic, in particular for large-scale applications. Different optimization algorithms have been evaluated empirically in many papers. In this work, we analyze the optimization problem analytically and show that the training of log-linear models can be highly ill-conditioned. We verify our findings on two handwriting tasks. By making use of our convergence analysis, we obtain good results on a large-scale continuous handwriting recognition task with a simple and generic approach.
Article
Full-text available
We explore the effect of introducing prior information into the intermediate level of neural networks for a learning task on which all the state-of-the-art machine learning algorithms tested failed to learn. We motivate our work from the hypothesis that humans learn such intermediate concepts from other individuals via a form of supervision or guidance using a curriculum. The experiments we have conducted provide positive evidence in favor of this hypothesis. In our experiments, a two-tiered MLP architecture is trained on a dataset with 64x64 binary inputs images, each image with three sprites. The final task is to decide whether all the sprites are the same or one of them is different. Sprites are pentomino tetris shapes and they are placed in an image with different locations using scaling and rotation transformations. The first part of the two-tiered MLP is pre-trained with intermediate-level targets being the presence of sprites at each location, while the second part takes the output of the first part as input and predicts the final task's target binary event. The two-tiered MLP architecture, with a few tens of thousand examples, was able to learn the task perfectly, whereas all other algorithms (include unsupervised pre-training, but also traditional algorithms like SVMs, decision trees and boosting) all perform no better than chance. We hypothesize that the optimization difficulty involved when the intermediate pre-training is not performed is due to the {\em composition} of two highly non-linear tasks. Our findings are also consistent with hypotheses on cultural learning inspired by the observations of optimization problems with deep learning, presumably because of effective local minima.
Article
Full-text available
There are two widely known issues with properly training Recurrent Neural Networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to improve the understanding of the underlying issues by exploring these problems from an analytical, a geometric and a dynamical systems perspective. Our analysis is used to justify a simple yet effective solution. We propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem. We validate empirically our hypothesis and proposed solutions in the experimental section.
Conference Paper
Full-text available
We transform the outputs of each hidden neuron in a multi-layer perceptron network to be zero mean and zero slope, and use separate shortcut connections to model the linear dependencies instead. This transformation aims at separating the problems of learning the linear and nonlinear parts of the whole input-output mapping, which has many benefits. We study the theoretical properties of the transformation by noting that they make the Fisher information matrix closer to a diagonal matrix, and thus standard gradient closer to the natural gradient. We experimentally confirm the usefulness of the transformations by noting that they make basic stochastic gradient learning competitive with state-of-the-art learning algorithms in speed, and that they seem also to help find solutions that generalize better. The experiments include both classification of handwritten digits with a 3-layer network and learning a low-dimensional representation for images by using a 6-layer auto-encoder network. The transformations were beneficial in all cases, with and without regularization. 1
Article
Full-text available
A class of predictive densities is derived by weighting the observed samples in maximizing the log-likelihood function. This approach is effective in cases such as sample surveys or design of experiments, where the observed covariate follows a different distribution than that in the whole population. Under misspecification of the parametric model, the optimal choice of the weight function is asymptotically shown to be the ratio of the density function of the covariate in the population to that in the observations. This is the pseudo-maximum likelihood estimation of sample surveys. The optimality is defined by the expected Kullback–Leibler loss, and the optimal weight is obtained by considering the importance sampling identity. Under correct specification of the model, however, the ordinary maximum likelihood estimate (i.e. the uniform weight) is shown to be optimal asymptotically. For moderate sample size, the situation is in between the two extreme cases, and the weight function is selected by minimizing a variant of the information criterion derived as an estimate of the expected loss. The method is also applied to a weighted version of the Bayesian predictive density. Numerical examples as well as Monte-Carlo simulations are shown for polynomial regression. A connection with the robust parametric estimation is discussed.
Conference Paper
Full-text available
In this paper, we describe a nonlinear image represen- tation based on divisive normalization that is designed to match the statistical properties of photographic images, as well as the perceptual sensitivity of biological visual sys- tems. We decompose an image using a multi-scale oriented representation, and use Student's t as a model of the de- pendencies within local clusters of coefficients. We then show that normalization of each coefficient by the square root of a linear combination of the amplitudes of the coef- ficients in the cluster reduces statistical dependencies. We further show that the resulting divisive normalization trans- form is invertible and provide an efficient iterative inversion algorithm. Finally, we probe the statistical and perceptual advantages of this image representation by examining its robustness to added noise, and using it to enhance image contrast.
Article
Full-text available
We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perfo rm more informative gradient-based learning. The adaptation, in essence, allows us to find needl es in haystacks in the form of very predictive yet rarely observed features. Our paradigm stems from recent advances in online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies the task of setting a learning rate and results in regret guar antees that are provably as good as the best proximal function that can be chosen in hindsight. We corroborate our theoretical results with experiments on a text classification task, showing substant ial improvements for classification with sparse datasets.
Article
Full-text available
Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations functions. We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new initialization scheme that brings substantially faster convergence.
Article
Full-text available
Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day
Book
We present a state-of-the-art image recognition system, Deep Image, developed using end-to-end deep learning. The key components are a custom-built supercomputer dedicated to deep learning, a highly optimized parallel algorithm using new strategies for data partitioning and communication, larger deep neural network models, novel data augmentation approaches, and usage of multi-scale high-resolution images. On one of the most challenging computer vision benchmarks, the ImageNet classification challenge, our system has achieved the best result to date, with a top-5 error rate of 5.33%, a relative 20.0% improvement over the previous best result.
Conference Paper
Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.
Article
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
Article
Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs (on datasets with long-term dependencies) to levels of performance that were previously achievable only with Hessian-Free optimization. We find that both the initialization and the momentum are crucial since poorly initialized networks cannot be trained with momentum and well-initialized networks perform markedly worse when the momentum is absent or poorly tuned. Our success training these models suggests that previous attempts to train deep and recurrent neural networks from random initializations have likely failed due to poor initialization schemes. Furthermore, carefully tuned momentum methods suffice for dealing with the curvature issues in deep and recurrent network training objectives without the need for sophisticated second-order methods.
Article
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
Article
We describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multi-core machines. In order to be as hardware-agnostic as possible, we needed a way to use multiple machines without generating excessive network traffic. Our method is to average the neural network parameters periodically (typically every minute or two), and redistribute the averaged parameters to the machines for further training. Each machine sees different data. By itself, this method does not work very well. However, we have another method, an approximate and efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD), which seems to allow our periodic-averaging method to work well, as well as substantially improving the convergence of SGD on a single machine.
Conference Paper
Deep neural networks are typically optimized with stochastic gradient descent (SGD). In this work, we propose a novel second-order stochastic optimization algorithm. The algorithm is based on analytic results showing that a non-zero mean of features is harmful for the optimization. We prove convergence of our algorithm in a convex setting. In our experiments we show that our proposed algorithm converges faster than SGD. Further, in contrast to earlier work, our algorithm allows for training models with a factorized structure from scratch. We found this structure to be very useful not only because it accelerates training and decoding, but also because it is a very effective means against overfitting. Combining our proposed optimization algorithm with this model structure, model size can be reduced by a factor of eight and still improvements in recognition error rate are obtained. Additional gains are obtained by improving the Newbob learning rate strategy.
Conference Paper
Restricted Boltzmann machines were developed using binary stochastic hidden units. These can be generalized by replacing each binary unit by an infinite number of copies that all have the same weights but have progressively more negative biases. The learning and inference rules for these “Stepped Sigmoid Units ” are unchanged. They can be approximated efficiently by noisy, rectified linear units. Compared with binary units, these units learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset. Unlike binary units, rectified linear units preserve information about relative intensities as information travels through multiple layers of feature detectors. 1.
Article
The singular value decomposition (SVD) is a factorization that is discontinuous on the subset of matrices having repeated singular values. In this paper the SVD is studied in the vicinity of this critical set. Each one-parameter C k perturbation transversal to the critical set is shown to uniquely determine an SVD at the critical point that extends to an SVD along the perturbation path that is C k-1 in the perturbation parameter. Derivatives of the singular vectors at the critical point are found explicitly. Application is made to the effect on the singular vectors of perturbations from a matrix in the critical set and compared to the information provided by the sin(θ) theorem. Estimates of the derivative of the singular vectors are applied to inequalities involving the matrix absolute value, such as the generalized Araki-Yamagami inequality.
Article
A fundamental problem in neural network research, as well as in many other disciplines, is finding a suitable representation of multivariate data, i.e. random vectors. For reasons of computational and conceptual simplicity, the representation is often sought as a linear transformation of the original data. In other words, each component of the representation is a linear combination of the original variables. Well-known linear transformation methods include principal component analysis, factor analysis, and projection pursuit. Independent component analysis (ICA) is a recently developed method in which the goal is to find a linear representation of non-Gaussian data so that the components are statistically independent, or as independent as possible. Such a representation seems to capture the essential structure of the data in many applications, including feature extraction and signal separation. In this paper, we present the basic theory and applications of ICA, and our recent work on the subject.
Knowledge mat-ters: Importance of prior information for optimization. CoRR, abs/1301 Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
  • C Gülç
  • Yoshua Bengio
  • K He
  • X Zhang
  • S Ren
  • J Sun
Gülç, C ¸ aglar and Bengio, Yoshua. Knowledge mat-ters: Importance of prior information for optimization. CoRR, abs/1301.4083, 2013. He, K., Zhang, X., Ren, S., and Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ArXiv e-prints, February 2015.
A literature survey on
  • Jiang
  • Jing
Jiang, Jing. A literature survey on