ArticlePDF Available

Gradient-Based Learning Applied to Document Recognition

December 1998
Proceedings of the IEEE 86(11):2278 - 2324

December 1998
86(11):2278 - 2324

DOI:10.1109/5.726791

Source
IEEE Xplore

Authors:

Yann Lecun

New York University

Y. Bengio

Université de Montréal

Patrick Haffner

Interactions LLC

Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day

Architecture of LeNet-5, a Convolutional Neural Network, here for digits recognition. Each plane is a feature map, i.e. a set of units whose weights are constrained to be identical.

…

Training and test errors of LeNet-5 achieved using training sets of various sizes. This graph suggests that a larger training set could improve the performance of LeNet-5. The hollow square show the test error when more training patterns are artiicially generated using random distortions. The test patterns are not distorted.

…

Error rate on the test set (%) for various classiication methods. deslant] indicates that the classiier was trained and tested on the deslanted version of the database. dist] indicates that the training set was augmented with artiicially distorted examples. 16x16] indicates that the system used the 16x16 pixel images. The uncertainty in the quoted error rates is about 0.1%.

…

Memory requirements, measured in number of variables, for each of the methods. Most of the methods only require one byte per variable for adequate performance.

…

A trainable system composed of heterogeneous modules.

…

Figures - uploaded by Yann Lecun

Content may be subject to copyright.

Content uploaded by Yann Lecun

Content may be subject to copyright.

A preview of the PDF is not available

Deep Spatio-Temporal Representation Learning of Gait Signals in Patients with Vertigo and Balance Disorders

Thesis

Full-text available

Aug 2019

Yuk Hoi Yiu

People with vertigo and balance disorders often exhibit abnormal gait patterns while walking, such as staggering or overly rigid motion. These pathological gait phenotypes are usually determined by observatory assessment from clinicians or detecting the gait parameters such as stride length and frequency. In this study we aim to represent the dynamics of these gait phenotypes directly in a low-dimensional manifold that allows quantitative but still high-level descriptions of the gait patterns. Particularly, we would like to use Variational Auto-encoder (VAE) with auxiliary classiﬁcation tasks that learn the latent feature variables automatically from the partially labelled gait data in a semi-supervised manner. The latent space of VAE was then projected onto a 2D manifold by Uniform Manifold Approximation and Projection (UMAP) for visualization of the clustering for diﬀerent spatio-temporal gait dynamics. In the results, we found clear clustering for the postural positions, walking speeds and gait instability as the represented spatio-temporal characteristics in the latent space. Such representation of gait dynamics in low-dimensional manifold would be useful for objective measures of gait patterns for clinical diagnostic purpose.

Computing for Sustainable Global Development

Conference Paper

Full-text available

Mar 2019

Tanya P. Garg

Today most of the data flowing over the internet is in the form of visual data therefore we need ways to interpret that data. This is where computer vision comes. In this paper we have described what is computer vision, where it is used, the popular models of computer vision and why CNN works the best.This paper also describes the working of the CNN models and how the machines are trained using CNN model. Further in this paper we have compared and explained classic CNN architectures. The paper also describes about the built in machine learning library TensorFlow which is based on CNN model for image, text recognition.

Query Provenance Analysis for Robust and Efficient Query-based Black-box Attack Defense

Preprint

May 2024

Query-based black-box attacks have emerged as a significant threat to machine learning systems, where adversaries can manipulate the input queries to generate adversarial examples that can cause misclassification of the model. To counter these attacks, researchers have proposed Stateful Defense Models (SDMs) for detecting adversarial query sequences and rejecting queries that are "similar" to the history queries. Existing state-of-the-art (SOTA) SDMs (e.g., BlackLight and PIHA) have shown great effectiveness in defending against these attacks. However, recent studies have shown that they are vulnerable to Oracle-guided Adaptive Rejection Sampling (OARS) attacks, which is a stronger adaptive attack strategy. It can be easily integrated with existing attack algorithms to evade the SDMs by generating queries with fine-tuned direction and step size of perturbations utilizing the leaked decision information from the SDMs. In this paper, we propose a novel approach, Query Provenance Analysis (QPA), for more robust and efficient SDMs. QPA encapsulates the historical relationships among queries as the sequence feature to capture the fundamental difference between benign and adversarial query sequences. To utilize the query provenance, we propose an efficient query provenance analysis algorithm with dynamic management. We evaluate QPA compared with two baselines, BlackLight and PIHA, on four widely used datasets with six query-based black-box attack algorithms. The results show that QPA outperforms the baselines in terms of defense effectiveness and efficiency on both non-adaptive and adaptive attacks. Specifically, QPA reduces the Attack Success Rate (ASR) of OARS to 4.08%, comparing to 77.63% and 87.72% for BlackLight and PIHA, respectively. Moreover, QPA also achieves 7.67x and 2.25x higher throughput than BlackLight and PIHA.

Communication-Efficient Distributed Deep Learning via Federated Dynamic Averaging

Preprint

Full-text available

May 2024

Driven by the ever-growing volume and decentralized nature of data, coupled with the escalating size of modern models, distributed deep learning (DDL) has been entrenched as the preferred paradigm for training. However, frequent synchronization of DL models, encompassing millions to many billions of parameters, creates a communication bottleneck, severely hindering scalability. Worse yet, DDL algorithms typically waste valuable bandwidth, and make themselves less practical in bandwidth-constrained federated settings, by relying on overly simplistic, periodic, and rigid synchronization schedules. To address these shortcomings, we propose Federated Dynamic Averaging (FDA), a communication-efficient DDL strategy that dynamically triggers synchronization based on the value of the model variance. Through extensive experiments across a wide range of learning tasks we demonstrate that FDA reduces communication cost by orders of magnitude, compared to both traditional and cutting-edge communication-efficient algorithms. Remarkably, FDA achieves this without sacrificing convergence speed - in stark contrast to the trade-offs encountered in the field. Additionally, we show that FDA maintains robust performance across diverse data heterogeneity settings.

Cyclic image generation using chaotic dynamics

Preprint

Full-text available

May 2024

Yutaka Yamaguti

Successive image generation using cyclic transformations is demonstrated by extending the CycleGAN model to transform images among three different categories. Repeated application of the trained generators produces sequences of images that transition among the different categories. The generated image sequences occupy a more limited region of the image space compared with the original training dataset. Quantitative evaluation using precision and recall metrics indicates that the generated images have high quality but reduced diversity relative to the training dataset. Such successive generation processes are characterized as chaotic dynamics in terms of dynamical system theory. Positive Lyapunov exponents estimated from the generated trajectories confirm the presence of chaotic dynamics, with the Lyapunov dimension of the attractor found to be comparable to the intrinsic dimension of the training data manifold. The results suggest that chaotic dynamics in the image space defined by the deep generative model contribute to the diversity of the generated images, constituting a novel approach for multi-class image generation. This model can be interpreted as an extension of classical associative memory to perform hetero-association among image categories.

Large Language Models Enhanced Sequential Recommendation for Long-tail User and Item

Preprint

May 2024

Qidong Liu

Sequential recommendation systems (SRS) serve the purpose of predicting users' subsequent preferences based on their past interactions and have been applied across various domains such as e-commerce and social networking platforms. However, practical SRS encounters challenges due to the fact that most users engage with only a limited number of items, while the majority of items are seldom consumed. These challenges, termed as the long-tail user and long-tail item dilemmas, often create obstacles for traditional SRS methods. Mitigating these challenges is crucial as they can significantly impact user satisfaction and business profitability. While some research endeavors have alleviated these issues, they still grapple with issues such as seesaw or noise stemming from the scarcity of interactions. The emergence of large language models (LLMs) presents a promising avenue to address these challenges from a semantic standpoint. In this study, we introduce the Large Language Models Enhancement framework for Sequential Recommendation (LLM-ESR), which leverages semantic embeddings from LLMs to enhance SRS performance without increasing computational overhead. To combat the long-tail item challenge, we propose a dual-view modeling approach that fuses semantic information from LLMs with collaborative signals from traditional SRS. To address the long-tail user challenge, we introduce a retrieval augmented self-distillation technique to refine user preference representations by incorporating richer interaction data from similar users. Through comprehensive experiments conducted on three authentic datasets using three widely used SRS models, our proposed enhancement framework demonstrates superior performance compared to existing methodologies.

Expanded Gating Ranges Improve Activation Functions

Preprint

Full-text available

May 2024

Activation functions are core components of all deep learning architectures. Currently, the most popular activation functions are smooth ReLU variants like GELU and SiLU. These are self-gated activation functions where the range of the gating function is between zero and one. In this paper, we explore the viability of using arctan as a gating mechanism. A self-gated activation function that uses arctan as its gating function has a monotonically increasing first derivative. To make this activation function competitive, it is necessary to introduce a trainable parameter for every MLP block to expand the range of the gating function beyond zero and one. We find that this technique also improves existing self-gated activation functions. We conduct an empirical evaluation of Expanded ArcTan Linear Unit (xATLU), Expanded GELU (xGELU), and Expanded SiLU (xSiLU) and show that they outperform existing activation functions within a transformer architecture. Additionally, expanded gating ranges show promising results in improving first-order Gated Linear Units (GLU).

Prediction of Chicken Doner Cooking Level with Deep Learning Methods

Conference Paper

Jun 2024

Osman Altay

In this study, a deep learning model aiming to detect the cooking level of chicken doner was developed using a neural network-based approach. The developed model is able to analyze the color and texture changes on the doner surface in detail, effectively monitor the cooking process and accurately detect the cooking level, which is critical for the prevention of foodborne diseases. The model has been trained to ensure high accuracy in determining the cooking level and increase efficiency in kitchen operations. The study involved collecting images of chicken doner at different cooking stages, extracting meaningful data from these images using image processing and machine learning methods, training classification algorithms and effectively predicting the cooking level. The planned outputs of this study aim to optimize cooking processes, improve food safety, and support health standards for both home users and professional kitchens.

Convolutional Neural Network For Image Classication

Article

Full-text available

Jun 2024

Mohammed Alsuwaii

With the increase of using the Artificial Neural Network (ANN), machine learning has taken a forceful twist in recent years. One of the most spectacular kinds of ANN design is the Convolutional Neural Network (CNN). This paper proposed convolutional neural network model for image classification. It is a class of Neural network that has proven very effective in areas of image recognition, processing, and classification. The backpropagation of proposed CNN model is derived. The experimental results show that the our proposed CNN model can detect and classify images as face and non-face with training accuracy rate 99% and validation accuracy rate 98%.

Sharpness-Aware Minimization Enhances Feature Quality via Balanced Learning

Preprint

May 2024

Sharpness-Aware Minimization (SAM) has emerged as a promising alternative optimizer to stochastic gradient descent (SGD). The originally-proposed motivation behind SAM was to bias neural networks towards flatter minima that are believed to generalize better. However, recent studies have shown conflicting evidence on the relationship between flatness and generalization, suggesting that flatness does fully explain SAM's success. Sidestepping this debate, we identify an orthogonal effect of SAM that is beneficial out-of-distribution: we argue that SAM implicitly balances the quality of diverse features. SAM achieves this effect by adaptively suppressing well-learned features which gives remaining features opportunity to be learned. We show that this mechanism is beneficial in datasets that contain redundant or spurious features where SGD falls for the simplicity bias and would not otherwise learn all available features. Our insights are supported by experiments on real data: we demonstrate that SAM improves the quality of features in datasets containing redundant or spurious features, including CelebA, Waterbirds, CIFAR-MNIST, and DomainBed.

DIscriminative feature and model design for automatic speech recognition

Conference Paper

Full-text available

Sep 1997

Signature Verification using a "Siamese" Time Delay Neural Network

Article

Full-text available

Aug 1993
INT J PATTERN RECOGN

This paper describes the development of an algorithm for verification of signatures written on a touch-sensitive pad. The signature verification algorithm is based on an artificial neural network. The novel network presented here, called a “Siamese” time delay neural network, consists of two identical networks joined at their output. During training the network learns to measure the similarity between pairs of signatures. When used for verification, only one half of the Siamese network is evaluated. The output of this half network is the feature vector for the input signature. Verification consists of comparing this feature vector with a stored feature vector for the signer. Signatures closer than a chosen threshold to this stored representation are accepted, all other signatures are rejected as forgeries. System performance is illustrated with experiments performed in the laboratory.

Comparison of learning algorithms for handwritten digit recognition

Article

Jan 1995

Y. LeCun

Parallel distributed processing: explorations in the microstructure of cognition

Article

Jan 1986

Receptive fields, binocular interaction and functional architecture in the cat's visual cortex

Article

Jan 1962

Learning Process in an Asymmetric Threshold Network

Chapter

Jan 1986

Yann Le Cun

Threshold functions and related operators are widely used as basic elements of adaptive and associative networks [Nakano 72, Amari 72, Hopfield 82]. There exist numerous learning rules for finding a set of weights to achieve a particular correspondence between input-output pairs. But early works in the field have shown that the number of threshold functions (or linearly separable functions) in N binary variables is small compared to the number of all possible boolean mappings in N variables, especially if N is large. This problem is one of the main limitations of most neural networks models where the state is fully specified by the environment during learning: they can only learn linearly separable functions of their inputs. Moreover, a learning procedure which requires the outside world to specify the state of every neuron during the learning session can hardly be considered as a general learning rule because in real-world conditions, only a partial information on the “ideal” network state for each task is available from the environment. It is possible to use a set of so-called “hidden units” [Hinton,Sejnowski,Ackley. 84], without direct interaction with the environment, which can compute intermediate predicates. Unfortunately, the global response depends on the output of a particular hidden unit in a highly non-linear way, moreover the nature of this dependence is influenced by the states of the other cells.

The Strength of Weak Learnability

Article

Jun 1990

Robert E. Schapire

This paper addresses the problem of improving the accuracy of an hypothesis output by a learning algorithm in the distribution-free (PAC) learning model. A concept class is learnable (or strongly learnable) if, given access to a source of examples of the unknown concept, the learner with high probability is able to output an hypothesis that is correct on all but an arbitrarily small fraction of the instances. The concept class is weakly learnable if the learner can produce an hypothesis that performs only slightly better than random guessing. In this paper, it is shown that these two notions of learnability are equivalent. A method is described for converting a weak learning algorithm into one that achieves arbitrarily high accuracy. This construction may have practical applications as a tool for efficiently converting a mediocre learning algorithm into one that performs extremely well. In addition, the construction has some interesting theoretical consequences, including a set of general upper bounds on the complexity of any strong learning algorithm as a function of the allowed error ∈.

Adaptation and Learning in Automatic Systems

Article