ArticlePDF Available

Gradient-Based Learning Applied to Document Recognition

Authors:

Abstract and Figures

Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day
Content may be subject to copyright.
A preview of the PDF is not available
... Hence, this thesis would be centered on the more specific approach of generative DNN models that infer latent variables of gait data which, ideally, should characterize the gait data and serve as interpretable feature descriptors. Figure 1: Learnt 2D-latent space of MNIST hand-written digit images [21]. The digit images are decoded from, and overlaid on their corresponding 2D latent vectors. ...
... In other words, the application of VAE is theoretically sound for inference of latent variable that are representative of input gait data. Figure 1 shows an example of latent space visualization of MNIST hand-written digits [21] from the study of [22], where the 8 × 8 input digit images were encoded into the latent space (2D coordinates in the figure) and decoded to reconstruct the original images. The latent space was constrained to be 2-dimensional in the experiment. ...
Thesis
Full-text available
People with vertigo and balance disorders often exhibit abnormal gait patterns while walking, such as staggering or overly rigid motion. These pathological gait phenotypes are usually determined by observatory assessment from clinicians or detecting the gait parameters such as stride length and frequency. In this study we aim to represent the dynamics of these gait phenotypes directly in a low-dimensional manifold that allows quantitative but still high-level descriptions of the gait patterns. Particularly, we would like to use Variational Auto-encoder (VAE) with auxiliary classification tasks that learn the latent feature variables automatically from the partially labelled gait data in a semi-supervised manner. The latent space of VAE was then projected onto a 2D manifold by Uniform Manifold Approximation and Projection (UMAP) for visualization of the clustering for different spatio-temporal gait dynamics. In the results, we found clear clustering for the postural positions, walking speeds and gait instability as the represented spatio-temporal characteristics in the latent space. Such representation of gait dynamics in low-dimensional manifold would be useful for objective measures of gait patterns for clinical diagnostic purpose.
... Error in the output is given as: Error = Right answer -Left answer These error signals help derive a process called Gradient descent. [15] For each magic numbers, each of the feature pixels are adjusted up and down to see how the error changes. The adjustment depends on the error, for larger errors the outputs are adjusted more, for small errors they are adjusted a bit and if there is no error they need not to be adjusted indicating that your output is right. ...
... And Random search, samples a given number of candidates from a parameter space with a specified distribution. A. LeNet-5 Fig. 7. Architecture of LeNet-5 [15] The aim of LeNet-5 [18] was to recognize hand written digits. It uses average pooling instead of max pooling and sigmoid or tanh for providing nonlinearity. ...
Conference Paper
Full-text available
Today most of the data flowing over the internet is in the form of visual data therefore we need ways to interpret that data. This is where computer vision comes. In this paper we have described what is computer vision, where it is used, the popular models of computer vision and why CNN works the best.This paper also describes the working of the CNN models and how the machines are trained using CNN model. Further in this paper we have compared and explained classic CNN architectures. The paper also describes about the built in machine learning library TensorFlow which is based on CNN model for image, text recognition.
... We evaluate the defense performance of BlackLight and PIHA on the four most common datasets: MNIST, CIFAR10, ImageNet and CelebaHQ. For MNIST, we trained a LeNet-5 [26] convolutional network for digital number classification. For CIFAR10, ImageNet and CelebaHQ, we selected the targeted models that previous work has used [13]. ...
Preprint
Query-based black-box attacks have emerged as a significant threat to machine learning systems, where adversaries can manipulate the input queries to generate adversarial examples that can cause misclassification of the model. To counter these attacks, researchers have proposed Stateful Defense Models (SDMs) for detecting adversarial query sequences and rejecting queries that are "similar" to the history queries. Existing state-of-the-art (SOTA) SDMs (e.g., BlackLight and PIHA) have shown great effectiveness in defending against these attacks. However, recent studies have shown that they are vulnerable to Oracle-guided Adaptive Rejection Sampling (OARS) attacks, which is a stronger adaptive attack strategy. It can be easily integrated with existing attack algorithms to evade the SDMs by generating queries with fine-tuned direction and step size of perturbations utilizing the leaked decision information from the SDMs. In this paper, we propose a novel approach, Query Provenance Analysis (QPA), for more robust and efficient SDMs. QPA encapsulates the historical relationships among queries as the sequence feature to capture the fundamental difference between benign and adversarial query sequences. To utilize the query provenance, we propose an efficient query provenance analysis algorithm with dynamic management. We evaluate QPA compared with two baselines, BlackLight and PIHA, on four widely used datasets with six query-based black-box attack algorithms. The results show that QPA outperforms the baselines in terms of defense effectiveness and efficiency on both non-adaptive and adaptive attacks. Specifically, QPA reduces the Attack Success Rate (ASR) of OARS to 4.08%, comparing to 77.63% and 87.72% for BlackLight and PIHA, respectively. Moreover, QPA also achieves 7.67x and 2.25x higher throughput than BlackLight and PIHA.
... The core experiments involve training Convolutional Neural Networks (CNNs) of varying sizes and complexities on two datasets: MNIST [15] and CIFAR-10 [36]. For the MNIST dataset, we employ LeNet-5 [37], composed of approximately 62 thousand parameters, and a modified version of VGG16 [60], denoted as VGG16*, consisting of 2.6 million parameters. VGG16* was specifically adapted for the MNIST dataset, a less demanding learning problem compared to ImageNet [54], for which VGG16 was designed. ...
Preprint
Full-text available
Driven by the ever-growing volume and decentralized nature of data, coupled with the escalating size of modern models, distributed deep learning (DDL) has been entrenched as the preferred paradigm for training. However, frequent synchronization of DL models, encompassing millions to many billions of parameters, creates a communication bottleneck, severely hindering scalability. Worse yet, DDL algorithms typically waste valuable bandwidth, and make themselves less practical in bandwidth-constrained federated settings, by relying on overly simplistic, periodic, and rigid synchronization schedules. To address these shortcomings, we propose Federated Dynamic Averaging (FDA), a communication-efficient DDL strategy that dynamically triggers synchronization based on the value of the model variance. Through extensive experiments across a wide range of learning tasks we demonstrate that FDA reduces communication cost by orders of magnitude, compared to both traditional and cutting-edge communication-efficient algorithms. Remarkably, FDA achieves this without sacrificing convergence speed - in stark contrast to the trade-offs encountered in the field. Additionally, we show that FDA maintains robust performance across diverse data heterogeneity settings.
... In this study, we performed training using the MNIST [26] and Fashion-MNIST [27] datasets. ...
Preprint
Full-text available
Successive image generation using cyclic transformations is demonstrated by extending the CycleGAN model to transform images among three different categories. Repeated application of the trained generators produces sequences of images that transition among the different categories. The generated image sequences occupy a more limited region of the image space compared with the original training dataset. Quantitative evaluation using precision and recall metrics indicates that the generated images have high quality but reduced diversity relative to the training dataset. Such successive generation processes are characterized as chaotic dynamics in terms of dynamical system theory. Positive Lyapunov exponents estimated from the generated trajectories confirm the presence of chaotic dynamics, with the Lyapunov dimension of the attractor found to be comparable to the intrinsic dimension of the training data manifold. The results suggest that chaotic dynamics in the image space defined by the deep generative model contribute to the diversity of the generated images, constituting a novel approach for multi-class image generation. This model can be interpreted as an extension of classical associative memory to perform hetero-association among image categories.
... Thus, at the early stage, researchers focus on fabricating the architecture to improve model capacity. GRU4Rec [29] and Caser [37] apply RNNs and CNNs [38] for sequence modeling. Later, inspired by the great success of self-attention [25] in natural language processing, SASRec [5] and Bert4Rec [33] verify its potential in SRS. ...
Preprint
Sequential recommendation systems (SRS) serve the purpose of predicting users' subsequent preferences based on their past interactions and have been applied across various domains such as e-commerce and social networking platforms. However, practical SRS encounters challenges due to the fact that most users engage with only a limited number of items, while the majority of items are seldom consumed. These challenges, termed as the long-tail user and long-tail item dilemmas, often create obstacles for traditional SRS methods. Mitigating these challenges is crucial as they can significantly impact user satisfaction and business profitability. While some research endeavors have alleviated these issues, they still grapple with issues such as seesaw or noise stemming from the scarcity of interactions. The emergence of large language models (LLMs) presents a promising avenue to address these challenges from a semantic standpoint. In this study, we introduce the Large Language Models Enhancement framework for Sequential Recommendation (LLM-ESR), which leverages semantic embeddings from LLMs to enhance SRS performance without increasing computational overhead. To combat the long-tail item challenge, we propose a dual-view modeling approach that fuses semantic information from LLMs with collaborative signals from traditional SRS. To address the long-tail user challenge, we introduce a retrieval augmented self-distillation technique to refine user preference representations by incorporating richer interaction data from similar users. Through comprehensive experiments conducted on three authentic datasets using three widely used SRS models, our proposed enhancement framework demonstrates superior performance compared to existing methodologies.
... Traditionally, the search for activation functions in neural networks has relied heavily on trial and error, with researchers exploring various functions and evaluating their performance empirically [LeCun et al., 1998, Nair and Hinton, 2010, Clevert et al., 2016. In this paper, we aim to identify the favorable properties of existing activation functions to guide the design of new and improved activation functions. ...
Preprint
Full-text available
Activation functions are core components of all deep learning architectures. Currently, the most popular activation functions are smooth ReLU variants like GELU and SiLU. These are self-gated activation functions where the range of the gating function is between zero and one. In this paper, we explore the viability of using arctan as a gating mechanism. A self-gated activation function that uses arctan as its gating function has a monotonically increasing first derivative. To make this activation function competitive, it is necessary to introduce a trainable parameter for every MLP block to expand the range of the gating function beyond zero and one. We find that this technique also improves existing self-gated activation functions. We conduct an empirical evaluation of Expanded ArcTan Linear Unit (xATLU), Expanded GELU (xGELU), and Expanded SiLU (xSiLU) and show that they outperform existing activation functions within a transformer architecture. Additionally, expanded gating ranges show promising results in improving first-order Gated Linear Units (GLU).
... A Convolutional Neural Network (CNN) is an example of specialized Neural Network architectures that incorporate information about the invariances of two-dimensional shapes and apply constraints on the weights using local connectivity patterns. [6]. A CNN is a collection of neurons organized in a non-cyclic graph. ...
Conference Paper
In this study, a deep learning model aiming to detect the cooking level of chicken doner was developed using a neural network-based approach. The developed model is able to analyze the color and texture changes on the doner surface in detail, effectively monitor the cooking process and accurately detect the cooking level, which is critical for the prevention of foodborne diseases. The model has been trained to ensure high accuracy in determining the cooking level and increase efficiency in kitchen operations. The study involved collecting images of chicken doner at different cooking stages, extracting meaningful data from these images using image processing and machine learning methods, training classification algorithms and effectively predicting the cooking level. The planned outputs of this study aim to optimize cooking processes, improve food safety, and support health standards for both home users and professional kitchens.
... In (1998),The name convolutional neural networks actually originated with the design of the LeNet-5 for the handwritten digit recognition task by Yann LeCun [3]. The architecture of LeNet-5 has (two convolutional layers each was connected with pooling layer plus two fully connected layers). ...
Article
Full-text available
With the increase of using the Artificial Neural Network (ANN), machine learning has taken a forceful twist in recent years. One of the most spectacular kinds of ANN design is the Convolutional Neural Network (CNN). This paper proposed convolutional neural network model for image classification. It is a class of Neural network that has proven very effective in areas of image recognition, processing, and classification. The backpropagation of proposed CNN model is derived. The experimental results show that the our proposed CNN model can detect and classify images as face and non-face with training accuracy rate 99% and validation accuracy rate 98%.
Preprint
Sharpness-Aware Minimization (SAM) has emerged as a promising alternative optimizer to stochastic gradient descent (SGD). The originally-proposed motivation behind SAM was to bias neural networks towards flatter minima that are believed to generalize better. However, recent studies have shown conflicting evidence on the relationship between flatness and generalization, suggesting that flatness does fully explain SAM's success. Sidestepping this debate, we identify an orthogonal effect of SAM that is beneficial out-of-distribution: we argue that SAM implicitly balances the quality of diverse features. SAM achieves this effect by adaptively suppressing well-learned features which gives remaining features opportunity to be learned. We show that this mechanism is beneficial in datasets that contain redundant or spurious features where SGD falls for the simplicity bias and would not otherwise learn all available features. Our insights are supported by experiments on real data: we demonstrate that SAM improves the quality of features in datasets containing redundant or spurious features, including CelebA, Waterbirds, CIFAR-MNIST, and DomainBed.
Article
Full-text available
This paper describes the development of an algorithm for verification of signatures written on a touch-sensitive pad. The signature verification algorithm is based on an artificial neural network. The novel network presented here, called a “Siamese” time delay neural network, consists of two identical networks joined at their output. During training the network learns to measure the similarity between pairs of signatures. When used for verification, only one half of the Siamese network is evaluated. The output of this half network is the feature vector for the input signature. Verification consists of comparing this feature vector with a stored feature vector for the signer. Signatures closer than a chosen threshold to this stored representation are accepted, all other signatures are rejected as forgeries. System performance is illustrated with experiments performed in the laboratory.
Chapter
Threshold functions and related operators are widely used as basic elements of adaptive and associative networks [Nakano 72, Amari 72, Hopfield 82]. There exist numerous learning rules for finding a set of weights to achieve a particular correspondence between input-output pairs. But early works in the field have shown that the number of threshold functions (or linearly separable functions) in N binary variables is small compared to the number of all possible boolean mappings in N variables, especially if N is large. This problem is one of the main limitations of most neural networks models where the state is fully specified by the environment during learning: they can only learn linearly separable functions of their inputs. Moreover, a learning procedure which requires the outside world to specify the state of every neuron during the learning session can hardly be considered as a general learning rule because in real-world conditions, only a partial information on the “ideal” network state for each task is available from the environment. It is possible to use a set of so-called “hidden units” [Hinton,Sejnowski,Ackley. 84], without direct interaction with the environment, which can compute intermediate predicates. Unfortunately, the global response depends on the output of a particular hidden unit in a highly non-linear way, moreover the nature of this dependence is influenced by the states of the other cells.
Article
This paper addresses the problem of improving the accuracy of an hypothesis output by a learning algorithm in the distribution-free (PAC) learning model. A concept class is learnable (or strongly learnable) if, given access to a source of examples of the unknown concept, the learner with high probability is able to output an hypothesis that is correct on all but an arbitrarily small fraction of the instances. The concept class is weakly learnable if the learner can produce an hypothesis that performs only slightly better than random guessing. In this paper, it is shown that these two notions of learnability are equivalent. A method is described for converting a weak learning algorithm into one that achieves arbitrarily high accuracy. This construction may have practical applications as a tool for efficiently converting a mediocre learning algorithm into one that performs extremely well. In addition, the construction has some interesting theoretical consequences, including a set of general upper bounds on the complexity of any strong learning algorithm as a function of the allowed error ∈.