Conference PaperPDF Available

Reading Checks With Multilayer Graph Transformer Networks

Authors:

Abstract

We propose a new machine learning paradigm called multilayer graph transformer network that extends the applicability of gradient-based learning algorithms to systems composed of modules that take graphs as input and produce graphs as output. A complete check reading system based on this concept is described. The system combines convolutional neural network character recognizers with graph-based stochastic models trained cooperatively at the document level. It is deployed commercially and reads million of business and personal checks per month with record accuracy
A preview of the PDF is not available
... For instance, giving computers the ability to learn representations without being directly programmed for a specific task has been extensively leveraged in computer vision (Sebe et al., 2005). Convolutional Neural Networks (CNNs) were particularly developed for image recognition tasks (Le Cun et al., 1997;Krizhevsky et al., 2012). Inspired by biological visual perception, CNNs are trained to react to specific image features, starting from simple forms, as lines or edges, and then detecting more complex and abstract patterns in subsequent layers (Ghosh et al., 2020). ...
Article
Full-text available
Solar granulation is the visible signature of convective cells at the solar surface. The granulation cellular pattern observed in the continuum intensity images is characterised by diverse structures e.g., bright individual granules of hot rising gas or dark intergranular lanes. Recently, the access to new instrumentation capabilities has given us the possibility to obtain high-resolution images, which have revealed the overwhelming complexity of granulation (e.g., exploding granules and granular lanes). In that sense, any research focused on understanding solar small-scale phenomena on the solar surface is sustained on the effective identification and localization of the different resolved structures. In this work, we present the initial results of a proposed classification model of solar granulation structures based on neural semantic segmentation. We inspect the ability of the U-net architecture, a convolutional neural network initially proposed for biomedical image segmentation, to be applied to the dense segmentation of solar granulation. We use continuum intensity maps of the IMaX instrument onboard the Sunrise I balloon-borne solar observatory and their corresponding segmented maps as a training set. The training data have been labeled using the multiple-level technique (MLT) and also by hand. We performed several tests of the performance and precision of this approach in order to evaluate the versatility of the U-net architecture. We found an appealing potential of the U-net architecture to identify cellular patterns in solar granulation images reaching an average accuracy above 80% in the initial training experiments.
... Specifically, convolutional networks (ConvNets) have been widely applied in various computer vision applications (Szegedy et al., 2016), supported by highperformance computing facilities and the availability of large public image repositories such as the ImageNet (Deng et al., 2009). Convolutional neural networks were originally employed for character recognition (Le Cun et al., 1997). Their computer vision applicability became apparent by their ImageNet classification performance (Krizhevsky et al., 2012). ...
Article
Full-text available
Automated and intelligent classification of defects can improve productivity, quality, and safety of various welded components used in industries. This study presents a transfer learning approach for accurate classification of tungsten inert gas (TIG) welding defects while joining stainless steel parts. In this approach, eight pre-trained deep learning models (VGG16, VGG19, ResNet50, InceptionV3, InceptionResNetV2, Xception, MobileNetV2, and DenseNet169) were explored to classify welding images into two-class (good weld/ bad weld) and multi-class (good weld/burn through/contamination/lack of fusion/lack of shielding gas/high travel speed) classifications. Moreover, four optimizers (SGD, Adam, Adagrad, and Rmsprop) were applied separately to each of the deep learning models to maximize prediction accuracies. All models were evaluated based on testing accuracy, precision, recall, F1 scores, training/validation losses, and accuracies over successive training epochs. Primary results show that the VGG19-SGD and DenseNet169-SGD architectures attained the best testing accuracies for two-class (99.69%) and multi-class (97.28%) defects classifications, respectively. For "burn through," "contamination," and "high travel speed" defects, most deep learning models ensured productivity over quality assurance of TIG welded joints. On the other hand, the weld quality was promoted over productivity during classification of "lack of fusion" and "lack of shielding gas" defects. Thus, transfer learning methodology can help boost productivity and quality of welded joints by accurate classification of good and bad welds.
... In the HTR problem domain, CNNs have been initially used for handwriting digits recognition [134]. Later, these networks were applied to handwritten text recognition in specific domains such as online handwriting [9] or bank checks transcription [133] [135]. ...
Preprint
Full-text available
Handwritten text recognition is an open problem of great interest in the area of automatic document image analysis. The transcription of handwritten content present in digitized documents is significant in analyzing historical archives or digitizing information from handwritten documents, forms, and communications. In the last years, great advances have been made in this area due to applying deep learning techniques to its resolution. This Thesis addresses the offline continuous handwritten text recognition (HTR) problem, consisting of developing algorithms and models capable of transcribing the text present in an image without the need for the text to be segmented into characters. For this purpose, we have proposed a new recognition model based on integrating two types of deep learning architectures: convolutional neural networks (CNN) and sequence-to-sequence (seq2seq) models, respectively. The convolutional component of the model is oriented to identify relevant features present in characters, and the seq2seq component builds the transcription of the text by modeling the sequential nature of the text. For the design of this new model, an extensive analysis of the capabilities of different convolutional architectures in the simplified problem of isolated character recognition has been carried out in order to identify the most suitable ones to be integrated into the continuous model. Additionally, extensive experimentation of the proposed model for the continuous problem has been carried out to determine its robustness to changes in parameterization. The generalization capacity of the model has also been validated by evaluating it on three handwritten text databases using different languages: IAM in English, RIMES in French, and Osborne in Spanish, respectively. The new proposed model provides competitive results with those obtained with other well-established methodologies.
... The CNN has been widely applied for the FER system and has significantly improved state-of-the-art practices as well as analyzed the performance of ImageNet classification challenges [22]. Earlier CNN models were used to solve character recognition tasks [24], but nowadays, CNN is widely used in various object recognition problems. Here, the most important ingredient for the success of CNN is the availability of large quantities of training data, i.e., the use of image augmentation techniques [15]. ...
Article
Full-text available
This work proposes a facial expression recognition system for a diversified field of applications. The purpose of the proposed system is to predict the type of expressions in a human face region. The implementation of the proposed method is fragmented into three components. In the first component, from the given input image, a tree-structured part model has been applied that predicts some landmark points on the input image to detect facial regions. The detected face region was normalized to its fixed size and then down-sampled to its varying sizes such that the advantages, due to the effect of multi-resolution images, can be introduced. Then, some convolutional neural network (CNN) architectures were proposed in the second component to analyze the texture patterns in the facial regions. To enhance the proposed CNN model’s performance, some advanced techniques, such data augmentation, progressive image resizing, transfer-learning, and fine-tuning of the parameters, were employed in the third component to extract more distinctive and discriminant features for the proposed facial expression recognition system. The performance of the proposed system, due to different CNN models, is fused to achieve better performance than the existing state-of-the-art methods and for this reason, extensive experimentation has been carried out using the Karolinska-directed emotional faces (KDEF), GENKI-4k, Cohn-Kanade (CK+), and Static Facial Expressions in the Wild (SFEW) benchmark databases. The performance has been compared with some existing methods concerning these databases, which shows that the proposed facial expression recognition system outperforms other competing methods.
... Generally, convolution filters (for example a 3 × 3 matrix) slide along input features and subsequently provide a corresponding output feature map. This methodology can significantly reduce the number of input parameters when compared to standard neural networks, and hence reduces the need for large data sets [21]. Moreover, from a practical point of view, CNNs are more straightforward to train than recurrent neural networks, as the latter face issues like exploding or vanishing gradient [22,23]. ...
Article
Full-text available
Our objective was to evaluate the diagnostic performance of a convolutional neural network (CNN) trained on multiple MR imaging features of the lumbar spine, to detect a variety of different degenerative changes of the lumbar spine. One hundred and forty-six consecutive patients underwent routine clinical MRI of the lumbar spine including T2-weighted imaging and were retrospectively analyzed using a CNN for detection and labeling of vertebrae, disc segments, as well as presence of disc herniation, disc bulging, spinal canal stenosis, nerve root compression, and spondylolisthesis. The assessment of a radiologist served as the diagnostic reference standard. We assessed the CNN’s diagnostic accuracy and consistency using confusion matrices and McNemar’s test. In our data, 77 disc herniations (thereof 46 further classified as extrusions), 133 disc bulgings, 35 spinal canal stenoses, 59 nerve root compressions, and 20 segments with spondylolisthesis were present in a total of 888 lumbar spine segments. The CNN yielded a perfect accuracy score for intervertebral disc detection and labeling (100%), and moderate to high diagnostic accuracy for the detection of disc herniations (87%; 95% CI: 0.84, 0.89), extrusions (86%; 95% CI: 0.84, 0.89), bulgings (76%; 95% CI: 0.73, 0.78), spinal canal stenoses (98%; 95% CI: 0.97, 0.99), nerve root compressions (91%; 95% CI: 0.89, 0.92), and spondylolisthesis (87.61%; 95% CI: 85.26, 89.21), respectively. Our data suggest that automatic diagnosis of multiple different degenerative changes of the lumbar spine is feasible using a single comprehensive CNN. The CNN provides high diagnostic accuracy for intervertebral disc labeling and detection of clinically relevant degenerative changes such as spinal canal stenosis and disc extrusion of the lumbar spine.
Article
Artificial intelligence (AI) and operations research (OR) have long been intertwined because of their synergistic relationship. Given the increasing popularity of AI and machine learning in particular, we face growing demand for educational offerings in this area from our students. This paper describes two courses that introduce machine learning concepts to undergraduate, predominantly industrial engineering and operations research students. Instead of taking a methods-first approach, these courses use real-world applications to motivate, introduce, and explore these machine learning techniques and highlight meaningful overlap with operations research. Significant hands-on coding experience is used to build student proficiency with the techniques. Student feedback indicates that these courses have greatly increased student interest in machine learning and appreciation of the real-world impact that analytics can have and helped students develop practical skills that they can apply. We believe that similar application-driven courses that connect machine learning and operations research would be valuable additions to undergraduate OR curricula broadly. Supplemental Material: Supplemental material is available at https://doi.org/10.1287/ited.2021.0256 .
Chapter
Laser‐induced plasma emission spectra contain vast amounts of information. Yet, the discovery of appropriate patterns in laser‐induced breakdown spectra is paramount to reliably performing both quantitative and qualitative analysis. This chapter provides a brief introduction of artificial neural network (ANN) classification models, which have recently become a fundamental part of most pattern recognition toolboxes. The working principles of ANNs are discussed along with the most frequently used architecture types. Special attention is given to the training process of ANNs with the aim of aiding the reader's troubleshooting capabilities. Moreover, some of the potential perils of ANN models are presented. Namely, the risk of overtraining is addressed extensively while providing several potential ailments. Lastly, a comprehensive overview of the applications of ANNs for the classification of LIBS spectra is provided and a few exemplary use‐cases of ANN classifiers are discussed in detail.
Thesis
Due to the massive and increasing amount of documents received each day and the number of steps to process them, the largest companies have turned to document automation software for reaching low processing costs. One crucial step of such software is the automatic extraction of information from the documents, particularly retrieving fields that repeatedly appear in the incoming documents. To deal with the variability of structure of the information contained in such documents, the industrial and academic practitioners have progressively moved from rule-based methods to machine and deep learning models for performing the extraction task. The goal of this thesis is to provide methods for learning to extract information from business documents. In the first part of this manuscript, we embrace the sequence labeling approach by training deep neural networks to classify the information type carried by each token in the documents. When provided perfect token labels for learning, we show that these token classifiers can extract complex tabular information from document issuers and layouts that were unknown at the model training time. However, when the token level supervision must be deduced from the high-level ground truth naturally produced by the extraction task, we demonstrate that the token classifiers extract information from real-world documents with a significantly lower accuracy due to the noise introduced in the labels. In the second part of this thesis, we explore methods that learn to extract information directly from the high-level ground truth at our disposal, thus bypassing the need for costly token level supervision. We adapt an attention-based sequence-to-sequence model in order to alternately copy the document tokens carrying relevant information and generate the XML tags structuring the output extraction schema. Unlike the prior works in end-to-end information extraction, our approach allows to retrieve any arbitrarily structured information schemas. By comparing its extraction performance with the previous token classifiers, we show that end-to-end methods are competitive with sequence labeling approaches and can greatly outperform them when their token labels are not immediately accessible. Finally, in a third part, we confirm that using pre-trained models to extract information greatly reduces the needs for annotated documents. We leverage an existing Transformer based language model which has been pre-trained on a large collection of business documents. When adapted for an information extraction task through sequence labeling, the language model requires very few training documents for attaining close to maximal extraction performance. This underlines that the pre-trained models are significantly more data-efficient than models learning the extraction task from scratch. We also reveal valuable knowledge transfer abilities of this language model since the few-shot performance is improved when learning beforehand to extract information on another dataset, even if its targeted fields differ from the initial task.
Article
It tends to invite road accidents for automotive drivers when they drive at a too high or too low level of mental workload. So it’s rewarding to recognize driver’s mental workload so that providing decision basis to driving assistance system of vehicles to warn drivers or even, take over driving. In this study, we conducted simulated driving experiment and collected driver’s various physiological signals under different driving conditions. A comparison was made between machine learning and deep learning methods of the recognizing task. Driver's physiological signal samples of different lengths were tested and the accuracy of which were compared. The results indicate that, the deep learning model based on a combination of CNN and LSTM gets a higher accuracy rate than the others, and methods based on deep learning have a better performance than that based on manual feature extraction and traditional classifier.
Article
Modern-day techniques for designing neural network architectures are highly reliant on trial and error, heuristics, and so-called best practices, without much rigorous justification. After choosing a network architecture, an energy function (or loss) is minimized, choosing from a wide variety of optimization and regularization methods. Given the ad-hoc nature of network architecture design, it would be useful if the optimization led to a sparse solution so that one could ascertain the importance or unimportance of various parts of the network architecture. Of course, historically, sparsity has always been a useful notion for inverse problems where researchers often prefer the L1 norm over L2. Similarly for control, one often includes the control variables in the objective function in order to minimize their efforts. Motivated by the design and training of neural networks, we propose a novel column space search approach that emphasizes the data over the model, as well as a novel iterative Levenberg-Marquardt algorithm that smoothly converges to a regularized SVD as opposed to the abrupt truncation inherent to PCA. In the case of our iterative Levenberg-Marquardt algorithm, it suffices to consider only the linearized subproblem in order to verify our claims. However, the claims we make about our novel column space search approach require examining the impact of the solution method for the linearized subproblem on the fully nonlinear original problem; thus, we consider a complex real-world inverse problem (determining facial expressions from RGB images).
Conference Paper
Full-text available
We present the concepts of weighted language, transduction and automaton from algebraic automata theory as a general framework for describing and implementing decoding cascades in speech and language processing. This generality allows us to represent uniformly such information sources as pronunciation dictionaries, language models and lattices, and to use uniform algorithms for building decoding stages and for optimizing and combining them. In particular, a single algorithm can be used either to combine information sources such as a pronunciation dictionary and a context-dependency model during the construction of a decoder, or dynamically during the operation of the decoder. Applications to speech recognition and to Chinese text segmentation will be discussed.
Conference Paper
Full-text available
INTRODUCTIONThe ability of multilayer back-propagation networks to learn complex, high-dimensional, nonlinearmappings from large collections of examples makes them obvious candidates for imagerecognition or speech recognition tasks (see PATTERN RECOGNITION AND NEURALNETWORKS). In the traditional model of pattern recognition, a hand-designed featureextractor gathers relevant information from the input and eliminates irrelevant variabilities.A trainable classifier then categorizes the...
Article
Full-text available
We introduce a new approach for on-line recognition of handwritten words written in unconstrained mixed style. The preprocessor performs a word-level normalization by fitting a model of the word structure using the EM algorithm. Words are then coded into low resolution "annotated images" where each pixel contains information about trajectory direction and curvature. The recognizer is a convolution network that can be spatially replicated. From the network output, a hidden Markov model produces word scores. The entire system is globally trained to minimize word-level errors.
Article
Full-text available
We introduce a recurrent architecture having a modular structure and we formulate a training procedure based on the EM algorithm. The resulting model has similarities to hidden Markov models, but supports recurrent networks processing style and allows to exploit the supervised learning paradigm while using maximum likelihood estimation. 1 INTRODUCTION Learning problems involving sequentially structured data cannot be effectively dealt with static models such as feedforward networks. Recurrent networks allow to model complex dynamical systems and can store and retrieve contextual information in a flexible way. Up until the present time, research efforts of supervised learning for recurrent networks have almost exclusively focused on error minimization by gradient descent methods. Although effective for learning short term memories, practical difficulties have been reported in training recurrent neural networks to perform tasks in which the temporal contingencies present in the in...
Conference Paper
The authors describe two systems in which neural network classifiers are merged with dynamic programming (DP) time alignment methods to produce high-performance continuous speech recognizers. One system uses the connectionist Viterbi-training (CVT) procedure, in which a neural network with frame-level outputs is trained using guidance from a time alignment procedure. The other system uses multi-state time-delay neural networks (MS-TDNNs), in which embedded DP time alignment allows network training with only word-level external supervision. The CVT results on the, TI Digits are 99.1% word accuracy and 98.0% string accuracy. The MS-TDNNs are described in detail, with attention focused on their architecture, the training procedure, and results of applying the MS-TDNNs to continuous speaker-dependent alphabet recognition: on two speakers, word accuracy is respectively 97.5% and 89.7%
Article
We introduce a framework for training architectures composed of several modules. This framework, which uses a statistical formulation of learning systems, provides a unique formalism for describing many classical connectionist algorithms as well as complex systems where several algorithms interact. It allows to design hybrid systems which combine the advantages of connectionist algorithms as well as other learning algorithms. 1 INTRODUCTION Many recent achievements in the connectionist area have been carried out by designing systems where different algorithms interact. For example (Bourlard & Morgan, 1991) have mixed a Multi-Layer Perceptron (MLP) with a Dynamic Programming algorithm. Another impressive application (Le Cun, Boser & al., 1990) uses a very complex multilayer architecture, followed by some statistical decision process. Also, in speech or image recognition systems, input signals are sequentially processed through different modules. Modular systems are the most promising wa...