Preprint

A Comprehensive Comparison of End-to-End Approaches for Handwritten Digit String Recognition

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

Over the last decades, most approaches proposed for handwritten digit string recognition (HDSR) have resorted to digit segmentation, which is dominated by heuristics, thereby imposing substantial constraints on the final performance. Few of them have been based on segmentation-free strategies where each pixel column has a potential cut location. Recently, segmentation-free strategies has added another perspective to the problem, leading to promising results. However, these strategies still show some limitations when dealing with a large number of touching digits. To bridge the resulting gap, in this paper, we hypothesize that a string of digits can be approached as a sequence of objects. We thus evaluate different end-to-end approaches to solve the HDSR problem, particularly in two verticals: those based on object-detection (e.g., Yolo and RetinaNet) and those based on sequence-to-sequence representation (CRNN). The main contribution of this work lies in its provision of a comprehensive comparison with a critical analysis of the above mentioned strategies on five benchmarks commonly used to assess HDSR, including the challenging Touching Pair dataset, NIST SD19, and two real-world datasets (CAR and CVL) proposed for the ICFHR 2014 competition on HDSR. Our results show that the Yolo model compares favorably against segmentation-free models with the advantage of having a shorter pipeline that minimizes the presence of heuristics-based models. It achieved a 97%, 96%, and 84% recognition rate on the NIST-SD19, CAR, and CVL datasets, respectively.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Automatic recognition of handwritten digit string with unknown length has many potential real applications. The most challenging step in this problem is how to efficiently segment connected and/or overlapped digits exhibited in the input image. Most existing numeral string segmentation approaches combine several segmentation hypotheses to handle various types of connected digits. This paper proposes a new handwritten digit string recognition without applying any explicit segmentation techniques. The proposed method uses a new cascade of hybrid principal component analysis network (PCANet) and support vector machine (SVM) classifier called PCA-SVMNet. PCANet is an emerging unsupervised simple deep neural network typically with only two convolutional layers. The proposed PCA-SVMNet model adds a new fully connected layer trained separately using SVM optimization method. Cascaded stages of PCA-SVMNet classifiers are constructed and trained to recognize various types of isolated and connected digits. Every PCA-SVMNet classifier is trained separately using combinations of real and synthetic touching digits. The first 1D-PCA-SVMNet stage is trained to recognize isolated handwritten digits $(0\ldots 9)$ while forwarding non-isolated digits to the next stages. Each of the following stages is designed to recognize a class of connected digits and forwards the higher class to its successor. Multiple stages can be added accordingly to classify more complex touching digits. The experimental results using NIST SD19 real dataset show that the cascade of PCA-SVMNet classifier efficiently recognizes unknown handwritten digit string without applying any sophisticated segmentation methods. The proposed method achieves state-of-the-art recognition accuracy compared to other segmentation-free techniques.
Article
Full-text available
We tackle automatic meter reading (AMR) by leveraging the high capability of convolutional neural networks (CNNs). We design a two-stage approach that employs the Fast-YOLO object detector for counter detection and evaluates three different CNN-based approaches for counter recognition. In the AMR literature, most datasets are not available to the research community since the images belong to a service company. In this sense, we introduce a public dataset, called UFPR-AMR dataset, with 2000 fully and manually annotated images. This dataset is, to the best of our knowledge, three times larger than the largest public dataset found in the literature and contains a well-defined evaluation protocol to assist the development and evaluation of AMR methods. Furthermore, we propose the use of a data augmentation technique to generate a balanced training set with many more examples to train the CNN models for counter recognition. In the proposed dataset, impressive results were obtained and a detailed speed/accuracy trade-off evaluation of each model was performed. In a public dataset, state-of-the-art results were achieved using <200 images for training.
Conference Paper
Full-text available
This paper presents segmentation-free strategies for the recognition of handwritten numeral strings of unknown length. A synthetic dataset of touching numeral strings of sizes 2-, 3- and 4-digits was created to train end-to-end solutions based on Convolutional Neural Networks. A robust experimental protocol is used to show that the proposed segmentation-free methods may reach the state-of-the-art performance without suffering the heavy burden of over-segmentation based methods. In addition, they confirmed the importance of introducing contextual information in the design of end-to-end solutions, such as the proposed length classifier when recognizing numeral strings.
Conference Paper
Full-text available
Automatic License Plate Recognition (ALPR) has been a frequent topic of research due to many practical applications. However, many of the current solutions are still not robust in real-world situations, commonly depending on many constraints. This paper presents a robust and efficient ALPR system based on the state-of-the-art YOLO object detector. The Convolutional Neural Networks (CNNs) are trained and fine-tuned for each ALPR stage so that they are robust under different conditions (e.g., variations in camera, lighting, and background). Specially for character segmentation and recognition, we design a two-stage approach employing simple data augmentation tricks such as inverted License Plates (LPs) and flipped characters. The resulting ALPR approach achieved impressive results in two datasets. First, in the SSIG dataset, composed of 2,000 frames from 101 vehicle videos, our system achieved a recognition rate of 93.53% and 47 Frames Per Second (FPS), performing better than both Sighthound and OpenALPR commercial systems (89.80% and 93.03%, respectively) and considerably outperforming previous results (81.80%). Second, targeting a more realistic scenario, we introduce a larger public dataset, called UFPR-ALPR dataset, designed to ALPR. This dataset contains 150 videos and 4,500 frames captured when both camera and vehicles are moving and also contains different types of vehicles (cars, motorcycles, buses and trucks). In our proposed dataset, the trial versions of commercial systems achieved recognition rates below 70%. On the other hand, our system performed better, with a recognition rate of 78.33% and 35 FPS.
Article
Full-text available
Over the last decades, a great deal of research has been devoted to handwritten digit segmentation. Algorithms based on different features extracted from the background, foreground, and contour of images have been proposed, with those achieving the best results usually relying on a heavy set of heuristics and over-segmentation. Here, the challenge lies in finding a good set of heuristics to reduce the number of segmentation hypotheses. Independently of the heuristic over-segmentation strategy adopted, all algorithms used show their limitations when faced with complex cases such as overlapping digits. In this work, we postulate that handwritten digit segmentation can be successfully replaced by a set of classifiers trained to predict the size of the string and classify them without any segmentation. To support our position, we trained four Convolutional Neural Networks (CNN) on data generated synthetically and validated the proposed method on two well-known databases, namely, the Touching Pairs Dataset and NIST SD19. Our experimental results show that the CNN classifiers can handle complex cases of touching digits more efficiently than all segmentation algorithms available in the literature.
Article
Full-text available
Recurrent neural network (RNN) and connectionist temporal classification (CTC) have showed successes in many sequence labeling tasks with the strong ability of dealing with the problems where the alignment between the inputs and the target labels is unknown. Residual network is a new structure of convolutional neural network and works well in various computer vision tasks. In this paper, we take advantage of the architectures mentioned above to create a new network for handwritten digit string recognition. First we design a residual network to extract features from input images, then we employ a RNN to model the contextual information within feature sequences and predict recognition results. At the top of this network, a standard CTC is applied to calculate the loss and yield the final results. These three parts compose an end-to-end trainable network. The proposed new architecture achieves the highest performances on ORAND-CAR-A and ORAND-CAR-B with recognition rates 89.75% and 91.14%, respectively. In addition, the experiments on a generated captcha dataset which has much longer string length show the potential of the proposed network to handle long strings.
Article
Full-text available
The segmentation of handwritten digit strings into isolated digits remains a challenging task. The difficulty for recognizing handwritten digit strings is related to several factors such as sloping, overlapping, connecting and unknown length of the digit string. Hence, this paper aims to propose a segmentation and recognition system for unknown-length handwritten digit strings by combining several explicit segmentation methods depending on the configuration link between digits. Three segmentation methods are combined based on histogram of the vertical projection, the contour analysis and the sliding window Radon transform. A recognition and verification module based on support vector machine classifiers allows analyzing and deciding the rejection or acceptance each segmented digit image. Moreover, various submodules are included leading to enhance the robustness of the proposed system. Experimental results conducted on the benchmark dataset show that the proposed system is effective for segmenting handwritten digit strings without prior knowledge of their length comparatively to the state of the art.
Article
Full-text available
Image-based sequence recognition has been a long-standing research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.
Article
Full-text available
We propose in this paper a system to recognize handwritten digit strings, which constitutes a difficult task because of overlapping and/or joining of adjacent digits. To resolve this problem, we use a segmentation-verification of handwritten connected digits based conjointly on the oriented sliding window and support vector machine (SVM) classifiers. The proposed approach allows separating adjacent digits according the connection configuration by finding at the same time the interconnection points between adjacent digits and the cutting path. SVM-based segmentation-verification using the global decision module allows the rejection or acceptance of the processed image. Experimental results conducted on a large synthetic database of handwritten digits show the effective use of the oriented sliding window for segmentation-verification.
Article
Full-text available
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide detailed a analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
Article
Full-text available
This paper addresses the problem of generating possible object locations for use in object recognition. We introduce selective search which combines the strength of both an exhaustive search and segmentation. Like segmentation, we use the image structure to guide our sampling process. Like exhaustive search, we aim to capture all possible object locations. Instead of a single technique to generate possible object locations, we diversify our search and use a variety of complementary image partitionings to deal with as many image conditions as possible. Our selective search results in a small set of data-driven, class-independent, high quality locations, yielding 99 % recall and a Mean Average Best Overlap of 0.879 at 10,097 locations. The reduced number of locations compared to an exhaustive search enables the use of stronger machine learning techniques and stronger appearance models for object recognition. In this paper we show that our selective search enables the use of the powerful Bag-of-Words model for recognition. The selective search software is made publicly available (Software: http://disi.unitn.it/~uijlings/SelectiveSearch.html).
Article
Full-text available
The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.
Article
Full-text available
A method for word recognition based on the use of hidden Markov models (HMMs) is described. An evaluation of its performance is presented using a test set of real printed documents that have been subjected to severe photocopy and fax transmission distortions. A comparison with a commercial OCR package highlights the inherent advantages of a segmentation-free recognition strategy when the word images are severely distorted, as well as the importance of using contextual knowledge. The HMM method makes only one quarter of the number of word errors made by the commercial package when tested on word images taken from faxed pages.
Conference Paper
Full-text available
We apply an HMM-based text recognition system to the recognition of handwritten digit strings of unknown length. The algorithm is tailored to the input data by controlling the maximum number of levels searched by the level building (LB) search algorithm. We demonstrate that setting this parameter according to the pixel length of the observation sequence, rather than using a fixed value for all input data, results in a faster and more accurate system. Best results were achieved by setting the maximum number of levels to twice the estimated number of characters in the input string. We also describe experiments which show the potential for further improvement by using an adaptive termination criterion in the LB search
Article
Object detection, including objectness detection (OD), salient object detection (SOD), and category-specific object detection (COD), is one of the most fundamental yet challenging problems in the computer vision community. Over the last several decades, great efforts have been made by researchers to tackle this problem, due to its broad range of applications for other computer vision tasks such as activity or event recognition, content-based image retrieval and scene understanding, etc. While numerous methods have been presented in recent years, a comprehensive review for the proposed high-quality object detection techniques, especially for those based on advanced deep-learning techniques, is still lacking. To this end, this article delves into the recent progress in this research field, including 1) definitions, motivations, and tasks of each subdirection; 2) modern techniques and essential research trends; 3) benchmark data sets and evaluation metrics; and 4) comparisons and analysis of the experimental results. More importantly, we will reveal the underlying relationship among OD, SOD, and COD and discuss in detail some open questions as well as point out several unsolved challenges and promising future works.
Article
A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation paramters. Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules. When multiple predictions agree, a higher level capsule becomes active. We show that a discrimininatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits. To achieve these results we use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Article
Most successful object recognition systems rely on binary classification, deciding only if an object is present or not, but not providing information on the actual object location. To perform localization, one can take a sliding window approach, but this strongly increases the computational cost, because the classifier function has to be evaluated over a large set of candidate subwindows. In this paper, we propose a simple yet powerful branch-and-bound scheme that allows efficient maximization of a large class of classifier functions over all possible subimages. It converges to a globally optimal solution typically in sublinear time. We show how our method is applicable to different object detection and retrieval scenarios. The achieved speedup allows the use of classifiers for localization that formerly were considered too slow for this task, such as SVMs with a spatial pyramid kernel or nearest neighbor classifiers based on the chi2-distance. We demonstrate state-of-the-art performance of the resulting systems on the UIUC Cars dataset, the PASCAL VOC 2006 dataset and in the PASCAL VOC 2007 competition.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
In this paper, we focus on the problem of script and handwritten/machine-printed identification of texts. We simultaneously identify the script (Chinese, English, Japanese, Korean, or Russian) and whether it is handwritten or machine-printed text by designing a dual-branch structured deep convolutional neural network (CNN). For the training stage, we propose a two-stage multi-task learning strategy to learn robust shared features for script and handwritten/machine-printed identification. Accordingly, we can implement two identification tasks using the proposed single CNN model. We compare the effects of using different length of input to train CNN. The experimental results show that text-line input is a suitable choice for the two identification tasks, as it can effectively capture more discriminative content for both script and handwritten/machine-printed identification. Furthermore, we evaluate three CNN networks of different scales (small, medium, and large) to determine the best CNN architecture for script and handwritten/machine-printed identification. As shown by our experimental validation, integrating the text-line input with larger architecture significantly improves performance. The accuracies achieved by the two-stage multi-task CNN for handwritten/machine-printed and script identification are 99% and 95%, respectively.
Article
Verifying the identity of a person using handwritten signatures is challenging in the presence of skilled forgeries, where a forger has access to a person’s signature and deliberately attempt to imitate it. In offline (static) signature verification, the dynamic information of the signature writing process is lost, and it is difficult to design good feature extractors that can distinguish genuine signatures and skilled forgeries. This reflects in a relatively poor performance, with verification errors around 7% in the best systems in the literature. To address both the difficulty of obtaining good features, as well as improve system performance, we propose learning the representations from signature images, in a Writer-Independent format, using Convolutional Neural Networks. In particular, we propose a novel formulation of the problem that includes knowledge of skilled forgeries from a subset of users in the feature learning process, that aims to capture visual cues that distinguish genuine signatures and forgeries regardless of the user. Extensive experiments were conducted on four datasets: GPDS, MCYT, CEDAR and Brazilian PUC-PR datasets. On GPDS-160, we obtained a large improvement in state-of-the-art performance, achieving 1.72% Equal Error Rate, compared to 6.97% in the literature. We also verified that the features generalize beyond the GPDS dataset, surpassing the state-of-the-art performance in the other datasets, without requiring the representation to be fine-tuned to each particular dataset.
Article
Like other problems in computer vision, offline handwritten Chinese character recognition (HCCR) has achieved impressive results using convolutional neural network (CNN)-based methods. However, larger and deeper networks are needed to deliver state-of-the-art results in this domain. Such networks intuitively appear to incur high computational cost, and require the storage of a large number of parameters, which renders them unfeasible for deployment in portable devices. To solve this problem, we propose a Global Supervised Low-rank Expansion (GSLRE) method and an Adaptive Drop-weight (ADW) technique to solve the problems of speed and storage capacity. We design a nine-layer CNN for HCCR consisting of 3,755 classes, and devise an algorithm that can reduce the networks computational cost by nine times and compress the network to 1/18 of the original size of the baseline model, with only a 0.21% drop in accuracy. In tests, the proposed algorithm surpassed the best single-network performance reported thus far in the literature while requiring only 2.3 MB for storage. Furthermore, when integrated with our effective forward implementation, the recognition of an offline character image took only 9.7 ms on a CPU. Compared with the state-of-the-art CNN model for HCCR, our approach is approximately 30 times faster, yet 10 times more cost efficient.
Article
In this paper, we propose an efficient multiple classifier system for Arabic handwritten words recognition. First, we use Chebyshev moments (CM) enhanced with some Statistical and Contour-based Features (SCF) for describing word images. Then, we combine several classifiers integrated at the decision level. We consider the multilayer perceptron (MLP), the support vector machine (SVM) and the Extreme Learning Machine (ELM) classifiers. We propose several combination rules between MLP, SVM and ELM classifiers trained with CM and SCF features. Further, we consider a second level of combination that merges three best rules among the proposed ones. The system is evaluated on the IFN/ENIT database and compared to some well-known systems for Arabic handwriting recognition. The numerical results are competitive and show that our system is able to achieve a global recognition rate equal to 96.82% for the considered dataset.
Article
Handwritten Chinese text recognition based on over-segmentation and path search integrating multiple contexts has been demonstrated successful, wherein the language model (LM) and character shape models play important roles. Although back-off N-gram LMs (BLMs) have been used dominantly for decades, they suffer from the data sparseness problem, especially for high-order LMs. Recently, neural network LMs (NNLMs) have been applied to handwriting recognition with superiority to BLMs. With the aim of improving Chinese handwriting recognition, this paper evaluates the effects of two types of character-level NNLMs, namely, feedforward neural network LMs (FNNLMs) and recurrent neural network LMs (RNNLMs). Both FNNLMs and RNNLMs are also combined with BLMs to construct hybrid LMs. For fair comparison with BLMs and a state-of-the-art system, we evaluate in a system with the same character over-segmentation and classification techniques as before, and compare various LMs using a small text corpus used before. Experimental results on the Chinese handwriting database CASIA-HWDB validate that NNLMs improve the recognition performance, and hybrid RNNLMs outperform the other LMs. To report a new benchmark, we also evaluate selected LMs on a large corpus, and replace the baseline character classifier, over-segmentation, and geometric context models with convolutional neural network (CNN) based models. The performance on both the CASIA-HWDB and the ICDAR-2013 competition dataset are improved significantly. On the CASIA-HWDB test set, the character-level accurate rate (AR) and correct rate (CR) achieve 95.88% and 95.95%, respectively.
Article
This paper presents a novel approach towards Indic handwritten word recognition using zone-wise information. Because of complex nature due to compound characters, modifiers, overlapping and touching, etc., character segmentation and recognition is a tedious job in Indic scripts (e.g. Devanagari, Bangla, Gurumukhi, and other similar scripts). To avoid character segmentation in such scripts, HMM-based sequence modeling has been used earlier in holistic way. This paper proposes an efficient word recognition framework by segmenting the handwritten word images horizontally into three zones (upper, middle and lower) and recognize the corresponding zones. The main aim of this zone segmentation approach is to reduce the number of distinct component classes compared to the total number of classes in Indic scripts. As a result, use of this zone segmentation approach enhances the recognition performance of the system. The components in middle zone where characters are mostly touching are recognized using HMM. After the recognition of middle zone, HMM based Viterbi forced alignment is applied to mark the left and right boundaries of the characters. Next, the residue components, if any, in upper and lower zones in their respective boundary are combined to achieve the final word level recognition. Water reservoir feature has been integrated in this framework to improve the zone segmentation and character alignment defects while segmentation. A novel sliding window-based feature, called Pyramid Histogram of Oriented Gradient (PHOG) is proposed for middle zone recognition. An exhaustive experiment is performed on two Indic scripts namely, Bangla and Devanagari for the performance evaluation. From the experiment, it has been noted that proposed zone-wise recognition improves accuracy with respect to the traditional way of Indic word recognition.
Article
Segmentation is an important issue in document image processing systems as it can break a sequence of characters into its components. Its application over digits is common in bank checks, mail and historical document processing, among others. This paper presents an algorithm for segmentation of connected handwritten digits based on the selection of feature points, through a skeletonization process, and the clustering of the touching region via Self-Organizing Maps. The segmentation points are then found, leading to the final segmentation. The method can deal with several types of connection between the digits, having also the ability to map multiple touching. The proposed algorithm achieved encouraging results, both relating to other state-of-the-art algorithms and to possible improvements.
Article
In this work, algorithms for segmenting handwritten digits based on different concepts are compared by evaluating them under the same conditions of implementation. A robust experimental protocol based on a large synthetic database is used to assess each algorithm in terms of correct segmentation and computational time. Results on a real database are also presented. In addition to the overall performance of each algorithm, we show the performance for different types of connections, which provides an interesting categorization of each algorithm. Another contribution of this work concerns the complementarity of the algorithms. We have observed that each method is able to segment samples that cannot be segmented by any other method, and do so independently of their individual performance. Based on this observation, we conclude that combining different segmentation algorithms may be an appropriate strategy for improving the correct segmentation rate.
Article
Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing the fixed-size limitation, we can improve all CNN-based image classification methods in general. Our SPP-net achieves state-of-the-art accuracy on the datasets of ImageNet 2012, Pascal VOC 2007, and Caltech101. The power of SPP-net is more significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method computes convolutional features 30-170× faster than the recent leading method R-CNN (and 24-64× faster overall), while achieving better or comparable accuracy on Pascal VOC 2007.
Conference Paper
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
This paper presents an empirical evaluation of the role of context in a contemporary, challenging object detection task - the PASCAL VOC 2008. Previous experiments with context have mostly been done on home-grown datasets, often with non-standard baselines, making it difficult to isolate the contribution of contextual information. In this work, we present our analysis on a standard dataset, using top-performing local appearance detectors as baseline. We evaluate several different sources of context and ways to utilize it. While we employ many contextual cues that have been used before, we also propose a few novel ones including the use of geographic context and a new approach for using object spatial support.
Article
In this paper, a two-stage HMM-based recognition method allows us to compensate for the possible loss in terms of recognition performance caused by the necessary trade-off between segmentation and recognition in an implicit segmentation-based strategy. The first stage consists of an implicit segmentation process that takes into account some contextual information to provide multiple segmentation-recognition hypotheses for a given preprocessed string. These hypotheses are verified and re-ranked in a second stage by using an isolated digit classifier. This method enables the use of two sets of features and numeral models: one taking into account both the segmentation and recognition aspects in an implicit segmentation-based strategy, and the other considering just the recognition aspects of isolated digits. These two stages have been shown to be complementary, in the sense that the verification stage compensates for the loss in terms of recognition performance brought about by the necessary tradeoff between segmentation and recognition carried out in the first stage. The experiments on 12,802 handwritten numeral strings of different lengths have shown that the use of a two-stage recognition strategy is a promising idea. The verification stage brought about an average improvement of 9.9% on the string recognition rates. On touching digit pairs, the method achieved a recognition rate of 89.6%.
Article
This paper deals with a new technique for automatic segmentation of unconstrained handwritten connected numerals. To take care of variability involved in the writing style of different individuals a robust scheme is presented here. The scheme is mainly based on features obtained from a concept based on water reservoir. A reservoir is a metaphor to illustrate the region where numerals touch. Reservoir is obtained by considering accumulation of water poured from the top or from the bottom of the numerals. At first, considering reservoir location and size, touching position (top, middle or bottom) is decided. Next, analyzing the reservoir boundary, touching position and topological features of the touching pattern, the best cutting point is determined. Finally, combined with morphological structural features the cutting path for segmentation is generated. The proposed scheme is tested on French bank check data and an accuracy about 94.8% is obtained from the system.
Article
A new approach to separating single touching handwritten digit strings is presented. The image of the connected numerals is normalized, preprocessed and then thinned before feature points are detected. Potential segmentation points are determined based on decision line that is estimated from the deepest/highest valley/hill in the image. The partitioning path is determined precisely and then the numerals are separated before restoration is applied. Experimental results on the NIST Database 19, CEDAR CD-ROM and our own collection of images show that our algorithm can get a successful recognition rate of 96%, which compares favorably with those reported in the literature.
Article
Researchers have thus far focused on the recognition of alpha and numeric characters in isolation as well as in context. In this paper we introduce a new genre of problems where the input pattern is taken to be a pair of characters. This adds to the complexity of the classification task. The 10 class digit recognition problem is now transformed into a 100 class problem where the classes are {00,…,99}. Similarly, the alpha character recognition problem is transformed to a 26×26 class problem, where the classes are {AA,…,ZZ}. If lower-case characters are also considered the number of classes increases further. The justification for adding to the complexity of the classification task is described in this paper. There are many applications where the pairs of characters occur naturally as an indivisible unit. Therefore, an approach which recognizes pairs of characters, whether or not they are separable, can lead to superior results. In fact, the holistic method described in this paper outperforms the traditional approaches that are based on segmentation. The correct recognition rate on a set of US state abbreviations and digit pairs, touching in various ways, is above 86%.
Article
In this paper we propose a method to evaluate segmentation cuts for handwritten touching digits. The idea of this method is to work as a filter in segmentation-based recognition system. This kind of system usually rely on over-segmentation methods, where several segmentation hypotheses are created for each touching group of digits and then assessed by a general-purpose classifier. The novelty of the proposed methodology lies in the fact that unnecessary segmentation cuts can be identified without any attempt of classification by a general-purpose classifier, reducing the number of paths in a segmentation graph, what can consequently lead to a reduction in computational cost. An cost-based approach using ROC (receiver operating characteristics) was deployed to optimize the filter. Experimental results show that the filter can eliminate up to 83% of the unnecessary segmentation hypothesis and increase the overall performance of the system.
Article
For the first time, a genetic framework using contextual knowledge is proposed for segmentation and recognition of unconstrained handwritten numeral strings. New algorithms have been developed to locate feature points on the string image, and to generate possible segmentation hypotheses. A genetic representation scheme is utilized to show the space of all segmentation hypotheses (chromosomes). For the evaluation of segmentation hypotheses, a novel evaluation scheme is introduced, in order to improve the outlier resistance of the system. Our genetic algorithm tries to search and evolve the population of segmentation hypotheses, and to find the one with the highest segmentation/recognition confidence. The NIST NSTRING SD19 and CENPARMI databases were used to evaluate the performance of our proposed method. Our experiments showed that proper use of contextual knowledge in segmentation, evaluation and search greatly improves the overall performance of the system. On average, our system was able to obtain correct recognition rates of 95.28% and 96.42% on handwritten numeral strings using neural network and support vector classifiers, respectively. These results compare favorably with the ones reported in the literature.
Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the first to achieve near-human performance. On a traffic sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classification benchmarks.
This paper describes a discriminatively trained, multi- scale, deformable part model for object detection. Our sys- tem achieves a two-fold improvement in average precision over the best performance in the 2006 PASCAL person de- tection challenge. It also outperforms the best results in the 2007 challenge in ten out of twenty categories. The system relies heavily on deformable parts. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL challenge. Our system also relies heavily on new methods for discriminative training. We combine a margin-sensitive approach for data mining hard negative examples with a formalism we calllatent SVM. A latent SVM, like a hid- den CRF, leads to a non-convex training problem. How- ever, a latent SVM is semi-convex and the training prob- lem becomes convex once latent information is specified for the positive examples. We believe that our training meth- ods will eventually make possible the effective use of more latent information such as hierarchical (grammar) models and models involving latent three dimensional pose.