ArticlePublisher preview available

Refocus attention span networks for handwriting line recognition

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Recurrent neural networks have achieved outstanding recognition performance for handwriting identification despite the enormous variety observed across diverse handwriting structures and poor-quality scanned documents. We initially proposed a BiLSTM baseline model with a sequential architecture well-suited for modeling text lines due to its ability to learn probability distributions over character or word sequences. However, employing such recurrent paradigms prevents parallelization and suffers from vanishing gradients for long sequences during training. To alleviate these limitations, we propose four significant contributions to this work. First, we devised an end-to-end model composed of a split-attention CNN-backbone that serves as a feature extraction method and a self-attention Transformer encoder–decoder that serves as a transcriber method to recognize handwriting manuscripts. The multi-head self-attention layers in an encoder–decoder transformer-based enhance the model’s ability to tackle handwriting recognition and learn the linguistic dependencies of character sequences. Second, we conduct various studies on transfer learning (TL) from large datasets to a small database, determining which model layers require fine-tuning. Third, we attained an efficient paradigm by combining different strategies of TL with data augmentation (DA). Finally, since the robustness of the proposed model is lexicon-free and can recognize sentences not presented in the training phase, the model is only trained on a few labeled examples with no extra cost of generating and training on synthetic datasets. We recorded comparable and outperformed Character and Word Error Rates CER/WER on four benchmark datasets to the most recent (SOTA) models.
This content is subject to copyright. Terms and conditions apply.
International Journal on Document Analysis and Recognition (IJDAR) (2023) 26:131–147
https://doi.org/10.1007/s10032-022-00422-7
ORIGINAL PAPER
Refocus attention span networks for handwriting line recognition
Mohammed Hamdan1·Himanshu Chaudhary2·Ahmed Bali3·Mohamed Cheriet1
Received: 1 June 2022 / Revised: 24 July 2022 / Accepted: 9 December 2022 / Published online: 25 December 2022
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2022
Abstract
Recurrent neural networks have achieved outstanding recognition performance for handwriting identification despite the
enormous variety observed across diverse handwriting structures and poor-quality scanned documents. We initially proposed
a BiLSTM baseline model with a sequential architecture well-suited for modeling text lines due to its ability to learn probability
distributions over character or word sequences. However, employing such recurrent paradigms prevents parallelization and
suffers from vanishing gradients for long sequences during training. To alleviate these limitations, we propose four significant
contributions to this work. First, we devised an end-to-end model composed of a split-attention CNN-backbone that serves as
a feature extraction method and a self-attention Transformer encoder–decoder that serves as a transcriber method to recognize
handwriting manuscripts. The multi-head self-attention layers in an encoder–decoder transformer-based enhance the model’s
ability to tackle handwriting recognition and learn the linguistic dependencies of character sequences. Second, we conduct
various studies on transfer learning (TL) from large datasets to a small database, determining which model layers require
fine-tuning. Third, we attained an efficient paradigm by combining different strategies of TL with data augmentation (DA).
Finally, since the robustness of the proposed model is lexicon-free and can recognize sentences not presented in the training
phase, the model is only trained on a few labeled examples with no extra cost of generating and training on synthetic datasets.
We recorded comparable and outperformed Character and Word Error Rates CER/WER on four benchmark datasets to the
most recent (SOTA) models.
Keywords Split attention convolutional network ·Multi-head attention transformer ·Seq2Seq-model ·BiLSTM ·Line
handwriting recognition
BMohammed Hamdan
mohammed.hamdan.1@ens.etsmtl.ca
Himanshu Chaudhary
him4318@gmail.com
Ahmed Bali
ahmed.bali.1@ens.etsmtl.ca
Mohamed Cheriet
mohamed.cheriet@etsmtl.ca
1Synchromedia Lab, System Engineering, University of
Quebec (ETS), 1100 Notre-Dame St W, Montreal, Quebec
H3C 1K3, Canada
2Data Science, Dr. A.P.J. Abdul Kalam Technical University,
CDRI Rd, Naya Khera, Jankipuram, Lucknow, Uttar Pradesh
226031, India
3Department of Software and IT Engineering, University of
Quebec (ETS), 1100 Notre-Dame St W, Montreal, Quebec
H3C 1K3, Canada
1 Introduction
Handwriting Text Recognition (HTR) systems allow comput-
ers to read and understand human handwriting. HTR is useful
for digitizing the textual contents of old document images in
historical records and contemporary administrative material
such as cheques, law letters, forms, and other documents.
While HTR research has been ongoing since the early 1960s
[34], it remains a challenging and unsolved research prob-
lem. The fundamental problem is the wide range of variations
and ambiguity encountered by different writers when crafting
words. Because the words to be deciphered usually adhere
to well-defined grammar rules, it is possible to eliminate
gibberish hypotheses and enhance recognition accuracy by
modeling the linguistic practices. HTR is usually embarked
with a blend of computer vision and natural language pro-
cessing (NLP).
In nature, handwritten text is a signal that follows a par-
ticular sequence. Texts in Latin languages are written in
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... These challenges are further compounded by the lack of robust mechanisms for handling historical character variations and archaic writing styles [1]. Combined with the computational demands of processing handwritten text recognition [15], [16], these limitations highlight the need for an integrated approach that addresses both accuracy and efficiency requirements. ...
Preprint
Full-text available
Despite significant advances in deep learning, current Handwritten Text Recognition (HTR) systems struggle with the inherent complexity of historical documents, including diverse writing styles, degraded text quality, and computational efficiency requirements across multiple languages and time periods. This paper introduces HTR-JAND (HTR-JAND: Handwritten Text Recognition with Joint Attention Network and Knowledge Distillation), an efficient HTR framework that combines advanced feature extraction with knowledge distillation. Our architecture incorporates three key components: (1) a CNN architecture integrating FullGatedConv2d layers with Squeeze-and-Excitation blocks for adaptive feature extraction, (2) a Combined Attention mechanism fusing Multi-Head Self-Attention with Proxima Attention for robust sequence modeling, and (3) a Knowledge Distillation framework enabling efficient model compression while preserving accuracy through curriculum-based training. The HTR-JAND framework implements a multi-stage training approach combining curriculum learning, synthetic data generation, and multi-task learning for cross-dataset knowledge transfer. We enhance recognition accuracy through context-aware T5 post-processing, particularly effective for historical documents. Comprehensive evaluations demonstrate HTR-JAND's effectiveness, achieving state-of-the-art Character Error Rates (CER) of 1.23\%, 1.02\%, and 2.02\% on IAM, RIMES, and Bentham datasets respectively. Our Student model achieves a 48\% parameter reduction (0.75M versus 1.5M parameters) while maintaining competitive performance through efficient knowledge transfer. Source code and pre-trained models are available at \href{https://github.com/DocumentRecognitionModels/HTR-JAND}{Github}.
... Sampath and Gomathi [24] used MNIST to train neural networks for English handwritten recognition. RIMES, IAM, and READ were used in [25][26][27] to generate variable-length symbol sequences from the English handwritten text. More related work is shown in Table 1, which includes the datasets' name, size, and the procedure used. ...
Article
Full-text available
Given the prevalence of handwritten documents in human interactions, optical character recognition (OCR) for documents holds immense practical value. OCR is a field that empowers the translation of various document types and images into data that can be analyzed, edited, and searched. In handwritten recognition techniques, symmetry can be crucial to improving accuracy. It can be used as a preprocessing step to normalize the input data, making it easier for the recognition algorithm to identify and classify characters accurately. This review paper aims to summarize the research conducted on character recognition for handwritten documents and offer insights into future research directions. Within this review, the research articles focused on handwritten OCR were gathered, synthesized, and examined, along with closely related topics, published between 2019 and the first quarter of 2024. Well-established electronic databases and a predefined review protocol were utilized for article selection. The articles were identified through keyword, forward, and backward reference searches to comprehensively cover all relevant literature. Following a rigorous selection process, 116 articles were included in this systematic literature review. This review article presents cutting-edge achievements and techniques in OCR and underscores areas where further research is needed.
... The purpose of sequence modeling is to capture the contextual information of characters in image text for the next prediction, which is more suitable than handling each character separately. Bidirectional Long Short Term Memory (BiLSTM) can better capture long-term dependencies than traditional RNN structures in the sequence modeling stage, and many studies have begun to use BiLSTM in character recognition tasks 18) . However, although BiLSTM can efficiently capture contextual information, its structure itself determines that it is very time-consuming during both training and inference. ...
Article
In steel production, the recognition of hot-cast billet numbers suffers from low efficiency and susceptibility to misjudgment. This paper proposes a novel method for identifying hot-cast billet numbers on the basis of improved Convolutional Recurrent Neural Network (ICRNN). Although the existing CRNN has achieved acceptable results in text recognition and music symbol recognition, it is not effective in recognizing industrial characters that are blurred or low-contrast. Based on the theoretical framework of CRNN, the Grayscale Spatial Transformation Network (GSTN) is added before character recognition to rectify the skew caused by shooting angles. The backbone network for feature extraction is changed to ResNet50. Moreover, the Efficient Channel Attention (ECA) module is added to construct the Res-ECA network, which extracts more features of the billet number characters. In the sequence modeling stage, the Bidirectional Gated Recurrent Unit (BiGRU) is used to reduce the risk of overfitting and accelerate convergence. After experimental comparison on a self-made billet number dataset, the put forward ICRNN has faster recognition speed and higher accuracy, with a recognition accuracy of 99.49%, which is 4.8% higher than that of the CRNN. The result fully demonstrates that ICRNN meets the requirements of accuracy and speed for billet number recognition.
... Some studies, like [25,26], use attention mechanisms for paragraph recognition tasks with implicit line and character segmentation. The authors [30] proposed attention blocks with encoder-decoder architectures at the line and character levels. The subnetwork decoder (LSTM) generates output based on the characters' likelihood computed from the representation. ...
Preprint
Full-text available
Handwritten document recognition (HDR) is one of the most challenging tasks in the field of computer vision, due to the various writing styles and complex layouts inherent in handwritten texts. Traditionally, this problem has been approached as two separate tasks, handwritten text recognition and layout analysis, and struggled to integrate the two processes effectively. This paper introduces HAND (Hierarchical Attention Network for Multi-Scale Document), a novel end-to-end and segmentation-free architecture for simultaneous text recognition and layout analysis tasks. Our model's key components include an advanced convolutional encoder integrating Gated Depth-wise Separable and Octave Convolutions for robust feature extraction, a Multi-Scale Adaptive Processing (MSAP) framework that dynamically adjusts to document complexity and a hierarchical attention decoder with memory-augmented and sparse attention mechanisms. These components enable our model to scale effectively from single-line to triple-column pages while maintaining computational efficiency. Additionally, HAND adopts curriculum learning across five complexity levels. To improve the recognition accuracy of complex ancient manuscripts, we fine-tune and integrate a Domain-Adaptive Pre-trained mT5 model for post-processing refinement. Extensive evaluations on the READ 2016 dataset demonstrate the superior performance of HAND, achieving up to 59.8% reduction in CER for line-level recognition and 31.2% for page-level recognition compared to state-of-the-art methods. The model also maintains a compact size of 5.60M parameters while establishing new benchmarks in both text recognition and layout analysis. Source code and pre-trained models are available at : https://github.com/MHHamdan/HAND.
Chapter
In the past years, understanding human behavior and classification of human actions and intentions by researching human activity recognition (HAR) using traditional pattern recognition has made great progress. In this paper, a novel approach is proposed based on convolutional neural network (CNN) and attention-based long short-term memory (attentıon LSTM) architecture. Human activity recognition (HAR) or recognizing human behavior is one of the challenging tasks due to human tendencies as the activities are not only complex but also multitasking. This deep learning-based long short-term memory network using convolutional neural networks (CNN-attentıon LSTM) architecture predicts the activities performed by humans and improves the accuracy by reducing the complexity of raw data and also by removing unnecessarily complex data. The convolutional layers act as a feature extractor, where they learn hierarchical representations of the image by applying multiple filters to the input image and passing the resulting feature maps through multiple activation functions. These learned features are then used as input to another classifier, such as a attention LSTM networks, to make the final prediction. On the internal UCF50 dataset, the proposed model achieves an 84.43% accuracy. The outcomes demonstrate that the suggested model is more robust and capable of activity detection than some of the results that have been reported.KeywordsHuman activity recognitionDeep learningConvolutional neural networkActivity recognition
Article
Full-text available
Handwritten Text Recognition (HTR) in free-layout pages is a challenging image understanding task that can provide a relevant boost to the digitization of handwritten documents and reuse of their content. The task becomes even more challenging when dealing with historical documents due to the variability of the writing style and degradation of the page quality. State-of-the-art HTR approaches typically couple recurrent structures for sequence modeling with Convolutional Neural Networks for visual feature extraction. Since convolutional kernels are defined on fixed grids and focus on all input pixels independently while moving over the input image, this strategy disregards the fact that handwritten characters can vary in shape, scale, and orientation even within the same document and that the ink pixels are more relevant than the background ones. To cope with these specific HTR difficulties, we propose to adopt deformable convolutions, which can deform depending on the input at hand and better adapt to the geometric variations of the text. We design two deformable architectures and conduct extensive experiments on both modern and historical datasets. Experimental results confirm the suitability of deformable convolutions for the HTR task.
Article
Full-text available
In this paper we address the problem of offline handwritten text recognition (HTR) in historical documents when few labeled samples are available and some of them contain errors in the train set. Our three main contributions are: first, we analyze how to perform transfer learning (TL) from a massive database to a smaller historical database, analyzing which layers of the model need fine-tuning. Second, we analyze methods to efficiently combine TL and data augmentation (DA). Finally, we propose an algorithm to mitigate the effects of incorrect labeling in the training set. The methods are analyzed over the ICFHR 2018 competition database, Washington and Parzival. Combining all these techniques, we demonstrate a remarkable reduction of CER (up to 6 percentage points in some cases) in the test set with little complexity overhead.
Article
Full-text available
We propose a framework for sequence-to-sequence contrastive learning (SeqCLR) of visual representations, which we apply to text recognition. To account for the sequence-to-sequence structure, each feature map is divided into different instances over which the contrastive loss is computed. This operation enables us to contrast in a sub-word level, where from each image we extract several positive pairs and multiple negative examples. To yield effective visual representations for text recognition, we further suggest novel augmentation heuristics, different encoder architectures and custom projection heads. Experiments on handwritten text and on scene text show that when a text decoder is trained on the learned representations, our method outperforms non-sequential contrastive methods. In addition, when the amount of supervision is reduced, SeqCLR significantly improves performance compared with supervised training, and when fine-tuned with 100% of the labels, our method achieves state-of-the-art results on standard handwritten text recognition benchmarks.
Article
The advent of recurrent neural networks for handwriting recognition marked an important milestone reaching impressive recognition accuracies despite the great variability that we observe across different writing styles. Sequential architectures are a perfect fit to model text lines, not only because of the inherent temporal aspect of text, but also to learn probability distributions over sequences of characters and words. However, using such recurrent paradigms comes at a cost at training stage, since their sequential pipelines prevent parallelization. In this work, we introduce a novel method that bypasses any recurrence during the training process with the use of transformer models. By using multi-head self-attention layers both at the visual and textual stages, we are able to tackle character recognition as well as to learn language-related dependencies of the character sequences to be decoded. Our model is unconstrained to any predefined vocabulary, being able to recognize out-of-vocabulary words, i.e. words that do not appear in the training vocabulary. We significantly advance over prior art and demonstrate that satisfactory recognition accuracies are yielded even in few-shot learning scenarios.
Article
Skeleton-based Human Activity Recognition has achieved great interest in recent years as skeleton data has demonstrated being robust to illumination changes, body scales, dynamic camera views, and complex background. In particular, Spatial-Temporal Graph Convolutional Networks (ST-GCN) demonstrated to be effective in learning both spatial and temporal dependencies on non-Euclidean data such as skeleton graphs. Nevertheless, an effective encoding of the latent information underlying the 3D skeleton is still an open problem, especially when it comes to extracting effective information from joint motion patterns and their correlations. In this work, we propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints using the Transformer self-attention operator. In our ST-TR model, a Spatial Self-Attention module (SSA) is used to understand intra-frame interactions between different body parts, and a Temporal Self-Attention module (TSA) to model inter-frame correlations. The two are combined in a two-stream network, whose performance is evaluated on three large-scale datasets, NTU-RGB+D 60, NTU-RGB+D 120, and Kinetics Skeleton 400, consistently improving backbone results. Compared with methods that use the same input data, the proposed ST-TR achieves state-of-the-art performance on all datasets when using joints’ coordinates as input, and results on-par with state-of-the-art when adding bones information.
Chapter
We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors. DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster R-CNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms competitive baselines. Training code and pretrained models are available at https://github.com/facebookresearch/detr.