PreprintPDF Available

RMLP-ViT: Recognizing Handwritten Chinese Examination Texts with ViT

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

It happens in the examination paper that text lines include inconsistent nonuniform word size, character erasure, diverse text length and dense long texts. This paper proposes an improved method for ViT to enhance its capability in recognizing text lines in handwritten Chinese examination papers. First, this method employs a segmentation method suitable for text line recognition and proposes a repeated multiscale linear projection (RMLP) method to enrich the spatial information of the image vectors, which improves the model's integration capability for patch vectors of multiple scales. Second, ViT is combined with CTC to achieve prediction for each patch, thus improving the robustness of ViT in Chinese handwritten text recognition. Experiments show that RMLP-ViT promises the recognition of examination paper text lines and achieves good performance on the SCUT-EPT dataset.
Content may be subject to copyright.
RMLP-ViT: Recognizing Handwritten Chinese
Examination Texts with ViT
Kaihe Zhong
Tianjin Normal University
Yuanping Zhu
Tianjin Normal University
Research Article
Keywords: Vision Transformer, Handwritten Chinese text recognition, Examination paper text, Linear
Projection
Posted Date: July 1st, 2024
DOI: https://doi.org/10.21203/rs.3.rs-4555340/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License. 
Read Full License
Additional Declarations: No competing interests reported.
RMLP-ViT: Recognizing Handwritten Chinese Examination
Texts with ViT
Kaihe Zhong1and Yuanping Zhu1*
1*School of Computing and Information Engineering, Tianjin Normal University, Tianjin,
300387, Tianjin, China.
*Corresponding author(s). E-mail(s): zhuyuanping@tjnu.edu.cn;
Abstract
It happens in the examination paper that text lines include inconsistent nonuniform word size, char-
acter erasure, diverse text length and dense long texts. This paper proposes an improved method for
ViT to enhance its capability in recognizing text lines in handwritten Chinese examination papers.
First, this method employs a segmentation method suitable for text line recognition and proposes a
repeated multiscale linear projection (RMLP) method to enrich the spatial information of the image
vectors, which improves the model’s integration capability for patch vectors of multiple scales. Second,
ViT is combined with CTC to achieve prediction for each patch, thus improving the robustness of ViT
in Chinese handwritten text recognition. Experiments show that RMLP-ViT promises the recognition
of examination paper text lines and achieves good performance on the SCUT-EPT dataset.
Keywords: Vision Transformer, Handwritten Chinese text recognition, Examination paper text, Linear
Projection
1 Introduction
Handwritten Chinese Text Recognition (HCTR)
has been extensively studied due to its diverse
writing styles, challenges associated with char-
acter segmentation, large character sets, and
complex semantics. In examination paper text,
handwritten text recognition faces more complex
situations, such as nonuniform word size, charac-
ter erasure, noisy background, diverse text length,
and dense long texts. These problems are visually
depicted in Fig. 1.
In the field of handwritten Chinese text recog-
nition, the dominant approach is a segmentation-
free recognition methodology[1][2][3][4][5][6]. This
obviates the need for explicit segmentation of text
lines into individual characters, allowing for recog-
nition outcomes to be attained solely through text
annotation. The current state-of-the-art methods
largely revolve around the Convolutional Recur-
rent Neural Network (CRNN) architecture pro-
posed by shi et al. [7]. This framework syner-
gizes convolutional layers, recurrent layers, and
transcription layers to extract highly abstracted
feature sequences from serialized input data.
Within this architecture, Long Short-Term Mem-
ory (LSTM) networks directly handle pointwise
feature vectors that incorporate (x, y) coordinates
and relative feature values, dynamically updat-
ing hidden states and offering frame predictions
for each temporal step in the sequence. Finally, a
sequence transcription process is executed using
the Connectionist Temporal Classification (CTC)
algorithm.
1
Fig. 1 Problems of Chinese text lines in real examination paper text.soft erasure(a,b),nonuniform word size(b),hard
erasure(c,d),noised background(b,f),diverse text length(a,c,d are short text,b,e,f are long text),dense long texts(e,f )
This integrative approach, which utilizes deep
convolutional networks for feature extraction,
recurrent networks to process serialized informa-
tion and dependencies, and CTC for sequence out-
put without predefined segmentation, establishes
a new technological benchmark for handwritten
text recognition. It demonstrates the profound
potential of deep learning in processing complex
serialized data.
In recent years, the Transformer has been
employed in the field of text recognition due
to its formidable ability to capture long-range
dependencies. Works based on a hybrid CNN-
Transformer architecture have surpassed tradi-
tional models based on CNNs and RNNs. The
Vision Transformer (ViT), proposed by Dosovit-
skiy et al. [8], has achieved state-of-the-art perfor-
mance on multiple image recognition benchmarks.
Beyond image classification, the ViT has also been
applied to a wide range of other visual tasks,
including object detection, semantic segmenta-
tion, image processing, and video understanding.
The superior performance of these models has
drawn an increasing number of researchers to
study and improve upon ViT, and to explore these
models application in the field of text recognition.
Considering the complex spatial structure of
handwritten Chinese examination papers and con-
textual semantics, we believe that ViT can effec-
tively address these issues. However, Transformer-
based models are constrained by the length
of sequences they process, with these models’
scale growing quadratically with sequence length,
making them less practical for handling long
sequences. Consequently, most current research
focuses on scene text, and there is a dearth of
experimental research on handwritten text. There
are two main reasons for this.
On one hand, the ViT model has achieved
such exemplary results owing to its simple yet
powerful and scalable architecture (the larger the
model, the better the performance). Without the
restrictions of inductive biases inherent in the
transformer design, ViT models only outperform
CNNs when pretrained with an ample supply of
data. CNNs possess two inductive biases—locality
and translation invariance, which mean that adja-
cent regions in an image are likely to exhibit
similar features. These biases provide CNNs with
a wealth of prior knowledge, enabling the learn-
ing of robust models with smaller datasets. The
commonly used datasets for handwritten Chi-
nese text are not voluminous enough to meet the
requirement of ViT.
2
Fig. 2 Patch from different segmentation methods
On the other hand,as shown in Fig. 2, main-
stream ViT models typically divide images into
patches of size 16x16. When applied to lines
of Chinese text, this patch size can lead to an
overly granular breakdown of the image. More-
over, it often splits a single character into multiple
patches, and upon flattening, separates different
patches of the same character, disrupting the
internal structure within each patch. This issue
is particularly problematic for Chinese characters
with complex strokes, and it also complicates the
decoding challenge related to the length of the
divided sequence and the positional encoding.
For the above demands and challenges, to
enhance the applicability of the Vision Trans-
former for the task of handwritten Chinese exam-
ination paper text recognition, this study intro-
duces an improved model, named RMLP-ViT.
This model harnesses the superior image under-
standing capabilities of ViT to address the unique
challenges present in the recognition of hand-
written Chinese examination paper text.As shown
in Fig.2, RMLP-ViT adopts a patch partition-
ing strategy consistent with that of Mask-OCR[9],
transforming text images into a one-dimensional
patch sequence based on image height h, while
maintaining the relative position and semantic
relationship of the segmented patch sequence. To
amplify the model’s ability to perceive complex
text image information, a novel algorithm named
Repeated Multi-Scale Linear Projection (RMLP)
is proposed. This approach enables the model to
more effectively process image features at multiple
scales.
Considering the diversity of Chinese charac-
ter structures, the patch vectors produced by
RMLP (Repeated Multi-Scale Linear Projection)
are fused with vectors obtained through tradi-
tional linear projection methods. This fusion strat-
egy significantly enhances the model’s capability
in capturing information across different image
resolutions. In the decoding phase, the linear layer
is combined with CTC to align the prediction of
each patch with the label, so that ViT’s predic-
tion for each patch can form a complete text line,
making ViT more suitable for text recognition.
Notably, this process improves the robustness of
the model against handwritten Chinese examina-
tion paper text images.
In summary, the contributions of this article
are as follows:
1 A ViT based model for examination paper
of Handwritten Chinese Text is proposed.The
model combines ViT with CTC, enabling ViT
to be well applied in handwritten long text
recognition tasks.
2 A repeated multi-scale linear projec-
tion(RMLP) is proposed. This fusion strategy
significantly enhances the model’s efficiency in
capturing information across different image
resolutions.
3 Experimental results show that the proposed
RMLP-ViT has yielded performance on the
SCUT-EPT dataset.
2 Related Work
2.1 CRNN based methods
In the field of optical character recognition for
handwritten Chinese text, traditional recognition
tasks require character segmentation before pro-
cessing. However, RNN-based segmentation-free
approaches eliminate the need for individual char-
acter segmentation and achieve superior perfor-
mance over segmentation-based methods. These
RNN-based techniques begin with feature extrac-
tion from text images via CNN. Subsequently, the
extracted features are transformed into a sequence
of feature maps. The RNN model is utilized to pro-
cess these sequences, projecting the feature map
sequences onto an output sequence, and employs
the CTC loss function[10] to align the model-
produced sequence with the actual text label
sequence. This facilitates the model’s autonomous
learning of the correspondence between charac-
ters.
3
The significant advantage of models like
CRNN is that they can handle text sequences of
any length without the need for character segmen-
tation or predefined dictionaries. Building upon
this framework, numerous subsequent studies have
enhanced Chinese text line recognition perfor-
mance in OCR by substituting the conventional
RNN with advanced Long Short-Term Memory
(LSTM) networks or bi-directional LSTMs (Bi-
LSTM) [5,11]. Unfortunately, they are likely to
suffer from vanishing/exploding gradient prob-
lems when processing long text images, which are
commonly found in scanned documents.
2.2 Transformer based methods
As the Transformer architecture[12] has demon-
strated significant success in the field of Natu-
ral Language Processing (NLP), some researchers
have integrated Transformer units in place of
LSTM modules to adapt to the challenges of
Optical Character Recognition. Lee et al. [13] pro-
posed the Self-Attention Text Recognition Net-
work (SATRN), which utilizes the self-attention
mechanism to capture the two-dimensional spatial
dependencies of characters in scene text images.
The comprehensive propagation properties of self-
attention enable SATRN to recognize texts with
arbitrary arrangements and large character spac-
ing.
To better utilize Transformer models in the
Computer Vision (CV) field, Dosovitskiy et al. [8]
developed the Vision Transformer (ViT), which
has become popular among CV researchers. This
popularity has spurred ViT and its variants to
achieve remarkable results in areas such as object
detection, semantic segmentation, image process-
ing, and video understanding. Moreover, there is
an increasing trend of adopting ViT in the text
recognition field. LevOCR[14] encodes the initial
prediction sequence generated by the pure vision
model and feeds it into a cross-modal Transformer,
interacting and integrating with visual features
to progressively converge on the true text values.
TrOCR [15] introduced an encoder-decoder struc-
ture based on pretrained Transformers from both
CV and NLP, marking the first effort to utilize
combined pretrained image and text Transformers
for OCR tasks.
However, the above approaches utilize a fixed
patch resolution when segmenting with ViT,
which can negatively impact text images of
specific word lengths and scaling scales, [16]
proposed PTIE (Pure Transformer with Inte-
grated Experts), a pure Transformer model that
accommodates different patch resolutions and
decodes in both the original and reverse character
orders. MRP-STR[17] presented a custom Adap-
tive Addressing and Aggregation (A3) module
that selects meaningful token combinations from
ViT and integrates them into an output token
corresponding to a specific character, named the
Character A3 module. MaskOCR[9] divides the
text image into a set of vertical image blocks and
randomly masks blocks that may contain a part
or an entire character. In the learned representa-
tion space from the encoder, the masked block’s
representation is predicted from the visible blocks
and the predicted representation is projected onto
the masked block image.
To enhance the suitability of ViT models for
Chinese handwritten examination paper text line
recognition tasks, this study proposed the RMLP-
ViT model. This method uses a segmentation
method similar to MaskOCR to prevent over seg-
mentation of text line images. In order to equip
the ViT with inductively biased global representa-
tional capabilities for better performance on small
datasets, and to overcome the insufficient spatial
information and homogeneity of scale inherent in
the original ViT, this paper proposes a method
of linear projection that differs from that used in
ViT, named Repeated Multi-scale Linear Projec-
tion(RMLP). With the introduction of ViT, trans-
forming images from 2D to 1D sequences has been
significantly optimized. Thus, the study combines
Connectionist Temporal Classification (CTC)[10]
directly with the Transformer architecture, elimi-
nating the need for text labeling during training,
which renders the model markedly more adaptable
for processing lengthy texts with variable lengths.
3 Method
3.1 Framework
The RMLP-ViT framework is illustrated in Fig.
3. The framework comprises two stages: the lin-
ear projection stage and the Transformer-CTC
stage. In the linear projection stage, the image
is projected into fixed-length vectors, which are
then encoded through the Encoder module of
4
Fig. 3 Architecture of the RMLP-ViT
the Transformer. Following this, the vectors are
decoded by a fully connected layer and finally, the
predicted text is obtained using the CTC module.
3.2 Linear Pro jection Stage
In the context of the Vision Transformer (ViT)
algorithm, an image is partitioned into small pic-
torial blocks, which are then linearly embedded
and fed into the network as input for the Trans-
former model. Customarily, ViT segments the
input image into fixed-size patches; each patch
is flattened into a vector, followed by a linear
projection of each patch’s vector onto a fixed-
dimensional embedding space. Such a linear pro-
jection transforms the primal pixel values from
the input image blocks into representations with
higher dimensions and abstraction. This facilitates
the model’s ability to capture semantic details and
features present within the image.
Given the intricate structural information
inherent in Chinese characters, the conventional
patch division method of ViT might induce unsuit-
able segmentation granularity. This can lead to
the excessive dispersion of encoded vectors that
pertain to the same character once flattened into
patches, as shown in Fig. 2(b).
In order to solve the above problems, this
paper chose the same segmentation method as [9],
as shown in Fig.2(c). This segmentation method
divides the image into image blocks with the same
height as the height h of the text line image. This
segmentation method retains the internal struc-
ture of a single Chinese character in the text
line image, and also maintains the linear relation-
ship of the whole text line after segmentation.
The influence of segmentation methods on text
recognition accuracy is shown in 4.3.4.
Considering the differences between the com-
plex strokes and font sizes of Chinese text, the
embedding after linear projection should include
as much image texture information and multi-
scale information as possible, which is considered
to be effective. Multi-scale information can well
solve the problem of different text sizes, and can
learn the global and local information of the
image, which is very helpful for the understand-
ing of Chinese character images. Therefore, the
Repeated Multi-scale Linear Projection(RMLP )
method is proposed in this paper. The process is
portrayed in Fig. 3, within the ’Linear Projection
Stage’. During the segmentation of patches, two
distinct linear projections are employed:
1 One set conducts a traditional linear projec-
tion across the image’s height h , categorizing
text line images into a one-dimensional array of
5
patches. This group of projection results only
contains the global information of each patch.
2 A second set utilizes the Repeated Multi-scale
Linear Projection (RMLP) method, as illus-
trated in Fig. 3. It initially segments the image
into smaller-scale patch blocks while retaining
the relative positioning of the patches, with-
out flattening. Then it subjects the segmented
group of patches to repeated linear projections,
culminating in a patch sequence of equivalent
dimensions to the first set. By intensifying the
linear projection depth and inducing additional
nonlinear transformations, experimental out-
comes indicate optimal results upon the third
linear projection iteration.
Following that, the two results of linear pro-
jections are fused. The synthesized sequence of
patches is subsequently subjected to another
linear projection, thereby making the post-
projection embedding vectors encompass greater
spatial details and diverse scale information.
In the current ViT architectures, the linear
projection module can be instantiated by two pri-
mary methods. The first employs the ’rearrange’
operation, which entails a reordering of the image
pixels followed by a linear layer that executes the
linear projection of an image, as delineated in
Fig.4(a).The alternative method leverages a con-
volution operation characterized by a stride equal
to the size of the convolutional kernel, accompa-
nied by a normalization step to implement the
linear projection, as depicted in Fig.4(b). This
convolution based method can maintain the rela-
tive position of the patches, which is more suitable
for repeated linear projection operations.In this
paper implementation of the linear projection
module, the second method is more appropriate
for the experiment in this article. Furthermore, the
convergence capabilities of the model are enriched
by integrating a Rectified Linear Unit (ReLU)
activation function after each normalization oper-
ation, which can be observed in Fig.4(c). This
not only preserves the relative positioning of the
patches but also introduces non-linearity into the
model, enhancing its feature extraction capabili-
ties and improving its representational power.
Specifically, given an input image of dimen-
sions W×H, the objective is to transform it
into a 2D sequence of patches, denoted as patch
xpRN×(H×w×C), where H represents the height
Fig. 4 Linear projection in different ways
of the text image, w represents the width of each
patch, C represents the number of image channels,
and N represents the total number of resulting
patches. Initially, the first set of linear pro jec-
tions straightforwardly projects the original input
image to a size of H×2w×C. Simultaneously,
the second grouping of partitions uses RMLP
method, projecting the textual line image onto
patches of dimension (H/n)×(w/2n)×C, where
n is the repetition factor for the linear projec-
tions. Subsequently, a spatial attention module is
deployed to capture the intricate spatial relations
within each small patch. Sequentially a second lin-
ear projection amalgamates four adjacent patches
into a larger patch of dimensions (H/(n1)) ×
(w/2(n1)) ×C. After undergoing n iterations
of such linear projections, the outcome is a patch
of dimensions H×2w×C. The results from
the two described processes are concatenated,
thereby integrating multiple nonlinear transfor-
mations through linear projections, and propelling
features into various feature spaces, finally, the
patch embedding with size H×w×Cis obtained.
By merging the feature representations from each
projection, an embedding vector endowed with
multi-scale spatial features is produced.
Accordingly, an input image is represented by
a D-dimensional patch embedding. Parallel to the
original ViT architecture, a learnable class token
encoding of D dimensions is introduced into the
patch embeddings to enable classification tasks.
To retain positional information, standard learn-
able 1D positional embeddings are amalgamated
with each patch embedding. Thus, the formation
6
of block embedding vectors can be described in
detail as follows:
z0= [xclass;xp
1E;xp
2E;...;xp
nE] (1)
Wherexclass R(1×D)denotes the class
embedding,ER((H×w×C)×D)denotes the linear
projection matrix,Epos R((N+1)×D)denotes the
positional embedding.
The experimental results show that setting the
final patch width to 8 and the number of repeated
projections to 3 has the best effect. The impact
of different patch widths and projections on the
experimental results can be found in section 4.3.4.
3.3 Transformer-CTC Stage
This article combines Vision Transformer with
CTC, decoding the output of the Encoder layer
directly using the fully connected layer, and
then aligning the predicted sequence through
CTC. CTC (Connectionist Temporary Classifi-
cation) is a loss function and training method
used for sequence labeling tasks. The basic idea
behind CTC is to interpret the network out-
put as the probability distribution of all possible
label sequences (aligned) based on a given input
sequence. Firstly, CTC processes duplicate labels
in known label sequences; For most Chinese text
recognition tasks, CTC does not require blank
symbols to distinguish different phrases. Secondly,
use it as a marker for unlabeled feature sequences.
Then, use the complete output label sequence
to generate all possible aligned distributions,
where each alignment is described as a possible
path pt; The path p-t consists of possible marker
sequences in L. The probability of one marker
sequence path ptis:
p(π|Y) = Y
t=1:T
p(ot
πt|yt) (2)
Finally, CTC defines a many-to-one projection
relationship f0and merges the first consecutive
identical label to obtain a predicted label sequence
path for the label sequence. Among them, the
probability of label L can be calculated as an
optimization of the probabilities of all possible
label sequences in CTC. The specific formula is as
follows:
P(L|Y) = X
πF′−1
p(π|Y) (3)
4 Experiments
4.1 Datasets
SCUT-EPT[18] is a dataset published by the
Deep Learning and Visual Computing Laboratory
of South China University of Technology for offline
handwritten Chinese recognition research in edu-
cational literature. The dataset was collected from
2986 volunteer test papers, containing 50000 text
line images, of which 40000 were used for training
and 10000 were used for testing. There are a total
of 4250 categories, including 4033 commonly used
Chinese characters, 104 symbols, and 113 outlier
Chinese characters. The outlier Chinese character
refers to a Chinese character that is not in the
popular CASIA-HWDB1.0-1.2 character set[19].
4.2 Implementation Details
The experiment was conducted using an NVIDIA
GTX 1080ti GPU with 11GB of memory, and
the method was implemented using PyTorch. This
network was optimized using Adaptive Moment
Estimation (Adam) with an initial learning rate of
0.0001 for pre-training and 0.001 for training. The
batch size was set to 64, and the learning rate was
adjusted every 5 epochs.
4.3 Comparative experiment
4.3.1 Experimental setting
The experimental setup on SCUT-EPT is consis-
tent with that of [18]. The height of the text line
images were normalized to 96 pixels, and then
pad the width to 1440 pixels while maintaining
the aspect ratio, without cleaning up challenging
conditions such as character erasure, text line fill-
ing, and noisy backgrounds. During training, the
dataset is augmented using simple data augmenta-
tion techniques where the addition of noise, affine
transformations, blur, sharpening, and skewing
are randomly applied.
4.3.2 Evaluation
Following previous studies[20] on HCTR, the
correct rate (CR) and accuracy rate (AR) are
adopted to evaluate the performance of methods.
They are given by:
AR = (NtDeSeIe)/Nt(4)
7
Table 1 Correct Rate (CR) and Accuracy Rate
(AR) on SCUT-EPT dataset
Model CR AR
CRNN(LSTM)[18] 78.60% 75.37%
Attention[18] 69.83% 64.78%
Cascaded Attention-CTC[18] 54.09% 48.98%
CNN +MDLSTM + CTC[5] 78.30% 73.26%
CNN +MDirLSTM + CTC[6] 78.53% 73.65%
CR = (NtDeSe)/Nt(5)
where Nis the total number of characters in
the test set, De,Seand Ieare the numbers of
deletion errors, substitution errors, and insertion
errors, respectively.
4.3.3 Comparisons with other methods
Table 1 shows the comparison results between the
method of this article and other methods. As the
baseline method of the task, the crnn method has
reached a very high accuracy, which is much higher
than the attention model baseline. Compared with
crnn model[18], the CR value of this method is
increased by 0.96%, and the AR value is increased
by 0.44%.
Fig. 5shows the effect of RMLP-ViT on
SCUT-EPT dataset, in which gray represents false
recognition and make the missing recognition in
parentheses. As shown in (a), neither CRNN nor
RMLP-ViT can cope with this challenge well; (b)
shows the performance in dense text; (c) shows
that RMLP-ViT can well cope with the challenge
of hard erasure than CRNN; (d) shows when fac-
ing nonuniform word size and noise background
challenges, CRNN effect is not ideal, RMLP-ViT
can also deal with this problem; (e) show that
RMLP-ViT still has able to cope with a variety
of mixed challenges. On the whole, RMLP-ViT
performs well in Chinese examination paper text
recognition tasks with different lengths.
4.3.4 Ablation study
An extensive series of ablation studies were con-
ducted to validate the efficacy of this method.
Table 2 presents a comparative analysis of
the method proposed in this article against
the conventional ViT linear projection, using
the SCUT-EPT dataset as a benchmark. To
Table 2 Ablation experiment.
Model CR AR
Baseline 77.79% 73.77%
+RMLP 78.23% 74.41%
RMLP-ViT 79.56% 75.81%
Table 3 Comparisons with different numbers of
linear projections
Model CR AR
+ RMLP(Repeat one times) 77.79% 73.77%
+ RMLP(Repeat two times) 77.41% 73.33%
+ RMLP(Repeat three times) 79.56% 75.81%
+ RMLP(Repeat four times) 77.58% 73.69%
ensure a fair comparison, in the patch parti-
tioning method, dimensions were maintained uni-
formly, with ’Baseline’ denoting the original ViT
approach, and ’+RMLP’ representing linear pro-
jection using only the second set of projection
methods. The findings revealed that using RMLP
alone, without fusing with traditional linear pro-
jection, only improves performance by 0.44% for
CR (Correct Rate) and 0.64% for AR (Accu-
racy Rate) over the baseline projection method.
Moreover, by fusing the RMLP result and the
traditional linear projection result, the results
increased by 1.77% for CR and 2.04% for AR
when compared to the baseline method, highlight-
ing the significant advantage gained through our
method’s integration of multiscale features.
To obtain the best experimental outcomes, a
series of experiments were conducted focusing on
the number of repetitive linear projections in the
RMLP method and the effects of varying patch
sizes.
Table 3 presents the experimental results asso-
ciated with different numbers of linear projections.
The model used in this experiment only utilizes
the second set of linear projection methods. The
findings clearly indicate that, within the context of
this task, thrice-repeated linear projection yielded
the most effective results.
In the conventional ViT model, images are par-
titioned into rectangular patches of size 16 ×16.
For images consisting of lines of text, this method
of dividing patches can excessively disperse indi-
vidual characters, leading to poor performance.
Therefore, the method proposed in this article
8
Fig. 5 Recognition results on SCUT-EPT. (a) soft erasure; (b)dense text; (c) hard erasure; (d) nonuniform word size and
noise background; (e)a variety of mixed challenges.
have opted to segment text line images into ver-
tical image strips. Table4 illustrates the impact
of different patch sizes on experimental outcomes.
The results indicate that for text line images, the
segmentation into vertical patches proves to be
more effective than the conventional approach.
Moreover, the optimal recognition results are
achieved when the patch width is set at 8 pixels.
Table 4 Comparisons with
different patch size
Model CR AR
96x2 76.87% 72.88%
96x4 78.31% 74.57%
96x8 79.56% 75,81%
96x16 78.83% 75.05%
16x16 70.87% 66.70%
5 Conclusion
This paper proposes a novel linear projection
method and applies it to the task of recog-
nizing handwritten Chinese examination paper
text lines, effectively solving common problems in
examination paper text, such as non-uniform word
size, character erasure, noisy background, diverse
text length, and dense long texts. Through exten-
sive experimentation, we have verified the efficacy
of our proposed method and demonstrated that
repeated projection can effectively improve the
image understanding ability of ViT. In the future,
we aim to further optimize the attention mod-
ule of the model and consider the integration of
a language model into our system, with the goal
of enhancing the precision and robustness of text
line recognition.
6 Acknowledgements
This work was supported partly by National Nat-
ural Science Foundation of China (No. 62201385),
Natural Science Foundation of Tianjin (Grant No.
18JCYBJC85000).
References
[1] Zhou, X.-D., Wang, D.-H., Tian, F., Liu,
C.-L. & Nakagawa, M. Handwritten chinese/-
japanese text recognition using semi-markov
conditional random fields. IEEE transactions
on pattern analysis and machine intelligence
35, 2413–2426 (2013).
9
[2] Jiang, Z., Ding, X., Liu, C. & Wang, Y. A
novel short merged off-line handwritten chi-
nese character string segmentation algorithm
using hidden markov model 668–672 (2011).
[3] Liu, B., Xu, X. & Zhang, Y. Offline hand-
written chinese text recognition with con-
volutional neural networks. arXiv preprint
arXiv:2006.15619 (2020).
[4] Wang, Y., Yang, Y., Ding, W. & Li, S. A
residual-attention offline handwritten chinese
text recognition based on fully convolutional
neural networks. IEEE Access 9, 132301–
132310 (2021).
[5] Messina, R. & Louradour, J. Segmentation-
free handwritten chinese text recognition
with lstm-rnn 171–175 (2015).
[6] Sun, Z., Jin, L., Xie, Z., Feng, Z. & Zhang,
S. Convolutional multi-directional recurrent
network for offline handwritten text recogni-
tion 240–245 (2016).
[7] Shi, B., Bai, X. & Yao, C. An end-to-
end trainable neural network for image-based
sequence recognition and its application to
scene text recognition. IEEE transactions on
pattern analysis and machine intelligence 39,
2298–2304 (2016).
[8] Dosovitskiy, A. et al. An image is
worth 16x16 words: Transformers for image
recognition at scale. arXiv preprint
arXiv:2010.11929 (2020).
[9] Lyu, P. et al. Maskocr: Text recognition with
masked encoder-decoder pretraining (2023).
2206.00311.
[10] Graves, A., Fern´andez, S., Gomez, F.
& Schmidhuber, J. Connectionist tem-
poral classification: labelling unsegmented
sequence data with recurrent neural networks
369–376 (2006).
[11] Zhai, C., Chen, Z., Li, J. & Xu, B. Chi-
nese image text recognition with blstm-ctc: a
segmentation-free method 525–536 (2016).
[12] Vaswani, A. et al. Attention is all you need.
Advances in neural information processing
systems 30 (2017).
[13] Lee, J. et al. On recognizing texts of arbitrary
shapes with 2d self-attention 546–547 (2020).
[14] Da, C., Wang, P. & Yao, C. Levenshtein ocr
322–338 (2022).
[15] Li, M. et al. Trocr: Transformer-based opti-
cal character recognition with pre-trained
models 37, 13094–13102 (2023).
[16] Tan, Y. L., Kong, A. W.-K. & Kim, J.-J. Pure
transformer with integrated experts for scene
text recognition 481–497 (2022).
[17] Wang, P., Da, C. & Yao, C. Multi-granularity
prediction for scene text recognition 339–355
(2022).
[18] Zhu, Y. et al. Scut-ept: new dataset and
benchmark for offline chinese text recogni-
tion in examination paper. IEEE Access 7,
370–382 (2018).
[19] Liu, C.-L., Yin, F., Wang, D.-H. & Wang,
Q.-F. Casia online and offline chinese hand-
writing databases 37–41 (2011).
[20] Yin, F., Wang, Q.-F., Zhang, X.-Y. & Liu, C.-
L. Icdar 2013 chinese handwriting recognition
competition 1464–1470 (2013).
10
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Offline handwritten Chinese text recognition is one of the most challenging tasks in that it involves various writing styles, complex character-touching, and large number of character categories. In this paper, we propose a residual-attention offline handwritten Chinese text recognition based on fully convolutional neural networks, which is segmentation-free handwritten recognition that avoids the impact of incorrect character segmentation. By designing a smart residual attention gate block, our model can help to extract important features, and effectively implement the training of deep convolutional neural networks. Furthermore, we deploy an expansion factor to indicate the trade-off between computing resources for model training and the ability of a gradient to propagate across multiple layers, and make our model training adapt to different computing platforms. Experiments on the CASIA-HWDB and ICDAR-2013 competition dataset show that our method achieves a competitive performance on offline handwritten Chinese text recognition. On the CASIA-HWDB test set, the character-level accurate rate and correct rate achieve 97.32% and 97.90% respectively.
Article
Full-text available
Most Existing researches and public datasets for handwritten Chinese text recognition base on the regular documents with clean and blank background, lacking research reports for handwritten text recognition on challenging areas such as educational documents and financial bills. In this paper, we focus on examination paper text recognition and construct a challenging dataset named Examination Paper Text (SCUT-EPT) dataset which contains 50,000 text line images (40,000 for training and 10,000 for testing) selected from examination papers of 2,986 volunteers. The proposed SCUT-EPT dataset presents numerous novel challenges, including character erasure, text line supplement, character/phrase switching, noised background, nonuniform word size and unbalanced text length. In our experiments, the current advanced text recognition methods such as convolutional recurrent neural network (CRNN [1]) exhibits poor performance on the proposed SCUT-EPT dataset, proving the challenge and significance of the dataset. Nevertheless, through visualizing and error analysis, we observe that humans can avoid vast majority of the error predictions, which reveals the limitations and drawbacks of current methods for HCTR. Finally, three popular sequence transcription methods: connectionist temporal classification (CTC), attention mechanism, and cascaded attention-CTC are investigated for HCTR problem. It is interesting to observe that although the attention mechanism has been proved to be very effective in English scene text recognition, its performance is far inferior to the CTC method in the case of HCTR with large-scale character set.
Article
Full-text available
Image-based sequence recognition has been a long-standing research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.
Conference Paper
Full-text available
We present initial results on the use of Multi- Dimensional Long-Short Term Memory Recurrent Neural Net- works (MDLSTM-RNN) in recognizing lines of handwritten Chinese text without explicit segmentation of the characters. In fact, most of Chinese text recognizers in the literature perform a pre-segmentation of text image into characters. This can be a drawback, as explicit segmentation is an extra step before recognizing the text, and the errors made at this stage have direct impact on the performance of the whole system. MDLSTM-RNN is now a state-of-the-art technology that provides the best performance on languages with Latin and Arabic characters, hence we propose to apply RNN on Chinese text recognition. Our results on the data from the Task 4 in ICDAR 2013 competition for handwritten Chinese recognition are comparable in performance with the best reported systems.
Article
Text recognition is a long-standing research problem for document digitalization. Existing approaches are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on the printed, handwritten and scene text recognition tasks. The TrOCR models and code are publicly available at https://aka.ms/trocr.
Chapter
Scene text recognition (STR) has been an active research topic in computer vision for years. To tackle this challenging problem, numerous innovative methods have been successively proposed and incorporating linguistic knowledge into STR models has recently become a prominent trend. In this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet powerful vision STR model, which is built upon ViT and outperforms previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods. To integrate linguistic knowledge, we further propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model in an implicit way, i.e. , subword representations (BPE and WordPiece) widely-used in NLP are introduced into the output space, in addition to the conventional character level representation, while no independent language model (LM) is adopted. The resultant algorithm (termed MGP-STR) is able to push the performance envelop of STR to an even higher level. Specifically, it achieves an average recognition accuracy of 93.35%93.35\% on standard benchmarks.KeywordsScene text recognitionViTMulti-granularity prediction
Chapter
Scene text recognition (STR) involves the task of reading text in cropped images of natural scenes. Conventional models in STR employ convolutional neural network (CNN) followed by recurrent neural network in an encoder-decoder framework. In recent times, the transformer architecture is being widely adopted in STR as it shows strong capability in capturing long-term dependency which appears to be prominent in scene text images. Many researchers utilized transformer as part of a hybrid CNN-transformer encoder, often followed by a transformer decoder. However, such methods only make use of the long-term dependency mid-way through the encoding process. Although the vision transformer (ViT) is able to capture such dependency at an early stage, its utilization remains largely unexploited in STR. This work proposes the use of a transformer-only model as a simple baseline which outperforms hybrid CNN-transformer models. Furthermore, two key areas for improvement were identified. Firstly, the first decoded character has the lowest prediction accuracy. Secondly, images of different original aspect ratios react differently to the patch resolutions while ViT only employ one fixed patch resolution. To explore these areas, Pure Transformer with Integrated Experts (PTIE) is proposed. PTIE is a transformer model that can process multiple patch resolutions and decode in both the original and reverse character orders. It is examined on 7 commonly used benchmarks and compared with over 20 state-of-the-art methods. The experimental results show that the proposed method outperforms them and obtains state-of-the-art results in most benchmarks.
Conference Paper
This paper presents BLSTM-CTC (bidirectional LSTM-Connectionist Temporal Classification), a novel scheme to tackle the Chinese image text recognition problem. Different from traditional methods that perform the recognition on the single character level, the input of BLSTM-CTC is an image text composed of a line of characters and the output is a recognized text sequence, where the recognition is carried out on the whole image text level. To train a neural network for this challenging task, we collect over 2 million news titles from which we generate over 1 million noisy image texts, covering almost the vast majority of common Chinese characters. With these training data, a RNN training procedure is conducted to learn the recognizer. We also carry out some adaptations on the neural network to make it suitable for real scenarios. Experiments on text images from 13 TV channels demonstrate the effectiveness of the proposed pipeline. The results all outperform those of a baseline system.