Fig 2 - available via license: Creative Commons Attribution 4.0 International
Content may be subject to copyright.
Source publication
It happens in the examination paper that text lines include inconsistent nonuniform word size, character erasure, diverse text length and dense long texts. This paper proposes an improved method for ViT to enhance its capability in recognizing text lines in handwritten Chinese examination papers. First, this method employs a segmentation method sui...
Contexts in source publication
Context 1
... mean that adjacent regions in an image are likely to exhibit similar features. These biases provide CNNs with a wealth of prior knowledge, enabling the learning of robust models with smaller datasets. The commonly used datasets for handwritten Chinese text are not voluminous enough to meet the requirement of ViT. On the other hand,as shown in Fig. 2, mainstream ViT models typically divide images into patches of size 16x16. When applied to lines of Chinese text, this patch size can lead to an overly granular breakdown of the image. Moreover, it often splits a single character into multiple patches, and upon flattening, separates different patches of the same character, disrupting ...
Context 2
... the above demands and challenges, to enhance the applicability of the Vision Transformer for the task of handwritten Chinese examination paper text recognition, this study introduces an improved model, named RMLP-ViT. This model harnesses the superior image understanding capabilities of ViT to address the unique challenges present in the recognition of handwritten Chinese examination paper text.As shown in Fig.2, RMLP-ViT adopts a patch partitioning strategy consistent with that of Mask-OCR [9], transforming text images into a one-dimensional patch sequence based on image height h, while maintaining the relative position and semantic relationship of the segmented patch sequence. ...
Context 3
... the intricate structural information inherent in Chinese characters, the conventional patch division method of ViT might induce unsuitable segmentation granularity. This can lead to the excessive dispersion of encoded vectors that pertain to the same character once flattened into patches, as shown in Fig. ...
Context 4
... order to solve the above problems, this paper chose the same segmentation method as [9], as shown in Fig.2 Considering the differences between the complex strokes and font sizes of Chinese text, the embedding after linear projection should include as much image texture information and multiscale information as possible, which is considered to be effective. ...
Similar publications
Adapting pre-trained models to open classes is a challenging problem in machine learning. Vision-language models fully explore the knowledge of text modality, demonstrating strong zero-shot recognition performance, which is naturally suited for various open-set problems. More recently, some research focuses on fine-tuning such models to downstream...