Fig 4 - uploaded by Xiang Bai
Content may be subject to copyright.
Example detections from the submissions. Green polygons are correctly detected. Yellow ones are false detections.
Source publication
Chinese is the most widely used language in the world. Algorithms that read Chinese text in natural images facilitate applications of various kinds. Despite the large potential value, datasets and competitions in the past primarily focus on English, which bares very different characteristics than Chinese. This report introduces RCTW, a new competit...
Contexts in source publication
Context 1
... noticed that the detection performance on digital-born images is generally better than that on natural images. Figure 4 (i) shows such an example. The reason is likely to be cleaner background and simpler fonts. ...
Context 2
... common mistake we have discovered is failing to detect long text. Figure 4 (a), (c), and (f) are examples of this kind. Long text lines are often not fully detected, i.e. missing a few characters, or detected in multiple separate pieces. ...
Similar publications
Shuhua Liu Huixin Xu Qi Li- [...]
Kun Hou
With the aim to solve issues of robot object recognition in complex scenes, this paper proposes an object recognition method based on scene text reading. The proposed method simulates human-like behavior and accurately identifies objects with texts through careful reading. First, deep learning models with high accuracy are adopted to detect and rec...
Citations
... Section 3.1 provides a comprehensive overview of Mini-InternVL. Then, Section 3.2 Laion [63], COYO [64], GRIT [39], COCO [65], LVIS [66], Objects365 [67], Flickr30K [68], VG [69], All-Seeing [61,62], MMInstruct [70], LRV-Instruction [71] OCR TextCaps [72], Wukong-OCR [73], CTW [74], MMC-Inst [75], LSVT [76], ST-VQA [77], RCTW-17 [78], ReCTs [79], ArT [80], SynthDoG [81], LaionCOCO-OCR [82], COCO-Text [83], DocVQA [84], TextOCR [85], LLaVAR [86], TQA [87], SynthText [88] DocReason25K [89], Common Crawl PDF Chart AI2D [90], PlotQA [91], InfoVQA [92], ChartQA [30], MapQA [93], FigureQA [94], IconQA [95], MMC-Instruction [96] Multidisciplinary CLEVR-Math/Super(en) [97,98], GeoQA+ [99], UniChart [100], ScienceQA [101], Inter-GPS [102], UniGeo [103], PMC-VQA [104], TabMWP [105], MetaMathQA [106] Other Stanford40 [107], GQA [108], MovieNet [109], KonIQ-10K [110], ART500K [111], ViQuAE [112] details InternViT-300M, a lightweight vision model developed through knowledge distillation, which inherits the strengths of a powerful vision encoder. Finally, Section 3.3 describes a transfer learning framework designed to enhance the model's adaptation to downstream tasks. ...
Multimodal large language models (MLLMs) have demonstrated impressive performance in vision-language tasks across a broad spectrum of domains. However, the large model scale and associated high computational costs pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices, thereby hindering their widespread application. In this work, we introduce Mini-InternVL, a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters. This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios. To further promote the adoption of our models, we develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks, including autonomous driving, medical images, and remote sensing. We believe that our study can provide valuable insights and resources to advance the development of efficient and effective MLLMs. Code is available at https://github.com/OpenGVLab/InternVL.
... Due to the prevalent availability of English datasets, we selected English as the source language. We utilized ten real datasets: SVT [59], IIIT [40], IC13 [26], IC15 [25], RCTW [52], Uber [76], ArT [9], LSVT [55], MLT19 [42], and ReCTS [73]. Furthermore, we incorporated two extensive real datasets derived from Open Images [30]: TextOCR [53] and annotations from the OpenVINO toolkit [32]. ...
Scene text recognition in low-resource languages frequently faces challenges due to the limited availability of training datasets derived from real-world scenes. This study proposes a novel approach that generates text images in low-resource languages by emulating the style of real text images from high-resource languages. Our approach utilizes a diffusion model that is conditioned on binary states: ``synthetic'' and ``real.'' The training of this model involves dual translation tasks, where it transforms plain text images into either synthetic or real text images, based on the binary states. This approach not only effectively differentiates between the two domains but also facilitates the model's explicit recognition of characters in the target language. Furthermore, to enhance the accuracy and variety of generated text images, we introduce two guidance techniques: Fidelity-Diversity Balancing Guidance and Fidelity Enhancement Guidance. Our experimental results demonstrate that the text images generated by our proposed framework can significantly improve the performance of scene text recognition models for low-resource languages.
... Therefore, following previous studies [2,3], we trained our models on both synthetic and real datasets to validate the performance. Specifically, we used MJSynth [24] and SynthText [19] as synthetic datasets and COCO-Text [52], RCTW [49], Uber-Text [67], ArT [7], LSVT [50], MLT19 [39], and ReCTS [66] as real datasets. Each model was evaluated on six standard scene text datasets: ICDAR 2013 (IC13) [27], Street View Text (SVT) [54], IIIT5K-Words (IIIT5K) [37], ICDAR 2015 (IC15) [26], Street View Text-Perspective (SVTP) [40], and CUTE80 (CUTE) [45]. ...
... This dataset includes four subsets (scene, web, document, and handwriting), with a total of 1.4 million fully labeled images. The scene subset is derived from scene text datasets such as RCTW [49], ReCTS [66], LSVT [50], ArT [7], and CTW [63]. It consists of 509,164, 63,645, and 63,646 samples for training, validation, and 5 https://github.com/jpuigcerver/Laia ...
Typical text recognition methods rely on an encoder-decoder structure, in which the encoder extracts features from an image, and the decoder produces recognized text from these features. In this study, we propose a simpler and more effective method for text recognition, known as the Decoder-only Transformer for Optical Character Recognition (DTrOCR). This method uses a decoder-only Transformer to take advantage of a generative language model that is pre-trained on a large corpus. We examined whether a generative language model that has been successful in natural language processing can also be effective for text recognition in computer vision. Our experiments demonstrated that DTrOCR outperforms current state-of-the-art methods by a large margin in the recognition of printed, handwritten, and scene text in both English and Chinese.
... Therefore, following previous studies [2,3], we trained our models on both synthetic and real datasets to validate the performance. Specifically, we used MJSynth [24] and SynthText [19] as synthetic datasets and COCO-Text [52], RCTW [49], Uber-Text [67], ArT [7], LSVT [50], MLT19 [39], and ReCTS [66] as real datasets. Each model was evaluated on six standard scene text datasets: ICDAR 2013 (IC13) [27], Street View Text (SVT) [54], IIIT5K-Words (IIIT5K) [37], ICDAR 2015 (IC15) [26], Street View Text-Perspective (SVTP) [40], and CUTE80 (CUTE) [45]. ...
... This dataset includes four subsets (scene, web, document, and handwriting), with a total of 1.4 million fully labeled images. The scene subset is derived from scene text datasets such as RCTW [49], ReCTS [66], LSVT [50], ArT [7], and CTW [63]. It consists of 509,164, 63,645, and 63,646 samples for training, validation, and 5 https://github.com/jpuigcerver/Laia ...
Typical text recognition methods rely on an encoder-decoder structure, in which the encoder extracts features from an image, and the decoder produces recognized text from these features. In this study, we propose a simpler and more effective method for text recognition, known as the Decoder-only Transformer for Optical Character Recognition (DTrOCR). This method uses a decoder-only Transformer to take advantage of a generative language model that is pre-trained on a large corpus. We examined whether a generative language model that has been successful in natural language processing can also be effective for text recognition in computer vision. Our experiments demonstrated that DTrOCR outperforms current state-of-the-art methods by a large margin in the recognition of printed, handwritten, and scene text in both English and Chinese.
... Therefore, following previous studies [2,3], we trained our models on both synthetic and real datasets to validate the performance. Specifically, we used MJSynth [24] and SynthText [19] as synthetic datasets and COCO-Text [52], RCTW [49], Uber-Text [67], ArT [7], LSVT [50], MLT19 [39], and ReCTS [66] as real datasets. Each model was evaluated on six standard scene text datasets: ICDAR 2013 (IC13) [27], Street View Text (SVT) [54], IIIT5K-Words (IIIT5K) [37], ICDAR 2015 (IC15) [26], Street View Text-Perspective (SVTP) [40], and CUTE80 (CUTE) [45]. ...
... This dataset includes four subsets (scene, web, document, and handwriting), with a total of 1.4 million fully labeled images. The scene subset is derived from scene text datasets such as RCTW [49], ReCTS [66], LSVT [50], ArT [7], and CTW [63]. It consists of 509,164, 63,645, and 63,646 samples for training, validation, and 5 https://github.com/jpuigcerver/Laia ...
Typical text recognition methods rely on an encoder-decoder structure, in which the encoder extracts features from an image, and the decoder produces recognized text from these features. In this study, we propose a simpler and more effective method for text recognition, known as the Decoder-only Transformer for Optical Character Recognition (DTrOCR). This method uses a decoder-only Transformer to take advantage of a generative language model that is pre-trained on a large corpus. We examined whether a generative language model that has been successful in natural language processing can also be effective for text recognition in computer vision. Our experiments demonstrated that DTrOCR outperforms current state-of-the-art methods by a large margin in the recognition of printed, handwritten, and scene text in both English and Chinese.
... In the experiments, ICDAR2019 Art [31], LSVT [32], and ReCTs [33] generate the text dataset and its inverse that are used for training. Verification is performed on the training set of RCTW-17 [34], and the test set of RCTW-17 serves as the test set. RCTW-17 is a common dataset used in scene text detection and recognition tasks. ...
... Given that the RCTW-17 dataset covers Traditional and Simplified Chinese characters, uppercase and lowercase English letters, we simulated this characteristic in our generated dataset. In this section, Average Edit Distance (AED) and Normalized Edit Distance (NED) [34] are used to evaluate the recognition results. A low AED indicates better performance. ...
With the aim to solve issues of robot object recognition in complex scenes, this paper proposes an object recognition method based on scene text reading. The proposed method simulates human-like behavior and accurately identifies objects with texts through careful reading. First, deep learning models with high accuracy are adopted to detect and recognize text in multi-view. Second, datasets including 102,000 Chinese and English scene text images and their inverse are generated. The F-measure of text detection is improved by 0.4% and the recognition accuracy is improved by 1.26% because the model is trained by these two datasets. Finally, a robot object recognition method is proposed based on the scene text reading. The robot detects and recognizes texts in the image and then stores the recognition results in a text file. When the user gives the robot a fetching instruction, the robot searches for corresponding keywords from the text files and achieves the confidence of multiple objects in the scene image. Then, the object with the maximum confidence is selected as the target. The results show that the robot can accurately distinguish objects with arbitrary shape and category, and it can effectively solve the problem of object recognition in home environments.
... In this section, we evaluate our approach on MSRA-TD500 [34], ICDAR2017-RCTW [35], ICDAR2015 [36] and COCO-Text [37] to show the effectiveness of our approach. ...
... We perform ablation study on MSRA-TD500 with different settings to analyze the function of corner refinement and scoring. To better illustrate the ability to detect long texts, we use 4k well annotated samples from [35] for pretraining which is adopted in [11]. The result is in Table 1. ...
Recent scene text detection works mainly focus on curve text detection. However, in real applications, the curve texts are more scarce than the multi-oriented ones. Accurate detection of multi-oriented text with large variations of scales, orientations, and aspect ratios is of great significance. Among the multi-oriented detection methods, direct regression for the geometry of scene text shares a simple yet powerful pipeline and gets popular in academic and industrial communities, but it may produce imperfect detections, especially for long texts due to the limitation of the receptive field. In this work, we aim to improve this while keeping the pipeline simple. A fully convolutional corner refinement network (FC2RN) is proposed for accurate multi-oriented text detection, in which an initial corner prediction and a refined corner prediction are obtained at one pass. With a novel quadrilateral RoI convolution operation tailed for multi-oriented scene text, the initial quadrilateral prediction is encoded into the feature maps which can be further used to predict offset between the initial prediction and the ground-truth as well as output a refined confidence score. Experimental results on four public datasets including MSRA-TD500, ICDAR2017-RCTW, ICDAR2015, and COCO-Text demonstrate that FC2RN can outperform the state-of-the-art methods. The ablation study shows the effectiveness of corner refinement and scoring for accurate text localization.
... Automated detection and recognition of various texts in scenes has attracted increasing interests as witnessed by increasing benchmarking competitions [24,51]. Different detection techniques have been proposed from those earlier using hand-crafted features [42,36] to the recent using DNNs [76,22,68,59,71,64]. ...
Recent adversarial learning research has achieved very impressive progress for modelling cross-domain data shifts in appearance space but its counterpart in modelling cross-domain shifts in geometry space lags far behind. This paper presents an innovative Geometry-Aware Domain Adaptation Network (GA-DAN) that is capable of modelling cross-domain shifts concurrently in both geometry space and appearance space and realistically converting images across domains with very different characteristics. In the proposed GA-DAN, a novel multi-modal spatial learning technique is designed which converts a source-domain image into multiple images of different spatial views as in the target domain. A new disentangled cycle-consistency loss is introduced which balances the cycle consistency in appearance and geometry spaces and improves the learning of the whole network greatly. The proposed GA-DAN has been evaluated for the classic scene text detection and recognition tasks, and experiments show that the domain-adapted images achieve superior scene text detection and recognition performance while applied to network training.
... Despite promising initial results, things did not work out well on more complicated detec-8DatasetYear Description #CitesICDAR [71] 2003ICDAR2003 is one of the first public datasets for text detection. ICDAR 2015 and 2017 are other popular iterations of the ICDAR challenge[72,73]. url: http: //rrc.cvc.uab.es/530STV ...
Object detection, as of one the most fundamental and challenging problems in computer vision, has received great attention in recent years. Its development in the past two decades can be regarded as an epitome of computer vision history. If we think of today's object detection as a technical aesthetics under the power of deep learning, then turning back the clock 20 years we would witness the wisdom of cold weapon era. This paper extensively reviews 400+ papers of object detection in the light of its technical evolution, spanning over a quarter-century's time (from the 1990s to 2019). A number of topics have been covered in this paper, including the milestone detectors in history, detection datasets, metrics, fundamental building blocks of the detection system, speed up techniques, and the recent state of the art detection methods. This paper also reviews some important detection applications, such as pedestrian detection, face detection, text detection, etc, and makes an in-deep analysis of their challenges as well as technical improvements in recent years.
... Most of existing text detection methods [14,15,24,42,8] achieve good performance in a controlled environment where text instances have regular shapes and aspect ratios, e.g., the cases in ICDAR 2015 [12]. Nevertheless, due to the limited receptive field size of CNNs and the text representation forms, these methods fail to detect more complex scene text, especially the extremely long text and arbitrarily shaped text in datasets such as ICDAR2017-RCTW [31], SCUT-CTW1500 [39], Total-Text [2] and ICDAR2017-MLT [26]. When detecting extremely long text, previous text detection methods like EAST [42] and Deep Regression [8] fail to provide a complete bounding box proposal as the blue box shown in Fig. 1 (a), since that the size of whole text instance is far beyond the receptive field size of text detectors. ...
... Firstly, in the process of center line sampling, we sample n points at equidistance intervals from left to right on the predicted text center line map. According to the label definition in the SCUT-CTW1500 [39], we set n to 7 in the curved text detection experiments 4.5 and to 2 when dealing with text detection in such benchmarks [12,26,31] labeled with quadrangle annotations considering the dataset complexity. Afterwards, we can determine the corresponding border points based on the sampled center line points, considering the information provided by 4 border offset maps in the same location. ...
... In this way, DR can generate high-recall proposals to cover most of text instance in real data. At fine-tuning step, we fine-tune all three branches on real datasets including ICDAR2015 [12], ICDAR2017-RCTW [31], SCUT-CTW1500 [39], Total-Text [2] and ICDAR2017-MLT [26] about another 10 epochs. Both IRM and SEM branches use the same proposals which generated by DR branch. ...
Previous scene text detection methods have progressed substantially over the past years. However, limited by the receptive field of CNNs and the simple representations like rectangle bounding box or quadrangle adopted to describe text, previous methods may fall short when dealing with more challenging text instances, such as extremely long text and arbitrarily shaped text. To address these two problems, we present a novel text detector namely LOMO, which localizes the text progressively for multiple times (or in other word, LOok More than Once). LOMO consists of a direct regressor (DR), an iterative refinement module (IRM) and a shape expression module (SEM). At first, text proposals in the form of quadrangle are generated by DR branch. Next, IRM progressively perceives the entire long text by iterative refinement based on the extracted feature blocks of preliminary proposals. Finally, a SEM is introduced to reconstruct more precise representation of irregular text by considering the geometry properties of text instance, including text region, text center line and border offsets. The state-of-the-art results on several public benchmarks including ICDAR2017-RCTW, SCUT-CTW1500, Total-Text, ICDAR2015 and ICDAR17-MLT confirm the striking robustness and effectiveness of LOMO.