Conference Paper

Detecting Oriented Text in Natural Images by Linking Segments

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... On this basis, the EAST (Efficient and Accurate Scene Text Detector) algorithm proposed by Zhou [3] draws on the concept of U-Net [4], employing a fully convolutional neural network (FCN) and non-maximum suppression (NMS) algorithms to streamline intermediate processing steps, enabling efficient detection at the single-character level. SegLink [5] is an improved version of CTPN, which not only makes tilt improvement for text box prediction but also optimizes the positioning of text boxes by connecting candidate boxes through neural networks. PixelLink [6] predicts text pixels and non-text pixels separately by training a convolutional neural network model and connecting them for instance segmentation. ...
... To further validate the performance of the handwriting detection model algorithm proposed in this paper, experimental comparisons with other mainstream text detection algorithms, CTPN [5], PixelLink [6], TextSnake [7], Boundary [8], R50_DBU [9] and TextFuse [10] were conducted on the ICDAR2015 dataset. The ICDAR [30] dataset is used for the ICDAR Challenge, in which the ICDAR2015 dataset contains 1500 images, 1000 images are used for the training set and the remaining 500 images are used for the test set. ...
Article
Full-text available
Featured Application This study explores the intelligent application of Shanghai’s writing proficiency assessment, targeting efficient recognition of handwritten Chinese characters in promotion exams for primary and secondary students. Abstract The dense text detection and segmentation of Chinese characters has always been a research hotspot due to the complex background and diverse scenarios. In the field of education, the detection of handwritten Chinese characters is affected by background noise, texture interference, etc. Especially in low-quality handwritten text, character overlap or occlusion makes the character boundaries blurred, which increases the difficulty of detection and segmentation; In this paper, an improved EAST network CEE (Components-ECA-EAST Network), which fuses the attention mechanism with the feature pyramid structure, is proposed based on the analysis of the structure of Chinese character mini-components. The ECA (Efficient Channel Attention) attention mechanism is incorporated during the feature extraction phase; in the feature fusion stage, the convolutional features are extracted from the self-constructed mini-component dataset and then fused with the feature pyramid in a cascade manner, and finally, Dice Loss is used as the regression task loss function. The above improvements comprehensively improve the performance of the network in detecting and segmenting the mini-components and subtle strokes of handwritten Chinese characters; The CEE model was tested on the self-constructed dataset with an accuracy of 84.6% and a mini-component mAP of 77.6%, which is an improvement of 7.4% and 8.4%, respectively, compared to the original model; The constructed dataset and improved model are well suited for applications such as writing grade examinations, and represent an important exploration of the development of educational intelligence.
... The detection task, also called localization, takes an image as input and outputs the locations of text within it. Along with the advances in deep learning and general object detection, more and more accurate as well as efficient scene text detection algorithms have been proposed, e.g., CTPN (Tian et al. 2016), TextBoxes (Liao et al. 2017), SegLink (Shi, Bai, and Belongie 2017) and EAST (Zhou et al. 2017). Most of these state-of-the-art methods are built on Fully Convolutional Networks (Long, Shelhamer, and Darrell 2015), and perform at least two kinds of predictions: 1. Text/non-text classification. ...
... RRPN (Ma et al. 2017) adds rotation to both anchors and RoIPooling in Faster R-CNN, to deal with the orientation of scene text. SegLink (Shi, Bai, and Belongie 2017) adopts SSD to predict text segments, which are linked into complete instances using the linkage prediction. EAST (Zhou et al. 2017) performs very dense predictions that are processed using locality-aware NMS. ...
Preprint
Most state-of-the-art scene text detection algorithms are deep learning based methods that depend on bounding box regression and perform at least two kinds of predictions: text/non-text classification and location regression. Regression plays a key role in the acquisition of bounding boxes in these methods, but it is not indispensable because text/non-text prediction can also be considered as a kind of semantic segmentation that contains full location information in itself. However, text instances in scene images often lie very close to each other, making them very difficult to separate via semantic segmentation. Therefore, instance segmentation is needed to address this problem. In this paper, PixelLink, a novel scene text detection algorithm based on instance segmentation, is proposed. Text instances are first segmented out by linking pixels within the same instance together. Text bounding boxes are then extracted directly from the segmentation result without location regression. Experiments show that, compared with regression-based methods, PixelLink can achieve better or comparable performance on several benchmarks, while requiring many fewer training iterations and less training data.
... Furthermore, specific techniques, as described in references [23,24], utilize object detection methods to first identify text segments, which, instead of directly identifying whole words or text lines, are then sorted into words or lines utilizing fundamental text-line clustering techniques or connectivitybased approaches. While these methods can automatically detect curved text, they introduce increased complexity into the overall process. ...
... To address text detection as an instance segmentation task, [23,24] applied object identification algorithms to identify text segments, which were subsequently categorized into words or lines. In this paper, a state-of-the-art instance segmentation method called Mask Scoring R-ConvNN was used to enhance text detection performance. ...
Article
In recent study efforts, the importance of text identification and recognition in images of natural scenes has been stressed more and more. Natural scene text contains an enormous amount of useful semantic data that can be applied in a variety of vision-related applications. The detection of shape-robust text confronts two major challenges: 1. A large number of traditional quadrangular bounding box-based detectors failed to identify text with irregular forms, making it difficult to include such text within perfect rectangles.2. Pixel-wise segmentation-based detectors sometimes struggle to identify closely positioned text examples from one another. Understanding the surroundings and extracting information from images of natural scenes depends heavily on the ability to detect and recognise text. Scene text can be aligned in a variety of ways, including vertical, curved, random, and horizontal alignments. This paper has created a novel method, the Transformation Scaling Extention Algorithm (TSEA), for text detection using a mask-scoring R-ConvNN (Region Convolutional Neural Network). This method works exceptionally well at accurately identifying text that is curved and text that has multiple orientations inside real-world input images. This study incorporates a mask-scoring R-ConvNN network framework to enhance the model's ability to score masks correctly for the observed occurrences. By providing more weight to accurate mask predictions, our scoring system eliminates inconsistencies between mask quality and score and enhances the effectiveness of instance segmentation. This paper also incorporates a Pyramid-based Text Proposal Network (PBTPN) and a Transformation Component Network (TCN) to enhance the feature extraction capabilities of the mask-scoring R-ConvNN for text identification and segmentation with the TSEA. Studies show that Pyramid Networks are especially effective in reducing false alarms caused by images with backgrounds that mimic text. On benchmark datasets ICDAR 2015, SCUT-CTW1500 containing multi-oriented and curved text, this method outperforms existing methods by conducting extensive testing across several scales and utilizing a single model. This study expands the field of vision-oriented applications by highlighting the growing significance of effectively locating and detecting text in natural situations.
... With the rapid development of object detection [8], semantic segmentation, and instance segmentation, text detection has achieved tremendous progress recently. Existing scene text detection methods based on deep learning can be grouped broadly into three categories: regression-based [9], [10], connected-component-based [11], [12], and segmentationbased [13], [14]. Compared to other methods, segmentationbased methods offer flexible pixel-level predictions, which are effective for handling text with arbitrary shapes. ...
... Then, it predicted the affinity of characters to judge whether they belong to the same instance. SegLink [12] predicted the segments and links of text instances that connect segments by the prediction of links. DRRG [27] utilized the graph convolutional networks (GCN) to group the text parts of instances. ...
Preprint
The irregular contour representation is one of the tough challenges in scene text detection. Although segmentation-based methods have achieved significant progress with the help of flexible pixel prediction, the overlap of geographically close texts hinders detecting them separately. To alleviate this problem, some shrink-based methods predict text kernels and expand them to restructure texts. However, the text kernel is an artificial object with incomplete semantic features that are prone to incorrect or missing detection. In addition, different from the general objects, the geometry features (aspect ratio, scale, and shape) of scene texts vary significantly, which makes it difficult to detect them accurately. To consider the above problems, we propose an effective spotlight text detector (STD), which consists of a spotlight calibration module (SCM) and a multivariate information extraction module (MIEM). The former concentrates efforts on the candidate kernel, like a camera focus on the target. It obtains candidate features through a mapping filter and calibrates them precisely to eliminate some false positive samples. The latter designs different shape schemes to explore multiple geometric features for scene texts. It helps extract various spatial relationships to improve the model's ability to recognize kernel regions. Ablation studies prove the effectiveness of the designed SCM and MIEM. Extensive experiments verify that our STD is superior to existing state-of-the-art methods on various datasets, including ICDAR2015, CTW1500, MSRA-TD500, and Total-Text.
... Subsequent advancements, such as TextBoxes [10] and its extension TextBoxes++ [11,12], adapted the convolutional neural network (CNN) architecture to better detect long text lines by changing the convolution kernel size. SegLink [13] improved upon this by predicting text candidate boxes and merging them through an area link algorithm, allowing for the detection of text lines at various angles. ...
Article
Full-text available
In the traditional education assessment landscape, the manual grading of subjective exam questions poses significant challenges. The labor-intensive nature of this process and the potential for human error can negatively impact teaching and learning outcomes. As society transitions towards a low-carbon future, there is a pressing need to reform educational evaluation methods, reduce paper-based exams, and leverage advanced intelligent technologies. Inspired by the principles of biomechanics, this research introduces a novel image-based handwritten text recognition algorithm powered by recurrent neural networks, specifically designed for the automated scoring of primary school mathematics subjective questions. Drawing insights from the human visual and cognitive systems, the proposed approach mimics the hierarchical and adaptive nature of biological information processing to tackle the complexities inherent in handwritten text detection, recognition, and understanding. The study first constructs a comprehensive dataset of real primary school math exam answer sheets, capturing the diverse range of handwriting styles and mathematical notations. This dataset serves as a robust training and evaluation platform, akin to the diverse sensory inputs that biological systems process. The recurrent neural network architecture employed in this work exhibits biomimetic properties, such as the ability to dynamically process sequential information and adaptively refine its internal representations, much like the human brain's neural networks. This allows the algorithm to effectively handle the contextual cues and structural patterns present in handwritten mathematical responses, enabling accurate recognition and interpretation. Rigorous comparative and ablation experiments were conducted to assess the performance of the proposed algorithm. The results demonstrate high accuracy in recognizing and interpreting handwritten subjective responses, showcasing the practical value of this biomechanics-inspired approach. These findings align with the study's overarching goal of developing resource-saving and environmentally-friendly education evaluation systems, paving the way for the widespread adoption of intelligent technologies in the assessment of subjective questions. By drawing inspiration from the elegant and efficient information processing mechanisms observed in biological systems, this research contributes to the advancement of intelligent handwritten text recognition, ultimately supporting the transition towards a more sustainable and equitable educational landscape.
... Yin et al. (2015) adopted clustering methods to generate readable text regions for slant fonts, involving three types of clustering, which may result in reduced robustness. Shi et al. (2017) used a segmentation approach to detect each word individually and then connected each target into words using a proposed directional adjacent link method, which needed huge computation resources. F. Gao et al. (2020) further applied similar techniques on metal OCR tasks, generating character bounding points and fitting connected curve to recertify rotated characters. ...
Article
Full-text available
The identification of tire text codes (TTC) during the production and operational phases of tires can significantly improve safety and maintenance practices. Current methods for TTC identification face challenges related to stability, computational efficiency, and outdoor applicability. This paper introduces an automated TTC identification system founded on a robust framework that is both user‐friendly and easy to implement, thereby enhancing the practical use and industrial applicability of TTC identification technologies. Initially, instance segmentation is creatively utilized for detecting TTC regions on the tire sidewall through You Only Look Once (YOLO)‐v8‐based models, which are trained on a dataset comprising 430 real‐world tire images. Subsequently, a computationally efficient rotation algorithm, along with specific image pre‐processing techniques, is developed to tackle common issues associated with centripetal rotation in the TTC region and to improve the accuracy of TTC region detection. Furthermore, a series of YOLO‐v8 object detection models were assessed using an independently collected dataset of 1127 images to optimize the recognition of TTC characters. Ultimately, a portable Internet of Things (IoT) vision device is created, featuring a comprehensive workflow to support the proposed TTC identification framework. The TTC region detection model achieves a segmentation precision of 0.8812, while the TTC recognition model reaches a precision of 0.9710, based on the datasets presented in this paper. Field tests demonstrate the system's advancements, reliability, and potential industrial significance for practical applications. The IoT device is shown to be portable, cost‐effective, and capable of processing each tire in 200 ms.
... The output segmentation mask is then utilized to extract the text regions. PixelLink [9], SegLink [50], and TextSnake [33] are examples of such segmentation-based methods. The main advantage of segmentation-based methods is that they can even detect arbitrarily shaped text very well, but they require a complex post-processing step that takes a considerable amount of inference time. ...
Preprint
Full-text available
In recent years, the field of Handwritten Text Recognition (HTR) has seen the emergence of various new models, each claiming to perform competitively better than the other in specific scenarios. However, making a fair comparison of these models is challenging due to inconsistent choices and diversity in test sets. Furthermore, recent advancements in HTR often fail to account for the diverse languages, especially Indic languages, likely due to the scarcity of relevant labeled datasets. Moreover, much of the previous work has focused primarily on character-level or word-level recognition, overlooking the crucial stage of Handwritten Text Detection (HTD) necessary for building a page-level end-to-end handwritten OCR pipeline. Through our paper, we address these gaps by making three pivotal contributions. Firstly, we present an end-to-end framework for Page-Level hAndwriTTen TExt Recognition (PLATTER) by treating it as a two-stage problem involving word-level HTD followed by HTR. This approach enables us to identify, assess, and address challenges in each stage independently. Secondly, we demonstrate the usage of PLATTER to measure the performance of our language-agnostic HTD model and present a consistent comparison of six trained HTR models on ten diverse Indic languages thereby encouraging consistent comparisons. Finally, we also release a Corpus of Handwritten Indic Scripts (CHIPS), a meticulously curated, page-level Indic handwritten OCR dataset labeled for both detection and recognition purposes. Additionally, we release our code and trained models, to encourage further contributions in this direction.
... Liao et al. [17] developed the TextBoxes method to locate text instances of varying lengths by adapting the con- [18] proposed the CTPN method, which uses edge detection and corrects the slant angle of text lines. Shi et al. [19] presented the SegLink method to identify local portions of text at different scales and combine them according to established rules to form the final text box detection results. This enhances the detection ability of curved text and long text. ...
Article
Full-text available
Image feature information is unable to be fully exploited by most existing scene text detection methods, resulting in multi-scale text error detection and crooked text missing detection. This paper focuses on small text detection missing, large text false detection, and low scene text boundary location accuracy caused by multi-scale scene text changes. To overcome these limitations, this paper proposes a text detection method based on multi-threshold and multi-scale feature fusion (MTMFNet). The lightweight network ResNet18 serves as the primary backbone network. Deformable convolution kernels are utilized in feature pyramids to expand the receptive field. Fused attention-oriented mechanism module is introduced to utilize various feature information efficiently and improve the ability of detail feature extraction. Multi-threshold boundary detection module is deployed to generate more accurate text boundaries. This multi-branch structure-based module can sense the cooperative relationship of text boxes using differentiable binarization submodules with associative thresholds. Experimental results indicate that MTMFNet obtains the best comprehensive performance compared with most state-of-the-art text detection methods in terms of four evaluation metrics on three datasets involving the multi-directional text, curved text, and multi-scale text. The related code of our method is available at https://github.com/Jinfu/MiLNet.
... One can effectively capture the contextual information of the image by further extracting semantic information. PSPNet [11] utilized pyramid pooling to fuse features at different scales, thereby reducing the loss of contextual information in sub-regions. PANet [12] employed pyramid structure to extract and fuse contextual information, while also utilizing global pooling to obtain global information. ...
... Tian et al. [1] introduced a text detection framework with a vertical positioning mechanism called CTPN, which detects text lines within fine-grained text proposals in the convolutional feature map and extracts contextual information, effectively spotting deeply blurred text. Shi et al. [2] designed a directed text detection method, SegLink. It decomposes text into locally detectable segments and links, and enables full convolutional neural networks to perform dense detection at multiple scales through end-to-end training. ...
... On the model front, traditional CNNbased approaches [10,25,44] trained on generic object detection datasets have often struggled to achieve satisfactory results in this unique context. Unlike natural images, which often contain simple and distinctive visual information [6,29], X-ray images are characterized by a lack of strong discriminative properties and often contain heavy visual noise [20,22,37]. Significant strides have been made in advancing the detection of prohibited items in X-ray imagery, with a lot of breakthroughs in model performance. ...
Preprint
The detection of prohibited items in X-ray security inspections is vital for ensuring public safety. However, the long-tail distribution of item categories, where certain prohibited items are far less common, poses a big challenge for detection models, as rare categories often lack sufficient training data. Existing methods struggle to classify these rare items accurately due to this imbalance. In this paper, we propose a Dual-level Boost Network (DBNet) specifically designed to overcome these challenges in X-ray security screening. Our approach introduces two key innovations: (1) a specific data augmentation strategy employing Poisson blending, inspired by the characteristics of X-ray images, to generate realistic synthetic instances of rare items which can effectively mitigate data imbalance; and (2) a context-aware feature enhancement module that captures the spatial and semantic interactions between objects and their surroundings, enhancing classification accuracy for underrepresented categories. Extensive experimental results demonstrate that DBNet improves detection performance for tail categories, outperforming sota methods in X-ray security inspection scenarios by a large margin 17.2%, thereby ensuring enhanced public safety.
... This method adapts well to curved and deformed text in natural scenes. Shi et al. [14] defined text as two detectable elements, segments and links, where a segment is typically a word or character. By aggregating the predicted segments based on the links, text lines can be obtained. ...
Article
Full-text available
A two-stage algorithm based on deep learning for the detection and recognition of can bottom spray codes and numbers is proposed to address the problems of small character areas and fast production line speeds in can bottom spray code number recognition. In the coding number detection stage, Differentiable Binarization Network is used as the backbone network, combined with the Attention and Dilation Convolutions Path Aggregation Network feature fusion structure to enhance the model detection effect. In terms of text recognition, using the Scene Visual Text Recognition coding number recognition network for end-to-end training can alleviate the problem of coding recognition errors caused by image color distortion due to variations in lighting and background noise. In addition, model pruning and quantization are used to reduce the number of model parameters to meet deployment requirements in resource-constrained environments. A comparative experiment was conducted using the dataset of tank bottom spray code numbers collected on-site, and a transfer experiment was conducted using the dataset of packaging box production date. The experimental results show that the algorithm proposed in this study can effectively locate the coding of cans at different positions on the roller conveyor, and can accurately identify the coding numbers at high production line speeds. The Hmean value of the coding number detection is 97.32%, and the accuracy of the coding number recognition is 98.21%. This verifies that the algorithm proposed in this paper has high accuracy in coding number detection and recognition.
... TextBoxes [11] builds on SSD [12] by modifying the aspect ratios of the anchor points and adding a textbox layer using a horizontal convolution kernel to detect lines of text. SegLink [13] detects feature maps at different scales separately and introduces angle prediction for anchors, making it easier to detect text at different scales and angles. PCR [14] proposes to first regress the horizontal contour of the text and then regress the arbitrary shape of the text to detect irregular text. ...
Article
Full-text available
Industrial barrel labels generally have low visual contrast, uneven lighting, and cluttered background, making it challenging to accurately locate text regions. This paper proposes a text detection network to solve the inaccurate localization problem based on DBNet. First, a convolutional attention mechanism is applied to the feature extraction network to get more valuable text feature maps. Then, a dual‐branch convolutional feature module is proposed in the feature pyramid to enrich contextual information. Besides, during the probability map generation stage, using a feature remodeling enhancement module to further distinguish text and text boundaries. This paper designs comparative experiments on ILTD, ICDAR2015 and MSRA‐TD500 datasets, achieve F‐measure of 92.3%, 86.0% and 84.1%, which are 2.2%, 2.3%, and 1.9% higher than DBNet, respectively. They demonstrate that our proposed method exhibits competitive performance and strong robustness. © 2024 Institute of Electrical Engineers of Japan and Wiley Periodicals LLC.
... Zhang et al. [23] proposed DRRG, which introduces graph convolutional networks (GCN) to model the relationships between the predicted text components. Shi et al. [24] represented instances as segments and links, which adopt a prediction scheme to judge which segments belong to the same instances. The above methods need complex post-progressing, which limits the detection speed significantly and is not suitable for traffic and industrial scene text detection. ...
Preprint
Texts on the intelligent transportation scene include mass information. Fully harnessing this information is one of the critical drivers for advancing intelligent transportation. Unlike the general scene, detecting text in transportation has extra demand, such as a fast inference speed, except for high accuracy. Most existing real-time text detection methods are based on the shrink mask, which loses some geometry semantic information and needs complex post-processing. In addition, the previous method usually focuses on correct output, which ignores feature correction and lacks guidance during the intermediate process. To this end, we propose an efficient multi-scene text detector that contains an effective text representation similar mask (SM) and a feature correction module (FCM). Unlike previous methods, the former aims to preserve the geometric information of the instances as much as possible. Its post-progressing saves 50%\% of the time, accurately and efficiently reconstructing text contours. The latter encourages false positive features to move away from the positive feature center, optimizing the predictions from the feature level. Some ablation studies demonstrate the efficiency of the SM and the effectiveness of the FCM. Moreover, the deficiency of existing traffic datasets (such as the low-quality annotation or closed source data unavailability) motivated us to collect and annotate a traffic text dataset, which introduces motion blur. In addition, to validate the scene robustness of the SM-Net, we conduct experiments on traffic, industrial, and natural scene datasets. Extensive experiments verify it achieves (SOTA) performance on several benchmarks. The code and dataset are available at: \url{https://github.com/fengmulin/SMNet}.
... As text detection (Shi, Bai, and Belongie 2017;Liao et al. 2020b) and text recognition (Shi, Bai, and Yao 2016;Shi et al. 2018;Fang et al. 2021) technologies gradually mature, the task of end-to-end text detection and recognition, known as text spotting (Li, Wang, and Shen 2017), is increasingly becoming a focus of research. Most pioneering works (He et al. 2018;Liao et al. 2020a;Liu et al. 2018Liu et al. , 2020Wang et al. 2021c) in text spotting follow a pipeline that first detects and then recognizes text, whereby the features of the Region-of-Interest (ROI) are passed to the recognizer. ...
Preprint
End-to-end visual information extraction (VIE) aims at integrating the hierarchical subtasks of VIE, including text spotting, word grouping, and entity labeling, into a unified framework. Dealing with the gaps among the three subtasks plays a pivotal role in designing an effective VIE model. OCR-dependent methods heavily rely on offline OCR engines and inevitably suffer from OCR errors, while OCR-free methods, particularly those employing a black-box model, might produce outputs that lack interpretability or contain hallucinated content. Inspired by CenterNet, DeepSolo, and ESP, we propose HIP, which models entities as HIerarchical Points to better conform to the hierarchical nature of the end-to-end VIE task. Specifically, such hierarchical points can be flexibly encoded and subsequently decoded into desired text transcripts, centers of various regions, and categories of entities. Furthermore, we devise corresponding hierarchical pre-training strategies, categorized as image reconstruction, layout learning, and language enhancement, to reinforce the cross-modality representation of the hierarchical encoders. Quantitative experiments on public benchmarks demonstrate that HIP outperforms previous state-of-the-art methods, while qualitative results show its excellent interpretability.
... PixeLink [25] predicted pixel score map and the relationships with surrounding pixels to detect text. SegLink [26] represented text instances as segments and links that merged segments according to the predictions of links. Long et al. proposed TextSnake [27], which represents text like a snake. ...
Preprint
Due to the diversity of scene text in aspects such as font, color, shape, and size, accurately and efficiently detecting text is still a formidable challenge. Among the various detection approaches, segmentation-based approaches have emerged as prominent contenders owing to their flexible pixel-level predictions. However, these methods typically model text instances in a bottom-up manner, which is highly susceptible to noise. In addition, the prediction of pixels is isolated without introducing pixel-feature interaction, which also influences the detection performance. To alleviate these problems, we propose a multi-information level arbitrary-shaped text detector consisting of a focus entirety module (FEM) and a perceive environment module (PEM). The former extracts instance-level features and adopts a top-down scheme to model texts to reduce the influence of noises. Specifically, it assigns consistent entirety information to pixels within the same instance to improve their cohesion. In addition, it emphasizes the scale information, enabling the model to distinguish varying scale texts effectively. The latter extracts region-level information and encourages the model to focus on the distribution of positive samples in the vicinity of a pixel, which perceives environment information. It treats the kernel pixels as positive samples and helps the model differentiate text and kernel features. Extensive experiments demonstrate the FEM's ability to efficiently support the model in handling different scale texts and confirm the PEM can assist in perceiving pixels more accurately by focusing on pixel vicinities. Comparisons show the proposed model outperforms existing state-of-the-art approaches on four public datasets.
... TextBoxes++ improves the localization of multi-oriented text in natural scene images by optimizing the network structure and training procedure. Shi et al. [50] introduced the seglink method to localize oriented text in natural scene images. This method operates in two stages: segmentation, where rectangular boxes are placed over specific word or text line areas, and linking, which connects adjacent segments. ...
Article
Full-text available
Text localization and recognition from natural scene images has gained a lot of attention recently due to its crucial role in various applications, such as autonomous driving and intelligent navigation. However, two significant gaps exist in this area: (1) prior research has primarily focused on recognizing English text, whereas Arabic text has been underrepresented, and (2) most prior research has adopted separate approaches for scene text localization and recognition, as opposed to one integrated framework. To address these gaps, we propose a novel bilingual end-to-end approach that localizes and recognizes both Arabic and English text within a single natural scene image. Specifically, our approach utilizes pre-trained CNN models (ResNet and EfficientNetV2) with kernel representation for localization text and RNN models (LSTM and BiLSTM) with an attention mechanism for text recognition. In addition, the AraElectra Arabic language model was incorporated to enhance Arabic text recognition. Experimental results on the EvArest, ICDAR2017, and ICDAR2019 datasets demonstrated that our model not only achieves superior performance in recognizing horizontally oriented text but also in recognizing multi-oriented and curved Arabic and English text in natural scene images.
... Traditional text reading systems typically consist of two distinct stages: text detection [21, 51,65,68,74,75,90] followed by text recognition [7,15,63,64]. The overall performance is highly reliant on the accuracy of the initial text detection phase, creating a cascading effect when errors occur. ...
Preprint
Full-text available
Reading text from images (either natural scenes or documents) has been a long-standing research topic for decades, due to the high technical challenge and wide application range. Previously, individual specialist models are developed to tackle the sub-tasks of text reading (e.g., scene text recognition, handwritten text recognition and mathematical expression recognition). However, such specialist models usually cannot effectively generalize across different sub-tasks. Recently, generalist models (such as GPT-4V), trained on tremendous data in a unified way, have shown enormous potential in reading text in various scenarios, but with the drawbacks of limited accuracy and low efficiency. In this work, we propose Platypus, a generalized specialist model for text reading. Specifically, Platypus combines the best of both worlds: being able to recognize text of various forms with a single unified architecture, while achieving excellent accuracy and high efficiency. To better exploit the advantage of Platypus, we also construct a text reading dataset (called Worms), the images of which are curated from previous datasets and partially re-labeled. Experiments on standard benchmarks demonstrate the effectiveness and superiority of the proposed Platypus model. Model and data will be made publicly available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/Platypus.
... OBBs can tightly encompass objects, reducing the impact of complex background information, and thus offering more precise localization compared to traditional HBBs. This detection approach holds significant value in many fields such as scene text recognition [37], remote sensing images [38], and biomedical science [39]. Key advancements in this field include two-stage and single-stage detectors. ...
Article
Full-text available
Vehicle detection is vital for urban planning and traffic management. Optical remote sensing imagery, known for its high resolution and extensive coverage, is ideal for this task. Traditional horizontal bounding box (HBB) annotations often include excessive background, leading to reduced accuracy, whereas oriented bounding box (OBB) annotations are more precise but costly and prone to human error. To address these issues, we propose the OGR-SM (Oriented Group R-CNN + Student Model) framework, a weakly semi-supervised oriented vehicle detection method based on single-point annotations. It leverages a small amount of OBB annotations along with a large quantity of single-point annotations for training, achieving performance comparable to fully-supervised learning with 100% complete annotations. Specifically, we train the teacher model (OGR) using a small set of accurately annotated OBBs and their corresponding single-point annotations. This model employs an instance-point driven proposal grouping strategy and a group-based proposal assignment strategy enhanced with point location to generate pseudo-OBBs from a large set of weakly annotated single-point data. We then utilize conventional oriented detectors as student models to perform the vehicle detection task in a standard manner. Extensive experiments on two datasets show that our framework, using limited accurate OBBs and many pseudo-OBBs, can achieve or surpass the accuracy of fully-supervised models. Our method balances high-quality annotations with data availability, enhancing scalability and robustness for vehicle detection in remote sensing applications.
... Regression-based methods, such as TextBoxs [8], modify SSD anchor points and convolution kernel scales to address extreme aspect ratios, but they fail to handle some difficult cases, such as overexposure and large character spacing. SegLink [9] applies a bottom-up mechanism to predict text segments and their links to handle long text but fails to detect text that has very large character spacing. East [10] applies pixel-level regression to multioriented text instances, but it might miss or yield imprecise predictions for vertical text instances. ...
Article
Full-text available
Intelligent systems, such as driving assistance systems, can assist drivers by providing basic traffic, road blockage and possible route information to enable safe driving. The goal of scene text tracking in driver assistance systems is to locate and track scene text, milestone signs, traffic panels and road signs in real time. Therefore, the accuracy and real-time performance of scene text localization tracking play vital roles in intelligent driving assistance systems. However, traffic video text tracking often has the problems of missed and false detections because of illumination occlusion and similar appearances. In this paper, we propose a new Swin transformer-based traffic video text tracking method, known as STVT, which is composed of a Siamese SwinDC transformer module, a deformable text detection module, and a text matching module. The STVT method employs the Siamese SwinDC transformer module, which performs text detection by considering both temporal and spatial dimensions, mitigating the issue of missed detections caused by occlusion. The text matching module combines the semantic, visual, and geometric features of text instances to effectively differentiate visually similar text instances. Extensive experiments demonstrated that our proposed STVT method outperformed the state-of-the-art methods on various benchmark datasets. On the ICDAR2015 dataset, compared with those of the Free method, the mostly matched (MM) result increased by 32.0% (702 vs. 926), and the mostly lost (ML) result decreased by 33.2% (568 vs. 850). The visualization results demonstrated that the proposed STVT model can accurately detect and track occluded text instances in traffic videos. On the ICDAR2023 dataset, our method achieved a 6.01% improvement in MOTA compared to that of the TransDETR method, demonstrating that our proposed method is effective for small and dense text detection problems. In addition, qualitative and quantitative analyses confirmed the effectiveness and real-time performance of our proposed STVT method.
... Various deep neural network algorithms proficiently extract image features and yield satisfactory outcomes in detection, including YOLO [42], RCNN [11], Faster-R CNN [43], SSD [29], and similar methods. Moreover, there are notable approaches such as CTPN [56], TextBoxes++ [26], ABCNet [49], SegLink [52], etc., which fall within this class of methods. ...
Article
Identifying and acknowledging Traffic Panels (TP) and the text they display constitute significant use cases for Advanced Driver Assistance Systems (ADAS). In recent years, particularly in the context of the Arabic language, extracting textual information from TP and signs has emerged as a challenging problem in the field of computer vision. Furthermore, the significant rise in road traffic accidents within Arabic-speaking countries has resulted in substantial financial losses and loss of human lives. This is largely attributed to the limited number of diverse datasets for traffic signs and the absence of a reliable system for TP detection. Implementing warning and guidance systems for drivers on the road not only addresses this issue but also paves the way for the integration of intelligent components into future vehicles, offering decision support for transitioning to semi-automatic or fully automatic driving based on the driver’s health condition. These tasks present us with two main challenges. First, it involves developing a new Arabic dataset called the Syphax Traffic Panels dataset (STP) tailored to the diverse conditions of natural scenes gathered from “Sfax,” a city in Tunisia. This dataset aims to provide high-quality images of Arabic TP. Secondly, we suggest a deep learning method for detecting Arabic text on TP by evaluating the performance of the state-of-the-art algorithms in this context. In our study, we enhance the architecture of the most successful result achieved. The experiments conducted reveal promising results, affirming the significant contribution of our dataset to this research area, and even more encouraging results stemming from the enhancements made to the proposed method. The dataset we possess is accessible to the general public on IEEE DataPort https://dx.doi.org/10.21227/5zd9-pe55
Article
Full-text available
End-to-end scene text spotting methods have garnered significant research attention due to their promising results. However, most existing approaches are not well suited for real-world applications because of their inherently complex pipelines. In this paper, we propose an end-to-end Character Region Excavation Network (CRENet) to streamline the text spotting pipeline. Our contributions are threefold: (i) Pipeline simplification: For the first time, we eliminate the text region retrieval step, allowing characters to be directly spotted from scene images. (ii) ROA layer: We introduce a novel RoI (Region of Interest) feature sampling layer for multi-oriented character region feature sampling, significantly enhancing the recognizer’s performance. (iii) Progressive learning strategy: We propose a progressive learning strategy to gradually bridge the gap between synthetic data and real-world images, addressing the challenge posed by the high cost of character-level annotations required during training. Extensive experiments demonstrate that our proposed method is robust and effective across horizontal, oriented, and curved text, achieving results comparable to state-of-the-art methods on ICDAR 2013, ICDAR 2015, Total-Text and ReCTS.
Article
The detection of scene text holds significant importance across a variety of application scenarios. However, previous methods were insufficient for detecting and recognizing text instances, such as variations in text size, chaotic background and diverse text orientations. To address these challenges, this paper proposes a novel methodology based on Text Stroke Components (TSC). The method leverages Harris corner detection to identify critical points of text strokes, such as endpoints, turning points, and curvatures. By analyzing the clustered regions of these points, the approach effectively localizes text characters. To enhance the detection process, a transparency parameter [Formula: see text] is introduced to control the fusion between original images and corner-detection images. This improves the localization of key stroke points, and reduces background noise interference. The proposed method is evaluated through extensive experiments, demonstrating superior performance compared to existing scene text detectors. Furthermore, the method is jointly trained with the ABINet recognition model across all stages. Comprehensive experiments conducted on 13 datasets reveal that this approach significantly outperforms SOTA methods. These results underscore the advantages of using text stroke components for key-point localization through the corner detection algorithm in scene text detection.
Article
Full-text available
At present, scene text detection is grabbing more and more attention as an offshoot of machine vision. However, due to the existing long types of text instances and complex background context, less exact localization and higher missed detection cases still remain in text detection domain. Accordingly, with the aim of tackling these two issues, we propose a text detector named GP-PSENet that comprises a combination of a group-related dilated encoder, a parallel extensional dilation-wise residual encoder and a mixed upsample. Firstly, feature maps of the lowest level processed by the backbone network are sent to a dilated encoder with group linkage. And the group residual module provides stratification to join group coefficients and dilated factors. This module can enhance the correctness of predictions about longer boundary boxes. Secondly, semantic information from the highest level is fed into a parallel extensional dilation-wise residual encoder. The extensional dilation-wise module is capable of obtaining diverse receptive fields by more parallel branches. And it can alleviate error detection from interfering material in the background. Thirdly, the feature maps processed in the second step are given to the mixed upsample module for transforming so as to the next fuse. Finally, the processed two-level feature maps are fused and sent to the progressive scale expansion algorithm for the final post-processing to gain the predicted coordinate points. Ablation experiments are conducted on CTW1500, ICDAR15, MSRA-TD500 and Total-Text datasets to confirm the availability of the proposed method. The values of precision on these datasets reach 86.24%, 87.84%, 73.98% and 90.48%. The proposed method is also competitive with other scene detection methods.
Article
Text detection in natural images is a crucial task for extracting and recognizing valuable information, but it comes with significant challenges. Traditional image processing methods often rely on synthetic features, which may not be suitable for handling the diverse range of unstructured text scenarios encountered in real-world environments. To overcome this challenge, deep learning techniques have emerged as a powerful tool for adaptively learning and extracting features to digitize text. In this paper, we introduce a novel approach based on the YOLOv5 architecture for text detection. Our method leverages the strengths of YOLOv5, including its efficient and lightweight architecture, multi-scale prediction capability, and advanced data augmentation techniques. By employing these features, our proposed model can accurately identify regions in an image that are more likely to contain text. Through extensive experimentation on popular benchmark datasets, we have evaluated the performance of our proposed model. The results have demonstrated that our method outperforms state-of-the-art approaches in terms of both detection efficiency and accuracy. This indicates the effectiveness of our approach in handling the challenges posed by text detection in natural images.
Article
Due to the diversity of scene text in aspects such as font, color, shape, and size, accurately and efficiently detecting text is still a formidable challenge. Among the various detection approaches, segmentation-based approaches have emerged as prominent contenders owing to their flexible pixel-level predictions. However, these methods typically model text instances in a bottom-up manner, which is highly susceptible to noise. In addition, the prediction of pixels is isolated without introducing pixel-feature interaction, which also influences the detection performance. To alleviate these problems, we propose a multi-information level arbitrary-shaped text detector consisting of a focus entirety module (FEM) and a perceive environment module (PEM). The former extracts instance-level features and adopts a top-down scheme to model texts to reduce the influence of noises. Specifically, it assigns consistent entirety information to pixels within the same instance to improve their cohesion. In addition, it emphasizes the scale information, enabling the model to distinguish varying scale texts effectively. The latter extracts region-level information and encourages the model to focus on the distribution of positive samples in the vicinity of a pixel, which perceives environment information. It treats the kernel pixels as positive samples and helps the model differentiate text and kernel features. Extensive experiments demonstrate the FEM's ability to efficiently support the model in handling different scale texts and confirm the PEM can assist in perceiving pixels more accurately by focusing on pixel vicinities. Comparisons show the proposed model outperforms existing state-of-the-art approaches on four public datasets.
Article
The irregular contour representation is one of the tough challenges in scene text detection. Although segmentation-based methods have achieved significant progress with the help of flexible pixel prediction, the overlap of geographically close texts hinders detecting them separately. To alleviate this problem, some shrink-based methods predict text kernels and expand them to restructure texts. However, the text kernel is an artificial object with incomplete semantic features that are prone to incorrect or missing detection. In addition, different from the general objects, the geometry features (aspect ratio, scale, and shape) of scene texts vary significantly, which makes it difficult to detect them accurately. To consider the above problems, we propose an effective spotlight text detector (STD), which consists of a spotlight calibration module (SCM) and a multivariate information extraction module (MIEM). The former concentrates efforts on the candidate kernel, like a camera focus on the target. It obtains candidate features through a mapping filter and calibrates them precisely to eliminate some false positive samples. The latter designs different shape schemes to explore multiple geometric features for scene texts. It helps extract various spatial relationships to improve the model's ability to recognize kernel regions. Ablation studies prove the effectiveness of the designed SCM and MIEM. Extensive experiments verify that our STD is superior to existing state-of-the-art methods on various datasets, including ICDAR2015, CTW1500, MSRA-TD500, and Total-Text.
Preprint
Full-text available
This paper introduces a novel rotation-based framework for arbitrary-oriented text detection in natural scene images. We present the Rotation Region Proposal Networks (RRPN), which are designed to generate inclined proposals with text orientation angle information. The angle information is then adapted for bounding box regression to make the proposals more accurately fit into the text region in terms of the orientation. The Rotation Region-of-Interest (RRoI) pooling layer is proposed to project arbitrary-oriented proposals to a feature map for a text region classifier. The whole framework is built upon a region-proposal-based architecture, which ensures the computational efficiency of the arbitrary-oriented text detection compared with previous text detection systems. We conduct experiments using the rotation-based framework on three real-world scene text detection datasets and demonstrate its superiority in terms of effectiveness and efficiency over previous approaches.
Article
Texts on the intelligent transportation scene include mass information. Fully harnessing this information is one of the critical drivers for advancing intelligent transportation. Unlike the general scene, detecting text in transportation has extra demand, such as a fast inference speed, except for high accuracy. Most existing real-time text detection methods are based on the shrink mask, which loses some geometry semantic information and needs complex post-processing. In addition, the previous method usually focuses on correct output, which ignores feature correction and lacks guidance during the intermediate process. To this end, we propose an efficient multi-scene text detector that contains an effective text representation similar mask (SM) and a feature correction module (FCM). Unlike previous methods, the former aims to preserve the geometric information of the instances as much as possible. Its post-progressing saves 50 %\% of the time, accurately and efficiently reconstructing text contours. The latter encourages false positive features to move away from the positive feature center, optimizing the predictions from the feature level. Some ablation studies demonstrate the efficiency of the SM and the effectiveness of the FCM. Moreover, the deficiency of existing traffic datasets (such as the low-quality annotation or closed source data unavailability) motivated us to collect and annotate a traffic text dataset, which introduces motion blur. In addition, to validate the scene robustness of the SM-Net, we conduct experiments on traffic, industrial, and natural scene datasets. Extensive experiments verify it achieves (SOTA) performance on several benchmarks. The code and dataset are available at: https://github.com/fengmulin/SMNet.
Article
Full-text available
YouTube’s “Video Chapter” feature segments videos into different sections, marked by timestamps on the slider, enhancing user navigation. Given the vast volume of video data, processing these efficiently demands substantial time and computational resources. This paper addresses two key objectives: reducing the computational cost of deep model training for text detection and enhancing overall performance with minimal effort. We introduce a classroom-based multi-loss learning approach for text detection, extending its application to title detection without requiring annotations. In deep learning, loss functions play a crucial role in updating model weights. Our proposed multi-loss functions facilitate faster convergence compared to baseline methods. Additionally, we present a novel technique to handle annotation-less data by employing a text grouping method to differentiate between regular text and title text. Experimental results on the COCO-Text and Slidin’ Videos AI-5G Challenge datasets demonstrate the efficacy and practicality of our approach.
Article
Full-text available
Scene text detection is a challenging topic in computer vision, characterized by complex illumination, irregular shape, and arbitrary size. While recent advancements have been made in scene text detection, it remains difficult to simultaneously distinguish nearby text and accommodate irregularly shaped text. Therefore, this paper introduces HPNet, an enhanced text detector, based on the segmentation method that predicts two-scale results. To improve the shape robustness, the Hybrid Attentional Feature Fusion (HAFF) module is integrated into Feature Pyramid Networks (FPN) to dynamically perform feature fusion. Additionally, to distinguish nearby text, the model predicts the text region covering text instances and the text kernel covering the central region of the text. The improved Pixel Aggregation (PA) algorithm is then utilized to guide the expansion from the text kernel to the text region. Experiments on IC15, Total-Text, and CTW1500 validate the effectiveness of these improvements and the superiority of HPNet. Compared with the previous method PSENet for nearby texts, the proposed HPNet has improved inference speed by 63.6% and F-measure metric by 2.6%, 3.7%, and 2.5% on three datasets, respectively.
Article
Full-text available
Existing segmentation-based text detection methods generally face the problems of insufficient receptive fields, insufficient text information filtering, and difficulty in balancing detection accuracy and speed, limiting their ability to detect arbitrary-shaped text in complex backgrounds. To address these problems, we propose a new text detection method fusing the pure ConvNet model InceptionNeXt and the multi-scale attention mechanism. Firstly, we propose a text information reinforcement module to fully extract effective text information from features of different scales while preserving spatial position information. Secondly, we construct the InceptionNeXt Block module to compensate for insufficient receptive fields without significantly reducing speed. Finally, the INA-DBNet network structure is designed to fuse local and global features and achieve the balance of accuracy and speed. Experimental results demonstrate the efficacy of our method. Particularly, on the MSRA-TD500 and Total-text datasets, INA-DBNet achieves 91.3% and 86.7% F-measure while maintaining real-time inference speed. Code is available at: https://github.com/yuyu678/INANET.
Conference Paper
Full-text available
We propose a novel Connectionist Text Proposal Network (CTPN) that accurately localizes text lines in natural image. The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional feature maps. We develop a vertical anchor mechanism that jointly predicts location and text/non-text score of each fixed-width proposal, considerably improving localization accuracy. The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model. This allows the CTPN to explore rich context information of image, making it powerful to detect extremely ambiguous text. The CTPN works reliably on multi-scale and multi-language text without further post-processing, departing from previous bottom-up methods requiring multi-step post filtering. It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpassing recent results [8, 35] by a large margin. The CTPN is computationally efficient with 0.14 s/image, by using the very deep VGG16 model [27]. Online demo is available: http:// textdet. com/ .
Article
Full-text available
Recently, scene text detection has become an active research topic in computer vision and document analysis, because of its great importance and significant challenge. However, vast majority of the existing methods detect text within local regions, typically through extracting character, word or line level candidates followed by candidate aggregation and false positive elimination, which potentially exclude the effect of wide-scope and long-range contextual cues in the scene. To take full advantage of the rich information available in the whole natural image, we propose to localize text in a holistic manner, by casting scene text detection as a semantic segmentation problem. The proposed algorithm directly runs on full images and produces global, pixel-wise prediction maps, in which detections are subsequently formed. To better make use of the properties of text, three types of information regarding text region, individual characters and their relationship are estimated, with a single Fully Convolutional Network (FCN) model. With such predictions of text properties, the proposed algorithm can simultaneously handle horizontal, multi-oriented and curved text in real-world natural images. The experiments on standard benchmarks, including ICDAR 2013, ICDAR 2015 and MSRA-TD500, demonstrate that the proposed algorithm substantially outperforms previous state-of-the-art approaches. Moreover, we report the first baseline result on the recently-released, large-scale dataset COCO-Text.
Article
Full-text available
In this paper, we propose a novel approach for text detec- tion in natural images. Both local and global cues are taken into account for localizing text lines in a coarse-to-fine pro- cedure. First, a Fully Convolutional Network (FCN) model is trained to predict the salient map of text regions in a holistic manner. Then, text line hypotheses are estimated by combining the salient map and character components. Fi- nally, another FCN classifier is used to predict the centroid of each character, in order to remove the false hypotheses. The framework is general for handling text in multiple ori- entations, languages and fonts. The proposed method con- sistently achieves the state-of-the-art performance on three text detection benchmarks: MSRA-TD500, ICDAR2015 and ICDAR2013.
Article
Full-text available
The field of object detection has made significant advances riding on the wave of region-based ConvNets, but their training procedure still includes many heuristics and hyperparameters that are costly to tune. We present a simple yet surprisingly effective online hard example mining (OHEM) algorithm for training region-based ConvNet detectors. Our motivation is the same as it has always been -- detection datasets contain an overwhelming number of easy examples and a small number of hard examples. Automatic selection of these hard examples can make training more effective and efficient. OHEM is a simple and intuitive algorithm that eliminates several heuristics and hyperparameters in common use. But more importantly, it yields consistent and significant boosts in detection performance on benchmarks like PASCAL VOC 2007 and 2012. Its effectiveness increases as datasets become larger and more difficult, as demonstrated by the results on the MS COCO dataset. Moreover, combined with complementary advances in the field, OHEM leads to state-of-the-art results of 78.9% and 76.3% mAP on PASCAL VOC 2007 and 2012 respectively.
Article
Full-text available
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of bounding box priors over different aspect ratios and scales per feature map location. At prediction time, the network generates confidences that each prior corresponds to objects of interest and produces adjustments to the prior to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. Our SSD model is simple relative to methods that requires object proposals, such as R-CNN and MultiBox, because it completely discards the proposal generation step and encapsulates all the computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on ILSVRC DET and PASCAL VOC dataset confirm that SSD has comparable performance with methods that utilize an additional object proposal step and yet is 100-1000x faster. Compared to other single stage methods, SSD has similar or better performance, while providing a unified framework for both training and inference.
Conference Paper
Full-text available
Maximally Stable Extremal Regions (MSERs) have achieved great success in scene text detection. However, this low-level pixel operation inherently limits its capability for handling complex text information efficiently (e. g. connections between text or background components), leading to the difficulty in distinguishing texts from background components. In this paper, we propose a novel framework to tackle this problem by leveraging the high capability of convolutional neural network (CNN). In contrast to recent methods using a set of low-level heuristic features, the CNN network is capable of learning high-level features to robustly identify text components from text-like outliers (e.g. bikes, windows, or leaves). Our approach takes advantages of both MSERs and sliding-window based methods. The MSERs operator dramatically reduces the number of windows scanned and enhances detection of the low-quality texts. While the sliding-window with CNN is applied to correctly separate the connections of multiple characters in components. The proposed system achieved strong robustness against a number of extreme text variations and serious real-world problems. It was evaluated on the ICDAR 2011 benchmark dataset, and achieved over 78% in F-measure, which is significantly higher than previous methods.
Conference Paper
Full-text available
In this paper, we present a new approach for text localization in natural images, by discriminating text and non-text regions at three levels: pixel, component and text line levels. Firstly, a powerful low-level filter called the Stroke Feature Transform (SFT) is proposed, which extends the widely-used Stroke Width Transform (SWT) by incorporating color cues of text pixels, leading to significantly enhanced performance on inter-component separation and intra-component connection. Secondly, based on the output of SFT, we apply two classifiers, a text component classifier and a text-line classifier, sequentially to extract text regions, eliminating the heuristic procedures that are commonly used in previous approaches. The two classifiers are built upon two novel Text Covariance Descriptors (TCDs) that encode both the heuristic properties and the statistical characteristics of text stokes. Finally, text regions are located by simply thresholding the text-line confident map. Our method was evaluated on two benchmark datasets: ICDAR 2005 and ICDAR 2011, and the corresponding Fmeasure values are 0.72 and 0.73, respectively, surpassing previous methods in accuracy by a large margin.
Conference Paper
Full-text available
This report presents the final results of the ICDAR 2013 Robust Reading Competition. The competition is structured in three Challenges addressing text extraction in different application domains, namely born-digital images, real scene images and real-scene videos. The Challenges are organised around specific tasks covering text localisation, text segmentation and word recognition. The competition took place in the first quarter of 2013, and received a total of 42 submissions over the different tasks offered. This report describes the datasets and ground truth specification, details the performance evaluation protocols used and presents the final results along with a brief summary of the participating methods.
Technical Report
Common goal ofmany computer vision and robotics algorithms is to extract geometric information from the sensory data. Due to noisy measurements and errors in matching or segmentation, the available data are often corrupted with outliers. In such instances robust estimation methods are employed for the problem of parametric model estimation. In the presence of a large fraction of outliers sampling based methods are often the preferred choice. Traditionally used RANSAC algorithm however requires a large number of samples, prior knowledge of the outlier ratio and an additional, difficult to obtain, inlier threshold for hypothesis evaluation. To tackle these problems we propose a novel efficient sampling based method for the robust estimation of model parameters. The method is based on the observation that for each data point, the properties of the residual distribution with respect to the generated hypotheses reveal whether the point is an outlier or an inlier. The problem of inlier/outlier identification can then be formulated as a classification problem. The proposed method is demonstrated on motion estimation problems from image correspondences with a large percentage of outliers (70%) on both synthetic and real data and estimation of planar models from range data. The method is shown to be of an order of magnitude more efficient than currently existing methods and does not require a prior knowledge of the outlier ratio and the inlier threshold. I.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
The goal of this work is text spotting in natural images. This is divided into two sequential tasks: detecting words regions in the image, and recognizing the words within these regions. We make the following contributions: first, we develop a Convolutional Neural Network (CNN) classifier that can be used for both tasks. The CNN has a novel architecture that enables efficient feature sharing (by using a number of layers in common) for text detection, character case-sensitive and insensitive classification, and bigram classification. It exceeds the state-of-the-art performance for all of these. Second, we make a number of technical changes over the traditional CNN architectures, including no downsampling for a per-pixel sliding window, and multi-mode learn- ing with a mixture of linear models (maxout). Third, we have a method of automated data mining of Flickr, that generates word and character level annotations. Finally, these components are used together to form an end-to-end, state-of-the-art text spotting system. We evaluate the text-spotting system on two standard benchmarks, the ICDAR Robust Reading data set and the Street View Text data set, and demonstrate improvements over the state-of-the-art on multiple measures.
Article
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
An end-to-end real-time text localization and recognition method is presented. Its real-time performance is achieved by posing the character detection and segmentation problem as an efficient sequential selection from the set of Extremal Regions. The ER detector is robust against blur, low contrast and illumination, color and texture variation. In the first stage, the probability of each ER being a character is estimated using features calculated by a novel algorithm in constant time and only ERs with locally maximal probability are selected for the second stage, where the classification accuracy is improved using computationally more expensive features. A highly efficient clustering algorithm then groups ERs into text lines and an OCR classifier trained on synthetic fonts is exploited to label character regions. The most probable character sequence is selected in the last stage when the context of each character is known. The method was evaluated on three public datasets. On the ICDAR 2013 dataset the method achieves state-of-the-art results in text localization; on the more challenging SVT dataset, the proposed method significantly outperforms the state-of-the-art methods and demonstrates that the proposed pipeline can incorporate additional prior knowledge about the detected text. The proposed method was exploited as the baseline in the ICDAR 2015 Robust Reading competition, where it compares favourably to the state-of-the art.
Article
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves improved segmentation of PASCAL VOC (30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of a second for a typical image.
Article
In this paper we introduce a new method for text detection in natural images. The method comprises two contributions: First, a fast and scalable engine to generate synthetic images of text in clutter. This engine overlays synthetic text to existing background images in a natural way, accounting for the local 3D scene geometry. Second, we use the synthetic images to train a Fully-Convolutional Regression Network (FCRN) which efficiently performs text detection and bounding-box regression at all locations and multiple scales in an image. We discuss the relation of FCRN to the recently-introduced YOLO detector, as well as other end-to-end object detection systems based on deep learning. The resulting detection network significantly out performs current methods for text detection in natural images, achieving an F-measure of 84.2% on the standard ICDAR 2013 benchmark. Furthermore, it can process 15 images per second on a GPU.
Conference Paper
In this paper, higher-order correlation clustering (HOCC) is used for text line detection in natural images. We treat text line detection as a graph partitioning problem, where each vertex is represented by a Maximally Stable Extremal Region (MSER). First, weak hypothesises are proposed by coarsely grouping MSERs based on their spatial alignment and appearance consistency. Then, higher-order correlation clustering (HOCC) is used to partition the MSERs into text line candidates, using the hypotheses as soft constraints to enforce long range interactions. We further propose a regularization method to solve the Semidefinite Programming problem in the inference. Finally we use a simple texton-based texture classifier to filter out the non-text areas. This framework allows us to naturally handle multiple orientations, languages and fonts. Experiments show that our approach achieves competitive performance compared to the state of the art.
Article
Text detection in natural scene images is an important prerequisite for many content-based image analysis tasks, while most current research efforts only focus on horizontal or near horizontal scene text. In this paper, first we present a unified distance metric learning framework for adaptive hierarchical clustering, which can simultaneously learn similarity weights (to adaptively combine different feature similarities) and the clustering threshold (to automatically determine the number of clusters). Then, we propose an effective multi-orientation scene text detection system, which constructs text candidates by grouping characters based on this adaptive clustering. Our text candidates construction method consists of several sequential coarse-to-fine grouping steps: morphology-based grouping via single-link clustering, orientation-based grouping via divisive hierarchical clustering, and projection-based grouping also via divisive clustering. The effectiveness of our proposed system is evaluated on several public scene text databases, e.g., ICDAR Robust Reading Competition data sets (2011 and 2013), MSRA-TD500 and NEOCR. Specifically, on the multi-orientation text data set MSRA-TD500, the f measure of our system is 71 percent, much better than the state-of-the-art performance. We also construct and release a practical challenging multi-orientation scene text data set (USTB-SV1K), which is available at http://prir.ustb.edu.cn/TexStar/MOMV-text-detection/.
Article
We present YOLO, a unified pipeline for object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is also extremely fast; YOLO processes images in real-time at 45 frames per second, hundreds to thousands of times faster than existing detection systems. Our system uses global image context to detect and localize objects, making it less prone to background errors than top detection systems like R-CNN. By itself, YOLO detects objects at unprecedented speeds with moderate accuracy. When combined with state-of-the-art detectors, YOLO boosts performance by 2-3% points mAP.
Article
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.
Article
An unconstrained end-to-end text localization and recognition method is presented. The method detects initial text hypothesis in a single pass by an efficient region-based method and subsequently refines the text hypothesis using a more robust local text model, which deviates from the common assumption of region-based methods that all characters are detected as connected components. Additionally, a novel feature based on character stroke area estimation is introduced. The feature is efficiently computed from a region distance map, it is invariant to scaling and rotations and allows to efficiently detect text regions regardless of what portion of text they capture. The method runs in real time and achieves state-of-the-art text localization and recognition results on the ICDAR 2013 Robust Reading dataset.
Conference Paper
We describe Photo OCR, a system for text extraction from images. Our particular focus is reliable text extraction from smartphone imagery, with the goal of text recognition as a user input modality similar to speech recognition. Commercially available OCR performs poorly on this task. Recent progress in machine learning has substantially improved isolated character classification, we build on this progress by demonstrating a complete OCR system using these techniques. We also incorporate modern data center-scale distributed language modelling. Our approach is capable of recognizing text in a variety of challenging imaging conditions where traditional OCR systems fail, notably in the presence of substantial blur, low resolution, low contrast, high image noise and other distortions. It also operates with low latency, mean processing time is 600 ms per image. We evaluate our system on public benchmark datasets for text extraction and outperform all previously reported results, more than halving the error rate on multiple benchmarks. The system is currently in use in many applications at Google, and is available as a user input modality in Google Translate for Android.
Article
In this work we present an end-to-end system for text spotting -- localising and recognising text in natural scene images -- and text based image retrieval. This system is based on a region proposal mechanism for detection and deep convolutional neural networks for recognition. Our pipeline uses a novel combination of complementary proposal generation techniques to ensure high recall, and a fast subsequent filtering stage for improving precision. For the recognition and ranking of proposals, we train very large convolutional neural networks to perform word recognition on the whole proposal region at the same time, departing from the character classifier based systems of the past. These networks are trained solely on data produced by a synthetic text generation engine, requiring no human labelled data. Analysing the stages of our pipeline, we show state-of-the-art performance throughout. We perform rigorous experiments across a number of standard end-to-end text spotting benchmarks and text-based image retrieval datasets, showing a large improvement over all previous methods. Finally, we demonstrate a real-world application of our text spotting system to allow thousands of hours of news footage to be instantly searchable via a text query.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
Full end-to-end text recognition in natural images is a challenging problem that has received much atten-tion recently. Traditional systems in this area have re-lied on elaborate models incorporating carefully hand-engineered features or large amounts of prior knowl-edge. In this paper, we take a different route and com-bine the representational power of large, multilayer neural networks together with recent developments in unsupervised feature learning, which allows us to use a common framework to train highly-accurate text detec-tor and character recognizer modules. Then, using only simple off-the-shelf methods, we integrate these two modules into a full end-to-end, lexicon-driven, scene text recognition system that achieves state-of-the-art performance on standard benchmarks, namely Street View Text and ICDAR 2003.
Conference Paper
With the increasing popularity of practical vision systems and smart phones, text detection in natural scenes becomes a critical yet challenging task. Most existing methods have focused on detecting horizontal or near-horizontal texts. In this paper, we propose a system which detects texts of arbitrary orientations in natural images. Our algorithm is equipped with a two-level classification scheme and two sets of features specially designed for capturing both the intrinsic characteristics of texts. To better evaluate our algorithm and compare it with other competing algorithms, we generate a new dataset, which includes various texts in diverse real-world scenarios; we also propose a protocol for performance evaluation. Experiments on benchmark datasets and the proposed dataset demonstrate that our algorithm compares favorably with the state-of-the-art algorithms when handling horizontal texts and achieves significantly enhanced performance on texts of arbitrary orientations in complex natural scenes.
Article
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Article
Text detection in natural scene images is an important prerequisite for many content-based image analysis tasks. In this paper, we propose an accurate and robust method for detecting texts in natural scene images. A fast and effective pruning algorithm is designed to extract Maximally Stable Extremal Regions (MSERs) as character candidates using the strategy of minimizing regularized variations. Character candidates are grouped into text candidates by the single-link clustering algorithm, where distance weights and clustering threshold are learned automatically by a novel self-training distance metric learning algorithm. The posterior probabilities of text candidates corresponding to non-text are estimated with a character classifier; text candidates with high non-text probabilities are eliminated and texts are identified with a text classifier. The proposed system is evaluated on the ICDAR 2011 Robust Reading Competition database; the f measure is over 76%, much better than the state-of-the-art performance of 71%. Experiments on multilingual, street view, multi-orientation and even born-digital databases also demonstrate the effectiveness of the proposed method. Finally, an online demo of our proposed scene text detection system has been set up at http://kems.ustb.edu.cn/learning/yin/dtext.
Conference Paper
We present a method for spotting words in the wild, i.e., in real images taken in unconstrained environments. Text found in the wild has a surprising range of difficulty. At one end of the spectrum, Optical Character Recognition (OCR) applied to scanned pages of well formatted printed text is one of the most successful applications of computer vision to date. At the other extreme lie visual CAPTCHAs - text that is constructed explicitly to fool computer vision algorithms. Both tasks involve recognizing text, yet one is nearly solved while the other remains extremely challenging. In this work, we argue that the appearance of words in the wild spans this range of difficulties and propose a new word recognition approach based on state-of-the-art methods from generic object recognition, in which we consider object categories to be the words themselves. We compare performance of leading OCR engines - one open source and one proprietary - with our new approach on the ICDAR Robust Reading data set and a new word spotting data set we introduce in this paper: the Street View Text data set. We show improvements of up to 16% on the data sets, demonstrating the feasibility of a new approach to a seemingly old problem.
Fastext: Efficient unconstrained scene text detector
  • M Busta
  • L Neumann
  • J Matas
Fastext: Efficient unconstrained scene text detector
  • busta