Figure 5 - uploaded by Azmi Can Özgen
Content may be subject to copyright.
Binarized and dilated regions

Binarized and dilated regions

Source publication
Conference Paper
Full-text available
Text detection is one of the most challenging and commonly dealt applications in computer vision. Detecting text regions is the first step of the text recognition systems called Optical Character Recognition. This process requires the separation of text region from non-text region. In this paper, we utilize Maximally Stable Extremal Regions to acqu...

Contexts in source publication

Context 1
... dilation operation is used to connect characters. In the end the output as in Figure 5 is obtained. ...
Context 2
... it can be observed in Figure 5, there may be still unwanted regions left, if image contains foliage or small- shaped objects. However, the characteristics of these regions are irregular and random. ...

Citations

... When deep learning-based text detection models became available, Ravagli et al. [27] applied several scene text recognition methods, including Efficient and Accurate Scene Text Detector (EAST) [37], Stroke Width Transform (SWT) [38] and the Connectionist Text Proposal Network (CTPN) [39] to the CVC-FP data set in order to extract room information with Tesseract OCR. Similarly, as one aspect of their floor plan analysis method, Dodge et al. [11] apply the Google Vision API to detect and recognize text elements. ...
Article
An important aspect of automatic floor plan analysis is the extraction of textual information, as it is essential for a thorough understanding of the drawing. This paper presents a text extraction approach utilizing a deep learning-based object detection model and state-of-the-art Optical Character Recognition (OCR) methods. The paper contributes to the research community in three ways: First, it introduces additional annotations to existing data sets to encompass text elements. Second, it proposes a specialized data synthesis pipeline, allowing for generating training images that mimic important characteristics of real data. Finally, it documents a comparative study of deep learning-based object detection architectures (Tesseract, EAST, CRAFT, Faster R-CNN, YOLOv5, YOLOR, YOLOv7, and YOLOv8) and OCR tools (PARSEq, MATRN, EasyOCR, and Tesseract) for the task. Results indicate that YOLOv7 yields the best text detection performance (up to 97.5% wmAP) and PARSEq excels in character recognition (85.2% CER). The data sets are made available.
... A new proposal was developed for detecting text from both computer generated image and natural image in article [12]. For detecting and recognize text they have utilized Maximally Stable Extremal Regions properties. ...
Article
Text extraction process can play a vital role for detecting valuable information from a selected image. This text extraction process involves text detection, localization, marking, tracking, extraction, enhancement and finally recognition task. It is a difficult task to detect these text characters, because of their variation of size, style, font, orientation, alignment, contrast, color and textured background. There is a growing demand of information detection, indexing and retrieval from various multimedia documents nowadays. Several methods have been developed for extraction of text from an image. This article proposes a novel method for image to text extraction. In this paper, we are presenting a multiresolution morphology based text segmentation process suitable for various types of non-text elements like drawing, pictures, halftones or etc. For image processing, python library OpenCV is used and for text extraction Tessaract is used. Python Imaging Library (PIL) is capable to handle the opening and manipulation of images in many formats in Python. Also we are in testing of such an application that can give output in every language correctly.
... Statistical features from candidate regions are used in [17] for training a Random Forest classifier to identify the text proposals. The MSER-SWT combination is utilized in [48] to generate candidate regions that are further geometrically filtered and passed through an Optical Character Recognition (OCR) [36] system for determining the text regions. Standard features like Histogram of Oriented Gradients (HOG) along with stroke width as well as entropy based features are employed in [2] to classify the MSER proposals as text/non-text. ...
Article
Full-text available
Detection and language identification of multi-lingual texts in natural scene images (NSI) and born-digital images (BDI) are popular research problems in the domain of information retrieval. Several methods addressing these problems have been evaluated over the years upon mostly NSI based standard datasets. However, datasets highlighting bi/trilingual Indic texts in a single image are quite a few. Also, datasets housing BDIs with multi-lingual texts are hardly available. To this end, a new dataset called Mixed-lingual Indic Texts in Digital Images (MITDI) having 500 NSIs and 500 BDIs, is introduced where each image contains texts written in at least two of the either English, Bangla and Hindi languages which are quite commonly used in India. Overall, NSI pool contains 360 images with bi-lingual texts and 140 with tri-lingual texts, whereas BDI pool contains 489 images with bi-lingual texts and 11 with tri-lingual texts. To benchmark the performance on MITDI, a deep learning based Connectionist-DenseNet framework is built and evaluated for each data pool NSI, BDI and combined set. The proposed dataset can serve as an important resource for evaluating state-of-the-art methods in this domain. The dataset is publicly available at: https://github.com/NCJUCSE/MITDI
... Moreover, these digital images act as a necessary role in various fields because of using many tools such as digital cameras and software [25]. Computer Generated Images (CGI) [23] are the images that are developed by computer software for various resolutions like developing designs, creating animation for movies, and create realistic look images in the computer game [12]. Moreover, CGI is one of the computer graphics applications used to generate images in films, TV programs, videogames, commercials, videos, and shorts [23]. ...
... Moreover, CGI is one of the computer graphics applications used to generate images in films, TV programs, videogames, commercials, videos, and shorts [23]. Generally, CGI is developed through 3D artists by different compositing components like modelling, animating, and painting [12,20]. Image splicing is the process of creating a composite image by cutting selected object from the image and adding it to some other image [10]. ...
... This metric is calculated based on the relation of total positive detections to the addition of accurate detection and inaccurate detection that can be calculated using Eq. (12). It is defined as the probability of detecting the images. ...
Article
Full-text available
Digital images are very necessary for various fields because of the availability of software and applications, which can create the images as looking reality. Moreover, the categorization of Computer-generated Images (CGI) from Natural Images (NI) is difficult because it can’t be differentiated by the human eyes. So, this research developed a novel Knowledge-based Fuzzy Approximation (KBFA) model for differentiating CGI, NI, and Spliced Images (SI). Subsequently, Hybrid Grey Wolf Ant Lion (H-GWAL) optimization approach is developed to the localization of tampered region in a spliced image. Moreover, the proposed H-GWAL algorithm has been utilized for enhancing the classification accuracy of the proposed method. Hence, this method distinguishes the CGI from NI and SI from original images and the classified images are detected by the proposed KBFA with H-GWAL model. Additionally, this method is simulated using Python and the obtained results prove the performance of an innovative method. Moreover, the obtained results in terms of detection accuracy, precision, recall, and F1-measure are compared with recent existing approaches.
... M.Jiang et al. [11] introduced frequency tuning (FT) visual saliency to remove non-text candidate regions after adopting the SWT algorithm. Azmi Can et al. [12] used the Tesseract optical character recognition engine to eliminate non-text groups after traditionally extracting text candidate frames. In this paper, the method we propose is based on the stroke width transform method to provide a pixellevel gradient ray map to filter the most irrelevant backgrounds. ...
Article
Full-text available
Compared with deep learning methods, traditional image processing methods have lower computational costs and have advantages for typical problems such as sign text, this paper proposes a sign text location method based on pixel-level gradient ray map and edge contour mapping. First, based on image pre-processing, the pixel-level gradient ray map is obtained by using the Stroke Width Transform. Then, using the Maximally Stable Extremal Region method to locate the irregular text candidate area in the ray map. At the same time, add a mask layer to filter the irregular candidate area, and then regularize the candidate area and perform Non-Maximum Suppression processing. Finally, the background non-text candidate area is processed by extracting the mapping of the closed edge contour of the image, and the remaining text area positioning candidate frame coordinates are mapped back to the original picture. The experimental results show that the method is simple and clear, the effect is remarkable, and it has good robustness to the sign of different sizes.
... The method proposed in [14] first converts an input image into a grayscale image and then applies edge enhanced MSER to extract a number of homogeneous blobs that may or may not be text components. Then, certain geometric properties of these components, like area, perimeter, compactness, and occupancy are calculated so as to eliminate the spurious components on the basis of the thresholds set for each geometric property. ...
Chapter
Recent years have witnessed an exponential surge in interest to explore the domain of scene text detection as well as analysis in natural scene images. However, owing to the complexities arising due to various factors, it can be said that existing techniques may fail at times while attempting to detect text components. This paper presents a system wherein an image is taken as input and its color components are extracted at first. Next the intensity values from each color channel are grouped together using K-means++ clustering algorithm. Memetic algorithm is then applied to get an optimal set of candidate components from the color maps while eliminating the background. The spurious components are removed on the basis of their dimension and entropy measure. This system is experimentally evaluated on two standard datasets namely MLe2e and KAIST, and on an in-house dataset of 400 images, all having multi-lingual texts. The results obtained are comparable with some state-of-the-art methods.
... The gray levels of this image are clustered into bright and dark levels where components of both the clusters undergo geometric and morphological filtering to eliminate non-text components and retaining text components. This method is experimentally evaluated on standard multi-lingual datasets like KAIST [8,9] and MLe2e [10,11], and an in-house dataset. ...
... However, spurious components resembling textual properties may limit the performance of SWT-based approaches. Some recent methodologies [1,8,9] combine MSER and SWT to minimize spurious regions and retain maximum texts. ...
Chapter
Detecting texts from natural scene images is currently becoming a popular trend in the field of information retrieval. Researchers find it interesting due to the challenges faced while processing an image. In this paper, a relatively simple but effective approach is proposed where bright texts on a dark background and dark texts on a bright background are detected in natural scene images. This approach is based on the fact that there is usually stark contrast between the background and foreground. Hence, K-means clustering algorithm is applied on the gray levels of the image where bright and dark gray level clusters are generated. Each of these clusters are then analyzed to extract the text components. This method proves to be robust compared to the existing methods, giving reasonably satisfactory results when evaluated on Multi-lingual standard datasets like KAIST and MLe2e, and an in-house dataset of images having Multi-lingual texts written in English, Bangla and Hindi.
... With the advances in the capture of information and the recognition of pattens, especially text recognition [5], new technologies have permitted the development of new services such as document analysis, access to devices via patterns, time savings in manual writing and minimization of errors in search engines [6]. With optical character recognition (OCR), a technique widely used with or without Internet connection, the text of an image when it is digitized is recognized or if it fails to be recognized during digitalization, while the user is writing in real time [7,8]. An external library is required to implement OCR. ...
Chapter
This paper presents the development of the smart device mobile application “Book’s Recognition”. The app recognizes the text of library book titles at the library of the Universidad Politécnica Salesiana in the city of Guayaquil, Ecuador. Through a service stored in Amazon Web Service (AWS), Mobil Vision’s algorithms for text recognition, and Google’s API on the Android platform, the app “Book’s Recognition” allows its user to recognize the text of the title of a physical book in an innovative and effective way, showing the user basic information about the book in real time. The application can be offered as a service of the library. The purpose of this development is to awaken university student’s interest about new and creative forms of intelligent investigation with resources from the university’s main library, and furthermore to facilitate the investigative process by providing information on non-digitalized, and digitalized, books that hold valuable and relevant information for all generations. The mobile app can be downloaded the following website: https://github.com/seimus96/mobile_vision.
... in the work [14] a new hybrid model of Naïve Bayes nearest neighbor model is built and trained it with convolutional features of candidate regions to exploit the discriminative power of small stroke-parts in order to identify the language of the detected scene text. A stepwise process using MSER with elimination of spurious components by geometric property analysis and SWT is implemented in the work [32]. Later these connected components are combined by applying a simple dilation process so that characters are combined into words. ...
Article
Full-text available
Recent years have witnessed significant development in the field of text detection in natural scene images. However, issues like poor image quality and complex background reduce the efficiency of such methods, thereby requiring a good pre-processing module for image enhancement. Also, conventional texture-based features have some limitations for classifying text and non-text components due to potential similarities between them. To this end, a new model is proposed where the image quality is first enhanced by removing noise and blur. Then, a histogram-based adaptive K-means clustering of intensity values is performed in order to extract the text candidates. These candidates are then analyzed using Daisy descriptor for text/non-text determination, and language identification of the text. The proposed model is applied on an in-house multilingual dataset of images with texts in Indian languages, and on standard datasets including ICDAR 2017, MLe2e and KAIST. The results indicate significant improvement in performance compared to some contemporary methods.
... The EAST is a segmentation-based method without FPN, the text instances with small area have less supporting boxes (i.e.,r i ) than that with big area, which leads to the extremely unbalanced supporting boxes number (i.e.,N s ). On the other hand, SWT is more reasonable for eliminating non-text regions, and similar previous works [26,42] have shown its effectiveness. SWT is a local image operator that computes the most likely stroke width for each pixel. ...
Preprint
Full-text available
Deep learning-based scene text detection can achieve preferable performance, powered with sufficient labeled training data. However, manual labeling is time consuming and laborious. At the extreme, the corresponding annotated data are unavailable. Exploiting synthetic data is a very promising solution except for domain distribution mismatches between synthetic datasets and real datasets. To address the severe domain distribution mismatch, we propose a synthetic-to-real domain adaptation method for scene text detection, which transfers knowledge from synthetic data (source domain) to real data (target domain). In this paper, a text self-training (TST) method and adversarial text instance alignment (ATA) for domain adaptive scene text detection are introduced. ATA helps the network learn domain-invariant features by training a domain classifier in an adversarial manner. TST diminishes the adverse effects of false positives~(FPs) and false negatives~(FNs) from inaccurate pseudo-labels. Two components have positive effects on improving the performance of scene text detectors when adapting from synthetic-to-real scenes. We evaluate the proposed method by transferring from SynthText, VISD to ICDAR2015, ICDAR2013. The results demonstrate the effectiveness of the proposed method with up to 10% improvement, which has important exploration significance for domain adaptive scene text detection. Code is available at https://github.com/weijiawu/SyntoReal_STD