Conference Paper

Image Text Detection and Documentation Using OCR

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In this paper, an automatic invoice processing system, which is in great demand among private and public companies, was proposed. The proposed system supports all invoice file types that can be submitted by companies. Companies can easily submit invoices to the system via the web interface or email, and all invoices submitted to the system are queued and processed sequentially. If the invoice is a text file, the invoice information is extracted from the text by using template matching. If the invoice is an image, the text and table areas are detected and extracted. For table detection, we used both image processing based and YOLOv5-based deep learning method. Cell extraction was then performed from the extracted table images. As a result of these processes, all text and table cells were obtained as images and these images were converted into machine-readable text using the open-source software Tesseract OCR. Tesseract already provides trained models for English and Turkish. However, these models do not provide successful results for invoices submitted by companies in Turkish. Therefore, the new fine-tuned model trained with invoices in Turkish was used for OCR. The experimental results showed that the trained Turkish model was more accurate than the Turkish and English models provided by Tesseract. In addition, the YOLOv5-based table detection model was more accurate than the image-processing-based table detection method.
Article
Full-text available
Key information extraction from unstructured documents is a practical problem in many industries. Machine learning models aimed at solving this problem should efficiently utilize textual, visual, and 2D spatial layout information of the document. Grid based approaches achieve this by representing the document as a 2D grid and feeding it to a fully convolutional encoder-decoder network that solves a semantic instance segmentation problem. We propose a new method for the instance detection branch of that network for the task of automatic information extraction from invoices. Our approach reduces this problem to 1D region detection. The proposed network has fewer parameters and a shorter inference times. Additionally, we suggest a new metric for evaluating the results.
Article
Full-text available
The daily transaction of an organization generates a vast amount of unstructured data such as invoices and purchase orders. Managing and analyzing unstructured data is a costly affair for the organization. Unstructured data has a wealth of hidden valuable information. Extracting such insights automatically from unstructured documents can significantly increase the productivity of an organization. Thus, there is a huge demand to develop a tool that can automate the extraction of key fields from unstructured documents. Researchers have used different approaches for extracting key fields, but the lack of annotated and high-quality datasets is the biggest challenge. Existing work in this area has used standard and custom datasets for extracting key fields from unstructured documents. Still, the existing datasets face some serious challenges, such as poor-quality images, domain-related datasets, and a lack of data validation approaches to evaluate data quality. This work highlights the detailed process flow for end-to-end key fields extraction from unstructured documents. This work presents a high-quality, multi-layout unstructured invoice documents dataset assessed with a statistical data validation technique. The proposed multi-layout unstructured invoice documents dataset is highly diverse in invoice layouts to generalize key field extraction tasks for unstructured documents. The proposed multi-layout unstructured invoice documents dataset is evaluated with various feature extraction techniques such as Glove, Word2Vec, FastText, and AI approaches such as BiLSTM and BiLSTM-CRF. We also present the comparative analysis of feature extraction techniques and AI approaches on the proposed multi-layout unstructured invoice document dataset. We attained the best results with BiLSTM-CRF model.
Article
Full-text available
Intelligent learning system (ILS) has become a popular learning tool for students. It can collect students' wrong questions in exercises and dig out their unskilled knowledge points so that it can recommend personalized exercises for students. Detecting text accurately from images of students' exercises is significant and essential in an ILS. However, a big challenge of text detection is that traditional text detection algorithms can not detect complete text lines in an exercise scene, and their detection box always splits between Chinese and mathematical symbols. In this article, we propose a deep-learning-based approach for text detection, which improves You Only Look Once version 3 (YOLOv3) by changing the regression object from a single character to a fixed-width text and applies a stitching strategy to construct text lines based on the relation matrix, which improves the accuracy by 9.8%. Experimental results on both RCTW Chinese text detection dataset and real exercise scenario show that our model can improve detection effectiveness. In addition, we compare our method with two state-of-the-art approaches in applications of exercise text detection, and discuss its capability and limitations. We have also provided a platform which has implemented the proposal for detecting text lines in students' daily homework or examination papers, which enhances user experience well.
Article
Full-text available
We present CameraKeyboard, a text entry technique that uses smartphone cameras to extract and digitalise text from physical sources such as business cards and identification documents. After taking a picture of the text of interest, the smartphone recognises the text through OCR technologies (specifically, Google Cloud Vision API) and organises it in the same way in which it is displayed in the original source. Next, users can select the required text -which is split into lines, and further split into words -by tapping on it. CameraKeyboard is designed to easily complement the standard keyboard when users need to digitalise text. In fact, the camera is integrated directly in the standard mobile QWERTY keyboard and can be accessed by tapping a button. CameraKeyboard is general-purpose like any other smartphone keyboard: it works with any application (e.g., Gmail, Whatsapp) and is able to extract text from any kind of source. We evaluated CameraKeyboard through a user study with 18 participants who carried out four different text entry tasks, and compared it to a standard smartphone keyboard (Google Keyboard) and a standard (physical) desktop keyboard. Results show that CameraKeyboard is the faster one in most cases.
Article
Full-text available
The main goal of existing word spotting approaches for searching document images has been the identification of visually similar word images in the absence of high quality text recognition output. Searching for a piece of arbitrary text is not possible unless the user identifies a sample word image from the document collection or generates the query word image synthetically. To address this problem, a Markov Random Field (MRF) framework is proposed for searching document images and shown to be effective for searching arbitrary text in real time for books printed in English (Latin script), Telugu and Ottoman scripts. The English experiments demonstrate that the dependencies between the visual terms and letter bigrams can be automatically learned using noisy OCR output. It is also shown that OCR text search accuracy can be significantly improved if it is combined with the proposed approach. No commercial OCR engine is available for Telugu or Ottoman script. In these cases the dependencies are trained using manually annotated document images. It is demonstrated that the trained model can be directly used to resolve arbitrary text queries across books despite font type and size differences. The proposed approach outperforms a state-of-the-art BLSTM baseline in these contexts.
Conference Paper
Full-text available
We propose an approach to multi-writer word spotting, where the goal is to find a query word in a dataset comprised of document images. We propose an attributes-based approach that leads to a low-dimensional, fixed-length representation of the word images that is fast to compute and, especially, fast to compare. This approach naturally leads to an unified representation of word images and strings, which seamlessly allows one to indistinctly perform query-by-example, where the query is an image, and query-by-string, where the query is a string. We also propose a calibration scheme to correct the attributes scores based on Canonical Correlation Analysis that greatly improves the results on a challenging dataset. We test our approach on two public datasets showing state-of-the-art results.
Article
Full-text available
An efficient word spotting framework is proposed to search text in scanned books. The proposed method allows one to search for words when optical character recognition (OCR) fails due to noise or for languages where there is no OCR. Given a query word image, the aim is to retrieve matching words in the book sorted by the similarity. In the offline stage, SIFT descriptors are extracted over the corner points of each word image. Those features are quantized into visual terms (visterms) using hierarchical K-Means algorithm and indexed using an inverted file. In the query resolution stage, the candidate matches are efficiently identified using the inverted index. These word images are then forwarded to the next stage where the configuration of visterms on the image plane are tested. Configuration matching is efficiently performed by projecting the visterms on the horizontal axis and searching for the Longest Common Subsequence (LCS) between the sequences of visterms. The proposed framework is tested on one English and two Telugu books. It is shown that the proposed method resolves a typical user query under 10 milliseconds providing very high retrieval accuracy (Mean Average Precision 0.93). The search accuracy for the English book is comparable to searching text in the high accuracy output of a commercial OCR engine.
Article
Object detection techniques are the foundation for the artificial intelligence field. This research paper gives a brief overview of the You Only Look Once (YOLO) algorithm and its subsequent advanced versions. Through the analysis, we reach many remarks and insightful results. The results show the differences and similarities among the YOLO versions and between YOLO and Convolutional Neural Networks (CNNs). The central insight is the YOLO algorithm improvement is still ongoing.This article briefly describes the development process of the YOLO algorithm, summarizes the methods of target recognition and feature selection, and provides literature support for the targeted picture news and feature extraction in the financial and other fields. Besides, this paper contributes a lot to YOLO and other object detection literature.
Technical Report
TensorFlow [1] is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.
Conference Paper
In this paper, we study the influence of both long and short skip connections on Fully Convolutional Networks (FCN) for biomedical image segmentation. In standard FCNs, only long skip connections are used to skip features from the contracting path to the expanding path in order to recover spatial information lost during downsampling. We extend FCNs by adding short skip connections, that are similar to the ones introduced in residual networks, in order to build very deep FCNs (of hundreds of layers). A review of the gradient flow confirms that for a very deep FCN it is beneficial to have both long and short skip connections. Finally, we show that a very deep FCN can achieve near-to-state-of-the-art results on the EM dataset without any further post-processing.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
Robust extraction of text from scene images is essential for successful scene text recognition. Scene images usually have nonuniform illumination, complex background, and text-like objects. In this paper, we propose a text extraction algorithm by combining the adaptive binarization and perceptual color clustering method. Adaptive binarization method can handle gradual illumination changes on character regions, so it can extract whole character regions even though shadows and/or light variations affect the image quality. However, image binarization on gray-scale images cannot distinguish different color components having the same luminance. Perceptual color clustering method complementary can extract text regions which have similar color distances, so that it can prevent the problem of the binarization method. Text verification based on local information of a single component and global relationship between multiple components is used to determine the true text components. It is demonstrated that the proposed method achieved reasonabe accuracy of the text extraction for the moderately difficult examples from the ICDAR 2003 database.
Conference Paper
Conventional optical character recognition (OCR) systems operate on individual characters and words, and do not normally exploit document or collection context. We describe a Collection OCR which takes advantage of the fact that multiple examples of the same word (often in the same font) may occur in a document or collection. The idea here is that an OCR or a reCAPTCHA like process generates a partial set of recognized words. In the second stage, a nearest neighbor algorithm compares the remaining word-images to those already recognized and propagates labels from the nearest neighbors. It is shown that by using an approximate fast nearest neighbor algorithm based on Hierarchical K-Means (HKM), we can do this accurately and efficiently. It is also shown that profile based features perform much better than SIFT and Pyramid Histogram of Gradient (PHOG) features. We believe that this is because profile features are more robust to word degradations (common in our documents). This approach is applied to a collection of Telugu books - a language for which no commercial OCR exists. We show from a selection of 33 Telugu books that starting with OCR labels for only 30% of the collection we can recognize the remaining 70% of the words in the collection with 70% accuracy using this approach. Since the approach makes no language specific assumptions, it should be applicable to a large number of languages. In particular we are interested in its applicability to Indic languages and scripts.
Conference Paper
Libraries and other institutions are interested in providing access to scanned versions of their large collections of handwritten historical manuscripts on electronic media. Convenient access to a collection requires an index, which is manually created at great labor and expense. Since current handwriting recognizers do not perform well on historical documents, a technique called word spotting has been developed: clusters with occurrences of the same word in a collection are established using image matching. By annotating "interesting" clusters, an index can be built automatically. We present an algorithm for matching handwritten words in noisy historical documents. The segmented word images are preprocessed to create sets of 1-dimensional features, which are then compared using dynamic time warping. We present experimental results on two different data sets from the George Washington collection. Our experiments show that this algorithm performs better and is faster than competing matching techniques.
Article
Libraries and other institutions are interested in providing access to scanned versions of their large collections of handwritten historical manuscripts on electronic media. Convenient access to a collection requires an index, which is manually created at great labour and expense. Since current handwriting recognizers do not perform well on historical documents, a technique called word spotting has been developed: clusters with occurrences of the same word in a collection are established using image matching. By annotating "interesting" clusters, an index can be built automatically. We present an algorithm for matching handwritten words in noisy historical documents. The segmented word images are preprocessed to create sets of 1-dimensional features, which are then compared using dynamic time warping. We present experimental results on two different data sets from the George Washington collection. Our experiments show that this algorithm performs better and is faster than competing matching techniques.
Tokengrid: Toward More Efficient Data Extraction From Unstructured Documents
  • Arsen Yeghiazaryan
  • Tokengrid
Ultralytics/YOLOv5: V6.0— YOLOv5n ‘nano’ models, roboflow integration, TensorFlow export, OpenCV DNN support,
  • G Jocher
Complete scanning application using OpenCv
  • A Gangal
  • P Kumar
  • S Kumari
Nearest neighbor based collection OCR
  • K.