An example of a Persian word consists of two PAWs, where p represents pen width. The red line is the visualization of the imaginary baseline.

Source publication

A Front-End OCR for Omni-Font Persian/Arabic Cursive Printed Documents

Conference Paper

Full-text available

Jan 2006

Compared to non-cursive scripts, optical character recognition of cursive documents comprises extra challenges in layout analysis as well as recognition of the printed scripts. This paper presents a front-end OCR for Persian/Arabic cursive documents, which utilizes an adaptive layout analysis system in addition to a combined MLP-SVM recognition pro...

Context 1

... languages with Roman alphabet and East Asian scripts. Although there has been a great attempt in producing omni-font OCR systems for Persian/Arabic language, the overall performance of such systems are far from perfect. Persian written language, which uses modified Arabic alphabet, is written cursively, and this intrinsic feature makes it difficult for automatic recognition. An essential part of a document understanding process, which seems trivial in human's character recognition, is how we perceive the written text parts and how we distinguish between different sections of a document. To eye of an educated person, the layout of a document is discerned once it came to sight but to a machine, this seemingly simple task will become burdensome. To a large extend, the layout of written materials appear in rectangular columns with all text and shapes aligned to the margins. This arrangement is mostly referred to as Manhattan style documents. In addition to this style, there are more sophisticated styles where printed text and figures in a document are not aligned. Thus, majority of efforts for layout analysis of documents are focused on Manhattan style, but recently, there is a growing attention toward the complicated styles as well [3]. Persian/Arabic printed documents are treated in the same way in layout analysis however; there is an absence of solid resources in the literature for these languages. Once the layout of a document is extracted, the system has access to each line of text. Thereafter, the problem of understanding of the script reduces into recognition of the words. There are two main approaches to automatic understanding of cursive scripts: holistic and segmentation-based [4]. In the first approach, each word is treated as a whole and the recognition system does not consider it as a combination of separable characters. Very similar to the speech recognition systems, in almost all significant results of holistic methods, Hidden Markov Models have been used as the recognition engine [5, 6]. The second strategy, which owns the majority in the literature, segments each word to containing characters as the building blocks, and recognizes each character then. In comparison, the first strategy usually outperforms the second, but it needs a more detailed model of the language, which its complexity grows as the vocabulary set gets larger. In addition, in this method, the number of recognition classes is far more than similar number in segmentation-based methods. Recently, there is also a trend toward hybrid methods which incorporates the segmentation and recognition systems to obtain overall results; these methods are usually called segmentation-by-recognition [7, 8]. One of the main concerns of designing every OCR system is to make it robust to the font variations. Thus, successful examples are omni-font recognition systems with ability to learn new fonts from some tutor. In holistic methods, as the character recognition problem is viewed from a different perspective, and the system collectively uses learning mechanisms for few connected characters, the transformation of the system into an omni-font learning system would be smooth. On the other hand, the segmentation-based systems mainly use learning methods only in recognition process, and to the best of our knowledge, the learning systems are never used for the segmentation process in the literature [9]. Usually, human recognizes unfamiliar words by segmenting them and recognizing each character separately to understand the whole word. With this perspective, in this research, the whole task is broken down into two separate learning systems to gain from reduction of complexity in hierarchy as well as adaptability of learning systems. The layout of this paper is as follows: Section 2 emphasizes on the characteristics of Persian script that were crucial for the design of OCR systems. In Section 3, we will discuss the proposed algorithm, which includes the layout analysis, segmentation, and recognition modules in separate subsections. In Section 4, implementation details and results are discussed which is entailed with conclusive remarks and acknowledgements. In this section, we will briefly describe some of the main characteristics of Persian/Arabic script to point out the main difficulties which an OCR system should overcome. As one of the main properties, the script consists of separated words which are aligned by a horizontal virtual line called "baseline". Words are separated by long spaces and each word consists of one or more isolated segments each of them is called Piece of a Word (PAW). On the contrary PAWs are separated by short spaces and each PAW includes one or more characters. If one PAW has more than one character, each of them will be connected to its neighbors along the baseline. Figure 1 shows a sample Persian/Arabic script, where a represents the space between two different words, and b is the short space between PAWs of a word. In Figure 2, the first PAW on the right comprises three characters and the second one, on the left, consists of only a single character, and p denotes the pen width value which is heuristically equal to the most frequent value of the vertical projection in each line. The overall block diagram of the system is presented in Figure 3 which depicts layout analysis, post-process, and natural language processing (NLP) subsystems in addition to recognition and segmentation blocks. This paper presents the design of the layout analysis system in addition to the segmentation, feature extraction, and recognition sections (Figure 3). The design of the NLP module is out of scope of this paper. With respect to the popularity of Manhattan style in Persian documents, this research was focused on documents with aligned columns without any graphical figure inside; therefore, the automatic exclusion of the graphical material from text is left for future works. The designed method includes a two-step skew detection with use of a combination of basic Hough transform method [10] and the method proposed in [11] that uses fuzzy run-lengths. The layout analysis section will process a document and adaptively segments it into paragraphs and then into the separated lines. In the proposed system, the segmentation-based approach is exploited, and some measures are considered to overcome the main weaknesses of it. The best results are achieved with aid of artificial neural networks (ANN) for performing segmentation with some extended features (Section 3.2). In recognition section, we obtained a definite set of features from each segmented symbol, which was fed to a support vector machine (SVM) classification engine to obtain the recognized symbol. Using large margin classifiers enables us to achieve high recognition rates which are in coherence with the best results in the literature [2]. We also decomposed each character of Persian script to more primitive symbols called graphemes. This novel decomposition has decreased the complexity of the recognition and segmentation procedures and has improved the overall result. Few different characters could share a single grapheme, and additionally, several joint graphemes could build a single character. In addition, Persian language includes many characters which the only difference they have is the number of dots and placement of them. To finalize the character recognition task, a post- processing section is implemented to combine the result of grapheme recognition and the number of dots. Besides, this section corrects some common grapheme recognition errors using an embedded confusion matrix. Figure 4 shows the combination of grapheme recognition and post-processing processes with dot recognition module. Before proceeding further, we provide concepts of some frequently used terms in this paper for clarification: Grapheme : In this research, we refer grapheme to any graphical image that would be a character or a part of it which acts as a fundamental building block of words. This resembles the concept of phonemes in speech, but we don't directly choose them in relation to real phonemes. Pen tip : The vertical position of the pen in the skeleton of a PAW image. Junction points : The horizontal position of the grapheme boundary. Thus, cutting the word at junction points results separated graphemes. The layout analysis section is responsible for segmenting document images into lines. The input of this system is clean black and white images with low noise, and the output is images of the separated lines. In this research, we considered a 300 DPI scanned document as input since it is a regular resolution for Persian/Arabic OCR. All of the other internal parameters are fixed for this resolution and for other scanning resolutions, all we need is to scale them proportional to the ratio they have to this predefined DPI. Thus, parameters of the proposed system only depend on the easily defined scanning resolution. Since after scanning a regular size paper such a detailed resolution creates a huge file size, we need to consider some measures to avoid the slow down of the algorithm. Hence, in line segmentation block, most of the time-consuming processes are computed on a downsampled version of the documents image, and only the final processes are performed on the full-size image with some slight adjustment to insure the accuracy. Figure 5 illustrates the block diagram of the layout analysis subsystem. The downsampling is performed with the ratio of 1/5, so that will result a very rough version of the original image. In Figure 5, H and V smear refer to the process of black painting every neighboring pixels of each back pixel in the boundary of 1 in horizontal or vertical direction respectively. In the skew detection module, a method similar to [11] was performed to estimate skew angle, and with this estimation, the exhaustive search of skew angle with Hough transform of the full-size image was narrowed ...

View in full-text

Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network

Article

Full-text available

There are about 300 million people in India who speak Hindi and write Devnagari script. Research in Optical Character Recognition (OCR) is popular for its application potential in banks, post offices, defense organizations and library automation etc. However most of the OCR systems are available for European texts. In this paper, we have proposed a...

The A2iA Multi-lingual Text Recognition System at the Second Maurdor Evaluation

Conference Paper

Full-text available

Sep 2014

This paper describes the system submitted by A2iA to the second Maurdor evaluation for multi-lingual text recognition. A system based on recurrent neural networks and weighted finite state transducers was used both for printed and handwritten recognition, in French, English and Arabic. To cope with the difficulty of the documents, multiple text lin...

Printed Arabic optical character segmentation

Article

Full-text available

Mar 2015

A considerable progress in recognition techniques for many non-Arabic characters has been achieved. In contrary, few efforts have been put on the research of Arabic characters. In any Optical Character Recognition (OCR) system the segmentation step is usually the essential stage in which an extensive portion of processing is devoted and a considera...

FIGURE 1: Evolution of Tamil characters depicting the historical...

FIGURE 2: Steps involved in character recognition using deep learning...

FIGURE 4: Proportion of types of models used for character recognition.

FIGURE 5: Comparison of accuracy measure.

Ancient Character Recognition: A Comprehensive Review

Article

Full-text available

Jan 2023

Deep learning-based character recognition of Tamil inscriptions plays a significant role in preserving the ancient Tamil language. The complexity of the task lies in the precise classification of the age-old Tamil letters (Vattezhuthu) into modern-day Tamil letter structures. Various methodologies and pre-processing techniques have been used for de...

COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

Article

Full-text available

Jan 2016

This paper describes the COCO-Text dataset. In recent years large-scale datasets like SUN and Imagenet drove the advancement of scene understanding and object recognition. The goal of COCO-Text is to advance state-of-the-art in text detection and recognition in natural images. The dataset is based on the MS COCO dataset, which contains images of co...

A Segmentation and Labeling Tool for Constructing a Dataset for Bangla Optical Character Recognition

Article

Full-text available

Jun 2022

The two main parts of Optical Character Recognition (OCR) systems are segmentation and recognition. In an OCR system, the input text is first segmented into distinctive character images before these segments are fed to the recognition system. The efficiency of a recognition system largely depends on the dataset. To build a good dataset, a highly interactive data collection tool proves to be extremely useful. This paper presents a tool that eases the segmentation of Bangla characters as well as the labeling of unlabeled character images. In this tool, a new segmentation strategy for characters and modifiers is introduced. A filtering-based decision-making strategy is also proposed, which gives better flexibility in character segmentation. Finally, the paper proposes an interactive tool using HCI principles that eases the labeling of the segmented data.

Segmentation-based, omnifont printed Arabic character recognition without font identification

Article

Full-text available

Oct 2020

Optical Character Recognition OCR is an essential part of many real-world applications such as digital archiving, automatic number plate recognition, handle cheques, etc. However, developing an OCR for printed Arabic text is still a challenging and open research field due to the special characteristics of Arabic cursive script. In this paper, we propose a segmentation-based, omnifont, open-vocabulary OCR for printed Arabic text. The proposed approach doesn’t require an explicit font type recognition stage. It uses an explicit, indirect character segmentation method. The presented segmentation method is baseline dependent and employs a hybrid, three-steps character segmentation algorithm to handle the problem of character overlapping. Besides, it uses a set of topological features that are designed and generalized to make the segmentation approach font independent. The segmented characters are fed as an input to a convolutional neural network for feature extraction and recognition. The APTID-MF data set has been used for testing and evaluation. The average accuracy of the proposed segmentation stage is 95%, while the average accuracy of the recognition stage is 99.97%. The whole approach achieves an average accuracy of 95% without using font-type recognition or any post-processing techniques.

Contour-based character segmentation for printed Arabic text with diacritics

Article

Full-text available

Aug 2019
J ELECTRON IMAGING

An Efficient, Font Independent Word and Character Segmentation Algorithm for Printed Arabic Text

Article

Full-text available

Aug 2019

Characters segmentation is a necessity and the most critical stage in Arabic OCR system. It has attracted the interest of a wide range of researchers. However, the nature of the Arabic cursive script poses extra challenges that need further investigation. Therefore, having a reliable and efficient Arabic OCR system that is independent of font variations is highly required. In this paper, an indirect, font-in dependent word and character segmentation algorithm for printed Arabic text investigated. The proposed algorithm takes a binary line image as an input and produces a set of binary images consisting of one character or ligature as an output. The segmentation performed at two levels: a word segmentation performed in the first level, by employing a vertical projection at the input line image along with using Interquartile Range (IQR) method to differentiate between word gaps and within word gaps. A projection profile method used as a second level of segmentation along with a set of statistical and topological features, which are font-independent, to identify the correct segmentation points from all potential points. The APTI dataset used to test the proposed algorithm with a variety of font type, size, and style. The algorithm experimented on 1800 lines (approximately 24,816 words) with an average accuracy of 97.7% for words segmentation and 97.51% for characters segmentation.

Farsi document image recognition system using word layout signature

Article

Full-text available

Mar 2019

In this paper, a new representation of Farsi words is proposed to present the keyword spotting problems in Farsi document image retrieval. In this regard, we define a signature for each Farsi word based on the word connected component layout. The mentioned signature is shown as boxes, and then, by sketching vertical and horizontal lines, we construct a grid of each word to provide a new descriptor. One of the advantages of this method is that it can be used for both handwritten and machine-printed texts. Finally, to evaluate the performance of our system in comparison to other methods, a database that contains 19,582 printed Farsi words is examined, and after applying this approach, a recall rate of 98.1% and a precision rate of 94.3% are obtained.

Improving Urdu Recognition Using Character-Based Artistic Features of Nastalique Calligraphy

Article

Full-text available

Jan 2019

The state-of-the-art Urdu recognition approaches for Nastalique use features along with the sequence of characters' labels for classification and recognition. In Arabic like cursive script, the characters are joined together to form a ligature. The conventional methods process the connected stroke of ligatures as sequence of characters. However, connected stroke of a ligature image has a sequence of pairs of characters and their joiners, instead of a sequence of characters. The character has a distinctive shape which clearly distinguishes it from other characters. The joiner preserves the connecting stroke shape of a character with next character. In this paper, an implicit Urdu character recognition technique is presented for Nastalique writing style which is based on recognition of characters and joiners. The detailed analysis of Nastalique calligraphy is carried out to extract artistic features of characters and their joiners. The presented technique is tested on Dataset-1 of 1446 ligature classes covering 3,309,762 ligature instances and 91,129 unique Urdu words. In addition, the system is also tested on 1,600 text lines of UPTI dataset called Dataset- 2. The character recognition accuracies are 95.58% and 98.37% on Dataset-1 and Dataset-2, respectively. The results reveal that the system outperforms the state-of-the-art HMMs and deep learning based Urdu recognition techniques.

Printed Arabic Script Recognition: A Survey

Article

Full-text available

Jan 2018

Optical character recognition (OCR) is essential in various real-world applications, such as digitizing learning resources to assist visually impaired people and transforming printed resources into electronic media. However, the development of OCR for printed Arabic script is a challenging task. These challenges are due to the specific characteristics of Arabic script. Therefore, different methods have been proposed for developing Arabic OCR systems, and this paper aims to provide a comprehensive review of these methods. This paper also discusses relevant issues of printed Arabic OCR including the challenges of printed Arabic script and performance evaluation. It concludes with a discussion of the current status of printed Arabic OCR, analyzing the remaining problems in the field of printed Arabic OCR and providing several directions for future research. © 2018 International Journal of Advanced Computer Science and Applications.

Arabic Optical Character Recognition Using Local Invariant Features

Thesis

Full-text available

Aug 2016

Mohamed Dahy

Arabic Optical character Recognition (AOCR) is the science of conversion Arabic text image documents of type, printed, or handwritten into machine-encoded text. OCR role is to help or replace humans in computerizing paperwork in order to accelerate, improve and reduce cost as well as time and effort. It provide although the ability to electronically editing, storing more compactly and searching documents. It is not a recent research field; it started about 40 years ago. The need for it has become increasingly urgent due to overcrowding paperwork in our societies. A lot of research conducted on AOCR as the Arabic script language is the mother tongues of over quarter of the world population despite this fact, robust and reliable performance AOCR system is still challenge. It is not such as Latin language OCR which have Reliable font-written OCR systems which are readily in use since long time ago. This thesis aimed to enhance the optical printed Arabic characters recognition accuracy across using local invariant features. A comparative study of four recent highly reported recognition accuracy algorithms presented. The algorithms have been evaluated on a proposed computer generated Primitive Arabic Characters Noise Free dataset (PAC-NF) since there is no publicly available dataset for primitive printed Arabic text. It contains two models PAC-NFA and PAC-NFB. Accuracy of algorithms is evaluated using CRR (Character Recognition Rate) metric. Results show that one of the four Approaches[1]achieved the highest CRR by average of 99.36% on PAC-NFA and 75.21% on PAC-NFB. considering this algorithm as the base technique to be improved, a combination of additional features has been proposed to achieve higher recognition rates, three types of classifiers used to test the features (Random Forest Tree, ANN, and SVM). the results showed that the Random Forest Tree classifier achieved the highest CRR. The proposed technique achieved CRR by average of 100% on PAC-NFA and 92.81% on PAC-NFB using Random Forest Tree classifier. The proposed technique robustness against two types of noise (scanning noise, and Artificial Gaussian noise) is tested, the results showed that the proposed technique more robust to the two types of noise than the base technique. Another system process has been added to AOCR system to automate the recognition process of Omni font documents which is the Optical Font Recognition (OFR).

Embedded Arabic text detection and recognition in videos

Thesis

Jul 2016

SONIA YOUSFI

This thesis focuses on Arabic embedded text detection and recognition in videos. Different approaches robust to Arabic text variability (fonts, scales, sizes, etc.) as well as to environmental and acquisition condition challenges (contrasts, degradation, complexbackground, etc.) are proposed.We introduce different machine learning-based solutions for robust text detection without relying on any pre-processing. The first method is based on Convolutional Neural Networks (ConvNet) while the others use a specific boosting cascade to select relevanthand-crafted text features.For the text recognition, our methodology is segmentation-free. Text images are transformed into sequences of features using a multi-scale scanning scheme. Standing out from the dominant methodology of hand-crafted features, we propose to learn relevant text representations from data using different deep learning methods, namely Deep Auto-Encoders, ConvNets and unsupervised learning models. Each one leads to a specific OCR (Optical Character Recognition) solution. Sequence labeling is performed without any prior segmentation using a recurrent connectionist learning model. Proposed solutions are compared to other methods based on non-connectionist and hand-crafted features. In addition, we propose to enhance the recognition results using Recurrent Neural Network-based language models that are able to capture long-range linguistic dependencies. Both OCR and language model probabilities are incorporated in a joint decoding scheme where additional hyper-parameters are introduced to boost recognitionresults and reduce the response time.Given the lack of public multimedia Arabic datasets, we propose novel annotated datasets issued from Arabic videos. The OCR dataset, called ALIF, is publicly available for research purposes. To the best of our knowledge, it is the first public dataset dedicated for Arabic video OCR. Our proposed solutions were extensively evaluated. Obtained results highlight the genericity and the efficiency of our approaches, reaching a word recognition rate of 88.63% on the ALIF dataset and outperforming well-known commercial OCR engine by more than 36%.

An Arabic Script Recognition System

Article

Full-text available

Sep 2015
KSII T INTERNET INF

A system for the recognition of machine printed Arabic script is proposed. The Arabic script is shared by three languages i.e., Arabic, Urdu and Farsi. The three languages have a descent amount of vocabulary in common, thus compounding the problems for identification. Therefore, in an ideal scenario not only the script has to be differentiated from other scripts but also the language of the script has to be recognized. The recognition process involves the segregation of Arabic scripted documents from Latin, Han and other scripted documents using horizontal and vertical projection profiles, and the identification of the language. Identification mainly involves extracting connected components, which are subjected to Principle Component Analysis (PCA) transformation for extracting uncorrelated features. Later the traditional K-Nearest Neighbours (KNN) algorithm is used for recognition. Experiments were carried out by varying the number of principal components and connected components to be extracted per document to find a combination of both that would give the optimal accuracy. An accuracy of 100% is achieved for connected components >=18 and Principal components equals to 15. This proposed system would play a vital role in automatic archiving of multilingual documents and the selection of the appropriate Arabic script in multi lingual Optical Character Recognition (OCR) systems.

An example of a Persian word consists of two PAWs, where p represents pen width. The red line is the visualization of the imaginary baseline.

Context in source publication

Similar publications

Citations