Figure 2 - uploaded by Farbod Razzazi
Content may be subject to copyright.
An example of a Persian word consists of two PAWs, where p represents pen width. The red line is the visualization of the imaginary baseline. 

An example of a Persian word consists of two PAWs, where p represents pen width. The red line is the visualization of the imaginary baseline. 

Source publication
Conference Paper
Full-text available
Compared to non-cursive scripts, optical character recognition of cursive documents comprises extra challenges in layout analysis as well as recognition of the printed scripts. This paper presents a front-end OCR for Persian/Arabic cursive documents, which utilizes an adaptive layout analysis system in addition to a combined MLP-SVM recognition pro...

Context in source publication

Context 1
... languages with Roman alphabet and East Asian scripts. Although there has been a great attempt in producing omni-font OCR systems for Persian/Arabic language, the overall performance of such systems are far from perfect. Persian written language, which uses modified Arabic alphabet, is written cursively, and this intrinsic feature makes it difficult for automatic recognition. An essential part of a document understanding process, which seems trivial in human's character recognition, is how we perceive the written text parts and how we distinguish between different sections of a document. To eye of an educated person, the layout of a document is discerned once it came to sight but to a machine, this seemingly simple task will become burdensome. To a large extend, the layout of written materials appear in rectangular columns with all text and shapes aligned to the margins. This arrangement is mostly referred to as Manhattan style documents. In addition to this style, there are more sophisticated styles where printed text and figures in a document are not aligned. Thus, majority of efforts for layout analysis of documents are focused on Manhattan style, but recently, there is a growing attention toward the complicated styles as well [3]. Persian/Arabic printed documents are treated in the same way in layout analysis however; there is an absence of solid resources in the literature for these languages. Once the layout of a document is extracted, the system has access to each line of text. Thereafter, the problem of understanding of the script reduces into recognition of the words. There are two main approaches to automatic understanding of cursive scripts: holistic and segmentation-based [4]. In the first approach, each word is treated as a whole and the recognition system does not consider it as a combination of separable characters. Very similar to the speech recognition systems, in almost all significant results of holistic methods, Hidden Markov Models have been used as the recognition engine [5, 6]. The second strategy, which owns the majority in the literature, segments each word to containing characters as the building blocks, and recognizes each character then. In comparison, the first strategy usually outperforms the second, but it needs a more detailed model of the language, which its complexity grows as the vocabulary set gets larger. In addition, in this method, the number of recognition classes is far more than similar number in segmentation-based methods. Recently, there is also a trend toward hybrid methods which incorporates the segmentation and recognition systems to obtain overall results; these methods are usually called segmentation-by-recognition [7, 8]. One of the main concerns of designing every OCR system is to make it robust to the font variations. Thus, successful examples are omni-font recognition systems with ability to learn new fonts from some tutor. In holistic methods, as the character recognition problem is viewed from a different perspective, and the system collectively uses learning mechanisms for few connected characters, the transformation of the system into an omni-font learning system would be smooth. On the other hand, the segmentation-based systems mainly use learning methods only in recognition process, and to the best of our knowledge, the learning systems are never used for the segmentation process in the literature [9]. Usually, human recognizes unfamiliar words by segmenting them and recognizing each character separately to understand the whole word. With this perspective, in this research, the whole task is broken down into two separate learning systems to gain from reduction of complexity in hierarchy as well as adaptability of learning systems. The layout of this paper is as follows: Section 2 emphasizes on the characteristics of Persian script that were crucial for the design of OCR systems. In Section 3, we will discuss the proposed algorithm, which includes the layout analysis, segmentation, and recognition modules in separate subsections. In Section 4, implementation details and results are discussed which is entailed with conclusive remarks and acknowledgements. In this section, we will briefly describe some of the main characteristics of Persian/Arabic script to point out the main difficulties which an OCR system should overcome. As one of the main properties, the script consists of separated words which are aligned by a horizontal virtual line called "baseline". Words are separated by long spaces and each word consists of one or more isolated segments each of them is called Piece of a Word (PAW). On the contrary PAWs are separated by short spaces and each PAW includes one or more characters. If one PAW has more than one character, each of them will be connected to its neighbors along the baseline. Figure 1 shows a sample Persian/Arabic script, where a represents the space between two different words, and b is the short space between PAWs of a word. In Figure 2, the first PAW on the right comprises three characters and the second one, on the left, consists of only a single character, and p denotes the pen width value which is heuristically equal to the most frequent value of the vertical projection in each line. The overall block diagram of the system is presented in Figure 3 which depicts layout analysis, post-process, and natural language processing (NLP) subsystems in addition to recognition and segmentation blocks. This paper presents the design of the layout analysis system in addition to the segmentation, feature extraction, and recognition sections (Figure 3). The design of the NLP module is out of scope of this paper. With respect to the popularity of Manhattan style in Persian documents, this research was focused on documents with aligned columns without any graphical figure inside; therefore, the automatic exclusion of the graphical material from text is left for future works. The designed method includes a two-step skew detection with use of a combination of basic Hough transform method [10] and the method proposed in [11] that uses fuzzy run-lengths. The layout analysis section will process a document and adaptively segments it into paragraphs and then into the separated lines. In the proposed system, the segmentation-based approach is exploited, and some measures are considered to overcome the main weaknesses of it. The best results are achieved with aid of artificial neural networks (ANN) for performing segmentation with some extended features (Section 3.2). In recognition section, we obtained a definite set of features from each segmented symbol, which was fed to a support vector machine (SVM) classification engine to obtain the recognized symbol. Using large margin classifiers enables us to achieve high recognition rates which are in coherence with the best results in the literature [2]. We also decomposed each character of Persian script to more primitive symbols called graphemes. This novel decomposition has decreased the complexity of the recognition and segmentation procedures and has improved the overall result. Few different characters could share a single grapheme, and additionally, several joint graphemes could build a single character. In addition, Persian language includes many characters which the only difference they have is the number of dots and placement of them. To finalize the character recognition task, a post- processing section is implemented to combine the result of grapheme recognition and the number of dots. Besides, this section corrects some common grapheme recognition errors using an embedded confusion matrix. Figure 4 shows the combination of grapheme recognition and post-processing processes with dot recognition module. Before proceeding further, we provide concepts of some frequently used terms in this paper for clarification: Grapheme : In this research, we refer grapheme to any graphical image that would be a character or a part of it which acts as a fundamental building block of words. This resembles the concept of phonemes in speech, but we don't directly choose them in relation to real phonemes. Pen tip : The vertical position of the pen in the skeleton of a PAW image. Junction points : The horizontal position of the grapheme boundary. Thus, cutting the word at junction points results separated graphemes. The layout analysis section is responsible for segmenting document images into lines. The input of this system is clean black and white images with low noise, and the output is images of the separated lines. In this research, we considered a 300 DPI scanned document as input since it is a regular resolution for Persian/Arabic OCR. All of the other internal parameters are fixed for this resolution and for other scanning resolutions, all we need is to scale them proportional to the ratio they have to this predefined DPI. Thus, parameters of the proposed system only depend on the easily defined scanning resolution. Since after scanning a regular size paper such a detailed resolution creates a huge file size, we need to consider some measures to avoid the slow down of the algorithm. Hence, in line segmentation block, most of the time-consuming processes are computed on a downsampled version of the documents image, and only the final processes are performed on the full-size image with some slight adjustment to insure the accuracy. Figure 5 illustrates the block diagram of the layout analysis subsystem. The downsampling is performed with the ratio of 1/5, so that will result a very rough version of the original image. In Figure 5, H and V smear refer to the process of black painting every neighboring pixels of each back pixel in the boundary of 1 in horizontal or vertical direction respectively. In the skew detection module, a method similar to [11] was performed to estimate skew angle, and with this estimation, the exhaustive search of skew angle with Hough transform of the full-size image was narrowed ...

Similar publications

Article
Full-text available
There are about 300 million people in India who speak Hindi and write Devnagari script. Research in Optical Character Recognition (OCR) is popular for its application potential in banks, post offices, defense organizations and library automation etc. However most of the OCR systems are available for European texts. In this paper, we have proposed a...
Conference Paper
Full-text available
This paper describes the system submitted by A2iA to the second Maurdor evaluation for multi-lingual text recognition. A system based on recurrent neural networks and weighted finite state transducers was used both for printed and handwritten recognition, in French, English and Arabic. To cope with the difficulty of the documents, multiple text lin...
Article
Full-text available
A considerable progress in recognition techniques for many non-Arabic characters has been achieved. In contrary, few efforts have been put on the research of Arabic characters. In any Optical Character Recognition (OCR) system the segmentation step is usually the essential stage in which an extensive portion of processing is devoted and a considera...
Article
Full-text available
Deep learning-based character recognition of Tamil inscriptions plays a significant role in preserving the ancient Tamil language. The complexity of the task lies in the precise classification of the age-old Tamil letters (Vattezhuthu) into modern-day Tamil letter structures. Various methodologies and pre-processing techniques have been used for de...
Article
Full-text available
This paper describes the COCO-Text dataset. In recent years large-scale datasets like SUN and Imagenet drove the advancement of scene understanding and object recognition. The goal of COCO-Text is to advance state-of-the-art in text detection and recognition in natural images. The dataset is based on the MS COCO dataset, which contains images of co...

Citations

... Chaudhuri et al., 2016) that are very well developed. The Persian language, which belongs to the Indo-Iranian languages family whose root also goes to the Proto-Indo-European language family, and Arabic, a Proto-Semitic (Huehnergard, 2017) language, also have their OCR systems (Alginahi, 2013;Khosravi & Kabir, 2009;Märgner & El Abed, 2012;Mehran et al., 2005). OCR systems have been developed for Indian sub-continental 28 A Segmentation and Labeling Tool for Constructing a Dataset... languages. ...
Article
Full-text available
The two main parts of Optical Character Recognition (OCR) systems are segmentation and recognition. In an OCR system, the input text is first segmented into distinctive character images before these segments are fed to the recognition system. The efficiency of a recognition system largely depends on the dataset. To build a good dataset, a highly interactive data collection tool proves to be extremely useful. This paper presents a tool that eases the segmentation of Bangla characters as well as the labeling of unlabeled character images. In this tool, a new segmentation strategy for characters and modifiers is introduced. A filtering-based decision-making strategy is also proposed, which gives better flexibility in character segmentation. Finally, the paper proposes an interactive tool using HCI principles that eases the labeling of the segmented data.
... Contour-based methods (Omidyeganeh et al., 2005;Mehran et al., 2005;Sari et al., 2002;Romeo-Pakker et al., 1995) extract information about the general shape of the word by finding the word contour which represents the pixels that form the outer shape of the word (Lawgali, 2015;Alginahi, 2013;Naz et al., 2016b). It exploits the representation of the word shape (contour) to locate the potential segmentation point based on the fact that each character consists of a high contour followed by a flat or low contour where the segmentation points located before the contour start rising. ...
Article
Full-text available
Optical Character Recognition OCR is an essential part of many real-world applications such as digital archiving, automatic number plate recognition, handle cheques, etc. However, developing an OCR for printed Arabic text is still a challenging and open research field due to the special characteristics of Arabic cursive script. In this paper, we propose a segmentation-based, omnifont, open-vocabulary OCR for printed Arabic text. The proposed approach doesn’t require an explicit font type recognition stage. It uses an explicit, indirect character segmentation method. The presented segmentation method is baseline dependent and employs a hybrid, three-steps character segmentation algorithm to handle the problem of character overlapping. Besides, it uses a set of topological features that are designed and generalized to make the segmentation approach font independent. The segmented characters are fed as an input to a convolutional neural network for feature extraction and recognition. The APTID-MF data set has been used for testing and evaluation. The average accuracy of the proposed segmentation stage is 95%, while the average accuracy of the recognition stage is 99.97%. The whole approach achieves an average accuracy of 95% without using font-type recognition or any post-processing techniques.
... In many cases, the shape of characters after applying thinning operation differs from the original one, making the segmentation process more difficult. In contour tracing methods, [22][23][24][25][26] the pixels that form the outer shape of the character or word are extracted. Researchers used many ways to determine the cutting points on the contour. ...
... In contour-tracing methods (Omidyeganeh et al., 2005;Bushofa and Spann, 1997a;Mehran et al., 2005), the pixels that form the outer shape of the word, sub-word, or character are traced and then extracted. Contour-based methods provide a clear description of the shape of the characters which can solve the under segmentation problem caused by characters overlapping (Alginahi, 2013). ...
... To eliminate noise sensitivity, Bushofa applies a low pass filter on the contour points. Mehran et al. (2005) investigate the segmentation and recognition of Persian/Arabic scripts. They employed three basic features including vertical projection of the line image, the first derivative of the upper contour, and the distance of the tip of the pen from the baseline to identify the junction points of the upper contour of the primary stroke of sub-words called PAWs. ...
Article
Full-text available
Characters segmentation is a necessity and the most critical stage in Arabic OCR system. It has attracted the interest of a wide range of researchers. However, the nature of the Arabic cursive script poses extra challenges that need further investigation. Therefore, having a reliable and efficient Arabic OCR system that is independent of font variations is highly required. In this paper, an indirect, font-in dependent word and character segmentation algorithm for printed Arabic text investigated. The proposed algorithm takes a binary line image as an input and produces a set of binary images consisting of one character or ligature as an output. The segmentation performed at two levels: a word segmentation performed in the first level, by employing a vertical projection at the input line image along with using Interquartile Range (IQR) method to differentiate between word gaps and within word gaps. A projection profile method used as a second level of segmentation along with a set of statistical and topological features, which are font-independent, to identify the correct segmentation points from all potential points. The APTI dataset used to test the proposed algorithm with a variety of font type, size, and style. The algorithm experimented on 1800 lines (approximately 24,816 words) with an average accuracy of 97.7% for words segmentation and 97.51% for characters segmentation.
... The most important of these disadvantages is that it costs a lot in converting huge amounts of documents and also it is not sufficiently successful in applying it on low quality texts and documents with complicated layout. Additionally, there is no robust OCR method available yet for Farsi language scripts [2,3]. In order to overcome these problems, researchers suggested another method for document image retrieval that is called keyword spotting or, more simply, word spotting [4]. ...
... The locative layout structure of a word image and classification of components of its layout have useful information for recognizing Farsi words. According to a literature review above and considering the method discussed in [2,3], in this paper, we propose a new model for machineprinted Farsi text retrieval based on the similarities of layout of components in Farsi words. The new method is actually the implementation of the method proposed in [28,29]. ...
Article
Full-text available
In this paper, a new representation of Farsi words is proposed to present the keyword spotting problems in Farsi document image retrieval. In this regard, we define a signature for each Farsi word based on the word connected component layout. The mentioned signature is shown as boxes, and then, by sketching vertical and horizontal lines, we construct a grid of each word to provide a new descriptor. One of the advantages of this method is that it can be used for both handwritten and machine-printed texts. Finally, to evaluate the performance of our system in comparison to other methods, a database that contains 19,582 printed Farsi words is examined, and after applying this approach, a recall rate of 98.1% and a precision rate of 94.3% are obtained.
... The system has 92.3% segmentation accuracy tested on 200 Arabic handwritten images of IFN/ENIT. Another segmentation technique uses vertical projection profile, first derivative of the upper contour and distance between the baseline and pen tip, to mark the junction points which are later filtered by a trained Neural Network (NN) [13]. The structural and statistical features are used to train a Support Vector Machine (SVM) using character labels. ...
Article
Full-text available
The state-of-the-art Urdu recognition approaches for Nastalique use features along with the sequence of characters' labels for classification and recognition. In Arabic like cursive script, the characters are joined together to form a ligature. The conventional methods process the connected stroke of ligatures as sequence of characters. However, connected stroke of a ligature image has a sequence of pairs of characters and their joiners, instead of a sequence of characters. The character has a distinctive shape which clearly distinguishes it from other characters. The joiner preserves the connecting stroke shape of a character with next character. In this paper, an implicit Urdu character recognition technique is presented for Nastalique writing style which is based on recognition of characters and joiners. The detailed analysis of Nastalique calligraphy is carried out to extract artistic features of characters and their joiners. The presented technique is tested on Dataset-1 of 1446 ligature classes covering 3,309,762 ligature instances and 91,129 unique Urdu words. In addition, the system is also tested on 1,600 text lines of UPTI dataset called Dataset- 2. The character recognition accuracies are 95.58% and 98.37% on Dataset-1 and Dataset-2, respectively. The results reveal that the system outperforms the state-of-the-art HMMs and deep learning based Urdu recognition techniques.
... SVM, which is a binary classifier, has been used in the implementation of printed Arabic OCR systems [106], [95], [115]. (For a comprehensive review of applying SVM to Arabic OCR, refer to [116]). ...
Article
Full-text available
Optical character recognition (OCR) is essential in various real-world applications, such as digitizing learning resources to assist visually impaired people and transforming printed resources into electronic media. However, the development of OCR for printed Arabic script is a challenging task. These challenges are due to the specific characteristics of Arabic script. Therefore, different methods have been proposed for developing Arabic OCR systems, and this paper aims to provide a comprehensive review of these methods. This paper also discusses relevant issues of printed Arabic OCR including the challenges of printed Arabic script and performance evaluation. It concludes with a discussion of the current status of printed Arabic OCR, analyzing the remaining problems in the field of printed Arabic OCR and providing several directions for future research. © 2018 International Journal of Advanced Computer Science and Applications.
... However, this rate has come down to 98% in the post recognition phase of identifying the specific characters. He reported that The major part of these errors come from corrupted data.In[ 38 ] authors proposed a front-end OCR for Persian/Arabic cursive documents, which utilizes an adaptive layout analysis system in addition to a combined MLP-SVM recognition process. They reported achieving an accurate OCR which is independent of font size for Persian/Arabic printed documents with ability to recognize omni-font scripts. ...
Thesis
Full-text available
Arabic Optical character Recognition (AOCR) is the science of conversion Arabic text image documents of type, printed, or handwritten into machine-encoded text. OCR role is to help or replace humans in computerizing paperwork in order to accelerate, improve and reduce cost as well as time and effort. It provide although the ability to electronically editing, storing more compactly and searching documents. It is not a recent research field; it started about 40 years ago. The need for it has become increasingly urgent due to overcrowding paperwork in our societies. A lot of research conducted on AOCR as the Arabic script language is the mother tongues of over quarter of the world population despite this fact, robust and reliable performance AOCR system is still challenge. It is not such as Latin language OCR which have Reliable font-written OCR systems which are readily in use since long time ago. This thesis aimed to enhance the optical printed Arabic characters recognition accuracy across using local invariant features. A comparative study of four recent highly reported recognition accuracy algorithms presented. The algorithms have been evaluated on a proposed computer generated Primitive Arabic Characters Noise Free dataset (PAC-NF) since there is no publicly available dataset for primitive printed Arabic text. It contains two models PAC-NFA and PAC-NFB. Accuracy of algorithms is evaluated using CRR (Character Recognition Rate) metric. Results show that one of the four Approaches[1]achieved the highest CRR by average of 99.36% on PAC-NFA and 75.21% on PAC-NFB. considering this algorithm as the base technique to be improved, a combination of additional features has been proposed to achieve higher recognition rates, three types of classifiers used to test the features (Random Forest Tree, ANN, and SVM). the results showed that the Random Forest Tree classifier achieved the highest CRR. The proposed technique achieved CRR by average of 100% on PAC-NFA and 92.81% on PAC-NFB using Random Forest Tree classifier. The proposed technique robustness against two types of noise (scanning noise, and Artificial Gaussian noise) is tested, the results showed that the proposed technique more robust to the two types of noise than the base technique. Another system process has been added to AOCR system to automate the recognition process of Omni font documents which is the Optical Font Recognition (OFR).
... In [AAKM + 12], printed text has been segmented into words, sub-words and characters and segment classification has been performed using template matching. Another Persian/Arabic text recognizer has been proposed in [MPR05] where segment classification is based on a MLP/SVM combination. ...
... One of the first attempted methods used in Arabic character segmentation is based on vertical projection (histogram) [MNY99,ZSSJ05]. Other segmentation techniques are based on contour tracing [RPML95, PLDW06,MPR05] which allows to determine touching points between characters and end of characters. Thinning approaches that determine the skelton of the text are also applied for character segmentation [TSAA93] in addition to the graph theory [XPD06,Zid04], morphological operators [TF96] and template matching [BS97]. ...
Thesis
This thesis focuses on Arabic embedded text detection and recognition in videos. Different approaches robust to Arabic text variability (fonts, scales, sizes, etc.) as well as to environmental and acquisition condition challenges (contrasts, degradation, complexbackground, etc.) are proposed.We introduce different machine learning-based solutions for robust text detection without relying on any pre-processing. The first method is based on Convolutional Neural Networks (ConvNet) while the others use a specific boosting cascade to select relevanthand-crafted text features.For the text recognition, our methodology is segmentation-free. Text images are transformed into sequences of features using a multi-scale scanning scheme. Standing out from the dominant methodology of hand-crafted features, we propose to learn relevant text representations from data using different deep learning methods, namely Deep Auto-Encoders, ConvNets and unsupervised learning models. Each one leads to a specific OCR (Optical Character Recognition) solution. Sequence labeling is performed without any prior segmentation using a recurrent connectionist learning model. Proposed solutions are compared to other methods based on non-connectionist and hand-crafted features. In addition, we propose to enhance the recognition results using Recurrent Neural Network-based language models that are able to capture long-range linguistic dependencies. Both OCR and language model probabilities are incorporated in a joint decoding scheme where additional hyper-parameters are introduced to boost recognitionresults and reduce the response time.Given the lack of public multimedia Arabic datasets, we propose novel annotated datasets issued from Arabic videos. The OCR dataset, called ALIF, is publicly available for research purposes. To the best of our knowledge, it is the first public dataset dedicated for Arabic video OCR. Our proposed solutions were extensively evaluated. Obtained results highlight the genericity and the efficiency of our approaches, reaching a word recognition rate of 88.63% on the ALIF dataset and outperforming well-known commercial OCR engine by more than 36%.
... Mehran et al. [67] investigates the Persian/Arabic scripts and found that the upper contour of the primary stroke of sub-words called PAWs (Piece of Arabic Word) has a high gradient at the junction points, and after most junction points, the vertical projection has a value larger than the mean. On the other hand, the pen tip is generally positioned near the baseline in the desired junction points. ...
Article
Full-text available
A system for the recognition of machine printed Arabic script is proposed. The Arabic script is shared by three languages i.e., Arabic, Urdu and Farsi. The three languages have a descent amount of vocabulary in common, thus compounding the problems for identification. Therefore, in an ideal scenario not only the script has to be differentiated from other scripts but also the language of the script has to be recognized. The recognition process involves the segregation of Arabic scripted documents from Latin, Han and other scripted documents using horizontal and vertical projection profiles, and the identification of the language. Identification mainly involves extracting connected components, which are subjected to Principle Component Analysis (PCA) transformation for extracting uncorrelated features. Later the traditional K-Nearest Neighbours (KNN) algorithm is used for recognition. Experiments were carried out by varying the number of principal components and connected components to be extracted per document to find a combination of both that would give the optimal accuracy. An accuracy of 100% is achieved for connected components >=18 and Principal components equals to 15. This proposed system would play a vital role in automatic archiving of multilingual documents and the selection of the appropriate Arabic script in multi lingual Optical Character Recognition (OCR) systems.