Can we build language-independent OCR using LSTM networks?

Conference Paper (PDF Available) · August 2013with 6,007 Reads
DOI: 10.1145/2505377.2505394
Conference: 4th International Workshop on Multilingual OCR, At Washington D.C., USA
Language models or recognition dictionaries are usually con-sidered an essential step in OCR. However, using a lan-guage model complicates training of OCR systems, and it also narrows the range of texts that an OCR system can be used with. Recent results have shown that Long Short-Term Memory (LSTM) based OCR yields low error rates even without language modeling. In this paper, we explore the question to what extent LSTM models can be used for multilingual OCR without the use of language models. To do this, we measure cross-language performance of LSTM models trained on different languages. LSTM models show good promise to be used for language-independent OCR. The recognition errors are very low (around 1%) without using any language model or dictionary correction.
Can we build language-independent OCR
using LSTM networks?
Adnan Ul-Hasan
Technical University of Kaiserslautern
67663 Kaiserslautern, Germany
Thomas M. Breuel
Technical University of Kaiserslautern
67663 Kaiserslautern, Germany
Language models or recognition dictionaries are usually con-
sidered an essential step in OCR. However, using a lan-
guage model complicates training of OCR systems, and it
also narrows the range of texts that an OCR system can
be used with. Recent results have shown that Long Short-
Term Memory (LSTM) based OCR yields low error rates
even without language modeling. In this paper, we explore
the question to what extent LSTM models can be used for
multilingual OCR without the use of language models. To
do this, we measure cross-language performance of LSTM
models trained on different languages. LSTM models show
good promise to be used for language-independent OCR.
The recognition errors are very low (around 1%) without
using any language model or dictionary correction.
MOCR, LSTM Networks, RNN
Multilingual OCR (MOCR) is of interest for many rea-
sons; digitizing historic books containing two or more scripts,
bilingual books, dictionaries, and books with line by line
translation are few reasons to have reliable multilingual OCR
systems. However, it (MOCR) also present several unique
challenges as Popat pointed out in context of Google books
project1. Some of the unique challenges are:
Multiple scripts/languages on a page. (multi-sript iden-
Multiple languages in same or similar fonts, like Arabic-
Persian, English-German.
The same language in multiple scripts, like Urdu in
Nastaleeq and Naskh scripts.
Archaic and reformed orthographies, e.g. 18th Century
English, Fraktur (historical German), etc.
1 Books
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Request permissions from
MOCR ’13, August 24 2013, Washington, DC, USA
Copyright 2013 ACM 978-1-4503-2114-3/13/08 ...$15.00.
There have been efforts reported to adapt the existing
OCR systems for other languages. Open source OCR sys-
tem Tesseract [2] is one such example. The basic classifica-
tion is based on hierarchical shape-classification, where at
first the character set is reduced to few characters and then
at last stage, the test sample is matched against the repre-
sentative of the short set. Although, Tesseract can be used
for a variety of languages (due to support available for many
languages), it can not be used as an all-in-one solution for
situation where we have multiple scripts together.
The usual approach to address multilingual OCR problem
is to somehow combine two or more separate classifiers [3],
as it is believed that a reasonable OCR output for a sin-
gle script can not be obtained without sophisticated post-
processing steps such as language modelling, use of dictio-
nary to correct OCR errors, font adaptation, etc. Natarajan
et al. [4] proposed an HMM-based script-independent multi-
lingual OCR system. Feature extraction, training and recog-
nition components are all language independent; however,
they use language specific word lexicon and language models
for recognition purpose. To our best knowledge, there was
not a single method proposed for OCR, that can achieve
very low error rates without using aforementioned sophis-
ticated post-processing techniques. But recent experiments
on English and German script using LSTM networks [5] have
shown that reliable OCR results can be obtained without
such techniques.
Our hypothesis for multilingual OCR is that if a single
model, at least for a family of scripts, e.g. Latin, Arabic,
Indic can be obtained, we can then use this single model to
recognize scripts of that particular family; thereby reduc-
ing the efforts to combine multiple classifiers. Since LSTM
networks can achieve very low error-rates without using lan-
guage modelling post-processing step; they can be used for
multilingual OCR.
In this paper, we report the results of applying LSTM
networks to address multilingual OCR problem. The ba-
sic aim is to benchmark how LSTM networks use language
modelling to predict the correct labels or can we do better
without using language modelling and other post-processing
steps. Additionally, we also want to see how well LSTM
networks use context to recognize a particular character.
Specifically, we trained LSTM networks for English, Ger-
man, French and a mix set of these three languages and test
them on each other. LSTM network based models achieve
very high recognition accuracy without the aid of language
modelling and they have shown good promise to be used for
multilingual OCR tasks.
Figure 1: Some sample images from our database. There are 96 variations in standard fonts used in common
practice, e.g. for times true-type fonts; its normal, italic, bold and italic-bold variations were included. Also,
note that these images were degraded to reflect scanning artefacts.
In what follows, preprocessing step is reported in next
section, Section 3 describes the configuration of the LSTM
network used in the experiments, Section 4 gives the details
of experimental evaluation. Section 5 concludes the paper
with discussions on the current work and future directions.
Scale and relative position of a character are important
features to distinguish characters in Latin script (and some
other scripts). So, text line normalization is an essential step
in applying 1D LSTM networks to OCR. In this work, we
used the normalization approach introduced in [5], namely
text-line normalization based on a trainable, shape-based
model. A token dictionary created from a collection of a
bunch of text lines contains information about x-height,
baseline (geometric features) and shape of individual charac-
ters. These models are then used to normalize any text-line
Recurrent Neural Networks (RNNs) have shown a greate
promise in recent times due to the Long Short Term Mem-
ory (LSTM) architecture [6], [7]. The LSTM architecture
differs significantly from earlier architectures like Elman net-
works [8] and echo-state networks [9]; and appears to over-
come many of the limitations and problems of those earlier
Traditinoal RNNs, though are good at context-aware pro-
cessing [10], have not shown vying performance for OCR and
speech recognition tasks. Their incompetence is reported
mainly due to the vanishing gradient problem [11, 12]. The
Long Short Term Memory [6] architecture was designed to
overcome this problem. It is a highly non-linear recurrent
network with multiplicative “gates” and additive feedback.
Graves et al. [7] introduced bidirectional LSTM architecture
for accessing context in both forward and backward direc-
tions. Both layers are then connected to a single output
layer. To avoid the requirement of segmented training data,
Graves et al. [13] used a forward backward algorithm to align
transcripts with the output of the neural network. Interested
reader is suggested to see the above-mentioned references for
further details regarded LSTM and RNN architectures.
For recognition, we used a 1D bidirectional LSTM archi-
tecture, as described in [7]. We found that 1D architec-
ture outperforms their 2D or higher dimensional siblings for
printed OCR tasks. For all the experiments reported in this
paper, we used a modified version of the LSTM library de-
scribed in [14]. That library provides 1D and multidimen-
sional LSTM networks, together with ground-truth align-
ment using a forward-backward algorithm (“CTC”, connec-
tionist temporal classification; [13]). The library also pro-
vides a heuristic decoding mechanism to map the frame-wise
network output onto a sequence of symbols. We have reim-
plemented LSTM networks and forward-backward alignment
from scratch and reproduced these results (our implementa-
tion uses a slightly different decoding mechanism). This im-
plementation has been released as an open-source form [15]
(ocropus version 0.7 ).
During the training stage, randomly chosen input text-
line images are presented as 1D sequences to forward prop-
agation step through LSTM cells and then the forward-
backward alignment of the output is performed. Errors are
then back-propagated to update weights and the process is
then repeated for the next randomly selected text-line im-
age. It is to be noted that raw pixel values are being used
as the only features and other sophisticated features were
extracted from the text-line images. The implicit features
in 1D sequence are baseline and x-heights of individual char-
The aim of our experiments was to evaluate LSTM per-
formance on multilingual OCR without the aid of language
modelling and other language-specific assistance. To explore
the cross-language performance of LSTM networks, a num-
ber of experiments were performed. We trained four sep-
arate LSTM networks for English, German, French and a
mixed set of all these languages. For testing, we have a to-
tal of 16 permutation. Each LSTM network was tested on
Table 1: Statistics on number of text-line images
in each of English, French, German and mix-script
Language Total Training Test
English 85,350 81,600 4750
French 85,350 81,600 4750
German 1,14,749 1,10,400 4349
Mixed-script 85,350 81,600 4750
Table 2: Experimental results of applying LSTM networks for multilingual OCR. These results validate our
hypothesis that a single LSTM model trained with a mixture of scripts (from a single family of script) can
be used to recognize text of individual family members. Note that the error rates of testing LSTM network
trained for German on French and networks trained for English on French and German were obtained by
ignoring the words containing special characters (umlauts and accented letters) to correctly gauge the affect
of language models of a particular language. LSTM networks trained for individual languages can also be
used to recognize other scripts, but they show some language dependence. All these results were achieved
without the aid of any language model.
Model English (%) German (%) French (%) Mixed (%)
English 0.5 1.22 4.11.06
German 2.04 0.85 4.71.2
French 1.8 1.4 1.11.05
Mixed-script 1.7 1.1 2.9 1.1
the respective script and on other three scripts, e.g. test-
ing LSTM network trained on German on German, French,
English and mixed-script. These results are summarized in
Table 2, and some sample outputs are presented in Table 3.
As error measure, we used the ratio of insertions, deletions
and substitution relative to the ground-truth and accuracy
was measured at character level.
4.1 Database
A separate synthetic database for each language was de-
veloped using OCRopus [16] (ocropus-linegen). This utility
requires a bunch of utf-8 encoded text files and a set of
true-type fonts. With these two things available, one can
artificially generate any number of text-line images. This
utility also provide control to induce scanning artefacts such
as distortion, jitter, and other degradations. Separate cor-
pora of text-line images in German, English and French
languages were generated in commonly used fonts (includ-
ing bold, italic, italic-bold) from freely available literature.
These images were degraded using degradation models [17]
to reflect scanning artefacts. There are four degradation pa-
rameters, namely elastic elongation, jitter, sensitivity and
threshold. Sample text-lines images in our database are
shown in Figure 1. Each database is further divided into
training and test datasets. Statistics on number of text line
images in each four scripts is given in Table 1.
4.2 Parameters
The text lines were normalized to a height of 32 in pre-
processing step. Both left-to-right and right-to-left LSTM
layers contain 100 LSTM memory blocks. The learning rate
was set to 1e4, and the momentum was set to 0.9. The
training was carried out for one million steps (roughly cor-
responding to 100 epochs, given the size of the training set).
Training errors were reported every 10,000 training steps
and plotted. The network corresponding to the minimum
training error was used for test set evaluation.
4.3 Results
Since, there are no umlauts (German) and accented (French)
letters in English, so while testing LSTM model trained for
German on French and model trained for English on French
and German, the words containing those special characters
were omitted from the recognition results. The reason to do
this was to able to correctly gauge the affect of not-using
language models. If those words were not removed, then the
resulting error would also contain a proportion of errors due
to character mis-recognition. So by removing those words
with special characters, the true performance of the LSTM
network trained for language containing lesser alphabets on
the language containing more alphabets can be evaluated.
It should be noted that these results were obtained without
the aid of any post-processing step, like language modelling,
use of dictionaries to correct OCR errors, etc.
LSTM model trained for mixed-data was able to obtain
similar recognition results (around 1% recognition error)
when applied to English, German and French script indi-
vidually. Other results indicate small language dependence
in that LSTM models trained for a single language yielded
lower error rates when tested on the same script than when
they are evaluated on other scripts.
To gauge the magnitude of affect of language modelling,
we compared our results with Tesseract open-source OCR
system [18]. We applied latest available models (as of sub-
mission date) of English, French and German on the same
test-data. Tesseract system achieved high rates as com-
pared to LSTM based models. Tesseract’s model for En-
glish yielded 1.33%, 5.02%, 5.09% and 4.82% recognition
error when applied to English, French, German and Mixed-
data respectively. Model for French yielded 2.06%, 2.7%,
3.5% and 2.96% recognition error when applied to English,
German and Mixed-data respectively, while model for Ger-
man yielded 1.85%, 2.9%, 6.63% and 4.36% recognition er-
ror when applied to English, French and Mixed-data re-
spectively. So, these results show that absence of language
modelling or applying different language models affects the
recognition. Since no model for mixed data is available for
Tesseract, the effect of evaluating such a model on individual
script could not be computed.
The results presented in this paper show that LSTM net-
works can be used for multilingual OCR. LSTM networks
do not learn a particular language model internally (nor we
need such models as post-processing step). They show great
promise to learn various shapes of a certain character in dif-
ferent fonts and under degradations (as evident from our
highly versatile data). The language dependence is observ-
Table 3: Sample outputs from four LSTM networks trained for English, German, French and Mixed data.
LSTM net trained on a specific language is unable to recognize special characters of other languages as they
were not part of training. Therefore, it is necessary to ignore these errors from final error score. Thus we
can train an LSTM model for mix-data of a family of script and can use it to recognize individual language
of this family with very low recognition error.
Text-line Image
Text-line Image
Mixed-data H
Text-line Image
able, but the affects are small as compared to other state-
of-the-art OCR, where absence of language models results
in relatively bad results. To gauge the language dependence
more precisely, one can evaluate the performance of LSTM
by training LSTM networks on randomly generated data
using n-gram statistics and testing those models on natural
languages. Currently, we are working in this direction and
the results will be reported elsewhere.
In the following, we will analyse the errors made by our
LSTM networks when applied to other scripts. Top 5 con-
fusions for each case are tabulated in Table 4. The case of
applying an LSTM network to the same language for which
it was trained is not discussed here as it is not relevant for
the discussion of cross-language performance of LSTM net-
Most of the errors caused by LSTM network trained on
mixed-data are non-recognition (deletion) of certain char-
acters like l,t,r,i. These errors may be removed by better
Looking at the first column of Table 4 (Applying LSTM
network trained for English on other 3 scripts), most of the
errors are due to the confusion between characters of similar
shapes, like Ito l(and vice verca), Zto 2 and cto e. Two
confusions namely Zwith Aand Zwith Lare interesting as,
apparently, there are no shape similarity between them. One
possibility of such a behaviour may be due to the fact that
Zis the least frequent letter in English2and thus there may
be not many Zs in the training samples, thereby resulting
in its poor recognition. Two other noticeable errors (also in
other models) are unrecognised space and (denotes that
this letter was deleted).
2 frequency
For LSTM networks trained on German language (second
column), most of the top errors are due to inability to rec-
ognize a particular letter. Top errors when applying LSTM
network trained for French language on other scripts are con-
fusion between w/W with v/V. An interesting observation,
which could be a possible reason for such behaviour, is that
relative frequency of w(see footnote) is very low in French.
In other words, ‘w’ may be considered as a special character
w.r.t. French language when applying French model to Ger-
man and English. So, this is a language dependent issue,
which is not observable in case of mix-data.
This work can be extended in future in many directions.
First, more European languages like Italian, Spanish, Dutch
may be included in current set-up to train an all-in-one
LSTM network for these languages. Secondly, other fam-
ilies of script especially Nabataean and Indic scripts can be
tested to further validate our hypothesis empirically.
[1] A. C. Popat, “Multilingual OCR Challenges in Google
Books,” 2012. [Online]. Available: multilingual
ocr challenges-handout.pdf
[2] R. Smith, D. Antonova, and D. S. Lee, “Adapting the
Tesseract Open Source OCR Engine for Multilingual
OCR,” in Int. Workshop on Multilingual OCR, Jul.
[3] M. A. Obaida, M. J. Hossain, M. Begum, and M. S.
Alam, “Multilingual OCR (MOCR): An Approach to
Classify Words to Languages,” Int’l Journal of
Computer Applications, vol. 32, no. 1, pp. 46–53, Oct.
Table 4: Top confusions for applying LSTM models for various tasks. The confusions for an LSTM models for
which it was trained are not mentioned as it is unnecessary for our present paper. shows the garbage class,
i.e. the character is not recognized at all. When the LSTM net trained on English was applied to recognize
other scripts, the resulting top errors are similar: shape confusions between characters. Non-recognition of
space” and “ are other noticeable errors. For network trained on German language, most errors are due
to deletion of characters. Confusion of w/W with v/V are the top confusions when LSTM network trained
on French was applied to other scripts.
Model English German French Mixed
English -space
vv w
German lI
vv w
French 0
Mixed-script 0
[4] P. Natarajan, Z. Lu, R. M. Schwartz, I. Bazzi, and
J. Makhoul, “Multilingual Machine Printed OCR,”
IJPRAI, vol. 15, no. 1, pp. 43–63, 2001.
[5] T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and
F. Shafait, “High Performance OCR for English and
Fraktur using LSTM Networks,” in Int. Conf. on
Document Analysis and Recognition, Aug. 2013.
[6] S. Hochreiter and J. Schmidhuber, “Long Short-Term
Memory,” Nueral Computation, vol. 9, no. 8, pp.
1735–1780, 1997.
[7] A. Graves, M. Liwicki, S. Fernandez, Bertolami,
H. Bunke, and J. Schmidhuber, “A Novel
Connectionist System for Unconstrained Handwriting
Recognition,” IEEE Trans. on Pattern Analysis and
Machine Intelligence, vol. 31, no. 5, pp. 855–868, May
[8] J. L. Elman, “Finding Structure in Time.” Cognitive
Science, vol. 14, no. 2, pp. 179–211, 1990.
[9] H. Jaeger, “Tutorial on Training Recurrent Neural
Networks, Covering BPTT, RTRL, EKF and the
‘Echo State Network’ approach,” Sankt Augustin,
Tech. Rep., 2002.
[10] A. W. Senior, “Off-line Cursive Handwriting
Recognition using Recurrent Neural Networks,” Ph.D.
dissertation, England, 1994.
[11] S. Hochreiter, Y. Bengio, P. Frasconi, and
J. Schmidhuber, “Gradient flow in recurrent nets: the
difficulty of learning long-term dependencies,” in A
Field Guide to Dynammical Recurrent Neural
Netwoks, S. C. Kremer and J. F. Kolen, Eds. IEEE
Press, 2001.
[12] Y. Bengio, P. Smirard, and P. Frasconi, “Learning
long-term dependencies with gradient descent is
difficult,” IEEE Trans. on Neural Networks, vol. 5,
no. 2, pp. 157–166, Mar. 1994.
[13] A. Graves, S. Fernandez, F. Gomes, and
J. Schmidhuber, “Connectionist Temporal
Classification: Labeling Unsegemented Sequence Data
with Recurrent Nerual Networks,” in ICML,
Pennsylvania, USA, 2006, pp. 369–376.
[14] A. Graves, “RNNLIB: A recurrent neural network
library for sequence learning problems.” [Online].
[15] “OCRopus - Open Source Document Analysis and
OCR system.” [Online]. Available:
[16] T. M. Breuel, “The OCRopus open source OCR
system,” in DRR XV, vol. 6815, Jan. 2008, p. 68150F.
[17] H. S. Baird, “Document Image Defect Models ,” in
Structured Document Image Analysis, H. S. Baird,
H. Bunke, and K. Yamamoto, Eds. New York:
Springer-Verlag, 1992.
[18] R. Smith, “An Overview of the Tesseract OCR
Engine,” in ICDAR, 2007, pp. 629–633.
  • Preprint
    Full-text available
    In this paper we describe a dataset of German and Latin \textit{ground truth} (GT) for historical OCR in the form of printed text line images paired with their transcription. This dataset, called \textit{GT4HistOCR}, consists of 313,173 line pairs covering a wide period of printing dates from incunabula from the 15th century to 19th century books printed in Fraktur types and is openly available under a CC-BY 4.0 license. The special form of GT as line image/transcription pairs makes it directly usable to train state-of-the-art recognition models for OCR software employing recurring neural networks in LSTM architecture such as Tesseract 4 or OCRopus. We also provide some pretrained OCRopus models for subcorpora of our dataset yielding between 95\% (early printings) and 98\% (19th century Fraktur printings) character accuracy rates on unseen test cases, a Perl script to harmonize GT produced by different transcription rules, and give hints on how to construct GT for OCR purposes which has requirements that may differ from linguistically motivated transcriptions.
  • Preprint
    Full-text available
    We propose a post-OCR text correction approach for digitising texts in Romanised Sanskrit. Owing to the lack of resources our approach uses OCR models trained for other languages written in Roman. Currently, there exists no dataset available for Romanised Sanskrit OCR. So, we bootstrap a dataset of 430 images, scanned in two different settings and their corresponding ground truth. For training, we synthetically generate training images for both the settings. We find that the use of copying mechanism (Gu et al., 2016) yields a percentage increase of 7.69 in Character Recognition Rate (CRR) than the current state of the art model in solving monotone sequence-to-sequence tasks (Schnober et al., 2016). We find that our system is robust in combating OCR-prone errors, as it obtains a CRR of 87.01% from an OCR output with CRR of 35.76% for one of the dataset settings. A human judgment survey performed on the models shows that our proposed model results in predictions which are faster to comprehend and faster to improve for a human than the other systems.
  • Article
    Full-text available
    The digital comic book market is growing every year now, mixing digitized and digital-born comics. Digitized comics suffer from a limited automatic content understanding which restricts online content search and reading applications. This study shows how to combine state-of-the-art image analysis methods to encode and index images into an XML-like text file. Content description file can then be used to automatically split comic book images into sub-images corresponding to panels easily indexable with relevant information about their respective content. This allows advanced search in keywords said by specific comic characters, action and scene retrieval using natural language processing. We get down to panel, balloon, text, comic character and face detection using traditional approaches and breakthrough deep learning models, and also text recognition using LSTM model. Evaluations on a dataset composed of online library content are presented, and a new public dataset is also proposed. © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (
  • Article
    Full-text available
    This paper presents a comprehensive test of the principal tasks in document image analysis (DIA), starting with binarization, text line segmentation, and isolated character/glyph recognition, and continuing on to word recognition and transliteration for a new and challenging collection of palm leaf manuscripts from Southeast Asia. This research presents and is performed on a complete dataset collection of Southeast Asian palm leaf manuscripts. It contains three different scripts: Khmer script from Cambodia, and Balinese script and Sundanese script from Indonesia. The binarization task is evaluated on many methods up to the latest in some binarization competitions. The seam carving method is evaluated for the text line segmentation task, compared to a recently new text line segmentation method for palm leaf manuscripts. For the isolated character/glyph recognition task, the evaluation is reported from the handcrafted feature extraction method, the neural network with unsupervised learning feature, and the Convolutional Neural Network (CNN) based method. Finally, the Recurrent Neural Network-Long Short-Term Memory (RNN-LSTM) based method is used to analyze the word recognition and transliteration task for the palm leaf manuscripts. The results from all experiments provide the latest findings and a quantitative benchmark for palm leaf manuscripts analysis for researchers in the DIA community.
  • Chapter
    Nowadays, there is a great demand for multilingual optical character recognition (MOCR) in various web applications. And recently, Long Short-Term Memory (LSTM) networks have yielded excellent results on Latin-based printed recognition. However, it is not flexible enough to cope with challenges posed by web applications where we need to quickly get an OCR model for a certain set of languages. This paper proposes a Hybrid Model Reuse (HMR) training approach for multilingual OCR task, based on 1D bidirectional LSTM networks coupled with a model reuse scheme. Specifically, Fixed Model Reuse (FMR) scheme is analyzed and incorporated into our approach, which implicitly grabs the useful discriminative information from a fixed text generating model. Moreover, LSTM layers from pre-trained networks for unilingual OCR task are reused to initialize the weights of target networks. Experimental results show that our proposed HMR approach, without assistance of any post-processing techniques, is able to effectively accelerate the training process and finally yield higher accuracy than traditional approaches.
  • Chapter
    Book identification system is one of the core parts of a book sorting system. And the efficiency and accuracy of book identification are extremely critical to all libraries. This paper proposes a new image recognition method to identify books in libraries based on barcode decoding together with deep learning optical character recognition (OCR) and describes its application in library book identification system. The identification process relies on recognition of the images or videos of the book cover moving on a conveyor belt. Barcode is printed on or attached to the surface of each book. Deep learning OCR program is applied to improve the accuracy of recognition, especially when the barcode is blurred or faded. Book sorting system design based on this method will also be introduced. Experiment demonstrates that the accuracy of our method is high in real-time test and achieve good accuracy even when the barcode is blurred.
  • Article
    As we all know, inconsistent distribution and representation of different modalities, such as image, text and audio, cause the “media gap”, which poses a great challenge to deal with such heterogeneous data. Currently, state-of-the-art multimodal approaches mainly focus on the data provided by target task, neglecting the extra information on different but related tasks. In this paper, we explore a multimodal representation learning architecture by leveraging embedding representation trained from extra information. Specifically speaking, the approach of fixed model reuse is integrated into our architecture, which can incorporate helpful information from existing models/features into a new model. Based on our proposed architecture, we study multilingual OCR and long-text-based image retrieval tasks. Multilingual OCR is a difficult task that deals with multiple languages on the same page. We take advantage of extra textual embedding layer in an existing text-generating model to improve the accuracy of multilingual OCR. As for the long-text-based image retrieval, a cross-modal task, intermediate visual embedding layer in an off-the-shelf image-captioning model is leveraged to enhance the retrieval ability. The experimental results validate the effectiveness of our proposed architecture on narrowing down the “media gap” and yield observable improvement in these two tasks. Our architecture outperform the state-of-the-art approaches by 4.2% improvements in terms of accuracy in multilingual OCR task and yields improvement from 9 to 6 with regard to the median rank of retrieval result in long-text-based image retrieval task.
  • Urdu optical character recognition (OCR) is a complex problem due to the nature of its script, which is cursive. Recognizing characters of different font sizes further complicates the problem. In this research, long short term memory-recurrent neural network (LSTM-RNN) and convolution neural network (CNN) are used to recognize Urdu optical characters of different font sizes. LSTM-RNN is trained on formerly extracted feature sets, which are extracted for scale invariant recognition of Urdu characters. From these features, LSTM-RNN extracts meta features. CNN is trained on raw binary images. Two benchmark datasets, i.e. centre for language engineering text images (CLETI) and Urdu printed text images (UPTI) are used. LSTM-RNN reveals consistent results on both datasets, and outperforms CNN. Maximum 99% accuracy is achieved using LSTM-RNN.
  • Article
    Full-text available
    We combine three methods which significantly improve the OCR accuracy of OCR mod-els trained on early printed books: (1) The pretraining method utilizes the informationstored in already existing models trained on a variety of typesets (mixed models) insteadof starting the training from scratch. (2) Performing cross fold training on a single setof ground truth data (line images and their transcriptions) with a single OCR engine(OCRopus) produces a committee whose members then vote for the best outcome byalso taking the top-N alternatives and their intrinsic confidence values into account.(3) Following the principle of maximal disagreement we select additional training lineswhich the voters disagree most on, expecting them to offer the highest informationgain for a subsequent training (active learning). Evaluations on six early printed booksyielded the following results: On average the combination of pretraining and votingimproved the character accuracy by 46% when training five folds starting from the samemixed model. This number rose to 53% when using different models for pretraining,underlining the importance of diverse voters. Incorporating active learning improvedthe obtained results by another 16% on average (evaluated on three of the six books).Overall, the proposed methods lead to an average error rate of 2.5% when training ononly 60 lines. Using a substantial ground truth pool of 1,000 lines brought the errorrate down even further to less than 1% on average.
  • Conference Paper
    Full-text available
    Long Short-Term Memory (LSTM) networks have yielded excellent results on handwriting recognition. This paper describes an application of bidirectional LSTM networks to the problem of machine-printed Latin and Fraktur recognition. Latin and Fraktur recognition differs significantly from handwriting recognition in both the statistical properties of the data, as well as in the required, much higher levels of accuracy. Applications of LSTM networks to handwriting recognition use two-dimensional recurrent networks, since the exact position and baseline of handwritten characters is variable. In contrast, for printed OCR, we used a one-dimensional recurrent network combined with a novel algorithm for baseline and x-height normalization. A number of databases were used for training and testing, including the UW3 database, artificially generated and degraded Fraktur text and scanned pages from a book digitization project. The LSTM architecture achieved 0:6% character-level test-set error on English text. When the artificially degraded Fraktur data set is divided into training and test sets, the system achieves an error rate of 1:64%. On specific books printed in Fraktur (not part of the training set), the system achieves error rates of 0:15% (Fontane) and 1:47% (Ersch-Gruber). These recognition accuracies were found without using any language modelling or any other post-processing techniques.
  • Article
    We describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond providing a corpus of text. Although change was required to various modules, including physical layout analysis, and linguistic post-processing, no change was required to the character classifier beyond changing a few limits. The Tesseract classifier has adapted easily to Simplified Chinese. Test results on English, a mixture of European languages, and Russian, taken from a random sample of books, show a reasonably consistent word error rate between 3.72% and 5.78%, and Simplified Chinese has a character error rate of only 3.77%.
  • Article
    Time underlies many interesting human behaviors. Thus, the question of how to represent time in connectionist models is very important. One approach is to represent time implicitly by its effects on processing rather than explicitly (as in a spatial representation). The current report develops a proposal along these lines first described by Jordan (1986) which involves the use of recurrent links in order to provide networks with a dynamic memory. In this approach, hidden unit patterns are fed back to themselves; the internal representations which develop thus reflect task demands in the context of prior internal states. A set of simulations is reported which range from relatively simple problems (temporal version of XOR) to discovering syntactic/semantic features for words. The networks are able to learn interesting internal representations which incorporate task demands with memory demands; indeed, in this approach the notion of memory is inextricably bound up with task processing. These representations reveal a rich structure, which allows them to be highly context-dependent, while also expressing generalizations across classes of items. These representations suggest a method for representing lexical categories and the type/token distinction.
  • Conference Paper
    Full-text available
    Many real-world sequence learning tasks re- quire the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their out- puts into label sequences, their applicability has so far been limited. This paper presents a novel method for training RNNs to label un- segmented sequences directly, thereby solv- ing both problems. An experiment on the TIMIT speech corpus demonstrates its ad- vantages over both a baseline HMM and a hybrid HMM-RNN.
  • Conference Paper
    Full-text available
    OCRopus is a new, open source OCR system emphasizing modularity, easy extensibility, and reuse, aimed at both the research community and large scale commercial document conversions. This paper describes the current status of the system, its general architecture, as well as the major algorithms currently being used for layout analysis and text line recognition.
  • Book
    Document image analysis is the automatic computer interpretation of images of printed and handwritten documents, including text, drawings, maps, music scores, etc. Research in this field supports a rapidly growing international industry. This is the first book to offer a broad selection of state-of-the-art research papers, including authoritative critical surveys of the literature, and parallel studies of the architectureof complete high-performance printed-document reading systems. A unique feature is the extended section on music notation, an ideal vehicle for international sharing of basic research. Also, the collection includes important new work on line drawings, handwriting, character and symbol recognition, and basic methodological issues. The IAPR 1990 Workshop on Syntactic and Structural Pattern Recognition is summarized,including the reports of its expert working groups, whose debates provide a fascinating perspective on the field. The book is an excellent text for a first-year graduate seminar in document image analysis,and is likely to remain a standard reference in the field for years.
  • Article
    This paper presents a script-independent methodology for optical character recognition (OCR) based on the use of hidden Markov models (HMM). The feature extraction, training and recognition components of the system are all designed to be script independent. The training and recognition components were taken without modification from a continuous speech recognition system; the only component that is specific to OCR is the feature extraction component. To port the system to a new language, all that is needed is text image training data from the new language, along with ground truth which gives the identity of the sequences of characters along each line of each text image, without specifying the location of the characters on the image. The parameters of the character HMMs are estimated automatically from the training data, without the need for laborious handwritten rules. The system does not require presegmentation of the data, neither at the word level nor at the character level. Thus, the system is able to handle languages with connected characters in a straightforward manner. The script independence of the system is demonstrated in three languages with different types of script: Arabic, English, and Chinese. The robustness of the system is further demonstrated by testing the system on fax data. An unsupervised adaptation method is then described to improve performance under degraded conditions.