PreprintPDF Available

Transcription of Ottoman Machine-Print Documents

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

With the ever increasing speed of the digitization process, a large collection of Ottoman documents is accessible to researchers and the general public. But, the majority of the users interested in these documents can not read these documents unless they are transcripted to the modern Turkish script which use an extended version of the Latin alphabet. Manual transcription of such a massive amount of documents is beyond the capacity of human experts. As a solution, we propose an automatic recognition system for printed Ottoman documents which transcribes Ottoman texts directly to the modern Turkish script. We evaluated three decoding strategies including the Word Beam Search decoder that allows to use a recognition lexicon and n-gram statistics during the decoding phase. The system achieves 2.25% character error rate and 6.42% word error rate on a test set of 1.4K samples, using the test set transcriptions as the recognition lexicon. Using a general purpose, large lexicon of the Ottoman era (260K words and 77% test coverage), the performance is measured as 3.68% character error rate and 16.61% word error rate.
Content may be subject to copyright.
Transcription of Ottoman Machine-Print Documents
Esma F. Bilgin Tasdemir ( esmabilgin.tasdemir@medeniyet.edu.tr )
Istanbul Medeniyet University
Fırat Kızılırmak ( rat.kizilirmak@alumni.sabanciuniv.edu )
Sabanci University
M. Aysu Akcan ( melaysuakcan@gmail.com )
University of Vienna
Mehmet Kuru ( mehmet.kuru@sabanciuniv.edu )
Sabanci University
Zeynep Tandogan ( zeyneptandogan@sabanciuniv.edu )
Sabanci University
S. Dogan Akansu ( selami.akansu@alumni.sabanciuniv.edu )
Sabanci University
Berrin Yanıkoglu ( berrin@sabanciuniv.edu )
Sabanci University
Research Article
Keywords:
DOI: https://doi.org/
License: This work is licensed under a Creative Commons Attribution 4.0 International License. 
Read Full License
Additional Declarations: No competing interests reported.
Springer Nature 2021 L
A
T
EX template
Transcription of Ottoman Machine-Print Documents
Esma F. Bilgin Tasdemir1, Fırat Kızılırmak2, M. Aysu Akcan3, Mehmet
Kuru4,5, Zeynep Tandogan2, S. Dogan Akansu2and Berrin Yanıkoglu2,5
1*Department of Information and Document Management, Istanbul Medeniyet University,
Istanbul, 34704, urkiye.
2Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, 34956, urkiye.
3Institute for Oriental Studies, University of Vienna, Vienna, Austria.
4Faculty of Arts and Sciences, Sabanci University, Istanbul, 34956, T¨urkiye.
5Center of Excellence in Data Analytics (VERIM), Sabanci University, Istanbul, 34956,
urkiye.
Contributing authors: esmabilgin.tasdemir@medeniyet.edu.tr;
firat.kizilirmak@alumni.sabanciuniv.edu;melaysuakcan@gmail.com ;
mehmet.kuru@sabanciuniv.edu;zeyneptandogan@sabanciuniv.edu;
selami.akansu@alumni.sabanciuniv.edu;berrin@sabanciuniv.edu;
These authors contributed equally to this work.
Abstract
With the ever increasing speed of the digitization process, a large collection of Ottoman docu-
ments is accessible to researchers and the general public. But, the majority of the users interested
in these documents can not read these documents unless they are transcripted to the modern
Turkish script which use an extended version of the Latin alphabet. Manual transcription of
such a massive amount of documents is beyond the capacity of human experts. As a solution,
we propose an automatic recognition system for printed Ottoman documents which transcribes
Ottoman texts directly to the modern Turkish script. We evaluated three decoding strategies
including the Word Beam Search decoder that allows to use a recognition lexicon and n-gram
statistics during the decoding phase. The system achieves 2.25% character error rate and 6.42%
word error rate on a test set of 1.4K samples, using the test set transcriptions as the recogni-
tion lexicon. Using a general purpose, large lexicon of the Ottoman era (260K words and 77% test
coverage), the performance is measured as 3.68% character error rate and 16.61% word error rate.
Keywords: Ottoman Text Recognition, Optical Character Recognition, Deep Learning, Turkish
1 Introduction
Ottoman Turkish (OT) was the language of the
Ottoman Empire that was used for administrative
and literary purposes from the early 15th cen-
tury to the early 20th century. Although Ottoman
Turkish is based on Turkish syntactical elements,
it contains a considerable amount of Arabic and
Persian words, loan-words and grammatical fea-
tures ([27,52]).
The writing is based on an extended version of
the Arabic-Persian alphabet. There are 28 Arabic
1
Springer Nature 2021 L
A
T
EX template
letters, 1 Persian letter and 4 additional characters
representing some Turkish sounds in the Ottoman
alphabet.
Ottoman Turkish is the main writing system
that we encounter in manuscripts prepared by
scribes and compilers in earlier periods as well
as printed works that started to be produced in
printing houses in the period after 1729. The same
writing system was also used in the early period
of the Turkish Republic until the alphabet revo-
lution in 1928. Figure 1shows some examples of
OT documents with Naskh font.
With the ever-increasing speed of the digitiza-
tion process, a large collection of Ottoman Turkish
documents is accessible to researchers or the gen-
eral public. Despite the ability to reach digitized
documents of scanned documents, the majority
of the users interested in these documents can
not read the Ottoman script. Transcription of
documents is the necessary first step for many
researchers across various disciplines of social sci-
ences, especially those investigating the period
before 1928. In fact, researchers in the fields of
social sciences and humanities devote most of
their time to scanning sources written in OT and
transcribing the references they reach. This situ-
ation prolongs the research and reduces the time
allocated for the actual analysis of the resources.
With the advances in deep learning, unprece-
dented recognition performances are obtained in
historical document recognition tasks in vari-
ous languages; however, very few works exist for
transcribing Ottoman documents using the latest
technologies [15,16,20,22]. Actually, Ottoman
document recognition is a problem that has been
attempted for many years, without a sufficiently
successful solution. Most of the previous works
are on document retrieval tasks using traditional
machine learning methods [12,17,26]. Their mod-
est success rates can be attributed to limited sizes
of the datasets they use.
There are a few existing commercial products
with functions similar to that of the proposed
system. Some of them adapt a pre-trained sys-
tem for Ottoman documents ( [23]) while others
does not provide a transcription but only Opti-
cal Character Recognition (OCR) service ([1,2] ).
Furthermore it is impractical to evaluate their per-
formance because of the usage restrictions applied
in the free versions.
This work presents the core engine to rec-
ognize text lines from machine-printed Ottoman
documents, as part of a new project that aims
to transcribe Ottoman Turkish documents. In the
scope of this work, the aim is to recognize a given
a text line written in the Arabic-Persian alphabet,
and return the transcription into Turkish words
written with the Latin alphabet. A wide range of
writing styles have been used in Ottoman Turkish
documents, ranging from relatively simple Naskh
style to very ornamental writing styles. We limit
the scope to the Naskh style, which is not only less
ornamental, but also the most widely used font in
printed materials, as exemplified in Figure 1.
2 Challenges in Ottoman
Turkish Recognition
In many languages, automatic transcription of his-
torical documents aim to digitize a text in the
same writing system. Digitized text are more use-
ful for users since they are more easily accessed
and are usually searchable. However, for Ottoman
Turkish documents to be comprehensible to larger
readers, they must be represented in the modern
Turkish writing system which has been in use since
1929. So, automatic transcription of Ottoman
script actually means a transcription within the
same language but between different writing sys-
tems. There are only a few such studies dealing
with within-language transcription problem in the
literature [36,37].
There are many issues making automatic
recognition and transcription of Ottoman docu-
ments a difficult problem. Problems associated
with the cursive nature of the Arabic and Per-
sian script are well-known and well-documented in
many studies ( [14]). Thus, unlike OCR in Latin-
alphabets, OCR in Arabic does not enjoy easy
character segmentation. Furthermore, connected
characters which take multiple forms depending
on their position in a word and diacritics and
rich ligatures in certain fonts complicate line and
character segmentation.
Ottoman script inherits all these difficulties
and in addition, there are problems emerging
from the orthography of the Ottoman script. In
the Ottoman language, some vowels are omitted
and need to be deduced from the context. This
practice is adopted from Arabic language where
Springer Nature 2021 L
A
T
EX template
Title 3
Fig. 1 Examples to Ottoman Turkish Naskh documents; a-b) handwritten manuscripts, c-d) printed books.
the short vowels are not represented with letters.
The practice of skipping vowels leads to many
heteronyms which are the words with the same
spelling but different pronunciation. Furthermore,
there are some one-to-many mappings from the
Ottoman alphabet to the Turkish alphabet. For
example, Ottoman word úÍð@ can be transcribed
as ’avlu’ , ’¨ol¨u’, ’evli’ or ’ulu’ depending on the
context (see Figure 2). Transcription of such words
require integration of context knowledge in the
recognition process.
Fig. 2 An example to one-to-many mappings between
Ottoman and Turkish alphabets. The same Ottoman word
corresponds to four different words (avlu, evli, ¨ol¨u and ulu)
in Turkish.
Another issue is that with the addition of Ara-
bic and Persian words and some borrowed syntax,
the recognition lexicon for Ottoman Turkish docu-
ments needs to be very large compared to Turkish,
which already requires a large lexicon or other
solution due to the agglutinative nature of the
Turkish language [51,53].
3 Related Work
Background. The first Optical Character Recog-
nition (OCR) studies in the literature were on
discrete character recognition in Latin alphabet-
based printed texts in the 1950s [41]. Over time,
the number of studies has increased and the scope
of the problem has expanded. Present day OCR
systems can be designed to recognize discrete sym-
bols, words, lines or paragraphs in very diverse
environments including handwritten or printed
documents, historical or modern documents, or
texts in natural environment images or in screen-
shots (such as texts in video footage). Also a
large collection of research exists in recognizing
non-Latin alphabets, including Arabic, Chinese,
Japanese, and Cyrillic texts ([3,7,22]). While
Turkish alphabet is also based on the Latin alpha-
bet, it has 8 extra characters to represent the
sounds in the language.
Prior to the deep learning era, neural net-
works and Hidden Markov Models (HMM) were
two of the popular methods in the literature
([3,40]). With the introduction of Deep Learn-
ing (DL) methods in the 2000s, Recurrent Neural
Networks (RNN), Convolutional Neural Networks
(CNN), Long Short-Term Memory Units and their
derivatives like bidirectional LSTMs (BLSTM)
and multi-dimensional LSTMs (MDLSTM) are
frequently preferred in OCR tasks [19,29,30,30].
Research on recognition of printed texts of lan-
guages written with Latin-based alphabets have
achieved great success. In a recent study, Breuel
et al. obtained a character error rate of 0.6% by
using LSTM on the UW-III dataset [15] which
contains English document images. In the same
study, a character error rate of 0.82% was reached
for recognition of historical German newspaper
texts printed with the Fraktur font. A similar
Springer Nature 2021 L
A
T
EX template
performance is observed in recognition of hand-
written texts. For example in [54], 4.7% character
error rate is obtained on IAM dataset using an
Encoder-Decoder type deep neural network by
Youself et al. Another recognition system employ-
ing a CNN-RNN network proposed by Dutta et
al. achieved 98.2% word recognition success on
RIMES dataset ([25]) and, a CNN-LSTM ensem-
ble model is reported to have word recognition
success of 89.55% on GW dataset by Ameryan et
al.([11]).
Systems for Turkish. Success in systems
developed for Turkish is lower due to the chal-
lenges associated with recognition Turkish. The
main issue affecting the recognition accuracy is
the size of the lexicon that is used to narrow down
possible recognition alternatives during recogni-
tion process. For general purpose English OCR or
Handwriting Text Recognition (HTR) systems, a
30,000-word lexicon is sufficient to cover most of
the valid words in the daily language. However,
due to its agglutinative morphology, the different
word forms in daily Turkish can easily exceed 1
million [53].
In a recent study, Tasdemir et al. developed an
HMM-based online Turkish handwriting recogni-
tion system that achieved 91.7% word recognition
rate in a vocabulary of approximately 2,000 words;
while a sharp decrease to 67.9% was observed
when the vocabulary size increased to 12,500
words ([51]).
Systems for Arabic and Persian.
Research in OCR and HTR for languages
besides English have gained momentum in the
last few decades, especially for Arabic [8,9,38].
However, success rates for Arabic recognition sys-
tems are considerably lower than those of Latin
script-based systems. Some characteristics of Ara-
bic script, such as being written in cursive fashion,
changing shape depending on their position within
the word, and omission of the vowels constitute
the main difficulties for Arabic OCR. As with
other languages, while HMM-based systems have
been most popular prior to deep learning era
[4,34,42], they been gradually replaced by deep
learning methods in recent years [9,10]). There are
also studies in which HMM and Artificial Neural
Networks (ANN) techniques are used together as
in the study conducted by Rahal et al. [45] where a
Bag-of-Feature (BoF) framework based on a deep
Sparse Auto-Encoder (SAE) is employed for fea-
ture extraction and HMMs are used for sequence
recognition.
Much of the Arabic machine-printed OCR
work is conducted on APTI dataset which con-
tains synthetically created Arabic word images
rendered using several fonts. For example, a char-
acter error rate of 2.5% was reported using the
HMM approach ([50]). In a similar study, a 0.5%
character recognition error was obtained on the
same dataset [4]. Another system based on LSTMs
is reported to have 0.01 word error rate for the
Naskh font on the same dataset [47].
Similar synthetic printed text datasets are
used for recognition of other Arabic alphabet
based scripts as well. In [43], a character error
rate of 7.6% was achieved on a synthetic Per-
sian printed text dataset by using Support Vector
Machines. Likewise, a character error rate of 0.8%
was achieved by using a CNN-LSTM model for
synthetic word images created with a font similar
to the Naskh font for Persian in [46].
The P-KHATT dataset is another Arabic
dataset consisting of real data obtained from
scanned printed line images. In a study conducted
by Ahmad et al. with the P-KHATT dataset
3.1% character recognition error was reported for
the Naskh font with the HMM technique [4]. By
using Bag-of-Features, Autoencoder and HMM
techniques on the same data set, 2.4% character
recognition error was obtained for the Naskh font
by Rahal et al. [45].
In Arabic HTR, Graves et al. [31] achieved
a 8.5% word error rate using Multi-dimensional
Recurrent Neural Networks (MDRNN) on the
IFN/ENIT handwritten isolated word dataset. On
the same data set , Chherawala et al. achieved a
word error rate of 11.1% by a MDLSTM where
automatically extracted and manually extracted
features were used together [21].
Recognition of handwritten lines is always
more difficult than recognition of isolated words.
Hence, the recognition accuracy are generally
lower for line recognition tasks. Ahmad et al. used
a MDLSTM network on the KHATT dataset con-
sisting of real handwritten line images in [6]. They
reported that a word error rate of 24.2% was
obtained. Later, they obtained 20% word error
rate by using data augmentation methods on the
same experimental setting [5].
Springer Nature 2021 L
A
T
EX template
Title 5
Systems for Ottoman Turkish. A very
limited number of studies on text recognition in
Ottoman Turkish have been identified in the liter-
ature. Most of them are dated to pre-deep learning
era and use traditional machine learning tech-
niques [12,17,18,26]. In a study that used deep
learning techniques on Ottoman documents for
the first time, Aydemir et al. trained an RNN
system by manually extracting features from a
dataset containing 169,148 discrete handwritten
word images obtained from population registra-
tion documents [13]. The accuracy is reported as
12.4% character error rate and 22.1% word error
rate on a small test set of 1,000 different words.
Dolek et al. developed another Ottoman OCR sys-
tem for printed Naskh line images using a CNN-
LSTM network trained with both synthetic and
real data in [24]. The system’s accuracy is reported
as 88.86% letter recognition and 64% word recog-
nition rate on a small test set comprising 21 pages.
[2].
4 Methodology
Unlike previous approaches in the literature, we
take a single-stage approach to produce the Turk-
ish transcription directly from Ottoman Turkish
document images. In the two-stage approaches,
the system performs OCR (into Arabic-Persian
alphabet) in the first stage to achieve what is
called the transliteration, followed by word recog-
nition, to obtain the Turkish word, in the second
stage. It should be noted that, in our proposed
method, the recognition system learns not only
to recognize the characters but also the voweliza-
tion of recognized words when the vowels are not
represented.
In our single-stage approach, we go directly
from the image to Turkish text. Our approach
has the advantage of saving time and effort in
data annotation stage. For Turkish annotators,
it is faster and easier to use Turkish characters
instead of Ottoman letters when annotating col-
lected images, due to the familiarity with the
Turkish letters and keyboard layout. In addition,
the accuracy of the labels produced can be checked
more efficiently for similar reasons. This direct
transcription will be the first for Ottoman Turkish
recognition problem.
4.1 Dataset
There is no publicly available Ottoman document
dataset. In all of the previous work, which are very
limited in both number and scope, small-sized pro-
prietary datasets are used. However, to train a
deep network, a huge amount of data is required.
In this work, we first collected and annotated a
large text line dataset.
Two Ottoman novels1are used for creating the
dataset. In total, 761 pages are semi-automatically
segmented into lines and annotated manually at
line level. The ground truth of a line image is
its transcription in modern Turkish. A special
transcription scheme is designed to represent map-
pings between Arabic alphabet-based Ottoman
letters and the Latin-alphabet based Turkish let-
ters at a sufficient level. Figure 3shows two sample
lines from the dataset.
Fig. 3 Sample lines from the dataset and their corre-
sponding ground truths as Turkish transcriptions.
The resulting dataset contains 14.236 lines
from 761 pages. There are 69 different symbols,
109,750 words and 36,847 unique words in the
dataset. The character set contains Turkish lower-
case letters and 3 additional letters to show long
vowels (i.e. ˆa, ˆı, ˆu), 10 digits, space characters,
some punctuation symbols and some special char-
acters like parenthesis. The average number of
characters per line is 8. The minimum and maxi-
mum image widths are 714 and 1597 respectively.
We randomly split the dataset into three sub-
sets so that 70% of the text line images are in the
train set, 20% of them are in the validation set
and 10% of samples are in the test.
4.2 CNN-BiLSTM Model
Our recognition model is based on deep learn-
ing, and in particular combines a Convolutional
Neural Network (CNN) for feature extraction and
1Hayattan Sahifeler by useyin Rahmi urpınar, printed in
1918 and Son Yıldız by Mehmet Rauf, printed in 1926
Springer Nature 2021 L
A
T
EX template
a Recurrent Neural Network (RNN) for sequence
modelling. It is based on the system developed in
[35].
Hybrid models of convolutional and recurrent
neural networks are frequently used for handwrit-
ten text recognition in the literature [33,39,44,
49]. In such a system, the CNN part learns a
sequence of feature maps from the input image
while the LSTM part learns sequence labeling
from that input. A final CTC layer [28] functions
as a softmax output layer which maps character-
scores for each sequence-element to characters
from a predefined symbol set. More specifically,
we use a CNN-BiLSTM network to encode input
image and a CTC layer to decode the encoded
representation into sequence of characters.
The network used in the preliminary study
includes 14 CNN layers and 2 bi-directional
LSTM layers. Bi-directional recurrent network is
preferred intentionally to incorporate knowledge
acquired from both left-to-right and right-to-left
sides.
CNN layers are composed of applying convolu-
tion operations with 3×3 kernels and max-pooling
operations with 2×2 kernels to extract sequence of
features from the image. ReLU is used as the acti-
vation function and batch-normalization is applied
to train the model properly. The features pro-
duced by the CNN network are fed to LSTM
layers, consisting of 256 hidden neurons, to lever-
age sequence based information. At the end, CTC
layers process sequence of probability distribu-
tions to obtain the final transcription i.e. sequence
of recognizable characters.
The training parameters of this baseline sys-
tem, which are empirically decided, are a batch
size of 4 and a learning rate of 1e-4. The net-
work weights are initialized randomly and opti-
mized using the Adam optimization algorithm.
The model is trained with the CTC loss function
until there is no remarkable improvement in the
CTC loss.
4.3 Decoding
The BLSTM layers together with the CTC lay-
ers output a softmax probability on a prede-
fined alphabet for every time step of the fea-
ture sequence output by the CNN layers. The
sequence of probabilities is labeled with symbols
from the predefined alphabet in a decoding phase.
Thus, the CNN-BLSTM output is a sequence
that includes repeated and possibly erroneous
characters.
A number of strategies can be employed for
decoding. In the greedy approach, the symbol with
the highest probability is chosen in each time
frame. It is a simple approach, thus often not opti-
mal, since it does not use any context information.
The beam search tries to overcome the limitation
of the greedy approach by extending the current
solutions with the highest rated kalternatives, as
opposed to just one [32]. The greedy and beam
search are general purpose search alternatives.
For text recognition, Scheidl et al. [48] pro-
posed the Word Beam Search (WBS), which is a
beam search method that uses a lexicon to guide
the search towards valid words in the lexicon.
While selecting the next best character, the algo-
rithm selects the next characters among those that
results in valid pre-words in the language. This
is done efficiently by representing the lexicon as
a trie, learned from a training corpus. The WBS
employs a 2-gram language model (LM) trained
over a given corpus to consider LM scores of the
words while decoding, which in turn helps obtain
more meaningful outputs.
The text corpus used to obtain the lexicon is
collected from a set of novels, historical works, and
periodicals all printed between 1888-1927. The
texts are generated with the Latin-based Turk-
ish script, following the same transcription scheme
with the training dataset. There are approxi-
mately 1,761K words and 260K unique words in
the corpus. It covers 77% of the test set lexicon.
In this work, we use the WBS with different
lexicons and language model settings offered in
WBS and evaluate their effectiveness. We exper-
iment with three modes proposed for WBS [48]:
words, n-gram and n-gram + forecast. In words
mode there is no language model applied through
the decoding, whereas the other modes consider
2-gram LM. While n-gram mode incorporates LM
score, in n-gram + forecast mode also considers
possible next words.
5 Experiments
A series of experiments have been conducted to
generate a baseline system and to evaluate differ-
ent decoding techniques and corpora. The system
is evaluated on 1,410 samples of the test set
Springer Nature 2021 L
A
T
EX template
Title 7
which are chosen randomly as explained in Section
4.1. All the lines are resized to have a height
of 128 pixels. The only preprocessing applied to
line images is binarization using the Otsu thresh-
olding method. The conventional Character Error
Rate (CER) and Word Error Rate (WER) met-
rics based on the Levenshtein distance are used to
evaluate the results.
Table 1 Recognition results from the baseline system,
using different decoding techniques. WBS (word) is the
word beam search algorithm without using a language
scoring but limiting the predictions to valid words. As the
lexicon, we used a 260K-word general purpose lexicon of
the era.
Decoding CER% WER%
Method Validation Test Validation Test
Greedy 36.55 36.86 86.01 86.73
Beam Search 36.81 37.35 86.32 86.91
WBS - 3.71 -16.67
Alternative decoding methods are tested on
the CTC output of the baseline system to gen-
erate the recognized text. The same CTC output
matrix is decoded using a greedy decoder, beam
search decoder and the WBS decoder in a num-
ber of experiments. A beam width of 50 is chosen
whenever a beam search decoder is used.
Table 1presents the results of the baseline sys-
tem. As can be seen here, the baseline system
with a greedy decoder obtains 36.86% CER and
86.73% WER on 1,410 test samples. Using a beam
search decoding approach does not improve the
results. We conclude that these general purpose
algorithms are not very useful, without language
modeling. On the other hand, the Word Beam
Search that restricts predictions to valid words,
significantly improves these results. It achieves a
3.71% CER and 16.67% WER, using a large lexi-
con of the Ottoman period. While there is also the
option of using n-gram word statistics in WBS, it
is not used in this experiment.
Table 2presents the experiments evaluating
the effects of corpus and language model on the
WBS algorithm. Here, different corpora are used
for generating the recognition lexicon. The n-gram
statistics which are used for scoring the beams in
N-gram and N-gram with forecast modes of WBS
are calculated from the corpus employed as well.
In the best case when the corpus contains only
the transcriptions of the test set, the system gets
the lowest CER and WER, as 2.25% and 6.42%
respectively. Using WBS in N-gram mode (2nd
row) where the beams are scored according to
uni- and bi-gram statistics calculated on the cor-
pus does not have any effect on the CER but it
increases the WER to 7.60% without forecasting
mode. Actually, in all of WBS experiments, inte-
gration of N-gram statistics to decoding process
with and without forecasting fails to improve the
results, possibly due to a relatively small test set.
When the train set transcriptions are added to
the corpus of test set transcriptions, the perfor-
mance slightly decreases in all three WBS modes.
The result obtained with WBS N-gram and N-
gram+forecast modes are 8.70% WER and 7.55%
WER respectively. While the results obtained
with test set lexicon are very good, they are not
realistic, as they assume knowledge of test set
lexicon.
Other two results from the table are corre-
sponds to cases when the corpus does not cover all
of the test lexicon. When the train set transcrip-
tions as used as the corpus, it has a coverage of
58% on the test set. In that case, character error
rate increases to 6.81% in word mode, 7.16% in
N-gram and 6.80% in N-gram+forecast modes. A
more dramatic increase is observed in the WER
as it is over 28% for all modes.
Finally, decoding with a large corpus which
have a 77% coverage on the test set results in a
CER of 3.71% and a WER of 16.67% in word
modes of the WBS. The n-gram mode does not
bring an improvement with a CER of 4.10% and
a WER of 17.27%. But using forecasting with N-
gram slightly improves over the word mode by a
CER of 3.68% and a WER of 16.61%.
6 Discussion
Based on the results obtained from the experi-
ments, we can easily say that the using a recogni-
tion lexicon by means of the WBS decoder helps
improving recognition accuracy significantly. The
baseline results from both the greedy and beam
search decoders are over 36% CER and 86% WER
on the test set. These figures both decrease dra-
matically when the output is restricted with a
word list derived from an appropriate corpus. The
amount of decrease is proportional to how much
Springer Nature 2021 L
A
T
EX template
Table 2 Recognition results from the WBS experiments on the test set.
Corpus Lexicon Size Test Coverage% WBS mode CER% WER%
Test set 5,542 100
words 2.25 6.42
N-gram 2.41 7.60
N-gram + forecast 2.27 6.46
Train set 23,793 58.4
words 6.81 28.26
N-gram 7.16 28.49
N-gram + forecast 6.80 28.21
Test set
29,335 100
words 2.50 7.46
+ N-gram 2.70 8.70
Train set N-gram + forecast 2.52 7.55
Large corpus 260,070 77
words 3.71 16.67
N-gram 4.10 17.27
N-gram + forecast 3.68 16.61
of the test set words are covered in the corpus. In
the best case when the corpus is actually the test
set transcription, the performance is measured as
2.25% CER and 6.42% WER. In a more realis-
tic setting where a large lexicon with 260K words
is used, the CER becomes 3.71% which is very
close to the best case. However, the WER value
increases over 10 points and becomes 16.67%. This
can be explained with the relatively low (77%)
coverage rate of the test vocabulary. It can be
deduced that for a lower WER rate, it is necessary
to use an even larger lexicon.
Although there are 1.4M words in the large
corpus, the average frequency of occurrence of a
word is 6.7 and some 16.7K words have three or
less occurrences. Given those figures, we can say
that the corpus size is not enough to provide use-
ful n-gram statistics. Actually, in all of of the WBS
experiments in Table 2, use of N-gram statistics
with or without forecasting fails to improve the
results over lexicon mode. A larger corpus can
help in integrating reliable n-gram statistics to
decoding process, in the case of agglutinative lan-
guages like Turkish, afflicted with the vocabulary
explosion problem [51].
Even if we have a large enough corpus to calcu-
late robust n-gram statistics, size of a recognition
lexicon with an acceptable coverage of words from
daily life can easily become unpractical due to its
size. Because, a large number of new words can be
produced by adding suffixes to a word body and
that makes it impossible to store all valid words.
In order for recognition lexicon based solutions
like the Word Beam Search to be useful for aggluti-
native languages like Turkish, sub-word structures
can be included in the tree structure instead of
words. A similar approach is to represent valid
words as stems and suffixes within a Finite State
Machine (FSA) as in [51]. In the future work we
plan to modify the WBS method to represent
words as stems and suffixes to adapt better for
Turkish.
To the best of our knowledge, our system is the
first to transcribe Ottoman script to modern Turk-
ish script automatically. So, we can not compare
our results with any previous work.
7 Conclusion
In this paper, we presented an automatic tran-
scription system for printed Ottoman documents.
We proposed a CNN-BLSTM-CTC network archi-
tecture and evaluated some alternative decoding
strategies for the output of the network. Our sys-
tem obtains 3.68% CER and 16.61% WER on
a test set of 1.4K line images using the Word
Beam Search decoder with a 260K-word lexicon.
We showed that although using a recognition lex-
icon to restrict the output of the transcription
system improves overall accuracy, it is not an opti-
mal solution for Turkish due to its agglutinative
nature. As a solution, we plan to modify the WBS
method to represent words as stems and suffixes
to adapt it better for Turkish in the future.
Springer Nature 2021 L
A
T
EX template
Title 9
References
[1] http://miletos.co/tr/showcase/ottoman-ocr
((accessed November 13, 2022))
[2] https://www.osmanlica.com/ ((accessed
November 13, 2022))
[3] Ahmad, I., Mahmoud, S.A., Fink, G.A.:
Open-vocabulary recognition of machine-
printed arabic text using hidden markov
models. Pattern Recognit. 51, 97–111 (2016)
[4] Ahmad, I., Mahmoud, S.A., Fink, G.A.:
Open-vocabulary recognition of machine-
printed arabic text using hidden markov
models. Pattern Recognit. 51, 97–111 (2016)
[5] Ahmad, R., Naz, S., Afzal, M.Z., Rashid,
S.F., Liwicki, M.: A deep learning based
arabic script recognition system: benchmark
on KHAT. Int. Arab J. Inf. Technol. 17(3),
299–305 (2020)
[6] Ahmad, R., Naz, S., Afzal, M.Z., Rashid,
S.F., Liwicki, M., Dengel, A.: KHATT: A
deep learning benchmark on arabic script.
In: 6th International Workshop on Multilin-
gual OCR, 14th IAPR International Con-
ference on Document Analysis and Recog-
nition, MOCR@ICDAR 2017, Kyoto, Japan,
November 9-15, 2017. pp. 10–14. IEEE (2017)
[7] Ahmed, I., Mahmoud, S., Parvez, M.: Printed
Arabic Text Recognition, pp. 147–168 (01
2012)
[8] Al-Badr, B., Mahmoud, S.A.: Survey and
recognition. Signal Process. 41(1), 49–77
(1995)
[9] Al-Helali, B.M., Mahmoud, S.A.: Arabic
online handwriting recognition (AOHR): A
survey. ACM Comput. Surv. 50(3), 33:1–
33:35 (2017)
[10] Alrobah, N.A., Albahli, S.: Arabic handwrit-
ten recognition using deep learning: A survey.
Arabian Journal for Science and Engineering
(2022)
[11] Ameryan, M., Schomaker, L.: A limited-
size ensemble of homogeneous cnn/lstms for
high-performance word classification. Neural
Comput. Appl. 33(14), 8615–8634 (2021)
[12] Arifoglu, D., Sahin, E., Adiguzel, H.,
Duygulu, P., Kalpakli, M.: Matching islamic
patterns in kufic images. Pattern Anal. Appl.
18(3), 601–617 (2015)
[13] Aydemir, M.S., Aydin, B., Kaya, H., Kar-
liaga, I., Demir, C.: ubıtak turkish -
ottoman handwritten recognition system. In:
2014 22nd Signal Processing and Communi-
cations Applications Conference (SIU), Tra-
bzon, Turkey, April 23-25, 2014. pp. 1918–
1921. IEEE (2014)
[14] Biadsy, F., El-Sana, J., Habash, N.: Online
arabic handwriting recognition using hidden
markov models (2006)
[15] Breuel, T.M., Ul-Hasan, A., Azawi, M.I.A.A.,
Shafait, F.: High-performance OCR for
printed english and fraktur using LSTM net-
works. In: 12th International Conference on
Document Analysis and Recognition, ICDAR
2013, Washington, DC, USA, August 25-28,
2013. pp. 683–687. IEEE Computer Society
(2013)
[16] Cai, J., Peng, L., Tang, Y., Liu, C., Li,
P.: TH-GAN: generative adversarial network
based transfer learning for historical chi-
nese character recognition. In: 2019 Inter-
national Conference on Document Analysis
and Recognition, ICDAR 2019, Sydney, Aus-
tralia, September 20-25, 2019. pp. 178–183.
IEEE (2019)
[17] Can, E.F., Duygulu, P.: A line-based rep-
resentation for matching words in historical
manuscripts. Pattern Recognit. Lett. 32(8),
1126–1138 (2011)
[18] Can, E.F., Duygulu, P., Can, F., Kalpakli,
M.: Redif extraction in handwritten ottoman
literary texts. In: 20th International Con-
ference on Pattern Recognition, ICPR 2010,
Istanbul, Turkey, 23-26 August 2010. pp.
1941–1944. IEEE Computer Society (2010)
Springer Nature 2021 L
A
T
EX template
[19] Carbune, V., Gonnet, P., Deselaers, T., Row-
ley, H.A., Daryin, A.N., Calvo, M., Wang,
L., Keysers, D., Feuz, S., Gervais, P.: Fast
multi-language lstm-based online handwrit-
ing recognition. Int. J. Document Anal.
Recognit. 23(2), 89–102 (2020)
[20] Chammas, E., Mokbel, C., Likforman-Sulem,
L.: Handwriting recognition of historical doc-
uments with few labeled data. In: 13th IAPR
International Workshop on Document Anal-
ysis Systems, DAS 2018, Vienna, Austria,
April 24-27, 2018. pp. 43–48. IEEE Computer
Society (2018)
[21] Chherawala, Y., Roy, P.P., Cheriet, M.:
Feature design for offline arabic handwrit-
ing recognition: Handcrafted vs automated?
In: 12th International Conference on Doc-
ument Analysis and Recognition, ICDAR
2013, Washington, DC, USA, August 25-28,
2013. pp. 290–294. IEEE Computer Society
(2013)
[22] Clanuwat, T., Lamb, A., Kitamoto, A.:
Kuronet: Pre-modern japanese kuzushiji
character recognition with deep learning. In:
2019 International Conference on Document
Analysis and Recognition, ICDAR 2019, Syd-
ney, Australia, September 20-25, 2019. pp.
607–614. IEEE (2019)
[23] Colutto, S., Kahle, P., Hackl, G., M¨uhlberger,
G.: Transkribus. A platform for automated
text recognition and searching of historical
documents. In: 15th International Conference
on eScience, eScience 2019, San Diego, CA,
USA, September 24-27, 2019. pp. 463–466.
IEEE (2019)
[24] Dolek, I., Kurt, A.: A deep learning model
for ottoman OCR. Concurr. Comput. Pract.
Exp. 34(20) (2022)
[25] Dutta, K., Krishnan, P., Mathew, M., Jawa-
har, C.V.: Improving CNN-RNN hybrid net-
works for handwriting recognition. In: 16th
International Conference on Frontiers in
Handwriting Recognition, ICFHR 2018, Nia-
gara Falls, NY, USA, August 5-8, 2018. pp.
80–85. IEEE Computer Society (2018)
[26] Duygulu, P., Arifoglu, D., Kalpakli, M.:
Cross-document word matching for segmen-
tation and retrieval of ottoman divans. Pat-
tern Anal. Appl. 19(3), 647–663 (2016)
[27] Ergin, M.: urk Dil Bilgisi. Bo˘gazi¸ci
Yayınları, ˙
Istanbul (2020)
[28] Graves, A., Fern´andez, S., Gomez, F.J.,
Schmidhuber, J.: Connectionist tempo-
ral classification: labelling unsegmented
sequence data with recurrent neural net-
works. In: Cohen, W.W., Moore, A.W.
(eds.) Machine Learning, Proceedings of
the Twenty-Third International Conference
(ICML 2006), Pittsburgh, Pennsylvania,
USA, June 25-29, 2006. ACM International
Conference Proceeding Series, vol. 148, pp.
369–376. ACM (2006)
[29] Graves, A., Fern´andez, S., Liwicki, M.,
Bunke, H., Schmidhuber, J.: Unconstrained
on-line handwriting recognition with recur-
rent neural networks. In: Platt, J.C., Koller,
D., Singer, Y., Roweis, S.T. (eds.) Advances
in Neural Information Processing Systems
20, Proceedings of the Twenty-First Annual
Conference on Neural Information Process-
ing Systems, Vancouver, British Columbia,
Canada, December 3-6, 2007. pp. 577–584.
Curran Associates, Inc. (2007)
[30] Graves, A., Liwicki, M., Fern´andez, S., Berto-
lami, R., Bunke, H., Schmidhuber, J.: A novel
connectionist system for unconstrained hand-
writing recognition. IEEE Trans. Pattern
Anal. Mach. Intell. 31(5), 855–868 (2009)
[31] Graves, A., Schmidhuber, J.: Offline hand-
writing recognition with multidimensional
recurrent neural networks. In: Koller, D.,
Schuurmans, D., Bengio, Y., Bottou, L.
(eds.) Advances in Neural Information Pro-
cessing Systems 21, Proceedings of the
Twenty-Second Annual Conference on Neural
Information Processing Systems, Vancouver,
British Columbia, Canada, December 8-11,
2008. pp. 545–552. Curran Associates, Inc.
(2008)
[32] Hwang, K., Sung, W.: Character-level incre-
mental speech recognition with recurrent
Springer Nature 2021 L
A
T
EX template
Title 11
neural networks. In: 2016 IEEE International
Conference on Acoustics, Speech and Signal
Processing, ICASSP 2016, Shanghai, China,
March 20-25, 2016. pp. 5335–5339. IEEE
(2016)
[33] Jain, M., Mathew, M., Jawahar, C.V.: Uncon-
strained scene text and video text recognition
for arabic script. In: 1st International Work-
shop on Arabic Script Analysis and Recogni-
tion, ASAR 2017, Nancy, France, April 3-5,
2017. pp. 26–30. IEEE (2017)
[34] Khoury, I., Gim´enez, A., Juan, A., Andr´es-
Ferrer, J.: Window repositioning for printed
arabic recognition. Pattern Recognit. Lett.
51, 86–93 (2015)
[35] Kızılırmak, F.: Offline Handwriting Recogni-
tion using Deep Learning with Emphasis on
Data Augmentation Effects. Master’s thesis,
Sabanci University (2022)
[36] Lamb, A., Clanuwat, T., Kitamoto, A.:
Kuronet: Regularized residual u-nets for end-
to-end kuzushiji character recognition. SN
Comput. Sci. 1(3), 177 (2020)
[37] Le, A.D., Clanuwat, T., Kitamoto, A.: A
human-inspired recognition system for pre-
modern japanese historical documents. IEEE
Access 7, 84163–84169 (2019)
[38] Lorigo, L.M., Govindaraju, V.: Offline ara-
bic handwriting recognition: A survey. IEEE
Trans. Pattern Anal. Mach. Intell. 28(5),
712–724 (2006)
[39] Mart´ınek, J., Lenc, L., Kr´al, P., Nicolaou,
A., Christlein, V.: Hybrid training data
for historical text OCR. In: 2019 Inter-
national Conference on Document Analysis
and Recognition, ICDAR 2019, Sydney, Aus-
tralia, September 20-25, 2019. pp. 565–570.
IEEE (2019)
[40] Memon, J., Sami, M., Khan, R.A., Uddin,
M.: Handwritten optical character recogni-
tion (OCR): A comprehensive systematic
literature review (SLR). IEEE Access 8,
142642–142668 (2020)
[41] Mori, S., Suen, C.Y., Yamamoto, K.: Histori-
cal review of OCR research and development.
Proc. IEEE 80(7), 1029–1058 (1992)
[42] Natarajan, P., Lu, Z., Schwartz, R.M., Bazzi,
I., Makhoul, J.: Multilingual machine printed
OCR. Int. J. Pattern Recognit. Artif. Intell.
15(1), 43–63 (2001)
[43] PourReza, M., Derakhshan, R., Fayyazi, H.,
Sabokrou, M.: Sub-word based persian OCR
using auto-encoder features and cascade clas-
sifier. In: 9th International Symposium on
Telecommunications, IST 2018, Tehran, Iran,
December 17-19, 2018. pp. 481–485. IEEE
(2018)
[44] Puigcerver, J.: Are multidimensional recur-
rent layers really necessary for handwritten
text recognition? In: 14th IAPR Interna-
tional Conference on Document Analysis and
Recognition, ICDAR 2017, Kyoto, Japan,
November 9-15, 2017. pp. 67–72. IEEE (2017)
[45] Rahal, N., Tounsi, M., Hussain, A., Alimi,
A.M.: Deep sparse auto-encoder features
learning for arabic text recognition. IEEE
Access 9, 18569–18584 (2021)
[46] Rahmati, M., Fateh, M., Rezvani, M., Tajary,
A., Abolghasemi, V.: Printed persian OCR
system using deep learning. IET Image Pro-
cess. 14(15), 3920–3931 (2020)
[47] Rashid, S.F., Schambach, M., Rottland, J.,
von der ull, S.: Low resolution arabic recog-
nition with multidimensional recurrent neu-
ral networks. In: Govindaraju, V., Natara jan,
P., Chaudhury, S., Lopresti, D.P., Setlur, S.,
Cao, H. (eds.) Proceedings of the 4th Inter-
national Workshop on Multilingual OCR,
MOCR@ICDAR 2013, Washington, D.C.,
USA, August 24, 2013. pp. 6:1–6:5. ACM
(2013)
[48] Scheidl, H., Fiel, S., Sablatnig, R.: Word
beam search: A connectionist temporal classi-
fication decoding algorithm. In: 16th Interna-
tional Conference on Frontiers in Handwrit-
ing Recognition, ICFHR 2018, Niagara Falls,
NY, USA, August 5-8, 2018. pp. 253–258.
IEEE Computer Society (2018)
Springer Nature 2021 L
A
T
EX template
[49] Shi, B., Bai, X., Yao, C.: An end-to-end
trainable neural network for image-based
sequence recognition and its application to
scene text recognition. IEEE Trans. Pattern
Anal. Mach. Intell. 39(11), 2298–2304 (2017)
[50] Slimane, F., Zayene, O., Kanoun, S., Alimi,
A.M., Hennebert, J., Ingold, R.: New features
for complex arabic fonts in cascading recog-
nition system. In: Proceedings of the 21st
International Conference on Pattern Recogni-
tion, ICPR 2012, Tsukuba, Japan, November
11-15, 2012. pp. 738–741. IEEE Computer
Society (2012)
[51] Tasdemir, E.F.B., Yanikoglu, B.A.: Large
vocabulary recognition for online turkish
handwriting with sublexical units. Turkish J.
Electr. Eng. Comput. Sci. 26(5), 2218–2233
(2018)
[52] Timurta¸s, F.K.: Osmanlı urk¸cesi Grameri
III. Alfa, ˙
Istanbul (2017)
[53] Yanikoglu, B.A., Kholmatov, A.: Turkish
handwritten text recognition: a case of agglu-
tinative languages. In: Kanungo, T., Smith,
E.H.B., Hu, J., Kantor, P.B. (eds.) Document
Recognition and Retrieval X, Santa Clara,
California, USA, January 22-23, 2003, Pro-
ceedings. SPIE Proceedings, vol. 5010, pp.
227–233. SPIE (2003)
[54] Yousef, M., Bishop, T.E.: Origaminet:
Weakly-supervised, segmentation-free, one-
step, full page text recognition by learning to
unfold. CoRR abs/2006.07491 (2020)
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The Ottoman OCR is an open problem because the OCR models for Arabic do not perform well on Ottoman. The models specifically trained with Ottoman documents have not produced satisfactory results either. We present a deep learning model and an OCR tool using that model for the OCR of printed Ottoman documents in the naksh font. We propose an end‐to‐end trainable CRNN architecture consisting of CNN, RNN (LSTM), and CTC layers for the Ottoman OCR problem. An experimental comparison of this model, called Osmanlica.com, with the Tesseract Arabic, the Tesseract Persian, Abby Finereader, Miletos, and Google Docs OCR tools or models was performed using a test data set of 21 pages of original documents. With 88.86% raw text, 96.12% normalized text, and 97.37% joined text character recognition accuracy, the Osmanlica.com Hybrid model outperforms the others with a marked difference. Our model outperforms the next best model by a clear margin of 4% which is a significant improvement considering the difficulty of the Ottoman OCR problem, and the huge size of the Ottoman archives to be processed. The hybrid model also achieves 58% word recognition accuracy on normalized text which is the only rate above 50%.
Article
Full-text available
In recent times, many research projects and experiments target machines that automatically recognize handwritten characters, but most of them are done in Latin. Recognizing handwritten Arabic characters is a complicated process compared to English and other languages as a nature of Arabic words. In the past few years, deep learning approaches have been increasingly used in the field of Arabic recognition. This paper aims to categorize, analyze and presents a comprehensive survey in Arabic handwritten recognition literature, focusing on state-of-the-art methods for deep learning in feature extraction. The paper focuses on offline text recognition, with a detailed discussion of the systematic analysis of the literature. Additionally, the paper is critically analyzing the current literature and identifying the problem areas and challenges faced by the previous studies. After investigating the studies, a new classification of the literature is proposed. Besides, an analysis is performed based on the findings, and several issues and challenges related to the recognition of Arabic scripts are discussed.
Article
Full-text available
Optical character recognition, known as OCR, has been widely used due to high demand of different technologies. Currently, most existing OCR systems have been focused on Latin languages. In recent studies, OCR systems for non‐Latin texts involving cursive style have also been introduced despite posing some challenges. In this study, the authors propose an OCR system based on long short‐term memory neural networks for the Persian language. The authors also investigate the effects of variations of parameters, involved in this approach. The proposed OCR system solves false recognition of sub‐word ‘LA’ and ‘LA’. Moreover, the authors present a preprocessing algorithm to remove ‘justification’ using image processing. A new comprehensive collated data set is introduced, comprising five million images with eight popular Persian fonts and in ten various font sizes. The proposed evaluations show that the accuracy of the proposed OCR is increased by 2%, compared to the existing Persian OCR system. The experimental results indicated that the proposed system has average accuracy of 99.69% at the letter level. The proposed system has an accuracy of 98.1% for ‘zero‐width non‐breaking space’ and 98.64% for ‘LA’ at the word level.
Article
Full-text available
The strength of long short-term memory neural networks (LSTMs) that have been applied is more located in handling sequences of variable length than in handling geometric variability of the image patterns. In this paper, an end-to-end convolutional LSTM neural network is used to handle both geometric variation and sequence variability. The best results for LSTMs are often based on large-scale training of an ensemble of network instances. We show that high performances can be reached on a common benchmark set by using proper data augmentation for just five such networks using a proper coding scheme and a proper voting scheme. The networks have similar architectures (convolutional neural network (CNN): five layers, bidirectional LSTM (BiLSTM): three layers followed by a connectionist temporal classification (CTC) processing step). The approach assumes differently scaled input images and different feature map sizes. Three datasets are used: the standard benchmark RIMES dataset (French); a historical handwritten dataset KdK (Dutch); the standard benchmark George Washington (GW) dataset (English). Final performance obtained for the word-recognition test of RIMES was 96.6%, a clear improvement over other state-of-the-art approaches which did not use a pre-trained network. On the KdK and GW datasets, our approach also shows good results. The proposed approach is deployed in the Monk search engine for historical-handwriting collections.
Article
Full-text available
One of the most recent challenging issues of pattern recognition and artificial intelligence is Arabic text recognition. This research topic is still a pervasive and unaddressed research field, because of several factors. Complications arise due to the cursive nature of the Arabic writing, character similarities, unlimited vocabulary, use of multi-size and mixed-fonts, etc. To handle these challenges, an automatic Arabic text recognition requires building a robust system by computing discriminative features and applying a rigorous classifier together to achieve an improved performance. In this work, we introduce a new deep learning based system that recognizes Arabic text contained in images. We propose a novel hybrid network, combining a Bag-of-Feature (BoF) framework for feature extraction based on a deep Sparse Auto-Encoder (SAE), and Hidden Markov Models (HMMs), for sequence recognition. Our proposed system, termed BoF-deep SAE-HMM, is tested on four datasets, namely the printed Arabic line images Printed KHATT (P-KHATT), the benchmark printed word images Arabic Printed Text Image (APTI), the benchmark handwritten Arabic word images IFN/ENIT, and the benchmark handwritten digits images Modified National Institute of Standards and Technology (MNIST).
Article
Full-text available
Given the ubiquity of handwritten documents in human transactions, Optical Character Recognition (OCR) of documents have invaluable practical worth. Optical character recognition is a science that enables to translate various types of documents or images into analyzable, editable and searchable data. During last decade, researchers have used artificial intelligence/machine learning tools to automatically analyze handwritten and printed documents in order to convert them into electronic format. The objective of this review paper is to summarize research that has been conducted on character recognition of handwritten documents and to provide research directions. In this Systematic Literature Review (SLR) we collected, synthesized and analyzed research articles on the topic of handwritten OCR (and closely related topics) which were published between year 2000 to 2019. We followed widely used electronic databases by following pre-defined review protocol. Articles were searched using keywords, forward reference searching and backward reference searching in order to search all the articles related to the topic. After carefully following study selection process 176 articles were selected for this SLR. This review article serves the purpose of presenting state of the art results and techniques on OCR and also provide research directions by highlighting research gaps.
Article
Full-text available
Kuzushiji, a cursive writing style, had been used in Japan for over a thousand years starting from the eighth century. Over 3 million books on a diverse array of topics, such as literature, science, mathematics and even cooking are preserved. However, following a change to the Japanese writing system in 1900, Kuzushiji has not been included in regular school curricula. Therefore, most Japanese natives nowadays cannot read books written or printed just 150 years ago. Museums and libraries have invested a great deal of effort into creating digital copies of these historical documents as a safeguard against fires, earthquakes and tsunamis. The result has been datasets with hundreds of millions of photographs of historical documents which can only be read by a small number of specially trained experts. Thus there has been a great deal of interest in using machine learning to automatically recognize these historical texts and transcribe them into modern Japanese characters. Our proposed model KuroNet (which builds on Clanuwat et al. in International conference on document analysis and recognition (ICDAR), 2019) outperforms other model for Kuzushiji recognition. In this paper, KuroNet achieves higher accuracy while still recognizing entire pages of text using the residual U-Net architecture from adding more regularization. We also explore areas where our system is limited and suggests directions for future work.
Article
Full-text available
We describe an online handwriting system that is able to support 102 languages using a deep neural network architecture. This new system has completely replaced our previous segment-and-decode-based system and reduced the error rate by 20–40% relative for most languages. Further, we report new state-of-the-art results on IAM-OnDB for both the open and closed dataset setting. The system combines methods from sequence recognition with a new input encoding using Bézier curves. This leads to up to 10×10\times faster recognition times compared to our previous system. Through a series of experiments, we determine the optimal configuration of our models and report the results of our setup on a number of additional public datasets.