Conference PaperPDF Available

How Well Does Multiple OCR Error Correction Generalize?

Authors:

Abstract and Figures

As the digitization of historical documents, such as newspapers, becomes more common, the need of the archive patron for accurate digital text from those documents increases. Building on our earlier work, the contributions of this paper are: 1. in demonstrating the applicability of novel methods for correcting optical character recognition (OCR) on disparate data sets, including a new synthetic training set, 2. enhancing the correction algorithm with novel features, and 3. assessing the data requirements of the correction learning method. First, we correct errors using conditional random fields (CRF) trained on synthetic training data sets in order to demonstrate the applicability of the methodology to unrelated test sets. Second, we show the strength of lexical features from the training sets on two unrelated test sets, yielding a relative reduction in word error rate on the test sets of 6.52%. New features capture the recurrence of hypothesis tokens and yield an additional relative reduction in WER of 2.30%. Further, we show that only 2.0% of the full training corpus of over 500,000 feature cases is needed to achieve correction results comparable to those using the entire training corpus, effectively reducing both the complexity of the training process and the learned correction model.
Content may be subject to copyright.
How Well Does Multiple OCR Error Correction Generalize?
William B. Lunda, Eric K. Ringgerb, and Daniel D. Walkerc
aHarold B. Lee Library, Brigham Young University, Provo, UT 84602, USA
bComputer Science Department, Brigham Young University, Provo, UT 84602, USA
cMicrosoft, Redmond, WA 98052, USA
ABSTRACT
As the digitization of historical documents, such as newspapers, becomes more common, the need of the archive patron
for accurate digital text from those documents increases. Building on our earlier work, the contributions of this paper are:
1. in demonstrating the applicability of novel methods for correcting optical character recognition (OCR) on disparate data
sets, including a new synthetic training set, 2. enhancing the correction algorithm with novel features, and 3. assessing the
data requirements of the correction learning method. First, we correct errors using conditional random fields (CRF) trained
on synthetic training data sets in order to demonstrate the applicability of the methodology to unrelated test sets. Second,
we show the strength of lexical features from the training sets on two unrelated test sets, yielding a relative reduction in
word error rate on the test sets of 6.52%. New features capture the recurrence of hypothesis tokens and yield an additional
relative reduction in WER of 2.30%. Further, we show that only 2.0% of the full training corpus of over 500,000 feature
cases is needed to achieve correction results comparable to those using the entire training corpus, effectively reducing both
the complexity of the training process and the learned correction model.
Keywords: Historical Documents, Optical Character Recognition, OCR Error Correction, Ensemble Methods
1. INTRODUCTION
Historical machine printed document images often exhibit significant noise, making the optical character recognition
(OCR) of the text difficult. Our previous work1,2 shows that it is possible for combined outputs from multiple OCR
engines using machine learning techniques to provide text output with a lower word error rate (WER) than the OCR of any
one OCR engine alone. Further, we use methods which are scalable to very large collections, up to millions of images,
without document- or test corpus-specific manipulation of training data, which would be infeasible given time and resource
constraints.
Ensemble methods are used effectively in a variety of problems such as machine translation, speech recognition, hand-
writing recognition, and OCR error correction, to name a few. In a paper on pattern recognition frameworks for ensemble
methods, Kittler et al.3state: ”It had been observed ... that although one of the [classifiers] would yield the best perfor-
mance, the sets of patterns mis-classified by the different classifiers would not necessarily overlap. This suggested that
different classifier designs potentially offered complementary information about the patterns to be classified which could
be harnessed to improve the performance of the selected classifier.” Previously we have merged complementary informa-
tion such as the output of multiple OCR engines2and multiple binarizations of the same document image1on a single test
set and training set. The goal of this paper is to demonstrate the generalizability of the methods involving multiple OCR
engines, to introduce a new test and a new training set, to show the results of feature engineering, and to demonstrate the
degree to which a large training set may be reduced and still yield results consistent with the full training set.
The remainder of this paper proceeds as follows: Section 2 discusses existing work in several fields related to the
methods and outcomes of this research. Section 3 outlines a brief overview of the methods used to extract corrected text
with machine learning techniques, leading to the heart of the paper: the evaluation of the extent to which these methods
are applicable across multiple test corpora and synthetic training sets in Section 4. Finally, Section 5 summarizes the
conclusions of this research.
Further author information: (Send correspondence to William Lund)
William Lund: E-mail: bill lund (at) byu.edu, Telephone: +1 801 422-4202
Eric Ringger.: E-mail: ringger (at) cs.byu.edu
2. RELATED WORK
Extracting usable text from older, degraded documents is often unreliable, frequently to the point of being unusable.4Kae
and Learned-Miller5remind us that OCR is not a solved problem and that “the goal of transcribing documents completely
and accurately... is still far off.” At some point the word error rate of the OCR output inhibits the ability of the user to
accomplish useful tasks.
Ensemble methods are used with success in this task as well as in a variety of settings. In 1998 Kitter et al.3provided
a common theoretical framework for combining classifiers which is the basis for much of the work in ensemble methods.
In off-line handwriting recognition Bertolami and Bunke6use ensemble methods in the language model. Si et al.7use
an ensemble of named entity recognizers to improve overall recognition in the bio-medical field. For machine translation,
Machery and Och8present a study in how different machine translation systems affect the quality of the machine translation
of the ensemble. Maximum entropy models have been used previously to select among multiple parses returned by a
generative model.9
Klein and Kobel10 as well as Cecotti and Bela¨
ıd11 note that the differences between OCR outputs can be used to
advantage. This observation is behind the success of ensemble methods, that multiple systems which are complementary
can be leveraged for an improved combined output. The question of how many inputs should be used in an ensemble system
is generally “the more the better. Caruana et al.12 use on the order of 2000 models built by varying the parameters of the
training system to create different models. On a smaller number of inputs (five OCR engines), Lund et al.13 demonstrate
that the error rate of the ensemble decreases with each added system. It should be noted that the complementarity14 of
correct responses of the methods is critical. An important point is that even high error rate systems added to an ensemble
can contribute to reducing the ensemble error rate where the addition represents new cases or information not included
previously. Diversity in the ensemble is critical to improving the system’s performance over that of any individual in the
ensemble.15–18 This paper will expand on the observation regarding complimentary sources, noting that one useful source
of diversity is the output of multiple OCR engines.
Necessary for our post-OCR error correction using multiple sequences is an alignment of the text sequences, which can
either use exact or approximate algorithms. The multiple sequence alignment problem has been shown to be NP-Hard by
Wang and Jiang.19 Lund and Ringger2demonstrate an efficient means for exact alignment; however, alignment problems
on long sequences are still computationally intractable. Much of the work in multiple sequence alignment work is done
in the field of bioinformatics, where the size of the alignment problems has forced the discovery and adoption of heuristic
solutions such as progressive alignment.20 Elias21 discusses how a simple edit distance metric, which may be appropriate
to text operations, is not directly applicable to biological alignment problems, which means that much of the alignment
work in bioinformatics requires some adaptation for use in the case of text.
It is well known from work by Lopresti and Zhou22 that voting among multiple sequences generated by the same OCR
engine can significantly improve OCR. One practical application of voting can be found in the Medical Article Record
System (MARS) of the National Library of Medicine (NLM) which uses a voting OCR server for text recognition.23
Esakov, Lopresti, and Sandberg24 evaluated recognition errors of OCR systems, and Kolak, Byrne, and Resnik25 specifi-
cally applied their algorithms to OCR systems for post-OCR error correction in natural language processing tasks. OCR
error correction with in-domain training26 as well as out-of-domain training using a synthetic training dataset13 have been
shown to be effective.
Recent work by Yamazoe et al.27 effectively uses multiple weighted finite-state transducers (WFST) with both the OCR
and a lexicon of the target language(s) to resolve the ambiguity inherent in line- and character-segmentation, and character
recognition, in which the number of combinations can be very large. Both conventional OCR and post-processing are
contained within their system, resolving the difference between various hypotheses before committing to an output string.
A non-ensemble method for improving historical document images prior to OCR is adaptive binarization. Our previous
work1compared the OCR results of an ensemble of document image binarizations to the results of adaptive binarization.
From the perspective of the corpus WER, the results of the ensemble methods were superior to those of individual document
image adaptive binarizations.
A contribution of this paper is an extension of previous methods in supervised, discriminative machine learning methods
to choose among all hypotheses, in which previous methods are shown to be effective in two unrelated historical print
corpora and two unrelated training datasets. In this work the models are learned on synthetic, out-of-domain training data
sets, created and computationally degraded according to the methods proposed by Sarkar, Baird, and Zhang28 and Baird.29
Figure 1. The methodology used in this paper to: create the training set and CRF model, prepare the test set for processing by the CRF,
and evaluate the results using the test set transcription.
3. METHODOLOGY
The first step of our methodology prepares the test and training corpora, scanning document images and creating the
synthetic training sets. (See Figure 1 for a flow chart of this process.) Once the document images have been scanned, the
images are recognized with the selected OCR engines; in this case those are Abbyy FineReader 10, OmniPage 18, Adobe
Acrobat Pro X, and Tesseract 3 (an open source OCR system). The baseline OCR results of each OCR engine are seen in
Table 1. Baseline corpus word error rates using micro-averaging for datasets by OCR engine. These results are on the original document
images without modification. Note that WERs of greater than 100% are possible due to multiple insertions not found in the reference
text.
Abbyy Fine-
Reader 10 OmniPage 18 Adobe Pro X Tesseract 3
Test Set Corpora
Eisenhower Communiqu´
es 19.98% 30.50% 52.59% 93.99%
Average WER: 49.26%
19th Century Mormon Article
Newspaper Index
7.44% 11.77% 23.49% 18.35%
Average WER: 15.26%
Training Set Corpora
Enron Synthetic Dataset 24.31% 30.57% 68.95% 56.07%
Reuters-21578 Dataset 15.35% 20.28% 99.77% 82.37%
Figure 2. From “Communiqu´
e No. 1” of the Eisenhower Communiqu´
es. The word error rate of this document across the five OCR
engines used in this research varied from 10.63% to 63.41%, with a mean WER of 36.34%.
Table 1. The OCR output is character aligned yielding parallel hypotheses from the OCR engines. Where spaces occur
in the aligned texts, the process creates a column of text hypotheses. From these columns, features used by the machine
learner are extracted as described in Section 3.3. The machine learner, a trained conditional random field (CRF), labels
the aligned hypotheses with the OCR engine to be selected or “NONE”, indicating that no output for the column should
be selected. The CRF models are trained using the training sets prepared similarly to the test corpora. Collectively, the
text associated with each label from all columns constitute the error corrected output. (See Figure 4 for an example.) The
following sections describe in more detail this process.
3.1 Corpora
Four datasets were used in this work: two test sets, the Eisenhower Communiqu´
es30 and the Nineteenth Century Mormon
Article Newspaper Index;31 and two training sets, an extraction of the 2001 Topic Annotated Enron Email Data Set and an
extraction of the Reuters-21578 Text Categorization Test Collection.32,33 The following sections describe each dataset and
how it was created.
3.1.1 Eisenhower Communiqu´
es
The Eisenhower Communiqu´
es30 are a collection of 605 facsimilesof typewritten documents created by the Supreme
Headquarters Allied Expeditionary Force (SHAEF) during the last years of World War II. Having been typewritten and
duplicated using carbon paper, the quality of the print is poor. (See Figure 2 for an example.) A manual transcription of
these documents serves as the gold standard for evaluating the word error rates of the OCR. In the course of duplication
the documents have effectively become bi-tonal and are treated as such by this research.
An online presentation of The Eisenhower Communiqu´
es is viewable at http://www.lib.byu.edu/digital/eisenhower.
3.1.2 Nineteenth Century Mormon Article Newspaper Index
Figure 3. A fragment from the grayscale scan of document
BM 24May1859 p2 c3 from the 19th Century Mormon Newspaper
Article collection.31
The Nineteenth Century Mormon Article Newspaper In-
dex31 (19thCMAN) corpus is a collection of 1055 color
imagesof articles dealing with events and persons of The
Church of Jesus Christ of Latter-day Saints (Mormon) from
an archive of historical newspapers of the 19th century
housed at the Harold B. Lee Library of Brigham Young
University. As expected from 19th century newsprint, the
quality of the paper and the print was poor when first
printed and has further degraded over time. The newspa-
pers were scanned at 400 dots per inch (dpi) in 24-bit RGB
color, and the individual articles were segmented and saved as TIFF images. For previous work the RGB images were
converted to 8-bit grayscale. The OCR output of each document was manually corrected by two reviewers to act as a gold
standard. An example from the document corpus can be seen in Figure 3.
3.1.3 Synthetic Training Sets
For the training sets, we created synthetic data sets from the 2001 Topic Annotated Enron Email Data Set34 and the
Reuters-21578 Text Categorization Test Collection.32 From the digital text of each document in the test corpora, a TIFF
document image was generated and randomly degraded using techniques inspired by Baird29 and Sarkar et al.28 Each
synthetic document image was produced using the following steps. First an image is rendered as a bi-tonal document.
Spatial sampling error is introduced by translating the entire image stochastically. The image is blurred using a Gaussian
convolution kernel. The document is sub-sampled. Gaussian noise is added to simulate pixel sensor sensitivity. To binarize
a document, a threshold is applied. For further details on the process, please consult the paper by Walker, Lund, and
Ringger (2013).33
3.2 Progressive Alignment
Since exact n-way alignments become exponentially complex in nwe turned to greedy progressive alignment heuristics,
which are applied successfully in bioinformatics20 and textual variance analysis.35 In brief, progressive alignment algo-
rithms begin by selecting two sequences to be aligned that are most similar based on some similarity measure applied to all
sequences. Additional sequences are aligned, using the same selection criteria as for the first two, until all sequences have
been aligned. (Refer to Spencer et al.35 for details on progressive alignment in a textual context.) The order of pairwise
alignments is specified in a binary tree structure called the guide tree. Due to downstream consequences of greedy choices,
a progressive alignment heuristic is not optimal; however, the resulting alignments are good in practice.
In this paper, the order of the alignment, unless indicated otherwise, is a greedy approximation of the guide tree based
on sequence similarity of the training set; specifically in the order: Abbyy FineReader and OmniPage Pro X, then Adobe
Acrobat Pro, and lastly Tesseract. The incremental results of this alignment order on the WER can be seen in Table 4.
3.3 Assigning Features in an Alignment Column
We employ modern supervised discriminative machine learning methods trained on the training set. The role of the machine
learning model is to select the proper hypothesis from each aligned column in order to produce the best OCR correction.
We prepared training data from the training sets with the same OCR engines and aligned their output using the same
progressive alignment algorithm described above in order to produce aligned columns of hypotheses. (See Figure 4.) As a
base we extracted the following kinds of feature types from each column:
Voting: multiple features to indicate where multiple hypotheses in a column match exactly.
Number: binary indicators for whether each hypothesis is a cardinal number.
Dictionary: binary indicators for whether each hypothesis appears in the Linux dictionary.
Gazetteer: binary indicators for whether each hypothesis appears in a gazetteer of place names.
Lexical Features: words that appear in the corpus are individually created as features.
An online presentation of The Nineteenth Century Mormon Article Newspaper Index is viewable at
http://lib.byu.edu/digital/19cMormonArticles.
OCR Engine Hypothesis Columns (Aligned Text)
Abbyy (A) Xfee M-o-r-mon !.< -gittiamre -Called
OmniPage (O) The- M-orrison E-r--gistature -Called
Tesseract (T) The- M-o-r-mon L-r--gnslauure -Called
Adobe (D) The- lUo-rnton Lt-˜˜&aslature ()ailed
Features A:Xfee
O:The
T:The
D:The
DictD
DictT
DictO
VoteDOT
VoteDO
VoteDT
VoteOT
A:Mormon
O:Morrison
T:Mormon
D:lUornton
DictT DictA
DictO
VoteAT
A:!.<O:Ergistature
T:Lrgnslauure
D:Lt-˜˜&aslature
A:gittiamre
O:
T:
D:
VoteDOT
VoteDO
VoteDT
VoteOT
A:Called
O:Called
T:Called
D:()ailed
DictT DictA
DictO
VoteAOT
VoteAO
VoteAT
VoteOT
Selected Label OmniPage Abbyy NONE OmniPage Abbyy
Error Corrected Text The Mormon Called
Transcript The- M-o-r-mon L-e--gislature -Called
Figure 4. An example of an aligned lattice from document CT 2Apr1872 p2 c1. The “dash” character represents a gap or INDEL in
the alignment, where a character needs to be inserted in order to complete the alignment. Correct hypotheses in the aligned text are
underlined. The aligned text is divided into columns of hypotheses on spaces in the aligned sequences. Note that a space occurred in the
middle of the Abbyy mis-recognition of the word “Legislature”.
For each training case (an aligned column), the label indicates which OCR engine provided the correct hypothesis.
Ties were resolved by selecting the OCR engine with the lowest WER from the training set. Note that “DictA” (and so
forth) indicates that the entry from Abbyy is found in the dictionary. Leading and trailing punctuation is removed from
the hypothesis before checking in the dictionary. To produce a “Voting” feature type the match must be exact, including
punctuation. Once all of the feature vectors have been extracted, we use the maximum entropy learner in the Mallet36
toolkit to train a maximum entropy (a.k.a., multinomial logistic regression) model to predict choices on unseen alignment
columns.
4. RESULTS
Previous work2,13, 26 using the Eisenhower Communiqu´
es test set and the Enron training set was focused on individual
document performance in which corpus WERs were calculated as the average of the individual document WERs. A result
of this method is that corpus statistics would give more weight to the the tokens of a short document than to the tokens
of a long document. This approach may be called a “macro-average” in which each document is given an equal weight,
regardless of its size. In contrast, the results reported in this paper are based on tokens, giving an equal weight to each
occurrence of a token in the corpus, the OCR output, and the resulting error corrections. All averages and other statistics in
this paper, unless stated otherwise, use this “micro-averaging” approach. We believe this approach is more suited for feature
engineering although there are important uses of a document-by-document evaluation. Examples where a document-by-
document analysis is useful would be observing improvement trends by document or in error analysis related to document
WER.
Both for documents and for the corpus as a whole, the word error rate is calculated as
WordErrorRate =Substitutions +Insertions +Deletions
N umber O f Ref erence T okens (1)
which may be greater than 100% due to the number of insertions, which is not limited.
The remainder of this section is organized as follows. Section 4.1 shows the baseline results from previous work
reinterpreted using the micro-averaging approach discussed above. New features for this work, recurring features and an
order 1 CRF, are described in Section 4.2 with the results of incorporating these features in Section 4.3. The generalization
of these methods on a new test corpus and a new training corpus is shown in Sections 4.4 and 4.5. Section 4.6 wraps up
with an evaluation of the effect of the training corpus size on the results of both test corpora.
4.1 Baseline Results
As a baseline consider the WERs of the OCR output on the unmodified images of the documents from the Eisenhower
Communiqu´
es test set and the Enron training set. Each document image in the Eisenhower Communiqu´
es test set, as well
as all corpora in this work, was recognized using four OCR engines: Abbyy FineReader 10, OmniPage 18, Adobe Acrobat
Pro X, and Tesseract 3. The resulting recognition hypotheses were evaluated using the NIST Sclite37 tool to compute the
number of correctly recognized tokens, as well as substitutions, insertions, and deletions for each document. These results,
shown in the first numerical column of Table 4, are a reference point for evaluating the effectiveness of the new techniques
and corpora introduced in this paper.
Our previous work13,26 showed the improvement in the WER using a trained machine learner with the alignment of the
output from multiple OCR engines. Reinterpreting these previous results using micro-averaging techniques, the underlined
entries in Table 4 show the decreasing WER as additional OCR outputs are added to the alignment. The order of the OCR
output alignment was determined by the order of increasing WER on the Enron training set.
4.2 New Features
In addition to the baseline feature types described in Section 3.3, this paper adds three new feature types for consideration
by the machine learner: voting, dictionary lookup, gazetteer lookup, identifying numbers, and the lexical features of the
training set. This paper adds three new features:
1. RecurSim, a binary feature indicating whether a token occurs more than once.
2. RecurBucket#, a multivalued feature dividing the number of times a token appears into one of ten buckets, numbered
0 to 9, with each bucket containing approximately the same number of tokens. The higher the bucket number, the
fewer times the individual tokens in that bucket appeared in the corpus. (See Section 4.2.1 for more details.)
3. Order, the machine learner is trained using either an order 0 or an order 1 conditional random field (CRF). The order
0 CRF only considers the current column of hypotheses when deciding on the label. The order 1 CRF considers the
previous label in addition to the current column features.
4.2.1 Recurring Features
The recurring features (RecurSim and RecurBucket#) are calculated on the training and test sets since the contents of the
OCR outputs is available without violating a restriction on using the gold standard transcription of the documents in the
corpus. The simple recurring feature (RecurSim) is created for every token that appears more than once anywhere in the
corpus. The hope is that by tracking recurring features in the corpus out of vocabulary tokens that do not appear in the
dictionary or gazetteer will be captured.
The bucket recurring feature (RecurBucket#) divides up the range of recurring feature counts into ten buckets tagged
with the labels RecurBucket0 to RecurBucket9. The method for assigning bucket labels calculates the number of times a
given token appears in the OCR of the corpus. For example in the combined OCR of the Eisenhower test corpus there are
3,931 different tokens that each appear twice, one of which is “SCHEUERN” and is likely a valid recognition by the OCR
engines of a town by that name in Germany. (See Figure 5.) The buckets are assigned labels in an ascending order such
that there are approximately the same number of recurring tokens in each bucket. For the Eisenhower test set the bucket
assignments are found in Table 2. The simpler “RecurSim” feature is assigned to all tokens that recur within the corpus, so
RecurBucket0 through RecurBucket9 would all be mapped to RecurSim.
Figure 5. A histogram from the Eisenhower test set showing the number of times that a token with a given number of occurrences appears
in the OCR of the test corpus. For example there are 3,931 tokens that appear twice in the corpus and one token that appears 283 times.
Table 2. Assignments of token recurrences in the Eisenhower test set to feature RecurBucket#.
Recurring Token Count
Bucket Label from to
RecurBucket0 2 3
RecurBucket1 4 12
RecurBucket2 5 21
RecurBucket3 22 30
RecurBucket4 31 41
RecurBucket5 42 52
RecurBucket6 53 62
RecurBucket7 63 76
RecurBucket8 77 150
RecurBucket9 151 283
4.2.2 CRF Order
Previously, the machine learner used an order 0 conditional random field (CRF), also called a log-linear classifier. This
means that only the features of the current column are used to select the label assigned to the features from the hypothesis
column. The selected label, which is one of the OCR engines or the label “NONE”, determines which OCR hypothesis to
select or whether to select none of the OCR outputs.
For this paper we have added a new set of models, trained using the same training sets but modeled using an order 1
CRF. The model considers not only the features found in the current hypothesis column, but also the label that was selected
previously. The results will clearly identify whether the order 0 or the order 1 CRF model is being used.
4.3 Results of the New Features
One of the goals of this research was to determine the contribution of the lexical features learned from the training set to
the overall performance of the machine learner. To this end we group features into sets, including and excluding both the
RecurSim, RecurBucket#, and the Lexical feature types. The set names and features found in each set are found in Table 3.
Refer to Section 3.3 for an explanation of the feature names.
The results across all feature set groupings and CRF orders may be seen in Table 4. To orient the reader, the previously
published results are underlined in the “Order 0 CRF” section of the table. Note the italicized entries which indicate
improvementswithin a given OCR alignment over the previous results.
The ground truth transcription of the Eisenhower test set consists of 145,346 words in 605 documents. A WER reduction of 0.01%
constitutes 15 tokens that are corrected across the corpus, consisting of insertions that are eliminated or words that are corrected.
Table 3. The grouping of features used in various model configurations.
Feature Set Feature Types
Base Set Voting, Dictionary, Gazetteer, and Number
RecurSim Set All features found in Base Set along with the RecurSim feature type
RecurBucket# Set All features found in the Base Set along with the RecurBucket# feature types
Lexical Set All features found in the Base Set with the Lexical feature type
Lexical+RecurSim Set A combination of the Base Set,Lexical Set, and the RecurSim feature type
Lexical+RecurBucket# Set A combination of the Base Set,Lexical Set, and the RecurBucket# feature types.
Table 4. Eisenhower test set WERs with the Enron training set including the new features. The underlined entries correspond to the
results previously published. Italicized entries indicate improvement over previous results. The bolded entry is the lowest WER in the
table.
Lexical Lexical
Alignment Order Base RecurSim RecurBucket# Lexical + RecurSim + RecurBucket#
Order 0 CRF Abbyy OCR WER: 19.98%
Abbyy + OmniPage 21.91% 22.20% 22.15% 18.31% 18.13% 18.19%
Abbyy + OmniPage +
Tesseract 17.47% 17.15% 17.47% 17.52% 17.29% 17.45%
Abbyy + OmniPage +
Tesseract + Adobe 17.80% 17.55% 17.78% 16.64% 16.23% 16.46%
Order 1 CRF
Abbyy + OmniPage 21.56% 21.76% 21.68% 17.70% 17.61% 17.70%
Abbyy + OmniPage +
Tesseract 17.68% 17.59% 17.70% 17.80% 17.62% 17.75%
Abbyy + OmniPage +
Tesseract + Adobe 17.82% 17.74% 17.79% 17.49% 17.17% 17.42%
Observe that the order 1 CRF is not in general an improvement over the order 0 CRF. With the exception of results
using the Abbyy+OmniPage alignment all of the other results in the table have an increased WER. Further, the best result
of 17.42% in the order 1 CRF is the Abbyy+OmniPage+Tesseract+Adobe alignment with the Lexical+RecurBucket# Set
is significantly higher than the best result in the order 0 CRF table at 16.23%. Based on this, the order 1 CRF will not be
included in the results the follow since it is not showing an improvement over the order 0 CRF.
Clearly the Lexical features, as found in the three feature sets Lexical Set,Lexical+RecurSim Set, and Lexical+Re-
curBucket# Set, show improvement over the feature sets without the Lexical features, yielding a 6.52% relative improve-
ment between the Base Set of the Abbyy+OmniPage+Tesseract+Adobe alignment and the Lexical Set as shown in Table 4.
In addition, the recurring features in conjunction with the Lexical features are superior to the Lexical features alone with
the Lexical+RecurSim Set having the greatest improvement, showing an additional 2.30% relative improvement over the
Abbyy+OmniPage+Tesseract+Adobe alignment and the Lexical Set mentioned above. Overall the Lexical+RecurSim Set
performs best on the Eisenhower test set and the Enron training set. We will proceed with the Lexical Set and recurring
feature sets as we compare results with the new test corpus, the 19th Century Mormon Article Newspaper Index.
4.4 Results on a Different Test Corpus
The 19th Century Mormon Article Newspaper Index, described in Section 3.1.2, consists of 208,630 words§in 1,055
documents digitized to 8-bit grayscale, in contrast to the Eisenhower dataset which is effectively bitonal. The results using
the selected feature sets from the previous section are found in Table 5.
§A reduction of 0.01% in the WER on the 19th Century Mormon Article Newspaper Index results in 21 tokens that are corrected
across the corpus.
Table 5. Results on the Eisenhower and the 19th Century Mormon Article Newspaper Index test sets using a CRF model trained on the
Enron, Reuters, and combined training sets. The bold entries are the lowest WERs for each section of the table. The underlined entries
are the lowest WERs for their respective test sets.
Eisenhower Test Set 19thCMAN Test Set
Feature Sets Feature Sets
Lexical Lexical Lexical Lexical Lexical Lexical
Alignment +RecurSim
+RecurBucket#
+RecurSim
+RecurBucket#
Enron Training Set Abbyy OCR WER: 19.98% Abbyy OCR WER: 7.44%
Abbyy+OmniPage 18.31% 18.13% 18.19% 6.91% 7.03% 6.92%
Abbyy+OmniPage
+Tesseract 17.52% 17.29% 17.45% 7.06% 7.75% 7.46%
Abbyy+OmniPage
+Tesseract+Adobe 16.64% 16.23% 16.49% 6.83% 7.03% 6.90%
Reuters-21578 Training Set
Abbyy+OmniPage 19.91% 20.14% 19.66% 5.99% 5.99% 5.98%
Abbyy+OmniPage
+Tesseract 16.29% 16.12% 15.97% 6.80% 7.64% 7.16%
Abbyy+OmniPage
+Tesseract+Adobe 16.41% 16.71% 16.37% 6.99% 7.68% 7.42%
Combined Training Set
Abbyy+OmniPage 20.01% 20.02% 19.83% 6.04% 5.97% 5.99%
Abbyy+OmniPage
+Tesseract 16.88% 17.15% 16.96% 6.82% 6.82% 7.28%
Abbyy+OmniPage
+Tesseract+Adobe 16.63% 16.56% 16.63% 6.83% 6.82% 7.05%
The monotonic improvement in WER seen on the Eisenhower test set using the Lexical+RecurSim feature set is not re-
flected in the 19thCMAN test set using the Enron training set. In the 19thCMAN test set, the addition of the Tesseract OCR
increases the WER above the Abbyy+OmniPage alignment results across the board for all of the feature sets. Unlike the
Eisenhower test set, 19thCMAN appears to have a sensitivity to the relatively high WER of the Tesseract OCR (56.08%).
Adding the Adobe OCR output improves the resulting WER to a level equal to or below both the Abbyy FineReader WER
and the Abbyy+OmniPage alignment, even though the WER of Adobe on the 19thCMAN test set (68.95%) is higher than
that of Tesseract. The conclusion here is that although the high WER OCR of Tesseract and Adobe in the training set were
able to contribute to lowering the WER for the Eisenhower test set, in combination they did not contribute in the same way
with the 19thCMAN treat set. A possible solution may be to eliminate from the training set documents with high WERs.
Since the training set includes alignments of documents from multiple OCR engines, this may decrease the size of the
training set since if a document is eliminated due to a high WER from one OCR engine, it would need to be eliminated
from all of the OCR engine contributions to the training set. Section 4.6 explores whether the full contents of the training
set are needed to maintain the level of performance seen so far.
4.5 Results on New Training Corpora
New in this paper, the Reuters-21578 training set described in Section 3.1.3 is a synthetic dataset consisting of grayscale
images, which is in contrast to the Enron training set which had previously be binarized. The hope was that since the
19thCMAN test set was grayscale, that the Reuters-21578 training set would contribute to improving the over WER. Note
that the affects of the high WER OCR outputs from Tesseract and Adobe Pro X seem more pronounced with this training
set. The WER results for Abbyy+OmniPage were the best and further adding Tesseract and Adobe Pro X to the alignment
each increased the WER. Overall, however, the Reuters-21578 training set showed better results than the Enron training
set.
Eisenhower Test Set
Figure 6. Comparing the WER on the Eisenhower test set, with
the Reuters-21578 training set, as the proportion of the training
set varies from 0.01% to 100%.
19thCMAN Test Set
Figure 7. Comparing the WER on the 19thCMAN test set, with
the Reuters-21578 training set, as the proportion of the training
set varies from 0.01% to 100%.
The last rows of Table 5 show the results of merging both the Enron and the Reuters-21578 training sets. As more
training data is available, there is no improvement on the Eisenhower test set and a small improvement of only 0.01%
for the 19thCMAN test set. The conclusion is that there is a point where more training set vectors does not necessarily
improve the outcome. Overall the best results were seen with the Reuters-21578 training set, although not consistently
with the complete set of OCR alignments.
4.6 Results of Sweeping the Size of the Training Sets
Exploring the observation from the last section, that the increased size of the training set does not necessarily improve
the WER outcome, we sweep the size of the Reuters-21578 training set from 0.01% to 100% to explore how the WER
of the Eisenhower and 19thCMAN test sets varies as the training set increases in size. We selected the best result on
the Eisenhower test set across all training sets, which was the Abbyy+OmniPage+Tesseract alignment using the Lexi-
cal+RecurBucket# Set as seen in Table 5. We selected ten proportion values (0.01%, 0.1%, 0.2%, 0.5%, 1%, 2%, 5%,
10%, 20%, and 50%) of the training set, which has 585,291 feature vectors. At each proportion value we took five random
samples from the full training set to create 50 new training sets from which order 0 CRF models were created. The results
shown in Figures 6 and 7 are the average WER, as well as the minimum and maximum WERs for the five models of
each proportion value. When selecting a proportion size of only 0.01% of the total training set, only 61 feature vectors on
average are included in the models created.
Of interest on the Eisenhower results is that beginning with a training set proportion size of only 2.0% the average
WER of the resulting error corrected output is less than the WER using the entire test corpus. Given that the full test set
takes a significant amount of time to train, the five 2.0% test sets are considerably faster to train. Further the models created
from the 2.0% test sets consist of on the order of 11,500 features while the full model consists of over 217,000 features.
Clearly the complexity of the full model does not necessarily reward us with better results.
Regarding the 19thCMAN test set, similar to the Eisenhower test set, the full effect of the Reuters-21578 training set is
visible between 1% and 10% of the total training set. Interestingly, superior results are possible given individual training
set proportions within the same range.
5. CONCLUSIONS
In general the results seen in previous work are also seen with the new test set and training set. Although the methodologies
used in previous papers still produced good results they do not work consistently across the board. The 19thCMAN test
set seemed sensitive to the high WER OCR engines included in the training set, but we demonstrated that the size of the
training set can be reduced, potentially eliminating the troublesome high WER documents, potentially improving the end
results. Future work will include error analysis of how the models of the Enron and Reuters-21578 training sets differ in
their performance on the Eisenhower and 19thCMAN test sets.
Acknowledgments
The authors express their appreciation to the Fulton Supercomputing Lab and the Harold B. Lee Library, L. Tom Perry
Special Collections, at Brigham Young University, without whose support this research would not have been possible.
We thank the employees of the Digital Initiatives Lab of the Harold B. Lee Library for their extensive work in digitizing,
OCRing, and correcting the OCR text of the Nineteenth Century Mormon Article Newspaper Index. Lastly, we express
gratitude to Patricia A. Frade of the Harold B. Lee Library for her extensive work in assembling and curating the Nineteenth
Century Mormon Article Newspaper Index and Dr. Richard Hacken for curating the Eisenhower Communiqu´
es.
REFERENCES
[1] Lund, W. B., Kennard, D. J., and Ringger, E. K., “Combining multiple thresholding binarization values to improve
OCR output,” in [Proceedings of Document Recognition and Retrieval XX], (Feb. 2013).
[2] Lund, W. B. and Ringger, E. K., “Improving optical character recognition through efficient multiple system align-
ment,” in [Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries], 231–240, ACM, Austin, TX,
USA (2009).
[3] Kittler, J., Hatef, M., Duin, R. P. W., and Matas, J., “On combining classifiers,IEEE Trans. Pattern Anal. Mach.
Intell. 20, 226–239 (Mar. 1998).
[4] Antonacopoulos, A. and Karatzas, D., “Semantics-based content extraction in typewritten historical documents,” in
[Proceedings of the 8th International Conference on Document Analysis and Recogniction, 2005 ], 1, 48–53 (Aug.
2005).
[5] Kae, A. and Learned-Miller, E., “Learning on the fly: Font-free approaches to difficult OCR problems,” in [Proceed-
ings of the International Conference on Document Analysis and Recognition (ICDAR) 2009 ], (2009).
[6] Bertolami, R. and Bunke, H., “Ensemble methods for handwritten text line recognition systems,” in [Proccedings of
the 2005 IEEE International Conference on Systems, Man and Cybermetrics], (Oct. 2005).
[7] Si, L., Kanungo, T., and Huang, X., “Boosting performance of bio-entity recognition by combining results from
multiple systems,” in [Proceedings of the 5th international workshop on Bioinformatics ], 76–83, ACM, Chicago,
Illinois (2005).
[8] Macherey, W. and Och, F. J., “An empirical study on computing consensus translations from multiple machine trans-
lation systems.,” in [EMNLP-CoNLL], 986 (2007).
[9] Charniak, E. and Johnson, M., “Coarse-to-fine n-best parsing and MaxEnt discriminative reranking,” in [Proceedings
of the 43rd Annual Meeting of the ACL], 173–180 (June 2005).
[10] Klein, S. T. and Kopel, M., “A voting system for automatic OCR correction,” in [Proceedings of the SIGIR 2002
Workshop on Information Retrieval and OCR ], (Aug. 2002).
[11] Cecotti, H. and Belaid, A., “Hybrid OCR combination approach complemented by a specialized ICR applied on
ancient documents,” in [Proceedings of the 8th International Conference on Document Analysis and Recognition,
2005], 2, 1045–1049 (Aug. 2005).
[12] Caruana, R., Niculescu-Mizil, A., Crew, G., and Ksikes, A., “Ensemble selection from libraries of models,” in [Pro-
ceedings of the twenty-first international conference on Machine learning], 18 (2004).
[13] Lund, W. B., Walker, D. D., and Ringger, E. K., “Progressive alignment and discriminative error correction for
multiple OCR engines,” in [Proceedings of the 11th International Conference on Document Analysis and Recognition
(ICDAR 2011)], (Sept. 2011).
[14] Lund, W. B., Kennard, D. J., and Ringger, E. K., “Why multiple document image binarizations improve OCR,” in
[Proceedings of the Workshop on Historical Document Imaging and Processing 2013 (HIP 2013)], (Aug. 2013).
[15] Hansen, L. K. and Salamon, P., “Neural network ensembles,Pattern Analysis and Machine Intelligence, IEEE
Transactions on 12(10), 9931001 (1990).
[16] Dietterich, T. G., “Ensemble methods in machine learning,” in [Multiple classifier systems], Springer (2000).
[17] Cer, D., Manning, C. D., and Jurafsky, D., “Positive diversity tuning for machine translation system combination,” in
[Proceedings of the Eighth Workshop on Statistical Machine Translation ], 320–328, Association for Computational
Linguistics, Sofia, Bulgaria (2013).
[18] Gimpel, K., Batra, D., Dyer, C., and Shakhnarovich, G., “A systematic exploration of diversity in machine transla-
tion,” in [Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2013) ],
(Oct. 2013).
[19] Wang, L. and Jiang, T., “On the complexity of multiple sequence alignment,Journal of Computational Biology: A
Journal of Computational Molecular Cell Biology 1(4), 337–48 (1994).
[20] Notredame, C., “Recent evolutions of multiple sequence alignment algorithms,PLoS Computational Biology 3(8),
e123 (2007).
[21] Elias, I., “Settling the intractability of multiple alignment,” Journal of Computational Biology: A Journal of Compu-
tational Molecular Cell Biology 13, 1323–39 (Sept. 2006).
[22] Lopresti, D. and Zhou, J., “Using consensus sequence voting to correct OCR error,” Computer Vision and Image
Understanding 67(1), 39–47 (1997).
[23] Thoma, G. and Le, D., “Medical database input using integrated OCR and document analysis and labeling technol-
ogy,” in [Proceedings 1997 Symposium on Document Image Understanding Technology ], 280 (1997).
[24] Esakov, J., Lopresti, D. P., and Sandberg, J., “Classification and distribution of optical character recognition errors,”
in [Proceedings of IS&T/SPIE International Symposium on Electronic Imaging ], 204–216 (Feb. 1994).
[25] Kolak, O., Byrne, W. J., and Resnik, P., “A generative probabilistic OCR model for NLP applications,” in [Proceed-
ings of HLT-NAACL 2003], 55–62 (May 2003).
[26] Lund, W. B. and Ringger, E. K., “Error correction with in-domain training across multiple OCR system outputs,” in
[Proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR 2011)], (Sept.
2011).
[27] Yamazoe, T., Etoh, M., Yoshimura, T., and Tsujino, K., “Hypothesis preservation approach to scene text recognition
with weighted finite-state transducer,” in [2011 International Conference on Document Analysis and Recognition
(ICDAR)], 359 –363 (Sept. 2011).
[28] Sarkar, P., Baird, H. S., and Zhang, X., “Training on severely degraded text-line images,” in [Proceedings of the
Seventh International Conference on Document Analysis and Recognition - Volume 1], 38, IEEE Computer Society
(2003).
[29] Baird, H., “The state of the art of document image degradation modelling,” in [Digital Document Processing], 261–
279, Springer (2007).
[30] Jordan, D. R., “Daily battle communiques, 1944-1945,” Harold B. Lee Library, L. Tom Perry Special Collections,
MSS 2766 (1945).
[31] Fraud, P., “19th century Mormon article newspaper index,” L. Tom Perry Special Collections, Brigham Young Uni-
versity (2012).
[32] Lewis, D. D., “Reuters-21578,” http://www.daviddlewis.com/resources/testcollections/reuters21578/ (2013).
[33] Walker, D., Lund, W. B., and Ringger, E. K., “A synthetic document image dataset for developing and evaluating
historical document processing methods,” in [Proceedings of SPIE Volume 8297 ], 8297 (Jan. 2012).
[34] Berry, M. W., Browne, M., and Signer, B., “2001 topic annotated Enron email data set.” http://www.ldc.upenn.edu/
(June 2007).
[35] Spencer, M. and Howe, C., “Collating texts using progressive multiple alignment,” Computers and the Humanities 38,
253–270 (Aug. 2004).
[36] McCallum, A. K., “MALLET: a machine learning for language toolkit..” http://mallet.cs.umass.edu (2002).
[37] Ajot, J., Fiscus, J., Radde, N., and Laprun, C., “Asclite – Multi-dimensional alignment program..”
http://www.nist.gov/speech/tools/asclite.html (2008).
... The second case is for low scanning resolution image [3][4][5]. The last case is when the image contains noise [6,7]. It is difficult to identify OCR error rate accurately for previous cases. ...
... Results from earlier studies [2,6,14], showed that the multiple thresholds technique is one of the best choices among them. Fig. 2 shows an example of the multiple thresholds technique proposed by Lund [13]. ...
Chapter
Full-text available
Optical character recognition (OCR) is the electronic transformation of images into a computer-encoded text. OCR systems often produce poor accu-racy for noisy images. Ensemble recognition techniques are used to improve OCR accuracy. The idea of the ensemble recognition techniques is to produce N-versions of an input image. These versions are similar but not identical. They are passed through the OCR engine to turn them into different OCR outputs, which later leads to select the best between them. Existing ensemble techniques need to be more effective to reduce OCR error rate. This research proposed en-hanced ensemble technique to overcome the drawbacks of existing techniques. The proposed technique was evaluated against three other relevant existing techniques. The performance measurements used in this research were Word Error Rate (WER) and Character Error Rate (CER). Experimental results showed a relative decrease of 14.37% and 40.13% over the WER and CER of the best existing technique. This study contributes to the OCR domain as the proposed technique could facilitate the automatic recognition of documents. Hence, it will lead to a better information extraction.
... Borovikov et al. [18] propose using Hidden Markov Model to find the correct candidate for English documents for which the authors achieved 0.9444 recall and 0.9787 precision. Similarly for historical [19] or scanned documents that lack in quality of the images, use multiple OCR outputs or runs multiple times on the same document using the same engine [20,21,22,23]. Thompson et al. used [19] rule-based as well as a medically tuned spell-checking strategy on historical medical documents from the British Medical Journal archive to improve the wordlevel accuracy up to 16%. ...
Article
Full-text available
According to a recent Deloitte study, the COVID-19 pandemic continues to place a huge strain on the global health care sector. Covid-19 has also catalysed digital transformation across the sector for improving operational efficiencies. As a result, the amount of digitally stored patient data such as discharge letters, scan images, test results or free text entries by doctors has grown significantly. In 2020, 2314 exabytes of medical data was generated globally. This medical data does not conform to a generic structure and is mostly in the form of unstructured digitally generated or scanned paper documents stored as part of a patient’s medical reports. This unstructured data is digitised using Optical Character Recognition (OCR) process. A key challenge here is that the accuracy of the OCR process varies due to the inability of current OCR engines to correctly transcribe scanned or handwritten documents in which text may be skewed, obscured or illegible. This is compounded by the fact that processed text is comprised of specific medical terminologies that do not necessarily form part of general language lexicons. The proposed work uses a deep neural network based self-supervised pre-training technique: Robustly Optimized Bidirectional Encoder Representations from Transformers (RoBERTa) that can learn to predict hidden (masked) sections of text to fill in the gaps of non-transcribable parts of the documents being processed. Evaluating the proposed method on domain-specific datasets which include real medical documents, shows a significantly reduced word error rate demonstrating the effectiveness of the approach.
... Redundancy Based. The works in Lund et al. (2013), Lund et al. (2011), Xu and Smith (2017), and Lund et al. (2014) view the problem of posthoc correction under the assumption of redundant text snippets. That is, multiple redundant text snippets are combined and under the majority voting scheme the correction is carried out. ...
Article
Full-text available
Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model’s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.
... Redundancy based. The works in (Lund et al., 2013(Lund et al., , 2011Xu and Smith, 2017;Lund et al., 2014) 2000;Dreyer et al., 2008;Wang et al., 2014;Silfverberg et al., 2016;Farra et al., 2014). WFSM require predefined rules (insertion, deletion, etc. of characters) and a lexicon, which is used to assess the transformations. ...
Preprint
Full-text available
Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model's correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.
... There are various approaches which have been proposed to OCR post-processing, both manual (including human intervention) and automatic. For instance, the approaches in [32,33] employ supervised machine learning techniques to select the best correction among candidates from the recognition outputs of multiple OCR systems. Commonly, OCR post-processing does not rely on any specified parameters of the OCR system as well as original scanned documents. ...
Article
Full-text available
Optical character recognition (OCR) systems help to digitize paper-based historical achieves. However, poor quality of scanned documents and limitations of text recognition techniques result in different kinds of errors in OCR outputs. Post-processing is an essential step in improving the output quality of OCR systems by detecting and cleaning the errors. In this paper, we present an automatic model consisting of both error detection and error correction phases for OCR post-processing. We propose a novel approach of OCR post-processing error correction using correction pattern edits and evolutionary algorithm which has been mainly used for solving optimization problems. Our model adopts a variant of the self-organizing migrating algorithm along with a fitness function based on modifications of important linguistic features. We illustrate how to construct the table of correction pattern edits involving all types of edit operations and being directly learned from the training dataset. Through efficient settings of the algorithm parameters, our model can be performed with high-quality candidate generation and error correction. The experimental results show that our proposed approach outperforms various baseline approaches as evaluated on the benchmark dataset of ICDAR 2017 Post-OCR text correction competition.
... Multiple outputs approach includes a selection process, which is used to select the best features between the outputs of OCR. Most existing selection techniques are based on a lexicon in selecting the best candidate for wrong word [4,14,15]. This type of correction is based only on the wrong word itself. ...
... Multiple outputs approach includes a selection process, which is used to select the best features between the outputs of OCR. Most existing selection techniques are based on a lexicon in selecting the best candidate for wrong word [4,14,15]. This type of correction is based only on the wrong word itself. ...
Article
Full-text available
The approach of OCR multiple outputs is used to improve accuracy for low scanning resolution images. The idea of this approach is to incorporate information from multiple outputs of OCR to improve the final OCR output. This approach includes a selection process for choosing the best resulting words among multiple outputs of OCR. However, most existing selection techniques used in the selection process are not context-aware. Therefore, this research proposed a selection technique to overcome the drawbacks of existing techniques. It uses context information of sentences collected from the N-gram language model to improve the final OCR output. The proposed selection technique was evaluated against three other related existing techniques. The evaluation metrics used in this research were Character Error Rate (CER) and Word Error Rate (WER). Experiments showed a relative decrease of 18.26% and 14.23% over the CER and WER of the best existing technique. The proposed selection technique will result in better information extraction through the automatic recognition of low scanning documents.
... Those systems are not scalable since human annotations are expensive to acquire, and they are not capable of utilizing complementary sources of information. Another line of work is ensemble methods (Lund et al., 2013(Lund et al., , 2014 combining OCR results from multiple scans of the same document. Most of these ensemble methods, however, require aligning multiple OCR outputs (Lund and Ringger, 2009;Lund et al., 2011), which is intractable in general and might introduce noise into the later correction stage. ...
... A supervised prediction model that selects the best word recognitions among different OCR outputs is introduced by Lund, Douglas, and Ringger (2013). Combining OCR outputs with word lexical features and training a Conditional Random Fields model for word selection is further studied in Lund, Ringger, and Walker (2014). While such models have proved useful, they select words only among OCR-generated recognitions and are blind to other candidate words. ...
Article
Modern OCR engines incorporate some form of error correction, typically based on dictionaries. However, there are still residual errors that decrease performance of natural language processing algorithms applied to OCR text. In this paper, we present a statistical learning model for post-processing OCR errors, either in a fully automatic manner or followed by minimal user interaction to further reduce error rate. Our model employs web-scale corpora and integrates a rich set of linguistic features. Through an interdependent learning pipeline, our model produces and continuously refines the error detection and suggestion of candidate corrections. Evaluated on a historical biology book with complex error patterns, our model outperforms various baseline methods in the automatic mode and shows an even greater advantage when involving minimal user interaction. Quantitative analysis of each computational step further suggests that our proposed model is well-suited for handling volatile and complex OCR error patterns, which are beyond the capabilities of error correction incorporated in OCR engines.
Conference Paper
Full-text available
Our previous work has shown that the error correction of optical character recognition (OCR) on degraded historical machine-printed documents is improved with the use of multiple information sources and multiple OCR hypotheses including from multiple document image binarizations. The contributions of this paper are in demon- strating how diversity among multiple binarizations makes those improvements to OCR accuracy possible. We demonstrate the de- gree and breadth to which the information required for correction is distributed across multiple binarizations of a given document im- age. Our analysis reveals that the sources of these corrections are not limited to any single binarization and that the full range of bi- narizations holds information needed to achieve the best result as measured by the word error rate (WER) of the final OCR decision. Even binarizations with high WERs contribute to improving the fi- nal OCR. For the corpus used in this research, fully 2.68% of all tokens are corrected using hypotheses not found in the OCR of the binarized image with the lowest WER. Further, we show that the higher the WER of the OCR overall, the more the corrections are distributed among all binarizations of the document image.
Conference Paper
Full-text available
For noisy, historical documents, a high optical character recognition (OCR) word error rate (WER) can render the OCR text unusable. Since image binarization is often the method used to identify foreground pixels, a significant body of research has sought to improve image-wide binarization directly. Instead of relying on any one imperfect binarization technique, our method incorporates information from multiple global threshold binarizations of the same image to improve text output. Using a new corpus of 19th century newspaper grayscale images for which the text transcription is known, we observe WERs of 13.8% and higher using current binarization techniques and a state-of-the-art OCR engine. Our novel approach combines the OCR outputs from multiple thresholded images by aligning the text output and producing a lattice of word alternatives from which a lattice word error rate (LWER) is calculated. Our results show a LWER of 7.6% when aligning two threshold images down to a LWER of 6.8% when aligning five. From the word lattice we commit to one hypothesis by applying the methods of Lund et al. (2011) achieving 8.41% WER, a 39.1% reduction in error rate relative to the performance of the original OCR engine on this data set.
Conference Paper
Full-text available
Document images accompanied by OCR output text and ground truth transcriptions are useful for developing and evaluating document recognition and processing methods, especially for historical document images. Additionally, research into improving the performance of such methods often requires further annotation of training and test data (e.g., topical document labels). However, transcribing and labeling historical documents is expensive. As a result, existing real-world document image datasets with such accompanying resources are rare and often relatively small. We introduce synthetic document image datasets of varying levels of noise that have been created from standard (English) text corpora using an existing document degradation model applied in a novel way. Included in the datasets is the OCR output from real OCR engines including the commercial ABBYY FineReader and the open-source Tesseract engines. These synthetic datasets are designed to exhibit some of the characteristics of an example real-world document image dataset, the Eisenhower Communiqués. The new datasets also benefit from additional metadata that exist due to the nature of their collection and prior labeling efforts. We demonstrate the usefulness of the synthetic datasets by training an existing multi-engine OCR correction method on the synthetic data and then applying the model to reduce word error rates on the historical document dataset. The synthetic datasets will be made available for use by other researchers.
Conference Paper
Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions. The original ensemble method is Bayesian averaging, but more recent algorithms include error-correcting output coding, Bagging, and boosting. This paper reviews these methods and explains why ensembles can often perform better than any single classifier. Some previous studies comparing ensemble methods are reviewed, and some new experiments are presented to uncover the reasons that Adaboost does not overfit rapidly.
Conference Paper
This paper addresses the problem of producing a diverse set of plausible translations. We present a simple procedure that can be used with any statistical machine translation (MT) system. We explore three ways of using diverse translations: (1) system combination, (2) discriminative reranking with rich features, and (3) a novel post-editing scenario in which multiple translations are presented to users. We find that diversity can improve performance on these tasks, especially for sentences that are difficult for MT.
Article
This paper describes an approach for classifying OCR errors based on a new variation of a well-known dynamic programming algorithm. We present results from a large-scale experiment we performed involving the printing, scanning, and OCRing of over one million characters in each of three fonts. Times, Helvetica, and Courier. Our data allows us to draw a number of interesting conclusions about the nature of OCR errors for a particular font, as well as the relationship between error sets for different fonts.