How Well Does Multiple OCR Error Correction Generalize?
William B. Lunda, Eric K. Ringgerb, and Daniel D. Walkerc
aHarold B. Lee Library, Brigham Young University, Provo, UT 84602, USA
bComputer Science Department, Brigham Young University, Provo, UT 84602, USA
cMicrosoft, Redmond, WA 98052, USA
As the digitization of historical documents, such as newspapers, becomes more common, the need of the archive patron
for accurate digital text from those documents increases. Building on our earlier work, the contributions of this paper are:
1. in demonstrating the applicability of novel methods for correcting optical character recognition (OCR) on disparate data
sets, including a new synthetic training set, 2. enhancing the correction algorithm with novel features, and 3. assessing the
data requirements of the correction learning method. First, we correct errors using conditional random ﬁelds (CRF) trained
on synthetic training data sets in order to demonstrate the applicability of the methodology to unrelated test sets. Second,
we show the strength of lexical features from the training sets on two unrelated test sets, yielding a relative reduction in
word error rate on the test sets of 6.52%. New features capture the recurrence of hypothesis tokens and yield an additional
relative reduction in WER of 2.30%. Further, we show that only 2.0% of the full training corpus of over 500,000 feature
cases is needed to achieve correction results comparable to those using the entire training corpus, effectively reducing both
the complexity of the training process and the learned correction model.
Keywords: Historical Documents, Optical Character Recognition, OCR Error Correction, Ensemble Methods
Historical machine printed document images often exhibit signiﬁcant noise, making the optical character recognition
(OCR) of the text difﬁcult. Our previous work1,2 shows that it is possible for combined outputs from multiple OCR
engines using machine learning techniques to provide text output with a lower word error rate (WER) than the OCR of any
one OCR engine alone. Further, we use methods which are scalable to very large collections, up to millions of images,
without document- or test corpus-speciﬁc manipulation of training data, which would be infeasible given time and resource
Ensemble methods are used effectively in a variety of problems such as machine translation, speech recognition, hand-
writing recognition, and OCR error correction, to name a few. In a paper on pattern recognition frameworks for ensemble
methods, Kittler et al.3state: ”It had been observed ... that although one of the [classiﬁers] would yield the best perfor-
mance, the sets of patterns mis-classiﬁed by the different classiﬁers would not necessarily overlap. This suggested that
different classiﬁer designs potentially offered complementary information about the patterns to be classiﬁed which could
be harnessed to improve the performance of the selected classiﬁer.” Previously we have merged complementary informa-
tion such as the output of multiple OCR engines2and multiple binarizations of the same document image1on a single test
set and training set. The goal of this paper is to demonstrate the generalizability of the methods involving multiple OCR
engines, to introduce a new test and a new training set, to show the results of feature engineering, and to demonstrate the
degree to which a large training set may be reduced and still yield results consistent with the full training set.
The remainder of this paper proceeds as follows: Section 2 discusses existing work in several ﬁelds related to the
methods and outcomes of this research. Section 3 outlines a brief overview of the methods used to extract corrected text
with machine learning techniques, leading to the heart of the paper: the evaluation of the extent to which these methods
are applicable across multiple test corpora and synthetic training sets in Section 4. Finally, Section 5 summarizes the
conclusions of this research.
Further author information: (Send correspondence to William Lund)
William Lund: E-mail: bill lund (at) byu.edu, Telephone: +1 801 422-4202
Eric Ringger.: E-mail: ringger (at) cs.byu.edu
2. RELATED WORK
Extracting usable text from older, degraded documents is often unreliable, frequently to the point of being unusable.4Kae
and Learned-Miller5remind us that OCR is not a solved problem and that “the goal of transcribing documents completely
and accurately... is still far off.” At some point the word error rate of the OCR output inhibits the ability of the user to
accomplish useful tasks.
Ensemble methods are used with success in this task as well as in a variety of settings. In 1998 Kitter et al.3provided
a common theoretical framework for combining classiﬁers which is the basis for much of the work in ensemble methods.
In off-line handwriting recognition Bertolami and Bunke6use ensemble methods in the language model. Si et al.7use
an ensemble of named entity recognizers to improve overall recognition in the bio-medical ﬁeld. For machine translation,
Machery and Och8present a study in how different machine translation systems affect the quality of the machine translation
of the ensemble. Maximum entropy models have been used previously to select among multiple parses returned by a
Klein and Kobel10 as well as Cecotti and Bela¨
ıd11 note that the differences between OCR outputs can be used to
advantage. This observation is behind the success of ensemble methods, that multiple systems which are complementary
can be leveraged for an improved combined output. The question of how many inputs should be used in an ensemble system
is generally “the more the better.” Caruana et al.12 use on the order of 2000 models built by varying the parameters of the
training system to create different models. On a smaller number of inputs (ﬁve OCR engines), Lund et al.13 demonstrate
that the error rate of the ensemble decreases with each added system. It should be noted that the complementarity14 of
correct responses of the methods is critical. An important point is that even high error rate systems added to an ensemble
can contribute to reducing the ensemble error rate where the addition represents new cases or information not included
previously. Diversity in the ensemble is critical to improving the system’s performance over that of any individual in the
ensemble.15–18 This paper will expand on the observation regarding complimentary sources, noting that one useful source
of diversity is the output of multiple OCR engines.
Necessary for our post-OCR error correction using multiple sequences is an alignment of the text sequences, which can
either use exact or approximate algorithms. The multiple sequence alignment problem has been shown to be NP-Hard by
Wang and Jiang.19 Lund and Ringger2demonstrate an efﬁcient means for exact alignment; however, alignment problems
on long sequences are still computationally intractable. Much of the work in multiple sequence alignment work is done
in the ﬁeld of bioinformatics, where the size of the alignment problems has forced the discovery and adoption of heuristic
solutions such as progressive alignment.20 Elias21 discusses how a simple edit distance metric, which may be appropriate
to text operations, is not directly applicable to biological alignment problems, which means that much of the alignment
work in bioinformatics requires some adaptation for use in the case of text.
It is well known from work by Lopresti and Zhou22 that voting among multiple sequences generated by the same OCR
engine can signiﬁcantly improve OCR. One practical application of voting can be found in the Medical Article Record
System (MARS) of the National Library of Medicine (NLM) which uses a voting OCR server for text recognition.23
Esakov, Lopresti, and Sandberg24 evaluated recognition errors of OCR systems, and Kolak, Byrne, and Resnik25 speciﬁ-
cally applied their algorithms to OCR systems for post-OCR error correction in natural language processing tasks. OCR
error correction with in-domain training26 as well as out-of-domain training using a synthetic training dataset13 have been
shown to be effective.
Recent work by Yamazoe et al.27 effectively uses multiple weighted ﬁnite-state transducers (WFST) with both the OCR
and a lexicon of the target language(s) to resolve the ambiguity inherent in line- and character-segmentation, and character
recognition, in which the number of combinations can be very large. Both conventional OCR and post-processing are
contained within their system, resolving the difference between various hypotheses before committing to an output string.
A non-ensemble method for improving historical document images prior to OCR is adaptive binarization. Our previous
work1compared the OCR results of an ensemble of document image binarizations to the results of adaptive binarization.
From the perspective of the corpus WER, the results of the ensemble methods were superior to those of individual document
image adaptive binarizations.
A contribution of this paper is an extension of previous methods in supervised, discriminative machine learning methods
to choose among all hypotheses, in which previous methods are shown to be effective in two unrelated historical print
corpora and two unrelated training datasets. In this work the models are learned on synthetic, out-of-domain training data
sets, created and computationally degraded according to the methods proposed by Sarkar, Baird, and Zhang28 and Baird.29
Figure 1. The methodology used in this paper to: create the training set and CRF model, prepare the test set for processing by the CRF,
and evaluate the results using the test set transcription.
The ﬁrst step of our methodology prepares the test and training corpora, scanning document images and creating the
synthetic training sets. (See Figure 1 for a ﬂow chart of this process.) Once the document images have been scanned, the
images are recognized with the selected OCR engines; in this case those are Abbyy FineReader 10, OmniPage 18, Adobe
Acrobat Pro X, and Tesseract 3 (an open source OCR system). The baseline OCR results of each OCR engine are seen in
Table 1. Baseline corpus word error rates using micro-averaging for datasets by OCR engine. These results are on the original document
images without modiﬁcation. Note that WERs of greater than 100% are possible due to multiple insertions not found in the reference
Reader 10 OmniPage 18 Adobe Pro X Tesseract 3
Test Set Corpora
es 19.98% 30.50% 52.59% 93.99%
Average WER: 49.26%
19th Century Mormon Article
7.44% 11.77% 23.49% 18.35%
Average WER: 15.26%
Training Set Corpora
Enron Synthetic Dataset 24.31% 30.57% 68.95% 56.07%
Reuters-21578 Dataset 15.35% 20.28% 99.77% 82.37%
Figure 2. From “Communiqu´
e No. 1” of the Eisenhower Communiqu´
es. The word error rate of this document across the ﬁve OCR
engines used in this research varied from 10.63% to 63.41%, with a mean WER of 36.34%.
Table 1. The OCR output is character aligned yielding parallel hypotheses from the OCR engines. Where spaces occur
in the aligned texts, the process creates a column of text hypotheses. From these columns, features used by the machine
learner are extracted as described in Section 3.3. The machine learner, a trained conditional random ﬁeld (CRF), labels
the aligned hypotheses with the OCR engine to be selected or “NONE”, indicating that no output for the column should
be selected. The CRF models are trained using the training sets prepared similarly to the test corpora. Collectively, the
text associated with each label from all columns constitute the error corrected output. (See Figure 4 for an example.) The
following sections describe in more detail this process.
Four datasets were used in this work: two test sets, the Eisenhower Communiqu´
es30 and the Nineteenth Century Mormon
Article Newspaper Index;31 and two training sets, an extraction of the 2001 Topic Annotated Enron Email Data Set and an
extraction of the Reuters-21578 Text Categorization Test Collection.32,33 The following sections describe each dataset and
how it was created.
3.1.1 Eisenhower Communiqu´
The Eisenhower Communiqu´
es30 are a collection of 605 facsimiles⇤of typewritten documents created by the Supreme
Headquarters Allied Expeditionary Force (SHAEF) during the last years of World War II. Having been typewritten and
duplicated using carbon paper, the quality of the print is poor. (See Figure 2 for an example.) A manual transcription of
these documents serves as the gold standard for evaluating the word error rates of the OCR. In the course of duplication
the documents have effectively become bi-tonal and are treated as such by this research.
⇤An online presentation of The Eisenhower Communiqu´
es is viewable at http://www.lib.byu.edu/digital/eisenhower.
3.1.2 Nineteenth Century Mormon Article Newspaper Index
Figure 3. A fragment from the grayscale scan of document
BM 24May1859 p2 c3 from the 19th Century Mormon Newspaper
The Nineteenth Century Mormon Article Newspaper In-
dex31 (19thCMAN) corpus is a collection of 1055 color
images†of articles dealing with events and persons of The
Church of Jesus Christ of Latter-day Saints (Mormon) from
an archive of historical newspapers of the 19th century
housed at the Harold B. Lee Library of Brigham Young
University. As expected from 19th century newsprint, the
quality of the paper and the print was poor when ﬁrst
printed and has further degraded over time. The newspa-
pers were scanned at 400 dots per inch (dpi) in 24-bit RGB
color, and the individual articles were segmented and saved as TIFF images. For previous work the RGB images were
converted to 8-bit grayscale. The OCR output of each document was manually corrected by two reviewers to act as a gold
standard. An example from the document corpus can be seen in Figure 3.
3.1.3 Synthetic Training Sets
For the training sets, we created synthetic data sets from the 2001 Topic Annotated Enron Email Data Set34 and the
Reuters-21578 Text Categorization Test Collection.32 From the digital text of each document in the test corpora, a TIFF
document image was generated and randomly degraded using techniques inspired by Baird29 and Sarkar et al.28 Each
synthetic document image was produced using the following steps. First an image is rendered as a bi-tonal document.
Spatial sampling error is introduced by translating the entire image stochastically. The image is blurred using a Gaussian
convolution kernel. The document is sub-sampled. Gaussian noise is added to simulate pixel sensor sensitivity. To binarize
a document, a threshold is applied. For further details on the process, please consult the paper by Walker, Lund, and
3.2 Progressive Alignment
Since exact n-way alignments become exponentially complex in nwe turned to greedy progressive alignment heuristics,
which are applied successfully in bioinformatics20 and textual variance analysis.35 In brief, progressive alignment algo-
rithms begin by selecting two sequences to be aligned that are most similar based on some similarity measure applied to all
sequences. Additional sequences are aligned, using the same selection criteria as for the ﬁrst two, until all sequences have
been aligned. (Refer to Spencer et al.35 for details on progressive alignment in a textual context.) The order of pairwise
alignments is speciﬁed in a binary tree structure called the guide tree. Due to downstream consequences of greedy choices,
a progressive alignment heuristic is not optimal; however, the resulting alignments are good in practice.
In this paper, the order of the alignment, unless indicated otherwise, is a greedy approximation of the guide tree based
on sequence similarity of the training set; speciﬁcally in the order: Abbyy FineReader and OmniPage Pro X, then Adobe
Acrobat Pro, and lastly Tesseract. The incremental results of this alignment order on the WER can be seen in Table 4.
3.3 Assigning Features in an Alignment Column
We employ modern supervised discriminative machine learning methods trained on the training set. The role of the machine
learning model is to select the proper hypothesis from each aligned column in order to produce the best OCR correction.
We prepared training data from the training sets with the same OCR engines and aligned their output using the same
progressive alignment algorithm described above in order to produce aligned columns of hypotheses. (See Figure 4.) As a
base we extracted the following kinds of feature types from each column:
•Voting: multiple features to indicate where multiple hypotheses in a column match exactly.
•Number: binary indicators for whether each hypothesis is a cardinal number.
•Dictionary: binary indicators for whether each hypothesis appears in the Linux dictionary.
•Gazetteer: binary indicators for whether each hypothesis appears in a gazetteer of place names.
•Lexical Features: words that appear in the corpus are individually created as features.
†An online presentation of The Nineteenth Century Mormon Article Newspaper Index is viewable at
OCR Engine Hypothesis Columns (Aligned Text)
Abbyy (A) Xfee M-o-r-mon !.< -gittiamre -Called
OmniPage (O) The- M-orrison E-r--gistature -Called
Tesseract (T) The- M-o-r-mon L-r--gnslauure -Called
Adobe (D) The- lUo-rnton Lt-˜˜&aslature ()ailed
Selected Label OmniPage Abbyy NONE OmniPage Abbyy
Error Corrected Text The Mormon Called
Transcript The- M-o-r-mon L-e--gislature -Called
Figure 4. An example of an aligned lattice from document CT 2Apr1872 p2 c1. The “dash” character represents a gap or INDEL in
the alignment, where a character needs to be inserted in order to complete the alignment. Correct hypotheses in the aligned text are
underlined. The aligned text is divided into columns of hypotheses on spaces in the aligned sequences. Note that a space occurred in the
middle of the Abbyy mis-recognition of the word “Legislature”.
For each training case (an aligned column), the label indicates which OCR engine provided the correct hypothesis.
Ties were resolved by selecting the OCR engine with the lowest WER from the training set. Note that “DictA” (and so
forth) indicates that the entry from Abbyy is found in the dictionary. Leading and trailing punctuation is removed from
the hypothesis before checking in the dictionary. To produce a “Voting” feature type the match must be exact, including
punctuation. Once all of the feature vectors have been extracted, we use the maximum entropy learner in the Mallet36
toolkit to train a maximum entropy (a.k.a., multinomial logistic regression) model to predict choices on unseen alignment
Previous work2,13, 26 using the Eisenhower Communiqu´
es test set and the Enron training set was focused on individual
document performance in which corpus WERs were calculated as the average of the individual document WERs. A result
of this method is that corpus statistics would give more weight to the the tokens of a short document than to the tokens
of a long document. This approach may be called a “macro-average” in which each document is given an equal weight,
regardless of its size. In contrast, the results reported in this paper are based on tokens, giving an equal weight to each
occurrence of a token in the corpus, the OCR output, and the resulting error corrections. All averages and other statistics in
this paper, unless stated otherwise, use this “micro-averaging” approach. We believe this approach is more suited for feature
engineering although there are important uses of a document-by-document evaluation. Examples where a document-by-
document analysis is useful would be observing improvement trends by document or in error analysis related to document
Both for documents and for the corpus as a whole, the word error rate is calculated as
WordErrorRate =Substitutions +Insertions +Deletions
N umber O f Ref erence T okens (1)
which may be greater than 100% due to the number of insertions, which is not limited.
The remainder of this section is organized as follows. Section 4.1 shows the baseline results from previous work
reinterpreted using the micro-averaging approach discussed above. New features for this work, recurring features and an
order 1 CRF, are described in Section 4.2 with the results of incorporating these features in Section 4.3. The generalization
of these methods on a new test corpus and a new training corpus is shown in Sections 4.4 and 4.5. Section 4.6 wraps up
with an evaluation of the effect of the training corpus size on the results of both test corpora.
4.1 Baseline Results
As a baseline consider the WERs of the OCR output on the unmodiﬁed images of the documents from the Eisenhower
es test set and the Enron training set. Each document image in the Eisenhower Communiqu´
es test set, as well
as all corpora in this work, was recognized using four OCR engines: Abbyy FineReader 10, OmniPage 18, Adobe Acrobat
Pro X, and Tesseract 3. The resulting recognition hypotheses were evaluated using the NIST Sclite37 tool to compute the
number of correctly recognized tokens, as well as substitutions, insertions, and deletions for each document. These results,
shown in the ﬁrst numerical column of Table 4, are a reference point for evaluating the effectiveness of the new techniques
and corpora introduced in this paper.
Our previous work13,26 showed the improvement in the WER using a trained machine learner with the alignment of the
output from multiple OCR engines. Reinterpreting these previous results using micro-averaging techniques, the underlined
entries in Table 4 show the decreasing WER as additional OCR outputs are added to the alignment. The order of the OCR
output alignment was determined by the order of increasing WER on the Enron training set.
4.2 New Features
In addition to the baseline feature types described in Section 3.3, this paper adds three new feature types for consideration
by the machine learner: voting, dictionary lookup, gazetteer lookup, identifying numbers, and the lexical features of the
training set. This paper adds three new features:
1. RecurSim, a binary feature indicating whether a token occurs more than once.
2. RecurBucket#, a multivalued feature dividing the number of times a token appears into one of ten buckets, numbered
0 to 9, with each bucket containing approximately the same number of tokens. The higher the bucket number, the
fewer times the individual tokens in that bucket appeared in the corpus. (See Section 4.2.1 for more details.)
3. Order, the machine learner is trained using either an order 0 or an order 1 conditional random ﬁeld (CRF). The order
0 CRF only considers the current column of hypotheses when deciding on the label. The order 1 CRF considers the
previous label in addition to the current column features.
4.2.1 Recurring Features
The recurring features (RecurSim and RecurBucket#) are calculated on the training and test sets since the contents of the
OCR outputs is available without violating a restriction on using the gold standard transcription of the documents in the
corpus. The simple recurring feature (RecurSim) is created for every token that appears more than once anywhere in the
corpus. The hope is that by tracking recurring features in the corpus out of vocabulary tokens that do not appear in the
dictionary or gazetteer will be captured.
The bucket recurring feature (RecurBucket#) divides up the range of recurring feature counts into ten buckets tagged
with the labels RecurBucket0 to RecurBucket9. The method for assigning bucket labels calculates the number of times a
given token appears in the OCR of the corpus. For example in the combined OCR of the Eisenhower test corpus there are
3,931 different tokens that each appear twice, one of which is “SCHEUERN” and is likely a valid recognition by the OCR
engines of a town by that name in Germany. (See Figure 5.) The buckets are assigned labels in an ascending order such
that there are approximately the same number of recurring tokens in each bucket. For the Eisenhower test set the bucket
assignments are found in Table 2. The simpler “RecurSim” feature is assigned to all tokens that recur within the corpus, so
RecurBucket0 through RecurBucket9 would all be mapped to RecurSim.
Figure 5. A histogram from the Eisenhower test set showing the number of times that a token with a given number of occurrences appears
in the OCR of the test corpus. For example there are 3,931 tokens that appear twice in the corpus and one token that appears 283 times.
Table 2. Assignments of token recurrences in the Eisenhower test set to feature RecurBucket#.
Recurring Token Count
Bucket Label from to
RecurBucket0 2 3
RecurBucket1 4 12
RecurBucket2 5 21
RecurBucket3 22 30
RecurBucket4 31 41
RecurBucket5 42 52
RecurBucket6 53 62
RecurBucket7 63 76
RecurBucket8 77 150
RecurBucket9 151 283
4.2.2 CRF Order
Previously, the machine learner used an order 0 conditional random ﬁeld (CRF), also called a log-linear classiﬁer. This
means that only the features of the current column are used to select the label assigned to the features from the hypothesis
column. The selected label, which is one of the OCR engines or the label “NONE”, determines which OCR hypothesis to
select or whether to select none of the OCR outputs.
For this paper we have added a new set of models, trained using the same training sets but modeled using an order 1
CRF. The model considers not only the features found in the current hypothesis column, but also the label that was selected
previously. The results will clearly identify whether the order 0 or the order 1 CRF model is being used.
4.3 Results of the New Features
One of the goals of this research was to determine the contribution of the lexical features learned from the training set to
the overall performance of the machine learner. To this end we group features into sets, including and excluding both the
RecurSim, RecurBucket#, and the Lexical feature types. The set names and features found in each set are found in Table 3.
Refer to Section 3.3 for an explanation of the feature names.
The results across all feature set groupings and CRF orders may be seen in Table 4. To orient the reader, the previously
published results are underlined in the “Order 0 CRF” section of the table. Note the italicized entries which indicate
improvements‡within a given OCR alignment over the previous results.
‡The ground truth transcription of the Eisenhower test set consists of 145,346 words in 605 documents. A WER reduction of 0.01%
constitutes 15 tokens that are corrected across the corpus, consisting of insertions that are eliminated or words that are corrected.
Table 3. The grouping of features used in various model conﬁgurations.
Feature Set Feature Types
Base Set Voting, Dictionary, Gazetteer, and Number
RecurSim Set All features found in Base Set along with the RecurSim feature type
RecurBucket# Set All features found in the Base Set along with the RecurBucket# feature types
Lexical Set All features found in the Base Set with the Lexical feature type
Lexical+RecurSim Set A combination of the Base Set,Lexical Set, and the RecurSim feature type
Lexical+RecurBucket# Set A combination of the Base Set,Lexical Set, and the RecurBucket# feature types.
Table 4. Eisenhower test set WERs with the Enron training set including the new features. The underlined entries correspond to the
results previously published. Italicized entries indicate improvement over previous results. The bolded entry is the lowest WER in the
Alignment Order Base RecurSim RecurBucket# Lexical + RecurSim + RecurBucket#
Order 0 CRF Abbyy OCR WER: 19.98%
Abbyy + OmniPage 21.91% 22.20% 22.15% 18.31% 18.13% 18.19%
Abbyy + OmniPage +
Tesseract 17.47% 17.15% 17.47% 17.52% 17.29% 17.45%
Abbyy + OmniPage +
Tesseract + Adobe 17.80% 17.55% 17.78% 16.64% 16.23% 16.46%
Order 1 CRF
Abbyy + OmniPage 21.56% 21.76% 21.68% 17.70% 17.61% 17.70%
Abbyy + OmniPage +
Tesseract 17.68% 17.59% 17.70% 17.80% 17.62% 17.75%
Abbyy + OmniPage +
Tesseract + Adobe 17.82% 17.74% 17.79% 17.49% 17.17% 17.42%
Observe that the order 1 CRF is not in general an improvement over the order 0 CRF. With the exception of results
using the Abbyy+OmniPage alignment all of the other results in the table have an increased WER. Further, the best result
of 17.42% in the order 1 CRF is the Abbyy+OmniPage+Tesseract+Adobe alignment with the Lexical+RecurBucket# Set
is signiﬁcantly higher than the best result in the order 0 CRF table at 16.23%. Based on this, the order 1 CRF will not be
included in the results the follow since it is not showing an improvement over the order 0 CRF.
Clearly the Lexical features, as found in the three feature sets Lexical Set,Lexical+RecurSim Set, and Lexical+Re-
curBucket# Set, show improvement over the feature sets without the Lexical features, yielding a 6.52% relative improve-
ment between the Base Set of the Abbyy+OmniPage+Tesseract+Adobe alignment and the Lexical Set as shown in Table 4.
In addition, the recurring features in conjunction with the Lexical features are superior to the Lexical features alone with
the Lexical+RecurSim Set having the greatest improvement, showing an additional 2.30% relative improvement over the
Abbyy+OmniPage+Tesseract+Adobe alignment and the Lexical Set mentioned above. Overall the Lexical+RecurSim Set
performs best on the Eisenhower test set and the Enron training set. We will proceed with the Lexical Set and recurring
feature sets as we compare results with the new test corpus, the 19th Century Mormon Article Newspaper Index.
4.4 Results on a Different Test Corpus
The 19th Century Mormon Article Newspaper Index, described in Section 3.1.2, consists of 208,630 words§in 1,055
documents digitized to 8-bit grayscale, in contrast to the Eisenhower dataset which is effectively bitonal. The results using
the selected feature sets from the previous section are found in Table 5.
§A reduction of 0.01% in the WER on the 19th Century Mormon Article Newspaper Index results in 21 tokens that are corrected
across the corpus.
Table 5. Results on the Eisenhower and the 19th Century Mormon Article Newspaper Index test sets using a CRF model trained on the
Enron, Reuters, and combined training sets. The bold entries are the lowest WERs for each section of the table. The underlined entries
are the lowest WERs for their respective test sets.
Eisenhower Test Set 19thCMAN Test Set
Feature Sets Feature Sets
Lexical Lexical Lexical Lexical Lexical Lexical
Enron Training Set Abbyy OCR WER: 19.98% Abbyy OCR WER: 7.44%
Abbyy+OmniPage 18.31% 18.13% 18.19% 6.91% 7.03% 6.92%
+Tesseract 17.52% 17.29% 17.45% 7.06% 7.75% 7.46%
+Tesseract+Adobe 16.64% 16.23% 16.49% 6.83% 7.03% 6.90%
Reuters-21578 Training Set
Abbyy+OmniPage 19.91% 20.14% 19.66% 5.99% 5.99% 5.98%
+Tesseract 16.29% 16.12% 15.97% 6.80% 7.64% 7.16%
+Tesseract+Adobe 16.41% 16.71% 16.37% 6.99% 7.68% 7.42%
Combined Training Set
Abbyy+OmniPage 20.01% 20.02% 19.83% 6.04% 5.97% 5.99%
+Tesseract 16.88% 17.15% 16.96% 6.82% 6.82% 7.28%
+Tesseract+Adobe 16.63% 16.56% 16.63% 6.83% 6.82% 7.05%
The monotonic improvement in WER seen on the Eisenhower test set using the Lexical+RecurSim feature set is not re-
ﬂected in the 19thCMAN test set using the Enron training set. In the 19thCMAN test set, the addition of the Tesseract OCR
increases the WER above the Abbyy+OmniPage alignment results across the board for all of the feature sets. Unlike the
Eisenhower test set, 19thCMAN appears to have a sensitivity to the relatively high WER of the Tesseract OCR (56.08%).
Adding the Adobe OCR output improves the resulting WER to a level equal to or below both the Abbyy FineReader WER
and the Abbyy+OmniPage alignment, even though the WER of Adobe on the 19thCMAN test set (68.95%) is higher than
that of Tesseract. The conclusion here is that although the high WER OCR of Tesseract and Adobe in the training set were
able to contribute to lowering the WER for the Eisenhower test set, in combination they did not contribute in the same way
with the 19thCMAN treat set. A possible solution may be to eliminate from the training set documents with high WERs.
Since the training set includes alignments of documents from multiple OCR engines, this may decrease the size of the
training set since if a document is eliminated due to a high WER from one OCR engine, it would need to be eliminated
from all of the OCR engine contributions to the training set. Section 4.6 explores whether the full contents of the training
set are needed to maintain the level of performance seen so far.
4.5 Results on New Training Corpora
New in this paper, the Reuters-21578 training set described in Section 3.1.3 is a synthetic dataset consisting of grayscale
images, which is in contrast to the Enron training set which had previously be binarized. The hope was that since the
19thCMAN test set was grayscale, that the Reuters-21578 training set would contribute to improving the over WER. Note
that the affects of the high WER OCR outputs from Tesseract and Adobe Pro X seem more pronounced with this training
set. The WER results for Abbyy+OmniPage were the best and further adding Tesseract and Adobe Pro X to the alignment
each increased the WER. Overall, however, the Reuters-21578 training set showed better results than the Enron training
Eisenhower Test Set
Figure 6. Comparing the WER on the Eisenhower test set, with
the Reuters-21578 training set, as the proportion of the training
set varies from 0.01% to 100%.
19thCMAN Test Set
Figure 7. Comparing the WER on the 19thCMAN test set, with
the Reuters-21578 training set, as the proportion of the training
set varies from 0.01% to 100%.
The last rows of Table 5 show the results of merging both the Enron and the Reuters-21578 training sets. As more
training data is available, there is no improvement on the Eisenhower test set and a small improvement of only 0.01%
for the 19thCMAN test set. The conclusion is that there is a point where more training set vectors does not necessarily
improve the outcome. Overall the best results were seen with the Reuters-21578 training set, although not consistently
with the complete set of OCR alignments.
4.6 Results of Sweeping the Size of the Training Sets
Exploring the observation from the last section, that the increased size of the training set does not necessarily improve
the WER outcome, we sweep the size of the Reuters-21578 training set from 0.01% to 100% to explore how the WER
of the Eisenhower and 19thCMAN test sets varies as the training set increases in size. We selected the best result on
the Eisenhower test set across all training sets, which was the Abbyy+OmniPage+Tesseract alignment using the Lexi-
cal+RecurBucket# Set as seen in Table 5. We selected ten proportion values (0.01%, 0.1%, 0.2%, 0.5%, 1%, 2%, 5%,
10%, 20%, and 50%) of the training set, which has 585,291 feature vectors. At each proportion value we took ﬁve random
samples from the full training set to create 50 new training sets from which order 0 CRF models were created. The results
shown in Figures 6 and 7 are the average WER, as well as the minimum and maximum WERs for the ﬁve models of
each proportion value. When selecting a proportion size of only 0.01% of the total training set, only 61 feature vectors on
average are included in the models created.
Of interest on the Eisenhower results is that beginning with a training set proportion size of only 2.0% the average
WER of the resulting error corrected output is less than the WER using the entire test corpus. Given that the full test set
takes a signiﬁcant amount of time to train, the ﬁve 2.0% test sets are considerably faster to train. Further the models created
from the 2.0% test sets consist of on the order of 11,500 features while the full model consists of over 217,000 features.
Clearly the complexity of the full model does not necessarily reward us with better results.
Regarding the 19thCMAN test set, similar to the Eisenhower test set, the full effect of the Reuters-21578 training set is
visible between 1% and 10% of the total training set. Interestingly, superior results are possible given individual training
set proportions within the same range.
In general the results seen in previous work are also seen with the new test set and training set. Although the methodologies
used in previous papers still produced good results they do not work consistently across the board. The 19thCMAN test
set seemed sensitive to the high WER OCR engines included in the training set, but we demonstrated that the size of the
training set can be reduced, potentially eliminating the troublesome high WER documents, potentially improving the end
results. Future work will include error analysis of how the models of the Enron and Reuters-21578 training sets differ in
their performance on the Eisenhower and 19thCMAN test sets.
The authors express their appreciation to the Fulton Supercomputing Lab and the Harold B. Lee Library, L. Tom Perry
Special Collections, at Brigham Young University, without whose support this research would not have been possible.
We thank the employees of the Digital Initiatives Lab of the Harold B. Lee Library for their extensive work in digitizing,
OCRing, and correcting the OCR text of the Nineteenth Century Mormon Article Newspaper Index. Lastly, we express
gratitude to Patricia A. Frade of the Harold B. Lee Library for her extensive work in assembling and curating the Nineteenth
Century Mormon Article Newspaper Index and Dr. Richard Hacken for curating the Eisenhower Communiqu´
 Lund, W. B., Kennard, D. J., and Ringger, E. K., “Combining multiple thresholding binarization values to improve
OCR output,” in [Proceedings of Document Recognition and Retrieval XX], (Feb. 2013).
 Lund, W. B. and Ringger, E. K., “Improving optical character recognition through efﬁcient multiple system align-
ment,” in [Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries], 231–240, ACM, Austin, TX,
 Kittler, J., Hatef, M., Duin, R. P. W., and Matas, J., “On combining classiﬁers,” IEEE Trans. Pattern Anal. Mach.
Intell. 20, 226–239 (Mar. 1998).
 Antonacopoulos, A. and Karatzas, D., “Semantics-based content extraction in typewritten historical documents,” in
[Proceedings of the 8th International Conference on Document Analysis and Recogniction, 2005 ], 1, 48–53 (Aug.
 Kae, A. and Learned-Miller, E., “Learning on the ﬂy: Font-free approaches to difﬁcult OCR problems,” in [Proceed-
ings of the International Conference on Document Analysis and Recognition (ICDAR) 2009 ], (2009).
 Bertolami, R. and Bunke, H., “Ensemble methods for handwritten text line recognition systems,” in [Proccedings of
the 2005 IEEE International Conference on Systems, Man and Cybermetrics], (Oct. 2005).
 Si, L., Kanungo, T., and Huang, X., “Boosting performance of bio-entity recognition by combining results from
multiple systems,” in [Proceedings of the 5th international workshop on Bioinformatics ], 76–83, ACM, Chicago,
 Macherey, W. and Och, F. J., “An empirical study on computing consensus translations from multiple machine trans-
lation systems.,” in [EMNLP-CoNLL], 986 (2007).
 Charniak, E. and Johnson, M., “Coarse-to-ﬁne n-best parsing and MaxEnt discriminative reranking,” in [Proceedings
of the 43rd Annual Meeting of the ACL], 173–180 (June 2005).
 Klein, S. T. and Kopel, M., “A voting system for automatic OCR correction,” in [Proceedings of the SIGIR 2002
Workshop on Information Retrieval and OCR ], (Aug. 2002).
 Cecotti, H. and Belaid, A., “Hybrid OCR combination approach complemented by a specialized ICR applied on
ancient documents,” in [Proceedings of the 8th International Conference on Document Analysis and Recognition,
2005], 2, 1045–1049 (Aug. 2005).
 Caruana, R., Niculescu-Mizil, A., Crew, G., and Ksikes, A., “Ensemble selection from libraries of models,” in [Pro-
ceedings of the twenty-ﬁrst international conference on Machine learning], 18 (2004).
 Lund, W. B., Walker, D. D., and Ringger, E. K., “Progressive alignment and discriminative error correction for
multiple OCR engines,” in [Proceedings of the 11th International Conference on Document Analysis and Recognition
(ICDAR 2011)], (Sept. 2011).
 Lund, W. B., Kennard, D. J., and Ringger, E. K., “Why multiple document image binarizations improve OCR,” in
[Proceedings of the Workshop on Historical Document Imaging and Processing 2013 (HIP 2013)], (Aug. 2013).
 Hansen, L. K. and Salamon, P., “Neural network ensembles,” Pattern Analysis and Machine Intelligence, IEEE
Transactions on 12(10), 9931001 (1990).
 Dietterich, T. G., “Ensemble methods in machine learning,” in [Multiple classiﬁer systems], Springer (2000).
 Cer, D., Manning, C. D., and Jurafsky, D., “Positive diversity tuning for machine translation system combination,” in
[Proceedings of the Eighth Workshop on Statistical Machine Translation ], 320–328, Association for Computational
Linguistics, Soﬁa, Bulgaria (2013).
 Gimpel, K., Batra, D., Dyer, C., and Shakhnarovich, G., “A systematic exploration of diversity in machine transla-
tion,” in [Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2013) ],
 Wang, L. and Jiang, T., “On the complexity of multiple sequence alignment,” Journal of Computational Biology: A
Journal of Computational Molecular Cell Biology 1(4), 337–48 (1994).
 Notredame, C., “Recent evolutions of multiple sequence alignment algorithms,” PLoS Computational Biology 3(8),
 Elias, I., “Settling the intractability of multiple alignment,” Journal of Computational Biology: A Journal of Compu-
tational Molecular Cell Biology 13, 1323–39 (Sept. 2006).
 Lopresti, D. and Zhou, J., “Using consensus sequence voting to correct OCR error,” Computer Vision and Image
Understanding 67(1), 39–47 (1997).
 Thoma, G. and Le, D., “Medical database input using integrated OCR and document analysis and labeling technol-
ogy,” in [Proceedings 1997 Symposium on Document Image Understanding Technology ], 280 (1997).
 Esakov, J., Lopresti, D. P., and Sandberg, J., “Classiﬁcation and distribution of optical character recognition errors,”
in [Proceedings of IS&T/SPIE International Symposium on Electronic Imaging ], 204–216 (Feb. 1994).
 Kolak, O., Byrne, W. J., and Resnik, P., “A generative probabilistic OCR model for NLP applications,” in [Proceed-
ings of HLT-NAACL 2003], 55–62 (May 2003).
 Lund, W. B. and Ringger, E. K., “Error correction with in-domain training across multiple OCR system outputs,” in
[Proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR 2011)], (Sept.
 Yamazoe, T., Etoh, M., Yoshimura, T., and Tsujino, K., “Hypothesis preservation approach to scene text recognition
with weighted ﬁnite-state transducer,” in [2011 International Conference on Document Analysis and Recognition
(ICDAR)], 359 –363 (Sept. 2011).
 Sarkar, P., Baird, H. S., and Zhang, X., “Training on severely degraded text-line images,” in [Proceedings of the
Seventh International Conference on Document Analysis and Recognition - Volume 1], 38, IEEE Computer Society
 Baird, H., “The state of the art of document image degradation modelling,” in [Digital Document Processing], 261–
279, Springer (2007).
 Jordan, D. R., “Daily battle communiques, 1944-1945,” Harold B. Lee Library, L. Tom Perry Special Collections,
MSS 2766 (1945).
 Fraud, P., “19th century Mormon article newspaper index,” L. Tom Perry Special Collections, Brigham Young Uni-
 Lewis, D. D., “Reuters-21578,” http://www.daviddlewis.com/resources/testcollections/reuters21578/ (2013).
 Walker, D., Lund, W. B., and Ringger, E. K., “A synthetic document image dataset for developing and evaluating
historical document processing methods,” in [Proceedings of SPIE Volume 8297 ], 8297 (Jan. 2012).
 Berry, M. W., Browne, M., and Signer, B., “2001 topic annotated Enron email data set.” http://www.ldc.upenn.edu/
 Spencer, M. and Howe, C., “Collating texts using progressive multiple alignment,” Computers and the Humanities 38,
253–270 (Aug. 2004).
 McCallum, A. K., “MALLET: a machine learning for language toolkit..” http://mallet.cs.umass.edu (2002).
 Ajot, J., Fiscus, J., Radde, N., and Laprun, C., “Asclite – Multi-dimensional alignment program..”