Content uploaded by Eugene Borovikov
Author content
All content in this area was uploaded by Eugene Borovikov
Content may be subject to copyright.
A multi-evidence, multi-engine OCR system
Ilya Zavorin
a
, Eugene Borovikov
a
, Anna Borovikov
a
, Luis Hernandez
b
, Kristen Summers
a
,
Mark Turner
a
a
CACI International Inc, 4831 Walden Lane, Lanham, MD 20706, USA
b
Army Research Laboratory, Adelphi, MD, USA
ABSTRACT
Although modern OCR technology is capable of handling a wide variety of document images, there is no single
OCR engine that performs equally well on all documents for a given single language script. Naturally, each
OCR engine has its strengths and weaknesses, and therefore different engines tend to differ in the accuracy on
different documents, and in the errors on the same document image. While the idea of using multiple OCR
engines to boost output accuracy is not new, most of the existing systems do not go beyond variations on
majority voting. While this approach may work well in many cases, it has limitations, especially when OCR
technology used to process a given script not yet fully matured. Our goal is to develop a system called MEMOE
(for “Multi-Evidence Multi-OCR-Engine”) that combines, in a optimal or near-optimal way, output streams of
one or more OCR engines together with various types of evidence extracted from these streams as well as from
original document images, to produce output of higher quality than that of the individual OCR engines. The
MEMOE system functions as an OCR engine taking document images and some configuration parameters as
input and producing a single output text stream. In this paper, we describe the design of the system, various
evidence types and how they are incorporated into MEMOE in the form of filters. Results of initial tests on
two corpora of Arabic documents show that, even in its initial configuration, the system is superior to a voting
algorithm and that even more improvement may be achieved by incorporating additional evidence types into the
system.
Keywords: multi-engine document image processing, machine learning, OCR, classifier combination
1. INTRODUCTION
The multi-engine OCR can be thought of as voting among multiple OCR engines, but the multi-evidence approach
(as the name implies) uses multiple types of evidence,such as the level of image noise, character confusion maps,
language models, and others, to help resolve the disagreements among the OCR engines via some underlying
statistical model, and produce the most likely OCR output, given the evidence. Notice that similar approaches are
used elsewhere in NLP technology, e.g., as in multi-engine machine translation systems.
1, 2
The resulting system
is considerably more sophisticated and accurate than simple voting. Unlike pure voting, the multi-evidence
approach is also effective for a single OCR engine case,
3
producing a more accurate, statistically optimal OCR
output.
1.1. Related Work
Each OCR engine can be thought of as a pattern classifier; then mathematically, the multi-engine OCR problem
is one of classifier combination, an approach that has proven to be effective in increasing recognition accuracy.
4–7
Classifier combination methods strive to classify patterns given outputs from multiple (independent) classifiers,
taking into account some types of external evidence (e.g. confidence levels). In OCR applications, this approach
works fairly well when the task is to recognize individual characters,
5
i.e. when the number of classes is limited
and the OCR output alignment is trivial. This may not be the case when we are dealing with multiple degraded
OCR streams with no confidence values available.
Although multi-engine OCR systems are available and fairly widely used, most of them typically employ some
variations of majority voting.
8
Pure voting is fairly simple to implement and may work well in many cases, but
it also has some limitations as it typically
Send correspondence to izavorin@caci.com
• assumes that one of the candidate symbols is correct (not always the case for noisy documents)
• requires more than two OCR streams, as in the case in case of majority voting
More sophisticated heuristically driven voting schemes
9
may suffer less from the above limitations, since they
usually take into account additional sources of evidence, but they rarely assess statistical optimality of the
corrected OCR output. Literature search has not produced a comprehensive study of the optimal multi-engine
OCR output combination based on multiple types of evidence.
1.2. Approach
We approach the multi-engine OCR problem as a statistically optimal combination of multiple OCR streams
given the multiple types of evidence. This system can be viewed as a generalization of a previous post-OCR
correction approach
3
with the following important extensions:
• one or more OCR streams (including the possibility of a single stream)
• multiple types of evidence (described in Section 2.1)
• multi-input/output filters (described in Section 2.2)
Like in a single engine predecessor system, the current solution is a filter-based system that uses statistical
inference to optimally correct the OCR output. Like the previous system, it can work with multiple languages
scripts. It performs just as well or better than the majority voting algorithm in case of multiple engines.
2. SYSTEM OVERVIEW
The MEMOE system is structured as shown in Figure 1. Document images are processed with multiple OCR
engines (or, in the trivial case, a single engine). The resulting output sequences of characters go to the Multi-
Evidence Processor Engine, along with the original images. The multi-evidence processor aligns the output
text from the engines
10
and also measures a number of other features, discussed in Section 2.1. These features
include both image-based features that are measured on the original input and text-based features that reflect
the OCR engines’ output. They can serve as evidence in selecting the preferred engine output for a particular
document and/or in directly selecting the most likely text, words or characters. The output streams and cal-
culated features go to the Evidence Combiner and File Generator Engine, which combines these various inputs
into a single output sequence of characters that it deems most likely. More specifically, its Evidence Analyzer
Engine component identifies the significance of the measured features, turning them into evidence for particular
choices; the Combiner and File Generator resolves conflicts between the indicated results and produces a single
output sequence of characters. Possible implementations of the Evidence Analyzer Engine and the Combiner are
discussedinmoredetailinSection2.2.
2.1. Evidence Types
This section briefly describes the results of the initial evidence investigation phase of the project. The purpose
of this phase was to make a preliminary assessment of usefulness of various evidence types in terms of predicting
and/or improving OCR accuracy of a document processed by one or more OCR engines. We investigated the
following types of evidence:
• Image optic al properties measured directly from a document image.
• Disagreement among OCR engines at the character and word level.
• Character confusion matric es that record information about single characters as well as groups of characters.
• Language models on character level (e.g. bigram statistics) and word level (e.g. dictionaries).
• Characters as deformable shapes, a measure that functions like a confusion matrix, but uses a measurement
of the energy required to transform a skeleton of one character into that of another.
• OCR confidence indicators, such as the presence of unrecognized character markers in OCR output.
• High-level context such as language models tuned to specific subject domains.
• Co-location, using other words in a document as sources of corrections for a word likely to be incorrect.
Figure 1. Overview of the MEMOE system.
• Multiple scripts in a do cument that helps select an appropriate engine, or engines, for a given document.
Experiments with all evidence types involved qualitative or quantitative evaluations of relationships between
various continuous and discrete features associated with each type and extracted from a given document or set
of documents, and OCR accuracy for this document of set of documents. Several data sets were employed for
evidence types investigation: images of Spanish and Arabic documents separated in several groups based on
their source, image quality, availability of ground truth and sometimes divided into training and testing subsets
for specific experiments. While the experiments had common components such as using the same corpora of
documents and software for collecting “raw” information (confusion matrices, OCR accuracy measurements etc.),
they were designed individually on a type-by-type basis. We determined that most evidence types provide useful
information that can be exploited to boost OCR accuracy. Furthermore, for each evidence type, we identified
three key characteristics:
• Whether it may serve as a basis for evaluation correction of OCR output, its correction, or both.
• Whether it may be attributed to single characters, words or the entire document, i.e., contain potentially
useful information at these different levels of output granularity.
• The extent to which it depends on the script or language of the document.
More details of the evidence types and how they are incorporated into the multi-engine system are given in the
following sections.
2.2. Design
The proposed multi-evidence processor engine combines multiple types of evidence, as available, to determine
the statistical combination of evidence that helps boosting the output OCR accuracy (see Figure 1). Contingent
upon the availability of suitable OCR engines and the developed multi-evidence filters contained in the multi-
evidence processor engine, the system should work for any script/language. The OCR text streams can be fed
either from concurrently running OCR engines and/or from off-line OCR text repositories. Being modular and
filter-based, the system can run its components on several computers taking advantage of parallel computing. It
also provides an extensible component-based framework for adding custom evidence filters and OCR engines as
required for future processing needs.
2.2.1. Previous work
Initial work in this area started with a filter based post-OCR accuracy boost system that combined different
post-OCR correction filters to improve the OCR accuracy for a single OCR output stream.
3
The major focus
was on developing an OCR accuracy booster based on Hidden Markov Model (HMM). The HMM filter modeled
OCR engine noise as a two-layer stochastic process. Experiments with the HMM based post-OCR filter revealed
its versatility in applications to different languages as well as its robustness and generalization power (e.g. in
correcting words that the filter was not trained on).
The single stream approach used multiple evidence types (e.g. n-gram frequencies, confusion matrices, etc.) and
was able to handle multiple languages (e.g. Arabic, English, etc.) It seems natural that this multi-evidence filter
based approach be extended to handle multiple OCR streams. Therefore, our current multi-engine OCR system,
may appear as an extension of our previous single stream solution.
2.2.2. Multiple filter approach
One way of implementing a multi-evidence system is by developing a collection of filters, each based on a single
evidence type, and then combine them together, e.g. into a cascade. The main advantages of this approach are
that the individual filters may be relatively simpler than those that incorporate many evidence types and that it
may be more likely to find a mathematical apparatus (such as an HMM) that implements them efficiently. The
main challenge is that even if individual filters are known to be optimal or near-optimal, it may be difficult to
prove optimality of their combination. Nevertheless, in this paper we focus on this approach and plan to develop
alternative approaches later.
3. EVIDENCE COMBINER
In Section 3.2 through 3.3 we describe various evaluating and correcting filters that we experimented with, each
of which being based on a single evidence type. Section 3.5 describes how these filters were connected together
to form a cascade-type multi-filter processing system.
3.1. Dictionary look-up
Perhaps one of the simplest correcting filters that can be applied to a single OCR stream is a spell-checker based
on dictionary lookup. Given a large body of correct text in a given language, a dictionary of all unique words
is extracted from this text and stored, together with a frequency count for every stored word. Throughout this
paper, we define the term “word” as a sequence of consecutive non-delimiter characters between two delimiters,
such as whitespace characters or punctuation symbols. To correct a stream of OCR text, all or some words from
the stream are compared against the dictionary. For every output word, a dictionary entry is found that is closest
to the word in terms of the Levenshtein distance. A “correct word” is a word with an exact match (distance of
0). Frequency counts are used to break ties, with more frequent dictionary words given more weight. Instead of
processing all output words, only some may be selected according to a certain criteria. For instance, only words
that contain unrecognized character markers produced by some OCR engines may be considered. This not only
may save processing time but also yield higher accuracy than all-word processing in the case when OCR output
contains many specialized words or when the dictionary is relatively small.
3.2. Multiple OCR output stream processing using OCRtk tools
In order to make any decision regarding quality of OCR output based on multiple OCR streams, it is necessary,
first, to synchronize these streams. Character-level synchronization is done via synctext utility of the ISRI
OCR toolkit (open source software) by Nartker et. al.
10
In Section 3.3 below, we describe a filter methodology
based on this character-level synchronization. The OCRtk package includes a vote utility that implements a
simple voting algorithm augmented by various heuristics to break ties. We use its results to establish a baseline
of performance for the MEMOE system. Voting uses the synctext alignment as its first step to determine a
smallest set of mismatching characters blocks from each of the OCR streams. It works well only when there
is a relatively small amount of disagreement between streams. Given large amounts of disagreement, the tool
produces rather suboptimal results, e.g. with very large blocks from one stream corresponding to small or empty
blocks from another stream.
Figure 2. Voting correction by disagreement pattern matching
3.3. Disagreement-pattern-based voting corrector
Voting usually fails when OCR engines exhibit significant disagreement. For instance, if three OCR streams
each produce a different candidate character for the same position in the document, additional processing is
needed to break the tie correctly. We make this decision by detecting patterns of disagreement between engines.
This means that even if an OCR engines makes a mistake but it makes it consistently, this consistency may be
exploited to our advantage when combined with information coming from another OCR engine. We illustrate
our approach in Figure 2. Suppose we have three OCR engines. OCR
1
is overall the most accurate, OCR
2
is somewhat worse (e.g. it consistently confuses “e” with “o”), while OCR
3
is the worst. Suppose the correct
word is “plate” and we need to determine what are the correct third and fifth characters. Further suppose
that we have some training corpus, i.e. OCR output with accompanying ground truth, and we have previously
collected disagreement character block sets from this corpus. In the case illustrated in the figure, these would
be {a,u,#;a} and {e,o,?;e} from OCR outputs corresponding to “table” and {e,c,#;c} from “came”. We can
determine whether disagreement character block sets that need to be corrected (in this case {a,z,?} and {e,o,&})
are similar to any of the training sets. One similarity measure is simply the number of matching blocks. For
instance, there is a single match between {a,z,?} and {a,u,#}, two matches between {e,o,&} and {e,o,?} and no
matches between {e,o,&} and {a,u,#}. Once we find the best matching training block set for a given block set
being corrected, we can correct the latter with the ground truth block corresponding to the former. For instance,
the best match for { e,o,&} is {e,o,?}, so we replace {e,o,&} with “e”. Similarly, we replace {a,u,#} with “a”.
Note that all character blocks in this example consist of a single character. In general, there may be some empty
blocks as well as blocks consisting of multiple characters.
This approach is intended for the cases when there is enough agreement between different engines that when they
do disagree, the resulting character blocks are small. In other situations, additional logic is used. For instance,
when one of the blocks is empty or very small while a corresponding block from a competing engine is large, we
pick the latter.
3.4. OCR stream triage
It is often the case that, when there is significant disagreement between OCR engines combining multiple OCR
streams often yields OCR accuracy that is lower than that of the individual engines. Therefore it is often more
beneficial to try to select the single best-quality stream and retain more than one only when it is very likely that
they both produce comparably good results.
We propose a triage filter that is applied to two OCR streams. It works in two steps. First,giventwoOCR
outputs for a given document image, we compute the F-score value that is based on the Levenshtein distance.
Only if the value is higher than a certain threshold can we conclude that both streams are good enough to be
combined by another filter, e.g. by the voting corrector. Otherwise we proceed to the second stage where we select
the best stream of the two. Given OCR engines I and II, let {p
(j)
j
},j =1,...,K be a set of K measurements
collected over an entire OCR output document, where j is either I or II depending on which of the engines
produces the output document. Further assume that these measurements are such that, often, engine I is shown
to be more accurate than engine II iff p
(I)
j
≥ p
(II)
j
, and the bigger the difference between the two measurement
values, the bigger the difference between accuracies. Examples of such measurements include
• The total number of words
• The total size of the document in bytes
• The median and mean word lengths
• The total number of correct words
• The fraction of the correct words
• The median and mean lengths of correct words
Suppose, without loss of generality, that for a given pair of OCR documents, p
(I)
j
≥ p
(II)
j
for j =1,...,M and
p
(I)
j
<p
(II)
j
for j = M +1,...,K. Define
ρ
(I)
=
M
j=1
log
p
(I)
j
p
(II)
j
,
and similarly for ρ
(II)
.Ifρ
(I)
≥ ρ
(II)
the filter picks OCR engine I as the best, otherwise it picks engine II.
Note that a set of three or more OCR streams may be triaged by successive application of this filter to pairs of
streams.
3.5. Filter Combination
The filters described above may be connected in various ways. Since filters based on multiple streams usually
produce a single output stream, it seems logical to apply single-stream filters, like the dictionary lookup, first, fol-
lowed by multiple-stream filters. Also, due to importance of triage before combining streams when disagreement
is significant, we believe that an effective cascade configuration of the proposed filters would be the following
• Apply the dictionary lookup to each of the OCR streams separately.
• Apply triage to weed out poor-quality streams.
• If more than one engine produce output good enough to warrant combining them, align these streams using
the synctext tool then apply the voting corrector to the synchronized outputs.
• Otherwise, pass the predicted best output as the final output of the cascaded system
We emphasize that as we develop new filters and improve the existing ones, the configuration of the resulting
system will most likely change. In particular, individual filters may be switched on and off depending on the
availability of specific evidence types to the system.
4. EXPERIMENTS
4.1. OCR engines, test data and performance metrics
We experimented with two- and three-engine combinations. Both cases included engines denoted as OCR1 and
OCR2; the three-engine case included a third, denoted as OCR3. We used two corpora of Arabic document
images with corresponding ground truth. The first data set (CORPUS1) consists of 136 scans of various Arabic
newspapers and magazines. The second dataset (CORPUS2) includes 200 images provided by a commercial
vendor. All performance assessments are based on the standard measures of precision, recall and F-score that
are expressed in terms of the numbers of matches, insertions, deletions and substitutions and are computed either
on the character or word level by the same tool that computes Levenshtein difference between two text streams.
In Sections 4.2 and 4.3 we present results of unit-testing of the individual filters described in Section 3. More
comprehensive tests of the complete cascade system as described in Section 3.5 will be performed in the near
future.
4.2. Triage
We evaluated a slightly modified version of the triage filter from the one described in Section 3.4 with the first step,
evaluation of pairwise F-score, omitted. We applied the OCR1 and OCR2 engines to both corpora and evaluated
how often the filter would correctly predict which of the two OCR streams was better. We also experimented
with different subsets of the measurement set listed in Section 3.4. The best results were achieved when only
3 out of the 8 measurements were used, namely the fraction of the correct words as well as the mean lengths
of all words and of correct words in the document. The results are summarized in Table 1. For instance, for
CORPUS1, the filter correctly predicted OCR1 and OCR2 as winners 76 and 49 times, respectively, and missed
only 11 times. In all the cases when it did miss, the corresponding accuracies of the two OCR engines were
fairly similar which implies that these misses would not affect the resulting quality. All the measurements based
on statistics of correct words in an output stream require matching of output against a dictionary, which may
be time-consuming. Therefore we repeated our experiments with only dictionary-independent measurements,
namely, the total number of words, the document size and the median and mean word lengths, included. The
best results (not shown here) were achieved with all four such measurements. Performance was somewhat worse
than when dictionary-dependent measurements were included, especially on CORPUS1, yet still reasonably good.
Tabl e 1. Performance of the triage filter using both dictionary-dependent and -independent measurements
Predicted Winner
CORPUS1 CORPUS2
True Winner OCR1 OCR2 OCR1 OCR2
OCR1 76 8 191 3
OCR2 3 49 0 6
4.3. Dictionary lookup and voting correction
We evaluated the voting corrector filter on the three-engine combination. Output produced by these engines on
the documents from CORPUS2, together with the corresponding ground truth, was used for training, and COR-
PUS1 was used for testing. Before applying the voting corrector, the OCR2 output documents were processed by
the dictionary lookup and all words that contained unrecognized character markers were replaced by dictionary
entries that differed from the output words only in the unrecognized character positions, provided such entries
were found. The OCR1 and OCR3 output documents were not processed by the lookup because the correspond-
ing engines did not generate any unrecognized character markers. We compared the performance of the voting
corrector with that of OCRtk voting using three criteria. First, we counted the number of documents (out of
the total of 136) on which filter’s word accuracy (F-score) exceeded that of the voting. Second,wemeasured
the average difference in accuracy between output produced by the two methods over the entire corpus. Third,
as sizes of test documents varied significantly, we computed average accuracy difference weighted by the sizes of
the corresponding ground truth files. The results are shown in Table 2. As one can see, we achieved a significant
improvement over voting.
Tabl e 2. Performance of the three-engine voting corrector prep ended with dictionary lookup
Count (out of 136 docs) Avg accuracy diff Weighted avg accuracy diff
101 17.8% 19.2%
5. SUMMARY AND FUTURE WORK
Initial experiments indicate that it is possible to design an effective and robust multi-engine OCR system based
on multiple types of evidence. They also provide insights on how individual filters may be modified, on alternative
ways of using individual evidence types as well as on ways to better combine individual filters. For instance,
character block sets described in Section 3.3 may be used as observation vectors in a probabilistic framework
similar to the one described in Ref. 3. Engine disagreement information obtained from the synctext utility may
be used to determine words in individual OCR output streams that need to be matched against a dictionary.
Investigation of these and other ideas, thorough testing of a complete cascade filter system, as well as collection
of additional test data with ground truth are being currently performed by the authors.
ACKNOWLEDGMENTS
The authors would like to thank J.J. Tavernier for their help with software development and for their helpful
comments.
The research reported in this document/presentation was performed in connection with Contract No. DAAD19-
03-C-0059 with the U.S. Army Research Laboratory. The views and conclusions contained in this docu-
ment/presentation are those of the authors and should not be interpreted as presenting the official policies
or position, whether expressed or implied, of the U.S. Army Research Laboratory or the U.S. Government unless
so designated by other authorized documents. Citation of manufacturers or trade names does not constitute
an official endorsement or approval of the use thereof. The U.S. Government is authorized to reproduce and
distribute reprints for Government purposes notwithstanding any copyright notation hereon.
REFERENCES
1. T. Nomoto, “Predictive models of performance in multi-engine machine translation,” in Proceedings of the
MT Summit IX, pp. 269–276, (New Orleans, USA), September 2003.
2. Y. Zuo and C. Zong, “Multi-engine based Chinese-to-English translation system,” in Proceedings of the
INTERSPEECH/ICSLP-2004 satellite workshop: International Workshop on Spoken Language Translation,
(Japan), 2004.
3. E. Borovikov, I. Zavorin, and M. Turner, “A filter based post-OCR accuracy boost system,” in Proceedings
of the 1st ACM workshop on Hardc opy document processing, 2004.
4. A. Al-Ani and M. Deriche, “A new technique for combining multiple classifiers using the Dempster-Shafer
theory of evidence,” Journal of Artificial Intelligence Research 17, pp. 333–361, 2002.
5. S. Jaeger, “Informational classifier fusion,” in Proceedings of the International Conference on Pattern Recog-
nition, pp. I: 216–219, 2004.
6. D.-S. Lee, A theory of classifier combination: the neural network approach. PhD thesis, State University of
New York at Buffalo, 1995.
7. X. Lin, “DRR research beyond COTS OCR software: A survey,” Tech. Rep. HPL-2004-167, Imaging Systems
Laboratory, HP Laboratories Palo Alto, 2004.
8. X. Lin, “Reliable OCR solution for digital content re-mastering,” in Proceedings of SPIE Conference on
Document Recognition and Retrieval IX, 2002.
9. S. T. Klein and M. Kopel, “A voting system for automatic OCR correction,” in Proceedings of the SIGIR 2002
Workshop on Information Retrieval and OCR: From Converting Content to Grasping Meaning,(University
of Tampere), August 2002.
10. T. A. Nartker, S. V. Rice, and S. E. Lumos, “Software tools and test data for research and testing of
page-reading ocr systems,” in Proceedings of the SPIE Conference on Document Recognition and Retrieval,
5676, pp. 37–47, SPIE, (San Jose, CA), January 2005.