ArticlePDF Available

Abstract and Figures

Although modern OCR technology is capable of handling a wide variety of document images, there is no single OCR engine that performs equally well on all documents for a given single language script. Naturally, each OCR engine has its strengths and weaknesses, and therefore different engines tend to differ in the accuracy on different documents, and in the errors on the same document image. While the idea of using multiple OCR engines to boost output accuracy is not new, most of the existing systems do not go beyond variations on majority voting. While this approach may work well in many cases, it has limitations, especially when OCR technology used to process a given script has not yet fully matured. Our goal is to develop a system called MEMOE (for "Multi-Evidence Multi-OCR-Engine") that combines, in an optimal or near-optimal way, output streams of one or more OCR engines together with various types of evidence extracted from these streams as well as from original document images, to produce output of higher quality than that of the individual OCR engines, or of majority voting applied to multiple OCR output streams. Furthermore, we aim to improve the accuracy of OCR output on images that might otherwise have low accuracy that significantly impacts downstream processing. The MEMOE system functions as an OCR engine taking document images and some configuration parameters as input and producing a single output text stream. In this paper, we describe the design of the system, various evidence types and how they are incorporated into MEMOE in the form of filters. Results of initial tests that involve two corpora of Arabic documents show that, even in its initial configuration, the system is superior to a voting algorithm and that even more improvement may be achieved by incorporating additional evidence types into the system.
Content may be subject to copyright.
A multi-evidence, multi-engine OCR system
Ilya Zavorin
a
, Eugene Borovikov
a
, Anna Borovikov
a
, Luis Hernandez
b
, Kristen Summers
a
,
Mark Turner
a
a
CACI International Inc, 4831 Walden Lane, Lanham, MD 20706, USA
b
Army Research Laboratory, Adelphi, MD, USA
ABSTRACT
Although modern OCR technology is capable of handling a wide variety of document images, there is no single
OCR engine that performs equally well on all documents for a given single language script. Naturally, each
OCR engine has its strengths and weaknesses, and therefore different engines tend to differ in the accuracy on
different documents, and in the errors on the same document image. While the idea of using multiple OCR
engines to boost output accuracy is not new, most of the existing systems do not go beyond variations on
majority voting. While this approach may work well in many cases, it has limitations, especially when OCR
technology used to process a given script not yet fully matured. Our goal is to develop a system called MEMOE
(for “Multi-Evidence Multi-OCR-Engine”) that combines, in a optimal or near-optimal way, output streams of
one or more OCR engines together with various types of evidence extracted from these streams as well as from
original document images, to produce output of higher quality than that of the individual OCR engines. The
MEMOE system functions as an OCR engine taking document images and some configuration parameters as
input and producing a single output text stream. In this paper, we describe the design of the system, various
evidence types and how they are incorporated into MEMOE in the form of filters. Results of initial tests on
two corpora of Arabic documents show that, even in its initial configuration, the system is superior to a voting
algorithm and that even more improvement may be achieved by incorporating additional evidence types into the
system.
Keywords: multi-engine document image processing, machine learning, OCR, classifier combination
1. INTRODUCTION
The multi-engine OCR can be thought of as voting among multiple OCR engines, but the multi-evidence approach
(as the name implies) uses multiple types of evidence,such as the level of image noise, character confusion maps,
language models, and others, to help resolve the disagreements among the OCR engines via some underlying
statistical model, and produce the most likely OCR output, given the evidence. Notice that similar approaches are
used elsewhere in NLP technology, e.g., as in multi-engine machine translation systems.
1, 2
The resulting system
is considerably more sophisticated and accurate than simple voting. Unlike pure voting, the multi-evidence
approach is also effective for a single OCR engine case,
3
producing a more accurate, statistically optimal OCR
output.
1.1. Related Work
Each OCR engine can be thought of as a pattern classifier; then mathematically, the multi-engine OCR problem
is one of classifier combination, an approach that has proven to be effective in increasing recognition accuracy.
4–7
Classifier combination methods strive to classify patterns given outputs from multiple (independent) classifiers,
taking into account some types of external evidence (e.g. confidence levels). In OCR applications, this approach
works fairly well when the task is to recognize individual characters,
5
i.e. when the number of classes is limited
and the OCR output alignment is trivial. This may not be the case when we are dealing with multiple degraded
OCR streams with no confidence values available.
Although multi-engine OCR systems are available and fairly widely used, most of them typically employ some
variations of majority voting.
8
Pure voting is fairly simple to implement and may work well in many cases, but
it also has some limitations as it typically
Send correspondence to izavorin@caci.com
assumes that one of the candidate symbols is correct (not always the case for noisy documents)
requires more than two OCR streams, as in the case in case of majority voting
More sophisticated heuristically driven voting schemes
9
may suffer less from the above limitations, since they
usually take into account additional sources of evidence, but they rarely assess statistical optimality of the
corrected OCR output. Literature search has not produced a comprehensive study of the optimal multi-engine
OCR output combination based on multiple types of evidence.
1.2. Approach
We approach the multi-engine OCR problem as a statistically optimal combination of multiple OCR streams
given the multiple types of evidence. This system can be viewed as a generalization of a previous post-OCR
correction approach
3
with the following important extensions:
one or more OCR streams (including the possibility of a single stream)
multiple types of evidence (described in Section 2.1)
multi-input/output filters (described in Section 2.2)
Like in a single engine predecessor system, the current solution is a filter-based system that uses statistical
inference to optimally correct the OCR output. Like the previous system, it can work with multiple languages
scripts. It performs just as well or better than the majority voting algorithm in case of multiple engines.
2. SYSTEM OVERVIEW
The MEMOE system is structured as shown in Figure 1. Document images are processed with multiple OCR
engines (or, in the trivial case, a single engine). The resulting output sequences of characters go to the Multi-
Evidence Processor Engine, along with the original images. The multi-evidence processor aligns the output
text from the engines
10
and also measures a number of other features, discussed in Section 2.1. These features
include both image-based features that are measured on the original input and text-based features that reflect
the OCR engines’ output. They can serve as evidence in selecting the preferred engine output for a particular
document and/or in directly selecting the most likely text, words or characters. The output streams and cal-
culated features go to the Evidence Combiner and File Generator Engine, which combines these various inputs
into a single output sequence of characters that it deems most likely. More specifically, its Evidence Analyzer
Engine component identifies the significance of the measured features, turning them into evidence for particular
choices; the Combiner and File Generator resolves conflicts between the indicated results and produces a single
output sequence of characters. Possible implementations of the Evidence Analyzer Engine and the Combiner are
discussedinmoredetailinSection2.2.
2.1. Evidence Types
This section briefly describes the results of the initial evidence investigation phase of the project. The purpose
of this phase was to make a preliminary assessment of usefulness of various evidence types in terms of predicting
and/or improving OCR accuracy of a document processed by one or more OCR engines. We investigated the
following types of evidence:
Image optic al properties measured directly from a document image.
Disagreement among OCR engines at the character and word level.
Character confusion matric es that record information about single characters as well as groups of characters.
Language models on character level (e.g. bigram statistics) and word level (e.g. dictionaries).
Characters as deformable shapes, a measure that functions like a confusion matrix, but uses a measurement
of the energy required to transform a skeleton of one character into that of another.
OCR confidence indicators, such as the presence of unrecognized character markers in OCR output.
High-level context such as language models tuned to specific subject domains.
Co-location, using other words in a document as sources of corrections for a word likely to be incorrect.
Figure 1. Overview of the MEMOE system.
Multiple scripts in a do cument that helps select an appropriate engine, or engines, for a given document.
Experiments with all evidence types involved qualitative or quantitative evaluations of relationships between
various continuous and discrete features associated with each type and extracted from a given document or set
of documents, and OCR accuracy for this document of set of documents. Several data sets were employed for
evidence types investigation: images of Spanish and Arabic documents separated in several groups based on
their source, image quality, availability of ground truth and sometimes divided into training and testing subsets
for specific experiments. While the experiments had common components such as using the same corpora of
documents and software for collecting “raw” information (confusion matrices, OCR accuracy measurements etc.),
they were designed individually on a type-by-type basis. We determined that most evidence types provide useful
information that can be exploited to boost OCR accuracy. Furthermore, for each evidence type, we identified
three key characteristics:
Whether it may serve as a basis for evaluation correction of OCR output, its correction, or both.
Whether it may be attributed to single characters, words or the entire document, i.e., contain potentially
useful information at these different levels of output granularity.
The extent to which it depends on the script or language of the document.
More details of the evidence types and how they are incorporated into the multi-engine system are given in the
following sections.
2.2. Design
The proposed multi-evidence processor engine combines multiple types of evidence, as available, to determine
the statistical combination of evidence that helps boosting the output OCR accuracy (see Figure 1). Contingent
upon the availability of suitable OCR engines and the developed multi-evidence filters contained in the multi-
evidence processor engine, the system should work for any script/language. The OCR text streams can be fed
either from concurrently running OCR engines and/or from off-line OCR text repositories. Being modular and
filter-based, the system can run its components on several computers taking advantage of parallel computing. It
also provides an extensible component-based framework for adding custom evidence filters and OCR engines as
required for future processing needs.
2.2.1. Previous work
Initial work in this area started with a filter based post-OCR accuracy boost system that combined different
post-OCR correction filters to improve the OCR accuracy for a single OCR output stream.
3
The major focus
was on developing an OCR accuracy booster based on Hidden Markov Model (HMM). The HMM filter modeled
OCR engine noise as a two-layer stochastic process. Experiments with the HMM based post-OCR filter revealed
its versatility in applications to different languages as well as its robustness and generalization power (e.g. in
correcting words that the filter was not trained on).
The single stream approach used multiple evidence types (e.g. n-gram frequencies, confusion matrices, etc.) and
was able to handle multiple languages (e.g. Arabic, English, etc.) It seems natural that this multi-evidence filter
based approach be extended to handle multiple OCR streams. Therefore, our current multi-engine OCR system,
may appear as an extension of our previous single stream solution.
2.2.2. Multiple filter approach
One way of implementing a multi-evidence system is by developing a collection of filters, each based on a single
evidence type, and then combine them together, e.g. into a cascade. The main advantages of this approach are
that the individual filters may be relatively simpler than those that incorporate many evidence types and that it
may be more likely to find a mathematical apparatus (such as an HMM) that implements them efficiently. The
main challenge is that even if individual filters are known to be optimal or near-optimal, it may be difficult to
prove optimality of their combination. Nevertheless, in this paper we focus on this approach and plan to develop
alternative approaches later.
3. EVIDENCE COMBINER
In Section 3.2 through 3.3 we describe various evaluating and correcting filters that we experimented with, each
of which being based on a single evidence type. Section 3.5 describes how these filters were connected together
to form a cascade-type multi-filter processing system.
3.1. Dictionary look-up
Perhaps one of the simplest correcting filters that can be applied to a single OCR stream is a spell-checker based
on dictionary lookup. Given a large body of correct text in a given language, a dictionary of all unique words
is extracted from this text and stored, together with a frequency count for every stored word. Throughout this
paper, we define the term “word” as a sequence of consecutive non-delimiter characters between two delimiters,
such as whitespace characters or punctuation symbols. To correct a stream of OCR text, all or some words from
the stream are compared against the dictionary. For every output word, a dictionary entry is found that is closest
to the word in terms of the Levenshtein distance. A “correct word” is a word with an exact match (distance of
0). Frequency counts are used to break ties, with more frequent dictionary words given more weight. Instead of
processing all output words, only some may be selected according to a certain criteria. For instance, only words
that contain unrecognized character markers produced by some OCR engines may be considered. This not only
may save processing time but also yield higher accuracy than all-word processing in the case when OCR output
contains many specialized words or when the dictionary is relatively small.
3.2. Multiple OCR output stream processing using OCRtk tools
In order to make any decision regarding quality of OCR output based on multiple OCR streams, it is necessary,
first, to synchronize these streams. Character-level synchronization is done via synctext utility of the ISRI
OCR toolkit (open source software) by Nartker et. al.
10
In Section 3.3 below, we describe a filter methodology
based on this character-level synchronization. The OCRtk package includes a vote utility that implements a
simple voting algorithm augmented by various heuristics to break ties. We use its results to establish a baseline
of performance for the MEMOE system. Voting uses the synctext alignment as its first step to determine a
smallest set of mismatching characters blocks from each of the OCR streams. It works well only when there
is a relatively small amount of disagreement between streams. Given large amounts of disagreement, the tool
produces rather suboptimal results, e.g. with very large blocks from one stream corresponding to small or empty
blocks from another stream.
Figure 2. Voting correction by disagreement pattern matching
3.3. Disagreement-pattern-based voting corrector
Voting usually fails when OCR engines exhibit significant disagreement. For instance, if three OCR streams
each produce a different candidate character for the same position in the document, additional processing is
needed to break the tie correctly. We make this decision by detecting patterns of disagreement between engines.
This means that even if an OCR engines makes a mistake but it makes it consistently, this consistency may be
exploited to our advantage when combined with information coming from another OCR engine. We illustrate
our approach in Figure 2. Suppose we have three OCR engines. OCR
1
is overall the most accurate, OCR
2
is somewhat worse (e.g. it consistently confuses “e” with “o”), while OCR
3
is the worst. Suppose the correct
word is “plate” and we need to determine what are the correct third and fifth characters. Further suppose
that we have some training corpus, i.e. OCR output with accompanying ground truth, and we have previously
collected disagreement character block sets from this corpus. In the case illustrated in the figure, these would
be {a,u,#;a} and {e,o,?;e} from OCR outputs corresponding to “table” and {e,c,#;c} from “came”. We can
determine whether disagreement character block sets that need to be corrected (in this case {a,z,?} and {e,o,&})
are similar to any of the training sets. One similarity measure is simply the number of matching blocks. For
instance, there is a single match between {a,z,?} and {a,u,#}, two matches between {e,o,&} and {e,o,?} and no
matches between {e,o,&} and {a,u,#}. Once we find the best matching training block set for a given block set
being corrected, we can correct the latter with the ground truth block corresponding to the former. For instance,
the best match for { e,o,&} is {e,o,?}, so we replace {e,o,&} with “e”. Similarly, we replace {a,u,#} with “a”.
Note that all character blocks in this example consist of a single character. In general, there may be some empty
blocks as well as blocks consisting of multiple characters.
This approach is intended for the cases when there is enough agreement between different engines that when they
do disagree, the resulting character blocks are small. In other situations, additional logic is used. For instance,
when one of the blocks is empty or very small while a corresponding block from a competing engine is large, we
pick the latter.
3.4. OCR stream triage
It is often the case that, when there is significant disagreement between OCR engines combining multiple OCR
streams often yields OCR accuracy that is lower than that of the individual engines. Therefore it is often more
beneficial to try to select the single best-quality stream and retain more than one only when it is very likely that
they both produce comparably good results.
We propose a triage filter that is applied to two OCR streams. It works in two steps. First,giventwoOCR
outputs for a given document image, we compute the F-score value that is based on the Levenshtein distance.
Only if the value is higher than a certain threshold can we conclude that both streams are good enough to be
combined by another filter, e.g. by the voting corrector. Otherwise we proceed to the second stage where we select
the best stream of the two. Given OCR engines I and II, let {p
(j)
j
},j =1,...,K be a set of K measurements
collected over an entire OCR output document, where j is either I or II depending on which of the engines
produces the output document. Further assume that these measurements are such that, often, engine I is shown
to be more accurate than engine II iff p
(I)
j
p
(II)
j
, and the bigger the difference between the two measurement
values, the bigger the difference between accuracies. Examples of such measurements include
The total number of words
The total size of the document in bytes
The median and mean word lengths
The total number of correct words
The fraction of the correct words
The median and mean lengths of correct words
Suppose, without loss of generality, that for a given pair of OCR documents, p
(I)
j
p
(II)
j
for j =1,...,M and
p
(I)
j
<p
(II)
j
for j = M +1,...,K. Define
ρ
(I)
=
M
j=1
log
p
(I)
j
p
(II)
j
,
and similarly for ρ
(II)
.Ifρ
(I)
ρ
(II)
the filter picks OCR engine I as the best, otherwise it picks engine II.
Note that a set of three or more OCR streams may be triaged by successive application of this filter to pairs of
streams.
3.5. Filter Combination
The filters described above may be connected in various ways. Since filters based on multiple streams usually
produce a single output stream, it seems logical to apply single-stream filters, like the dictionary lookup, first, fol-
lowed by multiple-stream filters. Also, due to importance of triage before combining streams when disagreement
is significant, we believe that an effective cascade configuration of the proposed filters would be the following
Apply the dictionary lookup to each of the OCR streams separately.
Apply triage to weed out poor-quality streams.
If more than one engine produce output good enough to warrant combining them, align these streams using
the synctext tool then apply the voting corrector to the synchronized outputs.
Otherwise, pass the predicted best output as the final output of the cascaded system
We emphasize that as we develop new filters and improve the existing ones, the configuration of the resulting
system will most likely change. In particular, individual filters may be switched on and off depending on the
availability of specific evidence types to the system.
4. EXPERIMENTS
4.1. OCR engines, test data and performance metrics
We experimented with two- and three-engine combinations. Both cases included engines denoted as OCR1 and
OCR2; the three-engine case included a third, denoted as OCR3. We used two corpora of Arabic document
images with corresponding ground truth. The first data set (CORPUS1) consists of 136 scans of various Arabic
newspapers and magazines. The second dataset (CORPUS2) includes 200 images provided by a commercial
vendor. All performance assessments are based on the standard measures of precision, recall and F-score that
are expressed in terms of the numbers of matches, insertions, deletions and substitutions and are computed either
on the character or word level by the same tool that computes Levenshtein difference between two text streams.
In Sections 4.2 and 4.3 we present results of unit-testing of the individual filters described in Section 3. More
comprehensive tests of the complete cascade system as described in Section 3.5 will be performed in the near
future.
4.2. Triage
We evaluated a slightly modified version of the triage filter from the one described in Section 3.4 with the first step,
evaluation of pairwise F-score, omitted. We applied the OCR1 and OCR2 engines to both corpora and evaluated
how often the filter would correctly predict which of the two OCR streams was better. We also experimented
with different subsets of the measurement set listed in Section 3.4. The best results were achieved when only
3 out of the 8 measurements were used, namely the fraction of the correct words as well as the mean lengths
of all words and of correct words in the document. The results are summarized in Table 1. For instance, for
CORPUS1, the filter correctly predicted OCR1 and OCR2 as winners 76 and 49 times, respectively, and missed
only 11 times. In all the cases when it did miss, the corresponding accuracies of the two OCR engines were
fairly similar which implies that these misses would not affect the resulting quality. All the measurements based
on statistics of correct words in an output stream require matching of output against a dictionary, which may
be time-consuming. Therefore we repeated our experiments with only dictionary-independent measurements,
namely, the total number of words, the document size and the median and mean word lengths, included. The
best results (not shown here) were achieved with all four such measurements. Performance was somewhat worse
than when dictionary-dependent measurements were included, especially on CORPUS1, yet still reasonably good.
Tabl e 1. Performance of the triage filter using both dictionary-dependent and -independent measurements
Predicted Winner
CORPUS1 CORPUS2
True Winner OCR1 OCR2 OCR1 OCR2
OCR1 76 8 191 3
OCR2 3 49 0 6
4.3. Dictionary lookup and voting correction
We evaluated the voting corrector filter on the three-engine combination. Output produced by these engines on
the documents from CORPUS2, together with the corresponding ground truth, was used for training, and COR-
PUS1 was used for testing. Before applying the voting corrector, the OCR2 output documents were processed by
the dictionary lookup and all words that contained unrecognized character markers were replaced by dictionary
entries that differed from the output words only in the unrecognized character positions, provided such entries
were found. The OCR1 and OCR3 output documents were not processed by the lookup because the correspond-
ing engines did not generate any unrecognized character markers. We compared the performance of the voting
corrector with that of OCRtk voting using three criteria. First, we counted the number of documents (out of
the total of 136) on which filter’s word accuracy (F-score) exceeded that of the voting. Second,wemeasured
the average difference in accuracy between output produced by the two methods over the entire corpus. Third,
as sizes of test documents varied significantly, we computed average accuracy difference weighted by the sizes of
the corresponding ground truth files. The results are shown in Table 2. As one can see, we achieved a significant
improvement over voting.
Tabl e 2. Performance of the three-engine voting corrector prep ended with dictionary lookup
Count (out of 136 docs) Avg accuracy diff Weighted avg accuracy diff
101 17.8% 19.2%
5. SUMMARY AND FUTURE WORK
Initial experiments indicate that it is possible to design an effective and robust multi-engine OCR system based
on multiple types of evidence. They also provide insights on how individual filters may be modified, on alternative
ways of using individual evidence types as well as on ways to better combine individual filters. For instance,
character block sets described in Section 3.3 may be used as observation vectors in a probabilistic framework
similar to the one described in Ref. 3. Engine disagreement information obtained from the synctext utility may
be used to determine words in individual OCR output streams that need to be matched against a dictionary.
Investigation of these and other ideas, thorough testing of a complete cascade filter system, as well as collection
of additional test data with ground truth are being currently performed by the authors.
ACKNOWLEDGMENTS
The authors would like to thank J.J. Tavernier for their help with software development and for their helpful
comments.
The research reported in this document/presentation was performed in connection with Contract No. DAAD19-
03-C-0059 with the U.S. Army Research Laboratory. The views and conclusions contained in this docu-
ment/presentation are those of the authors and should not be interpreted as presenting the official policies
or position, whether expressed or implied, of the U.S. Army Research Laboratory or the U.S. Government unless
so designated by other authorized documents. Citation of manufacturers or trade names does not constitute
an official endorsement or approval of the use thereof. The U.S. Government is authorized to reproduce and
distribute reprints for Government purposes notwithstanding any copyright notation hereon.
REFERENCES
1. T. Nomoto, “Predictive models of performance in multi-engine machine translation,” in Proceedings of the
MT Summit IX, pp. 269–276, (New Orleans, USA), September 2003.
2. Y. Zuo and C. Zong, “Multi-engine based Chinese-to-English translation system,” in Proceedings of the
INTERSPEECH/ICSLP-2004 satellite workshop: International Workshop on Spoken Language Translation,
(Japan), 2004.
3. E. Borovikov, I. Zavorin, and M. Turner, “A filter based post-OCR accuracy boost system,” in Proceedings
of the 1st ACM workshop on Hardc opy document processing, 2004.
4. A. Al-Ani and M. Deriche, A new technique for combining multiple classifiers using the Dempster-Shafer
theory of evidence,” Journal of Artificial Intelligence Research 17, pp. 333–361, 2002.
5. S. Jaeger, “Informational classifier fusion,” in Proceedings of the International Conference on Pattern Recog-
nition, pp. I: 216–219, 2004.
6. D.-S. Lee, A theory of classifier combination: the neural network approach. PhD thesis, State University of
New York at Buffalo, 1995.
7. X. Lin, “DRR research beyond COTS OCR software: A survey,” Tech. Rep. HPL-2004-167, Imaging Systems
Laboratory, HP Laboratories Palo Alto, 2004.
8. X. Lin, “Reliable OCR solution for digital content re-mastering,” in Proceedings of SPIE Conference on
Document Recognition and Retrieval IX, 2002.
9. S. T. Klein and M. Kopel, “A voting system for automatic OCR correction,” in Proceedings of the SIGIR 2002
Workshop on Information Retrieval and OCR: From Converting Content to Grasping Meaning,(University
of Tampere), August 2002.
10. T. A. Nartker, S. V. Rice, and S. E. Lumos, Software tools and test data for research and testing of
page-reading ocr systems,” in Proceedings of the SPIE Conference on Document Recognition and Retrieval,
5676, pp. 37–47, SPIE, (San Jose, CA), January 2005.
... More sophisticated heuristically driven voting schemes [14] may suffer less from the above limitations, since they usually take into account additional sources of evidence, but they rarely assess statistical optimality of the corrected OCR output. In what follows, we describe our approach in some detail as it was applied to the problem of Arabic OCR [6, 36]. However, it is important to realize that this approach can easily be extended to other similar problems such as handwriting recognition and other scripts and languages. ...
... We have discussed the major stages of Arabic OCR and handwriting recognition (HWR), addressed their goals and challenges, and presented a multi-stage solution that we have developed, as discussed in Section 2. Section 3 introduced our multi-filter framework that was utilized during the implementation of two specific systems for Arabic document analysis: Multi-evidence Multi-OCR Engine (MEMOE) system that uses multiple OCR engines to recognize a document and uses some extra information about the document and resulting OCR streams to produce an output that is more accurate than any of the individual OCR outputs. This system accounts for various types of OCR evidence extracted from its multiple OCR streams and from the original document images [36]. The system was designed with the goal to improve the OCR accuracy on images that were likely to result in low recognition rates, which in turn could significantly impact downstream processing (categorization, NEE, MT). ...
Chapter
Full-text available
We approach the analysis of electronic documents as a multi-stage process, which we implement via a multi-filter document processing framework that provides (a) flexibility for research prototyping, (b) efficiency for development, and (c) reliabil-ity for deployment. In the context of this framework, we present our multi-stage so-lutions to multi-engine Arabic OCR (MEMOE) and Arabic handwriting recognition (AHWR). We also describe our adaptive pre-OCR document image clean-up system called ImageRefiner. Experimental results are reported for all mentioned systems.
... With document images acquired electronically, the challenge of deriving the content via OCR remains as for paper, but its difficulty is increased by the fact that the conversion to image format lies as completely outside of the control of document exploitation units as does the condition of original paper documents. In this instance, rather than specifying quality control for the scanning and import process, the most useful early processing would consist of image enhancement or recognition techniques that are automatically or semi-automatically tuned to the particular input documents, an area of active current research (e.g., Sarkar and Breuel, 2003; Veermachaneni and Nagy, 2007; Zavorin, et al., 2007) In addition, the challenges of electronic document collection and exploitation are introduced here, such as identifying which image files in fact reflect documents or otherwise contain potentially recognizable text, and handling the massive volumes of data available electronically. When a system processes a full available set of electronic data, such as a hard drive or other general-purpose storage device, even identifying the type of processing appropriate for each file becomes nontrivial . ...
Article
The National Ground Intelligence Center (NGIC) collects massive quantities of textual data in foreign languages. To support exploi-tation in light of intelligence requirements, a triage process must be applied to this data as those requirements emerge, to identify the most useful data for further exploitation. Ma-chine translation provides critical support for this triage. This paper outlines the types of collected data and the different challenges they present for machine translation, as well as the types of triage to support for collections of this nature, and the issues raised for ma-chine translation by those uses.
Conference Paper
Full-text available
Machine perception and recognition of handwritten text in any language is a difficult problem. Even for Latin script most solutions are restricted to specific domains like bank checks courtesy amount recognition. Arabic script presents additional challenges for handwriting recognition systems due to its highly connected nature, numerous forms of each letter, and other factors. In this paper we address the problem of offline Arabic handwriting recognition of pre-segmented words. Rather than focusing on a single classification approach and trying to perfect it, we propose to combine heterogeneous classification methodologies. We evaluate our system on the IFN/ENIT corpus of Tunisian village and town names and demonstrate that the combined approach yields results that are better than those of the individual classifiers.
Article
Full-text available
After decades of research, Optical Character Recognition (OCR) has entered into a relatively mature stage. Commercial off-the-shelf (COTS) OCR software packages have become powerful tools in Document Recognition and Retrieval (DRR) applications. One question naturally arises: What areas are left for new DRR research beyond COTS OCR software? There are many discussions around it in recent conferences. This paper attempts to address this question through a systematic survey of recently reported DRR projects as well as our own Digital Content Re-Mastering (DCRM) research at HP Labs. This survey has shown that custom DRR research is still in great need for better accuracy and reliability, complementary contents, or downstream information retrieval. Several concrete observations are also made on the basis of this survey: First, the basic character/word recognition is mostly taken on by COTS software, with a few exceptions. Second, system-level research with regard to reliability and guaranteed accuracy can seldom be replaced by COTS software. Third, document-level structure understanding still has much room to expand. Fourth, post- OCR information retrieval also has many challenging research topics.
Article
Full-text available
This paper addresses the system's aspects of OCR solutions in the context of digital content re-mastering. It analyzes the unique requirements and challenges to implement a reliable OCR system in a high-volume and unattended environment. A new reliability metric is proposed and a practical solution based on the combination of multiple commercial OCR engines is introduced. Experimental results show that the combination system is both much more accurate and more reliable when compared with individual engines, thus it can fully satisfy the need of digital content re-mastering applications.
Article
Full-text available
Our current research effort aims at building a filter based post-OCR accuracy boost system that will combine different post-OCR correction filters to improve the OCR accuracy better than each individual filter can. In this paper we focus on a Hidden Markov Model (HMM) based accuracy booster modeling OCR engine noise generation as a two-layer stochastic process. We employ a commercial spell-checker both as another error correction filter and as a base line for accuracy boost comparison. We demonstrate the versatility of our approach in experiments with documents in English and Arabic.
Article
Full-text available
This paper describes a Multi-Engine based Chinese-to-English spoken language translation system. The design and implementation of the system is given in detail. Three different translation engines are employed in the system and a very simple way is proposed to select the best translation from all the outputs generated by them. The evaluation results from IWSLT2004 are reported and analyzed in detail. The results prove that the Multi-Engine based system is practical.
Conference Paper
Full-text available
We announce the availability of the UNLV/ISRI Analytic Tools for OCR Evaluation together with a large and diverse collection of scanned document images with the associated ground-truth text. This combination of tools and test data will allow anyone to conduct a meaningful test comparing the performance of competing page-reading algorithms. The value of this collection of software tools and test data is enhanced by knowledge of the past performance of several systems using exactly these tools and this data. These performance comparisons were published in previous ISRI Test Reports and are also provided. Another value is that the tools can be used to test the character accuracy of any page-reading OCR system for any language included in the Unicode standard. The paper concludes with a summary of the programs, test data, and documentation that is available and gives the URL where they can be located.
Conference Paper
Full-text available
Classifier combination has proven itself a powerful tool for achieving high recognition rates with otherwise moderately discriminating classifiers. While progress has been made during the last decade in terms of generating powerful classifier ensembles, the actual combination process is not understood yet. In this paper, the author present an information-theoretical solution to classifier combination that integrates the information conveyed by each classifier. The proposed method transforms the likelihood values of a classifier in such a way that they equal the information conveyed, without affecting its individual performance. This implicitly postulates that the elementary sum-rule performs at least as good as any other, more complex combination scheme. The author evaluated his method by combining on-line and off-line Japanese character recognizers, computing a considerable improvement of more than 4.5% compared to the best single recognition rate.
Article
Full-text available
This paper presents a new classi er combination technique based on the DempsterShafer theory of evidence. The Dempster-Shafer theory of evidence is a powerful method for combining measures of evidence from dierent classi ers. However, since each of the available methods that estimates the evidence of classi ers has its own limitations, we propose here a new implementation which adapts to training data so that the overall mean square error is minimized. The proposed technique is shown to outperform most available classi er combination methods when tested on three dierent classi cation problems.
Conference Paper
The paper examines the general classifier combination problem under strict separation of the classifier and combinator design. Several desirable combinator properties are identified: omnitype mixed type and correlated classifier combination, redundant classifier elimination, model complexity control, and dynamic selection combination. By adapting some of the theories and algorithms developed for neural network learning. They present a combination model which provides a solution to these problems. Experimental results on handwritten digits verify these findings
Article
The paper describes a novel approach to Multi-Engine Machine Translation. We build statistical models of performance of translations and use them to guide us in combining and selecting from outputs from multiple MT engines. We empirically demonstrate that the MEMT system based on the models outperforms any of its component engine.
Article
A new post-processing system for the enhancement of OCR produced text is suggested, which improves automatic data acquisition for large full-text Information Retrieval systems. The idea is to match the output of several OCR devices, thereby detecting possible errors, and to suggest possible corrections based on statistical information and dictionaries. The results of testing the new method on OCRs for several languages are reported.