ArticlePDF Available

Extracting person names from diverse and noisy OCR text

Authors:

Abstract and Figures

Named entity recognition from scanned and OCRed historical documents can contribute to historical research. However, entity recogni-tion from historical documents is more diffi-cult than from natively digital data because of the presence of word errors and the absence of complete formatting information. We ap-ply four extraction algorithms to various types of noisy OCR data found "in the wild" and focus on full name extraction. We evaluate the extraction quality with respect to hand-labeled test data and improve upon the extrac-tion performance of the individual systems by means of ensemble extraction. We also evalu-ate the strategies with different applications in mind: the target applications (browsing ver-sus retrieval) involve a trade-off between pre-cision and recall. We illustrate the challenges and opportunities at hand for extracting names from OCRed data and identify directions for further improvement.
Content may be subject to copyright.
Extracting Person Names from Diverse and Noisy OCR Text
Thomas Packer, Joshua Lutes, Aaron Stewart, David Embley, Eric Ringger, Kevin Seppi
Department of Computer Science
Brigham Young University
Provo, Utah, USA
tpacker@byu.net
Abstract
Named entity recognition from scanned and
OCRed historical documents can contribute to
historical research. However, entity recogni-
tion from historical documents is more diffi-
cult than from natively digital data because of
the presence of word errors and the absence
of complete formatting information. We ap-
ply four extraction algorithms to various types
of noisy OCR data found “in the wild” and
focus on full name extraction. We evaluate
the extraction quality with respect to hand-
labeled test data and improve upon the extrac-
tion performance of the individual systems by
means of ensemble extraction. We also evalu-
ate the strategies with different applications in
mind: the target applications (browsing ver-
sus retrieval) involve a trade-off between pre-
cision and recall. We illustrate the challenges
and opportunities at hand for extracting names
from OCRed data and identify directions for
further improvement.
1 Introduction
Information extraction (IE) facilitates efficient
knowledge acquisition for the benefit of many ap-
plications. Perhaps most importantly, IE from un-
structured documents allows us to go beyond now-
traditional keyword search and enables semantic
search. Semantic search allows a user to search
specifically for only those instances of an ambigu-
ous name that belong to a semantic type such as per-
son and to exclude instances of other entity types.
By extracting information from noisy OCR data we
aim to broaden the impact of IE technology to in-
clude printed documents that are otherwise inacces-
sible to digital tools. In particular, we are interested
in books, newspapers, typed manuscripts, printed
records, and other printed documents important for
genealogy, family history and other historical re-
search.
The specific task we target in the present study
is the extraction of person names from a variety
of types and formats of historical OCR documents.
This task is an example of named entity recognition
(NER) as described in (Nadeau and Sekine, 2007)
and (Ratinov and Roth, 2009). Accurately and effi-
ciently identifying names in noisy OCR documents
containing many OCR errors presents a challenge
beyond standard NER and requires adapting exist-
ing techniques or tools. Our applications of interest
are search and machine-assisted browsing of docu-
ment collections. Search requires names to be pre-
identified and indexed. Machine-assisted brows-
ing of document collections has greater tolerance for
misidentified names.
There has been little published research on named
entity extraction from noisy OCR data, but interest
in this field is growing. Recent work by Grover et
al. uses hand-written rules on two kinds of British
parliamentary proceedings (2008). Earlier work
by Miller et al. (2000) uses an HMM extractor on
matched conditions: for their OCR task, they printed
digital documents and scanned and OCRed the re-
sulting copy to produce the OCR data for both train-
ing and test sets. To our knowledge, no published
research targets the full extent of noisiness and di-
versity present in some real corpora or compares
competing NER techniques on the same OCR cor-
pus.
In starting such a project, we had several ques-
tions: What variation of word error rate (WER) can
be expected over multiple OCR engines and types of
documents? What level of NER quality is achiev-
able in a couple of months of development time, par-
ticularly when no annotated data is available for the
corpus for training or evaluation purposes? How
well can we do on a truly noisy and diverse cor-
pus of OCR data? How do competing extraction
approaches compare over different document types?
Can improvements in extraction quality be gained
by combining the strengths of different extractors?
We provide answers to these questions in the fol-
lowing sections. In §2 we describe the data we used
as well as the names extracted. In §3 we present
each of the basic extraction methods and examine
their performance. In §4 we present a straight-
forward ensemble method for combining the basic
extraction methods and show an improvement in
performance over each of the component extractors.
Finally, we conclude and discuss future work (§5).
2 Data and Task
The data used as input to our named entity recogniz-
ers is the OCR output for 12 titles spanning a diverse
range of printed historical documents with relevance
to genealogy and family history research. These
documents are described in table 1. To the best
of our knowledge, this collection has greater vari-
ety in formatting and genre than any other image-
and-text corpus used in a published NER experi-
ment. The data includes unstructured text (full
sentences), structured (tabular) text including long
lists of names and end-of-book indexes, and multi-
column formatted text from the books and newspa-
pers.
2.1 OCR
Three OCR engines were used in the production of
the data used in this study. PrimeOCR, a commer-
cial voting system utilizing six OCR engines, selects
the best results from those engines (PrimeRecogni-
tion, 2009). Abby is a version of Abby FineReader
used within an OCR engine produced by Kofax (Ko-
fax, 2009). The newspapers were OCRed by an en-
gine that was not identified by the corpus owner.
Examples of images and corresponding OCR out-
put are given in figures 1 and 2. Figure 1 is an exam-
ple of one of the poorer quality images and accom-
panying OCR output. Causes of poor quality are
dark splotches in the noisy image and the fact that
the OCR engine failed to recognize column bound-
aries during zoning. In figure 2, letter-spacing is in-
correctly interpreted by the OCR engine, resulting in
the introduction of superfluous spaces within words.
This figure also illustrates the common problem of
words that are split and hyphenated at line bound-
aries as well as other types of errors.
This particular collection of OCR documents
were originally intended to be indexed for keyword
search. Because the search application requires
no more than a bag-of-words representation, much
of the document structure and formatting, including
punctuation and line boundaries in many cases, were
discarded before the data was made available, which
affects the quality of the documents with respect to
NER. Furthermore, in parts of some of the docu-
ments, the original token ordering was not preserved
consistently: in some cases this was caused by the
OCR engine being unable to first separate columns
into distinct sections of text, while in other cases this
was more likely caused by the noisiness and poor
thresholding (binarization) of the image. The qual-
ity of the original images on which OCR was per-
formed varied greatly. Consequently, this corpus
represents a very noisy and diverse setting for ex-
tracting information.
2.2 Annotation
Our task is the extraction of the full names of people,
e.g., “Mrs Herschel Williams”, from OCRed docu-
ments. Since the corpus was not originally intended
as a public benchmark for NER, the pages used
for development test and blind test data were hand-
annotated for the current project. One to two pages
from each document were annotated for each of the
development test and blind test sets. The annota-
tions consisted of marking person names, including
titles. The number of names annotated in the blind
test set is given in table 1 for each document in the
corpus. Blind test pages were not inspected dur-
ing the development of the extraction systems. All
extractors, including ensembles, were applied to the
Title and Years Genre Engine N WER Fc Ff
Birmingham, Alabama; 1888-1890 City Directory Abby 23 53 35 61
Portland, Oregon; 1878-1881 City Directory Abby 69 21 44 55
Year Book of the First Church of Christ in Hartford; 1904 Church Year Book Prime 13 28 38 37
The New York Church Year Book; 1859-60 Church Year Book Prime 26 46 47 62
The Blake Family in England; 1891 Family History Prime 0 NA NA NA
The Libby Family in America; 1602-1881 Family History Prime 24 85 32 42
History and Genealogy of the Families of Old Fairfield Local History Abby 52 28 64 89
History of Inverness County, Nova Scotia Local History Abby 9 75 15 28
United States Ship Ajax; 1980 Navy Cruise Book Abby 0 NA NA NA
United States Ship Albany; 1962-1964 Navy Cruise Book Abby 9 114 0 34
Montclair Tribune; 1967-1968 Newspaper Unk. 174 15 64 68
The Story City Herald; 1955 Newspaper Unk. 91 92 44 55
Over all 490 56 38 53
Table 1: A summary of the documents used. In the first column, the nickname used throughout this paper is shown
in bold. Engine = the OCR engine used for each title. The following numbers refer to the blind test set. N=
number of person names annotated, WER = word error rate (percent) for OCR output, averaged over pages if more
than one page, Fc = Coarse-grained F-measure for coarse-grained Majority Ensemble, Ff = Fine-grained F-measure
for course-grained Majority Ensemble. Ensembles are defined in §4. Though there were no names annotated for
the blind test set in Blake and Ajax, they are included above for completeness (they contributed to the training and
development test sets).
Figure 1: Example of poor quality data found in Inverness. Names to be extracted are “Charles” and “John Bahn”.
Figure 2: Pairs of image and corresponding OCR text from one page of Montclair.
same pages. When variations in individual systems
were considered, the options which performed best
on development test data were selected and executed
on the blind test data, with scores for blind test data
reported.
The OCR text in our corpus was sufficiently noisy
to necessitate labeling guidelines that accommodate
the errors. On the one hand, we considered la-
beling only named entities that appeared correctly
in the OCR text; on the other hand, we considered
labeling all named entities occurring in the origi-
nal images. In the end, we settled on a middle
ground to accommodate some character recognition
errors: any token having a character error rate above
50% was excluded from annotation. In this, we at-
tempted to balance the negative impact of removing
too many tokens which could legitimately be identi-
fied by some named entity recognizers based solely
on context and the negative impact to the real ap-
plication of the extraction, which in our case was a
name search engine index. Such an index would
likely grow unnecessarily large if it were filled with
garbled names for which users are unlikely to search
or which are sufficiently dissimilar to real names.
2.3 Metrics
Precision, recall and F-measure scores were calcu-
lated for person names in both a coarse and fine
manner. The coarse-grained metrics score extrac-
tor output in an all-or-none manner: they count an
extracted full name as correct only if it matches a
full name in the hand-labeled test set (including to-
ken positions/IDs). Using the above example, if
an extractor misses the title “Mrs” and labels only
“William Herschel” as a full name, then this is con-
sidered as one false positive (since “William Her-
schel” is not found among the manual annotations)
and one false negative (since “Mrs William Her-
schel” is not found among the extractor’s output).
Thus this one mistake counts against both precision
and recall.
The fine-grained metrics are more forgiving and
would be more appropriate for a document browsing
application as opposed to searching for a complete
name. They will give partial credit if any part of a
name is recognized because they look for matches
between the individual tokens in the hand annotated
data and the extracted data. Continuing the example
above, the extractor that recognizes only “William
Herschel” as a full name will have two true positives,
one false negatives and no false positives.
These two metrics partially acknowledge the
same issues addressed by the MUC evaluation met-
rics described in (Nadeau and Sekine, 2007) in
which evaluation is decomposed into two comple-
mentary dimensions: TEXT and TYPE.
3 Basic Extraction Methods and Results
We built four person name recognizers while explor-
ing possible adaptations of existing named entity
recognition methods to the genre and especially the
noisiness of our corpus. The work required a couple
of months, with each extractor being built by a dif-
ferent researcher. The extractors are designated as
dictionary-based, regular expression based, MEMM
(maximum-entropy Markov model) and CRF (con-
ditional random field). In these four extractors,
we are comparing solutions from two competing
disciplines for NER: the hand-written, rule-based
approach and the supervised machine learning ap-
proach.
We applied them individually and collectively
(within the ensemble extractors) on the blind test
data and report a summary of their results in figures
3 and 5 (coarse metrics), and 4 and 6 (fine metrics).
Only the results for the coarse-grained ensembles
are reported in these four figures.
3.1 Dictionary-Based Extractor
The dictionary extractor is a simple extractor, in-
tended as a baseline and requiring about 20 to 30
hours to develop. It identifies any token as part of
a name if it is found in a case-sensitive name dictio-
nary. It then combines each contiguous sequence of
name tokens into a full name if it meets a few con-
straints. The following constraints were developed
manually while inspecting the names in a few of the
pages in the “training” (unlabeled, non-test) data: a
name must either contain one or more tokens that are
not initials or must contain exactly two initials (for
partial credit when only two initials can be identi-
fied). A name must also consist of five or fewer
tokens.
Name dictionaries include the following, col-
lected from online sources: a given name dictionary
Figure 3: Coarse-grained precision, recall and F-measure for person names in the blind test set.
Figure 4: Fine-grained precision, recall and F-measure for person names in the blind test set.
Figure 5: Coarse-grained F-measure for person names for each title in the blind test set.
Figure 6: Fine-grained F-measure for person names for each title in the blind test set.
(18,000 instances), a surname dictionary (150,000
instances), a list of common initial letters (capital
letters A through W) and a list of titles (10 hand-
written instances including “Mr” and “Jr”).
The surname dictionary was pruned by sort-
ing the original list by an approximation of
P(label =Surname|word)computed automati-
cally from statistics collected from a corpus of web
pages and then removing the low-scoring words
from the dictionary below a cut-off that was deter-
mined by maximizing extraction accuracy over the
development test (validation) set.
3.2 Regular Expression Rule-Based Extractor
The regular expression rule-based (Regex) extractor
was based on the Ontology-based Extraction Sys-
tem (OntoES) of Embley et al.(Embley et al., 1999).
OntoES was designed to extract a variety of infor-
mation from terse, data-rich, structured and semi-
structured text found in certain types of web pages
such as car sale ads. In the current work, we adapt
OntoES to work with noisy, unstructured text and
therefore do not make use of many of its features
associated with conceptual modeling and web page
structure. Like the dictionary-based extractor, the
Regex extractor also uses dictionaries to recognize
tokens that should be considered components of a
person name. Matching of entries in these dic-
tionaries is stage-wise case-sensitive. By this we
mean that the extractor first finds matching tokens
in a case-sensitive manner. Then for each page
in which a dictionary entry is found, the extractor
looks for case-insensitive matches of that word. The
Regex extractor then labels any token pattern as a
full name wherever one of the following regular ex-
pression patterns is found. Note that the patterns are
described in Perl5 regular expression syntax.
optional title, given name, optional initial, surname:
\b({Title}\s+){0,1}({First})\s+
([A-Z]\s+){0,1}{Last}\b
title, surname:
\b{Title}\s+{Last}\b
title, capitalized words:
\b{Title}([A-Z][A-Za-z]*){1,3}\b
surname, title, given names or initials:
\b({Last})(\s+{Title})?
(\s+({First}|[A-Z])){1,2}\b
initials, surname:
\b([A-Z]\s+){1,2}{Last}\b
The name dictionaries used in the Regex extrac-
tor include the following: a given name dictionary
(5,000 names taken from the 1990 study of the US
Census), a surname dictionary (89,000 names, taken
from the same US Census study) and a list of ti-
tles (773 titles manually taken from Wikipedia1).
Counts exclude stop-words which had been removed
(570 words). Other words used to eliminate false-
positives were also taken from short lists of mutually
exclusive categories: US States (149), street signs
(11) and school suffixes (6).
Among the four base extractors, figures 3 and 4
show that the Regex extractor generally produces
the highest quality extractions overall. Much of the
improvement exhibited by the Regex extractor over
the simpler dictionary extractor comes from the reg-
ular expression pattern matching which constrains
possible matches to only the above patterns. The
Regex extractor does less well on family and lo-
cal histories (e.g. Libby and Fairfield) where the
given regular expressions do not consistently apply:
there are many names that consist of only a single
given name. This could be corrected with contex-
tual clues.
3.3 Maximum Entropy Markov Model
The MEMM extractor is a maximum entropy
Markov model similar to that used in (Chieu and
Ng, 2003) and trained on CoNLL NER training
data (Sang and Meulder, 2003) in the newswire
genre. Because of the training data, this MEMM
was trained to recognize persons, places, dates and
organizations in unstructured text, but we evaluated
it only on the person names in the OCR corpus.
The feature templates used in the MEMM follow.
For dictionary features, there was one feature tem-
plate per dictionary, with dictionaries including all
the dictionaries used by the previous two extractors.
current word
previous tag
previous previous tag
bigram of previous two tags
next word
current word’s suffix and prefix, lengths 1 through
10 characters
1http://en.wikipedia.org/wiki/Title
all upper case word (case-folded word)
current word starts with an upper case character
current word starts with an upper case character and
is not the first word of a sentence
next word starts with an upper case character
previous word starts with an upper case character
contains a number
contains a hyphen
the word is in dictionary
The validation / development test set was used to
select the most promising variation of the MEMM.
Variations considered but rejected included the use
of a character noise model in conjunction with an al-
lowance for small edit distances (from zero to three)
when matching dictionary entries, similar in spirit
to, though less well developed than (Wang et al.,
2009). Variations also included additional feature
templates based on centered 5-grams.
By way of comparison, this same MEMM was
trained and tested on CoNLL data, where it achieved
83.1% F-measure using the same feature templates
applied to the OCR data, as enumerated. This is not
a state-of-the-art CoNLL NER system but it allows
for more flexible experimentation.
Figures 5 and 6 show the greatest quality differ-
ence with respect to the other extractors in the two
city directories (Birmingham and Portland). These
directories essentially consist of lists of the names
of people living in the respective cities, followed by
terse information about them such as addresses and
business names, one or two lines per person. Fur-
thermore, the beginning of each entry is the name
of the person, starting with the surname, which is
less common in the data on which the MEMM was
trained. The contrast between the newswire genre
and most of the test data explains its relatively poor
performance overall. Previous studies on domain
mismatch in supervised learning but especially in
NER (Vilain et al., 2007) document similar dramatic
shortfalls in performance.
3.4 Conditional Random Field
The CRF extractor uses the conditional random field
implementation in the Mallet toolkit (McCallum,
2002). It was trained and executed in the same way
as the MEMM extractor described above, includ-
ing the use of identical feature templates. Training
and testing on the CoNLL data, as we did with the
MEMM extractor, yielded a 87.0% F-measure.
The CRF extractor is the only one of the four base
extractors not included in the ensemble. Adding
the CRF resulted in slightly lower scores on the de-
velopment test set. We also ran the ensemble with
the CRF but without the MEMM, resulting in a 2%
lower score on the development test set, ruling it out.
Separate experiments on CoNLL test data with arti-
ficial noise introduced showed similarly worse be-
havior by the CRF, relative to the MEMM.
4 Ensemble Extraction Methods and
Results
We combined the decisions of the first three base ex-
tractors described above using a simple voting-based
ensemble. The ensemble interprets a full name in
each base extractor’s output as one vote in favor of
that entity as a person name. The general ensem-
ble extractor is parameterized by a threshold, t, indi-
cating how many of the base extractors must agree
on a person name before it can be included in the
ensemble’s output. By varying this parameter, we
produced the three following ensemble extractors:
Union (t= 1): any full name identified by any of the
base extractors is output.
Majority (t= 2): if a majority of the base extractors
(two or more) recognizes the same text as a name,
then that name is recognized.
Intersection (t= 3): the three base extractors must
be unanimous in choosing a full name to be ex-
tracted before that name will be output.
Figure 3 shows that the Majority Ensemble out-
performs each base extractor in terms of F-measure.
A second set of ensembles was developed. They
are identical to the three except that they allowed
each base extractor to vote on individual tokens.
This fine-grained ensemble did not produce accu-
racies as high as the coarse-grained approach when
using the coarse-grained metrics, but when we use
the fine-grained metrics it did better, achieving 68%
F-measure over the entire corpus while the coarse-
grained ensemble achieved only 60.7% F-measure.
The highest-scoring base extractor (Regex) achieved
66.5% using the fine-grained metric. So, again, an
ensemble did better than each base extractor regard-
less of the metric (coarse or fine), as long as the
matching version of the ensemble was applied.
Figure 7: Coarse F-measure of the coarse majority voting
ensemble for person names as a function of word error
rate for pages in the blind test set.
5 Conclusions and Future Work
In conclusion, we answer the questions posed in the
introduction. WER varies widely in this dataset:
the average is much higher than the 20% reported
in other papers (Miller et al., 2000). In a plot of
WER versus NER performance shown in figure 7,
the linear fit is substantially poorer than for the data
reported in the work of Miller et al.
Ranges of 0–64% or 28–89% F-measure for NER
can be expected on noisy OCR data, depending on
the document and the metric. Figure 7 shows some
but not perfect correlation between NER quality
and WER. Among those errors that directly cause
greater WER, different kinds of errors affect NER
quality to different degrees.
The Libby text’s WER was lower because of poor
character-level recognition (word order was actually
good) while Inverness had more errors in word order
where text from two columns has been incorrectly
interleaved by the OCR engine (its character-level
recognition was good). From error analysis on such
examples, it seems likely that word order errors play
a bigger role in extraction errors than do character
recognition errors.
We also conclude that combining basic methods
can produce higher quality NER. Each of the three
ensembles maximizes a different metric. The Ma-
jority Ensemble achieves the highest F-measure over
the entire corpus, compared to any of the base ex-
tractors and to the other ensembles. The Intersec-
tion Ensemble achieves the highest precision and the
Union Ensemble achieves the highest recall. Each
of these results is useful for a different application.
If the intended application is a person name search
engine, users do not want to manually sift through
many false-positives; with a sufficiently large cor-
pus containing millions of book and newspaper ti-
tles, a precision of 89.6% would be more desirable
than a precision of 61.6%, even when only 14.1%
of the names available in the corpus can be recog-
nized (low recall). Alternatively, if higher recall is
necessary for an application in which no instances
should be missed, then the high-recall Union En-
semble could be used as a filter of the candidates
to be shown. Browsing and exploration of a data set
for every case may be such an application. High-
recall name browsing could facilitate manual label-
ing or checking.
This work is a starting point against which to
compare techniques which we hope will be more ef-
fective in automatically adapting to new document
formats and genres in the noisy OCR setting. One
way to adapt the supervised machine learning ap-
proaches is in applying a more realistic noise model
of OCR errors to the CoNLL data. Another is to
use semi-supervised machine learning techniques to
take advantage of the large volume of unlabeled and
previously unused data available in each of the ti-
tles in this corpus. We plan to contrast this with the
more laborious method of producing labeled train-
ing data from within the present corpus. Additional
feature engineering and additional labeled pages for
evaluation are also in order. The rule-based Regex
extractor could also be adapted automatically to dif-
fering document or page formats by filtering a larger
set of regular expressions in the first of two passes
over each document. Finally, we plan to combine
NER with work on OCR error correction (Lund and
Ringger, 2009) to see if the combination can im-
prove accuracies jointly in both OCR and informa-
tion extraction.
6 Acknowledgements
We would like to acknowledge Ancestry.com and
Lee Jensen of Ancestry.com for providing the OCR
data from their free-text collection and for financial
support. We would also like to thank Lee Jensen for
discussions regarding applications of this work and
the related constraints.
References
Hai Leong Chieu and Hwee Tou Ng. 2003. Named en-
tity recognition with a maximum entropy approach. In
Proceedings of the seventh conference on Natural lan-
guage learning at HLT-NAACL 2003, pages 160–163.
D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Lid-
dle, D. W. Lonsdale, Y. -K. Ng, and R. D. Smith.
1999. Conceptual-model-based data extraction from
multiple-record web pages. Data & Knowledge Engi-
neering, 31(3):227–251, November.
Claire Grover, Sharon Givon, Richard Tobin, and Julian
Ball. 2008. Named entity recognition for digitised
historical texts. In Proceedings of the 6th Interna-
tional Conference on Language Resources and Eval-
uation (LREC 2008).
Kofax. 2009. Kofax homepage. http://www.kofax.com/.
W. B Lund and E. K Ringger. 2009. Improving optical
character recognition through efficient multiple system
alignment. In Proceedings of the 2009 joint interna-
tional conference on Digital libraries, pages 231–240.
Andrew Kachites McCallum. 2002. MAL-
LET: a machine learning for language toolkit.
http://mallet.cs.umass.edu/.
David Miller, Sean Boisen, Richard Schwartz, Rebecca
Stone, and Ralph Weischedel. 2000. Named entity
extraction from noisy input: speech and OCR. In Pro-
ceedings of ANLP-NAACL 2000, pages 316–324.
David Nadeau and Satoshi Sekine. 2007. A survey of
named entity recognition and classification. Linguisti-
cae Investigationes, 30(1):3–26.
PrimeRecognition. 2009. PrimeOCR web page.
http://www.primerecognition.com/augprime/prime ocr.htm.
Lev Ratinov and Dan Roth. 2009. Design challenges
and misconceptions in named entity recognition. In
Proceedings of the Thirteenth Conference on Com-
putational Natural Language Learning, pages 147–
155, Boulder, Colorado. Association for Computa-
tional Linguistics.
E. F.T.K Sang and F. De Meulder. 2003. Introduction to
the CoNLL-2003 shared task: Language-independent
named entity recognition. In Proceedings of the
Seventh Conference on Natural Language Learning
(CoNLL-2003), volume 922, page 1341.
Marc Vilain, Jennifer Su, and Suzi Lubar. 2007. Entity
extraction is a boring solved problem: or is it? In
Human Language Technologies 2007: The Conference
of the North American Chapter of the Association for
Computational Linguistics; Companion Volume, Short
Papers on XX, pages 181–184, Rochester, New York.
Association for Computational Linguistics.
Wei Wang, Chuan Xiao, Xuemin Lin, and Chengqi
Zhang. 2009. Efficient approximate entity extraction
with edit distance constraints. In Proceedings of the
35th SIGMOD international conference on Manage-
ment of data, pages 759–770, Providence, Rhode Is-
land, USA. ACM.
... Due to a combination of the above issues, a number of historical TM efforts have either completely or partially abandoned the usual ML-based supervised approach to NE recognition . Instead, the methods employed are either based upon, or incorporate, hand-written rules (which attempt to model the textual patterns that can signify the existence of NEs) and/or dictionaries that contain inventories of known NEs (e.g.,212223242526). Such methods tend to be less successful than ML-based approaches. ...
Article
Full-text available
Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while the processing pipeline and its modules may be used and configured within the Argo TM platform.
... They describe many of the problems encountered by NER systems that result from both OCR artefacts and the archaic nature of the sources themselves, such as conflation of marginal notes with body text, multi-line quoting rules, and capitalization of common nouns. Packer et al. (2010) tested three different meth- ods for the extraction of person names from noisy OCR output, scoring the results from each method against a hand-annotated reference. They noted a correlation between OCR word error rate (WER) and NER quality did exist, but was small, and hypothesised that errors in the OCR deriv- ing from misunderstanding of page-level features (i.e. ...
Conference Paper
Full-text available
This short paper analyses an experiment comparing the efficacy of several Named Entity Recognition (NER) tools at extract-ing entities directly from the output of an optical character recognition (OCR) work-flow. The authors present how they first created a set of test data, consisting of raw and corrected OCR output manually anno-tated with people, locations, and organiza-tions. They then ran each of the NER tools against both raw and corrected OCR out-put, comparing the precision, recall, and F1 score against the manually annotated data.
... The enumerators also introduced errors because of spelling errors of surnames and geographic names during the input step, or because they misinterpreted the instructions given to them by the census takers [1]. Numerous new errors were further introduced during the digitisation process, which is common in processing of historical documents [21]. Because " dirty " data is one of the biggest obstacles to accurate linking, extensive and accurate data cleaning is essential before any data linking can be performed [15]. ...
Conference Paper
Full-text available
Historical census data captures information about our ancestors. These data contain the social status at a certain point time. They contain valuable information for genealogists, historians, and social scientists. Historical census data can be used to reconstruct important aspects of a particular era in order to trace the changes in households and families. Record linkage across different historical census datasets can help to improve the quality of the data, enrich existing census data with additional information, and facilitate improved retrieval of information. In this paper, we introduce a domain driven approach to automatically clean and link historical census data based on recent developments in group linkage techniques. The key contribution of our approach is to first detect households, and to use this information to refine the cleaned data and improve the accuracy of linking records between census datasets. We have developed a two-step linking approach, which first links individual records using approximate string similarity measures, and then performs a group linking based on the previously detected households. The results show that this approach is effective and can greatly reduce the manual efforts required for data cleaning and linking by social scientists.
Article
Full-text available
Text databases have grown tremendously in number, size, and volume over the last few decades. Optical Character Recognition (OCR) software scans the text and makes them available in online repositories. The OCR transcription process is often not accurate resulting in large volumes of garbled text in the repositories. Spell correction and other post-processing of OCR text often prove to be very expensive and time-consuming. While it is possible to rely on the OCR model to assess the quality of text in a corpus, many natural language processing and information retrieval tasks prefer the extrinsic evaluation of the effect of noise on the task at hand. This paper examines the effect of noise on the unsupervised ranking of person name entities by first populating a list of person names using an out-of-the-box Named Entity Recognition (NER) software, extracting content-based features for the identified entities, and ranking them using a novel unsupervised Kernel Density Estimation (KDE) based ranking algorithm. This generative model has the ability to learn rankings using the data distribution and therefore requires limited manual intervention. Empirical results are presented on a carefully curated parallel corpus of OCR and clean text and ``in the wild" using a large real-world corpus. Experiments on the parallel corpus reveal that even with a reasonable degree of noise in the dataset, it is possible to generate ranked lists using the KDE algorithm with a high degree of precision and recall. Furthermore, since the KDE algorithm has comparable performance to state-of-the-art unsupervised rankers, it is feasible to use on real-world corpora. The paper concludes by reflecting on other methods for enhancing the performance of the unsupervised algorithm on OCR text such as cleaning entity names, disambiguating names concatenated to one another, and correcting OCR errors that are statistically significant in the corpus.
Chapter
Access to long-run historical data in the field of social sciences, economics and political sciences has been identified as one necessary condition to understand the dynamics of the past and the way those dynamics structure our present and future. Financial yearbooks are historical records reporting on information about the companies of stock exchanges. This paper concentrates on the description of the key components that implement a financial information extraction system from financial yearbooks. The proposed system consists in three steps: OCR, linked named entities extraction, active learning. The core of the system is related to linked named entities extraction (LNE). LNE are coherent n-tuple of named entities describing high level semantic information. In this respect we developed, tested and compared a CRF and a hybrid RNN/CRF based system. Active learning allows to cope with the lack of annotated data for training the system. Promising performance results are reported on two yearbooks (the French Desfossé yearbook (1962) and the German Handbuch (1914–15)) and for two LNE extraction tasks: capital information of companies and constitution information of companies.
Conference Paper
Full-text available
Named Entity Recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system's performance is genre and domain dependent and also used entity categories vary [16]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report evaluation result of NER with data out of a digitized Finnish historical newspaper collection Digi. Experiments, results and discussion of this research serve development of the Web collection of historical Finnish newspapers. Digi collection contains 1,960,921 pages of newspaper material from years 1771-1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70-75% [7]. Our baseline NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. Three other available tools are also evaluated: a Finnish Semantic Tagger (FST), Connexor's NER tool and Polyglot's NER.
Conference Paper
Full-text available
Named entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74 75 % [2]. Our principal NER tag-ger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. SeCo's tools achieve 30.0-60.0 F-score with locations and persons. Performance with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed text.
Article
Full-text available
Purpose – This paper aims to present an evaluation of open source OCR for supporting research on material in small‐ to medium‐scale historical archives. Design/methodology/approach – The approach was to develop a workflow engine to support the easy customisation of the OCR process towards the historical materials using open source technologies. Commercial OCR often fails to deliver sufficient results here, as their processing is optimised towards large‐scale commercially relevant collections. The approach presented here allows users to combine the most effective parts of different OCR tools. Findings – The authors demonstrate their application and its flexibility and present two case studies, which demonstrate how OCR can be embedded into wider digitally enabled historical research. The first case study produces high‐quality research‐oriented digitisation outputs, utilizing services that the authors developed to allow for the direct linkage of digitisation image and OCR output. The second case study demonstrates what becomes possible if OCR can be customised directly within a larger research infrastructure for history. In such a scenario, further semantics can be added easily to the workflow, enhancing the research browse experience significantly. Originality/value – There has been little work on the use of open source OCR technologies for historical research. This paper demonstrates that the authors' workflow approach allows users to combine commercial engines' ability to read a wider range of character sets with the flexibility of open source tools in terms of customisable pre‐processing and layout analysis. All this can be done without the need to develop dedicated code.
Conference Paper
Full-text available
Individual optical character recognition (OCR) engines vary in the types of errors they commit in recognizing text, par- ticularly poor quality text. By aligning the output of mul- tiple OCR engines and taking advantage of the differences between them, the error rate based on the aligned lattice of recognized words is significantly lower than the individ- ual OCR word error rates. This lattice error rate consti- tutes a lower bound among aligned alternatives from the OCR output. Results from a collection of poor quality mid- twentieth century typewritten documents demonstrate an average reduction of 55.0% in the error rate of the lattice of alternatives and a realized word error rate (WER) reduc- tion of 35.8% in a dictionary-based selection process. As an important precursor, an innovative admissible heuristic for the A* algorithm is developed, which results in a significant reduction in state space exploration to identify all optimal alignments of the OCR text output, a necessary step toward the construction of the word hypothesis lattice. On average 0.0079% of the state space is explored to identify all optimal alignments of the documents.
Article
Full-text available
Named Entities provides critical information for many NLP applications. Named Entity recognition and classification (NERC) in text is recognized as one of the important sub-tasks of Information Extraction (IE). The seven papers in this volume cover various interesting and informative aspects of NERC research. Nadeau & Sekine provide an extensive survey of past NERC technologies, which should be a very useful resource for new researchers in this field. Smith & Osborne describe a machine learning model which tries to solve the over-fitting problem. Mazur & Dale tackle a common problem of NE and conjunction; as conjunctions are often a part of NEs or appear close to NEs, this is an important practical problem. A further three papers describe analyses and implementations of NERC for different languages: Spanish (Galicia-Haro & Gelbukh), Bengali (Ekbal, Naskar & Bandyopadhyay), and Serbian (Vitas, Krstev & Maurel). Finally, Steinberger & Pouliquen report on a real WEB application where multilingual NERC technology is used to identify occurrences of people, locations and organizations in newspapers in different languages. The contributions to this volume were previously published in Lingvisticae Investigationes 30:1 (2007).
Article
Full-text available
This paper presents a maximum entropy approach to the NER task, where NER not only made use of local context within a sentence, but also made use of other occurrences of each word within the same document to extract useful features (global features). Such global features enhance the performance of NER (Chieu and Ng, 2002b)
Article
Full-text available
In this paper, we analyze the performance of name finding in the context of a variety of automatic speech recognition (ASR) systems and in the context of one optical character recognition (OCR) system. We explore the effects of word error rate from ASR and OCR, performance as a function of the amount of training data, and for speech, the effect of out-of-vocabulary errors and the loss of punctuation and mixed case 1
Conference Paper
We analyze some of the fundamental design challenges and misconceptions that underlie the development of an efficient and robust NER system. In particular, we address issues such as the representation of text chunks, the inference approach needed to combine local NER decisions, the sources of prior knowledge and how to use them within an NER system. In the process of comparing several solutions to these challenges we reach some surprising conclusions, as well as develop an NER system that achieves 90.8 F1 score on the CoNLL-2003 NER shared task, the best reported result for this dataset. 1
Article
Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the essence of a document's content. For these kinds of data-rich, multiple-record documents (e.g., advertisements, movie reviews, weather reports, travel information, sports summaries, financial statements, obituaries, and many others) we can apply a conceptual-modeling approach to extract and structure data automatically. The approach is based on an ontology – a conceptual model instance – that describes the data of interest, including relationships, lexical appearance, and context keywords. By parsing the ontology, we can automatically produce a database scheme and recognizers for constants and keywords, and then invoke routines to recognize and extract data from unstructured documents and structure it according to the generated database scheme. Experiments show that it is possible to achieve good recall and precision ratios for documents that are rich in recognizable constants and narrow in ontological breadth. Our approach is less labor-intensive than other approaches that manually or semiautomatically generate wrappers, and it is generally insensitive to changes in Web-page format.
Conference Paper
Named entity recognition aims at extracting named enti- ties from unstructured text. A recent trend of named entity recognition is finding approximate matches in the text with respect to a large dictionary of known entities, as the do- main knowledge encoded in the dictionary helps to improve the extraction performance. In this paper, we study the problem of approximate dic- tionary matching with edit distance constraints. Compared to existing studies using token-based similarity constraints, our problem definition enables us to capture typographi- cal or orthographical errors, both of which are common in entity extraction tasks yet may be missed by token-based similarity constraints. Our problem is technically challeng- ing as existing approaches based on q-gram filtering have poor performance due to the existence of many short en- tities in the dictionary. Our proposed solution is based on an improved neighborhood generation method employ- ing novel partitioning and prefix pruning techniques. We also propose an efficient document processing algorithm that minimizes unnecessary comparisons and enumerations and hence achieves good scalability. We have conducted exten- sive experiments on several publicly available named entity recognition datasets. The proposed algorithm outperforms alternative approaches by up to an order of magnitude.
Conference Paper
We describe and evaluate a prototype system for recognising person and place names in digitised records of British parli amentary proceedings from the late 17th and early 19th centuries. The output of an OCR engine is the input for our system and we describe certain issues and errors in this data and discuss the method s we have used to overcome the problems. We describe our rule-based named entity recognition system for person and place names which is implemented using the LT-XML2 and LT-TTT2 text processing tools. We discuss the annotation of a development and testing corpus and provide results of an evaluation of our system on the test c orpus.
Article
We describe the CoNLL-2003 shared task: language-independent named entity recognition. We give background information on the data sets (English and German) and the evaluation method, present a general overview of the systems that have taken part in the task and discuss their performance.