Figure 3 - uploaded by Eric K. Ringger
Content may be subject to copyright.
Coarse-grained precision, recall and F-measure for person names in the blind test set. 

Coarse-grained precision, recall and F-measure for person names in the blind test set. 

Source publication
Article
Full-text available
Named entity recognition from scanned and OCRed historical documents can contribute to historical research. However, entity recogni-tion from historical documents is more diffi-cult than from natively digital data because of the presence of word errors and the absence of complete formatting information. We ap-ply four extraction algorithms to vario...

Context in source publication

Context 1
... by maximizing Like the dictionary-based extraction accuracy extractor, over the the positives plate per were dictionary, also taken with from dictionaries short lists including of mutually all development Regex extractor test also (validation) uses dictionaries set. to recognize exclusive the dictionaries categories: used by US the States previous (149), two street extractors. signs tokens that should be considered components of a (11) and school suffixes (6). person name. Matching of entries in these dic- Among the four base extractors, figures 3 and 4 tionaries is stage-wise case-sensitive. By this we show that the Regex extractor generally produces mean that the extractor first finds matching tokens the highest quality extractions overall. Much of the in a case-sensitive manner. Then for each page improvement exhibited by the Regex extractor over in which a dictionary entry is found, the extractor the simpler dictionary extractor comes from the reg- looks for case-insensitive matches of that word. The ular expression pattern matching which constrains Regex extractor then labels any token pattern as a possible matches to only the above patterns. The full name wherever one of the following regular ex- Regex extractor does less well on family and lo- pression patterns is found. Note that the patterns are cal histories (e.g. Libby and Fairfield) where the described in Perl5 regular expression syntax. given regular expressions do not consistently apply: there are many names that consist of only a single given name. This could be corrected with contex- tual clues. The MEMM extractor is a maximum entropy Markov model similar to that used in (Chieu and Ng, 2003) and trained on CoNLL NER training data (Sang and Meulder, 2003) in the newswire genre. Because of the training data, this MEMM was trained to recognize persons, places, dates and organizations in unstructured text, but we evaluated it only on the person names in the OCR corpus. The feature templates used in the MEMM follow. For dictionary features, there was one feature tem- plate per dictionary, with dictionaries including all the dictionaries used by the previous two extractors. The validation / development test set was used to select the most promising variation of the MEMM. Variations considered but rejected included the use of a character noise model in conjunction with an al- lowance for small edit distances (from zero to three) when matching dictionary entries, similar in spirit to, though less well developed than (Wang et al., 2009). Variations also included additional feature templates based on centered 5-grams. By way of comparison, this same MEMM was trained and tested on CoNLL data, where it achieved 83.1% F-measure using the same feature templates applied to the OCR data, as enumerated. This is not a state-of-the-art CoNLL NER system but it allows for more flexible experimentation. Figures 5 and 6 show the greatest quality differ- ence with respect to the other extractors in the two city directories (Birmingham and Portland). These directories essentially consist of lists of the names of people living in the respective cities, followed by terse information about them such as addresses and business names, one or two lines per person. Furthermore, the beginning of each entry is the name of the person, starting with the surname, which is less common in the data on which the MEMM was trained. The contrast between the newswire genre and most of the test data explains its relatively poor performance overall. Previous studies on domain mismatch in supervised learning but especially in NER (Vilain et al., 2007) document similar dramatic shortfalls in performance. The CRF extractor uses the conditional random field implementation in the Mallet toolkit (McCallum, 2002). It was trained and executed in the same way as the MEMM extractor described above, including the use of identical feature templates. Training and testing on the CoNLL data, as we did with the MEMM extractor, yielded a 87.0% F-measure. The CRF extractor is the only one of the four base extractors not included in the ensemble. Adding the CRF resulted in slightly lower scores on the development test set. We also ran the ensemble with the CRF but without the MEMM, resulting in a 2% lower score on the development test set, ruling it out. Separate experiments on CoNLL test data with arti- ficial noise introduced showed similarly worse be- havior by the CRF, relative to the MEMM. We combined the decisions of the first three base extractors described above using a simple voting-based ensemble. The ensemble interprets a full name in each base extractor’s output as one vote in favor of that entity as a person name. The general ensemble extractor is parameterized by a threshold, t , indi- cating how many of the base extractors must agree on a person name before it can be included in the ensemble’s output. By varying this parameter, we produced the three following ensemble extractors: Figure 3 shows that the Majority Ensemble out- performs each base extractor in terms of F-measure. A second set of ensembles was developed. They are identical to the three except that they allowed each base extractor to vote on individual tokens. This fine-grained ensemble did not produce accuracies as high as the coarse-grained approach when using the coarse-grained metrics, but when we use the fine-grained metrics it did better, achieving 68% F-measure over the entire corpus while the coarse- grained ensemble achieved only 60.7% F-measure. The highest-scoring base extractor (Regex) achieved 66.5% using the fine-grained metric. So, again, an ensemble did better than each base extractor regard- less of the metric (coarse or fine), as long as the matching version of the ensemble was applied. In conclusion, we answer the questions posed in the of We these would results like is to useful acknowledge for a different Ancestry.com application. and introduction. WER varies widely in this dataset: If Lee the Jensen intended of Ancestry.com application is for a person providing name the search OCR the average is much higher than the 20% reported engine, data from users their do free-text not want collection to manually and for sift financial through in other papers (Miller et al., 2000). In a plot of many support. false-positives; We would also with like a to sufficiently thank Lee Jensen large cor- for WER versus NER performance shown in figure 7, pus discussions containing regarding millions applications of book and of this newspaper work and ti- the linear fit is substantially poorer than for the data tles, the related a precision constraints. of 89.6% would be more desirable reported in the work of Miller et al. than a precision of 61.6%, even when only 14.1% Ranges of 0–64% or 28–89% F-measure for NER of the names available in the corpus can be recog- can be expected on noisy OCR data, depending on nized (low recall). Alternatively, if higher recall is the document and the metric. Figure 7 shows some necessary for an application in which no instances but not perfect correlation between NER quality should be missed, then the high-recall Union En- and WER. Among those errors that directly cause semble could be used as a filter of the candidates greater WER, different kinds of errors affect NER to be shown. Browsing and exploration of a data set quality to different degrees. for every case may be such an application. High- The Libby text’s WER was lower because of poor recall name browsing could facilitate manual label- character-level recognition (word order was actually ing or checking. good) while Inverness had more errors in word order This work is a starting point against which to where text from two columns has been incorrectly compare techniques which we hope will be more ef- interleaved by the OCR engine (its character-level fective in automatically adapting to new document recognition was good). From error analysis on such formats and genres in the noisy OCR setting. One examples, it seems likely that word order errors play way to adapt the supervised machine learning ap- a bigger role in extraction errors than do character proaches is in applying a more realistic noise model recognition errors. of OCR errors to the CoNLL data. Another is to We also conclude that combining basic methods use semi-supervised machine learning techniques to can produce higher quality NER. Each of the three take advantage of the large volume of unlabeled and ensembles maximizes a different metric. The Ma- previously unused data available in each of the ti- jority Ensemble achieves the highest F-measure over tles in this corpus. We plan to contrast this with the the entire corpus, compared to any of the base ex- more laborious method of producing labeled train- tractors and to the other ensembles. The Intersec- ing data from within the present corpus. Additional tion Ensemble achieves the highest precision and the feature engineering and additional labeled pages for Union Ensemble achieves the highest recall. Each evaluation are also in order. The rule-based Regex extractor could also be adapted automatically to dif- fering document or page formats by filtering a larger set of regular expressions in the first of two passes over each document. Finally, we plan to combine NER with work on OCR error correction (Lund and Ringger, 2009) to see if the combination can improve accuracies jointly in both OCR and information extraction. In conclusion, we answer the questions posed in the introduction. WER varies widely in this dataset: the average is much higher than the 20% reported in other papers (Miller et al., 2000). In a plot of WER versus NER performance shown in figure 7, the linear fit is substantially poorer than for the data reported in the work of Miller et al. Ranges of 0–64% or 28–89% F-measure for NER can be expected on noisy OCR data, depending on the document and the metric. Figure 7 shows some but not perfect ...

Similar publications

Article
Full-text available
Summary Text line segmentation is an essential pre-processing stage for off-line handwriting recognition in many Optical Character Recognition (OCR) systems. It is an important step because inaccurately segmented text lines will cause errors in the recognition stage. Text line segmentation of the handwritten documents is still one of the most compl...
Article
Full-text available
This paper evaluates the retrieval effectiveness degradation when facing with noisy text corpus. With the use of a test-collection having the clean text, another version with around 5% error rate in recognition and a third with 20% error rate, we have evaluated six IR models based on three text representations (bag-of-words, n-grams, trunc-n) as we...
Conference Paper
Full-text available
Segmentation of handwritten document images is a complex task due to the variability in the writing styles. The segmentation technique has to deal with non-uniformly skewed, overlapped and touching lines. A very few works have been carried out yet, addressing these issues. This paper presents a novel methodology for segmenting handwritten Malayalam...
Conference Paper
Full-text available
Text preprocessing in document image is an important task and a prerequisite before performing segmentation, features extraction and recognition of text. Many systems have been proposed, but less attention has been given to the images acquired by a smartphone. In this paper, we propose a new system of text preprocessing in document images captured...
Conference Paper
Full-text available
Market demand for an embedded realization of video OCR motivated the authors to exert an attempt to evaluate the performance of existing document image OCR techniques for the same. Thus authors have tried to port the open source OCR systems like GOCR and Tessar act on an embedded platform. But their performance on an embedded platform shows that th...

Citations

... Due to a combination of the above issues, a number of historical TM efforts have either completely or partially abandoned the usual ML-based supervised approach to NE recognition . Instead, the methods employed are either based upon, or incorporate, hand-written rules (which attempt to model the textual patterns that can signify the existence of NEs) and/or dictionaries that contain inventories of known NEs (e.g.,212223242526). Such methods tend to be less successful than ML-based approaches. ...
Article
Full-text available
Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while the processing pipeline and its modules may be used and configured within the Argo TM platform.
... They describe many of the problems encountered by NER systems that result from both OCR artefacts and the archaic nature of the sources themselves, such as conflation of marginal notes with body text, multi-line quoting rules, and capitalization of common nouns. Packer et al. (2010) tested three different meth- ods for the extraction of person names from noisy OCR output, scoring the results from each method against a hand-annotated reference. They noted a correlation between OCR word error rate (WER) and NER quality did exist, but was small, and hypothesised that errors in the OCR deriv- ing from misunderstanding of page-level features (i.e. ...
Conference Paper
Full-text available
This short paper analyses an experiment comparing the efficacy of several Named Entity Recognition (NER) tools at extract-ing entities directly from the output of an optical character recognition (OCR) work-flow. The authors present how they first created a set of test data, consisting of raw and corrected OCR output manually anno-tated with people, locations, and organiza-tions. They then ran each of the NER tools against both raw and corrected OCR out-put, comparing the precision, recall, and F1 score against the manually annotated data.
... The enumerators also introduced errors because of spelling errors of surnames and geographic names during the input step, or because they misinterpreted the instructions given to them by the census takers [1]. Numerous new errors were further introduced during the digitisation process, which is common in processing of historical documents [21]. Because " dirty " data is one of the biggest obstacles to accurate linking, extensive and accurate data cleaning is essential before any data linking can be performed [15]. ...
Conference Paper
Full-text available
Historical census data captures information about our ancestors. These data contain the social status at a certain point time. They contain valuable information for genealogists, historians, and social scientists. Historical census data can be used to reconstruct important aspects of a particular era in order to trace the changes in households and families. Record linkage across different historical census datasets can help to improve the quality of the data, enrich existing census data with additional information, and facilitate improved retrieval of information. In this paper, we introduce a domain driven approach to automatically clean and link historical census data based on recent developments in group linkage techniques. The key contribution of our approach is to first detect households, and to use this information to refine the cleaned data and improve the accuracy of linking records between census datasets. We have developed a two-step linking approach, which first links individual records using approximate string similarity measures, and then performs a group linking based on the previously detected households. The results show that this approach is effective and can greatly reduce the manual efforts required for data cleaning and linking by social scientists.
Article
Full-text available
Text databases have grown tremendously in number, size, and volume over the last few decades. Optical Character Recognition (OCR) software scans the text and makes them available in online repositories. The OCR transcription process is often not accurate resulting in large volumes of garbled text in the repositories. Spell correction and other post-processing of OCR text often prove to be very expensive and time-consuming. While it is possible to rely on the OCR model to assess the quality of text in a corpus, many natural language processing and information retrieval tasks prefer the extrinsic evaluation of the effect of noise on the task at hand. This paper examines the effect of noise on the unsupervised ranking of person name entities by first populating a list of person names using an out-of-the-box Named Entity Recognition (NER) software, extracting content-based features for the identified entities, and ranking them using a novel unsupervised Kernel Density Estimation (KDE) based ranking algorithm. This generative model has the ability to learn rankings using the data distribution and therefore requires limited manual intervention. Empirical results are presented on a carefully curated parallel corpus of OCR and clean text and ``in the wild" using a large real-world corpus. Experiments on the parallel corpus reveal that even with a reasonable degree of noise in the dataset, it is possible to generate ranked lists using the KDE algorithm with a high degree of precision and recall. Furthermore, since the KDE algorithm has comparable performance to state-of-the-art unsupervised rankers, it is feasible to use on real-world corpora. The paper concludes by reflecting on other methods for enhancing the performance of the unsupervised algorithm on OCR text such as cleaning entity names, disambiguating names concatenated to one another, and correcting OCR errors that are statistically significant in the corpus.
Chapter
Access to long-run historical data in the field of social sciences, economics and political sciences has been identified as one necessary condition to understand the dynamics of the past and the way those dynamics structure our present and future. Financial yearbooks are historical records reporting on information about the companies of stock exchanges. This paper concentrates on the description of the key components that implement a financial information extraction system from financial yearbooks. The proposed system consists in three steps: OCR, linked named entities extraction, active learning. The core of the system is related to linked named entities extraction (LNE). LNE are coherent n-tuple of named entities describing high level semantic information. In this respect we developed, tested and compared a CRF and a hybrid RNN/CRF based system. Active learning allows to cope with the lack of annotated data for training the system. Promising performance results are reported on two yearbooks (the French Desfossé yearbook (1962) and the German Handbuch (1914–15)) and for two LNE extraction tasks: capital information of companies and constitution information of companies.
Conference Paper
Full-text available
Named Entity Recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system's performance is genre and domain dependent and also used entity categories vary [16]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report evaluation result of NER with data out of a digitized Finnish historical newspaper collection Digi. Experiments, results and discussion of this research serve development of the Web collection of historical Finnish newspapers. Digi collection contains 1,960,921 pages of newspaper material from years 1771-1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70-75% [7]. Our baseline NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. Three other available tools are also evaluated: a Finnish Semantic Tagger (FST), Connexor's NER tool and Polyglot's NER.
Conference Paper
Full-text available
Named entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74 75 % [2]. Our principal NER tag-ger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. SeCo's tools achieve 30.0-60.0 F-score with locations and persons. Performance with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed text.
Article
Full-text available
Purpose – This paper aims to present an evaluation of open source OCR for supporting research on material in small‐ to medium‐scale historical archives. Design/methodology/approach – The approach was to develop a workflow engine to support the easy customisation of the OCR process towards the historical materials using open source technologies. Commercial OCR often fails to deliver sufficient results here, as their processing is optimised towards large‐scale commercially relevant collections. The approach presented here allows users to combine the most effective parts of different OCR tools. Findings – The authors demonstrate their application and its flexibility and present two case studies, which demonstrate how OCR can be embedded into wider digitally enabled historical research. The first case study produces high‐quality research‐oriented digitisation outputs, utilizing services that the authors developed to allow for the direct linkage of digitisation image and OCR output. The second case study demonstrates what becomes possible if OCR can be customised directly within a larger research infrastructure for history. In such a scenario, further semantics can be added easily to the workflow, enhancing the research browse experience significantly. Originality/value – There has been little work on the use of open source OCR technologies for historical research. This paper demonstrates that the authors' workflow approach allows users to combine commercial engines' ability to read a wider range of character sets with the flexibility of open source tools in terms of customisable pre‐processing and layout analysis. All this can be done without the need to develop dedicated code.