Figure 2
Source publication
Named entity recognition applied to scanned and OCRed historical documents can contribute to the discoverability of historical information. However, entity recognition from some historical corpora is much more difficult than from natively digital text because of the marked presence of word errors and absence of page layout information. How difficul...
Contexts in source publication
Context 1
... our within knowledge, an What OCR this level engine collection of produced NER quality has by greater Kofax is achiev- vari- (Ko- dark annotated splotches for the in current the noisy project. image One and to the two fact pages that able ety fax, in 2009). in formatting a couple The of newspapers and months genre of were development than any OCRed other time, by image- an par- en- the from OCR each engine document failed were to recognize annotated column for each bound- of the ticularly and-text when corpus no used annotated in a published data is available NER experi- for the aries development during zoning. test and In blind figure 2, test letter-spacing sets. The annota- is in- corpus ment. for The training data includes or evaluation unstructured purposes? text How (full correctly tions consisted interpreted of marking by the OCR person engine, names, resulting including in well sentences), can we structured do on a (tabular) truly noisy text and including diverse long cor- the titles. introduction The number of superfluous of names annotated spaces within in the words. blind pus lists of of OCR names data? and end-of-book How do competing indexes, and extraction multi- This test set figure is given also in illustrates table 1 for the each common document problem in the of approaches column formatted compare text over from different the books document and newspa- types? words corpus. that Blind are split test and pages hyphenated were not inspected at line bound- dur- Can pers. improvements in extraction quality be gained aries ing the as development well as other of types the of extraction errors. systems. All by combining the strengths of different extractors? extractors, This particular including collection ensembles, of were OCR applied documents to the We provide answers to these questions in the fol- were originally intended to be indexed for keyword lowing sections. In § 2 we describe the data we used search. Because the search application requires as well as the names extracted. In § 3 we present no more than a bag-of-words representation, much each of the basic extraction methods and examine of the document structure and formatting, including their performance. In § 4 we present a straight- punctuation and line boundaries in many cases, were forward ensemble method for combining the basic discarded before the data was made available, which extraction methods and show an improvement in affects the quality of the documents with respect to performance over each of the component extractors. NER. Furthermore, in parts of some of the docu- Finally, we conclude and discuss future work ( § 5). ments, the original token ordering was not preserved consistently: in some cases this was caused by the OCR engine being unable to first separate columns into distinct sections of text, while in other cases this was more likely caused by the noisiness and poor thresholding (binarization) of the image. The quality of the original images on which OCR was performed varied greatly. Consequently, this corpus represents a very noisy and diverse setting for extracting information. competing The Three data OCR used NER engines as input techniques were to our used on named in the the same entity production OCR recogniz- cor- of gine Our task that is was the not extraction identified of by the the full corpus names owner. of people, pus. ers the is data the used OCR in output this study. for 12 titles PrimeOCR, spanning a commer- a diverse e.g., Examples “Mrs Herschel of images Williams”, and corresponding from OCRed OCR docu- out- range cial In voting starting of printed system such historical utilizing a project, documents six we OCR had engines, with several relevance selects ques- put ments. are given Since in the figures corpus 1 and was 2. not Figure 1 originally is an intended exam- tions: to the genealogy best What results variation and from family those of word history engines error research. (PrimeRecogni- rate (WER) These can ple as a of public one of benchmark the poorer quality for NER, images the and pages accom- used be documents tion, expected 2009). are over Abby described multiple is a version OCR in table 1. of engines Abby and To FineReader the types best of panying for development OCR output. test and Causes blind test of poor data were quality hand- are documents? of used our within knowledge, an What OCR this level engine collection of produced NER quality has by greater Kofax is achiev- vari- (Ko- dark annotated splotches for the in current the noisy project. image One and to the two fact pages that able ety fax, in 2009). in formatting a couple The of newspapers and months genre of were development than any OCRed other time, by image- an par- en- the from OCR each engine document failed were to recognize annotated column for each bound- of the ticularly and-text when corpus no used annotated in a published data is available NER experi- for the aries development during zoning. test and In blind figure 2, test letter-spacing sets. The annota- is in- corpus ment. for The training data includes or evaluation unstructured purposes? text How (full correctly tions consisted interpreted of marking by the OCR person engine, names, resulting including in well sentences), can we structured do on a (tabular) truly noisy text and including diverse long cor- the titles. introduction The number of superfluous of names annotated spaces within in the words. blind pus lists of of OCR names data? and end-of-book How do competing indexes, and extraction multi- This test set figure is given also in illustrates table 1 for the each common document problem in the of approaches column formatted compare text over from different the books document and newspa- types? words corpus. that Blind are split test and pages hyphenated were not inspected at line bound- dur- Can pers. improvements in extraction quality be gained aries ing the as development well as other of types the of extraction errors. systems. All by combining the strengths of different extractors? extractors, This particular including collection ensembles, of were OCR applied documents to the We provide answers to these questions in the fol- were originally intended to be indexed for keyword lowing sections. In § 2 we describe the data we used search. Because the search application requires as well as the names extracted. In § 3 we present no more than a bag-of-words representation, much each of the basic extraction methods and examine of the document structure and formatting, including their performance. In § 4 we present a straight- punctuation and line boundaries in many cases, were forward ensemble method for combining the basic discarded before the data was made available, which extraction methods and show an improvement in affects the quality of the documents with respect to performance over each of the component extractors. NER. Furthermore, in parts of some of the docu- Finally, we conclude and discuss future work ( § 5). ments, the original token ordering was not preserved consistently: in some cases this was caused by the OCR engine being unable to first separate columns into distinct sections of text, while in other cases this was more likely caused by the noisiness and poor thresholding (binarization) of the image. The quality of the original images on which OCR was performed varied greatly. Consequently, this corpus represents a very noisy and diverse setting for extracting information. Three OCR engines were used in the production of the data used in this study. PrimeOCR, a commer- cial voting system utilizing six OCR engines, selects the best results from those engines (PrimeRecognition, 2009). Abby is a version of Abby FineReader used within an OCR engine produced by Kofax (Kofax, 2009). The newspapers were OCRed by an en- Our task is the extraction of the full names of people, e.g., “Mrs Herschel Williams”, from OCRed documents. Since the corpus was not originally intended as a public benchmark for NER, the pages used for development test and blind test data were hand- annotated for the current project. One to two pages from each document were annotated for each of the development test and blind test sets. The annotations consisted of marking person names, including titles. The number of names annotated in the blind test set is given in table 1 for each document in the corpus. Blind test pages were not inspected during the development of the extraction systems. All extractors, including ensembles, were applied to the We built four person name recognizers while explor- ing possible adaptations of existing named entity recognition methods to the genre and especially the noisiness of our corpus. The work required a couple of months, with each extractor being built by a different researcher. The extractors are designated as dictionary-based, regular expression based, MEMM (maximum-entropy Markov model) and CRF (conditional random field). In these four extractors, we are comparing solutions from two competing disciplines for NER: the hand-written, rule-based approach and the supervised machine learning approach. We applied them individually and collectively (within the ensemble extractors) on the blind test data and report a summary of their results in figures 3 and 5 (coarse metrics), and 4 and 6 (fine metrics). Only the results for the coarse-grained ensembles are reported in these four figures. same Precision, pages. recall When and variations F-measure in scores individual were systems calcu- above, We The built dictionary the four extractor person extractor name that recognizes is recognizers a simple only extractor, while “William explor- in- were lated considered, for person names the options in both which a coarse performed and best fine Herschel” ing tended possible as as a baseline a adaptations full name and will of requiring have existing two about true named positives, 20 entity to 30 on manner. development The coarse-grained test ...
Context 2
... for the benefit of many ap- al. uses hand-written rules on two kinds of British plications. Perhaps most importantly, IE from un- parliamentary proceedings (2008). Earlier work structured documents allows us to go beyond now- by Miller et al. (2000) uses an HMM extractor on traditional keyword search and enables semantic matched conditions: for their OCR task, they printed search. Semantic search allows a user to search digital documents and scanned and OCRed the re- specifically for only those instances of an ambigu- sulting copy to produce the OCR data for both train- ous name that belong to a semantic type such as per- ing and test sets. To our knowledge, no published son and to exclude instances of other entity types. research targets the full extent of noisiness and di- By extracting information from noisy OCR data we versity present in some real corpora or compares aim to broaden the impact of IE technology to include printed documents that are otherwise inacces- sible to digital tools. In particular, we are interested in books, newspapers, typed manuscripts, printed records, and other printed documents important for genealogy, family history and other historical research. The specific task we target in the present study is the extraction of person names from a variety of types and formats of historical OCR documents. This task is an example of named entity recognition (NER) as described in (Nadeau and Sekine, 2007) and (Ratinov and Roth, 2009). Accurately and effi- ciently identifying names in noisy OCR documents containing many OCR errors presents a challenge beyond standard NER and requires adapting existing techniques or tools. Our applications of interest are search and machine-assisted browsing of document collections. Search requires names to be pre- identified and indexed. Machine-assisted browsing of document collections has greater tolerance for misidentified names. There has been little published research on named entity extraction from noisy OCR data, but interest in this field is growing. Recent work by Grover et al. uses hand-written rules on two kinds of British parliamentary proceedings (2008). Earlier work by Miller et al. (2000) uses an HMM extractor on matched conditions: for their OCR task, they printed digital documents and scanned and OCRed the resulting copy to produce the OCR data for both training and test sets. To our knowledge, no published research targets the full extent of noisiness and di- versity present in some real corpora or compares The data used as input to our named entity recognizers is the OCR output for 12 titles spanning a diverse range of printed historical documents with relevance to genealogy and family history research. These documents are described in table 1. To the best of our knowledge, this collection has greater variety in formatting and genre than any other image- and-text corpus used in a published NER experi- ment. The data includes unstructured text (full sentences), structured (tabular) text including long lists of names and end-of-book indexes, and multi- column formatted text from the books and newspapers. competing The Three data OCR used NER engines as input techniques were to our used on named in the the same entity production OCR recogniz- cor- of gine Our task that is was the not extraction identified of by the the full corpus names owner. of people, pus. ers the is data the used OCR in output this study. for 12 titles PrimeOCR, spanning a commer- a diverse e.g., Examples “Mrs Herschel of images Williams”, and corresponding from OCRed OCR docu- out- range cial In voting starting of printed system such historical utilizing a project, documents six we OCR had engines, with several relevance selects ques- put ments. are given Since in the figures corpus 1 and was 2. not Figure 1 originally is an intended exam- tions: to the genealogy best What results variation and from family those of word history engines error research. (PrimeRecogni- rate (WER) These can ple as a of public one of benchmark the poorer quality for NER, images the and pages accom- used be documents tion, expected 2009). are over Abby described multiple is a version OCR in table 1. of engines Abby and To FineReader the types best of panying for development OCR output. test and Causes blind test of poor data were quality hand- are documents? of used our within knowledge, an What OCR this level engine collection of produced NER quality has by greater Kofax is achiev- vari- (Ko- dark annotated splotches for the in current the noisy project. image One and to the two fact pages that able ety fax, in 2009). in formatting a couple The of newspapers and months genre of were development than any OCRed other time, by image- an par- en- the from OCR each engine document failed were to recognize annotated column for each bound- of the ticularly and-text when corpus no used annotated in a published data is available NER experi- for the aries development during zoning. test and In blind figure 2, test letter-spacing sets. The annota- is in- corpus ment. for The training data includes or evaluation unstructured purposes? text How (full correctly tions consisted interpreted of marking by the OCR person engine, names, resulting including in well sentences), can we structured do on a (tabular) truly noisy text and including diverse long cor- the titles. introduction The number of superfluous of names annotated spaces within in the words. blind pus lists of of OCR names data? and end-of-book How do competing indexes, and extraction multi- This test set figure is given also in illustrates table 1 for the each common document problem in the of approaches column formatted compare text over from different the books document and newspa- types? words corpus. that Blind are split test and pages hyphenated were not inspected at line bound- dur- Can pers. improvements in extraction quality be gained aries ing the as development well as other of types the of extraction errors. systems. All by combining the strengths of different extractors? extractors, This particular including collection ensembles, of were OCR applied documents to the We provide answers to these questions in the fol- were originally intended to be indexed for keyword lowing sections. In § 2 we describe the data we used search. Because the search application requires as well as the names extracted. In § 3 we present no more than a bag-of-words representation, much each of the basic extraction methods and examine of the document structure and formatting, including their performance. In § 4 we present a straight- punctuation and line boundaries in many cases, were forward ensemble method for combining the basic discarded before the data was made available, which extraction methods and show an improvement in affects the quality of the documents with respect to performance over each of the component extractors. NER. Furthermore, in parts of some of the docu- Finally, we conclude and discuss future work ( § 5). ments, the original token ordering was not preserved consistently: in some cases this was caused by the OCR engine being unable to first separate columns into distinct sections of text, while in other cases this was more likely caused by the noisiness and poor thresholding (binarization) of the image. The quality of the original images on which OCR was performed varied greatly. Consequently, this corpus represents a very noisy and diverse setting for extracting information. competing The Three data OCR used NER engines as input techniques were to our used on named in the the same entity production OCR recogniz- cor- of gine Our task that is was the not extraction identified of by the the full corpus names owner. of people, pus. ers the is data the used OCR in output this study. for 12 titles PrimeOCR, spanning a commer- a diverse e.g., Examples “Mrs Herschel of images Williams”, and corresponding from OCRed OCR docu- out- range cial In voting starting of printed system such historical utilizing a project, documents six we OCR had engines, with several relevance selects ques- put ments. are given Since in the figures corpus 1 and was 2. not Figure 1 originally is an intended exam- tions: to the genealogy best What results variation and from family those of word history engines error research. (PrimeRecogni- rate (WER) These can ple as a of public one of benchmark the poorer quality for NER, images the and pages accom- used be documents tion, expected 2009). are over Abby described multiple is a version OCR in table 1. of engines Abby and To FineReader the types best of panying for development OCR output. test and Causes blind test of poor data were quality hand- are documents? of used our within knowledge, an What OCR this level engine collection of produced NER quality has by greater Kofax is achiev- vari- (Ko- dark annotated splotches for the in current the noisy project. image One and to the two fact pages that able ety fax, in 2009). in formatting a couple The of newspapers and months genre of were development than any OCRed other time, by image- an par- en- the from OCR each engine document failed were to recognize annotated column for each bound- of the ticularly and-text when corpus no used annotated in a published data is available NER experi- for the aries development during zoning. test and In blind figure 2, test letter-spacing sets. The annota- is in- corpus ment. for The training data includes or evaluation unstructured purposes? text How (full correctly tions consisted interpreted of marking by the OCR person engine, names, resulting including in well sentences), can we structured do on a (tabular) truly noisy text and including diverse long cor- the titles. introduction The number of superfluous of names annotated spaces within in the words. blind pus lists of of OCR names data? and end-of-book How do competing indexes, and extraction multi- This test set figure is given also in illustrates table 1 for the each common ...
Citations
... Due to its challenging and complex nature of writing and speaking, Urdu is considered more difficult for its language processing or text processing. The complex structure, number of dots, sounds, shapes and context sensitivity of the Urdu is entirely different from other languages such as Chinese, English, Russian and Korean languages [23]. ...
NER is a natural language processing technique that primarily classifies parts of parsed text into well�known named entities. In the domain of natural language processing, the recognition of name entities is used to
classify nouns that appear in bulk text data and place these nouns into predefined groups, such as names of people,
places, times, dates, organizations, etc. There is a lot of fragmented material and data on the Cyberspace, therefore
scholars are working on several languages (i.e: Sindhi, English, etc.), by working on various approaches and
techniques depending on their locations, to improve accessibility of filtered information for online users. The NER
enhance the quality of NLP in applications including automated summarization, semantic web search, information
extraction and retrieval machine translation and question answering, chatbots and others. This study designs an
efficient framework to extract noun entities in Urdu using a hybrid approach. The UNER system not only extracts
entities by searching through a list of names, but also extracts named entities by recognizing phrases in a given text.
The UNER system is designed to recognize Urdu noun entities in pre-defined categories such as places, personal
names, titled personal names, organizations, object names, trade names, abbreviations, dates and times,
measurements, and text names in Urdu.
... Finally, some adopt the approach of ensembling systems, i.e. of considering NE predictions not from one but several recognisers, according to various voting strategies. Packer et al. [138] applied three algorithms (dictionary-based, regular expressions-based, and HMM-based) in isolation and in combination for the recognition of person names in various types of English OCRed documents. They observed increased performances (particularly a better P/R balance) with a majority vote ensembling. ...
After decades of massive digitisation, an unprecedented amount of historical documents is available in digital format, along with their machine-readable texts. While this represents a major step forward with respect to preservation and accessibility, it also opens up new opportunities in terms of content mining and the next fundamental challenge is to develop appropriate technologies to efficiently search, retrieve and explore information from this 'big data of the past'. Among semantic indexing opportunities, the recognition and classification of named entities are in great demand among humanities scholars. Yet, named entity recognition (NER) systems are heavily challenged with diverse, historical and noisy inputs. In this survey, we present the array of challenges posed by historical documents to NER, inventory existing resources, describe the main approaches deployed so far, and identify key priorities for future developments.
... re-tokenization of words, and OCR correction made by exploiting the reference correction provided for entity surface and manual correction. Packer et al. [34] applied four extraction algorithms -dictionary-based, regular expression-based, Maximum Entropy Markov Model (MEMM), and Conditional Random Field (CRF) and improved upon the performance of individual models by making an 160 ensemble. Hamdi et al. [35] simulated different kinds of OCR errors by adding noise to the CONLL-03 NER corpus. ...
Text databases have grown tremendously in number, size, and volume over the last few decades. Optical Character Recognition (OCR) software scans the text and makes them available in online repositories. The OCR transcription process is often not accurate resulting in large volumes of garbled text in the repositories. Spell correction and other post-processing of OCR text often prove to be very expensive and time-consuming. While it is possible to rely on the OCR model to assess the quality of text in a corpus, many natural language processing and information retrieval tasks prefer the extrinsic evaluation of the effect of noise on the task at hand. This paper examines the effect of noise on the unsupervised ranking of person name entities by first populating a list of person names using an out-of-the-box Named Entity Recognition (NER) software, extracting content-based features for the identified entities, and ranking them using a novel unsupervised Kernel Density Estimation (KDE) based ranking algorithm. This generative model has the ability to learn rankings using the data distribution and therefore requires limited manual intervention.
Empirical results are presented on a carefully curated parallel corpus of OCR and clean text and ``in the wild" using a large real-world corpus. Experiments on the parallel corpus reveal that even with a reasonable degree of noise in the dataset, it is possible to generate ranked lists using the KDE algorithm with a high degree of precision and recall. Furthermore, since the KDE algorithm has comparable performance to state-of-the-art unsupervised rankers, it is feasible to use on real-world corpora. The paper concludes by reflecting on other methods for enhancing the performance of the unsupervised algorithm on OCR text such as cleaning entity names, disambiguating names concatenated to one another, and correcting OCR errors that are statistically significant in the corpus.
... They observed a consistent relationship between decreased OCR quality and worse NER accuracy. [15] explored the relationship between the word error rates of noisy speech and OCR on downstream NER, and [19] noted the difficulty of extracting names from noisy OCR text. ...
Document digitization is essential for the digital transformation of our societies, yet a crucial step in the process, Optical Character Recognition (OCR), is still not perfect. Even commercial OCR systems can produce questionable output depending on the fidelity of the scanned documents. In this paper, we demonstrate an effective framework for mitigating OCR errors for any downstream NLP task, using Named Entity Recognition (NER) as an example. We first address the data scarcity problem for model training by constructing a document synthesis pipeline, generating realistic but degraded data with NER labels. We measure the NER accuracy drop at various degradation levels and show that a text restoration model, trained on the degraded data, significantly closes the NER accuracy gaps caused by OCR errors, including on an out-of-domain dataset. For the benefit of the community, we have made the document synthesis pipeline available as an open-source project.
... Person names are often recognised together with other entities, such as locations and organisations [6], [7], [8], [14]. [22] focus on extracting name from noisy OCR data by combining rule based methods, the Maximum Entropy Markov Model, and the CRF model using a simple voting-based ensemble. [23] extract person names from emails using CRF. ...
Person names are essential entities in the Named Entity Recognition (NER) task. Traditional NER models have good performance in recognising well-formed person names from text with consistent and complete syntax, such as news articles. However, user-generated documents such as academic homepages and articles in online forums may contain lots of free-form text with incomplete syntax and person names in various forms. To address person name recognition in this context, we propose a fine-grained annotation scheme based on anthroponymy. To take full advantage of the fine-grained annotations, we propose a Co-guided Neural Network (CogNN) for person name recognition. CogNN fully explore the intra-sentence context and rich training signals of name forms. However, the inter-sentence context and implicit relations, which are extremely essential for recognizing person names in long documents, are not captured. To address this issue, we propose a Multi-inference Overlapped BERT Model (NameRec*) through an overlapped input processor, and an inter-sentence encoder with bidirectional overlapped contextual embedding learning and multiple inference mechanisms. NameRec* takes full advantage of inter-sentence context in long documents, while loses advantage for short documents without too much inter-sentence context. To derive benefit from different documents with diverse abundance of context, we further propose an advanced Adaptive Multi-inference Overlapping BERT Model (Ada-NameRec*) to dynamically adjust the inter-sentence overlapping ratio to different documents. We conduct extensive experiments to demonstrate the superiority of the proposed methods on both academic homepages and news articles.
... In term of NER in historical texts, different methods have been applied for English so far; rule-based NER (Grover et al., 2008), Maximum entropy Markov model and Conditional random fields (Packer et al., 2010). Different tools for NER (Rodriguez et al., 2012) as OpenNLP, Stanford NER, AlchemyAPI and Open-Calais are available. ...
... This may be explained by the fact that person name recognition is more dependent on finding patterns in the text while location name recognition is more dependent on gazetteer resources. In [10] the authors focus on full name extraction on a corpus composed of 12 titles spanning a diverse range of printed historical documents with relevance to genealogy and family history research.The results show a certain correlation between WER and F-measure of the NER systems. From the analysis of errors the authors conclude that word order errors play a bigger role in extraction errors than do character recognition errors, which seems reasonable as extraction systems intensively exploit contextual word neighbours information. ...
... This may be explained by the fact that person name recognition is more dependent on finding patterns in the text while location name recognition is more dependent on gazetteer resources. In [10] the authors focus on full name extraction on a corpus composed of 12 titles spanning a diverse range of printed historical documents with relevance to genealogy and family history research.The results show a certain correlation between WER and F-measure of the NER systems. From the analysis of errors the authors conclude that word order errors play a bigger role in extraction errors than do character recognition errors, which seems reasonable as extraction systems intensively exploit contextual word neighbours information. ...
Access to long-run historical data in the field of social sciences, economics and political sciences has been identified as one necessary condition to understand the dynamics of the past and the way those dynamics structure our present and future. Financial yearbooks are historical records reporting on information about the companies of stock exchanges. This paper concentrates on the description of the key components that implement a financial information extraction system from financial yearbooks. The proposed system consists in three steps: OCR, linked named entities extraction, active learning. The core of the system is related to linked named entities extraction (LNE). LNE are coherent n-tuple of named entities describing high level semantic information. In this respect we developed, tested and compared a CRF and a hybrid RNN/CRF based system. Active learning allows to cope with the lack of annotated data for training the system. Promising performance results are reported on two yearbooks (the French Desfossé yearbook (1962) and the German Handbuch (1914–15)) and for two LNE extraction tasks: capital information of companies and constitution information of companies.
... Finally, we created a second version of the evaluation pages, in which the text has been produced by an automatic OCR system instead of manually recognized and the NE annotation is manually verified to take into account the errors introduced by the OCR system. Specifically, we employ the 50% rule, i.e., we remove NE annotation from tokens with more than 50% character errors [21]. ...
... NER experiments with OCRed data in other languages show usually improvement of NER when the quality of the OCRed data has been improved from very poor to somehow better (see e.g. [21,[38][39]). Miller et al. [39] show that rate of achieved NER performance of a statistical trainable tagger degraded linearly as a function of word error rates. ...
Named Entity Recognition (NER), search, classification, and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. Performance of a NER system is usually quite heavily genre and domain dependent. Entity categories used in NER may also vary. The most used set of named entity categories is usually some version of three partite categori-zation of locations, persons, and organizations. In this paper we report evaluation results with data extracted from a digitized Finnish historical newspaper collection Digi using two statistical NER systems, namely, Stanford Named Entity Recognizer and LSTM-CRF NER model. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70-75%. Our NER evaluation collection and training data are based on ca. 500 000 words which have been manually corrected from OCR output of ABBYY FineReader 11. We have also available evaluation data of new uncorrected OCR output of Tesseract 3.04.01. Our Stanford NER results are mostly satisfactory. With our ground truth data we achieve F-score of 0.89 with locations and 0.84 with persons. With organizations the result is 0.60. With re-OCRed Tesseract output the results are 0.79, 0.72, and 0.42, respectively. Results of LSTM-CRF are similar.
... Finally, we created a second version of the evaluation pages, in which the text has been produced by an automatic OCR system instead of manually recognized and the NE annotation is manually verified to take into account the errors introduced by the OCR system. Specifically, we employ the 50% rule, i.e., we remove NE annotation from tokens with more than 50% character errors [21]. ...
... NER experiments with OCRed data in other languages show usually improvement of NER when the quality of the OCRed data has been improved from very poor to somehow better (see e.g. [21,[38][39]). Miller et al. [39] show that rate of achieved NER performance of a statistical trainable tagger degraded linearly as a function of word error rates. ...
Named Entity Recognition (NER), search, classification, and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. Performance of a NER system is usually quite heavily genre and domain dependent. Entity categories used in NER may also vary. The most used set of named entity categories is usually some version of three partite categorization of locations, persons, and organizations. In this paper we report evaluation results with data extracted from a digitized Finnish historical newspaper collection Digi using two statistical NER systems, namely, Stanford Named Entity Recognizer and LSTM-CRF NER model. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70-75%. Our NER evaluation collection and training data are based on ca. 500 000 words which have been manually corrected from OCR output of ABBYY FineReader 11. We have also available evaluation data of new uncorrected OCR output of Tesseract 3.04.01. Our Stanford NER results are mostly satisfactory. With our ground truth data we achieve F-score of 0.89 with locations and 0.84 with persons. With organizations the result is 0.60. With re-OCRed Tesseract output the results are 0.79, 0.72, and 0.42, respectively. Results of LSTM-CRF are similar.