Figure 2 - uploaded by Eric K. Ringger
Content may be subject to copyright.
Pairs of image and corresponding OCR text from one page of Montclair. 

Pairs of image and corresponding OCR text from one page of Montclair. 

Source publication
Article
Full-text available
Named entity recognition from scanned and OCRed historical documents can contribute to historical research. However, entity recogni-tion from historical documents is more diffi-cult than from natively digital data because of the presence of word errors and the absence of complete formatting information. We ap-ply four extraction algorithms to vario...

Contexts in source publication

Context 1
... our within knowledge, an What OCR this level engine collection of produced NER quality has by greater Kofax is achiev- vari- (Ko- dark annotated splotches for the in current the noisy project. image One and to the two fact pages that able ety fax, in 2009). in formatting a couple The of newspapers and months genre of were development than any OCRed other time, by image- an par- en- the from OCR each engine document failed were to recognize annotated column for each bound- of the ticularly and-text when corpus no used annotated in a published data is available NER experi- for the aries development during zoning. test and In blind figure 2, test letter-spacing sets. The annota- is in- corpus ment. for The training data includes or evaluation unstructured purposes? text How (full correctly tions consisted interpreted of marking by the OCR person engine, names, resulting including in well sentences), can we structured do on a (tabular) truly noisy text and including diverse long cor- the titles. introduction The number of superfluous of names annotated spaces within in the words. blind pus lists of of OCR names data? and end-of-book How do competing indexes, and extraction multi- This test set figure is given also in illustrates table 1 for the each common document problem in the of approaches column formatted compare text over from different the books document and newspa- types? words corpus. that Blind are split test and pages hyphenated were not inspected at line bound- dur- Can pers. improvements in extraction quality be gained aries ing the as development well as other of types the of extraction errors. systems. All by combining the strengths of different extractors? extractors, This particular including collection ensembles, of were OCR applied documents to the We provide answers to these questions in the fol- were originally intended to be indexed for keyword lowing sections. In § 2 we describe the data we used search. Because the search application requires as well as the names extracted. In § 3 we present no more than a bag-of-words representation, much each of the basic extraction methods and examine of the document structure and formatting, including their performance. In § 4 we present a straight- punctuation and line boundaries in many cases, were forward ensemble method for combining the basic discarded before the data was made available, which extraction methods and show an improvement in affects the quality of the documents with respect to performance over each of the component extractors. NER. Furthermore, in parts of some of the docu- Finally, we conclude and discuss future work ( § 5). ments, the original token ordering was not preserved consistently: in some cases this was caused by the OCR engine being unable to first separate columns into distinct sections of text, while in other cases this was more likely caused by the noisiness and poor thresholding (binarization) of the image. The quality of the original images on which OCR was performed varied greatly. Consequently, this corpus represents a very noisy and diverse setting for extracting information. competing The Three data OCR used NER engines as input techniques were to our used on named in the the same entity production OCR recogniz- cor- of gine Our task that is was the not extraction identified of by the the full corpus names owner. of people, pus. ers the is data the used OCR in output this study. for 12 titles PrimeOCR, spanning a commer- a diverse e.g., Examples “Mrs Herschel of images Williams”, and corresponding from OCRed OCR docu- out- range cial In voting starting of printed system such historical utilizing a project, documents six we OCR had engines, with several relevance selects ques- put ments. are given Since in the figures corpus 1 and was 2. not Figure 1 originally is an intended exam- tions: to the genealogy best What results variation and from family those of word history engines error research. (PrimeRecogni- rate (WER) These can ple as a of public one of benchmark the poorer quality for NER, images the and pages accom- used be documents tion, expected 2009). are over Abby described multiple is a version OCR in table 1. of engines Abby and To FineReader the types best of panying for development OCR output. test and Causes blind test of poor data were quality hand- are documents? of used our within knowledge, an What OCR this level engine collection of produced NER quality has by greater Kofax is achiev- vari- (Ko- dark annotated splotches for the in current the noisy project. image One and to the two fact pages that able ety fax, in 2009). in formatting a couple The of newspapers and months genre of were development than any OCRed other time, by image- an par- en- the from OCR each engine document failed were to recognize annotated column for each bound- of the ticularly and-text when corpus no used annotated in a published data is available NER experi- for the aries development during zoning. test and In blind figure 2, test letter-spacing sets. The annota- is in- corpus ment. for The training data includes or evaluation unstructured purposes? text How (full correctly tions consisted interpreted of marking by the OCR person engine, names, resulting including in well sentences), can we structured do on a (tabular) truly noisy text and including diverse long cor- the titles. introduction The number of superfluous of names annotated spaces within in the words. blind pus lists of of OCR names data? and end-of-book How do competing indexes, and extraction multi- This test set figure is given also in illustrates table 1 for the each common document problem in the of approaches column formatted compare text over from different the books document and newspa- types? words corpus. that Blind are split test and pages hyphenated were not inspected at line bound- dur- Can pers. improvements in extraction quality be gained aries ing the as development well as other of types the of extraction errors. systems. All by combining the strengths of different extractors? extractors, This particular including collection ensembles, of were OCR applied documents to the We provide answers to these questions in the fol- were originally intended to be indexed for keyword lowing sections. In § 2 we describe the data we used search. Because the search application requires as well as the names extracted. In § 3 we present no more than a bag-of-words representation, much each of the basic extraction methods and examine of the document structure and formatting, including their performance. In § 4 we present a straight- punctuation and line boundaries in many cases, were forward ensemble method for combining the basic discarded before the data was made available, which extraction methods and show an improvement in affects the quality of the documents with respect to performance over each of the component extractors. NER. Furthermore, in parts of some of the docu- Finally, we conclude and discuss future work ( § 5). ments, the original token ordering was not preserved consistently: in some cases this was caused by the OCR engine being unable to first separate columns into distinct sections of text, while in other cases this was more likely caused by the noisiness and poor thresholding (binarization) of the image. The quality of the original images on which OCR was performed varied greatly. Consequently, this corpus represents a very noisy and diverse setting for extracting information. Three OCR engines were used in the production of the data used in this study. PrimeOCR, a commer- cial voting system utilizing six OCR engines, selects the best results from those engines (PrimeRecognition, 2009). Abby is a version of Abby FineReader used within an OCR engine produced by Kofax (Kofax, 2009). The newspapers were OCRed by an en- Our task is the extraction of the full names of people, e.g., “Mrs Herschel Williams”, from OCRed documents. Since the corpus was not originally intended as a public benchmark for NER, the pages used for development test and blind test data were hand- annotated for the current project. One to two pages from each document were annotated for each of the development test and blind test sets. The annotations consisted of marking person names, including titles. The number of names annotated in the blind test set is given in table 1 for each document in the corpus. Blind test pages were not inspected during the development of the extraction systems. All extractors, including ensembles, were applied to the We built four person name recognizers while explor- ing possible adaptations of existing named entity recognition methods to the genre and especially the noisiness of our corpus. The work required a couple of months, with each extractor being built by a different researcher. The extractors are designated as dictionary-based, regular expression based, MEMM (maximum-entropy Markov model) and CRF (conditional random field). In these four extractors, we are comparing solutions from two competing disciplines for NER: the hand-written, rule-based approach and the supervised machine learning approach. We applied them individually and collectively (within the ensemble extractors) on the blind test data and report a summary of their results in figures 3 and 5 (coarse metrics), and 4 and 6 (fine metrics). Only the results for the coarse-grained ensembles are reported in these four figures. same Precision, pages. recall When and variations F-measure in scores individual were systems calcu- above, We The built dictionary the four extractor person extractor name that recognizes is recognizers a simple only extractor, while “William explor- in- were lated considered, for person names the options in both which a coarse performed and best fine Herschel” ing tended possible as as a baseline a adaptations full name and will of requiring have existing two about true named positives, 20 entity to 30 on manner. development The coarse-grained test ...
Context 2
... for the benefit of many ap- al. uses hand-written rules on two kinds of British plications. Perhaps most importantly, IE from un- parliamentary proceedings (2008). Earlier work structured documents allows us to go beyond now- by Miller et al. (2000) uses an HMM extractor on traditional keyword search and enables semantic matched conditions: for their OCR task, they printed search. Semantic search allows a user to search digital documents and scanned and OCRed the re- specifically for only those instances of an ambigu- sulting copy to produce the OCR data for both train- ous name that belong to a semantic type such as per- ing and test sets. To our knowledge, no published son and to exclude instances of other entity types. research targets the full extent of noisiness and di- By extracting information from noisy OCR data we versity present in some real corpora or compares aim to broaden the impact of IE technology to include printed documents that are otherwise inacces- sible to digital tools. In particular, we are interested in books, newspapers, typed manuscripts, printed records, and other printed documents important for genealogy, family history and other historical research. The specific task we target in the present study is the extraction of person names from a variety of types and formats of historical OCR documents. This task is an example of named entity recognition (NER) as described in (Nadeau and Sekine, 2007) and (Ratinov and Roth, 2009). Accurately and effi- ciently identifying names in noisy OCR documents containing many OCR errors presents a challenge beyond standard NER and requires adapting existing techniques or tools. Our applications of interest are search and machine-assisted browsing of document collections. Search requires names to be pre- identified and indexed. Machine-assisted browsing of document collections has greater tolerance for misidentified names. There has been little published research on named entity extraction from noisy OCR data, but interest in this field is growing. Recent work by Grover et al. uses hand-written rules on two kinds of British parliamentary proceedings (2008). Earlier work by Miller et al. (2000) uses an HMM extractor on matched conditions: for their OCR task, they printed digital documents and scanned and OCRed the resulting copy to produce the OCR data for both training and test sets. To our knowledge, no published research targets the full extent of noisiness and di- versity present in some real corpora or compares The data used as input to our named entity recognizers is the OCR output for 12 titles spanning a diverse range of printed historical documents with relevance to genealogy and family history research. These documents are described in table 1. To the best of our knowledge, this collection has greater variety in formatting and genre than any other image- and-text corpus used in a published NER experi- ment. The data includes unstructured text (full sentences), structured (tabular) text including long lists of names and end-of-book indexes, and multi- column formatted text from the books and newspapers. competing The Three data OCR used NER engines as input techniques were to our used on named in the the same entity production OCR recogniz- cor- of gine Our task that is was the not extraction identified of by the the full corpus names owner. of people, pus. ers the is data the used OCR in output this study. for 12 titles PrimeOCR, spanning a commer- a diverse e.g., Examples “Mrs Herschel of images Williams”, and corresponding from OCRed OCR docu- out- range cial In voting starting of printed system such historical utilizing a project, documents six we OCR had engines, with several relevance selects ques- put ments. are given Since in the figures corpus 1 and was 2. not Figure 1 originally is an intended exam- tions: to the genealogy best What results variation and from family those of word history engines error research. (PrimeRecogni- rate (WER) These can ple as a of public one of benchmark the poorer quality for NER, images the and pages accom- used be documents tion, expected 2009). are over Abby described multiple is a version OCR in table 1. of engines Abby and To FineReader the types best of panying for development OCR output. test and Causes blind test of poor data were quality hand- are documents? of used our within knowledge, an What OCR this level engine collection of produced NER quality has by greater Kofax is achiev- vari- (Ko- dark annotated splotches for the in current the noisy project. image One and to the two fact pages that able ety fax, in 2009). in formatting a couple The of newspapers and months genre of were development than any OCRed other time, by image- an par- en- the from OCR each engine document failed were to recognize annotated column for each bound- of the ticularly and-text when corpus no used annotated in a published data is available NER experi- for the aries development during zoning. test and In blind figure 2, test letter-spacing sets. The annota- is in- corpus ment. for The training data includes or evaluation unstructured purposes? text How (full correctly tions consisted interpreted of marking by the OCR person engine, names, resulting including in well sentences), can we structured do on a (tabular) truly noisy text and including diverse long cor- the titles. introduction The number of superfluous of names annotated spaces within in the words. blind pus lists of of OCR names data? and end-of-book How do competing indexes, and extraction multi- This test set figure is given also in illustrates table 1 for the each common document problem in the of approaches column formatted compare text over from different the books document and newspa- types? words corpus. that Blind are split test and pages hyphenated were not inspected at line bound- dur- Can pers. improvements in extraction quality be gained aries ing the as development well as other of types the of extraction errors. systems. All by combining the strengths of different extractors? extractors, This particular including collection ensembles, of were OCR applied documents to the We provide answers to these questions in the fol- were originally intended to be indexed for keyword lowing sections. In § 2 we describe the data we used search. Because the search application requires as well as the names extracted. In § 3 we present no more than a bag-of-words representation, much each of the basic extraction methods and examine of the document structure and formatting, including their performance. In § 4 we present a straight- punctuation and line boundaries in many cases, were forward ensemble method for combining the basic discarded before the data was made available, which extraction methods and show an improvement in affects the quality of the documents with respect to performance over each of the component extractors. NER. Furthermore, in parts of some of the docu- Finally, we conclude and discuss future work ( § 5). ments, the original token ordering was not preserved consistently: in some cases this was caused by the OCR engine being unable to first separate columns into distinct sections of text, while in other cases this was more likely caused by the noisiness and poor thresholding (binarization) of the image. The quality of the original images on which OCR was performed varied greatly. Consequently, this corpus represents a very noisy and diverse setting for extracting information. competing The Three data OCR used NER engines as input techniques were to our used on named in the the same entity production OCR recogniz- cor- of gine Our task that is was the not extraction identified of by the the full corpus names owner. of people, pus. ers the is data the used OCR in output this study. for 12 titles PrimeOCR, spanning a commer- a diverse e.g., Examples “Mrs Herschel of images Williams”, and corresponding from OCRed OCR docu- out- range cial In voting starting of printed system such historical utilizing a project, documents six we OCR had engines, with several relevance selects ques- put ments. are given Since in the figures corpus 1 and was 2. not Figure 1 originally is an intended exam- tions: to the genealogy best What results variation and from family those of word history engines error research. (PrimeRecogni- rate (WER) These can ple as a of public one of benchmark the poorer quality for NER, images the and pages accom- used be documents tion, expected 2009). are over Abby described multiple is a version OCR in table 1. of engines Abby and To FineReader the types best of panying for development OCR output. test and Causes blind test of poor data were quality hand- are documents? of used our within knowledge, an What OCR this level engine collection of produced NER quality has by greater Kofax is achiev- vari- (Ko- dark annotated splotches for the in current the noisy project. image One and to the two fact pages that able ety fax, in 2009). in formatting a couple The of newspapers and months genre of were development than any OCRed other time, by image- an par- en- the from OCR each engine document failed were to recognize annotated column for each bound- of the ticularly and-text when corpus no used annotated in a published data is available NER experi- for the aries development during zoning. test and In blind figure 2, test letter-spacing sets. The annota- is in- corpus ment. for The training data includes or evaluation unstructured purposes? text How (full correctly tions consisted interpreted of marking by the OCR person engine, names, resulting including in well sentences), can we structured do on a (tabular) truly noisy text and including diverse long cor- the titles. introduction The number of superfluous of names annotated spaces within in the words. blind pus lists of of OCR names data? and end-of-book How do competing indexes, and extraction multi- This test set figure is given also in illustrates table 1 for the each common ...

Similar publications

Article
Full-text available
Summary Text line segmentation is an essential pre-processing stage for off-line handwriting recognition in many Optical Character Recognition (OCR) systems. It is an important step because inaccurately segmented text lines will cause errors in the recognition stage. Text line segmentation of the handwritten documents is still one of the most compl...
Article
Full-text available
This paper evaluates the retrieval effectiveness degradation when facing with noisy text corpus. With the use of a test-collection having the clean text, another version with around 5% error rate in recognition and a third with 20% error rate, we have evaluated six IR models based on three text representations (bag-of-words, n-grams, trunc-n) as we...
Conference Paper
Full-text available
Segmentation of handwritten document images is a complex task due to the variability in the writing styles. The segmentation technique has to deal with non-uniformly skewed, overlapped and touching lines. A very few works have been carried out yet, addressing these issues. This paper presents a novel methodology for segmenting handwritten Malayalam...
Conference Paper
Full-text available
Text preprocessing in document image is an important task and a prerequisite before performing segmentation, features extraction and recognition of text. Many systems have been proposed, but less attention has been given to the images acquired by a smartphone. In this paper, we propose a new system of text preprocessing in document images captured...
Conference Paper
Full-text available
Market demand for an embedded realization of video OCR motivated the authors to exert an attempt to evaluate the performance of existing document image OCR techniques for the same. Thus authors have tried to port the open source OCR systems like GOCR and Tessar act on an embedded platform. But their performance on an embedded platform shows that th...

Citations

... Due to a combination of the above issues, a number of historical TM efforts have either completely or partially abandoned the usual ML-based supervised approach to NE recognition . Instead, the methods employed are either based upon, or incorporate, hand-written rules (which attempt to model the textual patterns that can signify the existence of NEs) and/or dictionaries that contain inventories of known NEs (e.g.,212223242526). Such methods tend to be less successful than ML-based approaches. ...
Article
Full-text available
Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while the processing pipeline and its modules may be used and configured within the Argo TM platform.
... They describe many of the problems encountered by NER systems that result from both OCR artefacts and the archaic nature of the sources themselves, such as conflation of marginal notes with body text, multi-line quoting rules, and capitalization of common nouns. Packer et al. (2010) tested three different meth- ods for the extraction of person names from noisy OCR output, scoring the results from each method against a hand-annotated reference. They noted a correlation between OCR word error rate (WER) and NER quality did exist, but was small, and hypothesised that errors in the OCR deriv- ing from misunderstanding of page-level features (i.e. ...
Conference Paper
Full-text available
This short paper analyses an experiment comparing the efficacy of several Named Entity Recognition (NER) tools at extract-ing entities directly from the output of an optical character recognition (OCR) work-flow. The authors present how they first created a set of test data, consisting of raw and corrected OCR output manually anno-tated with people, locations, and organiza-tions. They then ran each of the NER tools against both raw and corrected OCR out-put, comparing the precision, recall, and F1 score against the manually annotated data.
... The enumerators also introduced errors because of spelling errors of surnames and geographic names during the input step, or because they misinterpreted the instructions given to them by the census takers [1]. Numerous new errors were further introduced during the digitisation process, which is common in processing of historical documents [21]. Because " dirty " data is one of the biggest obstacles to accurate linking, extensive and accurate data cleaning is essential before any data linking can be performed [15]. ...
Conference Paper
Full-text available
Historical census data captures information about our ancestors. These data contain the social status at a certain point time. They contain valuable information for genealogists, historians, and social scientists. Historical census data can be used to reconstruct important aspects of a particular era in order to trace the changes in households and families. Record linkage across different historical census datasets can help to improve the quality of the data, enrich existing census data with additional information, and facilitate improved retrieval of information. In this paper, we introduce a domain driven approach to automatically clean and link historical census data based on recent developments in group linkage techniques. The key contribution of our approach is to first detect households, and to use this information to refine the cleaned data and improve the accuracy of linking records between census datasets. We have developed a two-step linking approach, which first links individual records using approximate string similarity measures, and then performs a group linking based on the previously detected households. The results show that this approach is effective and can greatly reduce the manual efforts required for data cleaning and linking by social scientists.
Article
Full-text available
Text databases have grown tremendously in number, size, and volume over the last few decades. Optical Character Recognition (OCR) software scans the text and makes them available in online repositories. The OCR transcription process is often not accurate resulting in large volumes of garbled text in the repositories. Spell correction and other post-processing of OCR text often prove to be very expensive and time-consuming. While it is possible to rely on the OCR model to assess the quality of text in a corpus, many natural language processing and information retrieval tasks prefer the extrinsic evaluation of the effect of noise on the task at hand. This paper examines the effect of noise on the unsupervised ranking of person name entities by first populating a list of person names using an out-of-the-box Named Entity Recognition (NER) software, extracting content-based features for the identified entities, and ranking them using a novel unsupervised Kernel Density Estimation (KDE) based ranking algorithm. This generative model has the ability to learn rankings using the data distribution and therefore requires limited manual intervention. Empirical results are presented on a carefully curated parallel corpus of OCR and clean text and ``in the wild" using a large real-world corpus. Experiments on the parallel corpus reveal that even with a reasonable degree of noise in the dataset, it is possible to generate ranked lists using the KDE algorithm with a high degree of precision and recall. Furthermore, since the KDE algorithm has comparable performance to state-of-the-art unsupervised rankers, it is feasible to use on real-world corpora. The paper concludes by reflecting on other methods for enhancing the performance of the unsupervised algorithm on OCR text such as cleaning entity names, disambiguating names concatenated to one another, and correcting OCR errors that are statistically significant in the corpus.
Chapter
Access to long-run historical data in the field of social sciences, economics and political sciences has been identified as one necessary condition to understand the dynamics of the past and the way those dynamics structure our present and future. Financial yearbooks are historical records reporting on information about the companies of stock exchanges. This paper concentrates on the description of the key components that implement a financial information extraction system from financial yearbooks. The proposed system consists in three steps: OCR, linked named entities extraction, active learning. The core of the system is related to linked named entities extraction (LNE). LNE are coherent n-tuple of named entities describing high level semantic information. In this respect we developed, tested and compared a CRF and a hybrid RNN/CRF based system. Active learning allows to cope with the lack of annotated data for training the system. Promising performance results are reported on two yearbooks (the French Desfossé yearbook (1962) and the German Handbuch (1914–15)) and for two LNE extraction tasks: capital information of companies and constitution information of companies.
Conference Paper
Full-text available
Named Entity Recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system's performance is genre and domain dependent and also used entity categories vary [16]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report evaluation result of NER with data out of a digitized Finnish historical newspaper collection Digi. Experiments, results and discussion of this research serve development of the Web collection of historical Finnish newspapers. Digi collection contains 1,960,921 pages of newspaper material from years 1771-1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70-75% [7]. Our baseline NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. Three other available tools are also evaluated: a Finnish Semantic Tagger (FST), Connexor's NER tool and Polyglot's NER.
Conference Paper
Full-text available
Named entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74 75 % [2]. Our principal NER tag-ger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. SeCo's tools achieve 30.0-60.0 F-score with locations and persons. Performance with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed text.
Article
Full-text available
Purpose – This paper aims to present an evaluation of open source OCR for supporting research on material in small‐ to medium‐scale historical archives. Design/methodology/approach – The approach was to develop a workflow engine to support the easy customisation of the OCR process towards the historical materials using open source technologies. Commercial OCR often fails to deliver sufficient results here, as their processing is optimised towards large‐scale commercially relevant collections. The approach presented here allows users to combine the most effective parts of different OCR tools. Findings – The authors demonstrate their application and its flexibility and present two case studies, which demonstrate how OCR can be embedded into wider digitally enabled historical research. The first case study produces high‐quality research‐oriented digitisation outputs, utilizing services that the authors developed to allow for the direct linkage of digitisation image and OCR output. The second case study demonstrates what becomes possible if OCR can be customised directly within a larger research infrastructure for history. In such a scenario, further semantics can be added easily to the workflow, enhancing the research browse experience significantly. Originality/value – There has been little work on the use of open source OCR technologies for historical research. This paper demonstrates that the authors' workflow approach allows users to combine commercial engines' ability to read a wider range of character sets with the flexibility of open source tools in terms of customisable pre‐processing and layout analysis. All this can be done without the need to develop dedicated code.