ResearchPDF Available

Abstract

Named Entity Recognition (NER), search, classification and tagging of names and name like frequent informa-tional elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system's performance is genre and domain dependent and also used entity categories vary (Nadeau and Sekine, 2007). The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi (digi.kansalliskirjasto.fi). Experiments, results and discussion of this research serve development of the Web collection of historical Finnish newspapers. Digi collection contains 1,960,921 pages of newspaper material from years 1771–1910 both in Finnish and Swedish. Total number of Finnish pages is 1 063 648, and total number of Swedish pages 892 101. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70–75 % (Kettunen and Pääkkönen, 2016). Our baseline NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. Three other tools, ARPA, a semantic web linking tool, Finnish Semantic Tagger, and Connexor's NER software are also evaluated. We report also development work of a statistical tagger of Finnish and a new evaluation and learning corpus for NER of historical Finnish.
Tagging Named Entities in 19th Century Finnish Newspaper Material with a
Variety of Tools
Kimmo Kettunen and Teemu Ruokolainen
The National Library of Finland, Centre for Preservation and Digitization, Mikkeli, Finland
kimmo.kettunen@helsinki.fi
teemu.ruokolainen@helsinki.fi
Abstract Named Entity Recognition (NER), search, classification and tagging of names and name like frequent informa-
tional elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to
many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical
compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent and al-
so used entity categories vary (Nadeau and Sekine, 2007). The most general set of named entities is usually some version
of three partite categorization of locations, persons and organizations. In this paper we report evaluation of NER with data
out of a digitized Finnish historical newspaper collection Digi (digi.kansalliskirjasto.fi). Experiments, results and discus-
sion of this research serve development of the Web collection of historical Finnish newspapers.
Digi collection contains 1,960,921 pages of newspaper material from years 17711910 both in Finnish and Swedish. Total
number of Finnish pages is 1 063 648, and total number of Swedish pages 892 101. We use only material of Finnish doc-
uments in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is
about 7075 % (Kettunen and Pääkkönen, 2016). Our baseline NER tagger is a rule-based tagger of Finnish, FiNER, pro-
vided by the FIN-CLARIN consortium. Three other tools, ARPA, a semantic web linking tool, Finnish Semantic Tagger,
and Connexor’s NER software are also evaluated. We report also development work of a statistical tagger of Finnish and a
new evaluation and learning corpus for NER of historical Finnish.
Keywords: named entity recognition, historical newspaper collections, Finnish
Introduction
Digital newspapers and journals, either OCRed or born digital, form a growing global network of data that is available 24/7,
and as such they are an important source of information. As the amount of digitized journalistic information grows, also
tools for harvesting the information are needed. Named Entity Recognition (NER) has become one of the basic techniques
for information extraction of texts since the mid-1990s (Nadeau and Sekine, 2007). In its initial form NER was used to find
and mark semantic entities like person, location and organization in texts to enable information extraction related to this
kind of material. Later on other types of extractable entities, like time, artefact, event and measure/numerical, have been
added to the repertoires of NER software (Nadeau and Sekine, 2007). In this paper we report evaluation results of NER for
historical 19th century Finnish. Our historical data consists of an evaluation collection out of an OCRed Finnish historical
newspaper collection 17711910 (Kettunen and Pääkkönen, 2016).
Kettunen et al. (2016) have reported first NER evaluation results of the historical Finnish data with two tools, FiNER and
ARPA. FiNER is provided by the Fin-CLARIN consortium, ARPA is a semantic web tool produced by the Semantic Com-
puting group at the Aalto University. Both tools achieved maximal F-scores of about 60 at best, but with many categories
the results were much weaker. Word level accuracy of the evaluation collection was about 73 %, and thus the data can be
considered very noisy. NER results for modern Finnish have not been reported extensively so far. Silfverberg (2015) men-
tions a few results in his description of transferring an older version of FiNER to a new version. With modern Finnish data
F-scores round 90 are achieved.
In this paper we add two more analysis tools to our earlier NER repertoire. Finnish Semantic Tagger (FST) is not a NER
tool as such; it has first and foremost been developed for semantic analysis of full text. FST assigns a semantic category to
each word in text employing a comprehensive semantic category scheme (USAS Semantic Tagset, available in English
1
and
1
http://ucrel.lancs.ac.uk/usas/USASSemanticTagset.pdf
also in Finnish
2
; Löfberg et al., 2005). The scheme contains three name related categories: persons, locations and organiza-
tions. Our other new tool is Connexor’s NER software
3
, which is a commercial tool for modern Finnish.
Results for the historical data
Our historical Finnish evaluation data consists of 75 931 lines of manually annotated newspaper text, one word per line.
Most of the data is from the last decades of 19th century. Earlier NER evaluations with this data have achieved at best F-
scores of 5060 in some name categories (Kettunen et al., 2016). Our baseline tagger, FiNER, is described more in
Kettunen et al. (2016). Shortly described, it is a rule-based NER tagger that uses morphological recognition, morphological
disambiguation, gazetteers (name lists), pattern and context rules for name tagging.
We evaluated performance of our different NER tools using the conlleval
4
script used in Conference on Computational
Natural Language Learning (CONLL). Conlleval uses standard measures of precision, recall and F-score, the last one de-
fined as 2PR/(R+P), where P is precision and R recall (Manning and Schütze, p. 269). As FST and Connexor’s tagger do
not distinguish multipart names with their boundaries only a comparable loose evaluation without entity boundary detection
is reported here (Poibeau and Kosseim, 2001).
Table 1 shows F-score results of four evaluations of locations and persons in our evaluation data. EnamexPrsHums con-
tain both first names and last names; EnamexLocXxx is a general location category that combines three more refined loca-
tion categories to one.
<EnamexPrsHum>
<EnamexLocXxx>
F-score
Number of found tags
F-score
ARPA
52.9
3636
52.4
Connexor
56.4
5321
60.9
FiNER
58.1
2681
57.5
FST
51.1
1496
56.7
Table 1. Evaluation of four tools with loose criteria and two name categories in the historical newspaper collection. Best
results are in bold.
All taggers recognize locations and persons quite evenly, differences are small. Our baseline tagger FiNER achieves best
F-score with persons, Connexor with locations. Performance of the taggers is quite bad, which is expectable as the data is
very noisy.
It is evident that the main reason for low NER performance of the tools is the quality of the OCRed texts. If we analyze
the tagged words with a morphological analyzer (Omorfi v. 0.3
5
), we can see that wrongly tagged words are of lower quality
than those that are tagged correctly. Figures are shown in Table 2. Thus improvement in OCR quality will most probably
bring forth a clear improvement in NER of the material.
Locations
Persons
ARPA right tag, word unrecognition rate
1.9
4.5
Connexor right tag, word unrecognition rate
10.2
25.0
FiNER right tag, word unrecognition rate
6.3
12.8
FST right tag, word unrecognition rate
5.6
0.06
ARPA wrong tag, word unrecognition rate
22.7
29.3
Connexor wrong tag, word unrecognition rate
53.5
57.4
FiNER wrong tag, word unrecognition rate
38.3
34.0
FST wrong tag, word unrecognition rate
44.0
33.3
Table 2. Unrecognition rates for rightly and wrongly tagged words, per cent.
2
https://github.com/UCREL/Multilingual-USAS/raw/master/Finnish/USASSemanticTagset-Finnish.pdf
3
https://www.connexor.com/nlplib/?q=technology/name-recognition
4
http://www.cnts.ua.ac.be/conll2002/ner/bin/conlleval.txt, author ErikTjong Kim Sang, version 2004-01-26
5
https://github.com/flammie/omorfi
Development of a new statistical tagger
Our baseline tagger FiNER employed in the above experiments is a rule-based system utilizing morphological analysis,
gazetteers, and pattern and context rules. However, while there does exist some recent work on rule-based systems for NER
(Kokkinakis et al., 2014), the most prominent research on NER has focused on statistical machine learning methodology for
a longer time (Nadeau and Sekine, 2007; Neudecker 2016). Therefore, we are currently developing a statistical NER tagger
for historical Finnish text. For training and evaluation of the statistical system, we are manually annotating newspaper and
magazine text from the years 18621910 with classes person, organization, and location. The text contains approximately
650,000 word tokens. Subsequent to annotation, we can utilize freely available toolkits, such as the Stanford Named Enti ty
Recognizer (Finkel et al., 2005), for teaching the NER tagger. We expect that the rich feature sets enabled by statistical
learning will alleviate the effect of poor OCR quality on the recognition accuracy of NEs. For recent work on statistical
learning of NER taggers for historical data, see Neudecker (2016).
Discussion
In this paper we have shown results of NE tagging of historical OCRed Finnish with four tools: FiNER ARPA, a Finnish
Semantic Tagger, FST, and Connexor’s NE software. FiNER and Connexor’s tagger are dedicated NER tools for modern
Finnish, but FST is a general semantic tagger and ARPA a semantic web linking tool. Our results show that they all tag
names of locations and persons almost at the same level in the noisy OCRed historical newspaper collection. FiNER is best
with names of persons, Connexor with locations. Differences between tagger performances are at biggest 78 % points.
In general our results show that NE tagging in a noisy historical newspaper collection can be done to a reasonable extent
with tools that have been developed for modern Finnish. Anyhow, it seems obvious, that better results could be achieved
with a new tool, which is trained with the noisy historical data. We have ongoing development work with regards to this.
We also try to improve the quality of our OCRed text data with new OCRing and post-correction. Together these should
yield better NER results in the future.
Finally, a note about usage of Named Entity Recognition is in order. Named Entity Recognition is a tool that needs to be
used for some useful purpose. In our case extraction of person and place names is primarily a tool for improving access to
the Digi collection. After getting the recognition rate of some NER tool to an acceptable level, we need to decide, how we
are going to use extracted names in Digi. Some exemplary suggestions are provided by the archives of La Stampa
6
and
Trove Names (Mac Kim and Cassidy, 2015). La Stampa style usage of names provides informational filters after a basic
search has been conducted in the newspaper collection. User can further look for persons, locations and organizations men-
tioned in the article results. This kind of approach enhances browsing access to the collection (Bates, 2007; McNamee,
Mayfield and Piatko, 2011; Toms, 2000). Trove Names’ name search takes the opposite approach: user searches first for
names and then gets articles where the names occur. We believe that La Stampa style usage of names in the GUI of a news-
paper collection is more informative and useful for users, as the Trove style can be achieved with the normal search function
in the GUI of the newspaper collection.
Our main emphasis with NER will be to use the names with the newspaper collection as a means to improve structuring,
browsing and general informational usability of the collection. A good enough coverage of the names with NER needs to be
achieved also for this use, of course. A reasonable balance of P/R should be found for this purpose, but also other capabili-
ties of the software need to be considered. These remain to be seen later, if we are able to connect functional NER to our
historical newspaper collection’s user interface.
Acknowledgements
This work is funded by the EU Commission through its European Regional Development Fund, and the program Leverage
from the EU 20142020.
References
Bates, M. (2007). What is Browsing really? A Model Drawing from Behavioural Science Research. Information Research
12. http://www.informationr.net/ir/12-4/paper330.html.
6
http://www.archiviolastampa.it/
Finkel, J.R., Grenager, T. and Manning, C. (2005). Incorporating non-local information into information extraction systems
by Gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL 2005),
363370, available at http://dl.acm.org/citation.cfm?id=1219885.
Kettunen, K., Mäkelä, E., Kuokkala, J., Ruokolainen, T. and Niemi, J. (2016). Modern Tools for Old Content - in Search of
Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910. LWDA 2016, available at: http://ceur-
ws.org/Vol-1670/paper-35.pdf
Kettunen, K. and Pääkkönen, T. (2016). Measuring Lexical Quality of a Historical Finnish Newspaper Collection Analy-
sis of Garbled OCR Data with Basic Language Technology Tools and Means. In LREC 2016, Tenth International Confer-
ence on Language Resources and Evaluation, available at http://www.lrec-conf.org/proceedings/lrec2016/pdf/17_Paper.pdf.
Kokkinakis, D., Niemi, J., Hardwick, S., Lindén, K., and Borin. L. (2014). HFST-SweNER a New NER Resource for
Swedish. In Proceedings of LREC 2014, available at: http://www.lrec-conf.org/proceedings/lrec2014/pdf/391_Paper.pdf.
Löfberg, L., Piao, S., Rayson, P., Juntunen, J-P, Nykänen, A. and Varantola, K. (2005). A semantic tagger for the Finnish
language, available at http://eprints.lancs.ac.uk/12685/1/cl2005_fst.pdf.
McNamee, P., Mayfield, J.C., and Piatko, C.D. (2011). Processing Named Entities in Text. Johns Hopkins APL Technical
Digest, 30, 3140.
Mac Kim, S., Cassidy, S. (2015). Finding Names in Trove: Named Entity Recognition for Australian. In Proceedings of
Australasian Language Technology Association Workshop, available at https://aclweb.org/anthology/U/U15/U15-1007.pdf.
Manning, C. D., Schütze, H. (1999). Foundations of Statistical Language Processing. The MIT Press, Cambridge, Massa-
chusetts.
Nadeau, D., and Sekine, S. (2007). A Survey of Named Entity Recognition and Classification. Linguisticae Investigationes,
30(1): 326.
Neudecker, C. (2016). An Open Corpus for Named Entity Recognition in Historic Newspapers. In LREC 2016, Tenth Inter-
national Conference on Language Resources and Evaluation, available at http://www.lrec-
conf.org/proceedings/lrec2016/pdf/110_Paper.pdf .
Poibeau, T. and Kosseim, L. (2001). Proper Name Extraction from Non-Journalistic Texts. Language and Computers,
37(1): 144157.
Silfverberg, M. (2015). Reverse Engineering a Rule-Based Finnish Named Entity Recognizer. Paper presented at Named
Entity Recognition in Digital Humanities Workshop, June 15, Helsinki available at:
https://kitwiki.csc.fi/twiki/pub/FinCLARIN/KielipankkiEventNERWorkshop2015/Silfverberg_presentation.pdf
Toms, E.G. (2000). Understanding and Facilitating the Browsing of Electronic Text. International Journal of Human-
Computer Studies, 52, 423452.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Named entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74 75 % [2]. Our principal NER tag-ger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. SeCo's tools achieve 30.0-60.0 F-score with locations and persons. Performance with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed text.
Conference Paper
Full-text available
The National Library of Finland has digitized a large proportion of the historical newspapers published in Finland between 1771 and 1910 (Bremer-Laamanen 2001). This collection contains approximately 1.95 million pages in Finnish and Swedish. Finnish part of the collection consists of about 2.39 billion words. The National Library's Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. Part of this material is also available freely downloadable in The Language Bank of Finland provided by the Fin-CLARIN consortium. The collection can also be accessed through the Korp environment that has been developed by Språkbanken at the University of Gothenburg and extended by FIN-CLARIN team at the University of Helsinki to provide concordances of text resources. A Cranfield-style information retrieval test collection has been produced out of a small part of the Digi newspaper material at the University of Tampere (Järvelin et al., 2015). The quality of the OCRed collections is an important topic in digital humanities, as it affects general usability and searchability of collections. There is no single available method to assess the quality of large collections, but different methods can be used to approximate the quality. This paper discusses different corpus analysis style ways to approximate the overall lexical quality of the Finnish part of the Digi collection.
Conference Paper
Full-text available
This paper reports on the current status and evaluation of a Finnish semantic tagger (hereafter FST), which was developed in the EU-funded Benedict Project. In this project, we have ported the Lancaster English semantic tagger (USAS) to the Finnish language. We have re-used the existing software architecture of USAS, and applied the same semantic field taxonomy developed for English to Finnish. The Finnish lexical resources have been compiled using various corpus-based techniques, and the resulting lexicons have then been manually tagged and used for the FST prototype. At present, the lexicons contain 33,627 single lexical items and 8,912 multi-word expression templates. In the evaluation, we used two sets of test data. The first test data is from the domain of Finnish cooking, which is both sufficiently compact and sufficiently versatile. The second data is from Helsingin Sanomat, the biggest Finnish daily newspaper. As a result, the FST produced a lexical coverage of 94.1% and a precision of 83.03% on the cooking test data and a lexical coverage of 90.7% on the newspaper data. While there is much room for improvement, this is an encouraging result for a prototype tool. The FST will be continually improved by expanding the semantic lexical resources and improving the disambiguation algorithms.
Conference Paper
Full-text available
This paper discusses the influence of the corpus on the automatic identification of proper names in texts. Techniques developed for the news-wire genre are generally not sufficient to deal with larger corpora containing texts that do not follow strict writing constraints (for example, e-mail messages, transcriptions of oral conversations, etc). After a brief review of the research performed on news texts, we present some of the problems involved in the analysis of two different corpora: e-mails and hand-transcribed telephone conversations. Once the sources of errors have been presented, we then describe an approach to adapt a proper name extraction system developed for newspaper texts to the analysis of e-mail messages.
Conference Paper
Full-text available
Most current statistical natural language process- ing models use only local features so as to permit dynamic programming in inference, but this makes them unable to fully account for the long distance structure that is prevalent in language use. We show how to solve this dilemma with Gibbs sam- pling, a simple Monte Carlo method used to per- form approximate inference in factored probabilis- tic models. By using simulated annealing in place of Viterbi decoding in sequence models such as HMMs, CMMs, and CRFs, it is possible to incorpo- rate non-local structure while preserving tractable inference. We use this technique to augment an existing CRF-based information extraction system with long-distance dependency models, enforcing label consistency and extraction template consis- tency constraints. This technique results in an error reduction of up to 9% over state-of-the-art systems on two established information extraction tasks.
Article
Full-text available
The term “Named Entity”, now widely used in Natural Language Processing, was coined for the Sixth Message Understanding Conference (MUC-6) (R. Grishman & Sundheim 1996). At that time, MUC was focusing on Information Extraction (IE) tasks where structured information of company activities and defense related activities is extracted from unstructured text, such as newspaper articles. In defining the task, people noticed that it is essential to recognize information units like names, including person, organization and location names, and numeric expressions including time, date, money and percent expressions. Identifying references to these entities in text was recognized as one of the important sub-tasks of IE and was called “Named Entity Recognition and Classification (NERC)”. Le terme « entité nommée », maintenant largement utilisé dans le cadre du traitement des langues naturelles, a été adopté pour la Sixth Message Understanding Conference (MUC 6) (R. Grishman et Sundheim, 1996). À cette époque, la Conférence était concentrée sur les tâches d'extraction d'information (EI), dans lesquelles l'information structurée relative aux activités des entreprises et aux activités liées à la défense sont extraites de texte non structuré, comme les articles de journaux. Au moment de définir cette tâche, on a remarqué qu'il est essentiel de reconnaître les unités d'information comme les noms (dont les noms de personnes, d'organisations et de lieux géographiques) et les expressions numériques, notamment l'expression de l'heure, de la date, des sommes monétaires et des pourcentages. On a alors conclu que l'identification des références à ces entités dans le texte était une des principales sous-tâches de l'EI et on a alors nommé cette tâche Named Entity Recognition and Classification (NERC) (reconnaissance et classification d'entités nommées).
Article
Introduction. It is argued that the actual elements of typical browsing episodes have not been well captured by common approaches to the concept to date. Method. Empirical research results reported by previous researchers are presented and closely analysed. Analysis. Based on the issues raised by the above research review, the components of browsing are closely analysed and developed. Browsing is seen to consist of a series of four steps, iterated indefinitely until the end of a browsing episode: 1) glimpsing a field of vision, 2) selecting or sampling a physical or informational object within the field of vision, 3) examining the object, 4) acquiring the object ( conceptually and/or physically) or abandoning it. Not all of these elements need be present in every browsing episode, though multiple glimpses are seen to be the minimum to constitute the act. Results. This concept of browsing is then shown to have persuasive support in the psychological and anthropological literature, where research on visual search, curiosity and exploratory behaviour all find harmony with this perspective. Conclusions. It is argued that this conception of browsing is closer to real human behaviour than other approaches. Implications for better information system design are developed.
Article
Understanding human communication is a key foundation on which the understanding of human dynamics is based. Detection and classification of names in text and resolving mentions of those names to real-world entities are language-understanding tasks that might reasonably be automated. The need for these techniques arises in numerous settings such as news monitoring, law enforcement, and national security. In this article we give an overview of research in the area, describe automated techniques used for identifying and relating names in text, and discuss community evaluations that have given a significant boost to research efforts worldwide. We also highlight APL's contributions to research into some of these problems, giving particular emphasis to a recent evaluation of algorithms to match entities in text against a large database.
Article
Browsing tends to be used in two distinctive ways, alternatively associated with the goal of the activity and with the method by which the goal is achieved. In this study, the definition of browsing combines aspects of both concepts to define browsing as an activity in which one gathers information while scanning an information space without an explicit purpose. The objective of this research was to examine how browsers interact with their browsing environment while manipulating two types of interface tools constructed from the content.Forty-seven adults (24 males) performed the two types of tasks (one with no purpose and the second, a control, purposeful) in four sessions over a period of four weeks. Participants scanned and/or searched the textual content of current issue plus three months of back issues of the Halifax Chronicle Herald/Mail Star using a system designed specifically for this research. At any one time only one of each type of tool was available.Those with no assigned goal examined significantly more articles and explored more menu options. They made quick decisions about which articles to examine, spending twice as much time reading the content. They tended not to explore the newspaper to a great extent, examining only 24% of the articles in a single issue. About three-quarters of what they examined was new information on topics that they had not known about before being exposed to the paper. The type of menu had no impact on performance, but differences were discovered between the two items-to-browse tools. Those with no goal selected more articles from the Suggestions and found more interesting articles when the Suggestions were available.