Content uploaded by Kimmo Kettunen
Author content
All content in this area was uploaded by Kimmo Kettunen on Jan 20, 2017
Content may be subject to copyright.
Tagging Named Entities in 19th Century Finnish Newspaper Material with a
Variety of Tools
Kimmo Kettunen and Teemu Ruokolainen
The National Library of Finland, Centre for Preservation and Digitization, Mikkeli, Finland
kimmo.kettunen@helsinki.fi
teemu.ruokolainen@helsinki.fi
Abstract Named Entity Recognition (NER), search, classification and tagging of names and name like frequent informa-
tional elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to
many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical
compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent and al-
so used entity categories vary (Nadeau and Sekine, 2007). The most general set of named entities is usually some version
of three partite categorization of locations, persons and organizations. In this paper we report evaluation of NER with data
out of a digitized Finnish historical newspaper collection Digi (digi.kansalliskirjasto.fi). Experiments, results and discus-
sion of this research serve development of the Web collection of historical Finnish newspapers.
Digi collection contains 1,960,921 pages of newspaper material from years 1771–1910 both in Finnish and Swedish. Total
number of Finnish pages is 1 063 648, and total number of Swedish pages 892 101. We use only material of Finnish doc-
uments in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is
about 70–75 % (Kettunen and Pääkkönen, 2016). Our baseline NER tagger is a rule-based tagger of Finnish, FiNER, pro-
vided by the FIN-CLARIN consortium. Three other tools, ARPA, a semantic web linking tool, Finnish Semantic Tagger,
and Connexor’s NER software are also evaluated. We report also development work of a statistical tagger of Finnish and a
new evaluation and learning corpus for NER of historical Finnish.
Keywords: named entity recognition, historical newspaper collections, Finnish
Introduction
Digital newspapers and journals, either OCRed or born digital, form a growing global network of data that is available 24/7,
and as such they are an important source of information. As the amount of digitized journalistic information grows, also
tools for harvesting the information are needed. Named Entity Recognition (NER) has become one of the basic techniques
for information extraction of texts since the mid-1990s (Nadeau and Sekine, 2007). In its initial form NER was used to find
and mark semantic entities like person, location and organization in texts to enable information extraction related to this
kind of material. Later on other types of extractable entities, like time, artefact, event and measure/numerical, have been
added to the repertoires of NER software (Nadeau and Sekine, 2007). In this paper we report evaluation results of NER for
historical 19th century Finnish. Our historical data consists of an evaluation collection out of an OCRed Finnish historical
newspaper collection 1771–1910 (Kettunen and Pääkkönen, 2016).
Kettunen et al. (2016) have reported first NER evaluation results of the historical Finnish data with two tools, FiNER and
ARPA. FiNER is provided by the Fin-CLARIN consortium, ARPA is a semantic web tool produced by the Semantic Com-
puting group at the Aalto University. Both tools achieved maximal F-scores of about 60 at best, but with many categories
the results were much weaker. Word level accuracy of the evaluation collection was about 73 %, and thus the data can be
considered very noisy. NER results for modern Finnish have not been reported extensively so far. Silfverberg (2015) men-
tions a few results in his description of transferring an older version of FiNER to a new version. With modern Finnish data
F-scores round 90 are achieved.
In this paper we add two more analysis tools to our earlier NER repertoire. Finnish Semantic Tagger (FST) is not a NER
tool as such; it has first and foremost been developed for semantic analysis of full text. FST assigns a semantic category to
each word in text employing a comprehensive semantic category scheme (USAS Semantic Tagset, available in English
1
and
1
http://ucrel.lancs.ac.uk/usas/USASSemanticTagset.pdf
also in Finnish
2
; Löfberg et al., 2005). The scheme contains three name related categories: persons, locations and organiza-
tions. Our other new tool is Connexor’s NER software
3
, which is a commercial tool for modern Finnish.
Results for the historical data
Our historical Finnish evaluation data consists of 75 931 lines of manually annotated newspaper text, one word per line.
Most of the data is from the last decades of 19th century. Earlier NER evaluations with this data have achieved at best F-
scores of 50–60 in some name categories (Kettunen et al., 2016). Our baseline tagger, FiNER, is described more in
Kettunen et al. (2016). Shortly described, it is a rule-based NER tagger that uses morphological recognition, morphological
disambiguation, gazetteers (name lists), pattern and context rules for name tagging.
We evaluated performance of our different NER tools using the conlleval
4
script used in Conference on Computational
Natural Language Learning (CONLL). Conlleval uses standard measures of precision, recall and F-score, the last one de-
fined as 2PR/(R+P), where P is precision and R recall (Manning and Schütze, p. 269). As FST and Connexor’s tagger do
not distinguish multipart names with their boundaries only a comparable loose evaluation without entity boundary detection
is reported here (Poibeau and Kosseim, 2001).
Table 1 shows F-score results of four evaluations of locations and persons in our evaluation data. EnamexPrsHums con-
tain both first names and last names; EnamexLocXxx is a general location category that combines three more refined loca-
tion categories to one.
<EnamexPrsHum>
<EnamexLocXxx>
F-score
Number of found tags
F-score
Number of found tags
ARPA
52.9
3636
52.4
2933
Connexor
56.4
5321
60.9
1802
FiNER
58.1
2681
57.5
1541
FST
51.1
1496
56.7
1253
Table 1. Evaluation of four tools with loose criteria and two name categories in the historical newspaper collection. Best
results are in bold.
All taggers recognize locations and persons quite evenly, differences are small. Our baseline tagger FiNER achieves best
F-score with persons, Connexor with locations. Performance of the taggers is quite bad, which is expectable as the data is
very noisy.
It is evident that the main reason for low NER performance of the tools is the quality of the OCRed texts. If we analyze
the tagged words with a morphological analyzer (Omorfi v. 0.3
5
), we can see that wrongly tagged words are of lower quality
than those that are tagged correctly. Figures are shown in Table 2. Thus improvement in OCR quality will most probably
bring forth a clear improvement in NER of the material.
Locations
Persons
ARPA right tag, word unrecognition rate
1.9
4.5
Connexor right tag, word unrecognition rate
10.2
25.0
FiNER right tag, word unrecognition rate
6.3
12.8
FST right tag, word unrecognition rate
5.6
0.06
ARPA wrong tag, word unrecognition rate
22.7
29.3
Connexor wrong tag, word unrecognition rate
53.5
57.4
FiNER wrong tag, word unrecognition rate
38.3
34.0
FST wrong tag, word unrecognition rate
44.0
33.3
Table 2. Unrecognition rates for rightly and wrongly tagged words, per cent.
2
https://github.com/UCREL/Multilingual-USAS/raw/master/Finnish/USASSemanticTagset-Finnish.pdf
3
https://www.connexor.com/nlplib/?q=technology/name-recognition
4
http://www.cnts.ua.ac.be/conll2002/ner/bin/conlleval.txt, author ErikTjong Kim Sang, version 2004-01-26
5
https://github.com/flammie/omorfi
Development of a new statistical tagger
Our baseline tagger FiNER employed in the above experiments is a rule-based system utilizing morphological analysis,
gazetteers, and pattern and context rules. However, while there does exist some recent work on rule-based systems for NER
(Kokkinakis et al., 2014), the most prominent research on NER has focused on statistical machine learning methodology for
a longer time (Nadeau and Sekine, 2007; Neudecker 2016). Therefore, we are currently developing a statistical NER tagger
for historical Finnish text. For training and evaluation of the statistical system, we are manually annotating newspaper and
magazine text from the years 1862–1910 with classes person, organization, and location. The text contains approximately
650,000 word tokens. Subsequent to annotation, we can utilize freely available toolkits, such as the Stanford Named Enti ty
Recognizer (Finkel et al., 2005), for teaching the NER tagger. We expect that the rich feature sets enabled by statistical
learning will alleviate the effect of poor OCR quality on the recognition accuracy of NEs. For recent work on statistical
learning of NER taggers for historical data, see Neudecker (2016).
Discussion
In this paper we have shown results of NE tagging of historical OCRed Finnish with four tools: FiNER ARPA, a Finnish
Semantic Tagger, FST, and Connexor’s NE software. FiNER and Connexor’s tagger are dedicated NER tools for modern
Finnish, but FST is a general semantic tagger and ARPA a semantic web linking tool. Our results show that they all tag
names of locations and persons almost at the same level in the noisy OCRed historical newspaper collection. FiNER is best
with names of persons, Connexor with locations. Differences between tagger performances are at biggest 7–8 % points.
In general our results show that NE tagging in a noisy historical newspaper collection can be done to a reasonable extent
with tools that have been developed for modern Finnish. Anyhow, it seems obvious, that better results could be achieved
with a new tool, which is trained with the noisy historical data. We have ongoing development work with regards to this.
We also try to improve the quality of our OCRed text data with new OCRing and post-correction. Together these should
yield better NER results in the future.
Finally, a note about usage of Named Entity Recognition is in order. Named Entity Recognition is a tool that needs to be
used for some useful purpose. In our case extraction of person and place names is primarily a tool for improving access to
the Digi collection. After getting the recognition rate of some NER tool to an acceptable level, we need to decide, how we
are going to use extracted names in Digi. Some exemplary suggestions are provided by the archives of La Stampa
6
and
Trove Names (Mac Kim and Cassidy, 2015). La Stampa style usage of names provides informational filters after a basic
search has been conducted in the newspaper collection. User can further look for persons, locations and organizations men-
tioned in the article results. This kind of approach enhances browsing access to the collection (Bates, 2007; McNamee,
Mayfield and Piatko, 2011; Toms, 2000). Trove Names’ name search takes the opposite approach: user searches first for
names and then gets articles where the names occur. We believe that La Stampa style usage of names in the GUI of a news-
paper collection is more informative and useful for users, as the Trove style can be achieved with the normal search function
in the GUI of the newspaper collection.
Our main emphasis with NER will be to use the names with the newspaper collection as a means to improve structuring,
browsing and general informational usability of the collection. A good enough coverage of the names with NER needs to be
achieved also for this use, of course. A reasonable balance of P/R should be found for this purpose, but also other capabili-
ties of the software need to be considered. These remain to be seen later, if we are able to connect functional NER to our
historical newspaper collection’s user interface.
Acknowledgements
This work is funded by the EU Commission through its European Regional Development Fund, and the program Leverage
from the EU 2014–2020.
References
Bates, M. (2007). What is Browsing – really? A Model Drawing from Behavioural Science Research. Information Research
12. http://www.informationr.net/ir/12-4/paper330.html.
6
http://www.archiviolastampa.it/
Finkel, J.R., Grenager, T. and Manning, C. (2005). Incorporating non-local information into information extraction systems
by Gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL 2005),
363–370, available at http://dl.acm.org/citation.cfm?id=1219885.
Kettunen, K., Mäkelä, E., Kuokkala, J., Ruokolainen, T. and Niemi, J. (2016). Modern Tools for Old Content - in Search of
Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910. LWDA 2016, available at: http://ceur-
ws.org/Vol-1670/paper-35.pdf
Kettunen, K. and Pääkkönen, T. (2016). Measuring Lexical Quality of a Historical Finnish Newspaper Collection – Analy-
sis of Garbled OCR Data with Basic Language Technology Tools and Means. In LREC 2016, Tenth International Confer-
ence on Language Resources and Evaluation, available at http://www.lrec-conf.org/proceedings/lrec2016/pdf/17_Paper.pdf.
Kokkinakis, D., Niemi, J., Hardwick, S., Lindén, K., and Borin. L. (2014). HFST-SweNER – a New NER Resource for
Swedish. In Proceedings of LREC 2014, available at: http://www.lrec-conf.org/proceedings/lrec2014/pdf/391_Paper.pdf.
Löfberg, L., Piao, S., Rayson, P., Juntunen, J-P, Nykänen, A. and Varantola, K. (2005). A semantic tagger for the Finnish
language, available at http://eprints.lancs.ac.uk/12685/1/cl2005_fst.pdf.
McNamee, P., Mayfield, J.C., and Piatko, C.D. (2011). Processing Named Entities in Text. Johns Hopkins APL Technical
Digest, 30, 31–40.
Mac Kim, S., Cassidy, S. (2015). Finding Names in Trove: Named Entity Recognition for Australian. In Proceedings of
Australasian Language Technology Association Workshop, available at https://aclweb.org/anthology/U/U15/U15-1007.pdf.
Manning, C. D., Schütze, H. (1999). Foundations of Statistical Language Processing. The MIT Press, Cambridge, Massa-
chusetts.
Nadeau, D., and Sekine, S. (2007). A Survey of Named Entity Recognition and Classification. Linguisticae Investigationes,
30(1): 3–26.
Neudecker, C. (2016). An Open Corpus for Named Entity Recognition in Historic Newspapers. In LREC 2016, Tenth Inter-
national Conference on Language Resources and Evaluation, available at http://www.lrec-
conf.org/proceedings/lrec2016/pdf/110_Paper.pdf .
Poibeau, T. and Kosseim, L. (2001). Proper Name Extraction from Non-Journalistic Texts. Language and Computers,
37(1): 144–157.
Silfverberg, M. (2015). Reverse Engineering a Rule-Based Finnish Named Entity Recognizer. Paper presented at Named
Entity Recognition in Digital Humanities Workshop, June 15, Helsinki available at:
https://kitwiki.csc.fi/twiki/pub/FinCLARIN/KielipankkiEventNERWorkshop2015/Silfverberg_presentation.pdf
Toms, E.G. (2000). Understanding and Facilitating the Browsing of Electronic Text. International Journal of Human-
Computer Studies, 52, 423–452.