Krister Lindén

Krister Lindén
  • PhD
  • Research Director at University of Helsinki

About

151
Publications
32,604
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,338
Citations
Current institution
University of Helsinki
Current position
  • Research Director
Additional affiliations
January 2010 - present
University of Helsinki
Position
  • Research Director

Publications

Publications (151)
Article
Full-text available
Digital Humanities has an increasing need for widely applicable and easy-to-use methods across disciplinary boundaries. We believe that the cross-disciplinary methods we have developed for the study of the ancient Near East are useful for the study of other research questions outside the field of “Digital Assyriology”. The article presents an overv...
Conference Paper
Full-text available
For language scientists, a prima facie advantage of AI-generated data over human-created content is that AI outputs are generally regarded as free from copyright. This contribution addresses this issue in some detail.
Chapter
In this section, we introduce some of the evaluation metrics and the evaluation setups for language identification. Unsurprisingly, these metrics and setups have strong overlap with other text classification tasks. However, the particulars of evaluation are both strongly informed by the specific nature of LI and have impacted on how the field has e...
Chapter
Language identification (LI) is the task of predicting the language(s) in a text or speech input. The main difference between LI of text and speech is that the characters that make up the text are discrete, whereas with speech, the input is usually a continuous signal. This means that different styles of mathematical methods are needed to process t...
Chapter
In addition to features and methods used in LI, this chapter introduces the notation devised by Jauhiainen et al. (2019e) that is used throughout this book to describe LI methods. For easier reference, we include the complete description of the notation in the first section of this chapter. It may be difficult to digest the notation without concret...
Chapter
One fascinating aspect of language identification which makes it difficult is the similarity between languages. Some languages seem to be extremely easy to distinguish from each other, whereas for some others, it is extremely difficult. This phenomenon is closely tied to the definition of “language”, which is much less trivial than what one might t...
Chapter
In the first section of this chapter, we showcase some of the applications that have traditionally incorporated language identification. In effect, this encompasses all “mixed monolingual” NLP tasks, in routing instances to the monolingual model appropriate to the source language.
Chapter
This work has investigated the automatic language identification of digitally-encoded texts. Over the last 50 years, automatic language identification of text has emerged as a separate field of study related to general text categorization.
Chapter
In general, the more recognizable languages there are, the more difficult it is to recognize the language (Brown 2012; Rodrigues 2012; Jauhiainen et al. 2017a). It is intuitively easy to understand that if classes are added, the classification becomes more difficult. However, this depends in part on the evaluation measures used. For example, if the...
Article
Full-text available
CLARIN is a European Research Infrastructure Consortium developing and providing a federated and interoperable platform to support scientists in the field of the Social Sciences and Humanities in carrying-out language-related research. This contribution provides an overview of the entire infrastructure with a particular focus on tool interoperabili...
Conference Paper
Full-text available
We present BabyLemmatizer, a hybrid lemmatizer and POS-tagger for Akkadian, the language of the ancient Assyrians and Babylonians, documented from 2350 BCE to 100 CE. In our approach the text is first POS-tagged and lemmatized with TurkuNLP trained with human-verified labels, and then post-corrected with dictionary-based methods to improve the lemm...
Conference Paper
Full-text available
The Data Governance Act was proposed in late 2020 as part of the European Strategy for Data, and adopted on 30 May 2022 (as Regulation 2022/868). It will enter into application on 24 September 2023. The Data governance Act is a major development in the legal framework affecting CLARIN and the whole language community. With its new rules on the re-u...
Conference Paper
Full-text available
We present the process of publishing resources in Kielipankki, the Language Bank of Finland. Our pipeline includes all the steps that are needed to publish a resource: from finding and receiving the original data until making the data available via different platforms, e.g., the Korp concordance tool or the download service. Our goal is to standard...
Book
Full-text available
The book describes various approaches in rule-based language technology The authors are leading experts in this research field. The book is the first of its kind and it gives a comprehensive picture of the state-of-the-art in rule-based language technology. The book shows the suitability of the technology to all language types, including languages...
Article
Full-text available
Sentiment analysis and opinion mining are essential tasks with many prominent application areas, e.g., when researching popular opinions on products or brands. Sentiments expressed in social media can be used in brand name monitoring and indicating fake news. In our survey of previous work, we note that there is no large-scale social media data set...
Chapter
Full-text available
The normative layer of CLARIN is, alongside the organizational and technical layers, an essential part of the infrastructure. It consists of the regulatory framework (statutory law, case law, authoritative guidelines, etc.), the contractual framework (licenses, terms of service, etc.), and ethical norms. Navigating the normative layer requires expe...
Chapter
Full-text available
The Donate Speech campaign aimed to collect 10,000 hours of ordinary, casual Finnish speech to be used for studying language as well as for developing technology and services that can be readily used in the languages spoken in Finland. In this project, particular attention has been devoted to allowing for both academic and commercial use of the mat...
Article
Full-text available
The Language Bank of Finland hosts text corpora originating from Finland. Two of the most used ones are the Newspaper and Periodical Corpus of the National Library of Finland and the Suomi24 Corpus. The Language Bank has received considerable additions to both corpora and is currently creating new versions of the corpora. We are debuting language i...
Article
Full-text available
The Donate Speech campaign has so far succeeded in gathering approximately 3600 h of ordinary, colloquial Finnish speech into the Lahjoita puhetta (Donate Speech) corpus. The corpus includes over twenty thousand speakers from all the regions of Finland and from all age brackets. The primary goals of the collection were to create a representative, l...
Conference Paper
Full-text available
Language researchers are usually aware of intellectual property and personal data (PD) requirements. The problem, however, arises when these two legal regimes have conflicting requirements. For instance, when copyright law requires the acknowledgement of the author, but personal data law enshrines the data mini-misation principle. It is a practical...
Conference Paper
Full-text available
Twitter data is used in a wide variety of research disciplines in Social Sciences and Humanities. Although most Twitter data is publicly available, its re-use and sharing raise many legal questions related to intellectual property and personal data protection. Moreover, the use of Twitter and its content is subject to the Terms of Service, which al...
Chapter
Large computational lexicons are central NLP resources. Swedish FrameNet++ aims to be a versatile full-scale lexical resource for NLP containing many kinds of linguistic information. Although focused on Swedish, this ongoing effort, which includes building a new Swedish framenet and recycling existing lexicons, has offered valuable insights into ge...
Conference Paper
Full-text available
The article focuses on determining responsible parties and the division of potential liability arising from sharing language data (LD) containing personal data (PD). A key issue here is to identify who has to make sure and guarantee the GDPR compliance. The authors aim to answer 1) whether an individual researcher is a controller and 2) whether sha...
Article
Full-text available
Uuring keskendub isikuandmeid sisaldavate keeleandmete jagamisele, mis kujutab endast isikuandmete töötlemist. Rahvusvahelises praktikas ei ole üheselt selge, kuidas jaguneb vastutus isikuandmete töötlemise eest konkreetse teadlase ja teadusasutuse vahel. Näiteks erineb Prantsusmaa ja Saksamaa mudel Eesti, Leedu ja Soome mudelist. Omalaadset lähene...
Chapter
This is the Festschrift of Dr. Jack Rueter. The book presents peer-reviewed scientific work from Dr. Rueter’s colleagues related to the latest advances in natural language processing, digital resources and endangered languages in a variety of languages such as historical English, Chukchi, Mansi, Erzya, Komi, Finnish, Apurina, Sign Languages, Sami l...
Preprint
Full-text available
Sentiment analysis and opinion mining is an important task with obvious application areas in social media, e.g. when indicating hate speech and fake news. In our survey of previous work, we note that there is no large-scale social media data set with sentiment polarity annotations for Finnish. This publications aims to remedy this shortcoming by in...
Article
Full-text available
The optical character recognition (OCR) quality of the historical part of the Finnish newspaper and journal corpus is rather low for reliable search and scientific research on the OCRed data. The estimated character error rate (CER) of the corpus, achieved with commercial software, is between 8 and 13%. There have been earlier attempts to train hig...
Conference Paper
Full-text available
The article analyses the responsibility for ensuring compliance with the General Data Protection Regulation (GDPR) in research settings. As a general rule, organisations are considered the data controller (responsible party for the GDPR compliance). Research constitutes a unique setting influenced by academic freedom. This raises the question of wh...
Preprint
Full-text available
This article introduces the Wanca 2017 corpus of texts crawled from the internet from which the sentences in rare Uralic languages for the use of the Uralic Language Identification (ULI) 2020 shared task were collected. We describe the ULI dataset and how it was constructed using the Wanca 2017 corpus and texts in different languages from the Leipz...
Article
Full-text available
Language technology provides several possibilities to commercialise collected language resources (data) in the form of providing access to databases, dictionaries, translation, text analysis and localisation services, storage of documents and personal language data, software, and other types of digital content. This article focuses on the contractu...
Conference Paper
Full-text available
Akkadian is a fairly well resourced extinct language that does not yet have a comprehensive morphological analyzer available. In this paper we describe a general finite-state based morphological model for Babylonian, a southern dialect of the Akkadian language, that can achieve a coverage up to 97.3% and recall up to 93.7% on lemmatization and POS-...
Conference Paper
Full-text available
Akkadian was an East-Semitic language spoken in ancient Mesopotamia. The language is attested on hundreds of thousands of cuneiform clay tablets. Several Akkadian text corpora contain only the transliterated text. In this paper, we investigate automated phonological transcription of the transliterated corpora. The phonological transcription provide...
Conference Paper
Full-text available
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade h...
Preprint
Full-text available
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade h...
Preprint
Full-text available
We present a corpus of Finnish news articles with a manually prepared named entity annotation. The corpus consists of 953 articles (193,742 word tokens) with six named entity classes (organization, location, person, product, event, and date). The articles are extracted from the archives of Digitoday, a Finnish online technology news source. The cor...
Article
Full-text available
We present a corpus of Finnish news articles with a manually prepared named entity annotation. The corpus consists of 953 articles (193,742 word tokens) with six named entity classes (organization, location, person, product, event, and date). The articles are extracted from the archives of Digitoday, a Finnish online technology news source. The cor...
Article
Full-text available
This article describes an unsupervised language model (LM) adaptation approach that can be used to enhance the performance of language identification methods. The approach is applied to a current version of the HeLI language identification method, which is now called HeLI 2.0. We describe the HeLI 2.0 method in detail. The resulting system is evalu...
Conference Paper
This paper presents experiments on Optical character recognition (OCR) of historical newspapers and journals published in Finland. The corpus has two main languages: Finnish and Swedish and is written in both Blackletter and Antiqua fonts. Here we experiment with how much training data is enough to train high accuracy models, and try to train a joi...
Preprint
Full-text available
This article describes an unsupervised language model adaptation approach that can be used to enhance the performance of language identification methods. The approach is applied to a current version of the HeLI language identification method, which is now called HeLI 2.0. We describe the HeLI 2.0 method in detail. The resulting system is evaluated...
Preprint
Full-text available
This article introduces a corpus of cuneiform texts from which the dataset for the use of the Cuneiform Language Identification (CLI) 2019 shared task was derived as well as some preliminary language identification experiments conducted using that corpus. We also describe the CLI dataset and how it was derived from the corpus. In addition, we provi...
Article
Full-text available
The authors address the legal issues relating to the creation and use of language models. The article begins with an explanation of the development of language technologies. The authors analyse the technological process within the framework copyright, related rights and personal data protection law. The authors also cover commercial use of language...
Article
Full-text available
The article details the formational process of the FinnTransFrame corpus, a part of the FinnFrameNet project. In addition to a large annotated frame semantic corpus of natural language examples, the project created a separate corpus of examples translated from English to Finnish. The research question when creating the FinnTransFrame corpus was to...
Article
Full-text available
We describe the state-of-the-art in performance modeling and prediction for Information Retrieval (IR), Natural Language Processing (NLP) and Recommender Systems (RecSys) along with its shortcomings and strengths. We present a framework for further research, identifying five major problem areas: understanding measures, performance analysis, making...
Article
Full-text available
Language identification (LI) is the problem of determining the natural language that a document or part thereof is written in. Automatic LI has been extensively researched for over fifty years. Today, LI is a key part of many text processing pipelines, as text processing techniques generally assume that the language of the input text is known. Rese...
Preprint
Language identification (LI) is the problem of determining the natural language that a document or part thereof is written in. Automatic LI has been extensively researched for over fifty years. Today, LI is a key part of many text processing pipelines, as text processing techniques generally assume that the language of the input text is known. Rese...
Article
Full-text available
"Data protection issues relating to the development and utilisation of language resources" Copyright issues related to language resources have received considerable attention. Personal data protection in the field of language research is an area which is not thoroughly explored yet. This topic is very relevant since the General Data Protection Reg...
Article
The article describes the process of creating a Finnish language FrameNet or FinnFN, based on the original English language FrameNet hosted at the International Computer Science Institute in Berkeley, California. We outline the goals and results relating to the FinnFN project and especially to the creation of the FinnFrame corpus. The main aim of t...
Conference Paper
In this work, the task is to assist human transcribers to produce, for example, interview or parliament speech transcriptions. The system will perform in-document adaptation based on a small amount of manually corrected automatic speech recognition results. The corrected segments of the spoken document are used to adapt the speech recognizer’s acou...
Article
Full-text available
"Regulatory framework determining the development and utilization of digital language resources in Estonia and its compatibility with CLARIN infrastructure" The article focuses on legal issues relating to language resources. These issues are analyzed throughout the process of the creation of language resources and their subsequent distribution. La...
Article
Full-text available
The article focuses on the regulatory and contractual frameworks in CLARIN. A process analysis approach has been adopted to allow an evaluation of the functionality and shortcomings of the entire legal framework applicable to language resources and technologies. The article discusses and provides background information to amendments of key provisio...
Conference Paper
Full-text available
This paper presents two systems for spelling correction formulated as a sequence labeling task. One of the systems is an unstructured classifier and the other one is structured. Both systems are implemented using weighted finite-state methods. The structured system delivers stateof-the-art results on the task of tweet normalization when compared wi...
Article
This paper describes FinnPos, an open-source morphological tagging and lemmatization toolkit for Finnish. The morphological tagging model is based on the averaged structured perceptron classifier. Given training data, new taggers are estimated in a computationally efficient manner using a combination of beam search and model cascade. The lemmatizat...
Conference Paper
Full-text available
To recognize semantic frames in languages with a rich morphology, we need computational morphology. In this paper, we look at one particular framework, HFST–Helsinki Finite-State Technology, and how to use it for recognizing semantic frames in context. HFST enables tokenization, morphological analysis, tagging, and frame annotation in one single fr...
Article
Full-text available
This paper describes a Kone Foundation funded project called "The Finno-Ugric Languages and The Internet" together with some of the achieved results. The main activity of the project is to crawl the internet and gather texts written in small Uralic languages. The sentences and words of the found texts will be assembled into a freely available corpu...
Conference Paper
In this paper, we reconsider the problem of language identification of multilingual documents. Automated language identification algorithms have been improving steadily from the seventies until recent years. The current state-of-the-art language identifiers are quite efficient even with only a few characters and this gives us enough reason to again...
Conference Paper
Full-text available
In this paper, we study how to analyze and improve the quality of a large historical newspaper collection. The National Library of Finland has digitized millions of newspaper pages. The quality of the outcome of the OCR process is limited especially with regard to the oldest parts of the collection. Approaches such as crowdsourcing has been used in...
Article
Full-text available
Wordnets are large-scale lexical databases of related words and concepts, useful for language-aware software applications. They have recently been built for many languages by using various approaches. The Finnish wordnet, FinnWordNet (FiWN), was created by translating the more than 200,000 word senses in the English Princeton WordNet (PWN) 3.0 in 1...
Conference Paper
Full-text available
We discuss part-of-speech (POS) tagging in presence of large, fine-grained label sets using conditional random fields (CRFs). We propose improving tagging accuracy by utilizing dependencies within sub-components of the fine-grained labels. These sub-label dependencies are incorporated into the CRF model via a (relatively) straightforward feature ex...
Conference Paper
Full-text available
This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiative's work throughout Europe in order to boost progress a...
Conference Paper
The following claims can be made about finite-state methods for spell-checking: 1) Finite-state language models provide support for morphologically complex languages that word lists, affix stripping and similar approaches do not provide; 2) Weighted finite-state models have expressive power equal to other, state-of-the-art string algorithms used by...
Conference Paper
Full-text available
The paper presents and evaluates various NLP tools that have been created using the open source library HFST - Helsinki Finite-State Technology and outlines the minimal extensions that this has required to a pure finite-state system. In particular, the paper describes an implementation and application of Pmatch presented by Karttunen at SFCM 2011.
Article
Full-text available
In this paper we present simple methods for construction and evaluation of finite-state spell-checking tools using an existing finite-state lexical automaton, freely available finite-state tools and Internet corpora acquired from projects such as Wikipedia. As an example, we use a freely available open-source implementation of Finnish morphology, m...
Article
Full-text available
This document describes Hutmegs, the Helsinki University of Technology Morpholog-ical Evaluation Gold Standard package, which contains gold-standard morphological segmentations for 1.4 million Finnish and 120 000 English words. The Gold Stan-dards comprise surface-string, or allomorph, segmentations of word forms, as well as deep-level, or morpheme...
Book
There are not many people who can be said to have influenced and impressed researchers in so many disparate areas and language-geographic fields as Lauri Carlson, as is evidenced in the present Festschrift. His insight and acute linguistic sensitivity and linguistic rationality have spawned findings and research work in many areas, from non-standar...
Conference Paper
Systems for predictive text entry on ambiguous keyboards typically rely on dictionaries with word frequencies which are used to suggest the most likely words matching user input. This approach is insufficient for agglutinative languages, where morphological phenomena increase the rate of out-of-vocabulary words. We propose a method for text entry,...
Article
Full-text available
FinnWordNet is a Finnish wordnet which complies with the structure of the Princeton WordNet (Fellbaum, 1998). It was created by translating all the words in Princeton WordNet. It is open source and contains over 117 000 synsets. We are now testing different methods in order to improve and expand the content of FinnWordNet. Since wordnets are struct...
Conference Paper
Full-text available
The META-NORD project has contributed to an open infrastructure for language resources (data and tools) under the META-NET umbrella. This paper presents the key objectives of META-NORD and reports on the results achieved in the first year of the project. META-NORD has mapped and described the national language technology landscape in the Nordic and...
Article
Full-text available
HFST-HelsinkiFinite-StateTechnology (http://hfst.sf.net/) is a framework for compiling and applying linguistic descriptions with finitestatemethods. HFST currently collects some of the most important finite-state tools for creatingmorphologies and spellcheckers into one open-source platform and supports extending and improving the descriptions with...
Article
This project report describes a multilingual wordnet initiative embarked in the META-NORD project and concerned with the validation and pilot linking between Nordic and Baltic wordnets. The builders of these wordnets have applied very different compilation strategies: The Danish, Icelandic and Swedish wordnets are being developed via monolingual di...
Article
Full-text available
This paper presents a simple method for finding new synonym candidates for a bilingual wordnet by using another bilin-gual resource. Our goal is to add new synonyms to the existing synsets of the Finnish WordNet, which has direct word sense translation correspondences to the Princeton WordNet. For this task, we use Wikipedia and its links between t...
Chapter
This paper presents simple methods for adding new words to a wordnet. We use the Finnish wordnet, FinnWordNet, as an example. We pay particular attention to high- and medium-frequency words thus far missing from FinnWordNet, and arrive at an estimate for the number of culture-specific words among them. We also find that the majority of the high- an...
Conference Paper
Full-text available
HFST–Helsinki Finite-State Technology ( hfst.sf.net ) is a framework for compiling and applying linguistic descriptions with finite-state methods. HFST currently connects some of the most important finite-state tools for creating morphologies and spellers into one open-source platform and supports extending and improving the descriptions with weigh...
Conference Paper
Full-text available
This paper introduces the META-NORD pro-ject which develops Nordic and Baltic part of the European open language resource infra-structure. META-NORD works on assem-bling, linking across languages, and making widely available the basic language resources used by developers, professionals and re-searchers to build specific products and ap-plications....
Article
Full-text available
We introduce a framework for POS tagging which can incorporate a variety of different information sources such as statistical models and handwritten rules. The information sources are compiled into a set of weighted finite-state transducers and tagging is accomplished using weighted finite-state algorithms. Our aim is to develop a fast and flexible...

Network

Cited By