Mika Hämäläinen

Mika Hämäläinen
University of Helsinki | HY · Department of Digital Humanities

Doctor of Philosophy

About

108
Publications
7,527
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
445
Citations
Introduction
Before starting my PhD in language technology in 2017, I worked as a research assistant in computer science (2015-2017) developing a computationally creative Poem Machine. Later I have been involved with NLP for endangered Uralic languages with extremely low resources and Historical English corpora dealing with the complexity rising form non-standard historical spelling and OCR errors. I am also continuing with computational creativity research. My interest is nowadays in theoretical grounding of creativity when developing NLG systems that exhibit creative behavior. I have a multitude of open-source Python libraries related to NLP on GitHub https://github.com/mikahama
Additional affiliations
November 2020 - present
University of Helsinki
Position
  • PostDoc Position
Education
September 2017 - November 2020
University of Helsinki
Field of study
  • Natural Language Processing
October 2013 - October 2016
University of Helsinki
Field of study
  • Spanish Philology
September 2010 - October 2013
University of Helsinki
Field of study
  • Spanish Philology

Publications

Publications (108)
Preprint
Full-text available
This paper presents work on modelling the social psychological aspect of socialization in the case of a computationally creative master-apprentice system. In each master-apprentice pair, the master, a genetic algorithm, is seen as a parent for its apprentice, which is an NMT based sequence-to-sequence model. The effect of different parenting styles...
Conference Paper
Full-text available
A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupe...
Preprint
Full-text available
We present a creative poem generator for the morphologically rich Finnish language. Our method falls into the master-apprentice paradigm, where a computationally creative genetic algorithm teaches a BRNN model to generate poetry. We model several parts of poetic aesthetics in the fitness function of the genetic algorithm, such as sonic features, se...
Conference Paper
Full-text available
We compare different LSTMs and transformer models in terms of their effectiveness in normalizing dialectal Finnish into the normative standard Finnish. As dialect is the common way of communication for people online in Finnish, such a normalization is a necessary step to improve the accuracy of the existing Finnish NLP tools that are tailored for n...
Chapter
Full-text available
The term low-resourced has been tossed around in the field of natural language processing to a degree that almost any language that is not English can be called "low-resourced"; sometimes even just for the sake of making a mundane or mediocre paper appear more interesting and insightful. In a field where English is a synonym for language and low-re...
Conference Paper
Full-text available
We present a novel approach to generating news headlines in Finnish for a given news story. We model this as a summarization task where a model is given a news article, and its task is to produce a concise headline describing the main topic of the article. Because there are no openly available GPT-2 models for Finnish, we will first build such a mo...
Conference Paper
Full-text available
Role-playing games (RPGs) have a considerable amount of text in video game dialogues. Quite often this text is semi-annotated by the game developers. In this paper, we extract a multilingual dataset of persuasive dialogue from several RPGs. We show the viability of this data in building a persuasion detection system using a natural language process...
Preprint
Full-text available
Role-playing games (RPGs) have a considerable amount of text in video game dialogues. Quite often this text is semi-annotated by the game developers. In this paper, we extract a multilingual dataset of persuasive dialogue from several RPGs. We show the viability of this data in building a persuasion detection system using a natural language process...
Conference Paper
Full-text available
We present a novel neural model for modern poetry generation in French. The model consists of two pretrained neural models that are fine-tuned for the poem generation task. The encoder of the model is a RoBERTa based one while the decoder is based on GPT-2. This way the model can benefit from the superior natural language understanding performance...
Conference Paper
Full-text available
Many endangered Uralic languages have multilingual machine readable dictionaries saved in an XML format. However, the dictionaries cover translations very inconsistently between language pairs, for instance, the Livonian dictionary has some translations to Finnish, Lat-vian and Estonian, and the Komi-Zyrian dictionary has some translations to Finni...
Preprint
Full-text available
The goal of the paper is to predict answers to questions given a passage of Qur'an. The answers are always found in the passage, so the task of the model is to predict where an answer starts and where it ends. As the initial data set is rather small for training, we make use of multilingual BERT so that we can augment the training data by using dat...
Preprint
Full-text available
The study forms a technical report of various tasks that have been performed on the materials collected and published by Finnish ethnographer and linguist, Matthias Alexander Castr\'en (1813-1852). The Finno-Ugrian Society is publishing Castr\'en's manuscripts as new critical and digital editions, and at the same time different research groups have...
Preprint
Full-text available
Measuring the semantic similarity of different texts has many important applications in Digital Humanities research such as information retrieval, document clustering and text summarization. The performance of different methods depends on the length of the text, the domain and the language. This study focuses on experimenting with some of the curre...
Article
Full-text available
We present a new approach to inducing a bilingual dictionary between the endangered Erzya and Moksha languages automatically based on existing dictionaries in other languages. This work is, for the most part, complementary to the Mordvin research done by PhD László Keresztes, who has demonstrated the alignment of the two language forms in morpholog...
Conference Paper
Full-text available
Measuring the semantic similarity of different texts has many important applications in Digital Humanities research such as information retrieval, document clustering and text summarization. The performance of different methods depends on the length of the text, the domain and the language. This study focuses on experimenting with some of the curre...
Conference Paper
Full-text available
The study forms a technical report of various tasks that have been performed on the materials collected and published by Finnish ethnographer and linguist, Matthias Alexander Castrén (1813-1852). The Finno-Ugrian Society is publishing Castrén's manuscripts as new critical and digital editions, and at the same time different research groups have als...
Conference Paper
Full-text available
We present the first openly available corpus for detecting depression in Thai. Our corpus is compiled by expert verified cases of depression in several online blogs. We experiment with two different LSTM based models and two different BERT based models. We achieve a 77.53% accuracy with a Thai BERT model in detecting depression. This establishes a...
Conference Paper
Full-text available
We present our current work on developing keyboard layouts for a critically endangered Uralic language called Livonian. Our layouts work on Windows, MacOS and Linux. In addition, we have developed keyboard apps with predictive text for Android and iOS. This work has been conducted in collaboration with the language community.
Preprint
Full-text available
We present the first openly available corpus for detecting depression in Thai. Our corpus is compiled by expert verified cases of depression in several online blogs. We experiment with two different LSTM based models and two different BERT based models. We achieve a 77.53\% accuracy with a Thai BERT model in detecting depression. This establishes a...
Conference Paper
Full-text available
Finnish is a language with multiple dialects that not only differ from each other in terms of accent (pronunciation) but also in terms of morphological forms and lexical choice. We present the first approach to automatically detect the dialect of a speaker based on a dialect transcript and transcript with audio recording in a dataset consisting of...
Preprint
Full-text available
Finnish is a language with multiple dialects that not only differ from each other in terms of accent (pronunciation) but also in terms of morphological forms and lexical choice. We present the first approach to automatically detect the dialect of a speaker based on a dialect transcript and transcript with audio recording in a dataset consisting of...
Preprint
Full-text available
There are a lot of tools and resources available for processing Finnish. In this paper, we survey recent papers focusing on Finnish NLP related to many different subcategories of NLP such as parsing, generation, semantics and speech. NLP research is conducted in many different research groups in Finland, and it is frequently the case that NLP tools...
Conference Paper
Full-text available
There are a lot of tools and resources available for processing Finnish. In this paper , we survey recent papers focusing on Finnish NLP related to many different sub-categories of NLP such as parsing, generation , semantics and speech. NLP research is conducted in many different research groups in Finland, and it is frequently the case that NLP to...
Conference Paper
Full-text available
This study describes the ongoing development of the finite-state description for an endangered minority language, Komi-Zyrian. This work is located in the context where large written and spoken language corpora are available, which creates a set of unique challenges that have to be, and can be, addressed. We describe how we have designed the transd...
Preprint
Full-text available
Automated news generation has become a major interest for new agencies in the past. Oftentimes headlines for such automatically generated news articles are unimaginative as they have been generated with ready-made templates. We present a computationally creative approach for headline generation that can generate humorous versions of existing headli...
Conference Paper
Full-text available
Automated news generation has become a major interest for new agencies in the past. Oftentimes headlines for such automatically generated news articles are unimaginative as they have been generated with ready-made templates. We present a computationally creative approach for headline generation that can generate humorous versions of existing headli...
Article
Full-text available
Presentamos nuestra infraestructura para la documentación de lenguas urálicas, que consiste en herramientas para redactar diccionarios de tal forma que las entradas sean estructuradas en el formato XML (Extensible Markup Language). Desde los diccionarios en XML podemos generar código para analizadores morfológicos que son útiles para todo tipo de a...
Conference Paper
Full-text available
We investigate both rule-based and machine learning methods for the task of compound error correction and evaluate their efficiency for North Sámi, a low resource language. The lack of error-free data needed for a neural approach is a challenge to the development of these tools, which is not shared by bigger languages. In order to compensate for th...
Chapter
Full-text available
Tässä artikkelissa kokeilemme erilaisia menetelmiä kuvaavien piirteiden tuottamiseksi 151:lle alkuperäiselle Pokémonille. Tuotamme eri menetelmillä sanavektorimalleja nettikorpuksen avulla, ja luokittelemme niillä automaattisesti englannin kielen adjektiiveja sen perusteella, kuinka ominaisia ne ovat tietylle Pokémonille. Kokeidemme perusteella voi...
Preprint
Full-text available
We present different methods for obtaining descriptive properties automatically for the 151 original Pokémon. We train several different word embeddings models on a crawled Pokémon corpus, and use them to rank automatically English adjectives based on how characteristic they are to a given Pokémon. Based on our experiments, it is better to train a...
Preprint
Full-text available
We survey human evaluation in papers presenting work on creative natural language generation that have been published in INLG 2020 and ICCC 2020. The most typical human evaluation method is a scaled survey, typically on a 5 point scale, while many other less common methods exist. The most commonly evaluated parameters are meaning, syntactic correct...
Conference Paper
Full-text available
In this paper, we present our free and open-source online dictionary editing system that has been developed for editing the new edition of the Finnish-Skolt Sami dictionary. We describe how the system can be used in post-editing a dictionary and how NLP methods have been incorporated as a part of the workflow. In practice, this means the use of FST...
Preprint
Full-text available
Texts written in Old Literary Finnish represent the first literary work ever written in Finnish starting from the 16th century. There have been several projects in Finland that have digitized old publications and made them available for research use. However, using modern NLP methods in such data poses great challenges. In this paper we propose an...
Conference Paper
Full-text available
Texts written in Old Literary Finnish represent the first literary work ever written in Finnish starting from the 16th century. There have been several projects in Finland that have digitized old publications and made them available for research use. However, using modern NLP methods in such data poses great challenges. In this paper we propose an...
Conference Paper
Full-text available
This paper presents and discusses the first Universal Dependencies treebank for the Apurinã language. The treebank contains 76 fully annotated sentences, applies 14 parts-of-speech, as well as seven augmented or new features-some of which are unique to Apurinã. The construction of the treebank has also served as an opportunity to develop finite-sta...
Preprint
Full-text available
This study presents a new dataset on rumor detection in Finnish language news headlines. We have evaluated two different LSTM based models and two different BERT models, and have found very significant differences in the results. A fine-tuned FinBERT reaches the best overall accuracy of 94.3% and rumor label accuracy of 96.0% of the time. However,...
Preprint
Full-text available
This paper presents and discusses the first Universal Dependencies treebank for the Apurin\~a language. The treebank contains 76 fully annotated sentences, applies 14 parts-of-speech, as well as seven augmented or new features - some of which are unique to Apurin\~a. The construction of the treebank has also served as an opportunity to develop fini...
Conference Paper
Full-text available
This study presents a new dataset on rumor detection in Finnish language news headlines. We have evaluated two different LSTM based models and two different BERT models , and have found very significant differences in the results. A fine-tuned FinBERT reaches the best overall accuracy of 94.3% and rumor label accuracy of 96.0% of the time. However,...
Conference Paper
Full-text available
We construct the first ever multimodal sarcasm dataset for Spanish. The audiovisual dataset consists of sarcasm annotated text that is aligned with video and audio. The dataset represents two varieties of Spanish, a Latin American variety and a Peninsular Spanish variety , which ensures a wider dialectal coverage for this global language. We presen...
Preprint
Full-text available
We train neural models for morphological analysis, generation and lemmatization for morphologically rich languages. We present a method for automatically extracting substantially large amount of training data from FSTs for 22 languages, out of which 17 are endangered. The neural models follow the same tagset as the FSTs in order to make it possible...
Conference Paper
Full-text available
Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the digitization process, often said to be degrading the performance of NLP systems. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machi...
Conference Paper
Full-text available
We train neural models for morphological analysis, generation and lemmatization for morphologically rich languages. We present a method for automatically extracting substantially large amount of training data from FSTs for 22 languages, out of which 17 are endangered. The neural models follow the same tagset as the FSTs in order to make it possible...
Preprint
Full-text available
We construct the first ever multimodal sarcasm dataset for Spanish. The audiovisual dataset consists of sarcasm annotated text that is aligned with video and audio. The dataset represents two varieties of Spanish, a Latin American variety and a Peninsular Spanish variety, which ensures a wider dialectal coverage for this global language. We present...
Conference Paper
Full-text available
We outline the Great Misalignment Problem in natural language processing research, this means simply that the problem definition is not in line with the method proposed and the human evaluation is not in line with the definition nor the method. We study this misalignment problem by surveying 10 randomly sampled papers published in ACL 2020 that rep...
Preprint
Full-text available
We outline the Great Misalignment Problem in natural language processing research, this means simply that the problem definition is not in line with the method proposed and the human evaluation is not in line with the definition nor the method. We study this misalignment problem by surveying 10 randomly sampled papers published in ACL 2020 that rep...
Preprint
Full-text available
This paper presents the current lexical, morphological, syntactic and rule-based machine translation work for Erzya and Moksha that can and should be used in the development of a roadmap for Mordvin linguistic research. We seek to illustrate and outline initial problem types to be encountered in the construction of an Apertium-based shallow-transfe...
Book
Full-text available
This is a Festschrift for Dr. Jack Rueter, compiled on the occasion of his 60th birthday. The book consists of peer-reviewed scientific work by Dr. Rueter’s colleagues. Its contents, compiled by well-established scholars and researchers in NLP, linguistics, philology and digital humanities, pertain to latest advances in natural language processing,...
Chapter
Full-text available
Every NLP researcher has to work with different XML or JSON encoded files. This often involves writing code that serves a very specific purpose. Corpona is meant to streamline any workflow that involves XML and JSON based corpora, by offering easy and reusable func-tionalities. The current functionalities relate to easy parsing and access to XML fi...
Chapter
Full-text available
We study neologism use in two samples of early English correspondence, from 1640-1660 and 1760-1780. Of especial interest are the early adopters of new vocabulary, the social groups they represent, and the types and functions of their neologisms. We describe our computer assisted approach and note the difficulties associated with massive variation...
Preprint
Full-text available
The term low-resourced has been tossed around in the field of natural language processing to a degree that almost any language that is not English can be called "low-resourced"; sometimes even just for the sake of making a mundane or mediocre paper appear more interesting and insightful. In a field where English is a synonym for language and low-re...
Preprint
Full-text available
We study neologism use in two samples of early English correspondence, from 1640--1660 and 1760--1780. Of especial interest are the early adopters of new vocabulary, the social groups they represent, and the types and functions of their neologisms. We describe our computer-assisted approach and note the difficulties associated with massive variatio...
Conference Paper
Full-text available
We present an open-source online dictionary editing system, Ve'rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami...
Article
Full-text available
We present an open-source online dictionary editing system, Ve rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami...
Preprint
Full-text available
Our study presents a series of experiments on speech recognition with endangered and extinct Samoyedic languages, spoken in Northern and Southern Siberia. To best of our knowledge, this is the first time a functional ASR system is built for an extinct language. We achieve with Kamas language a Label Error Rate of 15\%, and conclude through careful...
Preprint
Full-text available
Our study presents a dialect normalization method for different Finland Swedish dialects covering six regions. We tested 5 different models, and the best model improved the word error rate from 76.45 to 28.58. Contrary to results reported in earlier research on Finnish dialects, we found that training the model with one word at a time gave best res...
Preprint
Full-text available
We present an open-source online dictionary editing system, Ve'rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami...
Conference Paper
Full-text available
This document describes shared development of finite-state description of two closely related but endangered minority languages, Erzya and Moksha. It touches upon morpholexical unity and diversity of the two languages and how this provides a motivation for shared open-source FST development. We describe how we have designed the transducers so that...
Preprint
Full-text available
Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the digitization process, often said to be degrading the performance of NLP systems. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machi...
Conference Paper
Our study presents a dialect normalization method for different Finland Swedish dialects covering six regions. We tested 5 different models, and the best model improved the word error rate from 76.45 to 28.58. Contrary to results reported in earlier research on Finnish dialects, we found that training the model with one word at a time gave best res...
Preprint
Full-text available
This study uses a character level neural machine translation approach trained on a long short-term memory-based bi-directional recurrent neural network architecture for diacritization of Medieval Arabic. The results improve from the online tool used as a baseline. A diacritization model have been published openly through an easy to use Python packa...
Thesis
Full-text available
This thesis presents approaches to computationally creative natural language generation focusing on theoretical foundations, practical solutions and evaluation. I defend that a theoretical definition is crucial for computational creativity and that the practical solution must closely follow the theoretical definition. Finally, evaluation must be ba...
Preprint
Full-text available
We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dial...
Preprint
Full-text available
We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dial...
Preprint
El objetivo de este trabajo es, en primer lugar, analizar el sarcasmo en el corpus elegido, y en segundo lugar, basándose en este análisis, elaborar un algoritmo de aprendizaje automático supervisado capaz de distinguir entre un input sarcástico y uno no sarcástico. Para ello, se utilizará NLTK, una librería de Python, que permite la construcción d...
Preprint
El objetivo de este estudio es averiguar si se puede verificar la generalización hecha por los manuales de español en el corpus elegido. Como corpus, utilizaré los tres primeros episodios de la vigésima temporada de la serie de televisión, llamada Los Simpson. Voy a comparar el uso de las formas de tratamiento en las dos versiones dobladas al españ...
Conference Paper
Full-text available
We present advances in the development of a FST-based morphological analyzer and generator for Skolt Sami. Like other minority Uralic languages, Skolt Sami exhibits a rich morphology, on the one hand, and there is little golden standard material for it, on the other. This makes NLP approaches for its study difficult without a solid morphological an...
Conference Paper
Full-text available
We present a method for conducting morphological disambiguation for South Sámi, which is an endangered language. Our method uses an FST-based morphological analyzer to produce an ambiguous set of morphological readings for each word in a sentence. These readings are disambiguated with a Bi-RNN model trained on the related North Sámi UD Treebank and...
Preprint
Full-text available
We present a method for conducting morphological disambiguation for South Sámi, which is an endangered language. Our method uses an FST-based morphological analyzer to produce an ambiguous set of morphological readings for each word in a sentence. These readings are disambiguated with a Bi-RNN model trained on the related North Sámi UD Treebank and...
Preprint
Full-text available
We present advances in the development of a FST-based morphological analyzer and generator for Skolt Sami. Like other minority Uralic languages, Skolt Sami exhibits a rich morphology, on the one hand, and there is little golden standard material for it, on the other. This makes NLP approaches for its study difficult without a solid morphological an...
Chapter
This paper will provide a brief description of Skolt Sami and how it might be construed as a pluricentric language. Historical factors are identified that might contribute to a pluricentric identity: geographic location and political history; shortages of language documentation, and the establishment of a normative body for the development of a sta...