Mikko Kurimo

Mikko Kurimo
Aalto University · Department of Signal Processing and Acoustics

About

296
Publications
31,314
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,660
Citations

Publications

Publications (296)
Preprint
Full-text available
Rule-based language processing systems have been overshadowed by neural systems in terms of utility, but it remains unclear whether neural NLP systems, in practice, learn the grammar rules that humans use. This work aims to shed light on the issue by evaluating state-of-the-art LLMs in a task of morphological analysis of complex Finnish noun forms....
Preprint
Full-text available
Test data is said to be out-of-distribution (OOD) when it unexpectedly differs from the training data, a common challenge in real-world use cases of machine learning. Although OOD generalisation has gained interest in recent years, few works have focused on OOD generalisation in spoken language understanding (SLU) tasks. To facilitate research on t...
Preprint
Full-text available
A large proportion of text is produced using mobile devices. However, very little research looks at the special characteristics of how this happens and, importantly, how it is affected by the design of the language model (LM). The operating systems of modern devices offer a number of LM-based intelligent text entry methods (ITEs) such as Autocorrec...
Article
Speech training apps are being developed that provide automatic feedback concerning children's production of known target words, as a score on a 1-5 scale. However, this 'goodness' scale is still poorly understood. We investigated listeners' ratings of 'how many stars the app should provide as feedback' on children's utterances, and whether listene...
Article
Full-text available
Speech embeddings, fixed-size representations derived from raw audio data, play a crucial role in diverse machine learning applications. Despite the abundance of speech embedding techniques, selecting the most suitable one remains challenging. Existing studies often focus on intrinsic or extrinsic aspects, seldom exploring both simultaneously. Furt...
Chapter
Children’s speech recognition shows poor performance as compared to adult speech. Large amount of data is required for the neural network models to achieve good performance. A very limited amount of children’s speech data is publicly available. A baseline system was developed using adult speech for training and children’s speech for testing. This k...
Conference Paper
Children’s speech recognition shows poor performance as compared to adult speech. Large amount of data is required for the neural network models to achieve good performance. A very limited amount of children’s speech data is publicly available. A baseline system was developed using adult speech for training and children’s speech for testing. This k...
Article
Full-text available
In this paper, we present our effort to develop an automatic speaker verification (ASV) system for low resources children’s data. For the children’s speakers, very limited amount of speech data is available in majority of the languages for training the ASV system. Developing an ASV system under low resource conditions is a very challenging problem....
Chapter
Full-text available
Automatic Speech Recognition (ASR) for high-resource languages like English is often considered a solved problem. However, most high-resource ASR systems favor socioeconomically advantaged dialects. In the case of English, this leaves behind many L2 speakers and speakers of low-resource accents (a majority of English speakers). One way to mitigate...
Preprint
Full-text available
Traditional topic identification solutions from audio rely on an automatic speech recognition system (ASR) to produce transcripts used as input to a text-based model. These approaches work well in high-resource scenarios, where there are sufficient data to train both components of the pipeline. However, in low-resource situations, the ASR system, e...
Conference Paper
Full-text available
This study investigates the feasibility of automated content scoring for spontaneous spoken responses from Finnish and Finland Swedish learners. Our experiments reveal that pretrained Transformer-based models outperform the tf-idf baseline in automatic task completion grading. Furthermore, we demonstrate that pre-fine-tuning these models to differe...
Article
Full-text available
Children with dyslexia often face difficulties in learning foreign languages, which is reflected as weaker neural activation. However, digital language-learning applications could support learning-induced plastic changes in the brain. Here we aimed to investigate whether plastic changes occur in children with dyslexia more readily after targeted tr...
Article
Full-text available
Computer-assisted Language Learning (CALL) is a rapidly developing area accelerated by advancements in the field of AI. A well-designed and reliable CALL system allows students to practice language skills, like pronunciation, any time outside of the classroom. Furthermore, gamification via mobile applications has shown encouraging results on learni...
Article
Full-text available
End-to-End speech recognition has become the center of attention for speech recognition research, but Hybrid Hidden Markov Model Deep Neural Network (HMM/DNN) -systems remain a competitive approach in terms of performance. End-to-End models may be better at very large data scales, and HMM / DNN-systems may have an advantage in low-resource scenario...
Article
Full-text available
In low resource children automatic speech recognition (ASR) the performance is degraded due to limited acoustic and speaker variability available in small datasets. In this paper, we propose a spectral warping based data augmentation method to capture more acoustic and speaker variability. This is carried out by warping the linear prediction (LP) s...
Conference Paper
Full-text available
Prerecorded laughter accompanying dialog in comedy TV shows encourages the audience to laugh by clearly marking humorous moments in the show. We present an approach for automatically detecting humor in the Friends TV show using multimodal data. Our model is capable of recognizing whether an utterance is humorous or not and assess the intensity of i...
Preprint
Full-text available
Prerecorded laughter accompanying dialog in comedy TV shows encourages the audience to laugh by clearly marking humorous moments in the show. We present an approach for automatically detecting humor in the Friends TV show using multimodal data. Our model is capable of recognizing whether an utterance is humorous or not and assess the intensity of i...
Preprint
Full-text available
The events of recent years have highlighted the importance of telemedicine solutions which could potentially allow remote treatment and diagnosis. Relatedly, Computational Paralinguistics, a unique subfield of Speech Processing, aims to extract information about the speaker and form an important part of telemedicine applications. In this work, we f...
Chapter
Full-text available
The Donate Speech campaign aimed to collect 10,000 hours of ordinary, casual Finnish speech to be used for studying language as well as for developing technology and services that can be readily used in the languages spoken in Finland. In this project, particular attention has been devoted to allowing for both academic and commercial use of the mat...
Article
Full-text available
This paper traces signs of urban culture in Finnish fiction films from the 1950s by drawing on a multimodal analysis of audiovisual content. The Finnish National Filmography includes 208 feature films released between 1950–1959. Our approach to the automatic analysis of media content includes aural and visual object recognition and speech recogniti...
Preprint
Full-text available
It is common knowledge that the quantity and quality of the training data play a significant role in the creation of a good machine learning model. In this paper, we take it one step further and demonstrate that the way the training examples are arranged is also of crucial importance. Curriculum Learning is built on the observation that organized a...
Article
Full-text available
The Donate Speech campaign has so far succeeded in gathering approximately 3600 h of ordinary, colloquial Finnish speech into the Lahjoita puhetta (Donate Speech) corpus. The corpus includes over twenty thousand speakers from all the regions of Finland and from all age brackets. The primary goals of the collection were to create a representative, l...
Article
Full-text available
Digital games may benefit children’s learning, yet the factors that induce gaming benefits to cognition are not well known. In this study, we investigated the effectiveness of digital game-based learning in children by comparing the learning of foreign speech sounds and words in a digital game or a non-game digital application. To evaluate gaming-i...
Preprint
Full-text available
Public sources like parliament meeting recordings and transcripts provide ever-growing material for the training and evaluation of automatic speech recognition (ASR) systems. In this paper, we publish and analyse the Finnish parliament ASR corpus, the largest publicly available collection of manually transcribed speech data for Finnish with over 30...
Preprint
Full-text available
The Donate Speech campaign has so far succeeded in gathering approximately 3600 hours of ordinary, colloquial Finnish speech into the Lahjoita puhetta (Donate Speech) corpus. The corpus includes over twenty thousand speakers from all the regions of Finland and from all age brackets. The primary goals of the collection were to create a representativ...
Article
Full-text available
Differences in acoustic characteristics between children’s and adults’ speech degrade performance of automatic speech recognition systems when systems trained using adults’ speech are used to recognize children’s speech. This performance degradation is due to the acoustic mismatch between training and testing. One of the main sources of the acousti...
Chapter
Standard end-to-end training of attention-based ASR models only uses transcribed speech. If they are compared to HMM/DNN systems, which additionally leverage a large corpus of text-only data and expert-crafted lexica, the differences in modeling cannot be disentangled from differences in data. We propose an experimental setup, where only transcribe...
Chapter
Successful speech recognition for children requires large training data with sufficient speaker variability. The collection of such a training database of children’s voices is challenging and very expensive for zero/low resource language like Punjabi. In this paper, the data scarcity issue of the low resourced language Punjabi is addressed through...
Article
Full-text available
Current ASR systems show poor performance in recognition of children’s speech in noisy environments because recognizers are typically trained with clean adults’ speech and therefore there are two mismatches between training and testing phases (i.e., clean speech in training vs. noisy speech in testing and adult speech in training vs. child speech i...
Chapter
Named entities are heavily used in the field of spoken language understanding, which uses speech as an input. The standard way of doing named entity recognition from speech involves a pipeline of two systems, where first the automatic speech recognition system generates the transcripts, and then the named entity recognition system produces the name...
Chapter
Long Short-Term Memory (LSTM) cells, frequently used in state-of-the-art language models, struggle with long sequences of inputs. One major problem in their design is that they try to summarize long-term information into a single vector, which is difficult. The attention mechanism aims to alleviate this problem by accumulating the relevant outputs...
Conference Paper
Successful speech recognition for children requires large train-ing data with sufficient speaker variability. The collection of such a training database of children’s voices is challenging and very expensive for zero/low resource language like Punjabi. In this paper, the data scarcity issue of the low resourced language Punjabi is addressed through...
Conference Paper
For children, the system trained on a large corpus of adult speakers performed worse than a system trained on a much smaller corpus of children{'}s speech. This is due to the acoustic mismatch between training and testing data. To capture more acoustic variability we trained a shared system with mixed data from adults and children. The shared syste...
Conference Paper
Full-text available
In this paper, we propose spectral modification by sharpening formants and by reducing the spectral tilt to recognize children’s speech by automatic speech recognition (ASR) systems developed using adult speech. In this type of mismatched condition, the ASR performance is degraded due to the acoustic and linguistic mismatch in the attributes betwee...
Article
Full-text available
Digital and mobile devices enable easy access to applications for the learning of foreign languages. However, experimental studies on the effectiveness of these applications are scarce. Moreover, it is not understood whether the effects of speech and language training generalize to features that are not trained. To this end, we conducted a four-wee...
Conference Paper
Acoustic differences between children’s and adults’ speech causes the degradation in the automatic speech recognition system performance when system trained on adults’ speech and tested on children’s speech. The key acoustic mismatch factors are formant, speaking rate, and pitch. In this paper, we proposed a linear prediction based spectral warping...
Article
We study class-based n-gram and neural network language models for very large vocabulary speech recognition of two morphologically rich languages: Finnish and Estonian. Due to morphological processes such as derivation, inflection and compounding, the models need to be trained with vocabulary sizes of several millions of word types. Class-based lan...
Article
Full-text available
We describe a novel way to implement subword language models in speech recognition systems based on weighted finite state transducers, hidden Markov models, and deep neural networks. The acoustic models are built on graphemes in a way that no pronunciation dictionaries are needed, and they can be used together with any type of subword language mode...
Article
Full-text available
There are several approaches for improving neural machine translation for low-resource languages: monolingual data can be exploited via pretraining or data augmentation; parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; subword segmentation and regularization techniques can be...
Preprint
Learning is increasingly assisted by technology. Digital games may be useful for learning, especially in children. However, more research is needed to understand the factors that induce gaming benefits to cognition. In this study, we investigated the effectiveness of digital game-based learning approach in children by comparing the learning of fore...
Preprint
Full-text available
This paper describes AaltoASR's speech recognition system for the INTERSPEECH 2020 shared task on Automatic Speech Recognition (ASR) for non-native children's speech. The task is to recognize non-native speech from children of various age groups given a limited amount of speech. Moreover, the speech being spontaneous has false starts transcribed as...
Preprint
Creating open-domain chatbots requires large amounts of conversational data and related benchmark tasks to evaluate them. Standardized evaluation tasks are crucial for creating automatic evaluation metrics for model development; otherwise, comparing the models would require resource-expensive human evaluation. While chatbot challenges have recently...
Preprint
Full-text available
End-to-end neural network models (E2E) have shown significant performance benefits on different INTERSPEECH ComParE tasks. Prior work has applied either a single instance of an E2E model for a task or the same E2E architecture for different tasks. However, applying a single model is unstable or using the same architecture under-utilizes task-specif...
Preprint
Character-based Neural Network Language Models (NNLM) have the advantage of smaller vocabulary and thus faster training times in comparison to NNLMs based on multi-character units. However, in low-resource scenarios, both the character and multi-character NNLMs suffer from data sparsity. In such scenarios, cross-lingual transfer has improved multi-...
Preprint
In spoken Keyword Search, the query may contain out-of-vocabulary (OOV) words not observed when training the speech recognition system. Using subword language models (LMs) in the first-pass recognition makes it possible to recognize the OOV words, but even the subword n-gram LMs suffer from data sparsity. Recurrent Neural Network (RNN) LMs alleviat...
Article
Full-text available
Natural speech builds on contextual relations that can prompt predictions of upcoming utterances. To study the neural underpinnings of such predictive processing we asked 10 healthy adults to listen to a 1-h-long audiobook while their magnetoencephalographic (MEG) brain activity was recorded. We correlated the MEG signals with acoustic speech envel...
Preprint
There are several approaches for improving neural machine translation for low-resource languages: Monolingual data can be exploited via pretraining or data augmentation; Parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; Subword segmentation and regularization techniques can be...
Preprint
Full-text available
Data-driven segmentation of words into subword units has been used in various natural language processing applications such as automatic speech recognition and statistical machine translation for almost 20 years. Recently it has became more widely adopted, as models based on deep neural networks often benefit from subword units even for morphologic...