Tamás Grósz

Tamás Grósz
  • Doctor of Computer Science
  • Research Fellow at Aalto University

About

68
Publications
14,335
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
795
Citations
Introduction
My main research interests are new machine learning methods for speech recognition and paralinguistics. Simultaneously, I am also interested in interpreting and understanding how Deep Neural Networks work and what kind of weaknesses (unwanted biases) they develop during the training process.
Current institution
Aalto University
Current position
  • Research Fellow

Publications

Publications (68)
Conference Paper
Full-text available
Deep neural network (DNN) based speech recognizers have recently replaced Gaussian mixture (GMM) based systems as the state-of-the-art. HMM/DNN systems have kept many refinements of the HMM/GMM framework, even though some of these may be suboptimal for them. One such example is the creation of context-dependent tied states, for which an efficient d...
Conference Paper
Full-text available
While Hidden Markov Modeling (HMM) has been the dominant technology in speech recognition for many decades, recently deep neural networks (DNN) it seems have now taken over. The current DNN technology requires frame-aligned labels, which are usually created by first training an HMM system. Obviously, it would be desirable to have ways of training D...
Conference Paper
Full-text available
The Interspeech ComParE 2014 Challenge consists of two machine learning tasks, which have quite a small number of examples. Due to our good results in ComParE 2013, we considered AdaBoost a suitable machine learning meta-algorithm for these tasks, besides we also experimented with Deep Rectifier Neu-ral Networks. These differ from traditional neura...
Article
Speech training apps are being developed that provide automatic feedback concerning children's production of known target words, as a score on a 1-5 scale. However, this 'goodness' scale is still poorly understood. We investigated listeners' ratings of 'how many stars the app should provide as feedback' on children's utterances, and whether listene...
Article
Full-text available
Speech embeddings, fixed-size representations derived from raw audio data, play a crucial role in diverse machine learning applications. Despite the abundance of speech embedding techniques, selecting the most suitable one remains challenging. Existing studies often focus on intrinsic or extrinsic aspects, seldom exploring both simultaneously. Furt...
Preprint
Full-text available
Traditional topic identification solutions from audio rely on an automatic speech recognition system (ASR) to produce transcripts used as input to a text-based model. These approaches work well in high-resource scenarios, where there are sufficient data to train both components of the pipeline. However, in low-resource situations, the ASR system, e...
Conference Paper
Full-text available
This study investigates the feasibility of automated content scoring for spontaneous spoken responses from Finnish and Finland Swedish learners. Our experiments reveal that pretrained Transformer-based models outperform the tf-idf baseline in automatic task completion grading. Furthermore, we demonstrate that pre-fine-tuning these models to differe...
Article
Full-text available
Computer-assisted Language Learning (CALL) is a rapidly developing area accelerated by advancements in the field of AI. A well-designed and reliable CALL system allows students to practice language skills, like pronunciation, any time outside of the classroom. Furthermore, gamification via mobile applications has shown encouraging results on learni...
Article
Full-text available
End-to-End speech recognition has become the center of attention for speech recognition research, but Hybrid Hidden Markov Model Deep Neural Network (HMM/DNN) -systems remain a competitive approach in terms of performance. End-to-End models may be better at very large data scales, and HMM / DNN-systems may have an advantage in low-resource scenario...
Preprint
Full-text available
The events of recent years have highlighted the importance of telemedicine solutions which could potentially allow remote treatment and diagnosis. Relatedly, Computational Paralinguistics, a unique subfield of Speech Processing, aims to extract information about the speaker and form an important part of telemedicine applications. In this work, we f...
Article
Full-text available
This paper traces signs of urban culture in Finnish fiction films from the 1950s by drawing on a multimodal analysis of audiovisual content. The Finnish National Filmography includes 208 feature films released between 1950–1959. Our approach to the automatic analysis of media content includes aural and visual object recognition and speech recogniti...
Preprint
Full-text available
It is common knowledge that the quantity and quality of the training data play a significant role in the creation of a good machine learning model. In this paper, we take it one step further and demonstrate that the way the training examples are arranged is also of crucial importance. Curriculum Learning is built on the observation that organized a...
Article
Full-text available
The Donate Speech campaign has so far succeeded in gathering approximately 3600 h of ordinary, colloquial Finnish speech into the Lahjoita puhetta (Donate Speech) corpus. The corpus includes over twenty thousand speakers from all the regions of Finland and from all age brackets. The primary goals of the collection were to create a representative, l...
Preprint
Full-text available
The Donate Speech campaign has so far succeeded in gathering approximately 3600 hours of ordinary, colloquial Finnish speech into the Lahjoita puhetta (Donate Speech) corpus. The corpus includes over twenty thousand speakers from all the regions of Finland and from all age brackets. The primary goals of the collection were to create a representativ...
Chapter
Long Short-Term Memory (LSTM) cells, frequently used in state-of-the-art language models, struggle with long sequences of inputs. One major problem in their design is that they try to summarize long-term information into a single vector, which is difficult. The attention mechanism aims to alleviate this problem by accumulating the relevant outputs...
Preprint
Full-text available
This paper describes AaltoASR's speech recognition system for the INTERSPEECH 2020 shared task on Automatic Speech Recognition (ASR) for non-native children's speech. The task is to recognize non-native speech from children of various age groups given a limited amount of speech. Moreover, the speech being spontaneous has false starts transcribed as...
Preprint
Full-text available
End-to-end neural network models (E2E) have shown significant performance benefits on different INTERSPEECH ComParE tasks. Prior work has applied either a single instance of an E2E model for a task or the same E2E architecture for different tasks. However, applying a single model is unstable or using the same architecture under-utilizes task-specif...
Article
Full-text available
Silent Speech Interfaces (SSI) perform articulatory-to-acoustic mapping to convert articulatory movement into synthesized speech. Its main goal is to aid the speech handicapped, or to be used as a part of a communication system operating in silence-required environments or in those with high background noise. Although many previous studies addresse...
Article
Full-text available
Our increasing reliance on software products and the amount of money we spend on creating and maintaining them makes it crucial to find bugs as early and as easily as possible. At the same time, it is not enough to know that we should be paying more attention to bugs; finding them must become a quick and seamless process in order to be actually use...
Chapter
In discrete tomography sometimes it is necessary to reduce the number of projections used for reconstructing the image. Earlier, it was shown that the choice of projection angles can significantly influence the quality of the reconstructions. In this study, we apply convolutional neural networks to select projections in order to reconstruct the ori...
Chapter
In the interaction between humans and computers as well as in the interaction among humans, topical units (TUs) have an important role. This motivated our investigation of topical unit recognition. To lay foundations for this, we first create a classifier for topical units using Deep Neural Nets with rectifier units (DRNs) and the probabilistic sam...
Preprint
Full-text available
Recently it was shown that within the Silent Speech Interface (SSI) field, the prediction of F0 is possible from Ultrasound Tongue Images (UTI) as the articulatory input, using Deep Neural Networks for articulatory-to-acoustic mapping. Moreover, text-to-speech synthesizers were shown to produce higher quality speech when using a continuous pitch es...
Article
Background and objective: The leading cause of vision loss in the Western World is Age-related Macular Degeneration (AMD), but together with modern medicines, tracking the number of Hyperreflective Foci (HF) on Optical Coherence Tomography (OCT) images should assist the treatment of patients. Here, we developed a framework based on deep learning f...
Preprint
Full-text available
When using ultrasound video as input, Deep Neural Network-based Silent Speech Interfaces usually rely on the whole image to estimate the spectral parameters required for the speech synthesis step. Although this approach is quite straightforward, and it permits the synthesis of understandable speech, it has several disadvantages as well. Besides the...
Article
When our task is to detect social signals such as laughter and filler events in an audio recording, the most straightforward way is to apply a Hidden Markov Model -- or a Hidden Markov Model/Deep Neural Network (HMM/DNN) hybrid, which is considered state-of-the-art nowadays. In this hybrid model, the DNN component is trained on frame-level samples...
Conference Paper
Full-text available
Silent Speech Interface systems apply two different strategies to solve the articulatory-to-acoustic conversion task. The recognition-and-synthesis approach applies speech recognition techniques to map the articulatory data to a textual transcript, which is then converted to speech by a conventional text-to-speech system. The direct synthesis appro...
Chapter
Optical Coherence Tomography (OCT) is one of the most advanced, non-invasive method of eye examination. Age-related macular degeneration (AMD) is one of the most frequent reasons of acquired blindness. Our aim is to develop automatic methods that can accurately identify and characterize biomarkers in OCT images, related to AMD. We present methods f...
Conference Paper
Full-text available
State-of-the-art silent speech interface systems apply vocoders to generate the speech signal directly from articulatory data. Most of these approaches concentrate on estimating just the spectral features of the vocoder, and use the original F0, a constant F0 or white noise as excitation. This solution is based on the assumption that the F0 curve i...
Article
Full-text available
The use of computer-readable visual codes became common in our everyday life both in industrial environments and for private use. The reading process of visual codes consists of two steps, namely, localization and data decoding. In this paper we examine the localization step of visual codes using conventional and deep rectifier neural networks. The...
Conference Paper
Full-text available
In this paper we present our initial results in articulatory-to-acoustic conversion based on tongue movement recordings using Deep Neural Networks (DNNs). Despite the fact that deep learning has revolutionized several fields, so far only a few researchers have applied DNNs for this task. Here, we compare various possible feature representation appr...
Article
Recently, attempts have been made to remove Gaussian mixture models (GMM) from the training process of deep neural network-based hidden Markov models (HMM/DNN). For the GMM-free training of a HMM/DNN hybrid we have to solve two problems, namely the initial alignment of the frame-level state labels and the creation of context-dependent states. Altho...
Conference Paper
Understanding topical units is important for improved human-computer interaction (HCI) as well as for a better understanding of human-human interaction. Here, we take the first steps towards topical unit recognition by creating a topical unit classifier based on the HuComTech multimodal database. We create this classifier by means of Deep Rectifier...
Conference Paper
Full-text available
For several years, the Interspeech ComParE Challenge has fo-cused on paralinguistic tasks of various kinds. In this paper we focus on the Native Language and the Deception sub-challenges of ComParE 2016, where the goal is to identify the native language of the speaker, and to recognize deceptive speech. As both tasks can be treated as classificatio...
Article
Full-text available
Wireless sensors are recent, portable, low-powered devices, designed to record and transmit observations of their environment such as speech. To allow portability they are designed to have a small size and weight; this, however, along with their low power consumption, usually means that they have only quite basic recording equipment (e.g. microphon...
Conference Paper
Full-text available
Az elmúlt néhány év során a beszédfelismerésben a rejtett Markov modellek Gauss keverékmodelljeit (Gaussian Mixture Models, GMM) háttérbe szorították a mély neuronhálók (Deep Neural Networks, DNN). Ugyanakkor a neuronhálókra épülő felismerők számos olyan taní-tási algoritmust megörököltek (változatlan formában vagy apróbb változ-tatásokkal), melyek...
Article
Usage of computer-readable visual codes became common in our everyday life at industrial environments and private use. The reading process of visual codes consists of two steps, localization and data decoding. This paper introduces a new method for QR code localization using conventional and deep rectifier neural networks. The structure of the neur...
Conference Paper
Spectro-temporal feature extraction and multi-band processing were both designed to make the speech recognizers more robust. Although they have been used for a long time now, very few attempts have been made to combine them. This is why here we integrate two spectro-temporal feature extraction methods into a multi-band framework. We assess the perf...
Conference Paper
Full-text available
Deep learning is regarded by some as one of the most important technological breakthroughs of this decade. In recent years it has been shown that using rectified neurons, one can match or surpass the performance achieved using hyperbolic tangent or sigmoid neurons, especially in deep networks. With rectified neurons we can readily create sparse rep...
Conference Paper
The introduction of deep neural networks to acoustic modelling has brought significant improvements in speech recognition accuracy. However, this technology has huge computational costs, even when the algorithms are implemented on graphic processors. Hence, finding the right training algorithm that offers the best performance with the lowest traini...

Network

Cited By