Martti Vainio

Martti Vainio
University of Helsinki | HY · Department of Modern Languages

Professor

About

195
Publications
31,927
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,784
Citations
Citations since 2017
43 Research Items
1367 Citations
2017201820192020202120222023050100150200250
2017201820192020202120222023050100150200250
2017201820192020202120222023050100150200250
2017201820192020202120222023050100150200250
Introduction
Martti Vainio currently works at the Department of Modern Languages, University of Helsinki. Martti does research in Phonetics and speech synthesis.
Additional affiliations
January 2017 - April 2017
University of Helsinki
Position
  • Professor
January 1996 - December 2013
University of Helsinki

Publications

Publications (195)
Conference Paper
Full-text available
A promising strategy for the multilingual annotation of speech prosody is to use manual annotation of a small corpus of speech to bootstrap a fully automatic annotation system. We make a systematic distinction between functional annotation and formal annotation. The use of functional prosodic labelling for prosody control in a Finnish speech synthe...
Article
Full-text available
Grasping and mouth movements have been proposed to be integrated anatomically, functionally and evolutionarily. In line with this, we have shown that there is a systematic interaction between particular speech units and grip performance. For example, when the task requires pronouncing a speech unit simultaneously with grasp response, the speech uni...
Article
Full-text available
Recent evidence has shown linkages between actions and segmental elements of speech. For instance, close-front vowels are sound symbolically associated with the precision grip, and front vowels are associated with forward-directed limb movements. The current review article presents a variety of such sound-action effects and proposes that they compo...
Article
Full-text available
Autismikirjon häiriö on aivojen neurobiologinen kehityshäiriö, jota luonnehtivat mm. sosiaalisen vuorovaikutuksen ongelmat, aistiyliherkkyydet sekä rajoittuneet kiinnostuksen kohteet (APA, 2013). Autismikirjon henkilöillä puheen prosodiaanliittyy usein epätyypillisiä piirteitä. Tässä artikkelissa tarkastellaan autismikirjonpoikien puheen lausumanlo...
Article
Full-text available
Autismikirjon häiriö on aivojen neurobiologinen kehityshäiriö, jota luonnehtivat mm. sosiaalisen vuorovaikutuksen ongelmat, aistiyliherkkyydet sekä rajoittuneet kiinnostuksen kohteet (APA, 2013). Autismikirjon henkilöillä puheen prosodiaanliittyy usein epätyypillisiä piirteitä. Tässä artikkelissa tarkastellaan autismikirjonpoikien puheen lausumanlo...
Preprint
Recent advances in deep learning methods have elevated synthetic speech quality to human level, and the field is now moving towards addressing prosodic variation in synthetic speech.Despite successes in this effort, the state-of-the-art systems fall short of faithfully reproducing local prosodic events that give rise to, e.g., word-level emphasis a...
Article
Full-text available
Ternary length contrast is a rare phonological feature, investigated here both in terms of its realization and possible undergoing changes. In North Sámi, a phonetically under-documented and endangered Fenno-Ugric language spoken by indigenous people in Northern Europe, the ternary quantity contrast is assumed to be signalled by a progressive lengt...
Article
Full-text available
Prosodic characteristics, such as lexical and phrasal stress, are one of the most challenging features for second language (L2) speakers to learn. The ability to quantify language learners’ proficiency in terms of prosody can be of use to language teachers and improve the assessment of L2 speaking skills. Automatic assessment, however, requires rel...
Preprint
Full-text available
This work explores the application of various supervised classification approaches using prosodic information for the identification of spoken North Sámi language varieties. Dialects are language varieties that enclose characteristics specific for a given region or community. These characteristics reflect segmental and suprasegmental (prosodic) dif...
Article
Previous research shows that simultaneously executed grasp and vocalization responses are faster when the precision grip is performed with the vowel [i] and the power grip is performed with the vowel [ɑ]. Research also shows that observing an object that is graspable with a precision or power grip can activate the grip congruent with the object. Gi...
Conference Paper
Full-text available
It is known that persons afflicted with autism often have deviant prosodic features in their speech. For example, they may have a limited range of intonation, their speech can be overly fast, jerky or loud, or it can be characterized by large pitch excursions, quiet voice, inconsistent pause structure, prominent word stress and/or by creaky or nasa...
Conference Paper
Full-text available
We present a methodology for assessing similarities and differences between language varieties and dialects in terms of prosodic characteristics. A multi-speaker, multi-dialect WaveNet network is trained on low sample-rate signal retaining only prosodic characteristics of the original speech. The network is conditioned on labels related to speakers...
Conference Paper
Full-text available
Prominence perception has been known to correlate with a complex interplay of the acoustic features of energy, fundamental frequency, spectral tilt, and duration. The contribution and importance of each of these features in distinguishing between prominent and non-prominent units in speech is not always easy to determine, and more so, the prosodic...
Preprint
Full-text available
In this paper we introduce a new natural language processing dataset and benchmark for predicting prosodic prominence from written text. To our knowledge this will be the largest publicly available dataset with prosodic labels. We describe the dataset construction and the resulting benchmark dataset in detail and train a number of different models...
Article
It has been shown recently that when participants are required to pronounce a vowel at the same time with the hand movement, the vocal and manual responses are facilitated when a front vowel is produced with forward-directed hand movements and a back vowel is produced with backward-directed hand movements. This finding suggests a coupling between s...
Conference Paper
We present a methodology for assessing similarities and differences between language varieties and dialects in terms of prosodic characteristics. A multi-speaker, multi-dialect WaveNet network is trained on low sample-rate signal retaining only prosodic characteristics of the original speech. The network is conditioned on labels related to speakers...
Article
Research has shown connections between articulatory mouth actions and manual actions. This study investigates whether forward–backward hand movements could be associated with vowel production processes that programme tongue fronting/backing, lip rounding/spreading (Experiment 1), and/or consonant production processes that programme tongue tip and t...
Article
Full-text available
This study investigates whether temporal features in speech can predict the perceived proficiency level in Finnish learners of Swedish. In so doing, seven expert raters assessed speech samples produced by upper secondary school students using the revised CEFR scale for phonological control. The effect of temporal features was studied with a cumulat...
Article
The study investigated whether number magnitude can influence vocal responses. Participants produced either short or long version of the vowel [ɑ] (Experiment 1), or high or low-pitched version of that vowel (Experiment 2), according to the parity of a visually presented number. In addition to measuring reaction times (RT) of vocal responses, we me...
Article
Full-text available
We study transformational computational creativity in the context of writing songs and describe an implemented system that is able to modify its own goals and operation. With this, we contribute to three aspects of computational creativity and song generation: (1) Application-wise, songs are an interesting and challenging target for creativity, as...
Article
Full-text available
During voiced speech, vocal folds interact with the vocal tract acoustics. The resulting glottal source–resonator coupling has been observed using mathematical and physical models as well as in in vivo phonation. We propose a computational time-domain model of the full speech apparatus that contains a feedback mechanism from the vocal tract acousti...
Article
Full-text available
The perceived duration of a sound is affected by its fundamental frequency and intensity: higher sounds are judged to be longer, as are sounds with greater intensity. Since increasing intensity lengthens the perceived duration of the auditory object, and increasing the fundamental frequency increases the sound’s perceived loudness (up to ca. 3 kHz)...
Data
Raw results of intensity discrimination task This spreadsheet contains raw results from the intensity discrimination task which show values which were generated for each parameter. Each row represents one trial.
Data
Raw results from the duration discrimination task This spreadsheet contains raw results from the duration discrimination task which show values which were generated for each parameter. Each row represents one trial.
Article
Full-text available
The shape and size-related sound symbolism phenomena assume that, for example, the vowel [i] and the consonant [t] are associated with sharp-shaped and small-sized objects, whereas [ɑ] and [m] are associated with round and large objects. It has been proposed that these phenomena are mostly based on the involvement of articulatory processes in repre...
Article
Manual actions and speech are connected: for example, grip execution can influence simultaneous vocalizations and vice versa. Our previous studies show that the consonant [k] is associated with the power grip and the consonant [t] with the precision grip. Here we studied whether the interaction between speech sounds and grips could operate already...
Article
Full-text available
Musical experiences and native language are both known to affect auditory processing. The present work aims to disentangle the influences of native language phonology and musicality on behavioral and subcortical sound feature processing in a population of musically diverse Finnish speakers as well as to investigate the specificity of enhancement fr...
Article
Full-text available
We have recently shown in Finnish speakers that articulation of certain vowels and consonants has a systematic influence on simultaneous grasp actions as well as on forward and backward hand movements. Here we studied whether these effects generalize to another language, namely Czech. We reasoned that if the results generalized to another language...
Article
Contraction of a muscle modulates not only the corticospinal excitability (CSE) of the contracting muscle but also that of different muscles. We investigated to what extent the CSE of a hand muscle is modulated during preparation and execution of teeth clenching and ipsilateral foot dorsiflexion either separately or in combination. Hand-muscle CSE...
Article
Full-text available
Previous research has shown that precision and power grip performance is consistently influenced by simultaneous articulation. For example, power grip responses are performed relatively fast with the open-back vowel [a], whereas precision grip responses are performed relatively fast with the close-front vowel [i]. In the present study, the particip...
Data
Datasets for Experiment 2 used for statistical analyses. (XLSX)
Data
Datasets for Experiment 1 used for statistical analyses. (XLSX)
Article
Full-text available
Prominences and boundaries are the essential constituents of prosodic structure in speech. They provide for means to chunk the speech stream into linguistically relevant units by providing them with relative saliences and demarcating them within utterance structures. Prominences and boundaries have both been widely used in both basic research on pr...
Article
Full-text available
Previous studies have shown congruency effects between specific speech articulations and manual grasping actions. For example, uttering the syllable [kɑ] facilitates power grip responses in terms of reaction time and response accuracy. A similar association of the syllable [ti] with precision grip has also been observed. As these congruency effects...
Article
Previous studies have shown a congruency effect between manual grasping and syllable articulation. For instance, a power grip is associated with syllables whose articulation involves the tongue body and/or large mouth aperture ([kɑ]) whereas a precision grip is associated with articulations that involve the tongue tip and/or small mouth aperture ([...
Article
The complex auditory brainstem response (cABR) can reflect language-based plasticity in subcortical stages of auditory processing. It is sensitive to differences between language groups as well as stimulus properties, e.g. intensity or frequency. It is also sensitive to the synchronicity of the neural population stimulated by sound, which results i...
Conference Paper
Full-text available
Increase in fundamental frequency (f0) is one of the most robust and best-studied phenomena characterizing Lombard speech. In this work, three types of global transformation of f0 contours from normal speech to Lombard condition are investigated: (1) a linear re-scaling of the quiet condition contour to match the mean and standard deviation of f0 i...
Conference Paper
Full-text available
Unsupervised boundary detection and classification is both a theoretically interesting question and an important challenge for speech technology. Theoretical interest lies in exploring how and to what extent is the boundary information encoded in purely acoustic material. For technology, automatic boundary detection facilitates cheap and fast label...
Article
Full-text available
Recent studies have shown that articulatory gestures are systematically associated with specific manual grip actions. Here we show that executing such actions can influence performance on a speech-categorization task. Participants watched and/or listened to speech stimuli while executing either a power or a precision grip. Grip performance influenc...
Data
Datasets used for statistical analyses. (XLSX)
Article
While the characteristics of the amplitude spectrum of the voiced excitation have been studied widely both in natural and synthetic speech, the role of the excitation phase has remained less explored. This contradicts findings observed in sound perception studies indicating that humans are not phase deaf. Especially in speech synthesis, phase infor...
Article
Full-text available
Over the last century, researchers have collected a considerable amount of data reflecting the properties of Lombard speech, i.e., speech in a noisy environment. The documented phenomena predominately report effects on the speech signal produced in ambient noise. In comparison, relatively little is known about the underlying articulatory patterns o...
Article
Full-text available
Prominences and boundaries are the essential constituents of prosodic structure in speech. They provide for means to chunk the speech stream into linguistically relevant units by providing them with relative saliences and demarcating them within coherent utterance structures. Prominences and boundaries have both been widely used in both basic resea...
Conference Paper
Full-text available
Prosodic prominence is an umbrella term encompassing various related but conceptually and functionally different phenomena such as phonological stress, paralinguistic emphasis, lexical, syntactic, semantic or pragmatic salience, to mention a few. Due to the high interest prominence has received from various disciplines, it has been studied from mul...
Conference Paper
Full-text available
In addition to fundamental frequency height, its movement is also generally assumed to lengthen the perceived duration of syllable-like sounds. The lengthening effect has been observed for some languages (US English, French, Swiss German, Japanese) but reported to be absent for another (Thai, Latin American Spanish, German). In this work, native sp...
Article
Some theories concerning speech mechanisms assume that overlapping representations are involved in programming certain articulatory gestures and hand actions. The present study investigated whether planning of movement direction for articulatory gestures and manual actions could interact. The participants were presented with written vowels (Experim...
Article
Full-text available
A unique feature of human communication system is our ability to rapidly acquire new words and build large vocabularies. However, its neurobiological foundations remain largely unknown. In an electrophysiological study optimally designed to probe this rapid formation of new word memory circuits, we employed acoustically controlled novel word-forms...
Article
Full-text available
It is well known that during voiced speech, the human vocal folds interact with the vocal tract acoustics. The resulting source-filter coupling has been observed using mathematical and physical models as well as in in vivo phonation. We propose a computational time-domain model of the full speech apparatus that, in particular, contains a feedback m...
Chapter
Full-text available
Speech prosody, especially intonation, is hierarchical in nature. That is, the temporal changes in, e.g., fundamental frequency are caused by different factors in the production of an utterance. The small changes due to segmental articulation—consonants and vowels—are different both in their temporal scope and magnitude when compared to word, phras...
Conference Paper
Full-text available
Text-to-speech synthesis is a task that solves many real-world problems such as providing speaking and reading ability to people who lack those capabilities. It is thus viewed mainly as an engineering problem rather than a purely scientific one. Therefore many of the solutions in speech synthesis are purely practical. However, from the point of vie...
Conference Paper
Full-text available
This paper studies a deep neural network (DNN) based voice source modelling method in the synthesis of speech with vary-ing vocal effort. The new trainable voice source model learns a mapping between the acoustic features and the time-domain pitch-synchronous glottal flow waveform using a DNN. The voice source model is trained with various speech m...
Article
We describe an arrangement for simultaneous recording of speech and vocal tract geometry in patients undergoing surgery involving this area. Experimental design is considered from an articulatory phonetic point of view. The speech signals are recorded with an acoustic-electrical arrangement. The vocal tract is simultaneously imaged with MRI. A MATL...