Jens Edlund

Jens Edlund
KTH Royal Institute of Technology | KTH · Department of Speech, Music and Hearing (TMH)

PhD Speech communication

About

154
Publications
23,444
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,219
Citations
Citations since 2017
20 Research Items
888 Citations
2017201820192020202120222023050100150200
2017201820192020202120222023050100150200
2017201820192020202120222023050100150200
2017201820192020202120222023050100150200
Additional affiliations
January 1999 - present
KTH Royal Institute of Technology
Position
  • Professor
August 1998 - December 1998
SRI Cambridge
Position
  • Researcher
July 1996 - July 1998
Telia Research
Position
  • Researcher

Publications

Publications (154)
Conference Paper
Full-text available
Micro-prosodic effects are segmental influences on the overall prosodic contour of the utterance, as a result of coarticulation with different phonological classes of segments (Taylor, 2009). Within the Text-To-Speech (TTS) synthesis, especially for unit-selection, these influences have been continually addressed. By controlling for, and preserving...
Chapter
This paper presents the design of one of Sweden’s largest digital humanities projects, SweTerror, that through an interdisciplinary multi-modal methodological approach develops an extensive speech-to-text digital HSS resource. SweTerror makes a major contribution to the study of terrorism in Sweden through a comprehensive mixed methods study of the...
Article
Purpose . We explore the validity and reliability of an Audience Response Systems (ARS)-based measure of acceptability, applied to speech produced by children with speech sound disorder (SSD). We further explore how the suggested measure relates to an ARS-based measure of intelligibility. Finally, we explore potential differences between speech-lan...
Article
Full-text available
Children's speech acquisition is influenced by universal and language-specific forces. Some speech error patterns (or phonological processes) in children's speech are observed in many languages, but the same error pattern may have different effects in different languages. We aimed to explore phonological effects of the same speech error patterns ac...
Article
Full-text available
Purpose We assessed audience response systems (ARS)-based evaluation of intelligibility, with a view to find a valid and reliable intelligibility measure that is accessible to non-trained participants. In addition, we investigated potential listener differences between pediatric speech and language pathologists (SLPs) and untrained adults. Method...
Conference Paper
Full-text available
Speech synthesis applications have become an ubiquity, in navigation systems, digital assistants or as screen or audio book readers. Despite their impact on the acceptability of the systems in which they are embedded, and despite the fact that different applications probably need different types of TTS voices, TTS evaluation is still largely treate...
Conference Paper
This provocation paper calls for a deeper understanding of what spoken human-computer interaction is, and what it can be. Its given structure by a story of humanlikeness and fraudulent spoken dialogue systems - specifically systems that deliberately attempts to mislead their interlocutors into believing that they are speaking to a human. Against th...
Article
Speech interfaces are growing in popularity. Through a review of 99 research papers this work maps the trends, themes, findings and methods of empirical research on speech interfaces in the field of human–computer interaction (HCI). We find that studies are usability/theory-focused or explore wider system experiences, evaluating Wizard of Oz, proto...
Conference Paper
Full-text available
The use of speech as an interaction modality has grown considerably through the integration of Intelligent Personal Assistants (IPAs- e.g. Siri, Google Assistant) into smartphones and voice based devices (e.g. Amazon Echo). However, there remain significant gaps in using theoretical frameworks to understand user behaviours and choices and how they...
Preprint
Full-text available
*Please follow the link on https://lmhclark.com/publications/ to find the published version of this article. The information on metrics in particular differs from this preprint version* Speech interfaces are growing in popularity. Through a review of 68 research papers this work maps the trends, themes, findings and methods of empirical research o...
Preprint
Speech interfaces are growing in popularity. Through a review of 68 research papers this work maps the trends, themes, findings and methods of empirical research on speech interfaces in HCI. We find that most studies are usability/theory-focused or explore wider system experiences, evaluating Wizard of Oz, prototypes, or developed systems by using...
Chapter
In the field of language technology, researchers are starting to pay more attention to various interactional aspects of language – a development prompted by a confluence of factors, and one which applies equally to the processing of written and spoken language. Notably, the so-called ‘phatic’ aspects of linguistic communication are coming into focu...
Conference Paper
Full-text available
We present a novel and general approach for fast and efficient non-sequential browsing of sound in large archives that we know little or nothing about, e.g. so called found data-data not recorded with the specific purpose to be analysed or used as training data. Our main motivation is to address some of the problems speech and speech technology res...
Chapter
The paper describes the design of a novel corpus of respiratory activity in spontaneous multiparty face-to-face conversations in Swedish. The corpus is collected with the primary goal of investigating the role of breathing for interactive control of interaction. Physiological correlates of breathing are captured by means of respiratory belts, which...
Article
Full-text available
Children with speech disorders often present with systematic speech error patterns. In clinical assessments of speech disorders, evaluating the severity of the disorder is central. Current measures of severity have limited sensitivity to factors like the frequency of the target sounds in the child's language and the degree of phonological diversity...
Chapter
The interest in embodying and situating computer programmes took off in the autonomous agents community in the 90s. Today, researchers and designers of programmes that interact with people on human terms endow their systems with humanoid physiognomies for a variety of reasons. In most cases, attempts at achieving this embodiment and situatedness ha...
Chapter
Full-text available
This contribution introduces backchannel relevance spaces – intervals where it is relevant for a listener in a conversation to produce a backchannel. By annotating and comparing actual visual and vocal backchannels with potential backchannels established using a group of subjects acting as third-party listeners, we show (i) that visual only backcha...
Article
In order to understand and model the dynamics between interaction phenomena such as gaze and speech in face-to-face multiparty interaction between humans, we need large quantities of reliable, objective data of such interactions. To date, this type of data is in short supply. We present a data collection setup using automated, objective techniques...
Conference Paper
This paper investigates gaze patterns in turn-taking. We focus on the difference between speaker changes resulting in gaps and overlaps. We also investigate gaze patters around backchannels and around silences not involving speaker changes.
Conference Paper
Full-text available
Overlap, although short in duration, occurs frequently in multiparty conversation. We show that its duration is approximately log-normal, and inversely proportional to the number of simultaneously speaking parties. Using a simple model, we demonstrate that simultaneous talk tends to end simultaneously less frequently than in begins simultaneously,...
Conference Paper
Full-text available
Studies of questions present strong evidence that there is no one-to-one relationship between intonation and interrogative mode. In this paper, we describe some aspects of prosodic variation in the Spontal corpus of 120 half-hour spontaneous dialogues in Swedish. The study is part of ongoing work aimed at extracting a database of 600 questions from...
Conference Paper
Spoken face to face interaction is a rich and complex form of communication that includes a wide array of phenomena that are not fully explored or understood. While there has been extensive studies on many aspects in face-to-face interaction, these are traditionally of a qualitative nature, relying on hand annotated corpora, typically rather limite...
Article
The perception of gaze plays a crucial role in human-human interaction. Gaze has been shown to matter for a number of aspects of communication and dialogue, especially for managing the flow of the dialogue and participant attention, for deictic referencing, and for the communication of attitude. When developing embodied conversational agents (ECAs)...
Conference Paper
Full-text available
The ability of people, and of machines, to determine the position of a sound source in a room is well studied. The related ability to determine the orientation of a directed sound source, on the other hand, is not, but the few studies there are show people to be surprisingly skilled at it. This has bearing for studies of face-to-face interaction an...
Conference Paper
Full-text available
We present an attempt at using 3rd party observer gaze to get a measure of how appropriate each segment in a dialogue is for a speaker change. The method is a step away from the current dependency of speaker turns or talkspurts towards a more general view of speaker changes. We show that 3rd party observers do indeed largely look at the same thing...
Conference Paper
Full-text available
The ability of people, and of machines, to determine the position of a sound source in a room is well studied. The related ability to determine the orientation of a directed sound source, on the other hand, is not, but the few studies there are show people to be surprisingly skilled at it. This has bearing for studies of face-to-face interaction an...
Conference Paper
Full-text available
Studies of questions present strong evidence that there is no one-to-one relationship between intonation and interrogative mode. We present initial steps of a larger project investigating and describing intonational variation in the Spontal database of 120 half-hour spontaneous dialogues in Swedish, and testing the hypothesis that the concept of a...
Conference Paper
We propose to utilize the Mona Lisa gaze effect for an objective and repeatable measure of the extent to which a viewer perceives an object as cospatial. Preliminary results suggest that the metric behaves as expected. Keywords: Copresence, Face-to-face interaction. Face-to-face interaction evolved between humans present in the same physical space...
Conference Paper
Full-text available
Segmentation of speech signals is a crucial task in many types of speech analysis. We present a novel approach at segmentation on a syllable level, using a Bidirectional Long-Short-Term Memory Neural Network. It performs estimation of syllable nucleus positions based on regression of perceptually motivated input features to a smooth target function...
Conference Paper
Full-text available
The taking of turns to speak is an intrinsic property of conversation. It is expected that models of taking turns, providing a prior distribution over conversational form, can reduce the perplexity of what is attended to and processed by spoken dialogue systems. We propose a single-port model of multi-party turn-taking which allows conversants to b...
Conference Paper
Full-text available
We present empirical justification of why logistic regression may acceptably approximate, using the number of currently vocalizing interlocutors, the probabilities returned by a time-invariant, conditionally independent model of turn-taking. The resulting parametric model with 3 degrees of freedom is shown to be identical to an infinite-range Ising...
Conference Paper
Full-text available
This work explores the timing of very short utterances in conversations, as well as the effects of excluding intervals adjacent to such utterances from distributions of betweenspeaker interval durations. The results show that very short utterances are more precisely timed to the preceding utterance than longer utterances in terms of a smaller varia...
Conference Paper
Full-text available
We present a computational framework for stochastically modeling dyad interaction chronograms. The framework's most novel feature is the capacity for incremental learning and forgetting. To showcase its flexibility, we design experiments answering four concrete questions about the systematics of spoken interaction. The results show that: (1) indivi...
Thesis
Full-text available
In the group of people with whom I have worked most closely, we recently attempted to dress our visionary goal in words: “to learn enough about human face-to-face interaction that we are able to create an artificial conversational partner that is humanlike”. The “conversational homunculus” figuring in the title of this book represents this “artific...
Conference Paper
Full-text available
Speech was conceived in a face-to-face interaction setting, and spoken dialogue is the cradle in which it evolved. As of this day, every-day face-to-face communicative interaction is the context in which most of our language use occurs. We present the Spontal database of spontaneous Swedish dialogues, a new resource for studies of everyday face-to-...
Conference Paper
Full-text available
Intonation is an important aspect of vocal production, used for a variety of communicative needs. Its modeling is therefore crucial in many speech understanding systems, particularly those requiring inference of speaker intent in real-time. However, the estimation of pitch, traditionally the first step in intonation modeling, is computationally inc...
Conference Paper
Full-text available
Spontal-N is a corpus of spontaneous, interactional Norwegian. To our knowledge, it is the first corpus of Norwegian in which the majority of speakers have spent significant parts of their lives in Sweden, and in which the recorded speech displays varying degrees of interference from Swedish. The corpus consists of studio quality audio- and video-r...
Conference Paper
Full-text available
We have a visionary goal: to learn enough about human face-to-face interaction that we are able to create an artificial conversational partner that is human-like. We take the opportunity here to present four new projects inaugurated in 2010, each adding pieces of the puzzle through a shared research focus: interactional aspects of spoken face-to-fa...
Conference Paper
Full-text available
Dynamic modeling of spoken dialogue seeks to capture how interlocutors change their speech over the course of a conversation. Much work has focused on how speakers adapt or entrain to different aspects of one another's speaking style. In this paper we focus on local aspects of this adaptation. We investigate the relationship between backchannels an...
Chapter
Full-text available
We present work fuelled by an urge to understand speech in its original and most fundamental context: in conversation between people. And what better way than to look to the experts? Regarding human conversation, authority lies with the speakers themselves, and asking the experts is a matter of observing and analyzing what speakers do. This is the...
Chapter
Full-text available
We present work fuelled by an urge to understand speech in its original and most fundamental context: in conversation between people. And what better way than to look to the experts? Regarding human conversation, authority lies with the speakers themselves, and asking the experts is a matter of observing and analyzing what speakers do. This is the...
Conference Paper
Full-text available
A large number of vocalizations in everyday conversation are traditionally not regarded as part of the information exchange. Examples include confirmations such as yeah and ok as well as traditionally non-lexical items, such as uh-huh, um, and hmm. Vocalizations like these have been grouped in different constellations
Article
This paper explores durational aspects of pauses, gaps and overlaps in three different conversational corpora with a view to challenge claims about precision timing in turn-taking. Distributions of pause, gap and overlap durations in conversations are presented, and methodological issues regarding the statistical treatment of such distributions are...
Conference Paper
We introduce an approach to using animated faces for robotics where a static physical object is used as a projection surface for an animation. The talking head is projected onto a 3D physical head model. In this chapter we discuss the different benefits this approach adds over mechanical heads. After that, we investigate a phenomenon commonly refer...
Article
In recent years there has been a substantial debate about the need for increasingly spontaneous, conversational corpora of spoken interaction that are not controlled or task directed. In parallel the need arises for the recording of multi-modal corpora which are not restricted to the audio domain alone. With a corpus that would ful�ll both needs, i...
Conference Paper
Full-text available
We present MMAE – Massively Multi-component Audio Environments – a new concept in auditory presentation, and Cocktail – a demonstrator built on this technology. MMAE creates a dynamic audio environment by playing a large number of sound clips simultaneously at different locations in a virtual 3D space. The technique utilizes standard soundboards an...
Conference Paper
In this demo, we show (a) affordable and relatively easy-to-implement means to facilitate synchronization of audio, video and motion capture data in post processing, and (b) a flexible tool for 3D visualization of recorded motion capture data aligned with audio and video sequences. The synchronisation is made possible by the use of two simple and a...
Conference Paper
Full-text available
Faced with the difficulties of finding an operationalized definition of backchannels, we have previously proposed an intermediate, auxiliary unit – the very short utterance (VSU) – which is defined operationally and is automatically extractable from recorded or ongoing dialogues. Here, we extend that work in the following ways: (1) we test the exte...
Article
Full-text available
This paper investigates learner response to a novel kind of intonation feedback generated from speech analysis. Instead of displays of pitch curves, our feedback is flashing lights that show how much pitch variation the speaker has produced. The variable used to generate the feedback is the standard deviation of fundamental frequency as measured in...
Conference Paper
Full-text available
Prosody plays a central role in communicating via speech, making it important for speech technologies to model. Unfortunately, the application of standard modeling techniques to the acoustics of prosody has been hindered by dif?culties in modeling intonation. In this work, we explore the suitability of the recently introduced fundamental frequency...
Conference Paper
Full-text available
It has long been noted that conversational partners tend to exhibit increasingly similar pitch, intensity, and timing behavior over the course of a conversation. However, the metrics developed to measure this similarity to date have generally failed to capture the dynamic temporal aspects of this process. In this paper, we propose new approaches to...
Article
Evaluation of methods and techniques for conversational and multimodal spoken dialogue systems is complex, as is gathering data for the modeling and tuning of such techniques. This article describes MushyPeek, an experiment framework that allows us to manipulate the audiovisual behavior of interlocutors in a setting similar to face-to-face human-hu...
Conference Paper
We share our experiences with integrating motion capture recordings in speech and dialogue research by describing (1) Spontal, a large project collecting 60 hours of video, audio and motion capture spontaneous dialogues, is described with special attention to motion capture and its pitfalls; (2) a tutorial where we use motion capture, speech synthe...
Conference Paper
Full-text available
We describe the MonAMI Reminder, a multimodal spoken dialogue system which can assist elderly and disabled people in organising and initiating their daily activities. Based on deep interviews with potential users, we have designed a calendar and reminder application which uses an innovative mix of an embodied conversational agent, digital pen and p...
Chapter
No matter how well hidden our systems are and how well they do their magic unnoticed in the background, there are times when direct interaction between system and human is a necessity. As long as the interaction can take place unobtrusively and without techno-clutter, this is desirable. It is hard to picture a means of interaction less obtrusive an...
Conference Paper
Full-text available
A basic requirement for participation in conversation is the ability to jointly manage interaction. Examples of interaction management include indications to acquire, re-acquire, hold, release, and acknowledge floor ownership, and these are often implemented using specialized dialog act (DA) types. In this work, we explore the prosody of one class...
Conference Paper
We describe the ongoing Swedish speech database project Spontal: Multimodal database of spontaneous speech in dialog (VR 2006-7482). The project takes as its point of departure the fact that both vocal signals and gesture involving the face and body are important in every- day, face-to-face communicative interaction, and that there is a great need...
Conference Paper
Full-text available
In this study, we describe the range of prosodic variation observed in two types of dialogue contexts, using fully automatic methods. The first type of context is that of speaker-changes, or transitions from only one participant speaking to only the other, involving either acoustic silences or acoustic overlaps. The second type of context is compri...
Conference Paper
Full-text available
This paper discusses the feasibility of using prosodic features for interaction control in spoken dialogue systems, and points to experimental evidence that automatically extracted prosodic features can be used to improve the efficiency of identifying relevant places at which a machine can legitimately begin to talk to a human interlocutor, as well...
Article
Full-text available
Proceedings of the NODALIDA 2009 workshop Nordic Perspectives on the CLARIN Infrastructure of Language Resources. Editors: Rickard Domeij, Kimmo Koskenniemi, Steven Krauwer, Bente Maegaard, Eiríkur Rögnvaldsson and Koenraad de Smedt. NEALT Proceedings Series, Vol. 5 (2009), 1-5. © 2009 The editors and contributors. Published by Northern European As...
Article
This paper presents an overview of methods that can be used to collect and analyse data on user responses to spoken dialogue system components intended to increase human-likeness, and to evaluate how well the components succeed in reaching that goal. Wizard-of-Oz variations, human–human data manipulation, and micro-domains are discussed in this con...
Conference Paper
Full-text available
This demo paper presents the first version of the Reminder, a prototype ECA developed in the European project MonAMI, which aims at “mainstreaming accessibility in consumer goods and services, using advanced technologies to ensure equal access, independent living and participation for all”. The Reminder helps users to plan activities and to remembe...