Khiet Truong

Khiet Truong
  • PhD
  • Associate Professor at University of Twente

About

97
Publications
28,702
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,068
Citations
Current institution
University of Twente
Current position
  • Associate Professor

Publications

Publications (97)
Conference Paper
Nanogami is a bioresponsive garment to visualize the importance of the microbiome on collective wellbeing. The microbiome is the group of bacteria, viruses, and cells that live within and on our bodies. This galaxy of particles makes up more than half of the human body and are noted to be responsible for overall health and mood. From nano - micro s...
Poster
Poster on the construction of the MULAI corpus. A Multimodal corpus focused on laughter in interactions designed for the Affective Computing field. The proceedings are published online.
Article
As children of ages 5–8 often play with each other in small groups, their differences in social development and personality traits usually cause various levels of engagement among others. For example, one child may just observe without engaging at all with others while another child may be interested in both the other children as well as the activi...
Conference Paper
Full-text available
Due to the complex nature of engagement and the naturalistic and noisy nature of the data, detecting engagement "in the wild" in a group of children remains a challenging task. Engagement has been linked to speech behaviour, pose behaviour and emotional states. However, only recently it has become more feasible and reliable to extract pose and emot...
Conference Paper
Full-text available
In this paper, we propose to use deep 3-dimensional convolutional networks (3D CNNs) in order to address the challenge of modelling spectro-temporal dynamics for speech emotion recognition (SER). Compared to a hybrid of Convolutional Neural Network and Long-Short-Term-Memory (CNN-LSTM), our proposed 3D CNNs simultaneously extract short-term and lon...
Conference Paper
Full-text available
Our moral conscience as the " inner light " that guides us shines brighter during moments of ethical conflicts, when we notice a tension between our many oughts and/or wants. We present the first analyses on speech related stress and affect in accounts of moral conflicts. For our exploratory study, we started with interviews on moral and immoral ev...
Conference Paper
Full-text available
Deep architectures using identity skip-connections have demonstrated groundbreaking performance in the field of image classification. Recently, empirical studies suggested that identity skip- connections enable ensemble-like behaviour of shallow networks, and that depth is not a solo ingredient for their success. Therefore, we examine the potential...
Conference Paper
Full-text available
One of the challenges in Speech Emotion Recognition (SER) "in the wild" is the large mismatch between training and test data (e.g. speakers and tasks). In order to improve the generalisation capabilities of the emotion models, we propose to use Multi-Task Learning (MTL) and use gender and naturalness as auxiliary tasks in deep neural networks. This...
Technical Report
Full-text available
In this paper, we propose to use deep 3-dimensional convolutional networks (3D CNNs) in order to address the challenge of modelling spectro-temporal dynamics for speech emotion recognition (SER). Compared to a hybrid of Convolutional Neural Network and Long-Short-Term-Memory (CNN-LSTM), our proposed 3D CNNs simultaneously extract short-term and lon...
Technical Report
Full-text available
One of the challenges in Speech Emotion Recognition (SER) "in the wild" is the large mismatch between training and test data (e.g. speakers and tasks). In order to improve the generalisation capabilities of the emotion models, we propose to use Multi-Task Learning (MTL) and use gender and naturalness as auxiliary tasks in deep neural networks. This...
Article
Full-text available
Developing systems that motivate people to change their behaviors, such as an exercise application for the smartphone, is challenging. One solution is to implement motivational strategies from existing behavior change theory and tailor these strategies to preferences based on personal characteristics, like personality and gender. We operationalized...
Conference Paper
Full-text available
We present a comparative analysis of motivational messages designed with a theory-driven approach. A previous study [4] involved crowdsourcing to design and evaluate motivational text messages for physical activity, and showed that these peer-designed text messages aligned to behavior change strategies from theory. However, the messages were predom...
Conference Paper
Full-text available
Carefully designed nonverbal behavior can supply minimally actuated non-anthropomorphic robots with the communicative power necessary to engage in playful tasks with children. However, this user group brings unique challenges for interaction designers as (i) it is often hard to understand how nonverbal robot behavior is perceived and interpreted by...
Conference Paper
Full-text available
In this note, we present minimal robot movements for robotic technology for children. Two types of minimal gaze movements were designed: social-gaze movements to communicate social engagement and deictic-gaze movements to communicate task-related referential information. In a two (social-gaze movements vs. none) by two (deictic-gaze movements vs. n...
Conference Paper
Awe is a powerful, visceral sensation described as a sudden chill or shudder accompanied by goosebumps. People feel awe in the face of extraordinary experiences: the sublimity of nature, the beauty of art and music, the adrenaline rush of fear. Awe is healthy, both physically and mentally. It can be shared by people who are witnessing the same phen...
Conference Paper
In this paper the organisers present a brief overview of the 2nd International Workshop on Emotion Representations and Modelling for Companion Systems (ERM4CT). The ERM4CT 2016 Workshop is held in conjunction with the 18th ACM International Conference on Multimodal Interaction (ICMI 2016) taking place Tokyo, Japan. The ERM4CT is the follow-up of th...
Conference Paper
This paper gives a summary of the 2nd International Workshop on Advancements in Social Signal Processing for Multimodal Interaction (ASSP4MI). Following our successful 1st International Workshop on Advancements in Social Signal Processing for Multimodal Interaction, held during ICMI-2015, we proposed the 2nd ASSP4MI workshop during ICMI-2016. The t...
Conference Paper
Full-text available
In collaborative play, children exhibit different levels of engagement. Some children are engaged with other children while some play alone. In this study, we investigated multimodal detection of individual levels of engagement using a ranking method and non-verbal features: turn-taking and body movement. Firstly, we automatically extracted turn-ta...
Conference Paper
Full-text available
In collaborative play, young children can exhibit different types of engagement. Some children are engaged with other children in the play activity while others are just looking. In this study, we investigated methods to automatically detect the children's levels of engagement in play settings using non-verbal vocal features. Rather than labelling...
Conference Paper
Full-text available
We explored the automatic analysis of vocal non-verbal cues of a group of children in the context of engagement and collabora-tive play. For the current study, we defined two types of engagement on groups of children: harmonised and unharmonised. A spontaneous audiovisual corpus with groups of children who collaboratively build a 3D puzzle was coll...
Conference Paper
Full-text available
Developing motivational technology to support long-term behavior change is a challenge. A solution is to incorporate insights from behavior change theory and design technology to tailor to individual users. We carried out two studies to investigate whether the processes of change, from the Transtheoretical Model, can be effectively represented by m...
Conference Paper
Full-text available
Current approaches to design motivational technology for behavior change focus on either tailoring motivational strategies to individual preferences or on implementing strategies from behavior change theory. Our goal is to combine these two approaches and translate behavior change theory to text messages, tailored to personality. To this end, we co...
Conference Paper
Full-text available
An increasing number of applications for social robots focuses on learning and playing with children. One of the unanswered questions is what kind of social character a robot should have in order to positively engage children in a task. In this paper, we present a study on the effect of two different social characters of a robot (peer vs. tutor) on...
Conference Paper
Physical activity leads to a respiratory behaviour that is very different to a resting state and that influences speech production. How speech parameters are exactly affected by physical activity remains largely unknown. Hence, we investigated how several prosodic parameters change under influence of physical activity and focused on temporal and br...
Conference Paper
When a mobile robot interacts with a group of people, it has to consider its position and orientation. We introduce a novel study aimed at generating hypotheses on suitable behavior for such social positioning, explicitly focusing on interaction with small groups of users and allowing for the temporal and social dynamics inherent in most interactio...
Conference Paper
Full-text available
Since children (5-9 years old) are still developing their emotional and social skills, their social interactional behaviors in small groups might differ from adults' interactional behaviors. In order to develop a robot that is able to support children performing collaborative tasks in small groups, it is necessary to gain a better understanding of...
Conference Paper
Full-text available
TERESA is a socially intelligent semi-autonomous telepresence system that is currently being developed as part of an FP7-STREP project funded by the European Union. The ultimate goal of the project is to deploy this system in an elderly day centre to allow elderly people to participate in social events even when they are unable to travel to the cen...
Article
Full-text available
Work on voice sciences over recent decades has led to a proliferation of acoustic parameters that are used quite selectively and are not always extracted in a similar fashion. With many independent teams working in different research areas, shared standards become an essential safeguard to ensure compliance with state-of-the-art methods allowing ap...
Article
The article aims to model the verbal and prosodic features of emotional expression in interviews to investigate the potential for synergy between scholarly fields that have the narrative as object of study. Using a digital collection of oral history interviews that contains narrative aspects addressing war and violence in Croatia, we analyzed emoti...
Conference Paper
Full-text available
In dialogue, it is not uncommon for people to laugh together. This joint laughter often results in overlapping laughter, con-sisting of an initiating laugh (the first one), and a responding laugh (the second one). In previous studies, we found that over-lapping laughs are acoustically different from non-overlapping ones. So far, we have considered...
Chapter
It is often reported that in spontaneous discourse the laughing of interlocutors overlaps with each other. Although transcripts of interactional linguistics do con-sider features of overlap, the main prosodic information regarding pitch, intensity, duration and interactional timing of laughter so far remained untracked. This pa-per aims to show tha...
Article
Full-text available
Sighs are non-verbal vocalisations that can carry important information about a speaker's emotional (and psychological) state. Although sighs are commonly associated with negative emotions (e.g. giving up on something, 'a sigh of despair', sadness), sighs can also be associated with positive emotions such as relief. In order to gain a better unders...
Article
Artificial listeners are virtual agents that can listen attentively to a human speaker in a dialog. In this paper, we present two experiments where we investigate the perception of rule-based backchannel strategies for artificial listeners. In both, we collect subjective judgements of humans who observe a video of a speaker together with a correspo...
Article
Full-text available
Audiovisual collections of narratives about war-traumas are rich in descriptions of personal and emotional experiences which can be expressed through verbal and nonverbal means. We complement a commonly used verbal analysis with a nonverbal one to study emotional developments in narratives. Using automatic text, vocal, and facial expression analysi...
Article
One of the major properties of overlapping speech is that it can be perceived as competitive or cooperative. For the development of real-time spoken dialog systems and the analysis of affective and social human behavior in conversations, it is important to (automatically) distinguish between these two types of overlap. We investigate acoustic chara...
Article
Full-text available
In this paper, we analyze acoustic profiles of fillers (i.e. filled pauses, FPs) and laughter with the aim to automatically localize these nonverbal vocalizations in a stream of audio. Among other features, we use voice quality features to capture the distinctive production modes of laughter and spectral similarity measures to capture the stability...
Article
The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two types of ratings affect the development and perform...
Conference Paper
Existing laughter annotations provided with several publicly available conversational speech corpora (both multiparty and dyadic conversations) were investigated and compared. We discuss the possibilities and limitations of these rather coarse and shallow laughter annotations. There are definition issues to be considered with respect to speech-laug...
Conference Paper
The social nature of laughter invites people to laugh together. This joint vocal action often results in overlapping laughter. In this paper, we show that the acoustics of overlapping laughs are different from non-overlapping laughs. We found that overlapping laughs are stronger prosodically marked than non-overlapping ones, in terms of higher valu...
Conference Paper
We explore different forms and functions of one of the most common feedback expressions in Dutch, English, and German, namely "yeah/ja" which is known for its multi-functionality and ambiguous usage in dialog. For example, it can be used as a yes-answer, or as a pure continuer, or as a way to show agree-ment. In addition, "yeah/ja" can be used in i...
Article
In this paper, we investigate prosodic alignment in task-based conversations. We use the HCRC Map Task Corpus and investigate how familiarity affects prosodic alignment and how task success is related to prosodic alignment. A variety of existing alignment measures is used and applied to our data. In particular, a windowed cross-correlation procedur...
Conference Paper
Full-text available
Human mimicry is one of the important behavioral cues displayed during social interaction that inform us about the interlocutors' interpersonal states and attitudes. For example, the absence of mimicry is usually associated with negative attitudes. A system capable of analyzing and understanding mimicry behavior could enhance social interaction, bo...
Conference Paper
During face-to-face interpersonal interaction, people have a tendency to mimic each other. People not only mimic postures, mannerisms, moods or emotions, but they also mimic several speech-related behaviors. In this paper we describe how visual and vocal behavioral information expressed between two interlocutors can be used to detect and identify v...
Article
Full-text available
Different turn-taking strategies of an agent influence the impression that people have of it and the behaviors that they display in response. To study these influences, we carried out several studies. In the first study, subjects listened as bystanders to computer-generated, unintelligible conversations between two speakers. In the second study, su...
Conference Paper
Full-text available
In a perception experiment, we systematically varied the quantity, type and timing of backchannels. Participants viewed stimuli of a real speaker side-by-side with an animated listener and rated how human-like they perceived the latter's backchannel behavior. In addition, we obtained measures of appropriateness and optionality for each backchannel...
Article
Full-text available
This paper presents our progress in developing a Virtual Human capable of being an attentive speaker. Such a Virtual Human should be able to attend to its interaction partner while it is speaking—and modify its communicative behavior on-the-fly based on what it observes in the behavior of its partner. We report new developments concerning a number...
Conference Paper
Mimicry occurs in conversations both when people agree with each other and when they do not. However, it has been reported that there is more mimicry when people agree than when they disagree: when people want to express shared opinions and attitudes, they do so by displaying behavior that is similar to their interlocutors' behavior. In a conversat...
Conference Paper
Full-text available
When human listeners utter Listener Responses (e.g. back-channels or acknowledgments) such as ‘yeah’ and ‘mmhmm’, interlocutors commonly continue to speak or resume their speech even before the listener has finished his/her response. This type of speech interactivity results in frequent speech overlap which is common in humanhuman conversation. To...
Chapter
Full-text available
This paper presents two examples of how nonverbal communication can be automatically detected and interpreted in terms of social phenomena. In particular, the presented approaches use simple prosodic features to distinguish between journalists and non-journalists in media, and extract social networks from turn-taking to recognize roles in different...
Conference Paper
Full-text available
Backchannels (BCs) are short vocal and visual listener responses that signal attention, interest, and understanding to the speaker. Previous studies have investigated BC prediction in telephone-style dialogs from prosodic cues. In contrast, we consider spontaneous face-to-face dialogs. The additional visual modality allows speaker and listener to m...
Article
Full-text available
This paper proposes an approach for the automatic recognition of roles in settings like news and talk-shows, where roles correspond to specific functions like Anchorman, Guest or Interview Participant. The approach is based on purely nonverbal vocal behavioral cues, including who talks when and how much (turn-taking behavior), and statistical prope...
Conference Paper
We evaluate multimodal rule-based strategies for backchannel (BC) generation in face-to-face conversations. Such strategies can be used by artificial listeners to determine when to produce a BC in dialogs with human speakers. In this research, we consider features from the speaker’s speech and gaze. We used six rule-based strategies to determine th...
Conference Paper
Full-text available
Different turn-taking strategies of an agent influence the impression that people have of it. We recorded conversations of a human with an interviewing agent, controlled by a wizard and using a particular turn-taking strategy. A questionnaire with 27 semantic differential scales concerning personality, emotion, social skills and interviewing skills...
Conference Paper
Full-text available
This paper presents two examples of how nonverbal commu- nication can be automatically detected and interpreted in terms of social phenomena. In particular, the presented approaches use simple prosodic features to distinguish between journalists and non-journalists in media, and extract social networks from turn-taking to recognize roles in dif- fe...
Conference Paper
Full-text available
In this paper, we look at how prosody can be used to automatically distinguish between different dialogue act functions and how it determines degree of speaker incipiency. We focus on the different uses of 'yeah'. Firstly, we investigate ambiguous dialogue act functions of 'yeah': 'yeah' is most frequently used as a backchannel or an assessment. Se...
Conference Paper
Full-text available
We carry out two studies on affective state modeling for communication settings that involve unilateral intent on the part of one participant (the evoker) to shift the affective state of another participant (the experiencer). The first investigates viewer response in a narrative setting using a corpus of documentaries annotated with viewer-reported...
Conference Paper
Full-text available
We manually designed rules for a backchannel (BC) prediction model based on pitch and pause information. In short, the model predicts a BC when there is a pause of a certain length that is preceded by a falling or rising pitch. This model was validated against the Dutch IFADV Corpus in a corpus-based evaluation method. The results showed that our m...
Article
Full-text available
We present a procedure for conversational floor annotation and discuss floor types and floor switches in face-to-face meetings and the relation with addressing behavior. It seems that for understanding interactions in meetings an agent needs a layered floor model and that turn and floor changes are constrained by the activities and the roles that t...
Article
One of the biggest challenges in designing computer assisted language learning (CALL) applications that provide automatic feedback on pronunciation errors consists in reliably detecting the pronunciation errors at such a detailed level that the information provided can be useful to learners. In our research we investigate pronunciation errors frequ...
Article
Full-text available
This paper presents a methodology to apply speech technology for compensating sensory, motor, cognitive and affective usage difficulties. It distinguishes (1) an analysis of accessibility and technological issues for the identification of context-dependent user needs and corresponding opportunities to include speech in multimodal user interfaces, a...
Conference Paper
Full-text available
In this paper, we describe emotion recognition experiments car- ried out for spontaneous affective speech with the aim to com- pare the added value of annotation of felt emotion versus an- notation of perceived emotion. Using speech material avail- able in the TNO-GAMING corpus (a corpus containing audio- visual recordings of people playing videoga...
Article
The aim of the research described in this thesis was to develop speech-based affect recognition systems that can deal with spontaneous (‘real’) affect instead of acted affect. Several affect recognition experiments with spontaneous affective speech data were carried out to investigate what combination of acoustic (and also lexical) features and cla...
Conference Paper
Full-text available
We developed acoustic and lexical classifiers, based on a boosting algorithm, to assess the separability on arousal and valence dimensions in spontaneous emotional speech. The spontaneous emotional speech data was acquired by inviting subjects to play a first-person shooter video game. Our acoustic classifiers performed significantly better than th...
Conference Paper
Full-text available
Laughter is a highly variable signal, which can be caused by a spectrum of emotions. This makes the automatic detection of laughter a challenging, but interesting task. We perform automatic laughter detection using audio-visual data from the AMI Meeting Corpus. Audio-visual laughter detection is performed by fusing the results of separate audio and...
Conference Paper
Full-text available
We investigate the combination of several sources of information for the purpose of sub- jectivity recognition and polarity classification in meetings. We focus on features from two modalities, transcribed words and acoustics, and we compare the performance of three dif- ferent textual representations: words, charac- ters, and phonemes. Our experim...
Conference Paper
Full-text available
We investigated inter-observer agreement and the reliability of self-reported emotion ratings (i.e., self-raters judging their own emotions) in spontaneous multimodal emotion data. During a multiplayer video game, vocal and facial expressions were recorded (including the game content itself) and were annotated by the players themselves on arousal a...
Conference Paper
Full-text available
Application of more and more automation in process control shifts the operator's task from manual to supervisory control. Increasing system autonomy, complexity and information fluctuations make it extremely difficult to develop static support concepts that cover all critical situations after implementing the system. Therefore, support systems in d...
Conference Paper
Full-text available
Two unobtrusive modalities for automatic emotion recognition are discussed: speech and facial expressions. First, an overview is given of emotion recognition studies based on a combination of speech and facial expressions. We will identify difficulties concerning data collection, data fusion, system evaluation and emotion annotation that one is mos...
Article
Emotions can be recognized by audible paralinguistic cues in speech. By detecting these paralinguistic cues that can consist of laughter, a trembling voice, coughs, changes in the intonation contour etc., information about the speaker’s state and emotion can be revealed. This paper describes the development of a gender-independent laugh detector wi...
Conference Paper
Full-text available
In thispaper, we introduce a visual analysis method toassess the discriminability and confusiability between emotions according to automatic emotion classifiers. The degree of acoustic simi- larities between emotions can be defined in terms of distances that are based on pair-wiseemotion discrimination experiments. By employing Multidimensional Sca...
Conference Paper
Full-text available
This paper introduces a detection methodology for recognition technologies in speech for which it is difcult to obtain an abun- dance of non-target classes. An example is language recogni- tion, where we would like to be able to measure the detection capability of a single target language without confounding with the modeling capability of non-targ...
Conference Paper
Full-text available
Providing feedback on pronunciation errors in computer assisted language learning systems requires that pronunciation errors be detected automatically. In the present study we compare four types of classifiers that can be used for this purpose: two acoustic-phonetic classifiers (one of which employs linear-discriminant analysis (LDA)), a classifier...
Article
Full-text available
In this paper, we present a multimodal affective mirror that senses and elicits laughter. Currently, the mirror contains a vocal and a facial affect-sensing module, a component that fuses the output of these two modules to achieve a user-state assessment, a user state transition model, and a component to present audiovisual affective feedback that...
Article
Full-text available
In this study, we investigated automatic laughter seg-mentation in meetings. We first performed laughter-speech discrimination experiments with traditional spectral features and subsequently used acoustic-phonetic features. In segmentation, we used Gaus-sian Mixture Models that were trained with spec-tral features. For the evaluation of the laughte...
Conference Paper
Full-text available
In this paper, we present a multimodal affective mirror that senses and elicits laughter. Currently, the mirror contains a vocal and a facial affect-sensing module, a component that fuses the output of these two modules to achieve a user-state assessment, a user state transition model, and a component to present audiovisual affective feedback that...
Conference Paper
Full-text available
To develop an annotated database of spontaneous, multi-modal, emotional expressions, recordings were made of facial and vocal expressions of emotions while participants were playing a multiplayer first-person shooter (fps) computer game. During a replay of the session, participants scored their own emotions by assigning values to them on an arousal...
Conference Paper
Full-text available
In the context of detecting 'paralinguistic events' with the aim to make classification of the speaker's emotional state possible, a detector was developed for one of the most obvious 'para- linguistic events', namely laughter. Gaussian Mixture Mod- els were trained with Perceptual Linear Prediction features, pitch&energy, pitch&voicing and modulat...
Conference Paper
Full-text available
In this paper, we present an acoustic-phonetic approach to automatic pronunciation error detection. Classifiers using techniques such as Linear Discriminant Analysis and Decision Trees were developed for three sounds that are frequently pronounced incorrectly by L2-learners of Dutch: /#/, /; / and /Z/. This paper will focus mainly on the problems w...
Article
Full-text available
To develop an annotated database of spontaneous, multi-modal, emotional expressions, recordings were made of facial and vocal expressions of emotions while participants were playing a multiplayer first-person shooter (fps) computer game. During a replay of the session, participants scored their own emotions by assigning values to them on an arousal...
Article
Full-text available
In this paper, we present an acoustic-phonetic approach to automatic pronunciation error detection. Classifiers using techniques such as Linear Discriminant Analysis or a decision tree were developed for three sounds that are frequently pronounced incorrectly by L2-learners of Dutch: /A/, /Y/ and /x/. The acoustic properties of these pronunciation...
Article
Full-text available
In this paper, we present a detection approach and an 'open-set' detection evaluation methodology for automatic emotion recog-nition in speech. The traditional classification approach does not seem to be suitable and flexible enough for typical emotion recognition tasks. For example, classification does not have an appropriate way to cope with 'new...

Network

Cited By