Engin ErzinKoc University · College of Engineering
Engin Erzin
Assoc. Prof.
About
145
Publications
29,549
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,232
Citations
Introduction
Skills and Expertise
Additional affiliations
September 1996 - December 2000
Lucent Technologies
Position
- MTS
January 2001 - present
September 1995 - August 1996
Education
September 1992 - August 1995
September 1990 - August 1992
September 1986 - June 1990
Publications
Publications (145)
Continuous emotion recognition (CER) aims to track the dynamic changes in a person's emotional state over time. This paper proposes a novel approach to translating CER into a prediction problem of dynamic affect-contour clusters from speech, where the affect-contour is defined as the contour of annotated affect attributes in a temporal window. Our...
We present the engagement in human–robot interaction (eHRI) database containing natural interactions between two human participants and a robot under a story-shaping game scenario. The audio-visual recordings provided with the database are fully annotated at a 5-intensity scale for head nods and smiles, as well as with speech transcription and cont...
Video summarization attracts attention for efficient video representation, retrieval, and browsing to ease volume and traffic surge problems. Although video summarization mostly uses the visual channel for compaction, the benefits of audio-visual modeling appeared in recent literature. The information coming from the audio channel can be a result o...
A key aspect of social human-robot interaction is natural non-verbal communication. In this work, we train an agent with batch reinforcement learning to generate nods and smiles as backchannels in order to increase the naturalness of the interaction and to engage humans. We introduce the Sequential Random Deep Q-Network (SRDQN) method to learn a po...
As speech-interfaces are getting richer and widespread, speech emotion recognition promises more attractive applications. In the continuous emotion recognition (CER) problem, tracking changes across affective states is an important and desired capability. Although CER studies widely use correlation metrics in evaluations, these metrics do not alway...
We address the problem of continuous arousal detection for emotion recognition in musical audio pieces where emotions are represented in the two-dimensional arousal-valence space. We propose a novel method which is a combination of two recurrent neural networks using mel-spectrogram features: A bidirectional GRU network along the frequency dimensio...
Increasing volume of user-generated human-centric video content and their applications, such as video retrieval and browsing, require compact representations that are addressed by the video summarization literature. Current supervised studies formulate video summarization as a sequence-to-sequence learning problem and the existing solutions often n...
Throat microphones (TM) are a type of skin-attached non-acoustic sensors, which are robust to environmental noise but carry a lower signal bandwidth characterization than the traditional close-talk microphones (CM). Attaining high-performance phoneme recognition is a challenging task when the training data from a degrading channel, such as TM, is l...
Due to its expressivity, natural language is paramount for explicit and implicit affective state communication among humans. The same linguistic inquiry (e.g., How are you?) might induce responses with different affects depending on the affective state of the conversational partner(s) and the context of the conversation. Yet, most dialog systems do...
Due to its expressivity, natural language is paramount for explicit and implicit affective state communication among humans. The same linguistic inquiry (e.g. How are you ?) might induce responses with different affects depending on the affective state of the conversational partner(s) and the context of the conversation. Yet, most dialog systems do...
In this study, we focus on continuous emotion recognition using body motion and speech signals to estimate Activation, Valence, and Dominance (AVD) attributes. Semi-End-To-End network architecture is proposed where both extracted features and raw signals are fed, and this network is trained using multi-task learning (MTL) rather than the state-of-t...
Although speech driven facial animation has been studied extensively in the literature, works focusing on the affective content of the speech are limited. This is mostly due to the scarcity of affective audio-visual data. In this paper, we improve the affective facial animation using domain adaptation by partially reducing the data scarcity. We fir...
In human-to-human communication, speech signals carry rich emotional cues that are further emphasized by affect-expressive gestures. In this regard, automatic synthesis and animation of gestures accompanying affective verbal communication can help to create more naturalistic virtual agents in human-computer interaction systems. Speech-driven gestur...
Recent advances in real-time Magnetic Resonance Imaging (rtMRI) provide an invaluable tool to study speech articulation. In this paper, we present an effective deep learning approach for supervised detection and tracking of vocal tract contours in a sequence of rtMRI frames. We train a single input multiple output deep temporal regression network (...
In human-to-computer interaction, facial animation in synchrony with affective speech can deliver more naturalistic conversational agents. In this paper, we present a two-stage deep learning approach for affective speech driven facial shape animation. In the first stage, we classify affective speech into seven emotion categories. In the second stag...
The ability to generate appropriate verbal and non-verbal backchannels by an agent during human-robot interaction greatly enhances the interaction experience. Backchannels are particularly important in applications like tutoring and counseling, which require constant attention and engagement of the user. We present here a method for training a robo...
We present a novel method for training a social robot to generate backchannels during human-robot interaction. We address the problem within an off-policy reinforcement learning framework, and show how a robot may learn to produce non-verbal backchannels like laughs, when trained to maximize the engagement and attention of the user. A major contrib...
In this paper we present a deep learning multimodal approach for speech driven generation of face animations. Training a speaker independent model, capable of generating different emotions of the speaker, is crucial for realistic animations. Unlike the previous approaches which either use acoustic features or phoneme label features to estimate the...
Wearable sensor systems can deliver promising solutions to automatic monitoring of ingestive behavior. This study presents an on-body sensor system and related signal processing techniques to classify different types of food intake sounds. A piezoelectric throat microphone is used to capture food consumption sounds from the neck. The recorded signa...
We address the problem of continuous laughter detection over audio-facial input streams obtained from naturalistic dyadic conversations. We first present meticulous annotation of laughters, cross-talks and environmental noise in an audio-facial database with explicit 3D facial mocap data. Using this annotated database, we rigorously investigate the...
In human-to-human communication, gesture and speech co-exist in time with a tight synchrony, and gestures are often utilized to complement or to emphasize speech. In human–computer interaction systems, natural, affective and believable use of gestures would be a valuable key component in adopting and emphasizing human-centered aspects. However, nat...
The aim of this paper is tracking Parkinson's disease (PD) progression based on its symptoms on vocal system using Unified Parkinsons Disease Rating Scale (UPDRS). We utilize a standard speech signal feature set, which contains 6373 static features as functionals of low-level descriptor (LLD) contours, and select the most informative ones using the...
We propose a framework for joint analysis of speech prosody and arm motion towards automatic synthesis and realistic animation of beat gestures from speech prosody and rhythm. In the analysis stage, we first segment motion capture data and speech audio into gesture phrases and prosodic units via temporal clustering, and assign a class label to each...
Gesticulation, together with the speech, is an important part of natural and affective human-human interaction. Analysis of gesticulation and speech is expected to help designing more natural human-computer interaction (HCI) systems. We build the JestKOD database, which consists of speech and motion capture recordings of dyadic interactions. In thi...
Gesture and speech co-exist in time with a tight synchrony, and they are planned and shaped by the emotional state and produced together. In our early studies we have developed joint gesture-speech models and proposed algorithms for speech driven gesture animation. These algorithms mainly based on the Viterbi decoders and can not run in realtime. I...
Swallowing action is one of the two fundamental elements of food intake mechanism. Classification of different swallowing patterns establishes an important part of the nutrient activity analysis. This paper is a preliminary research that investigates ingestion monitoring. We observe that throat microphone recordings can reveal certain characteristi...
In studies on artificial bandwidth extension (ABE), there is a lack of international coordination in subjective tests between multiple methods and languages. Here we present the design of absolute category rating listening tests evaluating 12 ABE variants of six approaches in multiple languages, namely in American English, Chinese, German, and Kore...
Speech and hand gestures form a composite communicative signal that boosts the naturalness and affectiveness of the communication. We present a multimodal framework for joint analysis of continuous
affect, speech prosody and hand gestures towards automatic synthesis of realistic hand gestures from spontaneous speech using the hidden semi-Markov mod...
Recently, affect bursts have gained significant importance in the field of emotion recognition since they can serve as prior in recognising underlying affect bursts. In this paper we propose a data driven approach for detecting affect bursts using multimodal streams of input such as audio and facial landmark points. The proposed Gaussian Mixture Mo...
In the nature of human-to-human communication, gesture and speech co-exist in time with a tight synchrony. We tend to use gestures to complement or to emphasize speech. In this study we present the JESTKOD database, which will be a valuable asset to examine gesture and speech in defining more natural human-computer interaction systems. This JESTKOD...
In this paper, a new approach that extends narrowband excitation signals to synthesize wide-band speech have been proposed. Bandwidth extension problem is analyzed using source-filter separation framework where a speech signal is decomposed into two independent components. For spectral envelope extension, our former work based on hidden Markov mode...
In this paper, we propose a new statistical enhancement system for throat microphone recordings through source and filter separation. Throat microphones (TM) are skin-attached piezoelectric sensors that can capture speech sound signals in the form of tissue vibrations. Due to their limited bandwidth, TM recorded speech suffers from intelligibility...
Hand gesture is one of the most expressive, natural and common types of body language for conveying attitudes and emotions in human interactions. In this paper, we study the role of hand gesture in expressing attitudes of friendliness or conflict towards the interlocutors during interactions. We first employ an unsupervised clustering method using...
Affect bursts, which are nonverbal expressions of emotions in conversations, play a critical role in analyzing affective states. Although there exist a number of methods on affect burst detection and recognition using only audio information, little effort has been spent for combining cues in a multi-modal setup. We suggest that facial gestures cons...
In this analysis paper, we investigate the effect of phonetic clustering based on place and manner of articulation for the enhancement of throat-microphone speech through spectral envelope mapping. Place of articulation (PoA) and manner of articulation (MoA) dependent GMM-based spectral envelope mapping schemes have been investigated using the refl...
Gesticulation is an essential component of face-to-face communication, and it contributes significantly to the natural and affective perception of human-to-human communication. In this work we investigate a new multimodal analysis framework to model relationships between intonational and gesture phrases using the hidden semi-Markov models (HSMMs)....
We investigate spectral envelope mapping problem with joint analysis of throat- and acoustic-microphone recordings to enhance throat-microphone speech. A new phone-dependent GMM-based spectral envelope mapping scheme, which performs the minimum mean square error (MMSE) estimation of the acoustic-microphone spectral envelope, has been proposed. Expe...
Gesticulation is an essential component of face-to-face communication, and it contributes significantly to the natural and affective perception of human-to-human communication. In this work, we investigate a new multimodal analysis framework to model the relationship between speech rhythm and gesture phrases. We extract speech rhythm using Fourier...
In this paper, we propose a hidden Markov model (HMM)-based wideband spectral envelope estimation method for the artificial bandwidth extension problem. The proposed HMM-based estimator decodes an optimal Viterbi path based on the temporal contour of the narrowband spectral envelope and then performs the minimum mean square error (MMSE) estimation...
In this paper we investigate a new statistical excitation mapping technique to enhance throat-microphone speech using joint analysis of throat- And acoustic-microphone recordings. In a recent study we employed source-filter decomposition to enhance spectral envelope of the throat-microphone recordings. In the source-filter decomposition framework w...
We propose a novel framework for learning many-to-many statistical mappings from musical measures to dance figures towards generating plausible music-driven dance choreographies. We obtain music-to-dance mappings through use of four statistical models: 1) musical measure models, representing a many-to-one relation, each of which associates differen...
We propose a multimodal framework for correlation analysis of upper body gestures, facial expressions and speech prosody patterns of a speaker in spontaneous and natural conversation. Spontaneous upper body, face and speech gestures exhibit a broad range of structural relationships and have not been previously analyzed together to the best of our k...
Over the last few years, interest on paralinguistic information classification has grown considerably. However, in comparison to related speech processing tasks such as Automatic Speech and Speaker Recognition, practically no standardised corpora and test-conditions exist to compare performances under exactly the same conditions. The successive cha...
Driving behavior signals differ in how and under which conditions the driver uses vehicle control units, such as pedals, driving wheel, etc. In this study, we investigate how driving behavior signals differ among drivers and among different driving tasks. Statistically significant clues of these investigations are used to define driver and driving...
In this paper, we propose novel spectrally weighted mel-frequency cepstral coefficient (WMFCC) features for emotion recognition from speech. The idea is based on the fact that formant locations carry emotion-related information, and therefore critical spectral bands around formant locations can be emphasized during the calculation of MFCC features....
We present a Random Sampling Consensus (RANSAC) based training approach for the problem of speaker state recognition from spontaneous speech. Our system is trained and tested with the INTERSPEECH 2011 Speaker State Challenge corpora that includes the Intoxication and the Sleepiness Sub-challenges, where each sub-challenge defines a two-class classi...
We present a new wideband spectral envelope estimation framework for the artificial bandwidth extension problem. The proposed frame work builds temporal clusters of the joint sub-phone patterns of the narrowband and wideband speech signals using a parallel branch HMM structure. The joint sub-phone patterns define temporally correlated neighborhoods...
We target to learn correlation models between music and dance performances to synthesize music driven dance choreographies. The proposed framework learns statistical mappings from mu- sical measures to dance figures using musical measure models, exchangeable figures model, choreography model and dance figure models. Alternative dance choreographies...
Training datasets containing spontaneous emotional expressions are often imperfect due the ambiguities and difficulties of labeling such data by human observers. In this paper, we present a Random Sampling Consensus (RANSAC) based training approach for the problem of emotion recognition from spontaneous speech recordings. Our motivation is to inser...
We propose the use of the line spectral frequency (LSF) features for emotion recognition from speech, which have not been been previously employed for emotion recognition to the best of our knowledge. Spectral features such as mel-scaled cepstral coefficients have already been successfully used for the parameterization of speech signals for emotion...
In this paper we evaluate INTERSPEECH 2009 Emotion Recognition Challenge results. The challenge presents the problem of accurate classification of natural and emotionally rich FAU Aibo recordings into five and two emotion classes. We evaluate prosody related, spectral and HMM-based features with Gaussian mixture model (GMM) classifiers to attack th...
Training datasets containing spontaneous emotional speech are often imperfect due the ambiguities and difficulties of labeling such data by human observers. In this paper, we present a Random Sampling Consensus (RANSAC) based training approach for the problem of emotion recognition from spontaneous speech recordings. Our motivation is to insert a d...
We present a new framework for joint analysis of throat and acoustic microphone (TAM) recordings to improve throat microphone only speech recognition. The proposed analysis framework aims to learn joint sub-phone patterns of throat and acoustic microphone recordings through a parallel branch HMM structure. The joint sub-phone patterns define tempor...
We present a speech signal driven emotion recognition sys- tem. Our system is trained and tested with the INTERSPEECH 2009 Emotion Challenge corpus, which includes spontaneous and emotionally rich recordings. The challenge includes clas- sifier and feature sub-challenges with five-class and two-class classification problems. We investigate prosody...
Music genre classification is an essential tool for music information retrieval systems and it has been finding critical applications in various media platforms. Two important problems of the automatic music genre classification are feature extraction and classifier design. This paper investigates inter-genre similarity modelling (IGS) to improve t...
We present a framework for automatically generating the facial expression animation of 3D talking heads using only the speech information. Our system is trained on the Berlin emotional speech dataset that is in German and includes seven emotions. We first parameterize the speech signal with prosody related features and spectral features. Then, we i...
This chapter presents a multimodal speaker identification system that integrates audio, lip texture, and lip motion modalities, and the authors propose to use the "explicit" lip motion information that best represent the modality for the given problem. The work is presented in two stages: First, they consider several lip motion feature candidates s...
This chapter presents a multimodal speaker identification system that integrates audio, lip texture, and lip motion modalities, and the authors propose to use the “explicit” lip motion information that best represent the modality for the given problem. The work is presented in two stages: First, they consider several lip motion feature candidates s...
We present a framework for training and synthesis of an audio-driven dancing avatar. The avatar is trained for a given musical
genre using the multicamera video recordings of a dance performance. The video is analyzed to capture the time-varying posture
of the dancer’s body whereas the musical audio signal is processed to extract the beat informati...
This paper focuses on the problem of automatically generating speech synchronous facial expressions for 3D talking heads. The proposed system is speaker and language independent. We parameterize speech data with prosody related features and spectral features together with their first and second order derivatives. Then, we classify the seven emotion...
This paper presents a framework for audio-driven human body motion analysis and synthesis. We address the problem in the context of a dance performance, where gestures and movements of the dancer are mainly driven by a musical piece and characterized by the repetition of a set of dance figures. The system is trained in a supervised manner using the...
We present a framework for selecting best audio features for audiovisual analysis and synthesis of dance figures. Dance figures are performed synchronously with the musical rhythm. They can be analyzed through the audio spectra using spectral and rhythmic musical features. In the proposed audio feature evaluation system, dance figures are manually...
This paper presents a framework for audio-driven human body motion analysis and synthesis. The video is analyzed to capture the time-varying posture of the dancerpsilas body whereas the musical audio signal is processed to extract the beat information. The human body posture is extracted from multiview video information without any human interventi...