Y. Yemez

Philips, Eindhoven, North Brabant, Netherlands

Are you Y. Yemez?

Claim your profile

Publications (47)48.89 Total impact

  • Source
    Conference Proceeding: Multi-modal analysis of dance performances for music-driven choreography synthesis
    [show abstract] [hide abstract]
    ABSTRACT: We propose a framework for modeling, analysis, annotation and synthesis of multi-modal dance performances. We analyze correlations between music features and dance figure labels on training dance videos in order to construct a mapping from music measures (segments) to dance figures towards generating music-driven dance choreographies. We assume that dance figure segment boundaries coincide with music measures (audio boundaries). For each training video, figure segments are manually labeled by an expert to indicate the type of dance motion. Chroma features of each measure are used for music analysis. We model temporal statistics of such chroma features corresponding to each dance figure label to identify different rhythmic patterns for that dance motion. The correlations between dance figures and music measures, as well as, correlations between consecutive dance figures are used to construct a mapping for music-driven dance choreography synthesis. Experimental results demonstrate the success of proposed music-driven choreography synthesis framework.
    Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on; 04/2010 · 4.63 Impact Factor
  • Article: 3D Model Retrieval Using Probability Density-Based Shape Descriptors
    [show abstract] [hide abstract]
    ABSTRACT: We address content-based retrieval of complete 3D object models by a probabilistic generative description of local shape properties. The proposed shape description framework characterizes a 3D object with sampled multivariate probability density functions of its local surface features. This density-based descriptor can be efficiently computed via kernel density estimation (KDE) coupled with fast Gauss transform. The non-parametric KDE technique allows reliable characterization of a diverse set of shapes and yields descriptors which remain relatively insensitive to small shape perturbations and mesh resolution. Density-based characterization also induces a permutation property which can be used to guarantee invariance at the shape matching stage. As proven by extensive retrieval experiments on several 3D databases, our framework provides state-of-the-art discrimination over a broad and heterogeneous set of shape categories.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 07/2009; · 4.91 Impact Factor
  • Source
    Conference Proceeding: Unsupervised dance figure analysis from video for dancing Avatar animation
    [show abstract] [hide abstract]
    ABSTRACT: This paper presents a framework for unsupervised video analysis in the context of dance performances, where gestures and 3D movements of a dancer are characterized by repetition of a set of unknown dance figures. The system is trained in an unsupervised manner using hidden Markov models (HMMs) to automatically segment multiview video recordings of a dancer into recurring elementary temporal body motion patterns to identify the dance figures. That is, a parallel HMM structure is employed to automatically determine the number and the temporal boundaries of different dance figures in a given dance video. The success of the analysis framework has been evaluated by visualizing these dance figures on a dancing avatar animated by the computed 3D analysis parameters. Experimental results demonstrate that the proposed framework enables synthetic agents and/or robots to learn dance figures from video automatically.
    Image Processing, 2008. ICIP 2008. 15th IEEE International Conference on; 11/2008
  • Source
    Article: Analysis of Head Gesture and Prosody Patterns for Prosody-Driven Head-Gesture Animation
    [show abstract] [hide abstract]
    ABSTRACT: We propose a new two-stage framework for joint analysis of head gesture and speech prosody patterns of a speaker towards automatic realistic synthesis of head gestures from speech prosody. In the first stage analysis, we perform hidden Markov model (HMM) based unsupervised temporal segmentation of head gesture and speech prosody features separately to determine elementary head gesture and speech prosody patterns, respectively, for a particular speaker. In the second stage, joint analysis of correlations between these elementary head gesture and prosody patterns is performed using Multi-Stream HMMs to determine an audio-visual mapping model. The resulting audio-visual mapping model is then employed to synthesize natural head gestures from arbitrary input test speech given a head model for the speaker. In the synthesis stage, the audio-visual mapping model is used to predict a sequence of gesture patterns from the prosody pattern sequence computed for the input test speech. The Euler angles associated with each gesture pattern are then applied to animate the speaker head model. Objective and subjective evaluations indicate that the proposed synthesis by analysis scheme provides natural looking head gestures for the speaker with any input test speech, as well as in ``prosody transplant" and ``gesture transplant" scenarios.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 09/2008; · 4.91 Impact Factor
  • Source
    Article: Analysis of Head Gesture and Prosody Patterns for Prosody-Driven Head-Gesture Animation
    [show abstract] [hide abstract]
    ABSTRACT: We propose a new two-stage framework for joint analysis of head gesture and speech prosody patterns of a speaker towards automatic realistic synthesis of head gestures from speech prosody. In the first stage analysis, we perform Hidden Markov Model (HMM) based unsupervised temporal segmentation of head gesture and speech prosody features separately to determine elementary head gesture and speech prosody patterns, respectively, for a particular speaker. In the second stage, joint analysis of correlations between these elementary head gesture and prosody patterns is performed using Multi-Stream HMMs to determine an audio-visual mapping model. The resulting audio-visual mapping model is then employed to synthesize natural head gestures from arbitrary input test speech given a head model for the speaker. In the synthesis stage, the audio-visual mapping model is used to predict a sequence of gesture patterns from the prosody pattern sequence computed for the input test speech. The Euler angles associated with each gesture pattern are then applied to animate the speaker head model. Objective and subjective evaluations indicate that the proposed synthesis by analysis scheme provides natural looking head gestures for the speaker with any input test speech, as well as in "prosody transplant" and "gesture transplant" scenarios.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 08/2008; vol. 30(no..8):pp. 1330-1345. · 4.91 Impact Factor
  • Source
    Conference Proceeding: Audio-driven human body motion analysis and synthesis
    [show abstract] [hide abstract]
    ABSTRACT: This paper presents a framework for audio-driven human body motion analysis and synthesis. We address the problem in the context of a dance performance, where gestures and movements of the dancer are mainly driven by a musical piece and characterized by the repetition of a set of dance figures. The system is trained in a supervised manner using the multiview video recordings of the dancer. The human body posture is extracted from multiview video information without any human intervention using a novel marker-based algorithm based on annealing particle filtering. Audio is analyzed to extract beat and tempo information. The joint analysis of audio and motion features provides a correlation model that is then used to animate a dancing avatar when driven with any musical piece of the same genre. Results are provided showing the effectiveness of the proposed algorithm.
    Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on; 05/2008 · 4.63 Impact Factor
  • Source
    Article: Audiovisual Synchronization and Fusion Using Canonical Correlation Analysis
    [show abstract] [hide abstract]
    ABSTRACT: It is well-known that early integration (also called data fusion) is effective when the modalities are correlated, and late integration (also called decision or opinion fusion) is optimal when modalities are uncorrelated. In this paper, we propose a new multimodal fusion strategy for open-set speaker identification using a combination of early and late integration following canonical correlation analysis (CCA) of speech and lip texture features. We also propose a method for high precision synchronization of the speech and lip features using CCA prior to the proposed fusion. Experimental results show that i) the proposed fusion strategy yields the best equal error rates (EER), which are used to quantify the performance of the fusion strategy for open-set speaker identification, and ii) precise synchronization prior to fusion improves the EER; hence, the best EER is obtained when the proposed synchronization scheme is employed together with the proposed fusion strategy. We note that the proposed fusion strategy outperforms others because the features used in the late integration are truly uncorrelated, since they are output of the CCA analysis.
    IEEE Transactions on Multimedia 12/2007; · 1.93 Impact Factor
  • Conference Proceeding: Feature-Level and Descriptor-Level Information Fusion for Density-Based 3D Shape Descriptors
    [show abstract] [hide abstract]
    ABSTRACT: We address the 3D object retrieval problem using density-based shape descriptors. We explore first and second order local surface features and their multivariate combinations in the density estimation framework. We also experiment with descriptor level information fusion. The results, obtained using two different databases, Princeton Shape Benchmark and Sculpteur, show that, boosted with both feature level and descriptor level information fusion, the density-based shape description framework enables effective and efficient 3D object retrieval.
    Signal Processing and Communications Applications, 2007. SIU 2007. IEEE 15th; 07/2007
  • Source
    Conference Proceeding: Multivariate Density-Based 3D Shape Descriptors
    [show abstract] [hide abstract]
    ABSTRACT: We address the 3D object retrieval problem using multivariate density-based shape descriptors. Considering the fusion of first and second order local surface information, we construct multivariate features up to five dimensions and process them by the kernel density estimation methodology to obtain descriptor vectors. We can compute these descriptors very efficiently using the fast Gauss transform algorithm. We also make use of descriptor level information fusion by concatenating descriptor vectors to increase their discrimination power further. To render the resulting descriptors storage-wise efficient, we develop two analytical tools, marginalization and probability density suppression, for descriptor dimensionality reduction. The experiments on two different databases, Princeton Shape Benchmark and Sculpteur, show that, boosted with both feature level and descriptor level information fusion, and powered with fast computational schemes, the density-based shape description framework enables effective and efficient 3D object retrieval.
    Shape Modeling and Applications, 2007. SMI '07. IEEE International Conference on; 07/2007
  • Conference Proceeding: Joint Correlation Analysis of Audio-Visual Dance Figures
    [show abstract] [hide abstract]
    ABSTRACT: In this paper we present a framework for analysis of dance figures from audio-visual data. Our audio-visual data is the mul-tiview video of a dancing actor which is acquired using 8 synchronized cameras. The multi-camera motion capture technique of this framework is based on 3D tracking of the markers attached to the dancer's body, using stereo color information. The extracted 3D points are used to calculate the body motion features as 3D displacement vectors. On the other hand, MFC coefficients serve as the audio features. In the first stage of the two stage analysis task, we perform Hidden Markov Model (HMM) based unsupervised temporal segmentation of the audio and body motion features, separately, to extract the recurrent elementary audio and body motion patterns. In the second stage, the correlation of body motion patterns with audio patterns is investigated to create a correlation model that can be used during the synthesis of an audio-driven body animation.
    Signal Processing and Communications Applications, 2007. SIU 2007. IEEE 15th; 07/2007
  • Conference Proceeding: Estimation of Personalized Facial Gesture Patterns
    [show abstract] [hide abstract]
    ABSTRACT: We propose a framework for estimation and analysis of temporal facial expression patterns of a speaker. The goal of this framework is to learn the personalized elementary dynamic facial expression patterns for a particular speaker. We track lip, eyebrow, and eyelid of the speaker in 3D across a head-and-shoulder stereo video sequence. We use MPEG-4 facial definition parameters (FDPs) to create the feature set, and MPEG-4 facial animation parameters (FAPs) to represent the temporal facial expression patterns. Hidden Markov model (HMM) based unsupervised temporal segmentation of upper and lower facial expression features is performed separately to determine recurrent elementary facial expression patterns for the particular speaker. These facial expression patterns, which are coded by FAP sequences and may not be tied with prespecified emotions, can be used for personalized emotion estimation and synthesis of a speaker. Experimental results are presented.
    Signal Processing and Communications Applications, 2007. SIU 2007. IEEE 15th; 07/2007
  • Conference Proceeding: Multicamera Audio-Visual Analysis of Dance Figures
    [show abstract] [hide abstract]
    ABSTRACT: We present an automated system for multicamera motion capture and audio-visual analysis of dance figures. The multiview video of a dancing actor is acquired using 8 synchronized cameras. The motion capture technique is based on 3D tracking of the markers attached to the person's body in the scene, using stereo color information without need for an explicit 3D model. The resulting set of 3D points is then used to extract the body motion features as 3D displacement vectors whereas MFC coefficients serve as the audio features. In the first stage of multimodal analysis, we perform Hidden Markov Model (HMM) based unsupervised temporal segmentation of the audio and body motion features, separately, to determine the recurrent elementary audio and body motion patterns. Then in the second stage, we investigate the correlation of body motion patterns with audio patterns, that can be used for estimation and synthesis of realistic audio-driven body animation.
    IEEE International Conference on Multimedia and Expo, 2007, Beijing, China (2-5 July, 2007); 07/2007
  • Source
    Conference Proceeding: Prosody-Driven Head-Gesture Animation
    [show abstract] [hide abstract]
    ABSTRACT: We present a new framework for joint analysis of head gesture and speech prosody patterns of a speaker towards automatic realistic synthesis of head gestures from speech prosody. The proposed two-stage analysis aims to "learn" both elementary prosody and head gesture patterns for a particular speaker, as well as the correlations between these head gesture and prosody patterns from a training video sequence. The resulting audio-visual mapping model is then employed to synthesize natural head gestures from arbitrary input test speech given a head model for the speaker. Objective and subjective evaluations indicate that the proposed synthesis by analysis scheme provides natural looking head gestures for the speaker with any input test speech
    Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on; 05/2007 · 4.63 Impact Factor
  • Source
    Article: Discriminative Analysis of Lip Motion Features for Speaker Identification and Speech-Reading
    [show abstract] [hide abstract]
    ABSTRACT: There have been several studies that jointly use audio, lip intensity, and lip geometry information for speaker identification and speech-reading applications. This paper proposes using explicit lip motion information, instead of or in addition to lip intensity and/or geometry information, for speaker identification and speech-reading within a unified feature selection and discrimination analysis framework, and addresses two important issues: 1) Is using explicit lip motion information useful, and, 2) if so, what are the best lip motion features for these two applications? The best lip motion features for speaker identification are considered to be those that result in the highest discrimination of individual speakers in a population, whereas for speech-reading, the best features are those providing the highest phoneme/word/phrase recognition rate. Several lip motion feature candidates have been considered including dense motion features within a bounding box about the lip, lip contour motion features, and combination of these with lip shape features. Furthermore, a novel two-stage, spatial, and temporal discrimination analysis is introduced to select the best lip motion features for speaker identification and speech-reading applications. Experimental results using an hidden-Markov-model-based recognition system indicate that using explicit lip motion information provides additional performance gains in both applications, and lip motion features prove more valuable in the case of speech-reading application
    IEEE Transactions on Image Processing 11/2006; · 3.04 Impact Factor
  • Article: Combined Gesture-Speech Analysis and Speech Driven Gesture Synthesis
    [show abstract] [hide abstract]
    ABSTRACT: Multimodal speech and speaker modeling and recognition are widely accepted as vital aspects of state of the art human-machine interaction systems. While correlations between speech and lip motion as well as speech and facial expressions are widely studied, relatively little work has been done to investigate the correlations between speech and gesture. Detection and modeling of head, hand and arm gestures of a speaker have been studied extensively and these gestures were shown to carry linguistic information. A typical example is the head gesture while saying "yes/no". In this study, correlation between gestures and speech is investigated. In speech signal analysis, keyword spotting and prosodic accent event detection has been performed. In gesture analysis, hand positions and parameters of global head motion are used as features. The detection of gestures is based on discrete predesignated symbol sets, which are manually labeled during the training phase. The gesture-speech correlation is modelled by examining the co-occurring speech and gesture patterns. This correlation can be used to fuse gesture and speech modalities for edutainment applications (i.e. video games, 3-D animations) where natural gestures of talking avatars is animated from speech. A speech driven gesture animation example has been implemented for demonstration.
    2012 IEEE International Conference on Multimedia and Expo. 07/2006;
  • Source
    Conference Proceeding: Multimodal Speaker Identification Using Canonical Correlation Analysis
    [show abstract] [hide abstract]
    ABSTRACT: In this work, we explore the use of canonical correlation analysis to improve the performance of multimodal recognition systems that involve multiple correlated modalities. More specifically, we consider the audiovisual speaker identification problem, where speech and lip texture (or intensity) modalities are fused in an open-set identification framework. Our motivation is based on the following observation. The late integration strategy, which is also referred to as decision or opinion fusion, is effective especially in case the contributing modalities are uncorrelated and thus the resulting partial decisions are statistically independent. Early integration techniques on the other hand can be favored only if a couple of modalities are highly correlated. However, coupled modalities such as audio and lip texture also consist of some components that are mutually independent. Thus we first perform a cross-correlation analysis on the audio and lip modalities so as to extract the correlated part of the information, and then employ an optimal combination of early and late integration techniques to fuse the extracted features. The results of the experiments testing the performance of the proposed system are also provided
    Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on; 06/2006 · 4.63 Impact Factor
  • Source
    Article: Multimodal person recognition for human-vehicle interaction
    [show abstract] [hide abstract]
    ABSTRACT: The authors combine two different biometric modalities for next-generation vehicles that use biometric person recognition. Next-generation vehicles will undoubtedly feature biometric person recognition as part of an effort to improve the driving experience. Today's technology prevents such systems from operating satisfactorily under adverse conditions. A proposed framework for achieving person recognition successfully combines different biometric modalities, borne out in two case studies.
    IEEE Multimedia 05/2006; · 0.44 Impact Factor
  • Source
    Article: Multimodal speaker identification using an adaptive classifier cascade based on modality reliability
    [show abstract] [hide abstract]
    ABSTRACT: We present a multimodal open-set speaker identification system that integrates information coming from audio, face and lip motion modalities. For fusion of multiple modalities, we propose a new adaptive cascade rule that favors reliable modality combinations through a cascade of classifiers. The order of the classifiers in the cascade is adaptively determined based on the reliability of each modality combination. A novel reliability measure, that genuinely fits to the open-set speaker identification problem, is also proposed to assess accept or reject decisions of a classifier. A formal framework is developed based on probability of correct decision for analytical comparison of the proposed adaptive rule with other classifier combination rules. The proposed adaptive rule is more robust in the presence of unreliable modalities, and outperforms the hard-level max rule and soft-level weighted summation rule, provided that the employed reliability measure is effective in assessment of classifier decisions. Experimental results that support this assertion are provided.
    IEEE Transactions on Multimedia 11/2005; · 1.93 Impact Factor
  • Source
    Conference Proceeding: Lip feature extraction based on audio-visual correlation
    [show abstract] [hide abstract]
    ABSTRACT: In this paper, the lip feature that has the highest correlation with audio features is investigated. Audio features are selected as Mel Frequency Cepstral Coefficients (MFCC) of the audio signal. Three different lip features are considered for the visual lip information, where these features are 2D DCT coefficients of the intensity based image and the optical flow vectors within the lip region, and the distances between pre-defined points on the lip contour which carries the lip shape information. In this study, we present two techniques based on class conditional probability analysis and canonical correlation analysis to estimate and compare the correlations between audio feature and each lip feature. The lip feature, which has the highest correlation to audio features, is identified among the above lip features. Isolation of lip features, which are highly correlated with audio signal, can be used for audio-visual speech recognition, audio-visual lip synchronization and estimation of lip shapes using audio signal for visual synthesis.
    European Signal Process'ng Conference (EUSIPCO), Antalya, Turkey; 09/2005
  • Conference Proceeding: Audio-visual Correlation Analysis For Lip Feature Extraction
    [show abstract] [hide abstract]
    ABSTRACT: Not Available
    Signal Processing and Communications Applications Conference, 2005. Proceedings of the IEEE 13th; 06/2005

Institutions

  • 2009
    • Philips
      Eindhoven, North Brabant, Netherlands
  • 2008
    • University of California, Santa Barbara
      • Department of Electrical and Computer Engineering
      Santa Barbara, CA, USA
  • 2003–2008
    • Koc University
      • College of Engineering
      İstanbul, Istanbul, Turkey
  • 1993–2007
    • Bogazici University
      • Department of Electrical and Electronic Engineering
      İstanbul, Istanbul, Turkey
  • 1999
    • French National Centre for Scientific Research
      Lyon, Rhone-Alpes, France