Véronique Aubergé

Stockholm University, Stockholm, Stockholm, Sweden

Are you Véronique Aubergé?

Claim your profile

Publications (59)7.92 Total impact

  • Petri Laukka, Nicolas Audibert, Véronique Aubergé
    [Show abstract] [Hide abstract]
    ABSTRACT: We examined what determines the typicality, or graded structure, of vocal emotion expressions. Separate groups of judges rated acted and spontaneous expressions of anger, fear, and joy with regard to their typicality and three main determinants of the graded structure of categories: category members' similarity to the central tendency of their category (CT); category members' frequency of instantiation, i.e., how often they are encountered as category members (FI); and category members' similarity to ideals associated with the goals served by its category, i.e., suitability to express particular emotions. Partial correlations and multiple regression analysis revealed that similarity to ideals, rather than CT or FI, explained most variance in judged typicality. Results thus suggest that vocal emotion expressions constitute ideal-based goal-derived categories, rather than taxonomic categories based on CT and FI. This could explain how prototypical expressions can be acoustically distinct and highly recognisable but occur relatively rarely in everyday speech.
    Cognition and Emotion 08/2011; 26(4):710-9. · 2.52 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Prosodic attitudes, or social affects, are main part of face-to-face interaction and linked to the language through the culture. This paper presents a study on prosodic attitudes in Vietnamese, a tonal language. Perception experiments on 16 Vietnamese attitudes were carried out with Vietnamese and French participants. The results revealed perception differences between native and non-native listeners. As attitudinal expression are partially carried through speech prosody, an analysis was also carried out, in order to have a better understanding of why these attitudes are recognized or confused, and to bring out some prosodic characteristics of Vietnamese social affects.
    01/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Prosodic attitudes (social affects) are highly linked to the language through the culture, and are a main part of face to face interaction. Therefore, for description and modeling, as well as for applications like translation, language learning or synthesis, a cross-cultural approach is relevant. This paper presents a cross-perception of Audio-Visual prosodic attitudes in Vietnamese, an under-resourced tonal language. Based on an audio-visual corpus of 16 attitudes, perception experiments were carried out with Vietnamese and French participants: firstly, to understand the contribution of audio and visual modalities to affective communication; secondly, to perceptually measure how the native and non-native listeners recognize and confuse the Vietnamese attitudes. The results reveal cultural specificities and cross-cultural common attitudes in Vietnamese.
    01/2010;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper investigates the differences in the perception of six culturally encoded French social affects for Japanese and native listeners. Half of the Japanese listeners have followed six months of training about both prosodic and facial realization of French social affects. Audio-visual stimuli were presented to listeners, who guess speaker's intended attitude and rate the intensity of the expressiveness. Results showed that the trained Japanese listeners recognized better than the untrained ones; however, culturally specific attitudes (i.e. suspicious irony and obviousness) were confused by Japanese listeners (including trained listeners). Facial information cues seem to be more salient than audio ones.
    01/2010;
  • Nicolas Audibert, Véronique Aubergé, Albert Rilliard
    Technique et Science Informatiques. 01/2010; 29:833-857.
  • A. Vanpé, Véronique Aubergé
    Technique et Science Informatiques. 01/2010; 29:807-832.
  • Source
    Nicolas Audibert, Véronique Aubergé, Albert Rilliard
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents the first results of the acoustic analysis of 12 pairs of monosyllabic acted vs. spontaneous expressions of satisfaction, irritation and anxiety produced by 4 subjects, discriminated and rated for emotional intensity differences in previous perceptual experiments. Acoustic features in each pair were extracted from the utterances, compared and correlated with perceptual ratings, mainly showing significant correlations between general F0 level difference in the pair and perceived emotional intensity difference, but failing to explain all the observed variability of discrimination scores. The influence of F0 contours shape of selected stimuli on perceptual discrimination scores and perceived emotional intensity is discussed.
    01/2010;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Previous work examined the contribution of audio and visual modalities for perception of Japanese social affects by adults. The results showed audio and visual information contribute to the perception of culturally encoded expressions, and show an important synergy when presented together. Multimodal presentation allows foreign adult listeners to recognize culturally encoded expressions of Japanese politeness which they cannot recognize with an audio-only stimuli. This current work analyzes the recognition performance of politeness expressions by Japanese children 13 to 14 years old. Stimuli, based on one sentence with an affectively neutral meaning, are performed with five different expressions of politeness. Subjects listen three times to each stimulus and judge the intended message of the speaker. The stimuli are presented as audio-only, visual-only, audio-visual. Listeners rate the social status of the hearer and the degree of politeness on a nine-point scale ranging from polite to impolite. The results are analyzed to capture the relative ability of adults and children to use both modalities to recognize social affects. [This work was supported in part by Japanese Ministry of Education, Science, Sport, and Culture, Grant-in-Aid for Scientific Research (C) (2007-2010): 19520371 and SCOPE (071705001) of Ministry of Internal Affairs and Communications (MIC), Japan.].
    The Journal of the Acoustical Society of America 05/2009; 125(4):2754. · 1.65 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Whereas several studies have explored the expression of emotions, little is known on how the visual and audio channels are combined during production of what we call the more controlled social affects, for example, "attitudinal" expressions. This article presents a perception study of the audovisual expression of 12 Japanese and 6 French attitudes in order to understand the contribution of audio and visual modalities for affective communication. The relative importance of each modality in the perceptual decoding of the expressions of four speakers is analyzed as a first step towards a deeper comprehension of their influence on the expression of social affects. Then, the audovisual productions of two speakers (one for each language) are acoustically (F0, duration and intensity) and visually (in terms of Action Units) analyzed, in order to match the relation between objective parameters and listeners' perception of these social affects. The most pertinent objective features, either acoustic or visual, are then discussed, in a bilingual perspective: for example, the relative influence of fundamental frequency for attitudinal expression in both languages is discussed, and the importance of a certain aspect of the voice quality dimension in Japanese is underlined.
    Language and Speech 02/2009; 52(Pt 2-3):223-43. · 0.82 Impact Factor
  • INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, Brighton, United Kingdom, September 6-10, 2009; 01/2009
  • INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, Brighton, United Kingdom, September 6-10, 2009; 01/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Irregular phonation can serve as a cue to segmental contrasts and prosodic structure as well as to the affective state and identity of the speaker. Thus algorithms for transforming between voice qualities, such as regular and irregular phonation, may contribute to building more natural sounding, expressive and personalized speech synthesizers. We describe a semiautomatic transformation method that introduces irregular pitch periods into a modal speech signal by amplitude scaling of the individual cycles. First the periods are separated by windowing, then multiplied by scaling factors, and finally overlapped and added. Thus, amplitude irregularities are introduced via boosting or attenuating selected cycles. The abrupt, substantial changes in cycle lengths that are characteristic of naturally-occurring irregular phonation can be achieved by removing (scaling to zero) one or more consecutive periods. A freely available graphical tool has been developed for copying stylized pulse patterns (glottal pulse spacings and amplitudes) from an irregular recording to a regular one, allowing the scaling factors to be refined and the waveform regenerated interactively. We present the effects of the transformation on harmonic structure, and perceptual test results showing that transformed signals are similar to natural irregular recordings in both roughness and naturalness.
    The Journal of the Acoustical Society of America 06/2008; 123(5):3886. · 1.65 Impact Factor
  • Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco; 01/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The aim of the present work is to investigate how Japanese listeners recognize 12 audio-visual prosodic attitudes of Japanese. Significant influences of the two speakers and three modalities were observed. Generally the audio-visual condition showed the best recognition score, and interesting behavior for audio and visual modality was observed. Attitudes were regrouped into 3 higher-level perceptual categories for the first speaker: polite expressions, the attitudes of "query", and the expressions of imposition of one's own opinion. The attitude of kyoshuku and surprise are particularly well recognized by visual information.
    01/2008;
  • Anne Vanpé, Véronique Aubergé
    [Show abstract] [Hide abstract]
    ABSTRACT: The interaction between two humans (or between human and computer by projection) is cadenced by the turn-taking/speech acting. Nevertheless the communication flow of each interaction participant is never stopped. The out of turn-taking time is continuously full of expressions that concern the processes of listening/ understanding/ reacting ("feedback" in the backchannel expression, (4)). Such information is very varied (3,4) and is expressed voluntary or involuntary: mental states (concentration, "Feeling of Knowing" (4)), intentions, attitudes (politeness, satisfaction, agreement...), emotions (disappointment, irritation...) and moods (stress, relaxing). We called "Feeling of Thinking" such expressions of mental and affective states. The perceptual study presented here aimed to ascertain the relevance of non-verbal audio-visual stimuli chosen from a large spontaneous expressive French database (Sound Teacher/Ewiz (1)): two (among 17) selected female subjects (introvert S and extravert T) were tricked in a wizard of Oz that made them react with strong contrasted affects and cognition states in a human-computer interaction. The out of turn-taking audio-visual signals for the two subjects and their auto-annotation in terms of affects and other feelings were edited in order to precisely segment and then extract stimuli. Selected stimuli are supposed to be minimal icons representative of the naïve auto-annotation labels given by the subjects. Even if many non speech sounds seem to be informative, only silent visual stimuli were chosen for this experiment. 10 auto-annotation labels were retained for this experiment for subject T: "hesitant", "stressed", “ill at ease/worried", "anxious /oppressed", "at ease/more relaxed", "quiet/fine", "a bit lost/perplexed", "disappointed", "astonished", “concentrating”), and 9 for subject S ("not concentrating and feeling like laughing", "deriding my results", "listening with attention", ""holding over me" by the software", "stressed", "feeling like laughing and answering by chance", "concentrating and answering by chance", "concentrating" and "disappointed"). For each selected minimal gestural icon (long from 0.5 to 4 seconds) was extracted a static picture, supposed to be typical. This experiment aimed to compare how efficient are the dynamic vs. static icons to convey the information referenced by the labels. Since main studies, in such a topic, are related to emotions, and since it was shown that the face is very informative and that the upper part and the lower part of the face do not carry the same information, according to the multiple works around the Ekman theory, our dynamic and static stimuli were presented in three balanced conditions: upper part of the face ("upper"), lower part of the face ("lower") and whole face ("whole"). Two identical perceptual tests were implemented: the first one with the dynamic stimuli, the second one with static stimuli (pictures) extracted from each dynamic stimulus. Each stimulus was presented once to in each condition for every session of perceptual test. Sixteen judges were presented the two tests, that consisted of closed choices among the self-annotation labels. Although presentation time was not limited for static stimuli, dynamic stimuli could be replayed during 8 seconds. The main result is that there is no additivity between the upper and lower part of the face. No part of the face really contains sufficient information, whatever is the label, and more specifically for the mental states expressions (even if, for example, "concentrating" and "feeling like laughing and answering by chance" have more information in the upper part of the face and "stressed" in the lower part). Moreover this sharing between lower and upper part of the face can change, depending on if the stimulus is dynamic or presented on an extracted static picture. More globally the profit from static to dynamic seems to deeply depend on the nature of the information: for some stimuli, an under chance identification becomes a clear chance score; whereas for some others, the dynamic seems to be a disturbance (recall that the dynamic presentation is the ecological one, and lets the judge free to use static processing when he observes the natural visual signal).
    Speech and Face-to-Face communication. 01/2008;
  • Source
    Nicolas Audibert, Véronique Aubergé, Albert Rilliard
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper reports how acted vs. spontaneous expressive speech can be discriminated by human listeners, with various performances depending on the listener (in line with preliminary results for amusement by [3]). The perceptive material was taken from the Sound Teacher/E-Wiz corpus [1], for 4 French-speaking actors trapped in spontaneous expressive monoword utterances, and then acting immediately after, in an acting protocol supposed to be a very convenient for them. Pairs of acted vs. spontaneous stimuli, expressing affective states related to anxiety, irritation and satisfaction, were rated by 33 native French listeners in audio-only, visual-only and audiovisual conditions. In visual-only condition, 70% of listeners were able to identify acted vs. spontaneous pairs over chance level, for 78% in audio-only condition and up to 85% in audio-visual condition. Globally, a highly significant subject effect confirms the hypothesis of a varied affective competence for separating involuntary vs. simulated affects [2]. One feature used by listeners in the acoustic task of discrimination can be the perceived emotional intensity, in accordance with the measurement of this intensity level for the same stimuli from a previous perception experiment by Laukka and al. [9].
    01/2008;
  • Source
    Nicolas Audibert, Véronique Aubergé, Albert Rilliard
    [Show abstract] [Hide abstract]
    ABSTRACT: The cognitive processing involved in the decoding of emotional expressions vs. attitudes in speech, as well as the modeling of emotional prosody as contours vs. gradual cues are debated questions. This work aims at measuring the anticipated perception of emotions on minimal linguistic units, to evaluate if the underlying processing is compatible with the hypothesis of gradient contours processing. Monosyllabic speech stimuli extracted from an expressive corpus and expressing anxiety, disappointment, disgust, disquiet, joy, resignation, sadness and satisfaction, were gradually presented in a gating experiment. Results show that identification along gates of most of expressions follow a linear pattern typical of a contour-like processing, while expressions of satisfaction present distinct gradient values that make possible an early identification of affective values.
    01/2007;
  • Nicolas Audibert, Véronique Aubergé
    [Show abstract] [Hide abstract]
    ABSTRACT: This work aims at measuring the anticipated perception of emotions on minimal linguistic units, to evaluate if the underlying cognitive processing is compatible with the hypothesis of gradient contours. Selected monosyllabic stimuli extracted from an expressive corpus and expressing anxiety, disappointment, disgust, disquiet, joy, resignation, sadness and satisfaction, were gradually presented to naïve judges in a gating experiment. Results strengthen the hypothesis of gradient processing by showing that identification along successive gates of most of expressions follow a linear pattern typical of a contour-like processing, while expressions of satisfaction present distinct gradient values that make possible an early identification of affective values.
    Affective Computing and Intelligent Interaction, Second International Conference, ACII 2007, Lisbon, Portugal, September 12-14, 2007, Proceedings; 01/2007
  • Véronique Aubergé, Nicolas Audibert, Albert Rilliard
    Revue d'Intelligence Artificielle. 01/2006; 20:499-527.
  • Source
    Takaaki Shochi, Véronique Aubergé, Albert Rilliard
    [Show abstract] [Hide abstract]
    ABSTRACT: The attitudes of the speaker during a verbal interaction are affects linked to the speaker intentions, and are built by the language and the culture. They are a very large part of the affects expressed during an interaction, voluntary controlled, This paper describes several experiments which show that some attitudes own both to Japanese and French and are implemented in perceptively similar prosody, but that some Japanese attitudes don't exist and/or are wrongly decoded by French listeners. Results are presented for 12 attitudes and three levels of language (naive, beginner, intermediary). It must particularly be noted that French listeners, naive in Japanese, can very well recognize admiration, authority and irritation; that they don't discriminate Japanese question and declaration before the intermediary level, and that the extreme Japanese politeness is interpreted as impoliteness by French listeners, even when they can speak a good level of Japanese.
    01/2006;

Publication Stats

237 Citations
7.92 Total Impact Points

Institutions

  • 2011
    • Stockholm University
      • Department of Psychology
      Stockholm, Stockholm, Sweden
  • 2008–2011
    • GIPSA-lab
      Grenoble, Rhône-Alpes, France
    • Budapest University of Technology and Economics
      • Department of Telecommunications and Media Informatics
      Budapest, Budapest fovaros, Hungary
    • French National Centre for Scientific Research
      Lutetia Parisorum, Île-de-France, France
  • 2010
    • Université d´Avignon et des Pays du Vaucluse
      Avinyó, Provence-Alpes-Côte d'Azur, France
    • University of Grenoble
      Grenoble, Rhône-Alpes, France
  • 2009
    • Computer Sciences Laboratory for Mechanics and Engineering Sciences
      Orsay, Île-de-France, France
    • Kumamoto University
      Kumamoto, Kumamoto Prefecture, Japan
  • 1995–2007
    • Université Stendhal - Grenoble 3
      Saint-Martin, Rhône-Alpes, France