Conference PaperPDF Available

Emotional Communication in the Human Voice

Authors:

Abstract and Figures

The human voice is often used to communicate emotional messages. These messages may be conveyed during speaking and singing. This paper describes research on emotional communication in the human voice. I first describe the qualities of speech that contribute to emotional communication. The research involved creating a new battery of 96 phrases spoken with the intention of communicating each of 6 emotional messages: happiness, sadness, fear, irritation, tenderness, and no emotion. The battery was then validated using a group of listeners and subjected to acoustic analyses. Acoustic attributes associated with different emotions are discussed. I next describe research on the facial expressions that accompany emotional communication by the human voice. Facial and gestural cues are important for a number of reasons. First, they play a crucial role in reinforcing, amplifying, or otherwise modifying the transmission of emotional signals, and it is interesting to know the extent of their influence on our experiences of speech, music, and emotional communication. Second, they exemplify the multimodal nature of emotional communication in speech and music. Do listeners attend to facial expressions while listening to someone singing, and do those facial expressions influence their experiences of the music? Third, it is important to understand the nature of facial expressions themselves. Facial expressions are a point of convergence for emotional signals in the music as well as many structural attributes. They provide information about phonetic information, musical dissonance, interval size, closure, tonality, and emotional intentions. The significance of this merging of structural and emotional signals in facial movements is discussed.
Content may be subject to copyright.
ARCHE – ARTS RESEARCH CENTER FOR HUMAN EXPRESSION
UNIVERSITY OF SAINT JOSEPH, MACAU
1
Emotional Communication in the Human Voice
William Forde Thompson
Macquarie University, Sydney
Bill.Thompson@psy.mq.edu.au
ABSTRACT
The human voice is often used to communicate emotional
messages. These messages may be conveyed during
speaking and singing. This paper describes research on
emotional communication in the human voice. I first
describe the qualities of speech that contribute to
emotional communication. The research involved creating
a new battery of 96 phrases spoken with the intention of
communicating each of 6 emotional messages: happiness,
sadness, fear, irritation, tenderness, and no emotion. The
battery was then validated using a group of listeners and
subjected to acoustic analyses. Acoustic attributes
associated with different emotions are discussed.
I next describe research on the facial expressions that
accompany emotional communication by the human voice.
Facial and gestural cues are important for a number of
reasons. First, they play a crucial role in reinforcing,
amplifying, or otherwise modifying the transmission of
emotional signals, and it is interesting to know the extent
of their influence on our experiences of speech, music,
and emotional communication. Second, they exemplify
the multimodal nature of emotional communication in
speech and music. Do listeners attend to facial
expressions while listening to someone singing, and do
those facial expressions influence their experiences of the
music? Third, it is important to understand the nature of
facial expressions themselves. Facial expressions are a
point of convergence for emotional signals in the music as
well as many structural attributes. They provide
information about phonetic information, musical
dissonance, interval size, closure, tonality, and emotional
intentions. The significance of this merging of structural
and emotional signals in facial movements is discussed.
INTRODUCTION
How do people communicate emotional messages while
speaking or singing? What are the vocal cues or signals
that are conveyed, and how can listeners understand or
“decode” those signals? Emotional communication does
not involve the use of individual defining features that
unambiguously communicate an emotion. Rather,
emotional communication works by using multiple cues
that are each associated probabilistically with different
emotional interpretations. These signals may include pitch
cues (speaking in a high pitch register, or varying the
pitch of one’s speaking voice), timbre (smooth or rough
speaking voice), intensity (speaking loudly or softly),
facial expressions (smiling, frowning, furrowing one’s
eyebrows, raised eyebrows), and bodily gestures.
Speakers and singers express themselves by providing
perceivers with a large number of signals in the form of
acoustic attributes (tone of voice), facial expressions, and
body movements. The result is a large amount of
information that perceivers attempt to “read” in order to
understand emotional intentions.
No one cue is sufficient for accurate emotional
communication. Rather emotional understanding is a
process of evaluating the accumulation of many signals
and inferring the most likely emotion being expressed.
Because we are never certain about the emotion being
expressed, the mechanisms used for emotional decoding
must be probabilistic in nature. That is, the mechanism
must consider a large amount of information and generate
a set of hypotheses about the emotions that are expressed.
To understand the processes by which we communicate
and understand emotion in the human voice, I will
describe two research projects. In the first, we developed
a new battery for evaluating human sensitivity to
emotional communication in the voice. We recorded 96
phrases spoken by 10 actors and other speakers who
spoke with the intention of communicating each of 6
emotional messages: happiness, sadness, fear, irritation,
tenderness, and no emotion. The battery was then
validated using a group of listeners and subjected to
acoustic analyses.
The second research project entailed a large number of
experiments on the nature of the facial expressions used
during vocal emotional communication. This work was
done in collaboration with many colleagues, including
Frank Russo (Ryerson University), Steven Livingstone
(McGill University), Phil Graham (Queensland University
of Technology) and many others. Our most recent
research on the topic is being conducted as part of a grant
from the Australian Research Council, which I was
awarded in 2009 in partnership with Caroline Palmer
(McGill University).
To anticipate the outcomes, our findings illustrate that
facial expressions greatly assist in the communication of
emotions, but they also communicate other kinds of
information about language and music. In effect, the
movements of the face are a point of convergence for
language, musical structure, and human emotion.
SENSORY CUES: BACKGROUND
How is emotional communication accomplished? One of
the most important ways is through the transmission of
sensory cues. Sensory cues are features or statistical
properties of music and speech. For example, the pitch
height of speech that communicates joy is higher than the
pitch height of speech that conveys sadness. Similarly, the
variability in pitch is greater in angry speech than in
tender speech. Why did we develop sensitivity to such
statistical properties?
ARCHE – ARTS RESEARCH CENTER FOR HUMAN EXPRESSION
UNIVERSITY OF SAINT JOSEPH, MACAU
2
Our sensitivity to statistical properties of speech and
music was developed out of necessity as an adaptation
that evolved long ago in our ancestral history. We live in a
probabilistic and uncertain world, and most information
cannot be determined through the detection of defining
properties. Adaptation to a probabilistic world requires the
development of mechanisms that draw inferences from
probabilistic, uncertain evidence. These statistical
properties in the environment are called sensory cues.
A sensory cue is defined as any statistic or signal that can
be extracted from sensory input, which helps to indicate
the state of a property of the world that the perceiver is
interested in. Sensory cues include visual features,
auditory cues, olfactory cues, kinaesthetic cues, and so on.
In addition to sensory cues, other mechanisms of
emotional communication include (a) expectancies, which
generate surprise or relaxation responses on a moment to
moment basis; (b) learning and memory, in which
emotions arise by association, and (c) physiological
effects, whereby certain acoustic attributes directly affect
physiological states.
It is important to keep the probabilistic nature of
emotional communication in mind when designing
experiments on emotional communication. If very simple
tasks are designed, then it may be possible for participants
to use one or two sensory cues to make an accurate
judgement about an emotional intention. However, such
results are misleading because they imply that individual
sensory cues are uniquely linked to specific emotions.
Under normal circumstances there is a complex
relationship between our perceptions of the environment
and the sensory cues available. Sensory cues are
inherently ambiguous for several reasons. One reason is
that perceivers are often unable to judge the precise level
of any one emotional attribute accurately: loudness is
difficult to judge in a noisy environment; lapses in
attention might mean certain cues are missed or misheard;
and people vary in their sensitivity and attention to
different attributes. Another reason is that communicators
are often unable to control attributes precisely, and many
factors can influence their expressions of emotion. Sad
speech is often soft, but in a noisy environment a speaker
may need to balance emotional communication with
intelligibility.
On balance, pitch-related cues are quite influential for
emotional communication. Pieces written in a major key
are described as “happy”, “light”, “bright” and “cheerful”;
pieces written in a minor key are described as “restless”,
“sad” and “mystical”. Large pitch variability suggests
excitement, happiness, pleasantness, surprise, and activity.
Small pitch variability is associated with negative
emotions like disgust, anger and fear. Lower average pitch
height is associated with sadness and boredom; higher
pitch tends to imply happiness, serenity, anger and fear.
Loudness and timing are also important, but there is never
a unique association between any one sensory cue and a
particular emotion. Every cue is highly ambiguous.
Juslin and Laukka (2003) made similar observations in
their meta-analysis of emotional cues in music and speech
prosody. Disregarding verbal cues (the words themselves)
and considering only vocal cues (tone of voice, or speech
prosody), they confirmed that the sensory cues associated
with a specific emotion are never uniquely associated with
that emotion. Rather, they are inherently ambiguous.
Many cues are associated with several possible emotions.
This ambiguity is reflected in overall decoding rates for
vocal or prosodic cues in speech. Juslin and Laukka (2003)
reported that the ability to decode emotions from vocal
cues was rarely above 70% across studies (the words were
emotionally neutral). For example, vocal expressions of
sadness were decoded an average of 63% of the time,
60% for fear; 58% for anger, 51% for happiness, and 40%
for disgust.
Although these decoding rates are seemingly low, it
should be emphasized that under normal circumstances
people have far more than just vocal or prosodic cues:
they have the words to go along with those cues, facial
expressions, body movements, and an understanding of
the context in which the person is speaking. The modest
decoding rates reported by Juslin and Laukka (2003)
reflect only the ability to decode emotional tone of voice
in the absence of all other information.
EMOTION IN THE VOICE
As a further investigation of emotional communication in
the human voice, we developed a new battery for
evaluating sensitivity to emotional speech. Our battery
was developed to estimate the ability of listeners to
decode different types of emotional messages, and to
examine individual differences in sensitivity to emotion in
the human voice. Some individuals are highly sensitive to
emotional connotations in speech, whereas others have
difficulty appreciating the emotional signals conveyed in
speech.
The outcome of our research project was a battery of
emotional sensitivity consisting of 96 spoken phrases that
communicated each of six different emotions: joy, sadness,
fear, irritation, tenderness, and no emotion. The verbal
content of the spoken phrases was always neutral or
semantically ambiguous (e.g., “the boy and girl went to
the store to fetch some milk for lunch”). There were an
equal number of male and female speakers.
To create our battery, we first recruited a number of
trained actors and laypersons to speak in emotional ways
while they were recorded in a professional recording
studio. Speakers were recorded individually. To help our
speakers communicate each emotion, they were asked to
read an emotional scenario on a piece of paper. The
scenario was designed to “prime” the speakers
emotionally; that is, to enhance their ability to
communicate the emotion. Once they read the scenario,
they were asked to speak one of seven different sentences.
The seven sentences used all had 14 syllables and were in
the English language. They could repeat the spoken
sentence as many times as they liked until they were
satisfied with their portrayal of the emotion.
In total, our recording engineer recorded 462 spoken
utterances. After recording all these phrases, we then
reduced the number of phrases to 150 by having four
expert listeners listen to the entire battery and rate their
quality and effectiveness at emotional communication.
We then asked a number of participants to judge the
emotion intended by each of the 150 phrases. Based on
their judgments, we accepted the best 8 male and female
utterances for each of the six emotions, producing a final
ARCHE – ARTS RESEARCH CENTER FOR HUMAN EXPRESSION
UNIVERSITY OF SAINT JOSEPH, MACAU
3
battery of 96 spoken phrases. We then ran acoustic
analyses on the final set of 96 phrases.
Speakers were recorded in a sound proof booth within the
Department of Media, Music, and Cultural Studies
recording studios at Macquarie University. Participants
spoke into a K2 Rode Condenser Microphone and were
recorded with Cubase SX 4 (Prochak, 2004). We
recorded the vocals at a sample rate and bit depth of 44.1
KHz / 16 bit - mono.
Figure 1 illustrates the mean decoding rates for each of
the six emotions in the battery. The Figure shows that
decoding rates were considerably higher than the mean
decoding rates reported in the meta-analysis by Juslin and
Laukka (2003), with a mean decoding rate across
emotions of 88% correct. Our measures of reliability
showed that there was a high level of internal consistency,
with a Cronbach’s alpha level of .92. As expected, there
were significant differences in decoding rates for the
different emotions.
Figure 1. Mean decoding rates for each of the six emotions,
as judged by 32 adult participants.
We next ran acoustic analyses on the spoken utterances in
the emotion battery. Among the many differences in
acoustic attributes of the emotional speech samples, we
found significant differences in speech rate, intensity,
pitch height, and pitch variability.
Figure 2. Acoustic analysis of rate (left) and intensity
(right) for the speech samples in the battery.
Figure 2 shows the average speech rate (syllables per
second) and intensity (decibels) for each type of emotion
that was communicated in the battery. As seen in the
figure, the emotion of fear was associated with a fast rate
of speech, and high intensity. In contrast, the emotions of
tender and sadness were characterized by a slow speaking
rate and low intensity. The other emotions varied in rate
and intensity.
Figure 3. Acoustic analysis of pitch height (left) and
standard deviation in pitch (right) for the speech samples
in the battery.
Figure 3 illustrates the average pitch height and pitch
variability for the six types of emotional speech samples.
As seen in the Figure, the emotion of fear was associated
with a high pitch height, but little variability in pitch. In
contrast, the emotion of happiness was characterized by a
high pitch height and high variability in pitch.
Figures 2 and 3 illustrate that the communication of any
one emotion is characterized by a pattern of acoustic
signals that must be evaluated in a probabilistic manner.
No one attribute can be used to determine the emotion
unambiguously. The figures illustrate just four of the
attributes of the speaking voice that are involved in the
communication of emotion. There are many other such
attributes that are used for emotional communication.
EMOTION IN THE FACE
In trying to explore the imperfect and probabilistic nature
of cue transmission, it is important to explore the full
nature of the information provided to decoders. Acoustic
information provides one part of the picture (the sounded
voice), but under many circumstances that information
does not adequately communicate the emotional signal. In
such cases, perceivers must grasp at other signals in order
to maximize their ability to understand the emotional
intention. These other signals include visual signals
arising from the facial expressions and body movements
of speakers or singers.
The use of facial and gestures during emotional
communication has been the topic of a series of
investigations in our laboratory. Facial expressions are
intriguing for a number of reasons. First, they play a
crucial role in reinforcing, amplifying, or otherwise
modifying the transmission of emotional signals.
Second, they can tell us about the multimodal nature of
emotional communication in speech and music. Speech
and music are traditionally viewed as auditory phenomena:
we “listen” to music and “hear” someone speaking. The
impact of facial expressions on music and speech
perception raises important questions about how we
should characterise and investigate these two channels of
communication. They are not purely auditory channels but
are multimodal experiences.
Third, the study of facial expressions may provide insight
into the nature of sensory cues. Should sensory cues be
described in abstract terms that apply equally across
ARCHE – ARTS RESEARCH CENTER FOR HUMAN EXPRESSION
UNIVERSITY OF SAINT JOSEPH, MACAU
4
domains, to acoustic signals and facial expressions alike?
In several of our studies, we observed that cues that are
traditionally defined acoustically (e.g., interval size,
dissonance, tonality) are also communicated in facial
expressions. Such evidence suggests that many cues
transcend the boundaries between sensory modalities, and
need to be defined as multimodal cues.
Fourth, linking facial expressions to auditory cues can tell
us about the function of facial expressions. Facial
expressions map multiple acoustic cues simultaneously:
they reflect phonetic information (as in lip reading),
musical structure, and emotion. How do the continuous
and subtle changes in facial expressions provide
information about verbal information, dissonance, interval
size, tonality, and emotion all at the same time?
Although we think of facial expressions as an emotional
signal, they also reflect these other kinds of information.
In a series of studies on emotional song, we confirmed
that facial expressions reflect and communicate all of
these attributes, providing a point of convergence between
emotion, language, and musical structure.
For example, research by Frank Russo and myself
indicates that facial expressions of singers clearly convey
information about melodic structure. In one study, we
showed that facial expressions communicate the size of
melodic intervals being sung (Thompson & Russo, 2007).
Three trained female vocalists sang 13 ascending melodic
intervals (“la”) spanning 0 to 12 semitones. Participants
were presented the visual signal only and rated the
interval they imagined was being sung (1-7 scale).
Figure 4 shows ratings of interval size when participants
were presented silent video recordings of musicians
singing ascending intervals of different sizes. The Figure
illustrates a close correspondence between the perceived
size of the interval and the interval that was being sung.
Importantly, judgments were based purely on visual
information: the stimuli were silent. These results indicate
that visual signals carry information about interval size,
and listeners are capable of “reading” this information
directly from the facial expressions of singers.
Figure 4. Mean ratings of interval size based on silent
videos of sung melodic intervals.
A subsequent analysis of the facial movements of our
singers revealed that interval size was reflected in several
visual cues, including head and eyebrow movement.
In a more recent study we showed that these visual cues
influence judgements of interval size even when the sound
is present, and even when participants are explicitly
instructed to focus on the sound alone. Moreover, when
we burdened the attentional capacity of our listeners with
a secondary task, the effect remained, suggesting that
participants were incorporating and merging visual and
auditory cues automatically, without conscious control.
Several subsequent studies have confirmed that the facial
expressions of speakers and singers are a rich source of
information. This information not only includes interval
size (Thompson, Russo & Livingstone, in press), but
tonality, dissonance, and closure (Thompson, Graham &
Russo, 2005; Ceaser, Thompson & Russo, 2009). These
various sources of information are merged into the
continuous movements of the face during vocal
production. Our hypothesis is that emotional
communication in speakers and singers, reflected in facial
expressions, is a point of convergence for other signals
about the speech or music. That is, facial expressions not
only communicate an intended emotional message; they
simultaneously reflect information about melodic
structure, tonality, closure, and dissonance.
In our most recent work, done with Caroline Palmer of
McGill University, we examined the time course of facial
expressions during emotional singing. When does it start?
Which features are involved? When does it end? Using
motion capture, we demonstrated that emotional facial
expressions begin well before the vocal production of
sound, and linger well after the voice has stopped
producing sound (Livingstone, Thompson & Russo, 2009).
These ancillary facial expressions seem to anticipate an
emotional voice, and support emotional communication
by sustaining a visual signal of the intended emotion well
after vocalization has stopped.
Perceivers, in turn, are acutely sensitive to visual signals
of emotion, and can interpret them even before the voice
has produced sound and after the voice has stopped
producing sound. Thus, emotional communication in the
human voice involves not only vocal sounds, but visual
signals that support those sounds: that enhance the
communication of emotion and merge emotional
communication with signals about numerous structural
properties of language or music.
REFERENCES
Ceaser, D.K., Thompson, W.F., & Russo, F.A. (2009).
Expressing tonal closure in music performance: Auditory and
visual cues. Canadian Acoustics, 37(1), 29-34.
Juslin, P. N., & Laukka, P. (2003). Communication of emotions
in vocal expression and music performance: Different
channels, same code? Psychological Bulletin, 129, 770-814.
Thompson, W.F., Graham, P., & Russo, F.A. (2005). Seeing
music performance: Visual influences on perception and
experience. Semiotica, 156(1/4), 203-227.
Thompson, W. F., & Russo, F. A. (2007). Facing the music.
Psychological Science, 18, 756-757.
Thompson, W.F., Russo, F.A., & Livingstone, S.L. (in press).
Facial expressions of singers influence perceived pitch
relations. Psychonomic Bulletin and Review. Accepted for
publication November 24, 2009.
... Livingstone et al. (2015) have shown that vocal communication tends to be more difficult to decipher than facial expression. Similarly, Thompson (2010) reports that facial expression is generally more heavily relied on for emotional cues both within speech and music performance. Divergence between test performance could also be a consequence of differences between the tasks used to measure musical and vocal recognition. ...
Article
Full-text available
Previous research has shown that levels of musical training and emotional engagement with music are associated with an individual’s ability to decode the intended emotional expression from a music performance. The present study aimed to assess traits and abilities that might influence emotion recognition, and to create a new test of emotion discrimination ability. The first experiment investigated musical features that influenced the difficulty of the stimulus items (length, type of melody, instrument, target-/comparison emotion) to inform the creation of a short test of emotion discrimination. The second experiment assessed the contribution of individual differences measures of emotional and musical abilities as well as psychoacoustic abilities. Finally, the third experiment established the validity of the new test against other measures currently used to assess similar abilities. Performance on the Musical Emotion Discrimination Task (MEDT) was significantly associated with high levels of self-reported emotional engagement with music as well as with performance on a facial emotion recognition task. Results are discussed in the context of a process model for emotion discrimination in music and psychometric properties of the MEDT are provided. The MEDT is freely available for research use.
Article
Full-text available
Drawing from ethnographic, empirical, and historical / cultural perspec-tives, we examine the extent to which visual aspects of music contribute to the communication that takes place between performers and their listeners. First, we introduce a framework for understanding how media and genres shape aural and visual experiences of music. Second, we present case studies of two performances, and describe the relation between visual and aural aspects of performance. Third, we report empirical evidence that vi-sual aspects of performance reliably influence perceptions of musical struc-ture (pitch related features) and a¤ective interpretations of music. Finally, we trace new and old media trajectories of aural and visual dimensions of music, and highlight how our conceptions, perceptions and appreciation of music are intertwined with technological innovation and media deployment strategies.
Article
Full-text available
We examined whether musical performers communicate tonal closure through expressive manipulation of facial expressions and non-pitch features of the acoustic output. Two musicians hummed two versions of Silent Night: one ended on the tonic of the scale and exhibited tonal closure; the other ended on the dominant and was therefore tonally unclosed. In Experiment 1, video-only recordings of the hummed sequences were presented to 15 participants, who judged whether the (imagined) melody was closed or unclosed. Accuracy was reliably above chance, indicating that the musicians expressed tonal closure in facial expressions and listeners decoded these cues. Experiment 2 was conducted to determine whether musicians also communicate tonal closure in acoustic attributes other than pitch. All tones in the hummed melodies were pitched-shifted to a constant mean value, but performances still differed in loudness, microtonal pitch variation, timing, and timbre. Participants judged whether audio-only recordings were closed or unclosed. Accuracy was not above chance overall, but was marginally above chance for judgement of one of the two singers. Results suggest that tonal closure can be mapped onto non-pitch aspects of performance expression, but is primarily restricted to the use of facial expressions.
Article
Full-text available
In four experiments, we examined whether facial expressions used while singing carry musical information that can be "read" by viewers. In Experiment 1, participants saw silent video recordings of sung melodic intervals and judged the size of the interval they imagined the performers to be singing. Participants discriminated interval sizes on the basis of facial expression and discriminated large from small intervals when only head movements were visible. Experiments 2 and 3 confirmed that facial expressions influenced judgments even when the auditory signal was available. When matched with the facial expressions used to perform a large interval, audio recordings of sung intervals were judged as being larger than when matched with the facial expressions used to perform a small interval. The effect was not diminished when a secondary task was introduced, suggesting that audio-visual integration is not dependent on attention. Experiment 4 confirmed that the secondary task reduced participants' ability to make judgments that require conscious attention. The results provide the first evidence that facial expressions influence perceived pitch relations.
Article
Full-text available
Many authors have speculated about a close relationship between vocal expression of emotions and musical expression of emotions. but evidence bearing on this relationship has unfortunately been lacking. This review of 104 studies of vocal expression and 41 studies of music performance reveals similarities between the 2 channels concerning (a) the accuracy with which discrete emotions were communicated to listeners and (b) the emotion-specific patterns of acoustic cues used to communicate each emotion. The patterns are generally consistent with K. R. Scherer's (1986) theoretical predictions. The results can explain why music is perceived as expressive of emotion, and they are consistent with an evolutionary perspective on vocal expression of emotions. Discussion focuses on theoretical accounts and directions for future research.
Article
Full-text available
From the phonograph of the 19th century to the iPod today, music technologies have typically isolated the auditory dimension of music, filtering out nonacoustic information and transmitting what most people assume is the essence of music. Yet many esteemed performers over the past century, such as Judy Garland and B.B. King, are renowned for their dramatic use of facial expressions (Thompson, Graham, & Russo, 2005). Are such expressions merely show business, or are they integral to experiencing music? In the investigation reported here, we considered whether the facial expressions and head movements of singers communicate melodic information that can be ''read'' by viewers. Three trained vocalists were recorded singing ascending melodic intervals. Subjects saw the visual recordings (without sound) and rated the size of the intervals they imagined the performers were singing.
Expressing tonal closure in music performance: Auditory and visual cues
  • D K Ceaser
  • W F Thompson
  • F A Russo
Ceaser, D.K., Thompson, W.F., & Russo, F.A. (2009). Expressing tonal closure in music performance: Auditory and visual cues. Canadian Acoustics, 37(1), 29-34.