Content uploaded by Steven R. Livingstone
Author content
All content in this area was uploaded by Steven R. Livingstone on Apr 11, 2014
Content may be subject to copyright.
INTRODUCTION
Across cultures and throughout history, speech and song have often being regarded as a dyadic form of vocal
expression (Nettl, 2000; 2005). In Persia, vocal music was understood by the term khāndan, referring to an activity
involving reciting, reading, and singing (Nettl, 2005). For members of the blackfoot tribe, saapup entailed a display
of singing, dancing, and ceremonial chanting (Nettl, 2000). Among the southern Nguni people, ngoma refers to
singing, divining, and the designation of people who engage in these activities (Janzen, 1992), while for the ancient
Greeks, the acts of singing and speaking were described interchangeably and did not exist as the distinct forms we
know them today (Stamou, 2002). Evolutionary theorists have proposed that speech and song may once have existed
as a coupled means of vocal communication, a central goal of which was the expression of emotion (Darwin, 1871;
Brown, 2000; Mithen, 2005). Similarly, speech and song have long been considered to share a common ‘acoustic
code’ in the expression of emotion (Spencer, 1875; Scherer, 1995). In a major review of the subject, Juslin and
Laukka (2003) concluded that music performance, under which singing was classified, shared many of the same
acoustic features of speech for the expression of emotion. However, to the authors’ knowledge, despite their long
historical and academic association, there have been no direct comparative analyses of the acoustic properties of
emotional speech and song. In this paper we report preliminary findings on the acoustic commonalities of emotion
in matched productions of speech and song.
Acoustic analyses were run on six vocalists taken from the Ryerson Audio-Visual Database of Emotional Speech
and Song (RAVDESS) (Livingstone et al., 2012). The RAVDESS, which is being prepared for public release,
consists of 24 professional actors, speaking and singing matched statements with a large range of different emotions,
each with two emotional intensities. The RAVDESS contains over 7000 files (audio-only, video-only, full audio-
video in 720p), and will be released with perceptual validation data, acoustic analyses, and facial motion analyses.
The purpose for creating the RAVDESS was to provide researchers with an open-access repository of high-quality,
audio-visual recordings of speech and song in North American English. Perceptual accuracy of the acoustic
recordings used in the present analysis was confirmed in a separate pilot experiment. Based on previous reviews of
the acoustic cues of emotion in speech (Cowie et al., 2001; Juslin and Laukka, 2001) and song (Sundberg, 1998), we
hypothesized that the two would exhibit similar patterns of change in fundamental frequency, vocal intensity,
utterance duration, first formant frequency, and spectral energy distribution.
METHOD
Participants
Six highly trained actors (mean age = 25.0, SD = 4.04, 3 females), were recruited from the Toronto community.
Participants were native English speakers with a North American accent, and had at least six years of acting
experience (M = 10.17, SD = 3.72), and varied amounts of singing experience (M = 6.83, SD = 2.85).
Stimulus and Materials
Two neutral English statements were used (“Kids are talking by the door”, “Dogs are sitting by the door”).
Statements were seven syllables in length and were matched in word frequency and familiarity using the MRC
psycholinguistic database (Coltheart, 1981). Statements were chosen to enable a matched production in speech and
song. In the song condition, three isochronous melodies were used; one each for emotionally neutral (F4, F4, G4,
G4, F4, E4, F4), positively valenced (F4, F4, A4, A4, F4, E4, F4), and negatively valenced (F4, F4, Ab4, Ab4, F4,
E4, F4) emotions. The neutral melody did not contain the 3rd scale degree, while the positive and negative melodies
were in the major and minor mode respectively (Kastner and Crowder, 1990). Stimuli were presented visually on a
15” Macbook Pro running Windows XP SP3 and Matlab 2009b, and auditorily over KRK Rocket 5 speakers,
controlled by Matlab and the Psychophysics Toolbox (3.0.8 SVN 1648, Brainard, 1997). Vocal utterances were
captured with an AKG C414 B-XLS cardioid microphone with a pop filter, positioned 30 cm from the actor, on a
Mac Pro computer with Pro Tools at 48 kHz. Recordings were edited using Adobe Premiere 6. To avoid perceptual
confusion between the three melodies, song trials were pitch corrected using Melodyne to ensure the mean
fundamental frequency of each note did not exceed ± 35 cents of the notated melody (Vurma and Ross, 2006).
Occasional pops filtered using a high-pass filter (100 Hz) in Adobe Audition. Vocal intensity was peak-normalized
within each actor to retain intensity variability across their emotions. The perceptual validity of the recordings was
tested in a pilot experiment with 8 raters. An average accuracy of 70.9% was recorded across the analyzed emotions.
Design, Procedure, and Analysis
The experimental design was a 2 (Domain: speech, song) × 11 (Emotion: neutral, calm, very calm, happy, very
happy, sad, very sad, angry, very angry, fearful, very fearful) × 2 (Statement: Kids, Dogs) × 2 (Repetition: 1, 2)
within-subjects design, with 88 trials per participant1. Trials were blocked by Production, with speech presented first
to reduce any temporal influences from the regularity of the song condition. Trials were blocked by Domain and
Emotion to reduce fatigue and the order of blocks was counter-balanced across participants. A dialogue script was
used when working with the actors. Each emotion was described, along with a vignette describing a scenario
involving that emotion. It was emphasized that actors were to produce realistic expressions of emotion, and that they
were to prepare themselves physiologically using method acting to induce the desired emotion prior to recording. In
the singing condition, actors were told to sing the basic notated pitches, but they were free to vary other acoustic
characteristics in order to convey the desired emotion.
Acoustic recordings were analyzed with Praat (Boersma and Weenink, 2013). Vocal duration, Fundamental
frequency (floor and range), vocal intensity, first formant frequency (F1, mean) and HF500 (proportion of energy
above 500 Hz/energy below 500 Hz, Juslin and Laukka, 2001) were extracted. For this preliminary analysis,
measures across the entire utterance were performed. Statistical analyses were combined across actor, statement, and
repetition.
RESULTS
Separate two-way analyses of variance by Domain and Emotion were conducted on Duration, Fundamental
Frequency (floor and range), Vocal Intensity, F1, and HF500, as listed in Table 1.
TABLE 1. Summary of results from Analysis of Variance of vocal features, showing the F-
scores. Analyses examined 2 Domains (speech, song), and 11 Emotions (happy, sad). * p < .01,
** p < .001, otherwise p < .05, Dunn-Bonferroni corrected.
Feature
Domain
Emotion
D X E
Duration
124.73**
5.87**
6.16**
Fundamental freq. (Floor)
31.10
6.52**
3.72*
Fundamental freq. (Range)
n.s.
5.48**
6.38**
Vocal intensity (Mean)
132.70**
24.80**
10.98**
First formant freq. (Mean)
n.s.
14.48**
2.91
HF500 (Mean)
n.s.
12.70**
n.s.
The main effect of Domain (speech, song) was significant for three of the six analyzed features (duration, F0
floor, vocal intensity), with singing having a longer duration (2.69s vs 1.69s), higher pitch floor (202.56 Hz vs
147.72 Hz)2, and louder vocal intensity (50.45 dB vs 46.0 dB) than speech respectively. Interestingly, the main
effect of Emotion was significant for all six analyzed features. Across speech and song, low arousal emotions: calm
and sad, exhibited longer durations than high arousal emotions: happy, angry and fearful. Similarly, low arousal
emotions: neutral, calm and sad (but not very sad), exhibited a smaller range in fundamental frequency than high
arousal emotions: happy, angry, and fearful. This pattern of effects was again reflected in vocal intensity, displayed
in Figures 1 and 2, with high arousal emotions generally exhibiting greater vocal intensity than lower arousal
emotions. First formant frequency and HF500 were also elevated for happy, angry, and fearful emotions. Significant
interactions of Domain and Emotion were reported in five of the six analyzed features. Overall, the interactions did
not vary the general pattern of main effects. Pronounced differences were shown in fundamental frequency (Floor,
range), though this was expected as vocalists were required to sing using the prescribed melody pitches.
1 Four additional emotions in speech (surprised, very surprised, disgusted, very disgusted) were not included in the analyses as the singing
condition did not contain these emotions. It was felt that song could not adequately express these emotions.
2 Future analyses will consider the effect of gender, which is important for song.
FIGURE 1. Mean vocal intensity across the 11 emotions for speech (a), and song (b). The figure illustrates the two main effects
of Domain and Emotion, with song being louder overall than speech, and with speech emotions appearing to show greater
variability in intensity than song.
CONCLUSION
Speech and song have long been considered an entwined form of vocal expression. Despite their long
association, there have until now been no direct comparisons of the acoustic similarities of speech and song in their
expression of emotion. In this paper we presented preliminary data from the Ryerson Audio-Visual Database of
Emotional Speech and Song (RAVDESS). We showed that speech and song shared many of the same acoustic
features in their expression of emotion, while also exhibiting differences that distinguish speech from song.
Collectively, these data support the notion that speech and song may once have emerged from a common vocal
origin.
ACKNOWLEDGMENTS
This research was supported through grants from AIRS (Advancing Interdisciplinary Research in Singing), a
Major Collaborative Research Initiative of the Social Sciences and Humanities Research Council of Canada,
awarded to the first and third authors, and by a Discovery grant from the Natural Sciences and Engineering Research
Council of Canada awarded to the third author. The authors thank Gabe Nespoli and Alex Andrews for their
assistance.
REFERENCES
Boersma, P., and Weenink, D. (2013). "Praat: doing phonetics by computer."
Brainard, D. H. (1997). "The psychophysics toolbox," Spatial Vision 10, 433-436.
Brown, S. (2000). "The" musilanguage" model of musical evolution," in The origins of music, edited by N. L. Wallin, B. Merker,
and S. Brown (The MIT Press, Cambridge, Mass), pp. 271-300.
Coltheart, M. (1981). "The MRC psycholinguistic database," The Quarterly Journal of Experimental Psychology Section A 33,
497-505.
Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., and Taylor, J. G. (2001). "Emotion
recognition in human-computer interaction," Signal Processing Magazine, IEEE 18, 32-80.
Darwin, C. (1871). The descent of man and selection in relation to sex (John Murray London).
Janzen, J. M. (1992). Ngoma: discourses of healing in Central and Southern Africa (University of California Press).
Juslin, P. N., and Laukka, P. (2001). "Impact of intended emotion intensity on cue utilization and decoding accuracy in vocal
expression of emotion," Emotion 1, 381.
Juslin, P. N., and Laukka, P. (2003). "Communication of emotions in vocal expression and music performance: Different
channels, same code?," Psychological Bulletin 129, 770.
Kastner, M. P., and Crowder, R. G. (1990). "Perception of the major/minor distinction: IV. Emotional connotations in young
children," Music Perception, 189-201.
Livingstone, S. R., Peck, K., and Russo, F. A. (2012). "RAVDESS: The Ryerson Audio-Visual Database of Emotional Speech
and Song.," in 22nd Annual Meeting of the Canadian Society for Brain, Behaviour and Cognitive Science (CSBBCS)
(Kingston, ON.).
Mithen, S. J. (2005). The singing Neanderthals: The origins of music, language, mind, and body (Harvard Univ Pr).
Nettl, B. (2000). "An ethnomusicologist contemplates universals in musical sound and musical culture," The origins of music,
463-472.
Nettl, B. (2005). The study of ethnomusicology: thirty-one issues and concepts (Univ of Illinois Pr).
Scherer, K. R. (1995). "Expression of emotion in voice and music," Journal of Voice 9, 235-248.
Spencer, H. (1875). "The origin and function of music," in Fraser's Magazine, pp. 396-408.
Stamou, L. (2002). "Plato and Aristotle on music and music education: Lessons from ancient Greece," Internationl Journal of
Music Education 39, 3-16.
Sundberg, J. (1998). "Expressivity in singing. A review of some recent investigations," Logopedics Phonatrics Vocology 23, 121-
127.
Vurma, A., and Ross, J. (2006). "Production and perception of musical intervals," Music Perception 23, 331-344.