ArticlePDF Available

Abstract and Figures

An illusion is explored in which a spoken phrase is perceptually transformed to sound like song rather than speech, simply by repeating it several times over. In experiment I, subjects listened to ten presentations of the phrase and judged how it sounded on a five-point scale with endpoints marked "exactly like speech" and "exactly like singing." The initial and final presentations of the phrase were identical. When the intervening presentations were also identical, judgments moved solidly from speech to song. However, this did not occur when the intervening phrases were transposed slightly or when the syllables were presented in jumbled orderings. In experiment II, the phrase was presented either once or ten times, and subjects repeated it back as they finally heard it. Following one presentation, the subjects repeated the phrase back as speech; however, following ten presentations they repeated it back as song. The pitch values of the subjects' renditions following ten presentations were closer to those of the original spoken phrase than were the pitch values following a single presentation. Furthermore, the renditions following ten presentations were even closer to a hypothesized representation in terms of a simple tonal melody than they were to the original spoken phrase.
Content may be subject to copyright.
Illusory transformation from speech to song
a)
Diana Deutsch,
b)
Trevor Henthorn, and Rachael Lapidis
Department of Psychology, University of California, San Diego, La Jolla, California 92093
(Received 8 December 2010; revised 12 February 2011; accepted 14 February 2011)
An illusion is explored in which a spoken phrase is perceptually transformed to sound like song
rather than speech, simply by repeating it several times over. In experiment I, subjects listened to
ten presentations of the phrase and judged how it sounded on a five-point scale with endpoints
marked “exactly like speech” and “exactly like singing.” The initial and final presentations of the
phrase were identical. When the intervening presentations were also identical, judgments moved
solidly from speech to song. However, this did not occur when the intervening phrases were trans-
posed slightly or when the syllables were presented in jumbled orderings. In experiment II, the
phrase was presented either once or ten times, and subjects repeated it back as they finally heard it.
Following one presentation, the subjects repeated the phrase back as speech; however, following
ten presentations they repeated it back as song. The pitch values of the subjects’ renditions follow-
ing ten presentations were closer to those of the original spoken phrase than were the pitch values
following a single presentation. Furthermore, the renditions following ten presentations were even
closer to a hypothesized representation in terms of a simple tonal melody than they were to the orig-
inal spoken phrase. V
C2011 Acoustical Society of America. [DOI: 10.1121/1.3562174]
PACS number(s): 43.75.Cd, 43.75.Rs [NHF] Pages: 2245–2252
I. INTRODUCTION
There has recently been an upsurge of interest in rela-
tionships between music and speech, particularly in how
these two forms of communication are processed by the au-
ditory system (cf. Zatorre et al., 2002;Koelsch et al., 2002;
Koelsch and Siebel, 2005;Zatorre and Gandour, 2007;
Schon et al., 2004;Peretz and Coltheart, 2003;Patel, 2008;
Hyde et al., 2009;Deutsch, 2010). In exploring this issue, it
is generally assumed that whether a phrase is heard as spo-
ken or sung depends on its acoustical characteristics. Speech
consists of frequency glides that are often steep, and of rapid
amplitude and frequency transitions. In contrast, song con-
sists largely of discrete pitches that are sustained over rela-
tively long durations and that tend to follow each other in
small steps. At the phenomenological level, speech appears
as a succession of rapid changes in timbre, which are inter-
preted as consonants and vowels, and in which pitch
contours are only broadly defined (at least in nontone lan-
guages). In contrast, song is heard primarily as a succession
of well-defined musical notes (though also with consonants
and vowels) and these are combined to form well-defined
pitch relationships and rhythmic patterns. The dichotomy
between the physical characteristics of speech and non-
speech is not clearcut, however. It has been found that cer-
tain nonspeech sounds can be interpreted as speech as a
consequence of training (Remez et al., 1981;Mottonen
et al., 2006) or when they are placed in verbal contexts
(Shtyrov et al., 2005).
The widespread view that speech and music can be
defined in terms of their acoustical properties is reflected in
studies that explore their perceptual characteristics and neu-
rological underpinnings. For speech, researchers have
focused on features such as fast formant transitions and
voice onset time (Diehl et al., 2004), while for music,
researchers have examined such issues as the processing of
pitch sequences, musical instrument timbres, and rhythmic
patterns (Stewart et al., 2006).
The use of signals with different physical characteristics
is necessary for studying music and speech taken independ-
ently. However, when differences are found in the ways in
which they are processed, these could be either due to the
differences in the signals employed or due to the processing
of these signals by different neural pathways (Zatorre and
Gandour, 2007). In contrast, this paper describes and
explores an illusion in which a spoken phrase is perceptually
transformed so as to be heard as sung rather than spoken.
The illusion occurs without altering the signal in any way,
without training, and without any context provided by other
sounds, but simply as a result of repeating the phrase several
times over. Research on this illusion therefore provides
insights into differences in the processing of speech and
music, without the complication of invoking different signal
parameters or different contexts.
The illusion was first published as a demonstration on
the compact disc by Deutsch (2003). Here, a spoken sen-
tence is presented, followed repeatedly by a phrase that had
been embedded in it. Most people hear the repeated phrase
transform into a sung melody, generally as notated in Fig. 1.
This paper describes the first formal exploration of the illu-
sion and presents a discussion of its possible underlying
bases.
The study consisted of two experiments. Experiment I
explored certain constraints governing the illusion, using a
b)
Author to whom correspondence should be addressed. Electronic mail:
ddeutsch@ucsd.edu
a)
Portions of this work were presented at a meeting of the Acoustical Society
of America, Miami, November 2008, as Deutsch, D., Lapidis, R., and Hen-
thorn, T. (2008). “The speech-to-song illusion,” J. Acoust. Soc. Am. 124,
2471.
J. Acoust. Soc. Am. 129 (4), April 2011 V
C2011 Acoustical Society of America 22450001-4966/2011/129(4)/2245/8/$30.00
Author's complimentary copy
rating task as the measure. The illusion was found to occur
when the repeated presentations of the spoken phrase were
exact replicas of the original one. However, when on repeti-
tion, the phrase was transposed slightly or the syllables were
jumbled, the illusion did not occur. In experiment II, the
characteristics of this perceptual transformation were
explored in detail by having subjects repeat back the phrase
exactly as they had heard it, both following a single repeti-
tion and following ten repetitions. It is hypothesized that
during the process of repetition, the pitches forming the
phrase increase in perceptual salience and that in addition
they are perceptually distorted so as to conform to a tonal
melody. The findings from the experiment provide evidence
in favor of this hypothesis. Finally, the hypothesized neuro-
logical underpinnings of this illusion are explored and its
general implications for relationships between speech and
music are discussed.
II. EXPERIMENT I
A. Method
1. Subjects
Fifty-four subjects with at least 5 yr of musical training
participated in the experiment and were paid for their serv-
ices. They were divided into three groups of 18 subjects
each, with each group serving in one condition. The subjects
in the first group (three males and 15 females) were of aver-
age age 21.7 yr (range, 18–33 yr) and with an average of
10.2 yr of musical training (range, 6–14 yr). Those in the
second group (four males and 14 females) were of average
age 22.4 yr (range, 18–29 yr) and with an average of 10.6 yr
(range, 6–15 yr) of musical training. Those in the third group
(three males and 15 females) were of average age 20.3 yr
(range 18–28 yr) and with an average of 10.0 yr (range 6–14
yr) of musical training. None of the subjects had perfect
pitch. All had normal hearing in the range of 250 Hz6 kHz,
as determined by audiometric testing, and all were naı¨ve
concerning the purpose of the experiment and the nature of
the illusion.
2. Stimulus patterns and procedure
The experiment was carried out in a quiet room. The
stimulus patterns were derived from the sentence on track 22
of the compact disc by Deutsch (2003). The sentence states
The sounds as they appear to you are not only different
from those that are really present, but they sometimes
behave so strangely as to seem quite impossible.” In all con-
ditions, this sentence was presented, followed by a pause of
2300 ms in duration and then by ten presentations of the em-
bedded phrase “sometimes behave so strangely,” which were
separated pauses of 2300 ms in duration. During the pause
following each presentation, the subjects judged how the
phrase had sounded on a five-point scale with endpoints 1
and 5 marked “exactly like speech” and “exactly like
singing.”
In all conditions, the initial and final presentations of the
phrase were untransformed; however, the phrases in the
intervening presentations varied depending on the condition:
In the untransformed condition, the intervening phrases were
also untransformed. In the transposed condition, the inter-
vening phrases were transposed slightly, while the formant
frequencies were preserved. The degree of transposition on
each of the intervening presentations, given in the order of
presentation, was þ2/3 semitone; 11/3 semitone; þ11/3
semitone; 2/3 semitone; þ11/3 semitone; 11/3 semitone;
þ2/3 semitone; and 2/3 semitone. In the jumbled condi-
tion, the intervening phrases were untransposed, but they
were presented in jumbled orderings. The phrase consisted
of seven syllables (1 ¼some;” 2 ¼times;” 3 ¼be;”
4¼have;” 5 ¼so;” 6 ¼strange; and 7 ¼ly”), and in
the intervening repetitions, the orderings of the syllables,
given in the order of presentation, were 6, 4, 3, 2, 5, 7, 1;
7, 5, 4, 1, 3, 2, 6; 1, 3, 5, 7, 6, 2, 4; 3, 6, 2, 5, 7, 1, 4; 2,
6, 1, 7, 4, 3, 5; 4, 7, 1, 3, 5, 2, 6; 6, 1,5, 3, 2, 4, 7; and 2,
5, 4, 3, 7, 1, 6.
Finally, all subjects filled out a questionnaire that
enquired into their age and musical training.
3. Instrumentation and software
The original sentence was recorded from track 22 of the
Deutsch (2003) compact disc onto a Power Mac G5 com-
puter, and it was saved as an AIF file at a sampling fre-
quency of 44.1 kHz. The software package BIAS PEAK PRO
Version 4.01 was employed to create the stimuli in all condi-
tions and also to create the jumbled orderings in the jumbled
condition. The software package PRAAT Version 4.5.06
(Boersma and Weenink, 2006) was employed to create the
transpositions in the transposed condition, using the pitch-
synchronous overlap-and-add method. The reconstituted sig-
nals were then recorded onto compact disc. They were
played to subjects on a Denon DCD-815 compact disc
player, the output of which was passed through a Mackie CR
1604-VLZ mixer and presented via two Dynaudio BM15A
loudspeakers at a level of approximately 70 dB sound pres-
sure level (SPL) at the subjects’ ears.
B. Results
Figure 2shows the average ratings of the phrase on the
initial and final presentations and under each of the three
conditions. It can be seen that the phrase was perceived
as speech on the initial presentation; however, the way it
was perceived on the final presentation depended on the
nature of the intervening presentations. When these were
untransformed, the phrase on the final presentation was
heard as song. However, when the intervening presentations
were transposed, judgments on the final presentation
remained as speech, though displaced slightly towards song.
FIG. 1. The spoken phrase, as it appears to be sung. From Deutsch (2003).
2246 J. Acoust. Soc. Am., Vol. 129, No. 4, April 2011 Deutsch et al.: The speech-to-song illusion
Author's complimentary copy
When the syllables in the intervening presentations were pre-
sented in jumbled orderings, the phrase on the final presenta-
tion was heard as speech.
To make statistical comparison between the judgments
under the different conditions, a 2 3 analysis of variance
(ANOVA) was performed, with presentation (initial and
final) as a within-subjects factor, and condition (untrans-
formed,transposed, and jumbled) as a between-subjects fac-
tor. The overall difference between the initial and final
presentations was highly significant [F(1, 51) ¼62.817;
p<0.001], the effect of condition was highly significant
[F(2, 51) ¼16.965; p<0.001], and the interaction between
presentation and condition was highly significant [F(2,
51) ¼25.593; p<0.001].
Given these findings, two further ANOVAs were per-
formed, in which judgments in the three conditions were
compared for the initial and final presentations taken sepa-
rately. For the initial presentation, judgments in the different
conditions did not differ significantly [F(2, 51) ¼1.912;
p>0.05]. However, for the final presentation, the effect of
type of intervening phrase was highly significant [F(2,
51) ¼27.317, p<0.001]. Post hoc comparisons were there-
fore made taking the final presentation alone. It was found
that judgments in the untransformed condition were signifi-
cantly different from those in the transposed condition
(p<0.001) and were also significantly different from those
in the jumbled condition (p<0.001). The difference between
judgments in the transposed and jumbled conditions was
nonsignificant (p>0.05).
C. Discussion
In this experiment, it was found that for a group of
subjects who were naı¨ve concerning the purpose of the
experiment and who had been selected only on the basis of
having had at least 5 yr of musical training, the repeated pre-
sentation of a spoken phrase caused it to be heard as sung
rather than spoken. However, this perceptual transformation
did not occur when, during the intervening presentations, the
phrase was transposed slightly or the syllables were pre-
sented in jumbled orderings. The illusion could not, there-
fore, have been due to repetition of the pitch contour of the
phrase or even repetition of the exact melodic intervals,
since these were preserved under transposition. Further,
since the perceptual transformation did not occur when the
intervening patterns were transposed leaving the timing of
the signal unaltered, it could not have been due to the repeti-
tion of the exact timing of the phrase. In addition, since the
perceptual transformation did not occur when the syllables
were presented in jumbled orderings, it could not have been
due to the exact repetition of the unordered set of syllables.
The illusion therefore appears to require repetition of the
untransposed set of syllables, presented in the same
ordering.
Experiment II explored this transformation effect in fur-
ther detail, by employing a production task. The embedded
phrase was presented either once or ten times following the
complete sentence, and the subjects were asked to repeat it
back exactly as they had most recently heard it. Differences
in the subjects’ renditions under these two conditions were
analyzed.
The experiment was motivated by two hypotheses. First,
in contrast to song, the pitch characteristics of speech are
rarely salient perceptually, and one striking characteristic of
the present illusion is that the perceived pitch salience of the
syllables increases substantially through repetition. It was
therefore hypothesized that following repeated listening to
the phrase, this perceptual increase in pitch salience would
result in renditions whose pitches would be closer to the
original spoken phrase, and with more inter-subject consis-
tency. Second, it was hypothesized that once the syllables
were heard as forming salient pitches, they would also be
perceptually distorted so as to be in accordance with a plau-
sible melodic representation; specifically, it was hypothe-
sized that the pitches produced by the subjects would be as
notated in Fig. 1.
III. EXPERIMENT II
A. Method
1. Subjects
Thirty-one female subjects participated in the experi-
ment and were paid for their services. They were divided
into three groups. The first group consisted of 11 subjects, of
average age 23.8 yr (range, 19–35 yr), and with an average
of 11.2 yr (range, 5–15 yr) of musical training. Before partic-
ipating in the experiment, they had listened to the sentence
followed by the repeating phrase and had all reported that
they heard the phrase as having been transformed into song.
The second group also consisted of 11 subjects and were of
average age 18.9 yr (range, 18–20 yr) and with an average of
8.9 yr (range, 6–12 yr) of musical training. They had not
been presented with the stimulus pattern before participating
in the experiment. The third group consisted of nine subjects,
of average age 19.2 yr (range, 18–22 yr), and with an
FIG. 2. Average ratings of the spoken phrase on the initial and final presen-
tations, under the three conditions: untransformed,transposed, and jumbled.
Subjects rated the phrase on a five-point scale with endpoints 1 and 5
marked exactly like speech, and exactly like singing, respectively.
J. Acoust. Soc. Am., Vol. 129, No. 4, April 2011 Deutsch et al.: The speech-to-song illusion 2247
Author's complimentary copy
average of 8.0 yr (range, 4–11 yr) of musical training. None
of the subjects had perfect pitch. All had normal hearing in
the range of 250 Hz6 kHz, as determined by audiometric
testing, and all were naı¨ve concerning the purpose of the
experiment and the nature of the illusion.
2. Experimental conditions and procedure
There were four conditions in the experiment. In the
repeat speech condition, the stimulus pattern was as in
experiment 1, so that the embedded phrase sometimes
behave so strangely was presented ten times in succession,
except that the pauses between repeated presentations were
780 ms in duration. The nonrepeat speech condition was
identical to the repeat speech condition, except that the em-
bedded phrase was presented only once. In the nonrepeat
song condition, the stimulus pattern consisted of a recording
of a single rendition of the embedded phrase sung by one of
the authors (R.L.) as she had heard it following multiple pre-
sentations. In all three conditions, the subjects were asked to
listen to the stimulus pattern and then to repeat it back three
or four times exactly as they had heard it; the second rendi-
tion of the phrase was then extracted for analysis. The first
group of subjects served in the repeat speech condition and
the second group served in the nonrepeat speech condition,
followed immediately by the nonrepeat song condition.
Finally, in the evaluation condition, the 22 renditions of
the spoken phrase, which were taken from the utterances of
the 11 subjects in the repeat speech condition and the 11
subjects in the nonrepeat speech condition, were presented
to the third group of subjects. The phrases were presented in
random order and were separated by 8-s pauses. During the
pause following each presentation, the subject indicated on a
response sheet whether the phrase sounded as speech or as
song.
3. Instrumentation and software
The subjects produced their renditions individually in a
quiet room. The instrumentation used to deliver the stimulus
patterns was identical to that in experiment I. The subjects’
vocalizations were recorded onto an Edirol R-1 24 bit re-
corder at a sampling frequency of 44.1 kHz. The recordings
were made using an AKG C 1000 S microphone placed
roughly 8 in. from the subject’s mouth. The sound files were
transferred to an iMac computer, where they were saved as
AIF files at a sampling frequency of 44.1 K. Then from each
sound file, the second rendition of the phrase was extracted,
saved as a separate sound file, and normalized for amplitude
using the software package BIAS PEAK PRO Version 5.2. F0
estimates of the subject’s vocalizations were then obtained
at 5 ms intervals using the software package PRAAT Version
5.0.09 (autocorrelation method). Then for each sound file,
the F0 estimates were averaged along the musical scale; that
is, along a log frequency continuum, so producing an aver-
age F0 for the phrase. In addition, each phrase was seg-
mented into the seven syllables (some,times,be,have,so,
strange, and ly), and the F0 estimates were averaged over
each syllable separately.
B. Results
The judgments made in the evaluation condition showed
that renditions in the nonrepeat speech condition were heard
as spoken, while those in the repeat speech condition were
heard as sung. Specifically, this was true for 97.5% of the
198 judgments that were made. This result is as expected
from the findings from experiment I, in which subjects
judged the initial presentation of the original spoken phrase
as spoken and the final presentation as sung.
Detailed analyses of the pitch patterns in the renditions
were undertaken in order to characterize the changes that
resulted from repeated exposure to the original spoken
phrase. Figure 3displays the average F0s of all the syllables
in the original spoken phrase, together with those averaged
over all renditions in the repeat speech condition and in the
nonrepeat speech condition. As further illustration, Fig. 4
displays the pitch tracings of the original spoken phrase, to-
gether with those from four subjects in the repeat speech
condition and four subjects in the nonrepeat speech condi-
tion. These pitch tracings are representative of those pro-
duced by all subjects in each condition.
Two findings emerged that were predicted from the hy-
pothesis that pitch salience increased as a result of repetition.
These showed that the average pitch for the phrase as a
whole, and also for each syllable taken separately, was more
consistent across subjects and closer to the original spoken
phrase, in the repeat speech condition than in the nonrepeat
speech condition.
First, the across-subjects variance in average F0 was
considerably lower for renditions in the repeat speech condi-
tion than in the nonrepeat speech condition. Taking the aver-
age F0s for renditions of the entire phrase, this difference in
variance was highly significant statistically [F(10,
10) ¼5.62, p<0.01]. This pattern held when comparisons
were made for each syllable taken separately: for some,
F(10, 10) ¼19.72, p<0.0001; for times,F(10, 10) ¼69.22,
p<0.0001; for be,F(10, 10) ¼6.2, p<0.01; for have,F(10,
10) ¼9.71, p<0.001; for so,F(10, 10) ¼22.68, p<0.0001;
FIG. 3. Triangles show the average F0 of each syllable in the original spo-
ken phrase. Diamonds show the average F0 of each syllable, averaged over
the renditions of all subjects in the repeat speech condition. Squares show
the average F0 of each syllable, averaged over the renditions of all subjects
in the nonrepeat speech condition. F3 ¼174.6 Hz; G3 ¼196.0 Hz;
A3 ¼220 Hz; B3¼246.9 Hz; C#4 ¼277.2 Hz; and D#4 ¼311.1 Hz.
2248 J. Acoust. Soc. Am., Vol. 129, No. 4, April 2011 Deutsch et al.: The speech-to-song illusion
Author's complimentary copy
for strange,F(10, 10) ¼35.76, p<0.0001; and for ly,F(10,
10) ¼12.71, p<0.001.
Second, the average F0s for renditions in the repeat
speech condition were found to be considerably closer to the
average F0 of the original spoken phrase compared with
those in the nonrepeat speech condition. To evaluate this
effect statistically, for each subject a difference score was
obtained between the average F0 of her rendition of the
phrase and that of the original spoken phrase. Using an inde-
pendent samples t-test assuming unequal sample variances,
the difference scores were found to be significantly lower for
renditions following ten presentations than for those follow-
ing a single presentation [t(13.34) ¼4.03, p<0.01]. This
pattern held for all syllables taken individually except for the
last one: for some t(11.01) ¼5.37, p<0.001; for times,
t(10.29) ¼7.68, p<0.0001; for be,t(13.14) ¼6.46,
p<0.0001; for have,t(12.04) ¼6.28, p<0.0001; for so,
t(10.88) ¼4.23, p<0.01; for strange,t(10.73) ¼6.07,
p<0.001; and for ly, there was a nonsignificant
[t(11.19) ¼1.34, p¼0.23] trend in the same direction.
To ensure that the differences between conditions found
here were not due to a simple effect of repetition, compari-
son was made between renditions in the nonrepeat speech
condition and the nonrepeat song condition, in which the
stimulus pattern was also presented to the subjects only
once. Figure 5displays the average F0s of all syllables in the
original sung phrase, together with those averaged over all
renditions in the nonrepeat song condition. It can be seen
that the subjects’ renditions in this condition corresponded
closely to each other and also to the original sung phrase. As
further illustration, Fig. 6displays the pitch tracings from
the original sung phrase, together with those from four repre-
sentative subjects in the nonrepeat song condition. (The trac-
ings were taken from those subjects whose tracings in the
nonrepeat speech condition are shown in Fig. 4.) It can be
seen that the renditions in the nonrepeat song condition were
more consistent across subjects and considerably closer to
the original sung phrase than were those in the nonrepeat
speech condition in relation to the original spoken phrase.
Two types of statistical comparison were made between
renditions in these two conditions. First, it was found that
the across-subjects variance in average F0 was considerably
lower for renditions in the nonrepeat song condition than in
the nonrepeat speech condition. Taking the average F0s for
renditions of the entire phrase, the difference in variance was
highly significant statistically [F(10, 10) ¼7.39, p<0.01].
This pattern held for all the syllables taken separately: for
some F(10, 10) ¼19.89, p<0.0001; for times F(10,
10) ¼97.66, p<0.0001; for be F(10, 10) ¼5.31, p<0.01;
for have F(10, 10) ¼21.63, p<0.0001; for so F(10,
10) ¼60.51, p<0.0001; for strange F(10, 10) ¼66.06,
p<0.0001; and for ly F(10, 10) ¼17.74, p<0.0001.
Second, for each subject a difference score was obtained
in the nonrepeat song condition, taking the difference
between the average F0 of her rendition of the entire phrase
and that of the original sung phrase. Using a correlated sam-
ples t-test, the difference scores in the nonrepeat song condi-
tion were found to be significantly lower than those in the
nonrepeat speech condition [t(10) ¼3.31, p<0.01]. The
same pattern held for each of the seven syllables taken indi-
vidually, with the exception of the last one: for some
t(10) ¼4.16, p<0.01; for times t(10) ¼7.56, p<0.0001;
for be t(10) ¼5.69, p<0.001; for have t(10) ¼6.12,
p<0.001; for so t(10) ¼2.41, p<0.05; for strange
t(10) ¼6.6, p<0.0001; and for ly there was a nonsignifi-
cant trend in the same direction, t(10) ¼1.37, p¼0.200.
The above findings are in accordance with the hypothe-
sis that repeated listening to the original spoken phrase
causes its pitches to be heard more saliently, and in this
respect more appropriately to song than to speech.
FIG. 4. Pitch tracings of the original spoken phrase, together with those
from four representative subjects in the repeat speech condition, and from
four representative subjects in the nonrepeat speech condition. F3 ¼174.6
Hz; B3 ¼246.9 Hz; and F4 ¼349.2 Hz.
FIG. 5. Triangles show the average F0s of each syllable in the original sung
phrase. Squares show the average F0 of each syllable, averaged over the ren-
ditions of all subjects in the nonrepeat song condition. F3 ¼174.6 Hz;
G3 ¼196.0 Hz; A3 ¼220 Hz; B3 ¼246.9 Hz; C#4 ¼277.2 Hz; and
D#4 ¼311.1 Hz.
J. Acoust. Soc. Am., Vol. 129, No. 4, April 2011 Deutsch et al.: The speech-to-song illusion 2249
Author's complimentary copy
Renditions in the repeat speech condition were considerably
closer to the original and had considerably closer inter-sub-
ject agreement than were renditions in the nonrepeat speech
condition. However, renditions in the nonrepeat song condi-
tion were very close to the original, with strong inter-subject
agreement.
We now turn to the prediction that once the syllables
constituting the original phrase were heard as forming salient
pitches, these would also be perceptually distorted so as to
conform to a tonal melody. Specifically, it was predicted that
the subjects’ renditions of the spoken phrase in the repeat
speech condition would correspond to the pattern of pitches
notated in Fig. 1and so would correspond to the sequence of
intervals (in semitones) of 0, 2, 2, þ4, 2, and 7. It
was therefore hypothesized that the sequence of intervals
formed by the subjects’ renditions of the spoken phrase in
the repeat speech condition would be more consistent with
this melodic representation than with the sequence of inter-
vals formed by the original spoken phrase.
To test this hypothesis, the six melodic intervals formed
by the original spoken phrase were calculated, as were those
produced by the subjects’ renditions in the repeat speech
condition and the nonrepeat speech condition, taking the av-
erage F0 of each syllable as the basic measure. Then for
each condition two sets of difference scores were calculated
for each of the six intervals: (a) between the interval pro-
duced by the subject and that in the original spoken phrase,
and (b) between the interval produced by the subject and that
based on the hypothesized melodic representation. These
two sets of difference scores were then compared
statistically.
The results are shown in Table I. It can be seen that, as
expected, the difference scores for renditions in the nonrep-
eat speech condition did not differ significantly for the two
types of comparison. However, the renditions in the repeat
speech condition were significantly closer to the hypothe-
sized melodic representation than they were to the original
spoken phrase. Specifically, in the repeat speech condition,
for each of the six intervals, the difference between the sub-
jects’ renditions and the hypothesized melodic representation
was smaller than between the subjects’ renditions and the
original spoken phrase. This difference between the two
types of comparison was statistically significant (p<0.016,
one-tailed, on a binomial test). This result is in accordance
with the hypothesis that the subjects’ renditions following
repeated listening to the spoken phrase would be heavily
TABLE I. Difference scores, averaged across subjects, between the inter-
vals produced by the subjects’ renditions and (a) produced by the original
spoken phrase, and (b) based on the hypothesized melodic representation.
Average difference (semitones)
(a) From original
spoken phrase
(b) From melodic
representation
Repeat speech condition
some to times 0.89 0.75
times to be 0.63 0.41
be to have 0.68 0.43
have to so 1.42 0.55
so to strange 2.04 0.51
strange to ly 1.41 0.72
Nonrepeat speech condition
some to times 1.85 1.53
times to be 1.63 1.31
be to have 0.91 1.20
have to so 2.16 2.53
so to strange 2.35 1.68
strange to ly 5.06 4.06
FIG. 6. Pitch tracings from the original sung phrase, together with those
from four representative subjects in the nonrepeat song condition. These
were the same subjects as whose pitch tracings, taken from renditions in the
nonrepeat speech condition, are shown in Fig. 4.F3¼174.6 Hz; B3 ¼246.9
Hz; and F4 ¼349.2 Hz.
2250 J. Acoust. Soc. Am., Vol. 129, No. 4, April 2011 Deutsch et al.: The speech-to-song illusion
Author's complimentary copy
influenced by the hypothesized perceptual representation in
terms of a tonal melody.
Table II shows the results of the same calculations that
were carried out based on the original sung phrase, i.e., in
the nonrepeat song condition. It can be seen that the differ-
ence scores were here very small and that they did not differ
depending on whether comparison was made with the origi-
nal sung phrase or with the hypothesized melodic representa-
tion. This result is expected on the assumption that the
original sung phrase was itself strongly influenced by the
melodic representation that had been constructed by the
singer.
IV. DISCUSSION
In considering the possible basis for this transformation
effect, we note that the vowel components of speech are
composed of harmonic series, so that one might expect their
pitches to be clearly perceived even in the absence of repeti-
tion. Yet in contrast to song, the pitch characteristics of
speech are rarely salient perceptually. We can therefore
hypothesize that in listening to the normal flow of speech,
the neural circuitry underlying pitch salience is somewhat
inhibited, perhaps to enable the listener to focus more on
other characteristics of the speech stream that are essential to
meaning, i.e., consonants and vowels. We can also hypothe-
size that exact repetition of the phrase causes this circuitry to
become disinhibited, with the result that the salience of the
perceived pitches is enhanced. Concerning the brain regions
that might be involved here, brain imaging studies have
identified a bilateral region in the temporal lobe, anterolat-
eral to primary auditory cortex, which responds preferen-
tially to sounds of high pitch salience (Patterson et al.,
2002;Penagos et al., 2004;Schneider et al., 2005). This
leads to the prediction that, as a result of repeated listen-
ing to this phrase, activation in these regions would be
enhanced.
The process underlying the perceptual transformation of
the spoken phrase into a well-formed tonal melody must nec-
essarily be a complex one, involving several levels of
abstraction (Deutsch, 1999). At the lowest level, the forma-
tion of melodic intervals occurs (Deutsch, 1969;Demany
and Ramos, 2005). This process involves regions of the tem-
poral lobe that are further removed from the primary audi-
tory cortex, with emphasis on the right hemisphere
(Patterson et al., 2002;Hyde et al., 2008;Stewart et al.,
2008). Furthermore, in order for listeners to perceive the
transformed phrase as a tonal melody, they must draw on
their long term memories for familiar music. This entails
projecting the pitch information onto overlearned scales
(Burns, 1999) and invoking further rule-based characteristics
of our tonal system (Deutsch, 1999;Deutsch and Feroe,
1981;Lerdahl and Jackendoff, 1983;Krumhansl, 1990;Ler-
dahl, 2001) and so requires the processing of musical syntax.
Further brain regions must therefore also be involved in the
perception of this phrase once it is perceived as song. Brain
imaging studies have shown that regions in the frontal lobe
in both hemispheres, particularly Broca’s area and its
homologue, are involved in processing musical syntax (Patel
et al., 1998;Maess et al., 2001;Janata et al., 2002;Koelsch
et al., 2002;Koelsch and Siebel, 2005). Furthermore, regions
in the parietal lobe, particularly in the left supramarginal
gyrus, have been found to be involved in the short term
memory for musical tones (Schmithorst and Holland, 2003;
Vines et al., 2006;Koelsch et al., 2009), as have other corti-
cal regions, such as the superior temporal gyrus (Janata
et al., 2002;Koelsch et al., 2002;Schmithorst and Holland,
2003;Warrier and Zatorre, 2004). We can therefore
hypothesize that when the phrase is perceived as sung, there
would be further activation in these regions also.
It should be pointed out that the subjects in the pres-
ent study had all received musical training. It has been
found that musically trained listeners are more sensitive to
pitch structures than untrained listeners (Schneider et al.,
2002;Thompson et al., 2004;Magne et al., 2006;Musac-
chia et al., 2007;Wong et al., 2007;Kraus et al., 2009;
Hyde et al., 2009), so it is possible that musically
untrained subjects may not produce results as clear as
those obtained in the present experiment.
Finally, the present findings have implications for gen-
eral theories concerning the substrates of music and speech
perception. As reviewed by Diehl et al. (2004) and Zatorre
and Gandour (2007), much research in this area has been
motivated by two competing theories. The domain-specific
theory assumes that the sounds of speech and of music are
each processed by a system that is dedicated specifically to
processing these sounds and that excludes other sounds (Lib-
erman and Mattingly, 1985;Peretz and Coltheart, 2003).
The cue-based theory assumes instead that whether a stimu-
lus is processed as speech, music, or some other sound
depends on its acoustic characteristics and that it is unneces-
sary to posit special-purpose mechanisms for processing
speech and music (Diehl et al., 2004;Zatorre et al., 2002).
The present findings cannot be accommodated by a strong
version of either theory, since the identical stimulus pattern
is perceived convincingly as speech under some conditions
and as music under others. It is proposed instead (similarly
to the proposal advanced by Zatorre and Gandour, 2007) that
speech and music are processed to a large extent by common
neural pathways, but that certain circuitries that are specific
either to speech or to music are ultimately invoked to pro-
duce the final percept.
TABLE II. Difference scores, averaged across subjects, between the inter-
vals produced by the subjects and (a) produced by the original sung phrase
and (b) based on the hypothesized melodic representation.
Average difference (semitones)
(a) From original
sung phrase
(b) From melodic
representation
Nonrepeat song condition
some to times 0.46 0.52
times to be 0.40 0.43
be to have 0.35 0.35
have to so 0.39 0.43
so to strange 0.40 0.31
strange to ly 0.82 0.36
J. Acoust. Soc. Am., Vol. 129, No. 4, April 2011 Deutsch et al.: The speech-to-song illusion 2251
Author's complimentary copy
V. CONCLUSION
In conclusion, we have described and explored a new
perceptual transformation effect, in which a spoken phrase
comes to be heard as sung rather than spoken, simply as a
result of repetition. This effect is not just one of interpreta-
tion, since listeners upon hearing several repetitions of the
phrase sing it back with the pitches distorted so as to give
rise to a well-formed melody. Further research is needed to
characterize the features of a spoken phrase that are neces-
sary to produce this illusion, to document its neural under-
pinnings, and to understand why it occurs. Such research
should provide useful information concerning the brain
mechanisms underlying speech and song.
ACKNOWLEDGMENTS
We thank Adam Tierney, David Huber, and Julian Par-
ris for discussions and Stefan Koelsch and an anonymous
reviewer for helpful comments on an earlier draft of this
manuscript.
Boersma, P., and Weenink, D. (2006). “Praat: Doing phonetics by computer
(version 4.5.06),” http://www.praat.org/ (Last viewed December 8, 2010).
Burns, E. M. (1999). “Intervals, scales, and tuning,” in The Psychology of
Music, 2nd ed., edited by D. Deutsch (Academic Press, New York), pp.
215–258.
Demany, L., and Ramos, C. (2005). “On the binding of successive sounds:
Perceiving shifts in nonperceived pitches,” J. Acoust. Soc. Am. 117, 833–
841.
Deutsch, D. (2010). “Speaking in tones,” Sci. Am. Mind 21, 36–43.
Deutsch, D. (2003). Phantom Words, and Other Curiosities (Philomel
Records, La Jolla) (compact disc; Track 22).
Deutsch, D. (1999). “Processing of pitch combinations,” in The Psychology
of Music, 2nd ed., edited by D. Deutsch (Academic Press, New York), pp.
349–412.
Deutsch, D. (1969). “Music recognition,” Psychol. Rev. 76, 300–309.
Deutsch, D., and Feroe, J. (1981). “The internal representation of pitch
sequences in tonal music,” Psychol. Rev. 88, 503–522.
Diehl, R. L., Lotto, A. J., and Holt, L. L. (2004). “Speech perception,”
Annu. Rev. Psychol. 55, 149–179.
Hyde, K. L., Peretz, I., and Zatorre, R. J. (2008). “Evidence for the role of
the right auditory cortex in fine pitch resolution,” Neuropsychologia 46,
632–639.
Hyde, K. L., Lerch, J., Norton, A., Forgeard, M., Winner, E., Evans, A. C.,
and Schlaug, G. (2009). “Musical training shapes structural brain devel-
opment,” J. Neurosci. 29, 3019–3025.
Janata, P., Birk, J. L., Horn, J. D. V., Leman, M., Tillmann, B., and Bharu-
cha, J. J. (2002). “The cortical topography of tonal structures underlying
Western music” Science 298, 2167–2170.
Koelsch, S., Gunter, T. C., von Cramon, D. Y., Zysset, S., Lohmann, G.,
and Friederici, A. D. (2002). “Bach speaks: A cortical ‘language-network’
serves the processing of music,” Neuroimage 17, 956–966.
Koelsch, S., and Siebel, W.A. (2005). “Towards a neural basis of music
perception,” Trends Cogn. Sci. 9, 578–584.
Koelsch, S., Schulze, K., Sammler, D., Fritz, T., Muller, K., and Gruber, O.
(2009). “Functional architecture of verbal and tonal working memory: An
fMRI study,” Hum. Brain Mapp. 30, 859–873.
Kraus, N., Skoe, E., Parbery-Clark, A., and Ashley, R. (2009).
“Experience-induced malleability in neural encoding of pitch, timbre and
timing,” Ann. N. Y. Acad. Sci. 1169, 543–557.
Krumhansl, C. L. (1990). Cognitive Foundations of Musical Pitch (Oxford
University Press, New York), pp 1–318.
Lerdahl, F. (2001). Tonal Pitch Space (Oxford University Press, New York),
pp. 1–411.
Lerdahl, F., and Jackendoff, R. (1983). A Generative Theory of Tonal Music
(MIT Press, Cambridge, MA), pp. 1–368.
Liberman, A. M., and Mattingly, I. G. (1985). “The motor theory of speech
perception revised,” Cognition 21, 1–36.
Maess, B., Koelsch, S., Gunter, T. C., and Friederici, A. D. (2001). “Musical
syntax is processed in Broca’s area: An MEG study,” Nat. Neurosci. 4,
540–545.
Magne, C., Scho¨n, D., and Besson, M. (2006). “Musician children detect
pitch violations in both music and language better than nonmusician chil-
dren: Behavioral and electrophysiological approaches,” J. Cogn. Neurosci.
18, 199–211.
Mottonen, R., Calvert, G. A., Jaaskelainen, I. P., Matthews, P. M., Thesen,
T., Tuomainen, J., and Sams, M. (2006). “Perceiving identical sounds as
speech or non-speech modulates activity in the left posterior superior tem-
poral sulcus,” Neuroimage 30, 563–569.
Musacchia, G., Sams, M., Skoe, E., and Kraus, N. (2007). “Musicians have
enhanced subcortical auditory and audiovisual processing of speech and
music,” Proc. Natl. Acad. Sci. U.S.A. 104, 15894–15898.
Patel, A. D., (2008). Music, Language, and the Brain (Oxford University
Press, Oxford), pp. 1–513.
Patel, A. D., Peretz, I., Tramo, M. J., and Lebreque, R. (1998). “Processing
prosodic and musical patterns: A neuropsychological investigation,” Brain
Lang. 61, 123–144.
Patterson, R. D., Uppenkamp, S., Johnsrude, I. S., and Griffiths, T. D.
(2002). “The processing of temporal pitch and melody information in au-
ditory cortex,” Neuron 36, 767–776.
Penagos, H., Melcher, J. R., and Oxenham, A. J. (2004). “A neural represen-
tation of pitch salience in nonprimary human auditory cortex revealed
with functional magnetic resonance imaging,” J. Neurosci. 24, 6810–
6815.
Peretz, I., and Coltheart, M. (2003). “Modularity of music processing,” Nat.
Neurosci. 6, 688–691.
Remez, R. E., Rubin, P. E., Pisoni, D. B., and Carrell, T. D. (1981).
“Speech perception without traditional speech cues,” Science 212, 947–
949.
Schmithorst, V. J., and Holland, S. K. (2003). “The effect of musical train-
ing on music processing: A functional magnetic resonance imaging study
in humans,” Neurosci. Lett. 348, 65–68.
Schneider, P., Scherg, M., Dosch, H. G.,Specht,H.J.,Gutschalk,A.,and
Rupp, A. (2002). “Morphology of Heschl’s gyrus reflects enhanced
activation in the auditory cortex of musicians,” Nat. Neurosci. 5, 688–
694.
Schneider, P., Sluming, V., Roberts, N., Scherg, M., Goebel, R., Specht, H.
J., Dosch, H. G., Bleeck, S., Stippich, C., and Rupp, A. (2005). “Structural
and functional asymmetry of lateral Heschl’s gyrus reflects pitch percep-
tion preference,” Nat. Neurosci. 8, 1241–1247.
Schon, D., Magne, C., and Besson, M. (2004). “The music of speech: Music
training facilitates pitch processing in both music and language,” Psycho-
physiology 41, 341–349.
Shtyrov, Y., Pihko, E., and Pulvermuller, F. (2005). “Determinants of domi-
nance: Is language laterality explained by physical or linguistic features of
speech?” Neuroimage 27, 37–47.
Stewart, L., Von Kriegstein, K., Warren, J. D., and Griffiths, T. D. (2006).
“Music and the brain: Disorders of musical listening,” Brain 129, 2533–
2553
Stewart, L., Overath, T., Warren, J. D., Foxton, J. M., and Griffiths, T. D.
(2008). “fMRI evidence for a cortical hierarchy of pitch pattern proc-
essing,” PLoS ONE 3, e1470.
Thompson, W. F., Schellenberg, E.G. and Husain, G. (2004). “Decoding
speech prosody: Do music lessons help?” Emotion 4, 46–64.
Vines, B. W., Schnider N. M., and Schlaug, G. (2006). “Testing for causality
with transcranial direct current stimulation: Pitch memory and the left
supramarginal gyrus,” NeuroReport 17, 1047–1050.
Warrier, C. M., and Zatorre, R. J. (2004). “Right temporal cortex is critical
for utilization of melodic contextual cues in a pitch constancy task,” Brain
127, 1616–1625.
Wong, P. C. M., Skoe, E., Russo, N. M., Dees, T., and Kraus, N. (2007).
“Musical experience shapes human brainstem encoding of linguistic pitch
patterns,” Nat. Neurosci. 10, 420–422.
Zatorre, R. J., and Gandour, J. T. (2007). “Neural specializations for speech
and pitch: Moving beyond the dichotomies,” Philos. Trans. R. Soc. Lon-
don Ser. B 2161, 1–18.
Zatorre, R. J., Belin, P., and Penhune, V. B. (2002). “Structure and function
of auditory cortex: Music and speech,” Trends Cogn. Sci. 6, 37–46.
2252 J. Acoust. Soc. Am., Vol. 129, No. 4, April 2011 Deutsch et al.: The speech-to-song illusion
Author's complimentary copy
... It is rather slow, shows more pitch variation, and is often perceived to be more melodic in its characteristics than adult speech (Kuhl et al. 1997;McMullen and Saffran 2004). Indeed, song and melody are based on discrete pitches, which are sustained over longer durations compared to speech (Deutsch et al. 2011). Even though language and music show many similarities, they are based on different sound systems. ...
... Interestingly, while the relationship between language and music has been addressed in various domains, looking at how melodic languages are perceived has largely been neglected. There is some research that focuses on a phenomenon in which spoken utterances are transformed to sound like song, which is achieved by repetition (Deutsch et al. 2011). In a series of experiments, the researchers concluded that this phenomenon is valid as long as the samples, which were repeatedly provided, were exactly the same. ...
... We provided only little evidence that musical capacity is related to how melodic languages appear to individuals. Research has illustrated that the repetition of speech can transform language to sound like song (Deutsch et al. 2011). Margulis et al. (2015) used this speech-to-song illusion and found that languages that were more difficult to pronounce appeared to be more musical even before any repetition took place. ...
Article
Research has shown that melody not only plays a crucial role in music but also in language acquisition processes. Evidence has been provided that melody helps in retrieving, remembering, and memorizing new language material, while relatively little is known about whether individuals who perceive speech as more melodic than others also benefit in the acquisition of oral languages. In this investigation, we wanted to show which impact the subjective melodic perception of speech has on the pronunciation of unfamiliar foreign languages. We tested 86 participants for how melodic they perceived five unfamiliar languages, for their ability to repeat and pronounce the respective five languages, for their musical abilities, and for their short-term memory (STM). The results revealed that 59 percent of the variance in the language pronunciation tasks could be explained by five predictors: the number of foreign languages spoken, short-term memory capacity, tonal aptitude, melodic singing ability, and how melodic the languages appeared to the participants. Group comparisons showed that individuals who perceived languages as more melodic performed significantly better in all language tasks than those who did not. However, even though we expected musical measures to be related to the melodic perception of foreign languages, we could only detect some correlations to rhythmical and tonal musical aptitude. Overall, the findings of this investigation add a new dimension to language research, which shows that individuals who perceive natural languages to be more melodic than others also retrieve and pronounce utterances
... For instance, singing and its positive effect on speech recovery for therapeutic purposes have been investigated in detail (Norton et al. 2009). Other research on the relationship of song and speech have reported that languages can perceptually also be transformed into song by repetition (Deutsch et al. 2011). For instance, Deutsch et al. (2011) have noted that spoken utterances can be transformed to song by repeating the same language stimuli several times. ...
... Other research on the relationship of song and speech have reported that languages can perceptually also be transformed into song by repetition (Deutsch et al. 2011). For instance, Deutsch et al. (2011) have noted that spoken utterances can be transformed to song by repeating the same language stimuli several times. This finding was taken up by Margulis et al. (2015) who investigated language pronunciation abilities. ...
Article
Full-text available
Research on singing and language abilities has gained considerable interest in the past decade. While several studies about singing ability and language capacity have been published, investigations on individual differences in singing behavior during childhood and its relationship to language capacity in adulthood have largely been neglected. We wanted to focus our study on whether individuals who had sung more often during childhood than their peers were also better in language and music capacity during adulthood. We used questionnaires to assess singing behavior of adults during childhood and tested them for their singing ability, their music perception skills, and their ability to perceive and pronounce unfamiliar languages. The results have revealed that the more often individuals had sung during childhood, the better their singing ability and language pronunciation skills were, while the amount of childhood singing was less predictive on music and language perception skills. We suggest that the amount of singing during childhood seems to influence the ability to sing and the ability to acquire foreign language pronunciation later in adulthood.
... Por tanto, utilizar una definición subjetiva no sería del todo desacertado. También tendría sentido si utilizamos la literatura científica al respecto, ya que en realidad la música no es música hasta que un sujeto la interpreta como tal (Deutsch, Henthorn & Lapidis, 2011). ...
Chapter
En un modelo educativo donde se apuesta cada vez más por las competencias STEM (Science, Technology, Engineering and Mathematics), basadas en la ciencia, la tecnología, la ingeniería y las matemáticas, resulta fundamental introducir los aspectos relativos a la creatividad a través de las artes, pasando entonces el modelo STEAM (Science, Technology, Engineering, Arts and Mathematics). La música ha estado presente a lo largo de toda la historia del ser humano. Su estudio ha permitido no solamente comprender mejor qué es, sino también qué mitos han estado involucrados en sus posibles efectos y usos. Mediante esta caracterización de la música como herramienta, es posible identificar nuevos ámbitos de aplicación. Este es el caso de su incorporación a contextos científico-tecnológicos para su empleo con fines didácticos como vía de transmisión de conocimientos. Esta propuesta introduce un caso práctico de cohesión entre la música y el fenómeno de la fotosíntesis a través de la energía fotovoltaica, presentado mediante un recurso didáctico audiovisual. Su publicación, además, ha permitido evaluar su impacto en la audiencia.
... Doris' repetitions while remembering certain events in the past creates the illusion of a song. This is a phenomenon "in which a spoken phrase is perceptually transformed to sound like a song rather than speech, simply by repeating it several times over" and which is called "speech-to-song illusion" (Deutsch et al. 2014(Deutsch et al. : 2245. Doris uses repetitions especially when she needs to present and signal the intensity of her emotions. ...
Article
Full-text available
Abstract: Marginalized cultures regularly seek strategies for supplanting the norms of the dominant culture in order to survive the encroaching sovereignty of its hegemonic discourse. This article aims to identify African-American underlying strategies for resisting the “white” culture, as represented in Toni Morrison’s Beloved and Alice Walker’s Meridian. The subalterns in these novels use faultlines to destabilize the nodal points on which an overarching culture’s field of discursivity is installed. The dissident culture ultimately opens up spaces for residual and emergent signifiers to redefine the existing signs in the supposedly fixed field of discursivity. Keywords: dominant discourse, faultlines, field of discursivity, nodal points, residual and emergent elements, dissidence
... Repetition can involve single notes, melodic motifs, chord progressions, rhythmic patterns, and the entire musical piece. Repetitiveness in music seems to be also a foundational perceptual principle: the speech-to-song illusion is a striking phenomenon in psychological research on music and language, whereby repetition of speech phrases leads to them being perceived as sung speech (Deutsch et al., 2011). Certain speech phrases, especially when characterised by relatively flat within-syllable pitch contours and less variability in tempo, are more prone to be judged as musical by Western listeners (Tierney et al., 2018). ...
Article
Full-text available
Music and spoken language share certain characteristics: both consist of sequences of acoustic elements that are combinatorically combined, and these elements partition the same continuous acoustic dimensions (frequency, formant space and duration). However, the resulting categories differ sharply: scale tones and note durations of small integer ratios appear in music, while speech uses phonemes, lexical tone, and non-isochronous durations. Why did music and language diverge into the two systems we have today, differing in these specific features? We propose a framework based on information theory and a reverse-engineering perspective, suggesting that design features of music and language are a response to their differential deployment along three different continuous dimensions. These include the familiar propositional-aesthetic (‘goal’) and repetitive-novel (‘novelty’) dimensions, and a dialogic-choric (‘interactivity’) dimension that is our focus here. Specifically, we hypothesize that music exhibits specializations enhancing coherent production by several individuals concurrently—the ‘choric’ context. In contrast, language is specialized for exchange in tightly coordinated turn-taking—‘dialogic’ contexts. We examine the evidence for our framework, both from humans and non-human animals, and conclude that many proposed design features of music and language follow naturally from their use in distinct dialogic and choric communicative contexts. Furthermore, the hybrid nature of intermediate systems like poetry, chant, or solo lament follows from their deployment in the less typical interactive context.
... If we are to consider the meaning of a signal as the desired response of its receiver, then the cyclicity of music carries one of its most basic meanings: it invites participation. This is supported by studies showing that repetition of sound stimuli makes them feel more musical (Simchy-Gross & Margulis, 2018), and perhaps best illustrated by the speech-to-song illusion, in which the repetition of a speech phrase re-orients listeners away from decoding its linguistic meaning, to a focus on reproducing its rhythmic and tonal structure (Deutsch et al., 2011). Repetitive rhythmic patterns are, in that sense, like gaze following and pointing for referential communication. ...
Article
Theories of music evolution rely on our understanding of what music is. Here, I argue that music is best conceptualized as an interactive technology, and propose a coevolutionary framework for its emergence. I present two basic models of attachment formation through behavioral alignment applicable to all forms of affiliative interaction and argue that the most critical distinguishing feature of music is entrained temporal coordination. Music's unique interactive strategy invites active participation and allows interactions to last longer, include more participants, and unify emotional states more effectively. Regarding its evolution, I propose that music, like language, evolved in a process of collective invention followed by genetic accommodation. I provide an outline of the initial evolutionary process which led to the emergence of music, centered on four key features: technology, shared intentionality, extended kinship, and multilevel society. Implications of this framework on music evolution, psychology, cross-species and cross-cultural research are discussed.
Chapter
This chapter deals with the use of traditional western musical notation to transcribe and analyze Speech Prosody. Prosody is everything that extends beyond vocalic and consonantal segments. The term can be used in the singular or plural. Some of the prosodies are intonation, speech contours, word stress, sentence stress, speech rhythm, voice quality, duration, the tempo of speech, (re)syllabification, and all phonological processes. From the perspective of this book, the equivalent of Speech Prosody in music is Speech Melody or Speech Melodies. As these terms say, Speech Prosodies and Speech Melodies deal with speech phenomena. However, Speech Prosodies are mainly associated to speech and hearing in any situation outside music, whereas Speech Melodies are associated with speech and hearing in music.
Article
We examined pitch-error detection in well-known songs sung with or without meaningful lyrics. In Experiment 1, adults heard the initial phrase of familiar songs sung with lyrics or repeating syllables ( la) and judged whether they heard an out-of-tune note. Half of the renditions had a single pitch error (50 or 100 cents); half were in tune. Listeners were poorer at pitch-error detection in songs with lyrics. In Experiment 2, within-note pitch fluctuations in the same performances were eliminated by auto-tuning. Again, pitch-error detection was worse for renditions with lyrics (50 cents), suggesting adverse effects of semantic processing. In Experiment 3, songs were sung with repeating syllables or scat syllables to ascertain the role of phonetic variability. Performance was poorer for scat than for repeating syllables, indicating adverse effects of phonetic variability, but overall performance exceeded Experiment 1. In Experiment 4, listeners evaluated songs in all styles (repeating syllables, scat, lyrics) within the same session. Performance was best with repeating syllables (50 cents) and did not differ between scat or lyric versions. In short, tracking the pitches of highly familiar songs was impaired by the presence of words, an impairment stemming primarily from phonetic variability rather than interference from semantic processing.
Article
Full-text available
Music is often described in the laboratory and in the classroom as a beneficial tool for memory encoding and retention, with a particularly strong effect when words are sung to familiar compared to unfamiliar melodies. However, the neural mechanisms underlying this memory benefit, especially for benefits related to familiar music are not well understood. The current study examined whether neural tracking of the slow syllable rhythms of speech and song is modulated by melody familiarity. Participants became familiar with twelve novel melodies over four days prior to MEG testing. Neural tracking of the same utterances spoken and sung revealed greater cerebro-acoustic phase coherence for sung compared to spoken utterances, but did not show an effect of familiar melody when stimuli were grouped by their assigned (trained) familiarity. When participant's subjective ratings of perceived familiarity during the MEG testing session were used to group stimuli, however, a large effect of familiarity was observed. This effect was not specific to song, as it was observed in both sung and spoken utterances. Exploratory analyses revealed some in-session learning of unfamiliar and spoken utterances, with increased neural tracking for untrained stimuli by the end of the MEG testing session. Our results indicate that top-down factors like familiarity are strong modulators of neural tracking for music and language. Participants’ neural tracking was related to their perception of familiarity, which was likely driven by a combination of effects from repeated listening, stimulus-specific melodic simplicity, and individual differences. Beyond simply the acoustic features of music, top-down factors built into the music listening experience, like repetition and familiarity, play a large role in the way we attend to and encode information presented in a musical context.
Article
In the speech-to-song illusion a spoken phrase is presented repeatedly and begins to sound as if it is being sung. Anecdotal reports suggest that subsequent presentations of a previously heard phrase enhance the illusion, even if several hours or days have elapsed between presentations. In Experiment 1, we examined in a controlled laboratory setting whether memory traces for a previously heard phrase would influence song-like ratings to a subsequent presentation of that phrase. The results showed that word lists that were played several times throughout the experimental session were rated as being more song-like at the end of the experiment than word lists that were played only once in the experimental session. In Experiment 2, we examined if the memory traces that influenced the speech-to-song illusion were abstract in nature or exemplar-based by playing some word lists several times during the experiment in the same voice and playing other word lists several times during the experiment but in different voices. The results showed that word lists played in the same voice were rated as more song-like at the end of the experiment than word lists played in different voices. Many previous studies have examined how various aspects of the stimulus itself influences the perception of the speech-to-song illusion. The results of the present experiments demonstrate that memory traces of the stimulus also influence the speech-to-song illusion.
Article
Full-text available
Speech and music are highly complex signals that have many shared acoustic features. Pitch, Timbre, and Timing can be used as overarching perceptual categories for describing these shared properties. The acoustic cues contributing to these percepts also have distinct subcortical representations which can be selectively enhanced or degraded in different populations. Musically trained subjects are found to have enhanced subcortical representations of pitch, timbre, and timing. The effects of musical experience on subcortical auditory processing are pervasive and extend beyond music to the domains of language and emotion. The sensory malleability of the neural encoding of pitch, timbre, and timing can be affected by lifelong experience and short-term training. This conceptual framework and supporting data can be applied to consider sensory learning of speech and music through a hearing aid or cochlear implant.
Article
Full-text available
The relative pitch of harmonic complex sounds, such as instrumental sounds, may be perceived by decoding either the fundamental pitch (f0) or the spectral pitch (fSP) of the stimuli. We classified a large cohort of 420 subjects including symphony orchestra musicians to be either f0 or fSP listeners, depending on the dominant perceptual mode. In a subgroup of 87 subjects, MRI (magnetic resonance imaging) and magnetoencephalography studies demonstrated a strong neural basis for both types of pitch perception irrespective of musical aptitude. Compared with f0 listeners, fSP listeners possessed a pronounced rightward, rather than leftward, asymmetry of gray matter volume and P50m activity within the pitch-sensitive lateral Heschl's gyrus. Our data link relative hemispheric lateralization with perceptual stimulus properties, whereas the absolute size of the Heschl's gyrus depends on musical aptitude.
Article
Full-text available
Examines how humans optimally form hierarchies, using a model based on the assumptions that tonal music is solely the product of human processing systems, and that pitch sequences are retained as hierarchical networks. At each level of the hierarchy, elements are organized as structural units in accordance with gestalt principles such as proximity and good continuation. Further, elements that are present at each hierarchical level are elaborated by further elements to form structural units at the next-lower level, until the lowest level is reached. Processing advantages include the following: (a) redundancy of representation, (b) use of distinct alphabets at different structural levels, and (c) representations formed in accordance with the laws of figural goodness. (73 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Book
Full-text available
This book builds on and in many ways completes the project of Fred Lerdahl and Ray Jackendoff's influential A Generative Theory of Tonal Music. Like the earlier volume, this book is both a music-theoretic treatise and a contribution to the cognitive science of music. After presenting some modifications to Lerdahl and Jackendoff's original framework, the book develops a quantitative model of listeners' intuitions of the relative distances of pitches, chords, and regions from a given tonic. The model is used to derive prolongational structure, trace paths through pitch space at multiple prolongational levels, and compute patterns of tonal tension and attraction as musical events unfold. The consideration of pitch-space paths illuminates issues of musical narrative, and the treatment of tonal tension and attraction provides a technical basis for studies of musical expectation and expression. These investigations lead to a fresh theory of tonal function and reveal an underlying parallel between tonal and metrical structures. Later portions of the book apply these ideas to highly chromatic tonal as well as atonal music. In response to stylistic differences, the shape of pitch space changes and psychoacoustic features become increasingly important, while underlying features of the theory remain constant, reflecting unvarying features of the musical mind. The theory is illustrated throughout by analyses of music from Bach to Schoenberg, and frequent connections are made to the music-theoretic and psychological literature.
Article
In the first comprehensive study of the relationship between music and language from the standpoint of cognitive neuroscience, the author challenges the widespread belief that music and language are processed independently. Since Plato's time, the relationship between music and language has attracted interest and debate from a wide range of thinkers. Recently, scientific research on this topic has been growing rapidly, as scholars from diverse disciplines including linguistics, cognitive science, music cognition, and neuroscience are drawn to the music-language interface as one way to explore the extent to which different mental abilities are processed by separate brain mechanisms. Accordingly, the relevant data and theories have been spread across a range of disciplines. This book provides the first synthesis, arguing that music and language share deep and critical connections, and that comparative research provides a powerful way to study the cognitive and neural mechanisms underlying these uniquely human abilities.
Article
The chapter discusses the possible origins and bases of scales including those aspects of scales that are universal across musical cultures. It also addresses the perception of the basic unit of melodies and scales, the musical interval. Natural intervals are define as intervals that show maximum sensory consonance and harmony, have influenced the evolution of the scales of many musical cultures, but the standards of intonation for a given culture are the learned interval categories of the scales of that culture. Based on the results of musical interval adjustment and identification experiments, and on measurements of intonation in performance, the intonation standard for Western music appears to be a version of the equitempered scale that is slightly compressed for small intervals, and stretched for wide intervals, including the octave. The perception of musical intervals shares a number of commonalities with the perception of phonemes in speech, most notably categorical-like perception, and an equivalence of spacing, in sensation units, of categories along the respective continua. However, the perception of melodic musical intervals appears to be the only example of ideal categorical perception in which discrimination is totally dependent on identification. Therefore this chapter concludes that, rather than speech being “special” as ofttimes proclaimed by experimental psychologists it seems that music is truly special.
Article
The left superior temporal cortex shows greater responsiveness to speech than to non-speech sounds according to previous neuroimaging studies, suggesting that this brain region has a special role in speech processing. However, since speech sounds differ acoustically from the non-speech sounds, it is possible that this region is not involved in speech perception per se, but rather in processing of some complex acoustic features. "Sine wave speech" (SWS) provides a tool to study neural speech specificity using identical acoustic stimuli, which can be perceived either as speech or non-speech, depending on previous experience of the stimuli. We scanned 21 subjects using 3T functional MRI in two sessions, both including SWS and control stimuli. In the pre-training session, all subjects perceived the SWS stimuli as non-speech. In the post-training session, the identical stimuli were perceived as speech by 16 subjects. In these subjects, SWS stimuli elicited significantly stronger activity within the left posterior superior temporal sulcus (STSp) in the post- vs. pre-training session. In contrast, activity in this region was not enhanced after training in 5 subjects who did not perceive SWS stimuli as speech. Moreover, the control stimuli, which were always perceived as non-speech, elicited similar activity in this region in both sessions. Altogether, the present findings suggest that activation of the neural speech representations in the left STSp might be a pre-requisite for hearing sounds as speech.
Chapter
This book addresses the central problem of music cognition: how listeners' responses move beyond mere registration of auditory events to include the organization, interpretation, and remembrance of these events in terms of their function in a musical context of pitch and rhythm. The work offers an analysis of the relationship between the psychological organization of music and its internal structure. It combines over a decade of original research on music cognition with an overview of the available literature. The author also provides a background in experimental methodology and music theory.