ArticlePDF Available

Listening to polyphonic music recruits domain-general attention and working memory circuits

Authors:

Abstract and Figures

Polyphonic music combines multiple auditory streams to create complex auditory scenes, thus providing a tool for investigating the neural mechanisms that orient attention in natural auditory contexts. Across two fMRI experiments, we varied stimuli and task demands in order to identify the cortical areas that are activated during attentive listening to real music. In individual experiments and in a conjunction analysis of the two experiments, we found bilateral blood oxygen level dependent (BOLD) signal increases in temporal (the superior temporal gyrus), parietal (the intraparietal sulcus), and frontal (the precentral sulcus, the inferior frontal sulcus and gyrus, and the frontal operculum) areas during selective and global listening, as compared with passive rest without musical stimulation. Direct comparisons of the listening conditions showed significant differences between attending to single timbres (instruments) and attending across multiple instruments, although the patterns that were observed depended on the relative demands of the tasks being compared. The overall pattern of BOLD signal increases indicated that attentive listening to music recruits neural circuits underlying multiple forms of working memory, attention, semantic processing, target detection, and motor imagery. Thus, attentive listening to music appears to be enabled by areas that serve general functions, rather than by music-specific cortical modules.
Content may be subject to copyright.
Cognitive, Affective, & Behavioral Neuroscience
2002, 2 (2), 121-140
Most natural environments contain many simultane-
ously active sound sources, thereby presenting listeners
with a complex auditory scene. As with visual scenes, in
which it is important to segregate objects from one another
in order to be able to orient to and interact with the visual
environment, the auditory scene must also undergo ex-
tensive analysis before individual sources can be identi-
fied and segregated from others (Bregman, 1990). Natural
auditory scenes vary greatly in their content and complex-
ity, and it is often necessary to focus attention selectively on
a single sound source. The most famous example of an au-
ditory scene that humans analyze is the cocktail party in
which the voice of an individual must be separated from
the voices of others (Cherry, 1953). Music forms the
basis for another intriguing and diverse class of auditory
scenes, which are created and appreciated in all human
cultures.
Polyphonic music is a composite of multiple streams,
in which the aggregate can be appreciated, yet individual
components can be analyzed. For example, when listening
to a blues band, it is relatively easy to focus attention se-
lectively on the drums, bass, keyboards, or lead guitar.
Relevant acoustic features that facilitate stream segrega-
tion are the relative pitch height and rate at which tones
are presented (Bregman & Campbell, 1971), timbre (Iver-
son, 1995), rhythmic patterns (Jones, 1993), and spatial lo-
cation (see Bregman, 1990, for a review). These features
provide a basis for selective attention in music, and they
define auditory objects, sequences of which constitute au-
ditory streams
.
For example, when the tones of two differ-
ent melodies are interleaved temporally, a listener can dis-
cern the individual melodies, provided they differ from
each other along at least one feature dimension—for ex-
ample, pitch range, timbre, or loudness (Dowling, 1973).
The neural mechanisms underlying auditory attention to
both single and multiple streams of auditory information
have been studied extensively, using event-related poten-
tial (ERP) recordings from the human scalp (see Näätänen,
1992, for a review). With few exceptions, however, audi-
tory streams in most ERP and psychophysical experiments
have consisted of extremely simple stimuli, such as re-
peated pure tones with occasional pitch or loudness de-
viants. Experiments in which selective attention to one of
two streams has been investigated generally have distin-
guished the streams spatially by presenting a different
stimulus train to each ear (Alho, 1992; Hillyard, Hink,
Schwent, & Picton, 1973; Woldorff, Hackley, & Hillyard,
1991). The neural mechanisms underlying selective atten-
121 Copyright 2002 Psychonomic Society, Inc.
This research was supported by NIH Grant P50 NS17778-18 to J.J.B.,
the National Institute of Drug Abuse, the McDonnel Foundation, and
the Dartmouth Brain Imaging Center. We thank Jeffrey L. Birk for tran-
scribing the Schubert excerpt into MIDI format for Experiment 2, Lau-
ren Fontein for helping us to test subjects, Matthew Brett for providing
display routines, Souheil Inati for assistance in pulse sequence selection
and for providing file conversion routines, and Scott Grafton for helpful
comments. The manuscript benefited from the criticisms and sugges-
tions of two anonymous reviewers. The data and stimuli from the ex-
periments reported in this paper are available upon request from the
f MRI Data Center at Dartmouth College (http://www.fmridc.org) under
Accession Number 2-2002-112YT. Correspondence concerning this ar-
ticle should be addressed to P. Janata, Department of Psychological and
Brain Sciences, 6207 Moore Hall, Dartmouth College, Hanover, NH
03755 (e-mail: petr.janata@dartmouth.edu).
Listening to polyphonic music
recruits domain-general attention
and working memory circuits
PETR JANATA, BARBARA TILLMANN, and JAMSHED J. BHARUCHA
Dartmouth College, Hanover, New Hampshire
Polyphonic music combines multiple auditory streams to create complex auditory scenes, thus pro-
viding a tool for investigating the neural mechanisms that orient attention in natural auditory contexts.
Across two fMRI experiments, we varied stimuli and task demands in order to identify the cortical areas
that are activated during attentive listening to real music. In individual experiments and in a conjunction
analysis of the two experiments, we found bilateral blood oxygen level dependent (BOLD) signal in-
creases in temporal (the superior temporal gyrus), parietal (the intraparietal sulcus), and frontal (the pre-
central sulcus, the inferior frontal sulcus and gyrus, and the frontal operculum) areas during selective and
global listening, as compared with passive rest without musical stimulation. Direct comparisons of the
listening conditions showed significant differences between attending to single timbres (instruments)
and attending across multiple instruments, although the patterns that were observed depended on the rel-
ative demands of the tasks being compared. The overall pattern of BOLD signal increases indicated that
attentive listening to music recruits neural circuits underlying multiple forms of working memory, atten-
tion, semantic processing, target detection, and motor imagery. Thus, attentive listening to music appears
to be enabled by areas that serve general functions, rather than by music-specific cortical modules.
122 JANATA, TILLMANN, AND BHARUCHA
tion to somewhat more complex streams consisting of re-
peating phoneme and consonantvowel (CV) syllable to-
kens have also been investigated (Hink, Hillyard, & Ben-
son, 1978; Sams, Aulanko, Aaltonen, & ätänen, 1990;
Szymanski, Yund, & Woods, 1999a, 1999b). Selective at-
tention in the presence of more than two streams has re-
ceived scant attention (Alain & Woods, 1994; Brochard,
Drake, Botte, & McAdams, 1999; Woods & Alain, 2001).
Functional neuroimaging (positron emission tomogra-
phy [PET] and functional magnetic resonance imaging
[f MRI]) investigations of attention in auditory contexts
have begun to provide converging evidence as to which
general brain circuits are recruited during detection of tar-
get objects within single streams with or without the pres-
ence of secondary streams. In an fMRI experiment, Pugh
et al. (1996) presented both binaural (single stream) and
dichotic (dual stream) conditions while subjects per-
formed a CV or a frequency modulation (FM) sweep de-
tection task. The increased attentional demands in the di-
chotic condition resulted in increased activation in the
inferior parietal, inferior frontal, and auditory association
areas. In a PET/ERP experiment modeled on the typical
auditory oddball paradigm used in the ERP field, Tzourio
et al., (1997) presented frequent low tones and rare high
tones, under passive-listening conditions and with deviance
detection, randomly to the right or the left ear. Subjects
attended to and reported deviants presented to one of the
two ears. Both Heschl’s gyrus (HG) and the planum tem-
porale (PT) were active in the passive and attentive con-
ditions, as compared with the rest condition, but there was
no significant activation difference in these areas between
the passive and the attentive conditions. However, selec-
tive attention increased activation significantly, as com-
pared with passive listening, in the supplementary motor
area (SMA), anterior cingulate, and precentral gyrus bi-
laterally. These results and others (Linden et al., 1999; Za-
torre, Mondor, & Evans, 1999) suggest that selective atten-
tion operates on representations of auditory objects after
processing in the primary and secondary auditory areas is
complete, although other f MRI and ERP data suggest that
selective attention operates on early auditory representa-
tions (Jäncke, Mirzazade, & Shah, 1999; Woldorff et al.,
1993; Woldorff & Hillyard, 1991). However, in the exper-
iments cited above, the stimuli were acoustically simple,
the contexts were unnatural, and attention was deployed in
the framework of a target detection task. Thus, the spe-
cific requirements of target detection and decision making
were combined with potentially more general attentional
processes.
In the music perception domain, Platel et al. (1997) per-
formed a PET study of selective attention to musical fea-
tures of short, simple tonal sequences. While listening to
the same body of material repeatedly, subjects identified
the category membership of individual sequences along
the feature dimensions of pitch, timbre, and rhythm. The
goal of the study was to identify those brain areas that
uniquely represent or process these different feature di-
mensions. By contrasting various combinations of activa-
tion maps—for example, pitch versus timbre or rhythm
versus pitch and timbre combinedseveral areas in the
frontal lobes were found to be differentially activated by
attention to the different feature dimensions. In a recent
PET study of attentive listening in musicians, Satoh,
Takeda, Nagata, Hatazawa, and Kuzuhara (2001) found
that listening for target pitches in the alto voice of four-
part harmonic progressions resulted in greater bilateral re-
gional cerebral blood flow (rCBF) in superior parietal and
frontal areas, relative to listening for minor chord targets
in the same progressions. The frontal activations appear to
include regions around the inferior frontal sulcus (IFS)
and the precentral sulcus (PcS), the left-SMA/pre-SMA,
and the orbitofrontal cortex.
Although the use of acoustically and contextually sim-
ple stimuli is alluring from an experimental point of view,
our brains have adapted to rich and complex acoustic en-
vironments, and it is of interest to determine whether
stream segregation tasks using more complex stimuli ex-
hibit brain response patterns similar to those using sim-
pler stimuli. Therefore, just as experiments on the neural
mechanisms of visual processing have largely progressed
beyond the use of simple oriented bar stimuli to the use of
faces and other complex objects, both natural and artifi-
cial, presented individually (Haxby, Hoffman, & Gobbini,
2000) or as natural scenes consisting of multiple objects
(Coppola, White, Fitzpatrick, & Purves, 1998; Stanley, Li,
& Dan, 1999; Treves, Panzeri, Rolls, Booth, & Wakeman,
1999; Vinje & Gallant, 2000), we believe that functional
neuroimaging experiments using more complex acoustic
stimuli embedded in natural contexts may help identify
those brain regions that have adapted to the processing of
such stimuli. This type of neuroethological approach has
proven indispensable for identifying the neural substrates
of behaviors in other species that depend on complex au-
ditory stimuli—for example, birdsong (Brenowitz, Mar-
goliash, & Nordeen, 1997; Konishi, 1985).
We conducted two experiments in order to identify the
brain areas recruited during attentive listening to excerpts
of real music and to identify the brain areas recruited
during selective listening to one of several simultane-
ously presented auditory streams. To this end, we varied
task demands, musical material, and trial structure across
the experiments. In the first experiment, subjects were in-
structed to listen to the musical passages either globally/
holistically, without focusing their attention on any par-
ticular instrument, or selectively, by attentively tracking
the part played by a single instrument. Within the con-
straints of the f MRI environment, we sought to image
blood oxygen level dependent (BOLD) signal changes as
the subjects engaged in relatively natural forms of music
listening. Our objective of imaging natural attentive music
listening posed a significant challenge, because we did not
want to contaminate “normallistening processes with
additional demands of secondary tasks, such as target de-
tection. Auditory attention is studied almost exclusively in
the context of target detection, in which a stream is moni-
tored for the occurrence of single target events. By virtue
FUNCTIONAL IMAGING OF ATTENTIVE LISTENING TO MUSIC 123
of being able to measure the number of targets that are de-
tected as a function of attentional orienting, this type of
design provides a powerful tool for studying attention.
However, the subjective experience of listening attentively
to music is not the same as performing a target detection
task, and it is possible that brain dynamics vary with those
different forms of attentional processing.
To further investigate this hypothesis with a second ex-
periment, an excerpt of classical music was adapted for
use in a target detection task with two levels of attentional
instructions. As in the first experiment, subjects were re-
quired to orient their attention either globally or selec-
tively. This time, the attentional orienting was coupled to
detection of targets that could occur in any one of three
streams (global/divided-attending condition) or in the
single attended stream (selective-attending condition).
Finally, we sought to identify the cortical areas in which
the BOLD signal increased reliably during attentive music
perception across different experimental contexts. A con-
junction analysis enabled us to identify the cortical areas
that were activated in both of the experiments, thereby
identifying processes common to attentive music listen-
ing despite variation in the musical material and tasks.
EXPERIMENT 1
For this experiment, we chose a musical example con-
sisting of two simultaneous melodies (streams). Specifi-
cally, we selected excerpts from two baroque flute duets in
which the melodic contour and rhythmic properties of the
two voices were very similar. The feature dimension of
primary interest was timbre, which we manipulated by as-
signing two perceptually very different timbres (Krum-
hansl, 1989; McAdams, Winsberg, Donnadieu, Desoete,
& Krimphoff, 1995) to the melodies. We focused on tim-
bre because it is a potent cue for auditory stream segrega-
tion and should facilitate orienting of attention to a single
melody line and because we wanted to identify regions of
interest for further studies of timbre processing.
Method
Subjects. Twelve subjects (7 females; mean age, 28.5 years; age
range, 20 41) participated in the experiment: 8 fellows from the
2000 Summer Institute in Cognitive Neuroscience held at Dart-
mouth College, 3 visitors to Dartmouth, and 1 member of the Dart-
mouth community. All the subjects reported having normal hearing,
and 11 of the 12 subjects were right-handed. The duration of the sub-
jects’ formal musical training was 8.3 ± 3.5 years, and the average
duration for which they had played one or more instruments was
17.4 6 6.9 years. All the subjects provided informed consent ac-
cording to Dartmouth human subjects committee guidelines.
Stimuli and Procedure. Melodies were excerpted from the Vi-
vace and Giga movements of the Sonata in F Major for two recorders
by J. B. Loeillet (1712, op. 1, no. 4), thus providing two sets of two
melodies (streams) each (Figure 1A). All of the individual melodies
were stored as separate MIDI tracks, using MIDI sequencing soft-
ware (Performer 6.03, Mark of the Unicorn). When stored as MIDI
sequences, every note in all of the melodies could be equated in in-
tensity by assigning the same velocity value. Each melody could be
assigned to an arbitrary timbre. We used the vibes” and “strings”
timbres generated by an FM tone generator (TX802, Yamaha). These
two timbres occupy opposite corners of three-dimensional (3-D) tim-
bral space, as defined by multidimensional scaling solutions of per-
ceptual similarity judgments (Krumhansl, 1989; McAdams et al.,
1995). The most salient difference between the two timbres is the at-
tack time.
The assignment of timbres to melodies was completely counter-
balanced so that attention was directed to each of the four melodies
played by each of the two timbres over the course of the experiment
(Figure 1B). Every subject was presented with a complete set of
melody/timbre combinations across the two blocks. The ordering of
blocks was counterbalanced across subjects. The counterbalanced
design allowed us to compare directly the activations caused by se-
lective attention to the two different timbres. Stimulus files corre-
sponding to each of the different orders were arranged in a single track
of an audio file (SoundEdit16, Macromedia). Scanner trigger and
timing pulses were arranged on a second track, and both tracks were
written to CD.
The stimuli were delivered via ear-insert tubephones (ER-30,
Etymotic Research). In order to obtain better acoustic separation of
the music from the echo-planar imaging (EPI) pinging, the subject
wore ear muffs (HB-1000, Elvex), through which the tubephones
had been inserted. Prior to amplification, the audio signal from the
CD was filtered with a 31-band 1/3 octave band equalizer (Model
351, Applied Research Technologies) to compensate for the atten-
uation by the tubephones of frequencies above 1.5 kHz. The stim-
uli were presented at ,102 dB SPL.
Each functional EPI run began with 30 sec of rest in order to adapt
the subject to the EPI sound and to allow the global signal intensity
in the EPI images to stabilize. Eight seconds prior to the onset of each
task epoch, a verbal cue indicated the task: listen, attend vibes, or at-
tend strings. Task epochs were 30 sec long and were immediately fol-
lowed by a 30-sec rest period (rest condition) prior to the next verbal
cue. We began and ended each block with epochs in which the subjects
were asked to listen to the melodies in a holistic, integrated manner,
rather than attempting to focus their attention on one timbre or the
other (listen condition). The subjects were instructed to focus their at-
tention as best they could on the cued instrument during the attend
epochs. Following the experiment, the subjects completed a question-
naire and an interview about their musical training, task difficulty rat-
ings, and the strategies they had used in performing the task.
Data acquisition. Data were acquired on a General Electric Signa
Horizon Echospeed MRI scanner (1.5 T), fitted with a GE birdcage
head coil. Functional images were collected with a gradient echo EPI
pulse sequence with the following parameters: TR 5 2 sec; TE 5
35 msec; field of view (FOV) 5 240 3 240 mm; flip angle (a) 5 90º;
matrix size 5 64 3 64; resolution 5 3.75 3 3.75 3 5.0 mm; inter-
slice spacing 5 0 mm. Twenty-seven axial slices were collected, pro-
viding whole-brain coverage. Two sets of high-resolution T1-weighted
anatomical images were also obtained. The first was a set of 27 slices
taken in the same planes as the functional images (coplanar) and ob-
tained with a two-dimensional fast spin echo sequence with the fol-
lowing parameters: TR 5 650 msec; TE 5 6.6 msec; FOV 5 240 3
240 mm; a5 90º; matrix size 5 256 3 256; resolution 5 0.937 3
0.937 3 5.0 mm; interslice spacing 5 0 mm. The second set con-
sisted of 124 sagittal slices and was obtained with a 3-D SPGR se-
quence with the following parameters: TR 5 25 msec; TE 5
6.0 msec; FOV 5 240 3 240 mm; a5 25º; matrix size 5 256 3 192;
resolution 5 1.2 3 0.937 3 0.937 mm.
Data processing. EPI volumes from the initial 30-sec adaptation
phase of each run, during which the subjects were waiting for the
task to begin, were discarded prior to the analysis. SPM99 was used
for image preprocessing and statistical analyses (Friston et al.,
1995). Unless otherwise specified, all algorithms ran with default
settings in the SPM99 distribution (http://www.fil.ion.ucl.ac.
uk/spm). In order to estimate and correct for subject motion, re-
alignment parameters relative to the first image of the first run were
computed for each of the functional runs. A mean functional image
124 JANATA, TILLMANN, AND BHARUCHA
was constructed and used to coregister the functional images with
the coplanar anatomical images. The mutual information algorithm
implemented in SPM99 was used for coregistration (Maes, Col-
lignon, Vandermeulen, Marchal, & Suetens, 1997). The coplanar
images were then coregistered with the 3-D high-resolution images.
These, in turn, were normalized to the Montreal Neurological Insti-
tute’s T1-template images that approximate Talairach space, and the
normalization parameters were applied to the functional images. De-
fault parameters were used for the normalization, with the exception
that normalized images retained the original voxel size of 3.75 3
3.75 3 5 mm. Normalized images were smoothed with a 6 3 6 3
8 mm (FWHM) Gaussian smoothing kernel. Both individual and
group statistical parametric maps (SPMs) were constructed from
these images.
Regression coefficients were estimated for rest, verbal cue, listen,
attend vibes, and attend strings conditions. The verbal cue onset events
were modeled as consisting of an early and a late component, using a
modified version of the basis function routine in SPM99. Gamma
functions for the early and the late components were specified by pass-
ing shape parameter values of 2 and 3 to the gamma probability den-
sity function specified in SPM99, and these were then orthogonalized
by the same procedure as the default set of SPM99 gamma functions.
Similarly, the onset of each attentional condition was modeled with
these gamma functions in order to emphasize the initial period of at-
tentional orienting at the onset of the musical excerpt. The regressors
for the rest, listen, and attend conditions were modeled as a boxcar
waveform convolved with the SPM99 canonical hemodynamic re-
sponse function (HRF). Regressors for a linear trend and estimated
motion parameters were also included in the model for each subject.
In order to identify general task-related BOLD signal increases
and for purposes of comparison with Experiment 2, SPMs were cre-
ated for the following contrasts: listen minus rest, and attend (pooled
across timbres) minus rest. These contrasts identify general regions
that respond to the musical stimulus under the different attentional de-
mands. Because the attend conditions were also separated by timbre,
we compared activation maps elicited by attending to different tim-
bres, as compared with the rest condition. In addition, the focused-
attending (attend) condition was compared directly with the global-
1 2 3 4 5
=
120
6 7 8 9
10 11 12 13 14
15 16 17 18 19
1 2 3 4
=
105
5 6 7 8
9 10 11 12
13 14
Melody 2A
Melody 2B
Melody 1A
Excerpt 2 Excerpt 2 Excerpt 1 Excerpt 1 Excerpt 1
Excerpt 2
Melody 1B
Melody A Vibes Vibes
Vibes
Vibes Strings
Strings
Strings
Strings
Melody B
Listen Attend Attend Attend Attend Listen
Task
Listen Rest Rest Rest Rest Rest Attend Attend Attend Attend Listen
Task
Excerpt 2: Giga Excerpt 1: Vivace
Vibes
Strings Vibes
Strings
Excerpt 2 Excerpt 2 Excerpt 1 Excerpt 1 Excerpt 1 Excerpt 2
Melody A
Vibes
Vibes
Vibes
Vibes
Strings
Strings
Strings Strings
Melody B
Strings
Vibes
Strings
Vibes
Run 1
Run 2
A
B
Rest Rest Rest Rest Rest
None None None None None
None None None None None
Stimulus
Stimulus
Figure 1. Design of Experiment 1. (A) The excerpts from two baroque flute duets used in the experiment. Each melody could be
mapped to an arbitrary synthesized timbre, in this case “vibes” and “strings” timbres. (B) Counterbalancing scheme used in the ex-
periment. In each run, task blocks alternated with rest blocks. The subjects heard only echo-planar imaging pinging during the rest
blocks. Streams A and B refer to the melodic lines shown above—for example, Melody 1A and Melody 1B, respectively. The diagram
indicates which excerpt was played and which timbre was mapped to each stream (melodic line) in each task block. For task blocks in
which the subjects focused their attention selectively on a single stream, the attended stream (timbre) is highlighted with a box.
FUNCTIONAL IMAGING OF ATTENTIVE LISTENING TO MUSIC 125
listening (listen) condition to determine whether the different modes
of attending to the music elicited different patterns of BOLD signals.
The contrast maps from each subject were then entered into a
random-effects analysis, and SPMs for the group data were created.
SPMs were thresholded at p , .01 (uncorrected). Activations signif-
icant at the cluster level (extent threshold: clusters larger than the es-
timated resolution element size of 19 voxels) were identified by
comparing SPMs projected onto the mean T1-weighted structural
image (averaged across the subjects in the study) with the atlas of
Duvernoy (Duvernoy, 1999).
Results and Discussion
All the subjects reported that they performed the task as
instructed (Table 1). Eleven of the 12 subjects judged the
two timbres to be equally loud. However, the ratings of how
easy it was to attend to each timbre were more variable. On
average, each timbre could be attended to with relative
ease, although the within-timbre ratings differed widely
across subjects for both timbres. Some subjects had an eas-
ier time attending to vibes than to strings, whereas for other
subjects the opposite was true. The difficulty ratings for
the two timbres were within one point of each other for 5
of the subjects.
Significant BOLD signal increases were observed in
three principal areas in the listen–rest contrast (Table 2,
Figure 2A). These included, bilaterally, the superior tem-
poral gyrus (STG) spanning from the planum polare (PP)
to the PT, thus encompassing the regions surrounding and
including portions of HG. The SMA/pre-SMA showed
increased BOLD signals bilaterally. The third area of sig-
nificantly increased BOLD signal was the right PcS.
When the subjects were instructed to attend selectively to
one instrument or the other, the pattern of BOLD signal in-
creases was largely the same as when they were instructed
to listen more holistically. However, increased BOLD sig-
nals during selective listening was noted along the left in-
traparietal sulcus (IPS) and the supramarginal gyrus (Fig-
ure 2A,
1
40 mm slice). Frontal BOLD signal increases
were observed bilaterally in a swath stretching from the dor-
sal aspect of the inferior frontal gyrus (IFG), along the IFS
to the inferior PcS (Figure 2A, slices
1
20 to
1
60). The pre-
motor areas on the right were activated during both holistic
listening and selective attending, whereas the frontal and
parietal regions in the left hemisphere appeared to be most
affected by the selective-listening task. Figure 2B shows
that a direct statistical comparison of the attend and the lis-
ten conditions largely conformed to these observations
(Table 2). In addition, selective attending resulted in signif-
icantly greater BOLD signals in the posterior part of the left
STG and the underlying superior temporal sulcus (STS). In
the opposite contrast, the listen condition exhibited clusters
of significantly greater BOLD signals than did the selective
attend condition bilaterally along the anterior calcarine sul-
cus and in the rostral fusiform gyrus.
Because the stimulus materials were completely coun-
terbalanced, we could explore whether there were any dif-
ferences in the responses to the two timbres. We observed
no striking differences in the group images as a function
of the timbre that was attended (note the overlap of con-
tours in Figure 2A). The direct comparison of attentively
listening to each timbre yielded very small foci, although
neither did these fall in the auditory cortex or surround-
ing association areas nor were they closely apposed, as
might be expected from activation of an area possessing
a timbre map (data not shown). Thus, we did not pursue
the issue further.
EXPERIMENT 2
Experiment 2 was performed in order to obtain objective
behavioral verification, in the context of a target detection
task, that subjects had oriented their attention as instructed
during the f MRI scans. In addition, we used an excerpt
from a Schubert trio to assess whether the activation pat-
terns elicited with the stimulus materials of Experiment 1
could be elicited with a musical excerpt consisting of three
streams. In contrast to the first experiment, Experiment 2
used a timbral deviance detection task to assess the degree
to which the subjects were able to focus on a single instru-
ment or divide their attention across instruments. Because
we needed to obtain behavioral responses during divided
(global/ holistic) and selective attention conditions, our
ability to use the musical excerpt as a control stimulus in a
passive holistic-listening condition (as in Experiment 1) was
compromised. To take the place of the passive-listening
condition, we added a
scrambled
condition in which the
subjects listened to a nonmusical stimulus that was derived
from the musical excerpt by filtering white noise with the
average spectral characteristics in successive 1.5-sec win-
dows of the original Schubert trio excerpt. Thus, very coarse
spectrotemporal features were preserved, but stream and
rhythmic cues were removed. Owing to the change in task
structure, the experimental protocol was modified from one
in which 30-sec task epochs were alternated with 30-sec
rest epochs to one in which 15-sec trials were interleaved
with 8-sec rest periods.
Method
Subjects. Fifteen subjects (8 females; mean age, 18.73 years;
range, 18–22) participated in behavioral pretesting of the stimulus ma-
terial. Of these subjects, 5 (4 females; age range, 18–19 years) partic-
ipated in the fMRI scanning sessions. Three additional subjects (2 fe-
males; age range, 19–20 years) participated in the fMRI scanning
session following a brief training session on the task. The average
amount of formal musical training in the behavioral pretest cohort was
4.4 ± 3.76 years, and for the fMRI cohort it was 6.13 ± 5.79 years. All
the subjects had normal hearing.
Stimuli. The musical passage is shown in Figure 3. The instru-
ments’ parts were transcribed into MIDI sequences to allow the in-
troduction of timbral deviants at various points in the excerpt. The
Table 1
Average Ratings of Task Difficulty in Experiment 1
Rating Range
Relative loudness 3.9 6 0.3 3 4
Attend to strings 2.73 6 1.5 1–5
Attend to vibes 2.5 6 1.2 1–5
Note—For relative loudness, 1 5 strings, 7 5 vibes; for the attend con-
ditions, 1 5 easy, 7 5 difficult.
126 JANATA, TILLMANN, AND BHARUCHA
Table 2
Areas of Activation in Experiment 1
Listen–Rest (Holistic) Attend–Rest (Selective) Listen–Attend Attend–Listen
Lobe Hemisphere Region x y z Z Score x y z Z Score x y z Z Score x y z Z Score
Temporal left STG 256 624 25 4.04
260 215 25 4.64 260 215 45 5.30
252 222 25 4.26 252 226 45 4.49
268 230 45 4.00
STS/STG 264 241 10 4.05
rostral FG 234 241 210 3.46
right STG 2 60 230 40 4.70
256 211 25 4.69
260 219 25 4.74
252 222 25 4.78 252 222 45 4.83
rostral FG 226 256 215 3.39
collateral sulcus 230 252 425 3.16
Frontal left SMA 428 254 55 3.30
PcG 252 250 40 4.06
inferior PcS 256 238 25 4.22 252 254 20 3.15
IFS/IFG/MFG 245 222 30 4.50
right SMA 240 238 60 4.00 230 234 60 4.05
rostral IFS 234 238 15 3.86
caudal IFS 245 211 30 4.07 238 215 25 4.26
PcS 245 230 45 4.21 241 420 40 4.91
superior PcS 230 428 50 3.67
PcG 241 628 60 3.44
MFG 245 234 55 3.53
Parietal left IPS 234 252 45 3.20
supramarginal gyrus 249 245 40 3.02 256 234 40 3.57
256 226 25 3.96
Occipital left anterior calcarine sulcus 428 260 425 4.21
right anterior calcarine sulcus 211 249 425 3.50
Other left caudate 219 424 20 3.80
right GP/putamen 215 424 25 3.66
anterior thalamus 211 428 10 3.25
cerebellum 2 19 275 245 3.71
Note—STG, superior temporal gyrus; STS, superior temporal sulcus; FG, fusiform gyrus; SMA, supplementary motor area; PcG, precentral gyrus; PcS, precentral sul-
cus; IFS, inferior frontal sulcus, IFG, inferior frontal gyrus; MFG, middle frontal gyrus; IPS, intraparietal sulcus; GP, globus pallidus.
FUNCTIONAL IMAGING OF ATTENTIVE LISTENING TO MUSIC 127
note-on and note-off velocities were set to the same value within each
part, and the relative note velocities between instruments were ad-
justed to achieve comparable salience of the individual parts. The
standard piano notes were rendered by a Korg SG-1D sampling grand
piano. Timbres for all violin and cello notes and for piano deviants
were rendered by an FM tone generator (TX802, Yamaha).
In addition to the original standard sequence, 4 sequences con-
taining deviants were constructed for each of the three instruments,
resulting in a total of 12 deviant sequences. For any given deviant
sequence, there was a deviant in a single instrument and one of four
locations. The locations were distributed across early and late por-
tions of the excerpt. In the case of the violin and cello, the deviant
notes encompassed one measure—that is, a window of 1,000 msec—
whereas for the piano, the deviant notes took up two thirds of a mea-
sure (666 msec). The deviant notes were played with a slightly dif-
ferent timbrefor example, one of the otherviolin” sound patches
available on the Yamaha synthesizer. The standard and deviant se-
quences were recorded with MIDI and audio sequencing software
(Performer 6.03, MOTU). Each sequence was 15.5 sec long. The
same set of sequences was used for the selective and the divided at-
tention conditions described below.
The time courses of root-mean square (RMS) power were com-
pared for the standard and the deviant sequences. Initially, the note
velocities of the deviants were adjusted until the RMS curves over-
lapped for the standard and the deviant sequences. Unfortunately,
deviant notes adjusted by this procedure were very difficult to de-
tect and often sounded quieter than the surrounding standard notes.
We therefore adjusted the velocities of deviants until detection per-
formance in subsequent groups of pilot subjects rose above chance
levels. This adjustment resulted in a combination of timbral and
Figure 2. Group (N 5 12) images showing significant blood oxygen level dependent (BOLD) signal increases ( p , .01) for two sets
of contrasts from Experiment 1, superimposed on the average T1-weighted anatomical image of the subject cohort. (A) Contrast of
global listening (green blobs) and selective listening to each of the timbres (red and blue contours) with rest. (B) The direct contrast
of selective listening with global listening. Brain areas with significantly stronger BOLD signals during selective listening, as compared
with global listening, are shown in a red gradient, whereas the opposite relationship is shown in a blue gradient. The white contour
line denotes the inclusion mask that shows the edges of the volume that contained data from all participants. Susceptibility artifact
signal dropout in orbitofrontal and inferotemporal regions is clearly seen in the top row of images.
128 JANATA, TILLMANN, AND BHARUCHA
loudness deviance. We must emphasize that the multidimensional
aspect of the deviants is tangential to the primary goal of this ex-
periment. In other words, we employed deviants in this task not be-
cause we were trying to investigate target detection along specific
feature dimensions, but rather because we wanted objective verifi-
cation that the subjects were attending to the excerpts and we
wanted to create two different types of this attention with this
task—divided and selective.
In addition to the musical excerpts, we synthesized a control ex-
cerpt (scrambled) designed to match the gross spectral characteris-
tics of the original excerpt in 1.5-sec time windows. This was ac-
complished as follows. First, the amplitude spectra of the standard
sequences were created for 64 consecutive short segments of 2
10
samples each (23.22-msec windows). These were then averaged to
yield the average spectrum of a longer segment (2
16
samples,
1,486.08-msec windows). Next, the spectra of corresponding short
segments of white noise were multiplied by the average spectrum
of the longer window and were converted into the time domain.
This process was repeated across the total duration of the original
excerpt. In order to eliminate clicks owing to transients between
successive noise segments, the beginning and end of each noise seg-
ment were multiplied by linear ramps (3.63 msec). At shorter win-
dow durations of average spectrum estimation, the filtered noise
stimulus tended to retain the rhythmic qualities of the original ex-
cerpt and induced a weak sense of stream segregation between a low
and a high register in the noise. Since we wanted to minimize re-
cruitment of musical processing or attention to one of multiple
streams in this control stimulus, we chose the 1.5-sec windows. We
did not use envelope-matched white noise (Zatorre, Evans, & Meyer,
1994), since that type of stimulus retained the rhythmic properties
of the excerpt.
Procedure. The subjects were scanned during three consecutive
10.5-min runs. Each run consisted of divided and selective attention
trials. During a divided attention trial, the subject heard the verbal
cue, “everything, indicating that the deviance, if it occurred, could
occur in any of the three instruments. During the selective attention
trials, the subjects heard the name of one of the three instruments.
They were informed that if a deviant was to occur, it would occur
only in the cued instrument. Deviants never occurred in the uncued
instruments. The subjects were instructed to press a left button after
the excerpt f inished playing if they detected a deviant and a right
button if they detected none. Since buttonpresses immediately fol-
lowing targets would have contaminated activations arising from
the process of attentive listening with motor response activations,
which would have introduced a requirement for additional control
conditions, the subjects were instructed to withhold their buttonpress
until the excerpt had finished playing. The attentional trials (divided
and selective) were interspersed with the scrambled excerpts, to
which the subjects listened passively, pressing the right button after
the excerpt finished playing. Verbal cues were given 6 sec prior to the
onset of the excerpt. The pause between the end of the excerpt and
the following verbal cue was 8.5 sec. Across the three runs, 12 devi-
ant and 9 standard sequences were presented in each of the atten-
tional conditions. Thus, the probability of hearing a sequence with a
deviant was .57. Overall, 12 scrambled excerpts were presented. Note
that in the behavioral pretest, the probability of deviants was .50, and
there were no scrambled trials. The greater proportion of deviants in
the fMRI experiment arose from reducing the overall length of func-
tional scans by eliminating 3 standard sequences during each run.
Stimuli were pseudorandomized so that the number and timbral
identity of deviants would be balanced across runs. Within each run,
the ordering of stimuli was randomized. Event timing and stimulus
Piano
Cello
Violin
Figure 3. The Experiment 2 stimulus in musical notation. The arrows mark the starting and stopping points of the 15-sec ex-
cerpt used as the stimulus. From F. Schubert, “Trios für Klavier, Violine, und Violincello, op. 100, D 929, HN 193, p. 96, mea-
sures 137 through 156, Copyright 1973 by G. Henle Verlag, Munich.
FUNCTIONAL IMAGING OF ATTENTIVE LISTENING TO MUSIC 129
presentation were controlled by PsyScope (Cohen, MacWhinney,
Flatt, & Provost, 1993). Stimuli were presented from the left audio
channel, which was subsequently split for stereo presentation.
Data acquisition. The equipment and protocols described in Ex-
periment 1 were used for data acquisition and stimulus presentation.
A magnet-triggering pulse at the start of each run and event mark-
ers at the onsets of verbal cues and music excerpts (sent from the right
audio channel of the computer running PsyScope), together with the
output of the response button interface box (MRA Inc.), were mixed
(MX-8SR, Kawai) and were recorded to an audio file by SoundEdit
16 running on an independent iMac computer (Apple Computer).
Data processing. The initial nine EPI volumes, during which the
subjects performed no task, were discarded prior to any data analy-
sis. Procedures for image coregistration, spatial normalization, and
smoothing were the same as those in Experiment 1. A regression
model was specified in which experimental conditions were mod-
eled as either epochs or events. Conditions specified as epochs were
the rest condition, during which no stimulus was presented, the acous-
tic control sequence (scrambled), and a total of eight music conditions
reflecting the divided and the selective attention conditions and each
of the four deviance types (cello, piano, violin, and none) that were
present in each of the attention conditions. Verbal cues and responses
were modeled as events. The onset times of verbal cues, stimulus ex-
cerpts, and buttonpresses, relative to the start of data acquisition on
each run, were extracted from the audio files recorded during each
run, using custom-written event extraction scripts in Matlab (Math-
works). Epochs were modeled as a 15.5-sec boxcar convolved with
the canonical HRF in the SPM99 toolbox. In order to emphasize pro-
cesses related to auditory stream selection at the beginning of a mu-
sical excerpt, the onsets of each epoch in both the divided and the se-
lective attention conditions were modeled as events. The verbal cue
and attentional onset events were modeled as consisting of an early
and a late component, as in Experiment 1. Response events were
modeled as an impulse convolved with the canonical HRF. The mo-
tion parameters estimated during EPI realignment and a linear trend
were added to the model to remove additional variance.
In order to facilitate comparison with the data from Experiment 1,
the beta parameter estimates from the selective attention condition
regressors were combined and compared with the beta parameter
estimate for the rest condition. Similarly, the beta parameter estimates
from the divided attention condition regressors, which presumably
promoted active holistic listening, were combined and contrasted
with those for the rest condition. The response to the acoustic con-
trol stimulus (scrambled) condition was also contrasted with the rest
condition. Finally, the selective and divided attention conditions
were compared directly by calculating the difference between their
respective beta parameter estimates. Group effects were estimated
from the contrast maps for individual subjects, thresholded at p , .01
(magnitude) and one resolution element (24 voxels for the contrasts
with the the rest condition and 17 voxels for the direct comparison
of attention conditions) and were projected onto the mean T1-
weighted anatomical image. Anatomical regions of significant BOLD
signal changes were identified as in Experiment 1.
Results and Discussion
Behavior. The best behavioral evidence in support of
dissociable neural mechanisms underlying selective and
divided attention in real musical contexts would be a cost
in detection accuracy during divided attention relative to
selective attention. The ideal distribution of responses in
the target detection framework would be one in which tar-
gets occurring during divided attention would be missed—
that is, detected at chance levelsand all targets in the se-
lectively attended stream would be detected correctly.
False alarms would also be lower under selective attention
conditions.
In the first set of calibration experiments, in which reg-
ular headphones were used, the subjects (
n 5
6) detected
deviants under the divided attention condition at chance
level (49%). In the selective attention condition, detection
of deviants was above chance (60%) and significantly
higher than in the divided attention condition (paired
t
test,
t 5 2
4,
p ,
.01).
The adjustment of target intensity levels proved more
difficult when f MRI tubephones were used. With a group
of 9 subjects, 71% of the divided attention targets were
detected on average, as compared with 81% of the selec-
tive attention targets, demonstrating that targets could be
reliably detected even under less than optimal listening
conditions. However, the difference between attentional
conditions only bordered on statistical significance (
p 5
.084). In practice, we found that we were unable to adjust
the amplitudes of the individual timbral deviants so that we
could observe a strong and reliable dissociation between
forms of attention in the detection of targets across sub-
jects. The subjects showed different propensities in detect-
ing deviants in the different instruments. Some detected
the piano deviants better than the cello deviants, and for
others the opposite was true. Similarly, the accuracy with
which the subjects detected deviants at each of the four po-
sitions within the passage varied from subject to subject.
The length of the passage (15 sec), the overall length of the
f MRI experiment (1.5 h), and our dependence on pre-
recorded material prevented us from engaging in an adap-
tive psychophysical procedure to determine each individ-
ual’s threshold for each timbral deviant in the context of a
musical piece.
Five subjects whose behavioral performance was better
for the selective than for the divided attention condition
during the calibration experiments participated in the
fMRI scanning session, as did 3 additional subjects who
received instruction and 12 training trials immediately pre-
ceding the fMRI session. Detection of deviants was sig-
nificantly above chance levels in both the selective (71% ±
7% hits [mean ± SEM];
t 5
2.0,
p ,
.05) and the divided
(73% ± 6%;
t 5
2.7,
p ,
.02) conditions. These hit rates
were not significantly different from each other (paired
t
test,
t ,
1 ). In addition, false alarm rates were low in both
the selective and the divided attention conditions (11% and
8%, respectively).
Table 3 summarizes the subjects’ ratings of task diffi-
culty. Ratings were made on an integer scale from 1 (
very
easy
) to 7 (
very difficult
). The subjects found attending to
the violin part and detecting violin deviants relatively more
difficult, as compared with the other instruments, during
fMRI scanning. The subjects reported that the EPI ping-
ing interfered with their ability to detect deviants in the
violin stream, because the two fell into the same register.
We were aware of the interaction between the magnet and
the violin streams at the outset but were not concerned by
it because the de facto task of separating the violin from
the magnet was consistent with our aim of investigating
processes underlying stream segregation in natural con-
texts. In general, both the listen condition of Experiment 1
and the divided attention condition in the present experi-
130 JANATA, TILLMANN, AND BHARUCHA
ment may involve a baseline level of selective attention as-
sociated with segregating the music from the EPI pinging.
The overall difficulty of the task was rated as moderately
difficult both outside and inside the magnet, as was the dif-
ficulty of detecting deviants when the instruction was to
attend to all the instruments. Ratings of how often (1,
often
;
7,
not often
) the subjects found themselves ignoring the
music altogether indicated that the subjects almost always
attended to the music. These self-reports are in agreement
with the observed performance in the detection task.
Physiology. In contrast to Experiment 1, the attentional
load in this experiment was high whenever the subjects
heard the musical excerpt. Not surprisingly, the patterns of
increased BOLD signals were very similar in the divided
and the selective attention conditions, as compared with
the REST condition (Table 4, Figure 4A). This observation
is consistent with the absence of behavioral differentiation
of the two conditions. Despite the similar behavioral per-
formance and largely overlapping activation pattern, a di-
rect comparison of the BOLD signal elicited under the dif-
ferent attention conditions revealed significant differences
in the activation patterns (Table 4, Figure 4B). In the com-
parisons with the rest condition, both conditions recruited,
bilaterally, regions of the STG surrounding and including
the HG, the pre-SMA, and the frontal operculum. The
right-hemisphere IFG activation extended more laterally
than on the left. On the right, an area around the intersec-
tion of the superior precentral and superior frontal sulci
and the posterior part of the middle frontal gyrus (MFG)
was activated in both conditions, although in the selective
attention condition, the size of the cluster at this location
was slightly smaller than the extent threshold. The right
IPS and the angular gyrus were activated in both atten-
tional conditions, whereas the left IPS was activated pri-
marily in the divided attention condition.
The divided attention condition was associated with
two additional activation foci not observed in the previous
experiment. One was observed bilaterally in the ventro-
rostral part of the MFG, although the cluster on the right
was slightly smaller than the extent threshold used in com-
puting the contrasts. The other was activation of the ante-
rior cingulate gyrus just ventrolateral to the large pre-
SMA activation cluster.
The direct comparison of the selective and the divided
attention conditions further highlighted the activation dif-
ferences between the two conditions (Table 4, Figure 4B).
In particular, during divided attention, the BOLD signal
was stronger in bilateral parietal, right superior frontal,
and left anterior cingulate areas. In contrast, the selective
attention condition was associated with a stronger BOLD
signal in the left fusiform gyrus and in a number of oc-
cipital areas bilaterally, including the lingual gyrus, the
lateral occipital sulcus, and the anterior calcarine sulcus.
Note that none of the latter areas showed significant ac-
tivity for the selective attention condition, relative to the
rest condition.
When BOLD signal increases were determined for the
scrambled condition, relative to the rest condition (Fig-
ure 4A), only two areas exceeded the height and extent
thresholds: the right STG in the vicinity of Heschl’s sulcus
(
x, y, z 5
60,
2
15, 10), and the left postcentral gyrus
(
x, y, z 5 2
52,
2
19, 50). The only task during the acoustic
control sequences was to listen to the sound and make a re-
sponse with the right button when the sound terminated.
CONJUNCTION ANALYSIS
OF EXPERIMENTS 1 AND 2
Identification of Common Patterns Despite
Variations in Tasks and Stimuli
Despite the variability in stimulus materials and tasks,
the two experiments making up this study had two condi-
tions in common. In one, the subjects attempted to focus
their attention selectively on one of either two or three
different instruments that played concurrently (attend
condition in Experiment 1, selective condition in Exper-
iment 2), and in the other, they rested while listening to
the EPI pinging (rest condition in both experiments).
Other conditions that were somewhat related across both
experiments were those in which the subjects listened to
the music in a more global, integrative, holistic fashion
(listen condition in Experiment 1, divided condition in
Experiment 2). However, the latter conditions were not as
well equated, owing to the increased task demands in the
second experiment during the divided attention condi-
tion. Given these similarities across experiments, we were
interested in identifying the set of cortical areas activated
by similar conditions in both experiments.
Problems of Specifying Baseline Conditions
and Control Tasks in the Functional
Neuroimaging of Audition
Additional motivation for a conjunction analysis stems
from the observation that specification of baseline tasks
Table 3
Average Ratings of Task Difficulty
Overall
Attend Detect
Detect in
Difficulty Violin Cello Piano Violin Cello Piano Divided Ignore
First pretest 5.8 6 1.2 4.0 6 2.6 3.5 6 1.2 5.7 6 1.0 4.8 6 2.4 3.8 6 1.5 5.3 6 1.2 6.2 6 1.0 5.0 6 1.6
Second pretest 4.7 6 1.1 3.6 6 1.6 3.7 6 1.9 3.1 6 1.5 3.7 6 1.7 4.3 6 1.5 3.0 6 1.3 5.1 6 1.5 4.9 6 2.2
f MRI 4.4 6 1.2 4.5 6 1.4 2.6 6 1.1 3.0 6 1.3 4.6 6 1.8 2.8 6 1.3 3.2 6 1.7 4.0 6 1.3 5.9 6 1.4
Note—Ratings were made on an integer scale from 1 (easy) to 7 (difficult). “Attend” columns indicate how difficult it was to main-
tain attention to each stream, whereas “Detect” columns indicate how difficult it was to detect deviants within each stream. “Detect
in Divided” refers to difficulty of detecting targets in the divided attention condition. Ignore” indicates how often the subjects found
themselves ignoring the music altogether (1, often; 7, not often).
FUNCTIONAL IMAGING OF ATTENTIVE LISTENING TO MUSIC 131
Table 4
Areas of activation in Experiment 2
DividedRest Selective–Rest Divided –Selective Selective–Divided
Lobe Hemisphere Region x y z Z Score x y z Z Score x y z Z Score x y z Z Score
Temporal left STG 252 219 5 4.16 256 0 0 3.29
256 238 15 3.13 256 222 10 4.38
260 238 15 3.26
FG 226 268 210 3.1
right STG 56 211 5 3.79 56 11 10 3.78
56 222 10 4.86 52 222 10 4.11
68 226 10 3.4
Frontal bilateral pre-SMA 0 4 45 4.07
0 19 45 4.76 0 19 45 4.72
0 34 40 3.99
left pre-SMA 24 8 50 3.34
frontal operculum 238 19 5 4.21 238 19 5 3.89
230 22 0 3.78
rostral MFG 238 56 10 3.58
230 49 5 3.56
CS 234 226 60 3.49
right frontal operculum 38 19 0 3.57 45 15 5 3.91
34 22 10 3.42
SFS 26 19 50 3.20
superior PcS/SFS/MFG 30 24 55 4.8 30 24 55 3.92
MFG 34 4 65 2.86
Parietal left PoCG 256 222 50 3.09
IPS 241 249 50 3.93 234 252 40 3.41
230 264 40 3.65
right IPS 38 249 50 3.47
IPS/angular gyrus 38 256 60 4.08 41 256 55 3.87 34 260 50 3.20
Limbic bilateral anterior cingulate sulcus 28 30 30 4.04
posterior cingulate gyrus 0 230 30 3.09
left anterior cingulate 28 34 25 3.28
Occipital left lingual gyrus 215 275 25 3.42
right lingual gyrus 15 282 210 3.78
MOG/lateral occipital sulcus 41 279 25 3.69
anterior calcarine sulcus 15 260 5 3.63
Other left caudate 215 15 0 3.12
anterior thalamus 222 4 10 3.43
cerebellum 211 275 245 3.24 222 268 230 3.17
right caudate 15 4 10 3.7 19 4 15 3.62
thalamus 11 28 10 4.65
11 215 5 3.62
Note—STG, superior temporal gyrus; FG, fusiform gyrus; SMA, supplementary motor area; MFG, middle frontal gyrus; CS, central sulcus; SFS, superior frontal sul-
cus; PcS, precentral sulcus; PoCG, postcentral gyrus; IPS, intraparietal sulcus; MOG, middle occipital gyrus.
132 JANATA, TILLMANN, AND BHARUCHA
causes problems for PET/f MRI studies that try to disso-
ciate levels of processing in audition, particularly as more
acoustically complex and structured stimuli are employed.
The crux of the problem is twofold: matching the acousti-
cal complexity of stimuli across conditions while inferring/
confirming the mental state of subjects across conditions
in which they hear the same stimulus. Seemingly, the sim-
plest solution to this is to present the identical stimulus in
the different conditions and independently vary the task re-
quirements—for example, passive listening versus atten-
tive listening. This type of design has been used to describe
mnemonic processing in the context of melodic perception
(Zatorre et al., 1994), phonetic and pitch judgments about
speech syllables (Zatorre, Evans, Meyer, & Gjedde, 1992),
and auditory selective attention to tones and syllables
(Benedict et al., 1998; Jäncke et al., 1999; Tzourio et al.,
1997). One argument against using
passive
listening as a
control task is that the mental activity associated with it is
not well defined and may vary among subjects, particu-
larly as stimulus complexity and naturalness increase. For
example, one subject may ignore melodies, speech sounds,
or tones and think, instead, about the time he or she is wast-
ing in the scanner, whereas other subjects may attempt to
shadow (sing/speak along with) the acoustic material.
Despite these problems—which are likely to increase the
amount of variance in the data that is unaccounted for—
passive-listening conditions have traditionally served as
an important intermediate state between resting in the ab-
Figure 4. Group (N 5 8) images showing significant blood oxygen level dependent (BOLD) signal increases ( p , .01) for two sets of
contrasts from Experiment 2, superimposed on the average T1-weighted anatomical image of the subject cohort. (A) Contrast of di-
vided attention (global) listening and rest is depicted with a red gradient. The contrasts of selective listening to each instrument, rel-
ative to rest, are denoted with a contour line of different color for each instrument. The responses to an acoustically matched control
stimulus are shown in green. (B) The direct comparison of selective and divided attention conditions. Areas with a stronger BOLD re-
sponse during selective attending, as compared with divided attending, are shown in red, whereas areas with a stronger BOLD response
during divided attending are shown in blue.
FUNCTIONAL IMAGING OF ATTENTIVE LISTENING TO MUSIC 133
sence of a stimulus and very directed processing of a
stimulus. For this reason, we used resting and passive lis-
tening as control tasks in Experiment 1. We should cau-
tion, however, that even the passive-listening condition
may require a minimum of selective attention for segre-
gating the music from the continuous pinging sound of
the EPI pulse sequence. The degree to which separating
the music from the pinging requires attentive mecha-
nisms, rather than preattentive automatic mechanisms,
according to principles of auditory stream segregation
(Bregman, 1990) remains to be determined.
Other studies of auditory attention dispense with
acoustically matched control stimuli altogether, instead
comparing BOLD signal levels during attention with
silent rest blocks (Celsis et al., 1999; Linden et al., 1999;
Zatorre et al., 1999) or white noise bursts (Pugh et al.,
1996). This approach provides a comprehensive view of
auditory processing in a particular task/stimulus condi-
tion, although it precludes dissociating attentional pro-
cesses from those related more specifically to sensory
encoding and preattentive acoustic analysis. An addi-
tional concern is that the cognitive state of a resting sub-
ject is unknown to the experimenter and may influence
the activation patterns derived from comparisons of task
and rest activations, possibly even obscuring activations
of interest. Despite these shortcomings, passive rest is
often the common denominator across experiments. As
such, it affords a reasonably constant reference against
which task-specific activations can be contrasted within
experiments, and these contrasts can then be entered into
conjunction analyses to investigate the commonality
across experiments.
Conjunction Analysis Methods
For the purposes of the conjunction analysis, the listen
and the divided conditions were grouped together (labeled
global
) to reflect the holistic, global listening that was sug-
gested or required. The attend and the selective conditions,
which collectively required focusing attention on individ-
ual instrument streams, were grouped together as the se-
lective conditions. Regions of SPMs (thresholded at
p ,
.05) that overlapped in the two experiments are shown in
Figure 5 and Table 5 for the global–rest and selective–rest
contrasts, respectively. We applied a lower significance cri-
terion than in the individual experiments, since the con-
junction analysis was essentially a screen for regions of in-
terest for further studies on the topic of attention in real
musical contexts. The likelihood of observing a significant
conjunction was 2.5
3
10
23
. We did not perform a global
versus selective conjunction analysis, because within each
experiment, the global versus selective contrast would re-
flect the difference between high-level cognitive states that
were somewhat different in the two experiments. In other
words, in Experiment 2, the contrast would be between two
attentionally demanding states in the context of target de-
tection, whereas in Experiment 1, the contrast would be be-
tween a demanding selective attention state (not requiring
detection of single targets) and a less demanding passive-
listening state. Thus, a direct comparison of the
difference
states
in a conjunction analysis would not be readily inter-
pretable. The conjunction maps were projected onto the av-
erage T1-weighted anatomical image of the Experiment 1
subjects, and Brodmann’s areas were assigned to the acti-
vations, using the atlas of Duvernoy (Duvernoy, 1999) and
Brodmanns areal descriptions (Brodmann, 1909/1999).
Conjunction Analysis Results and Discussion
Figure 5 shows the results of the conjunction analysis
in red. In both experiments, the requirement to focus at-
tention on a single instrument, relative to passive resting
(selective–rest), resulted in bilateral activation of the STG,
pre-SMA, frontal operculum/rostral insula, superior PcS,
IPS, and cerebellum (Figure 5A). In addition, there was
right-lateralized activation in the thalamus and the caudate
nucleus. A similar pattern was observed for the global–rest
conjunction with respect to the STG, pre-SMA, superior
PcS, and IPS (Figure 5B).
Also shown in Figure 5 are the activations that were
unique to Experiment 1 (green patches) and Experiment 2
(yellow patches). When the activations unique to each ex-
periment are compared with the conjoint activations, both
within a contrast and across contrasts, a picture emerges
that can be accounted for, tentatively, on the basis of task
differences across the two experiments. Two major differ-
ences existed between the two experiments. First, Exper-
iment 2 was a target detection task, but Experiment 1 was
not. Second, owing to the different tasks, the attentional
characteristics were more closely matched between ex-
periments for the selective-listening conditions than for
the global-listening conditions. Thus, areas that are very
sensitive to attentional load would be expected to show a
conjunction in the selectiverest contrast, whereas a con-
junction would be less likely for the global–rest contrast,
owing to mismatched attentional demands.
Several areas that were activated in both experiments in
the selective–rest comparison were activated to a greater
extent or exclusively in Experiment 2 for the global–rest
comparison. These included the frontal operculum/rostral
insula and the IPS bilaterally and the caudate nucleus and
the thalamus on the right. In addition, the medial frontal
activation extended more ventrally into the anterior cingu-
late sulcus in Experiment 2, and there was an activation
focus in the posterior cingulate. The presence of these ac-
tivation foci in the global-listening condition of Experi-
ment 2, but not of Experiment 1, is consistent with the fact
that the global-listening condition was a demanding di-
vided attention target detection task in Experiment 2 but
was a passive-listening task in Experiment 1.
Experiment 1 exhibited unique bilateral activation along
the IFS extending caudally into the superior PcS in the se-
lective–rest contrast (Figure 5A, slices
1
20 to
1
50). In
addition, there was a unique parietal activation in the left
supramarginal gyrus. The activation of these areas might
be explained by the strategy for focusing attention on a
single stream that was suggested to all the subjects in Ex-
periment 1. The suggested strategy was to listen to each
134 JANATA, TILLMANN, AND BHARUCHA
instrument’s part as though one was trying to learn/mem-
orize it. Given that the observed frontal areas are inti-
mately involved in working memory functions (see the
General Discussion section), the suggested listening strat-
egy may have facilitated recruitment of these areas during
the selective-listening task. Because Experiment 2 was a
target detection task, the goal of detecting a deviant within
a specified stream was deemed to be sufficient for orient-
ing attention, and the listening strategy was not suggested
to the subjects.
GENERAL DISCUSSION
Our experiments were designed with several questions
in mind. First, we wanted to identify neural circuits that
are activated during attentive listening to music, as com-
pared with rest, regardless of the form of attentive listen-
ing. Second, we wanted to determine whether we could
dissociate activations in response to attending to a single
stream from more holistic attending. Third, we were curi-
ous about how the results of our experiments using real
musical stimuli would differ from several studies of audi-
tory attention that have used simpler acoustic stimuli.
Similarly, we wondered what the differences might be be-
tween attentive listening to real music in a target detection
context and attentive listening in a more natural context
without the constraints of target detection.
This last point presents a significant dilemma. A hall-
mark of cognitive psychology research is inference of the
cognitive state of a subject from behavioral responses. For
Figure 5. Conjunction analysis showing activation regions common to both experiments. The conjunctions were performed by using
the contrasts of global- and selective-listening conditions relative to rest, because rest was the only condition that was identical across
experiments. Contrasts from the individual experiments were thresholded at p , .05, and these contrast masks were superimposed to
identify common voxels (shown in red). Activations for each contrast that were unique to Experiment 1 are shown in green, whereas
voxels unique to Experiment 2 are shown in yellow. The white line around the edge of the brain denotes the conjunction of the data in-
clusion masks from the two experiments.
FUNCTIONAL IMAGING OF ATTENTIVE LISTENING TO MUSIC 135
instance, how well a subject is paying attention to the fea-
tures of a stimulus is usually inferred from his or her de-
tection accuracy and response times to the target features.
Our concern is that requiring subjects to make decisions
about stimulus features or to detect targets simply in the
name of verifying compliance with attentional instruc-
tions fundamentally alters the constellations of brain
areas that are engaged by the mental states we are inter-
ested in. Thus, it becomes almost impossible to study at-
tentional states that occur outside the context of target de-
tection or decision making. These considerations led us
to vary the task demands across two experiments and
compare the results via a conjunction analysis of the two
experiments. In addition to the target detection accuracy
data obtained in Experiment 2, ratings were obtained (in
both experiments) of the difficulty each subject had ori-
enting his or her attention as instructed to each of the in-
struments. Although not as compelling as reaction time
or accuracy data, the ratings showed that the subjects at
least attempted to orient their attention as instructed.
Thus, the subjective ratings and the results of the con-
junction analysis left us fairly confident that we had ob-
tained fMRI measures of auditory attention to musical
stimuli, even outside of a target detection context.
Circuitry Underlying Attention
to Complex Musical Stimuli
For both the global and the selective attention condi-
tions, we observed increased BOLD signals in the tempo-
ral, frontal, and parietal areas during attentive listening to
excerpts of real music. This finding is in agreement with
recent studies that have supported the hypothesis (Zatorre
et al., 1999) that a supramodal attention circuit consisting
of the temporal, parietal, and frontal areas is recruited dur-
ing the detection of auditory targets presented in simple
acoustic contexts (Benedict et al., 1998; Kiehl, Laurens,
Duty, Forster, & Liddle, 2001; Linden et al., 1999; Pugh
et al., 1996; Sakai et al., 2000; Tzourio et al., 1997; Zatorre
et al., 1999) and musical contexts (Satoh et al., 2001). Most
selective attention experiments in audition are designed as
target detection tasks in which subjects form a search
image of a single target stimulus—for example, a pitch, a
syllable, or a location. In contrast to these studies, our sub-
jects were required to maintain attentional focus on a
melody carried by a specific timbre. Rather than selecting
a single search image from a background of stimuli within
an attended stream, the subjects were required to select an
entire stream in the presence of multiple streams. Despite
these differences, we observed an activation pattern across
both experiments that was very similar to that observed
during auditory selective attention for acoustic features
under dichotic listening to simple FM sweeps and syllable
stimuli, relative to detection of noise bursts (Pugh et al.,
1996). Common to the study of Pugh et al. and our present
set of experiments were activations of the STG, the IPS (bi-
lateral for syllables, right-lateralized for FM sweeps), the
bilateral IFS, the right-lateralized PcS, the SMA/pre-SMA,
and the bilateral frontal operculum.
Our observation that the set of cortical areas activated
by attentive listening to music is very similar to the set
that is activated during processing of simpler acoustical
stimuli might be seen as problematic, given that we argue
Table 5
Areas of Activation Common to Experiments 1 and 2
Global–Rest Selective–Rest
Lobe Hemisphere Region (Brodmann Areas) x y z x y z
Temporal left STG (41/42/22) 255 7 25 254 7 24
262 242 20 252 244 26
right STG (41/42/22) 55 7 0 53 16 0
64 234 15 64 234 16
STS (22) 60 228 25 60 231 25
Frontal bilateral pre-SMA (6) 0 10 54 0 4 55
left frontal operculum (44) 235 20 5
IFG, p. operc. (44) 260 7 20
superior PcS (6/8) 231 25 51 252 24 48
rostral MFG (10) 227 56 9
241 49 9
right frontal operculum (44) 44 19 5 35 20 5
IFS/PcS (6/8) 47 12 30 51 10 33
superior PcS (6/8) 39 22 60 37 22 57
Parietal left IPS (7/39) 242 248 46
right IPS (7/39) 44 248 50
IPS/angular gyrus (39) 35 260 45
Other left cerebellum 232 265 228
right caudate 10 9 8 11 6 12
thalamus 11 217 12
cerebellum 33 266 225
Note—If more than one set of coordinates