Scientic Reports | (2023) 13:6842
Revealing the stimulus‑driven
component of attention
through modulations of auditory
salience by timbre attributes
Baptiste Bouvier
1,2*, Patrick Susini
1, Catherine Marquis‑Favre
2 & Nicolas Misdariis
Attention allows the listener to select relevant information from their environment, and disregard
what is irrelevant. However, irrelevant stimuli sometimes manage to capture it and stand out from a
scene because of bottom‑up processes driven by salient stimuli. This attentional capture eect was
observed using an implicit approach based on the additional singleton paradigm. In the auditory
domain, it was shown that sound attributes such as intensity and frequency tend to capture attention
during auditory search (cost to performance) for targets dened on a dierent dimension such as
duration. In the present study, the authors examined whether a similar phenomenon occurs for
attributes of timbre such as brightness (related to the spectral centroid) and roughness (related
the amplitude modulation depth). More specically, we revealed the relationship between the
variations of these attributes and the magnitude of the attentional capture eect. In experiment 1,
the occurrence of a brighter sound (higher spectral centroid) embedded in sequences of successive
tones produced signicant search costs. In experiments 2 and 3, dierent values of brightness
and roughness conrmed that attention capture is monotonically driven by the sound features. In
experiment 4, the eect was found to be symmetrical: positive or negative, the same dierence in
brightness had the same negative eect on performance. Experiment 5 suggested that the eect
produced by the variations of the two attributes is additive. This work provides a methodology for
quantifying the bottom‑up component of attention and brings new insights on attention capture and
auditory salience.
e acoustic environment is so rich in information that our brain cannot process in detail all of the sounds it is
constantly receiving. Instead, the individual selects stimuli that they deem to be relevant for a particular task,
and ignores others1. e most famous example of selective attention is the cocktail party problem2. is ability
is made possible by an attentional process that lters the ow of stimulus information through certain irrelevant
channels3,4. e precise mechanisms involved in this ltering are still being investigated5. However, the brain
should not be completely blind to task-irrelevant stimuli since they could provide important information about
the environment. For example, if we are chatting to someone on the street, we can pick up what they are saying
and ignore the surrounding trac noise. However, the squeal of tires associated with a car’s sudden braking may
still attract our attention. So, if the stimulus is suciently salient, the brain may have to process the information
it contains involuntarily. is phenomenon is known as involuntary attentional capture. Salience is the property
of a stimulus that makes it likely to capture attention, i.e., the bottom-up component of attention6.
Attention capture has been extensively studied in the visual modality (see 30 for a review). Implicit approaches
measure the behavioral costs (increased reaction times and error rates) of the presence of an irrelevant distractor
in focal tasks. Among other things, irrelevant stimuli dened by their color, shape or onset time are known to
attract the attention of participants performing a visual search task7–9.
However, there has been some debate about how salient objects can automatically capture attention. Some
have argued that salient objects have an automatic power to attract attention, regardless of the subject’s goals. ey
observed that certain features, such as color or shape, make the salient object automatically capable of attracting
attention10. is led to a stimulus-driven conception of attentional capture11: visual selection is determined by the
physical properties of the stimuli, and attention is drawn to the location where one object diers from the others
Scientic Reports | (2023) 13:6842
along a particular dimension. However, others have argued that only items that match the target’s features can
capture attention. For them, capture depends on the attentional set that is encouraged by the task12. For example,
it has recently been found that salience does not inuence the capture of visual stimuli. Instead, participants
can oen learn to suppress salient objects13,14. Authors from dierent parties eventually came together to review
and compare their theories15. ey agreed that "physically salient stimuli automatically generate a priority signal
that, in the absence of specic attentional control settings, will automatically capture attention, but there are
circumstances under which the actual capture of attention can be prevented", reconciling the stimulus-driven
and contingent capture approaches.
In the auditory modality, few studies have addressed this issue. Huang and Elhilali16 used an explicit approach
to measure auditory salience in complex sound scenes. Participants listened to the scenes dichotically (a dier-
ent scene in each ear), and continuously indicated which side their attention was focused on. Averaged across
scenes and participants, this allows the identication of salient events in a scene where their responses, on aver-
age, indicate how they orient their attention. is protocol involves top-down processes, as participants actively
listen to the sounds and report the orientation of their attention. We therefore cannot infer any measurement of
the purely bottom-up component of attention. In Kaya etal.17 the authors asked their participants to focus on
a visual task and to ignore background acoustic melodies. Brain responses were recorded, showing that varia-
tions in acoustic attributes could make notes in these melodies more salient, and how these dierent attributes
interacted to modulate brain responses.
Dalton and Lavie18 used an implicit approach based on the additional singleton paradigm to reveal an auditory
attentional capture eect by sound features such as frequency or intensity. is paradigm was rst developed
in the visual modality to show that irrelevant stimuli can capture participants’ attention during a visual search
task, resulting in increased error rates and response times7,19.
Results from Dalton and Lavie18 showed a signicant cost (increased response times and error rates) in an
auditory search task caused by irrelevant sounds. In their experiment, participants had to listen to sequences
of ve sounds. Among these, they had to detect a target dened by a dimension (e.g., a change in frequency
compared to non-targets). In half of the trials, one of the non-targets was made dierent from the others on
a dimension other than that which dened the target, such as intensity. is sound is called a singleton and is
irrelevant to the task. In fact, paying attention to the dimension that denes the singleton is not an advantageous
strategy for detecting the target. e results showed that the singleton features could cause interference: partici-
pants made more errors and took more time to detect the target when the singleton was present. e eect was
not due to low-level interactions between the singleton and the target, which would have caused it to be more
dicult to compare the target with the singleton than with a non-target. e eect was shown when the singleton
was separated from the target by another sound. Garrido etal.20 discussed the similarity to mismatch negativ-
ity studies, which focus on the elicitation of an event-related potential by deviant tones that dier in frequency
or duration. e much shorter inter-stimulus interval, the frequency of occurrence of the deviant tones, and
the explicit instruction to ignore these irrelevant singletons limit the parallels that can be drawn in this area of
research. Dalton and Lavie18 focused on the attentional capture produced by singletons of dierent frequency or
intensity, but did not investigate the eects of sounds whose features are gradually modied.
In addition, the study of variations in intensity, and therefore loudness, of sounds may be compromised in
this paradigm. Masking eects are likely to occur for louder sounds and interfere with the attentional processes
we wish to study21. However, the paradigm is compatible with the study of variations in timbre. One precaution
would be to equalize all sounds in loudness to remove potential masking eects and the inuence of loudness,
which can be aected by pitch or timbre variations22.
None of the approaches mentioned here focused on the relationship that may exist between variations in the
acoustic attributes and the attentional capture eect.
e rst acoustic feature one might think of when studying salience is loudness. Sounds that are perceived
as louder are more likely to attract the listener’s attention. Loudness has been shown to be an important feature
of salience16,23,24. In addition to this feature, several studies have shown that some dimensions of timbre can be
sound markers for conveying relevant information. Lemaitre etal.25 found that listeners used common perceptual
dimensions to categorize car horns. Two of the three dimensions identied were roughness and brightness. Arnal
etal.26 noted that amplitude modulated sounds in the roughness range are found in both natural and articial
alarm signals, and are better detected due to the privileged space they occupy in the communication landscape.
Rough sounds are also said to enhance aversiveness through specic neural processing27. Brightness has long
been known to be a major dimension of musical timbre28 and has therefore been included in most salience
models16,29. More recently, roughness has also been included30.
us, the existence of the stimulus-driven component of attention capture has been theoretically established.
Moreover, the additional singleton paradigm allows the measurement of the attentional capture eect due to
sound features. Finally, the literature ndings suggest that certain attributes of the sound timbre are potential
candidates that could be responsible for the salience of a sound, and thus its ability to capture attention. However,
to the authors’ knowledge, no study has ever established the relationship that might exist between variations
in these features and the magnitude of the attentional capture eect. In other words, the driving properties of
attentional capture by the stimulus features have not yet been revealed.
In the present work, we adopted the additional singleton paradigm to provide evidence for the eect of timbre
features on attentional capture. We then used this paradigm to quantify the relationship that may exist between
a sound feature and the associated capture eect. us, in the current study, we focused on the properties of the
stimulus-driving of the attentional capture eect.
To summarize, we wanted to answer the two following questions:
Scientic Reports | (2023) 13:6842
• Do timbre attributes such as brightness or roughness trigger attention capture?
• How do their variations drive attention capture?
First, the possibility of an attentional capture by a timbre variation was investigated. erefore, the spectral
centroid (SC) of the singleton, which correlates with its perceived brightness, was investigated in experiment 1.
en, the same experimental procedure was used to evaluate how the eect size was modulated by feature vari-
ations. In experiments 2 and 3, the SC and the depth of amplitude modulation (correlated with roughness) could
take several dierent values. Finally, experiment 4 examined the eect of symmetric variations in brightness and
experiment 5 focused on combined variations in brightness and roughness to investigate the directionality and
additivity of attentional modulation.
Experiment 1: attentional capture by a bright singleton
Method. Transparency and openness. We report how we determined our sample size, all data exclusions, all
manipulations and all measures in the study. Data were collected in 2021 and 2022 and analyzed using python
3.7. All statistical analyses were performed using python 3.7 and the open-source pingouin package.
Participants. A previous pilot experiment involving 11 participants was conducted to calculate the power of
the eect of the singleton presence on response time. e calculus was made for a one-tailed t-test, with an eect
size of d = 0.8, α = 0.05 and aiming for a power of 0.8, and determined a minimum sample size N = 12.
us, 15 participants (8 females, 7 males) took part in this experiment. ey ranged in age from 20 to
45years (mean age: 31 ± 8years). ey were all consenting and reported normal hearing. An audiometry in
the frequency range between 0.125 and 8kHz was performed for each participant and revealed no hearing
impairment. e protocol was approved according to Helsinki Declaration by the Ethics Committee of Institut
Européen d’Administration des Aaires (INSEAD). All methods were carried out in accordance with their guide-
lines and regulations. Participants gave written informed consent and received nancial compensation for their
Apparatus. e experiment was designed and run on Max soware (version 7, https:// cycli ng74. com), on a
Mac mini 2014 (OS Big Sur 11.2.3). e stimuli were designed with python 3.7, and presented during the experi-
ment through headphones (Beyerdynamic 770 pro, 250 Ohm). e experiment took place in the STMS labora-
tory of IRCAM in a soundproofed double-walled IAC booth.
Stimuli. e stimuli were made of sequences of 5 sounds (see Fig.1). All notes follow the harmonic structure
of Bouvier etal.31, with 20 harmonics, the nth harmonic fn having a frequency n*f0 and a weight
. us, decreas-
ing α increased the sound spectral centroid (SC), and therefore its perceived brightness:
Distractor. For the reference distractor, α = 3. It lasted 170ms, with a ramp at the beginning and end of 5ms,
and had a SC equal to 512Hz.
Targets. e targets were 50ms shorter or longer than the distractor. is value is higher than what Abel32
found as a just-noticeable dierence (jnd) for duration discrimination of sinusoidal sounds. Based on previous
tests done in the lab, the experimenters still ensured beforehand that the targets were clearly heard as distinct
Figure1. Stimuli without (le) and with (right) a singleton (surrounded with a glow), with 50% chances being
before or aer the target (dark blue). Only sequences with target in position 4 are shown here.
Scientic Reports | (2023) 13:6842
from the distractors. e targets had the same fundamental frequency and spectrum distribution (α = 3) as the
reference distractor, but a duration of 220ms for the long one and 120ms for the short one.
Singleton. e singleton had the same fundamental frequency and envelope as the reference distractor, but
a dierent spectrum distribution with α = 2. It resulted in a higher SC, equal to 822Hz. Allen and Oxenham33
found a jnd of 5.0% for the SC, which ensures the singleton was indeed perceived brighter. e experimenters
still ensured beforehand that the singleton was clearly heard as distinct from the distractors.
In the reference condition, the target was embedded in sequences of distractors only such that a sequence was
composed of four distractors and a target stimulus. In the test condition, one of the distractors was the singleton
such that a sequence was composed of three distractors, one target and one singleton. e IOI ("Inter-Onset
Interval") was kept constant at 230ms. e rst sound of each sequence was always a distractor. e target was
in 3rd or 4th position (50% of the trials each). In the trials containing a singleton, its position was either just
before or just aer the target (50% of the trials each). All the conditions are presented in Fig.1.
Loudness equalization. All the soundswere equalized in an adjustment experiment with 12 participants from
the lab, using samesetup as the main experiment. Loudness adjustments were performed by comparing allthe
sounds(shorttarget, long target or singleton) to a reference (the distractor presentedat 80dB SPL). e sounds
were randomly distributed and presented 8 times each. e levels were measured at the headphones output with
a Brüel and Kjaer 2238 mediator sound level meter. e obtained levels were 81dB SPL for the short target,
79dB SPL for the long target and 74dB SPL for the singleton.All inter-participants standard deviations of these
obtained levels were lessthan 1dB SPL, i.e., less than a just-noticeable dierence in sound level34.
Procedure. Six blocks of 60 randomly distributed trials were run for each participant. For every trial, the word
Ready was displayed on the screen for 1500ms, then a sequence of 5 sounds was presented.
At the end of the sequence, the participant could respond by pressing a keyboard: "1" for "short" and "2" for
long (2 alternative forced choice protocol). Feedback regarding the participant’s response (Correct or Incorrect)
was displayed aer each trial and remained for 1500ms. If aer 3000ms no answer was given by the participant,
the message Too late. Answer faster! was displayed. e response time was measured from the moment the target
was played in the sequence. en, a 1500ms pause occurred and the next trial began.
e participants were asked, at the beginning of the experiment, to focus on the duration of the sounds and
their duration only in order to discriminate the target. Each participant had a training block before taking the
test. We kept only the results of participants with an error rate below 40% on the sequences containing the target.
Due to this criterion, one participant had to be replaced at this step. e experiment lasted 90min on average.
Results. For each participant, and for each singleton condition (absent or present), we calculated the mean
and the standard deviation of the response times. We then removed the data whose response time was more
than two standard deviations from the mean35. We also removed the data for which the response time was less
than 100ms, and those for which the participant did not answer. 94.9% of the data were kept at this stage. For
the response time analysis, only the data where the participant’s response was correct were kept, i.e., 75.6%
of the data. e results of mean error rates and response times are presented in Table1. For all the following
experiments, error rates follow the same trends as response time increases. e LISAS (Linear Integrated Speed
Accuracy Score—36) were also computed and followed the same trends. For the sake of clarity, we therefore show
only the increases in response time.
e error rates (16.2% and 24.2% in the conditions without and with a singleton, respectively) conrm that
participants were able to complete the task correctly in both conditions. e mean response time increase, when
the singleton was present, was 137ms. A t test revealed that the singleton presence had a signicant eect on
response time increase (t test: t(14) = 8.33, p < 0.001). e eect of the singleton presence was very large (cohen-
d = 2.1). A very large eect of the singleton presence was found for error rates as well (t(14) = 3.85, p < 0.001,
cohen-d = 1.0).
e eect of the singleton position on error rates was not signicant (t(14) = 0.72, p = 0.48), suggesting that
attentional capture occurs as much whether the singleton appears before or aer the target. However, there was
an eect of the singleton position on response times (t(14) = 4.38, p < 0.001): when the singleton appeared aer
the target, the response times were greater. is absence of eect of the singleton position on error rates and
the increased reaction times when the singleton occurs aer compared to before the target conrm that this
eect is not due to auditory masking. is is consistent with the loudness equalization that had been carried
out beforehand and the IOI which prevented auditory masking21. e observed eect is due to an attentional
Table 1. Mean and standard deviation of response times and error rates (across the 15 participants)
depending on the presence of the bright singleton.
Singleton Absent Present
Response time
(Standard deviation) 985ms
(142) 1121ms
Error rate
(Standard deviation) 16.2%
(13.5) 24.2%
Scientic Reports | (2023) 13:6842
capture caused by the bright singleton. Finally, one could claim that the eect is due to the surprise caused by the
occurrence of the singleton. However, this singleton is present in 50% of the trials, and the participants identi-
ed and accustomed themselves to it during the training session. Moreover, no signicant dierence was found
for response times between trials where a singleton appears aer one or more trials without any singleton (the
"surprising" condition), and trials where the singleton is present aer one or more trials with a singleton (the
"non-surprising" condition): t(14) = 0.31, p = 0.76.
is rst experiment thus allowed us to validate the framework in which we can test modulations of timbre
features and observe how they drive the attentional capture eect. It was therefore decided to reproduce the
experiment, modifying it so that the singleton could take dierent values of brightness in a second experiment,
and dierent values of roughness in a third one.
Experiments 2 and 3: variations of brightness and roughness
Experiments 2 and 3 were conducted to study how the eect magnitude is modulated by the singleton fea-
ture variations. In experiment 2, we replicated experiment 1 with four dierent values of the spectral centroid
(SC) for the singleton. In experiment 3, four values of the amplitude modulation depth for the singleton were
used. is latter sound feature is associated to an auditory attribute usually described by the semantic attribute
Method. Participants. Twenty participants (10 females, 10 males) took part in experiment 2, and 20 oth-
ers (10 females, 10 males) in experiment 3. e sample size was increased to ensure that the power of the eect
produced by the second-brightest singleton was greater than 0.8. is was done in order to have at least two dif-
ferent brightness conditions with sucient power. e participants ranged in age from 19 to 34years (mean age:
27 ± 4years) for experiment 2, and from 22 to 50years (mean age: 28 ± 8years) for experiment 3. ey were all
consenting and reported normal hearing. An audiometry in the frequency range between 0.125 and 8kHz was
performed for each participant and revealed no hearing impairment. Participants gave written informed consent
and received nancial compensation for their participation.
Apparatus. e apparatus was the same as in the rst experiment, except that it took place in the INSEAD-
Sorbonne Université Behavioural Lab, in soundproofed rooms.
Stimuli. e distractors and targets were the same as in experiment 1. For experiment 2, the singleton SC could
take 4 values: 538, 563, 640 or 768Hz. Each one was presented in 20% of the trials. To establish these values,
an increment of SC was calculated (using the estimation of 5% for SC jnd found by Allen and Oxenham33, and
then multiplied by 1, 2, 5 and 10. For experiment 3, the singleton signal ssing(t) was the distractor signal sdis(t)
modulated at a modulation frequency fmod = 50Hz:
e modulation
depth m could take 4 values: 0.1, 0.2, 0.5 or 1.0. Each one was presented in 20% of the trials. To establish these
values, the increment of modulation depth estimation proposed by Zwicker and Fastl37 (10%) was multiplied by
1, 2, 5 and 10 as well.
Loudness equalization. e loudness of the singletons was equalized as in experiment 1. e levels obtained for
each singleton aer equalization were 79.5, 79.0, 77.5 and 75.0dB SPL for the bright singletons with SC of 538,
563, 640 and 768Hz, respectively, and 80dB SPL for all the rough singletons. All inter-participants standard
deviations of the obtained levels were less than 1dB SPL.
Procedure. e procedure was the same as in experiment 1, except that the number of trials had to be increased
because of the increased number of singletons. Eight blocks of 80 randomly distributed trials each were run for
each participant.
Results. e data processing was the same as for experiment 1. For the error rate analysis, 95.0% and 94.6%
of the data were kept for experiments 2 and 3, respectively. For the response time analysis, only the data where
the participant’s response was correct were kept, i.e., 78.6% and 76.4% of the data. e mean response time
and error rate across the 20 participants for sequences without singleton were 867ms (std = 246ms) and 12.6%
(std = 13.1%) for experiment 2, 1058ms (std = 294ms) and 15.2% (std = 12%) for experiment 3. e increase in
response time for each singleton, i.e., the dierence between the condition with the considered singleton and the
reference condition without any singleton, is presented in Fig.2 for each value of modulation depth and spectral
For both experiments 2 and 3, t-tests were conducted with Holm corrections for repeating comparisons.
Complete statistics can be found in the Supplementary information (S1 and S2).
Data from experiment 2 conrmed and extended the result of experiment 1 as various bright singletons
produced an attentional capture eect. Moreover, the eect increased with SC values: the brighter the singleton,
the greater the eect.Experiment 3 showed that roughness is also a feature that triggers an attentional capture
eect: the presence of various rough singletons caused signicant behavioral costs. e results conrmed that
there is a dependency of salience with the variations of the feature which dene the singleton.
Interestingly, the manipulations of the two timbre attributes resulted in comparable eect magnitudes. An
increase of a few increments on brightness gives an eect similar to that obtained with an increase of the same
number of increments on roughness. is is discussed in the general discussion.
Scientic Reports | (2023) 13:6842
Experiment 4: symmetrical variations of brightness
Experiment 4 was conducted to study the symmetry or the directionality of the eect. We replicated experiment
2 with SC values for the singleton being either higher or lower than the distractors SC.
Method. Participants. 19 participants (8 females, 11 males) took part in the experiment 4. ey ranged
in age from 18 to 32years (mean age: 25 ± 4years). ey were all consenting and reported normal hearing. An
audiometry in the frequency range between 0.125 and 8kHz was performed for each participant and revealed
no hearing impairment. Participants gave written informed consent and received nancial compensation for
their participation.
Apparatus. e apparatus was the same as in the rst experiment, except that it took place in the INSEAD-
Sorbonne Université Behavioural Lab, in soundproofed rooms.
Stimuli. e distractor and target SC was equal to 631Hz. e singleton SC was 2 and 4 jnd higher or lower
than the distractor one, i.e., 512, 569, 696 or 768Hz. Each one was presented in 20% of the trials. All the sounds
were equalized in loudness (12 participants with the same procedure as in experiment 1): the obtained levels
were 80, 79, 77 and 75dB SPL for the singletons with SC at 512, 569, 696 and 768Hz respectively, and 78dB SPL
for the distractor. All inter-participants standard deviations were less than 1dB SPL.
Results. e data processing was the same as for experiment 1. For the error rate analysis, 94.7% of the data
were kept. For the response time analysis, only the data where the participant’s response was correct were kept,
i.e., 87.1% of the data. e mean response time and error rate across the 19 participants for sequences without
singleton were 940ms (std = 195ms) and 5.1% (std = ± 8.4%). e increase in response time for each singleton,
i.e., the dierence between the condition with the considered singleton and the reference condition without any
singleton, is presented in Fig.3. Complete statistics can be found in the Supplementary information (S3).
e eect magnitudes are comparable to those obtained in experiment 2. A clear symmetry is observed
in experiment 4: the eect of a brighter singleton is the same as the one of a less bright singleton, if both vary
absolutely by the same amount of perceived brightness. is result tells us that it is the absolute variation of the
singleton feature that modulates the attention capture.e results of experiments 1, 2, 3 and 4 can be summarized
in Fig.4, which shows the driving of response time increases by the perceived variations in the singleton feature.
ese perceived variations are shown in terms of jnd values.
Interestingly, a linear relationship seems to emerge between increases of perceived brightness (combined
across experiment 1, 2 and the positive variations in experiment 4) and response time increase (rPearson(3) = 0.99,
Figure2. Increase in response time (ms) with singleton SC (le, experiment 2) and modulation depth (right,
experiment 3). Error bars represent the standard errors across participants in each condition compared to the
no-singleton condition. Signicances between conditions are displayed on the horizontal braces. *: p < .05, **:
p < .01, ***: p < .001.
Scientic Reports | (2023) 13:6842
p < 0.001, slope = 14.0ms—std error = 0.9ms), and for perceived roughness as well (rPearson(3) = 0.99, p < 0.01,
slope = 12.4ms—std error = 0.9ms). is relationship is only valid for this range of feature variations and is
discussed in the general discussion.
Experiment 5: combination of roughness and brightness
Experiment 5 was conducted to study the additivity of the eects of dierent features variations. We replicated
experiment 2 with four dierent singletons, having dierent combinations of roughness and brightness. e
singleton could have two dierent SC combined with two dierent amplitude modulation depths.
Figure3. Increase in response time (ms) with singleton SC(experiment 4). Error bars represent the standard
errors across participants in each condition compared to the no-singleton condition. Signicances between
conditions are displayed on the horizontal braces. *: p < .05, **: p < .01, ***: p < .001.
Figure4. Increase in response time (ms) depending on the singleton perceived feature variations (jnd) in
experiments 2, 3, and 4. Error bars represent the standard errors across participants in each condition compared
to the no-singleton condition.
Scientic Reports | (2023) 13:6842
Method. Participants. Nineteen participants (9 females, 10 males) took part in the experiment 4, whose
ages ranged from 21 to 36years (mean age: 26 ± 5years). ey were all consenting and reported normal hear-
ing. An audiometry in the frequency range between 0.125 and 8kHz was performed for each participant and
revealed no hearing impairment. Participants gave written informed consent and received nancial compensa-
tion for their participation.
Apparatus. e apparatus was the same as in the rst experiment, except that it took place in the INSEAD-
Sorbonne Université Behavioural Lab, in soundproofed rooms.
Stimuli. e distractor and target SC was equal to 512Hz, and they were not modulated, i.e., null roughness.
e singleton SC was 2 or 5 jnd higher than the distractor one, i.e., 564 and 653Hz. e singleton modulation
depth was 2 or 5 jnd higher as well, i.e., 0.2 and 0.5. e four singletons were thus obtained with the four com-
binations of these SC and modulation depths. Each one was presented in 20% of the trials. All the sounds were
equalized in loudness (12 participants with the same procedure as the one used in experiment 1): the obtained
levels were 79dB SPL for the singletons with 2 jnds of brightness, 77.5dB SPL for the singletons with 5 jnds of
brightness. All inter-participants standard deviations were less than 1dB SPL.
Results. e data processing was the same as for experiment 1. For the error rate analysis, 94.9% of the data
were kept. For the response time analysis, only the data where the participant’s response was correct were kept,
i.e., 74.9% of the data. e mean response time and error rate across the 19 participants for sequences without
singleton were 994ms (std = 158ms) and 17.4% (std = 12.7%). e increase in response time for each singleton,
i.e., the dierence between the condition with the considered singleton and the reference condition without any
singleton, is presented in Fig.5. Complete statistics can be found in the Supplementary information (S4).
e eect produced by a 2 + 2-jnds variation here is comparable to that produced by a 2-jnds variation in
experiments 2 and 3. It is uncertain whether this is due to a non-additivity of the eects of the combined features
or whether participants were simply less subject to attentional capture in this experiment. Nevertheless, within
their range of magnitudes, the response times in experiment 5 appear to increase linearly with the addition of
the perceptual variations on the two dimensions (rPearson(3) = 0.99, p < 0.01, slope = 8.5ms—std error = 0.4ms).
In other words, the eect seems to be additive across dimensions in this range of values.
Public signicance statement. ese ndings provide evidence that the perception of certain auditory
features drives the ability of sounds to capture our attention, according to laws that are revealed.
General discussion
Results from experiment 1 showed that a singleton dened by its timbre, specically its brightness, captured
participants’ attention despite being irrelevant to the task they had to perform. Experiment 2 proved that the eect
magnitude was driven by the singleton brightness. Experiment 3 showed that a dierent attribute, roughness,
Figure5. Increase in response time (ms) with singleton SC and modulation depth(experiment 5). Error bars
represent the standard errors across participants in each condition compared to the no-singleton condition.
Signicances between conditions are displayed on the horizontal braces. *: p < .05, **: p < .01, ***: p < .001.
Scientic Reports | (2023) 13:6842
also drives the attentional capture eect. Results from experiment 4 and 5 revealed that this eect is symmetrical,
i.e., that only the absolute perceived deviation matters, and additive, i.e., that combining features produces the
addition of the eects that each feature variation produces alone.
us, in a series of 4 dierent experiments (2, 3, 4 and 5), a driving of attentional capture by the singleton
feature was observed. All else being equal in the experiments, the participants’ attentional state remained identi-
cal across the dierent values of the singleton feature. Nevertheless, the magnitude of the eect increased with
increasing brightness or roughness variation. e results cannot be explained by increasing singleton-target
similarity, because the timbre variations dening the singletons did not make them more similar to the target.
Since the increased response times cannot be explained by top-down processes that change with the value of
the singleton feature, the observed relationships represent purely feature-driven components of the eect. In
other words, the bottom-up component of the attentional capture eect is revealed here, not only conrming its
existence11, but also revealing its pattern.
us, by varying the timbre of the tones while keeping the participants’ attentional state xed, we were able
to elicit only the bottom-up component of attentional capture. However, the nature of our protocol itself could
raise questions about the participants’ attentional state and thus the origin of the capture. e contingency on
participants’ attentional state12 is questionable here. Indeed, according to the contingency hypothesis15, the task
leads to an attentional state that favors the detection of singletons, and this is why attention is captured by the
singleton. However, in the present experiments, there were two single items (out of ve) in 80% of the trials,
and the singleton was one out of 4 possible singletons. Furthermore, all sounds had a fundamental frequency
randomly drawn from a broad uniform distribution of 20Hz. us, the variability of the items was increased in
our protocol, and the target was not a single item among all identical items. e single-item detection strategy
may therefore no longer be advantageous in this setting, and the adaptation of the singleton to participants’
attentional state may be dierent from that which was traditionally thought to be responsible for detection in this
paradigm. Further work is needed to understand the interactions between the bottom-up component revealed
here and top-down processes, and to address the issue of the compatibility of these results with the contingent
capture approach. For example, it would be important to investigate how the driving by the singleton features
evolves as participants change their attentional state.
e feature-driven relationships obtained make it possible to observe and compare how dierent features
modulate attention capture. Indeed, the marginal increase of the eect (the derivative of the curves of response
time increases with the perceptual variations of the feature) can be interpreted as the weight of the feature in
the sound salience. Interestingly, in experiments 2 and 3, both features drove the eect in a similar way. Either
these two features are by chance equally responsible for the salience of a sound, or it is the perceived deviation
on each dimension that is important in making a sound salient. is evolution of attentional capture with varia-
tions of dierent features therefore deserves to be conrmed through more experiments involving more features
(harmonicity, attack time, spectral ux…). If a similar driving is found for other features, it would show that
it is precisely how dierent the sound is perceived that matters to trigger attentional capture, regardless of the
feature used. On the contrary, some features could drive the eect with more or less power. is would lead to a
hierarchy of features that inuence the salience of a stimulus in terms of its ability to capture attention.
Furthermore, the combined results of experiments 1, 2, 3 and 4 (summarized in Fig.4) reveal a monotonic
relationship between the perceived dierence of the singleton feature (quantied in just-noticeable dierences)
and the increase in reaction time. us, the attentional capture eect increases progressively with the perceived
dierence, according to a law that appears to be linear in the range of deviations tested. is law cannot extend
over a very wide range of values, as the capture eect must saturate at some point. In any case, we observe that
there is no threshold eect, the function is monotonic and continuous. A more precise and extensive determina-
tion of this function could also be further investigated in future studies.
is work also brings new insights into the understanding of auditory salience itself, conrming the impor-
tance of timbre in this property. Both brightness and roughness were found to be responsible for an attentional
capture by irrelevant sounds. It therefore appears that timbre is also a key dimension in directing auditory atten-
tion, in addition to the main dimensions of frequency and intensity highlighted by Dalton and Lavie18. e results
on brightness conrm the ndings that previously led some researchers to consider this feature in their salience
model16,29. Roughness has only recently been included in some form: Kothinti etal.30, for example, added aver-
age fast temporal modulations to the latest version of their model. e relationship found between attentional
capture and feature variations seems to be supported by both features and deserves further investigation, either
in other contexts (other tasks, more complex environments…) or with other features.
Our results show that attention capture is driven by absolute deviations of the sound features. In other words,
the features do not have an intrinsic polarity with respect to salience (e.g., the brighter, the more salient). Rather,
it is a dissimilarity eect that modulates it. is is consistent with predictive coding and theories of auditory
deviance detection38. ey suggest that the deviations between the prediction and what is subsequently perceived
determine auditory salience and trigger notied events39,40. Here, we support these theories by showing that
absolute deviations of the sound features directly modulate the magnitude of the attentional capture eect, i.e.,
their salience.
Finally, our ndings are interesting from the perspective of auditory salience modelling, which could be
improved by knowing the relevant parameters to consider and how salience depends on their variations. e
approach taken so far is to consider the absolute and normalized feature variations over time16,39,41, without
implying a more elaborate modulation of attention with these variations. e additivity of the eect produced
by dierent feature variations provides insights into how to combine them41. An interesting avenue might be
to consider more complex interactions and to go deeper in the understanding of the mechanisms underlying
auditory salience.
Scientic Reports | (2023) 13:6842
is work provides contributions on a theoretical, methodological and practical level. From a theoretical point
of view, a driving of attention capture by a stimulus feature was revealed. is modulation of bottom-up atten-
tion was found to be monotonic and similar for the two timbre attributes studied here: brightness and rough-
ness. e experiment with variations in brightness highlighted symmetric properties, and the experiment with
combinations of both attributes underlined the non-additive character. Methodologically, a way to measure
the feature-driven component of attention was proposed: it implies modulating the singleton features in an
additional singleton paradigm while keeping the attentional state constant. From a practical perspective, the
results may enrich salience models that can include these features and the way they modulate salience in their
Finally, this study opens perspectives and calls for further studies. e extendibility of the modulation law
to more features and to a wider range of feature variations, its dependence on attentional sets and top-down
processes, and a higher resolution of the modulation curves deserve further investigation.
All data are available at https:// github. co m/ Bouvi erBa p tiste/ Revea ling- the- stimu lus- driven- compo nent- of- atten
tion- throu gh- modul ations- of- audit ory- salie nce- by- tim. git.
Received: 22 December 2022; Accepted: 13 April 2023
Scientic Reports | (2023) 13:6842
We thank the INSEAD sta for their help in welcoming and organizing participant slots during the experiments,
and Claire Richards for proofreading the manuscript. e contributed work from LTDS was performed within
the Labex CeLyA (ANR-10-LABX-0060).
Author contributions
All authors designed the experiments. B.B. collected the experimental data, performed analyses, and wrote the
manuscript. All authors reviewed the manuscript.
Competing interests
e authors declare no competing interests.
Additional information
Supplementary Information e online version contains supplementary material available at https:// doi. org/
10. 1038/ s41598- 023- 33496-2.
Correspondence and requests for materials should be addressed to B.B.
Reprints and permissions information is available at
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access is article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons licence, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.
© e Author(s) 2023
