ChapterPDF Available

Modeling Multisensory Integration

Authors:

Abstract

The different senses, such as vision, touch, or audition, often provide redundant information for perceiving our environment. For instance, the size of an object can be determined by both sight and touch. This chapter discusses the statistical optimal framework for combining redundant sensory information to maximize perceptual precision – the Maximum Likelihood Estimation (MLE) framework – and provides examples on how human performance approaches optimality. In the MLE framework, each cue is weighed according to its precision, that is, the more precise sensory estimate receives a higher weight when integration occurs. However, before integrating multisensory information, the perceptual system needs to determine whether or not the sensory signals correspond to the same object or event (the so-called correspondence problem). Current ideas on how the perceptual system solves the correspondence problem are provided in the same mathematical framework. Additionally, the chapter briefly reviews learning and developmental influences on multisensory integration.
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
10  Modeling Multisensory Integration
Loes C. J. van Dam, Cesare V. Parise, and Marc O. Ernst
When we move through the world, we constantly rely on sensory inputs
from vision, audition, touch, etc., to infer the structure of our surround-
ings. These sensory inputs often do not come in isolation, and multiple
senses can be stimulated at the same time. For instance, if we knock on
a door, we see our hand making impact on the door, we hear the result-
ing sound, and we feel the movement of our arm and our hand making
contact (see figure 10.1). How does our sensory system make sense of all
these different inputs if, for example, we need to judge where exactly we
are knocking?
We can illustrate that, in order to perform such a task, we need to com-
bine our sensory information in several fundamentally different ways. For
instance, to obtain a visual estimate of where we are knocking, we have
the information of the image of our hand on the retina, but this alone
is insufficient information for estimating the location of the hand in the
world. Such an estimate of the hand’s location can only be obtained if
the information from the retinal location of the hand is combined with
the information from the eye and neck muscles about the orientation
of the eyes relative to the world. In other words, retinal location and eye
and neck muscle signals provide complementary information, and without
either one or the other, a single estimate of the location of the hand in the
world would not be possible. In a similar fashion, the auditory cues about
the location of the knocking sound need to be combined with information
about the position of the ears with respect to the world to estimate where
the knocking sound is coming from.
Using the different senses, we can obtain several independent estimates
about our knocking location. As indicated above and in figure 10.1, we
have an estimate of where our hand is through vision (retinal location
combined with eye- and neck-muscle information), through audition (loca-
tion of the sound combined with neck-muscle information), and through
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
210  L. C. J. van Dam, C. V. Parise, and M. O. Ernst
proprioception (location of the hand estimated from muscles and the
stretch of the skin on our arm). Each of these estimates alone is, in prin-
ciple, sufficient to provide a good and unambiguous estimate of the hand’s
location in the world. That is, besides providing complementary informa-
tion, the separate sensory modalities often provide redundant estimates. In
this chapter, we will describe how humans exploit such redundancy to get
more reliable perceptual estimates through multisensory integration.1
In order to deal with redundant sensory information, the simplest strat-
egy would be to pick just one sensory modality to base our estimate on
and virtually ignore the others (see Rock & Victor, 1964). Though simple,
this strategy of basing our estimate on a single sensory modality has sev-
eral drawbacks. First, the sensory estimate of the chosen modality could
be biased, and thus might not necessarily be the best we can do if, for
example, the bias for the ignored senses is smaller. Second, the sensory
signals may be noisy, such that the very same stimulus would be perceived
slightly differently each time it is presented. That is, the estimate based on
a single modality may also not be very precise, and integrating estimates
Figure 10.1
When we knock on a door, our sensory system receives information from several dif-
ferent modalities. We see our hand knocking on the door. Together with the signals
from our eye and neck muscles, this provides information about where the knocking
event takes place. Similarly, our auditory system receives input from the sound pro-
duced by the contact. At the same time, we also feel the contact itself and the effects
it has on our arm. To behave optimally, these separate sensory modalities have to be
combined to come to a unified percept (reprinted from Ernst & Bülthoff, 2004, with
permission from Elsevier).
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
Modeling Multisensory Integration  211
from separate sensory modalities would cause their noise to partially can-
cel out, thus leading to a more precise estimate. Third, individual sensory
signals are often ambiguous, hence combining information from multiple
senses can help the system to resolve such ambiguities.
Therefore, rather than ignoring part of our senses, a much better solu-
tion to deal with multiple sensory inputs would be to integrate them to
come to a more precise and unambiguous estimate. Consider, for instance,
estimating the size of an object by simultaneously seeing and touching it
(Ernst & Banks, 2002). Both vision and touch provide a sensory signal that
is inherently corrupted by noise. Thus, when presented with a certain sig-
nal, we can only say something about the actual stimulus that may have
caused it with limited certainty. This can be represented probabilistically
by likelihood functions (or probability distributions) representing the like-
lihood of the occurrence of the actual stimulus given the specific signal.
The shape of these likelihood functions is determined by the sensory noise.
Assuming that the noise is Gaussian distributed and independent across
sensory modalities,2 the amount of noise can be quantified by the vari-
ance
σ
i
2 of the distribution (where i refers to the different modalities). An
estimate ˆ
Si of the relevant stimulus property can be obtained by taking the
mean of the distribution.
That is, both vision and touch provide an independent estimate ˆ
Si of the
size. If the sensory noises
σ
i
2 in the separate estimates are independent, the
optimal way to combine multiple sensory estimates in order to maximize
precision is to multiply the probability distributions corresponding to the
individual estimates. When the probability distributions are Gaussian, this
corresponds to a weighted average of the sensory estimates, with weights
proportional to the reciprocal of the amount of noise in each estimate (i.e.,
its precision, sometimes also referred to as the reliability). This is called the
maximum likelihood estimation (MLE) and can be written as follows:
ˆ ˆ
S w S
i
i i
= with wi
i
jj
=
/
/
1
1
2
2
σ
σ
(1)
Here, the combined estimate Ŝ is the weighted average of the individual
estimates Ŝi from each modality, with weights wi assigned according to
the relative reliability of each sensory estimate. Given that the weight is
inversely related to the amount of noise in each estimate, the estimate with
less variance will receive a higher weight. Furthermore, when the senses are
combined in this way, the variance in the combined estimate is reduced
with respect to either of the unisensory estimates:
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
212  L. C. J. van Dam, C. V. Parise, and M. O. Ernst
σ
σ
2
2
1
1
=/
ii
(2)
or, as in our example with vision and touch:
σσ σ
σ σ
VT
V T
V T
2
2 2
2 2
=+.
This implies that when the variance in the two sensory estimates is very
different, the final combined estimate and the resulting variance are both
close to that of the best (i.e., more precise) estimate.
Ernst and Banks (2002) investigated whether human multisensory per-
ception is consistent with the MLE combination of the sensory estimates.
Participants estimated the size of a ridge with vision, touch, or with both
modalities simultaneously. To test how the estimates from both senses are
combined, Ernst and Banks used a virtual reality setup to bring the two
senses into conflict and to parametrically manipulate the relative reliabili-
ties. They found that observers’ behavior was indeed well predicted by the
MLE weighting scheme.
Likelihood
functions
Size
Probability
ŜVT ŜV
ŜT
visiontouch
combined
V
T
VT
Figure 10.2
To estimate the size of an object, the separate estimates obtained from vision ŜV and
touch ŜT have to be combined into a unified one ŜVT. These estimates can be repre-
sented by probability distribution (i.e., likelihood functions) of how likely different
sizes of the object are, based on the sensory input. The likelihood functions vary in
uncertainty σ. The less reliable the estimate (larger σ), the wider the distribution of
the likelihood function. The combined estimate corresponding to the normalized
product of the likelihood functions has a mean (i.e., maximum) equal to the weight-
ed average of the unisensory estimates, where the relative weights are contingent
on the levels of uncertainty in each estimate. The more reliable estimate (narrower
distribution) receives a higher weight. The combined estimate is more reliable than
either of the independent estimates, as indicated by the narrower distribution.
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
Modeling Multisensory Integration  213
By now, many studies have demonstrated optimal integration in other
cue/modality combinations. For example, Alais and Burr (2004) showed
that integration of spatial cues from vision and audition is also obtained in
an optimal fashion, thereby accounting for the well-known ventriloquist
effect. Similarly, the numerosity of sequences of events presented in two
or more modalities such as tactile taps combined with auditory beeps and/
or visual flashes are perceived according to optimal integration principles
(e.g., Bresciani, Dammeier, & Ernst, 2006, 2008; see also Shams, Kamitani,
& Shimojo, 2000). Further examples of MLE integration across modalities
include visuohaptic shape perception (Helbig & Ernst, 2007b), perception
of the position of the hand using both vision and proprioception (e.g., van
Beers, Sittig, & Denier van der Gon, 1999), and visual-vestibular heading
perception (Fetsch, DeAngelis, & Angelaki, 2010), to name but a few.
Integration of redundant sensory cues does not occur only between, but
also within senses. This is, for instance, the case for visual slant perception,
whereby both binocular disparity and texture gradients provide redundant
slant cues. Notably, even in this case, cue integration is well predicted by a
weighted average of binocular and texture cues (Hillis et al., 2002; Knill &
Saunders, 2003; Hillis et al., 2004), thus suggesting that the MLE weighting
scheme might represent a universal principle of sensory integration.
When estimating a property of the environment, however, we do not
just rely on our sensory inputs, but also make use of prior knowledge.
For instance, we know that under most circumstances light comes from
above (e.g., Brewster, 1826; Mamassian & Goutcher, 2001) and that most
objects are static (Stocker & Simoncelli, 2006; Weiss, Simoncelli, & Adelson,
2002). It would therefore make sense to use such information to interpret
incoming sensory signals that may otherwise be ambiguous or less precise.
Within a probabilistic framework of perception, Bayes’ theorem provides
a formal approach to model the combination of sensory inputs and prior
information. In the Bayesian framework, the prior knowledge is repre-
sented by a probability distribution indicating the expected likelihood of
certain circumstances occurring. Making use of both the sensory inputs
and prior information, the combined perceptual estimate (i.e., the poste-
rior) corresponds to the normalized product of the probability distribu-
tions defining the sensory information (i.e., the likelihood) and the prior
knowledge (i.e., the prior). This is known as Bayes’ Rule, which can be for-
mally written as:
P X S P S X P X| |
ˆ ˆ ( )
( )
( )
(3)
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
214  L. C. J. van Dam, C. V. Parise, and M. O. Ernst
Here
P S X( | )
represents the likelihood-function of the sensory input Ŝ
occurring given the possible state of the world X. P X( ) represents the prior
probability of the state of the world X occurring based on, for instance, prior
experience. Combining these two estimates provides us with the posterior
distribution P X S( |
)
, whose maximum (known as maximum a posteriori, or
MAP) indicates the most likely state of the world X given the prior knowl-
edge and the sensory input Ŝ. A number of researchers have demonstrated
that such prior information is also optimally integrated with sensory evi-
dence to provide a more reliable estimate of the properties of the environ-
ment (e.g., Adams, Graf, & Ernst, 2004; Girshick, Landy, & Simoncelli, 2011;
Stocker & Simoncelli, 2006; Weiss, Simoncelli, & Adelson, 2002). Notably,
MLE constitutes a particular case of Bayesian sensory combination in the
absence of prior information (i.e., flat prior), and in this case the combined
estimate Ŝ of equation (1) would correspond directly to the maximum of
the posterior distribution Max P X S[ ]|
( )
. For more information and further
examples concerning the integration of prior information using Bayes’ rule,
see also the tutorial on Bayesian cue combination contained in chapter 1 of
this volume by Bennett, Trommershäuser, and van Dam.
1  The Cost of Integration
As shown above, integration of sensory information brings about benefits
in the combined estimate. Particularly, the precision of the combined esti-
mate is increased compared to the estimates based on each individual sen-
sory modality, and combining the estimate from sensory information with
prior information further increases the precision of the final estimate. Inte-
gration, however, also comes at a cost. If the senses are completely fused
into a unified percept, this would mean we do not have access to the sepa-
rate estimates of the individual senses anymore. In other words, we would
not be able to discriminate between a small and a large intersensory conflict
if the weighted average in the two conflicts leads to the same combined
estimate (perceptual metamers, see figure 10.3). Hillis and colleagues (2002)
investigated to what extent the separate senses are fused, or put differently,
whether we still have access to the individual sensory estimates feeding
into the combined percept. Participants had to identify the odd one out in
a set of three stimuli and could use any difference, regardless of whether
unisensory or combined, to do the task. If participants had access to uni-
sensory estimates, they should have been able to perform this task accu-
rately. Conversely, if the senses were completely fused, identifying the odd
one out would be impossible when it is defined by a sensory conflict (i.e.,
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
Modeling Multisensory Integration  215
metameric behavior). For cue combination within a single sensory modal-
ity, such as the combination of visual perspective and binocular disparity
cues to slant, they found that information about the separate estimates is
indeed lost. That is, participants clearly showed metameric behavior. How-
ever, for size estimation across two separate modalities (vision and haptic),
observers still had access to the individual senses in spite of the fact that
sensory fusion took place, as observed by Ernst and Banks (2002). Costs of
multisensory integration in terms of metameric behavior are also evident
when previously unrelated cues end up integrated after prolonged exposure
to multisensory correlation (Ernst, 2007; Parise et al., 2013).
2  The Correspondence Problem
It may seem that the cost of integration, that is, losing information about
the separate sensory modalities, is a small price to pay when combining the
separate sensory modalities. After all, through sensory integration we can
SV
1–1
1
–1
SH (JND)
Complete fusion
No combination
SV
1–1 SV (JND)
1–1
Figure 10.3
Discrimination performance in an oddity task for different levels of strength of fu-
sion represented in 2-D sensory space. The x-axis represents the visual estimate of,
for example, object size; the y-axis represents the haptic estimate. Both dimensions
are normalized for the unisensory just noticeable difference (JND). Black corresponds
to discrimination performance at chance relative to the center of this 2-D space, and
white indicates perfect discrimination. If the separate estimates are independent, i.e.,
no combination takes place, discrimination performance should be about equal in
all directions (left panel) and fully dependent on detection of a difference in either
of the unisensory modalities. If, however, full fusion takes place (right panel), all sen-
sory combinations leading to the same combined estimated (negative diagonal) be-
come indistinguishable. Thus, discrimination performance along the negative diago-
nal should be at chance (perceptual metamers). Between these two extremes there are
different levels of coupling strength (middle) (reprinted from Ernst, 2007, © ARVO).
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
216  L. C. J. van Dam, C. V. Parise, and M. O. Ernst
obtain more precise estimates about the same physical property of inter-
est compared to each sensory modality alone. However, in spite of being
a statistically optimal solution, a mandatory fusion of sensory informa-
tion according to MLE is not always the best strategy to deal with multiple
incoming signals. For example, if the spatial separation between a sound
source and a visual event is large, we can wonder whether those signals are
not causally related and, hence, whether it makes sense at all to integrate
them. If we would always integrate the signals from vision and audition,
information about this discrepancy would be lost and we would end up
with one inaccurate combined estimate rather than two separate estimates
of the two unrelated events. In other words, the perceptual system has to
determine whether or not two signals belong to the same event in order to
choose whether or not to integrate them. This is known as the correspon-
dence problem, or causal inference. It has been demonstrated that spatial
separation (Gepshtein et al., 2005), temporal delay (Bresciani et al., 2005),
and natural correlation (Parise & Spence, 2009) between two sensory sig-
nals are used by the perceptual system as parameters that influence the
integration of the signals. However, fusion can survive a large spatial sepa-
ration between the sensory signals, provided there is additional evidence
that the signals are coming from a common source (e.g., seeing your own
hand move via a mirror, Helbig & Ernst, 2007a). Moreover, it has been
demonstrated that prolonged exposure to spatiotemporal asynchrony can
destroy fusion (Rohde, Di Luca, & Ernst, 2011).
On top of being spatially and temporally coincident, multisensory sig-
nals originating from the same distal event are also often similar in nature.
Specifically, the correlation of temporal structure of multiple sensory sig-
nals, rather than merely their temporal coincidence, provides a powerful
cue for the brain to determine whether or not multiple sensory signals have
a common cause. In order to demonstrate the role of signal correlation in
causal inference, Parise and colleagues (2012) asked participants to localize
streams of beeps and flashes presented together or in isolation. In com-
bined audiovisual trials, the temporal structure of the visual and auditory
stimuli could either be correlated or not. Notably, localization was well pre-
dicted by the MLE weighting scheme only when the signals were correlated,
while it was suboptimal otherwise. This demonstrates that the correlation
in the fine temporal structure of multiple sensory signals is indeed a cue for
causal inference.
In a Bayesian framework, the strength of sensory fusion can be mod-
eled as a coupling prior (Bresciani, Dammeier, & Ernst, 2006; Ernst 2005,
2007, 2012; Ernst & Di Luca, 2011; Knill, 2007; Körding et al., 2007; Roach,
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
Modeling Multisensory Integration  217
Heron, & McGraw, 2006; Shams, Ma, & Beierholm, 2005) that represents
the belief that two signals go together. This belief is likely based on prior
experience, whereby having been exposed to consistent correlation between
the sensory modalities, we build up an expectation of whether certain com-
binations of signals do—or do not—belong together. This can be better
understood if we illustrate the integration process in a two-dimensional
perceptual space with the separate sensory modalities represented along
the x- and y-axes (figure 10.4). In such a space, a prior expectation for cou-
pling would be represented by the tuning of a 2-D probability distribution,
the coupling prior, along the identity line (figure 10.4, middle column).
In the case of strong perceptual coupling, this tuning will be very narrow
(i.e., the signals are expected to fully correlate; see figure 10.4, bottom). If
instead the sensory signals are expected to co-occur only by chance, the
coupling prior will be flat (no association, figure 10.4, top). In other words,
the coupling prior can be interpreted as a predictor of how strongly one
sensory signal reflects what the other one should be. Using the coupling
prior, the separate sensory estimates can be adjusted according to both the
estimate obtained from the other source and the strength of the expected
association.
In our 2-D representation, this means that the estimates from the uni-
sensory modalities (the likelihood function, figure 10.4, left column) are
multiplied by the coupling prior (figure 10.4, middle column), to obtain
a combined a posteriori estimate (right column). The maximum of the a
posteriori distribution (MAP) that is the most likely combined sensory esti-
mates ˆ ˆ ˆ
( ,
S S S
MAP V
MAP H
MAP
= would eventually determine the final percept.
Depending on the shape of the coupling prior (e.g., a 2-D Gaussian prob-
ability distribution) and in particular on the width of the prior along the
negative direction, which represents the coupling uncertainty σx, differ-
ent a posteriori estimates will be obtained. The MLE type of integration
in which the senses are completely fused can be illustrated as a prior that
enables a one-to-one mapping from one sensory estimate to the other (i.e.,
an infinitely thin line along the positive diagonal; figure 10.4, bottom row).
In this case, the system is certain the two signals always belong together,
thus the coupling uncertainty σx is zero. Here, the combined estimate in the
posterior distribution will always end up on the identity line ( )S S
V
MAP H
MAP
=,
and information about a discrepancy between the senses would be com-
pletely lost. At the other extreme, if the sensory modalities are considered
independent by the perceptual system, the coupling prior is flat (i.e., the
coupling uncertainty σx is infinite). In this case, the a posteriori estimate
is fully determined by the likelihood function, and no combined percept
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
218  L. C. J. van Dam, C. V. Parise, and M. O. Ernst
Likelihood Prior PosteriorPosterior
SWV
SWH
MAP
Flat prior
Impulse prior
x
xx
x
x
x
MLE
x
2
x
2
=0
(
()
)
V
2
H
2
m
2
x
2
(ˆ
SV,ˆ
SH)
(ˆ
S
V
MAP
,ˆ
S
H
MAP
)
Figure 10.4
The sensory estimates from two separate modalities (e.g., vision and touch) are
represented in 2-D space (left side). The Gaussian blob represents the uncertainties
(blob-width) as well as the unisensory estimates (position along the corresponding
axis). The coupling prior linking these separate estimates together can have different
shapes: from completely flat, indicating no integration (top), to infinitely narrow,
indicating full fusion of the senses (bottom). To obtain the final a posteriori esti-
mate (right column), the distributions of the unisensory estimates are multiplied
with the prior. Depending on how strongly the senses are combined, the location
of the maximum a posteriori probability shifts in the 2-D perceptual space (adapted
from Ernst, 2007).
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
Modeling Multisensory Integration  219
is formed (figure 10.4, top row). For intermediate levels of fusion, the cou-
pling prior can be modeled as a bivariate distribution aligned along the
diagonal (middle plot). In this case, partial fusion will take place, and there
will be perceptual benefit for estimating the property of interest without
the information about the separate modalities being completely lost (see
figure 10.3). The narrower the distribution of the coupling prior, the stron-
ger the fusion between the senses.
Here it is important to note that as long as priors and likelihood distri-
butions are assumed to be Gaussian, sensory integration will always occur
for all possible signal combinations. That is, assuming Gaussian priors and
likelihoods, the weighting between the senses would not change regardless
of whether the discrepancy between the senses is small (close to the diago-
nal) or very large (off-diagonal, as demonstrated in figure 10.5a). Ideally,
however, if the discrepancy between the signals is large, one can wonder
whether they are caused by the same event at all, and rather than inte-
grating them, it would be better to treat them separately. Such a complete
breakdown of integration for large discrepancies is not possible to model
using a single 2-D Gaussian distribution for the coupling prior. In other
words, to model both integration for small sensory discrepancies as well as
heavy tails
Prior Likelihood
Posterior
A B Heavy TailsGaussian
Posterior
heavy tails
Visual size
Haptic size
1D cross section
Increasing prior -- likelihood separation
Figure 10.5
(A) Integration using a Gaussian-shaped prior. The weighting for the a posteriori
estimate is always the same regardless of the discrepancy between the senses. (B)
Using a heavy-tailed prior, integration occurs for small discrepancies, but with large
discrepancies, the a posteriori estimate is the same as for the likelihood, indicating
a breakdown of integration. In other words, the heavy-tailed prior can explain both
integration for small sensory discrepancies as well as breakdown of integration for
large sensory discrepancies (adapted from Ernst, 2012, fig. 30.7).
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
220  L. C. J. van Dam, C. V. Parise, and M. O. Ernst
breakdown of integration for large discrepancies, the shape of the coupling
prior is very important.
Using a mixture of the distributions described above, we can, however,
create a single coupling prior that can explain both integration for small
conflicts as well as breakdown of integration for large conflicts. For exam-
ple, in figure 10.5b, the shape of the prior is Gaussian-like, but instead of
approaching zero for large sensory conflicts, it has heavy tails added to
it. That is, for large sensory conflicts, the prior is flat and nonzero. For
the posterior estimates, this means that for small conflicts (top), the com-
bined estimate is influenced by the peak of the prior (integration) in a very
similar fashion, as discussed above for Gaussian priors. However, for large
conflicts, the flat part of the prior does not influence the combined esti-
mate (bottom) and thus will lead to independent treatment of the signals.
This demonstrates that simply adding heavy tails to the coupling prior (i.e.,
nonzero probability for large cue conflicts) can explain the breakdown of
integration when the conflict between the signals is too large, while at the
same time keeping sensory integration for small conflicts intact (Ernst &
Di Luca, 2011; Ernst, 2012; Körding et al., 2007; Roach, Heron, & McGraw,
2006). Such a smooth transition from integration to no integration depend-
ing on the sensory discrepancy, known as the “window of integration,”
has also been repeatedly shown experimentally (Bresciani et al., 2005; Jack
& Thurlow, 1973; Jackson, 1953; Radeau & Bertelson, 1987; Shams, Kami-
tani, & Shimojo, 2002; van Wassenhove, Grant, & Poeppel, 2007; Warren
& Cleaves 1971; Witkin, Wapner, & Leventhal, 1952), and the coupling
prior approach using heavy tails approximates these experimental findings
pretty well. However, an empirical measure of the precise shape of the cou-
pling prior along all possible discrepancy axes (e.g., temporal, spatial, etc.)
is of course not so easy to obtain, as it would require a very large number
of measurements.
3  Learning
In the integration process, there are several stages at which learning can
influence how and to what extent the sensory modalities are fused. First
of all, in order to integrate the sensory information optimally, we need
to know the reliabilities associated with each signal. That is, we need to
assign the relative weights to each cue. How the sensory system exactly
estimates the reliabilities, and thus the weights, is not entirely clear, but
learning appears to be involved. For instance, Ernst and colleagues (2000)
put two unisensory estimates of slant (perspective and stereo disparity) into
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
Modeling Multisensory Integration  221
conflicts of different extents. At the same time, by moving an object along
the surface, haptic sensation provided a third estimate of the slant. The hap-
tic slant estimate was always in correspondence with one visual cue (either
perspective or disparity), and the slant specified by the remaining visual cue
was randomly chosen from several different levels of conflict. Given such a
situation, one of the visual estimates was always the odd one out from the
three slant estimates, and it would make sense for the perceptual system to
learn to weigh this cue less. This is exactly what Ernst and colleagues (2000)
found. After exposure to such a situation, a reweighing of the visual cues
had taken place, such that the always odd-one-out sensory estimate was
weighed less when only the visual signals were presented together. In an
analogous study using a different training paradigm, Atkins and colleagues
(2001) showed very similar results. An open question, however, is if this
means that this type of learning leads to a change in the reliability of the
individual cues if measured in isolation, or if this is just a change in deci-
sion rule or some prior when combining it with other sensory signals.
In addition to learning the reliabilities, the mapping between multiple
sensory inputs can also be learned. For example, Ernst (2007) investigated
whether humans could learn to integrate two previously unrelated signals.
For this purpose, he brought two normally unrelated sensory signals, the
luminance of an object and the stiffness of the same object, into corre-
spondence during a training phase so that bright objects would be very
stiff and dark objects soft (or vice versa). Before training, observers treated
these signals as independent. That is, in an oddity task, performance was
similar regardless of whether the odd one out was defined by a difference
in brightness and stiffness congruent or incongruent to the to-be-trained
axis. However, after training, performance for identifying differences that
were incongruent to the learned correlation axis (i.e., the odd one out was
chosen from the anticorrelation axis) was significantly worse than when
the difference was defined along the congruent axis (luminance and stiff-
ness in correspondence to the learned correlation). Previous studies have
demonstrated that the fusion of sensory combinations leading to the same
combined estimate results in perceptual metamers (Hillis et al., 2002, see
“The Cost of Integration”). Therefore, such a difference in performance
between before-and-after training is consistent with integration of the two
signals and indicates that participants no longer have complete access to
the individual cues. If the cues were still treated independently, perfor-
mance for congruent and incongruent trials should still have been the
same after training. In short, Ernst (2007) demonstrated that we can learn
to integrate arbitrary signals from different sensory modalities if we are
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
222  L. C. J. van Dam, C. V. Parise, and M. O. Ernst
exposed to correlations between these two. This learning is consistent with
a change in the distribution of the coupling prior after training (from flat
to nonflat).
Learning the mapping between multiple senses is also a key factor for
multisensory perceptual development. Recent studies have demonstrated
that until the age of eight or ten, children are not optimal integrators; that
is, when simultaneously presented with redundant sensory information,
they base their sensory estimates on a single sensory cue (not necessarily
the most precise one), virtually ignoring the others (Gori et al., 2008; Nar-
dini et al., 2008). Moreover, it has been shown that blind people regaining
sight are unable to immediately match visual and haptic shapes and that
some exposure to visuotactile stimuli (between five and nineteen days) is
necessary to perform such tasks (Held et al., 2011). This result mirrors a
developmental study by Meltzoff and Borton (1979), demonstrating that
infants between the ages of twenty-six and thirty-three days old are able to
match visual and haptic shape information. In a Bayesian framework, the
inability to optimally integrate multiple sensory inputs can be modeled by
a flat coupling prior, representing the lack of knowledge about the mapping
between the senses (Ernst, 2008). Then, during development, perceptual
systems learn the natural correlation between the senses—hence the shape
of the coupling prior. Note however, that there is evidence that some mech-
anisms aimed at matching information from the separate senses may well
be present already at birth (see, e.g., Streri & Gentaz, 2003). Such findings
suggest that the brain might be carefully designed to search for similarities
between the senses, a necessary condition for learning of a coupling prior
to occur.
In short, in order to integrate the senses, learning is involved in deter-
mining the relative weights as well as the mapping and the strength of
the correlation between the senses. But perhaps the most obvious form of
learning is that we appear to integrate prior knowledge into our perceptual
estimates. This prior knowledge somehow has to be embedded in the per-
ceptual system. The question is whether we have such knowledge innately,
acquire it in early development, or are continuously updating our prior
knowledge of the world. In other words, how fixed are our priors? Adams
and colleagues (2004) investigated this question for the light-from-above
prior, that is, the expectation that light sources are normally positioned
above the observer’s head. Observers were presented with ambiguous visual
images of shaded objects that could either be perceived as concave or con-
vex. When observers actively touched those ambiguous shapes, percep-
tion (convex or concave) was disambiguated through a combination of
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
Modeling Multisensory Integration  223
the senses. Using such a disambiguation scheme, Adams and colleagues
exposed the observers to a light direction that was rotated thirty degrees
from the observers’ initial light-from-above prior. After training, Adams and
colleagues found that the light-from-above prior had changed in a manner
corresponding to training, thus demonstrating that priors are dynamically
tuned to the statistics of the world. In a follow-up study, Adams and col-
leagues (2010) demonstrated that error signals, indicating that the percept
from the ambiguous input alone was wrong, are important for such learn-
ing to occur. Furthermore, using an entirely different task, Körding and
Wolpert (2004) showed that, depending on the experience, the learned pri-
ors can even take on relatively complex shapes.
Another form of multisensory perceptual learning is sensory recalibra-
tion. Consider, for instance, watching an out-of-sync video in which the
audio signal always precedes the corresponding visual events. Initially, we
are likely to notice such a discrepancy since we still have access to the indi-
vidual estimates from each modality with respect to time. But as you may
have experienced when watching such videos, the asynchrony between the
auditory and visual signals seems to decrease over time. In short, we are
remapping (i.e., learning) how these sensory signals link together in the
temporal domain based on the consistent information that they are out of
sync at a constant delay; that is, we are recalibrating to this sensory discrep-
ancy (Fujisaki et al., 2004).
In sensory recalibration, which modality is being changed also depends
on the assumed statistics of the signals and auxiliary information about
which source might be at fault. For instance, Di Luca and colleagues (2009)
showed by investigating audiotactile and visuotactile synchrony perception
after audiovisual adaptation that time perception for vision has changed
the most. In the temporal domain, vision is also the less precise cue of the
two, and the system may therefore also have assumed it to be the least accu-
rate. However, when the auditory signal was presented via headphones,
providing additional information that the auditory signal is coming from
a different location (auditory signals presented via headphones are often
perceived to be coming from inside the head), the auditory signal changed
the most after being exposed to an audiovisual delay (Di Luca, Machulla,
& Ernst, 2009). How quickly adaptation takes place is dependent, among
other factors, on the quality/reliability of the error signal. For instance,
Burge and colleagues (2008) showed that for visuomotor adaptation, very
reliable visual feedback results in faster adaptation than when the visual
feedback is blurred, and thus less reliable. Moreover, recent findings have
shown that adaptation to cross-sensory inconsistencies can sometimes be
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
224  L. C. J. van Dam, C. V. Parise, and M. O. Ernst
very fast, occurring after a few milliseconds of exposure (Wozny & Shams,
2011; Van den Burg, Alais, & Cass, 2013).
Overall, as is evident from the examples above, we learn to interpret
the incoming signals by constantly updating our knowledge about the
general behavior of the world (i.e., its statistics). The Bayesian framework
introduced above explicitly incorporates the use of knowledge from prior
experience (priors) and thus also provides an elegant approach to model all
such learning effects (see, e.g., Ernst & Di Luca, 2011; Ernst, 2012). In par-
ticular, learning which modality (not) to trust (e.g., Ernst et al., 2000) can
be conceptualized as learning the weights (the reliability) of the individual
sensory estimates and thus the widths of the distributions. Learning the
mapping between multiple sensory cues (e.g., Ernst, 2007) can be modeled
as learning the shape (i.e., slope and variance) of a coupling prior. Crossmo-
dal recalibration (e.g., Di Luca, Machulla, & Ernst, 2009) can be represented
as a shift of the coupling prior, and learning the statistics of the environ-
ment (e.g., Adams, Graf, & Ernst, 2004; Körding & Wolpert, 2004) can be
modeled as a prior influencing the interpretation of sensory cues.
4  Concluding Remarks and Open Questions
Here we have discussed how we can model the combination of separate
estimates from multiple sensory modalities. Overall, sensory integration
seems to occur in an optimal fashion, but before integration takes place, the
system has to estimate how likely the different signals belong together (cor-
respondence problem). We’ve discussed how the system makes use of prior
knowledge to help in the process and that learning processes are involved
in every aspect of the integration process. Still, there are a number of open
questions, such as how such computational principles are implemented in
the brain; how the brain knows on an instance-by-instance basis the vari-
ance of the sensory input; how learning influences unisensory estimates;
how to reconcile apparently non-Bayesian findings (such as the size-weight
illusion, see Ernst, 2009; Flanagan, Bittner, & Johansson, 2008). It will be
a challenge for future research to tackle such questions, to provide further
insight into the actual validity of this framework for perception science,
and to explore its potential in other disciplines such as cognitive science
and cognitive robotics.
Notes
1. Though note that sensory systems also have to deal with complementary infor-
mation. Battaglia and colleagues (2010) examined an appealing case, in which cues
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
Modeling Multisensory Integration  225
from the separate senses are complementary rather than redundant—the perception
of object-size change. Using only monocular vision, an object changing in size
cannot be distinguished from the same object moving in depth because both affect
the retinal image size in very much the same way. Therefore, to unambiguously
perceive the retinal image changes as object-size changes, additional information
about the object location is needed to “explain away” the effect of a change in depth
on the retinal image. Such complementary information can come from another
source such as touch or binocular vision.
2. See Ernst (2012) for a treatise on deviations from the assumptions of normally
distributed and independent estimates.
References
Adams, W. J., Graf, E. W., & Ernst, M. O. (2004). Experience can change the “light-
from-above” prior. Nature Neuroscience, 7, 1057–1058.
Adams, W. J., Kerrigan, I. S., & Graf, E. W. (2010). Efficient visual recalibration from
either visual or haptic feedback: The importance of being wrong. Journal of Neurosci-
ence, 30(44), 14745–14749.
Alais, D., & Burr, D. (2004). The ventriloquist effect results from near optimal cross-
modal integration. Current Biology, 14, 257–262.
Atkins, J., Fiser, J., & Jacobs, R. A. (2001). Experience-dependent visual cue integra-
tion based on consistencies between visual and haptic percepts. Vision Research, 41,
449–461.
Battaglia, P. W., Di Luca, M., Ernst, M. O., Schrater, P. R., Machulla, T., & Kersten, D.
(2010). Within- and cross-modal distance information disambiguate visual size-
change perception. PLoS Computational Biology, 6(3), e1000697. doi:10.1371/journal.
pcbi.1000697.
Bresciani, J.-P., Dammeier, F., & Ernst, M. O. (2006). Vision and touch are automati-
cally integrated for the perception of sequences of events. Journal of Vision, 6(5),
554–564.
Bresciani, J.-P., Dammeier, F., & Ernst, M. O. (2008). Tri-modal integration of visual,
tactile and auditory signals for the perception of sequences of events. Brain Research
Bulletin, 75, 753–760.
Bresciani, J.-P., Ernst, M. O., Drewing, K., Bouyer, G., Maury, V., & Kheddar, A.
(2005). Feeling what you hear: Auditory signals can modulate tactile taps percep-
tion. Experimental Brain Research, 162(2), 172–180.
Brewster, D. (1826). On the optical illusion of the conversion of cameos into inta-
glios, and of intaglios into cameos, with an account of other analogous phenomena.
Edinburgh Journal of Science, 4, 99–108.
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
226  L. C. J. van Dam, C. V. Parise, and M. O. Ernst
Burge, J., Ernst, M. O., & Banks, M. S. (2008). The statistical determinants of adapta-
tion rate in human reaching. Journal of Vision, 8(4), 1–19.
Di Luca M., Machulla T., Ernst, M. O. (2009). Recalibration of multisensory simulta-
neity: Cross-modal transfer coincides with a change in perceptual latency. Journal of
Vision, 9(12), 1–16.
Ernst, M. O. (2005). A Bayesian view on multimodal cue integration. In G. Knoblich,
I. M. Thornton, M. Grosjean, & M. Shiffrar (Eds.), Human body perception from the
inside out. Oxford: Oxford University Press.
Ernst, M. O. (2007). Learning to integrate arbitrary signals from vision and touch.
Journal of Vision, 7(5), 1–14.
Ernst, M. O. (2008). Multisensory integration: A late bloomer. Current Biology, 18(12),
R519–R521.
Ernst, M. O. (2009). Perceptual learning: Inverting the size-weight illusion. Current
Biology, 19(1), R23–R25.
Ernst, M. O. (2012). Optimal multisensory integration: Assumptions and limits. In
B. E. Stein (Ed.), The new handbook of multisensory processes (pp. 1084–1124). Cam-
bridge, MA: MIT Press.
Ernst, M. O., & Banks, M. S. (2002). Humans integrate visual and haptic information
in a statistically optimal fashion. Nature, 415, 429–433.
Ernst, M. O., Banks, M. S., & Bülthoff, H. H. (2000). Touch can change visual slant
perception. Nature Neuroscience, 3(1), 69–73.
Ernst, M. O., & Bülthoff, H. H. (2004). Merging the senses into a robust percept.
Trends in Cognitive Sciences, 8(4), 162–169.
Ernst, M. O., & Di Luca, M. (2011). Multisensory perception: From integration to
remapping. In J. Trommershäuser, K. Körding, & M. S. Landy (Eds.), Sensory cue inte-
gration (pp. 224–250). New York: Oxford University Press.
Fetsch, C. R., DeAngelis, G. C., & Angelaki, D. (2010). Visual-vestibular cue integra-
tion for heading perception: Applications of optimal cue integration theory. Euro-
pean Journal of Neuroscience, 31(10), 1721–1729.
Flanagan, J. R., Bittner, J. P., & Johansson, R. S. (2008). Experience can change dis-
tinct size-weight priors engaged in lifting objects and judging their weights. Current
Biology, 18(22), 1742–1747.
Fujisaki, W., Shimojo, S., Kashino, M., & Nishida, S. (2004). Recalibration of audiovi-
sual simultaneity. Nature Neuroscience, 7(7), 773–778.
Gepshtein, S., Burge, J., Ernst, M. O., & Banks, M. S. (2005). The combination of
vision and touch depends on spatial proximity. Journal of Vision, 5(11), 1013–1023.
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
Modeling Multisensory Integration  227
Girshick, A. R., Landy, M. S., & Simoncelli, E. P. (2011). Cardinal rules: Visual orien-
tation perception reflects knowledge of environmental statistics. Nature Neuroscience,
14(7), 926–932.
Gori, M., Del Viva, M., Sandin, G., & Burr, D. C. (2008). Young children do not inte-
grate visual and haptic form information. Current Biology, 18(9), 694–698.
Helbig, H. B., & Ernst, M. O. (2007a). Knowledge about a common source can pro-
mote visual-haptic integration. Perception, 36(10), 1523–1533.
Helbig, H. B., & Ernst, M. O. (2007b). Optimal integration of shape information
from vision and touch. Experimental Brain Research, 179(4), 595–606.
Held, R., Ostrovsky, Y., de Gelder, B., Gandhi, T., Ganesh, S., Mathur, U., et al.
(2011). The newly sighted fail to match seen with felt. Nature Neuroscience, 14(5),
551–553.
Hillis, J. M., Ernst, M. O., Banks, M. S., & Landy, M. S. (2002). Combining sensory
information: Mandatory fusion within, but not between, senses. Science, 298(5598),
1627–1630.
Hillis, J. M., Watt, S. J., Landy, M. S., & Banks, M. S. (2004). Slant from texture and
disparity cues: Optimal cue combination. Journal of Vision, 4(12), 967–992.
Jack, C. E., & Thurlow, W. R. (1973). Effects of degree of visual association and angle of
displacement on the “ventriloquism” effect. Perceptual and Motor Skills, 37, 967–979.
Jackson, C. V. (1953). Visual factors in auditory localization. Quarterly Journal of
Experimental Psychology, 5, 52–65.
Knill, D. C., & Saunders, J. A. (2003). Do humans optimally integrate stereo and tex-
ture information for judgments of surface slant? Vision Research, 43, 2539–2558.
Knill, D. C. (2007). Robust cue integration: A Bayesian model and evidence from
cue-conflict studies with stereoscopic and figure cues to slant. Journal of Vision, 7(7),
124.
Körding, K. P., Beierholm, U., Ma, W. J., Quartz, S., Tenenbaum, J. B., & Shams, L.
(2007). Causal inference in multisensory perception. PLoS ONE, 2(9), e943.
Körding, K. P., & Wolpert, D. M. (2004). Bayesian integration in sensorimotor learn-
ing. Nature, 427, 244–247.
Mamassian, P., & Goutcher, R. (2001). Prior knowledge on the illumination posi-
tion. Cognition, 81(1), B1–B9.
Meltzoff, A. N., & Borton, R. W. (1979). Intermodal matching by human neonates.
Nature, 282, 403–404.
Nardini, M., Jones, P., Bedford, R., & Braddick, O. (2008). Development of cue inte-
gration in human navigation. Current Biology, 18(9), 689–693.
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
228  L. C. J. van Dam, C. V. Parise, and M. O. Ernst
Parise, C. V., Harrar, V., Ernst, M. O., & Spence, C. (2013). Cross-correlation between
auditory and visual signals promotes multisensory integration. Multisensory Research,
26, 307–316.
Parise, C. V., & Spence, C. (2009). When birds of a feather flock together: Synes-
thetic correspondences modulate audiovisual integration in nonsynesthetes. PLoS
ONE, 4(5), e5664.
Parise, C. V., Spence, C., & Ernst, M. O. (2012). When correlation implies causation
in multisensory integration. Current Biology, 22(1), 46–49.
Radeau, M., & Bertelson, P. (1987). Auditory-visual interaction and the timing of
inputs: Thomas (1941) revisited. Psychological Research, 49, 17–22.
Roach, N. W., Heron, J., & McGraw, P. V. (2006). Resolving multisensory conflict: A
strategy for balancing the costs and benefits of audio-visual integration. Proceedings
of the Royal Society of London, Series B: Biological Sciences, 273, 2159–2168.
Rock, I., & Victor, J. (1964). Vision and touch: An experimentally created conflict
between the two senses. Science, 143, 594–596.
Rohde, M., Di Luca, M., & Ernst, M. O. (2011). The rubber hand illusion: Feeling of
ownership and proprioceptive drift do not go hand in hand. PLoS ONE, 6(6), e21659.
Epub June 28, 2011.
Shams, L., Kamitani, Y., & Shimojo, S. (2000). What you see is what you hear.
Nature, 408, 788.
Shams, L., Kamitani, Y., & Shimojo, S. (2002). Visual illusion induced by sound. Cog-
nitive Brain Research, 14, 147–152.
Shams, L., Ma, W. J., & Beierholm, U. (2005). Sound-induced flash illusion as an
optimal percept. Neuroreport, 16, 1923–1927.
Stocker, A. A., & Simoncelli, E. P. (2006). Noise characteristics and prior expectations
in human visual speed perception. Nature Neuroscience, 9(4), 578–585.
Streri, A., & Gentaz, E. (2003). Cross-modal recognition of shape from hand to eyes
in human newborns. Somatosensory and Motor Research, 20(1), 13–18.
van Beers, R. J., Sittig, A. C., & Denier van der Gon, J. J. (1999). Integration of pro-
prioceptive and visual position information: An experimentally supported model.
Journal of Neurophysiology, 81, 1355–1364.
van den Burg, E., Alais, D., & Cass, J. (2013). Rapid recalibration to audiovisual asyn-
chrony. Journal of Neuroscience, 33(37), 14633–14637.
van Wassenhove, V., Grant, K., & Poeppel, D. (2007). Temporal window of integra-
tion in auditory-visual speech perception. Neuropsychologia, 45, 598–607.
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
Modeling Multisensory Integration  229
Warren, D. H., & Cleaves, W. T. (1971). Visual-proprioceptive interaction under
large amounts of conflicts. Journal of Experimental Psychology, 90, 206–214.
Weiss, Y., Simoncelli, E. P., & Adelson, E. H. (2002). Motion illusions as optimal per-
cepts. Nature Neuroscience, 5(6), 598–604.
Witkin, H. A., Wapner, S., & Leventhal, T. J. (1952). Sound localization with con-
flicting visual and auditory cues. Journal of Experimental Psychology, 43, 58–67.
Wozny, D. R., & Shams, L. (2011). Recalibration of auditory space following millisec-
onds of cross-modal discrepancy. Journal of Neuroscience, 31(12), 4607–4612.
Ci
PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
... Multisensory binding in this task is facilitated by the experienced coherence of hand and cursor trajectories, even when these are presented in different spatial planes 11 . This experienced coherence suggests a common cause for both signals, which is critical for binding multisensory signals both in this task and in conceptually similar audio-visual (e.g., 12,13 ) and rubber-hand paradigms (e.g., [14][15][16] ) as well as other variants of multisensory binding (reviews e.g., [17][18][19] ). ...
... Multisensory binding in this task is facilitated by the experienced coherence of hand and cursor trajectories, even when these are presented in different spatial planes 11 . This experienced coherence suggests a common cause for both signals, which is critical for binding multisensory signals both in this task and in conceptually similar audio-visual (e.g., 12,13 ) and rubber-hand paradigms (e.g., [14][15][16] ) as well as other variants of multisensory binding (reviews e.g., [17][18][19]. ...
... Importantly, for multisensory perception to function in every day settings, our brain has to first identify the relevant redundant signals for integration and recalibration www.nature.com/scientificreports/ (e.g., [17][18][19] ). Here we asked how two stimuli in one modality are combined with evidence in another modality. ...
Article
Full-text available
To organize the plethora of sensory signals from our environment into a coherent percept, our brain relies on the processes of multisensory integration and sensory recalibration. We here asked how visuo-proprioceptive integration and recalibration are shaped by the presence of more than one visual stimulus, hence paving the way to study multisensory perception under more naturalistic settings with multiple signals per sensory modality. We used a cursor-control task in which proprioceptive information on the endpoint of a reaching movement was complemented by two visual stimuli providing additional information on the movement endpoint. The visual stimuli were briefly shown, one synchronously with the hand reaching the movement endpoint, the other delayed. In Experiment 1, the judgments of hand movement endpoint revealed integration and recalibration biases oriented towards the position of the synchronous stimulus and away from the delayed one. In Experiment 2 we contrasted two alternative accounts: that only the temporally more proximal visual stimulus enters integration similar to a winner-takes-all process, or that the influences of both stimuli superpose. The proprioceptive biases revealed that integration—and likely also recalibration—are shaped by the superposed contributions of multiple stimuli rather than by only the most powerful individual one.
... Our brain exploits two processes to deal with such discrepancies: multisensory integration and sensory recalibration. Through multisensory integration unisensory signals are combined into a weighted representation, which is typically revealed in studies investigating bimodal judgments (e.g., Ernst and Banks 2002;Ernst 2006;Körding et al. 2007;Cheng et al. 2007;Shams and Beierholm 2010;van Dam et al. 2014;Kayser and Shams 2015;Colonius and Diederich 2020). Sensory recalibration in contrast helps to align unisensory estimates with each other. ...
... We asked how perceptual estimates of a proprioceptive target (hand position) are biased by the presence of two (rather than one) visual stimuli presented during the movement. Redundant multisensory signals are often characterized by their spatio-temporal proximity, which is a critical feature determining which signals enter multisensory integration and recalibration (Ernst 2007;Odegaard et al. 2017;Tong et al. 2020) (for reviews see e.g., Shams and Beierholm 2010;Talsma et al. 2010;van Dam et al. 2014). We hence manipulated the two visual stimuli so that they featured the same spatial proximity to the proprioceptive stimulus while differing in their temporal proximity. ...
Preprint
Full-text available
To organize the plethora of sensory signals from our environment into a coherent percept, our brain relies on the processes of multisensory integration and sensory recalibration. We here asked how visuo-proprioceptive integration and recalibration are shaped by the presence of more than one potentially relevant visual stimulus, hence paving the way to studying multisensory perception under more naturalistic settings with multiple signals per sensory modality. By manipulating the spatio-temporal correspondence between the hand position and two visual stimuli during a cursor-control task, we contrasted two alternative accounts: that only the temporally more proximal signal enters integration and recalibration similar to a winner-takes-all process, or that the influences of both visual signals superpose. Our results show that integration - and likely also recalibration - are shaped by the superposed contributions of multiple stimuli rather than by only individual ones.
... Furthermore, we aimed to examine whether the impact of bodily actions on vision is greater for individuals with more precise proprioception-the sense of self-movement and of the position of body parts in space. In general, our hypotheses follow from Bayesian models of multisensory integration, according to which inferences about the state of the environment-and thus our perceptual experience-rely on the degree of congruence between unimodal signals (here: visual and kinaesthetic-proprioceptive) and their relative precisions (understood as inverse variance/noisiness) [37][38][39]. The basic idea is that the extent to which one is aware of one's movement and body parts should reflect the overall reliability of action-related afferent signals. ...
... Our results seem to be at odds with Bayesian accounts of perception according to which the brain uses probabilistic information to optimise inferences about the state of the environment [38,39,61]. Prior expectations-in the form of learned associations between multisensory signals-bias the interpretation of unimodal sensory input towards percepts congruent with information available across different senses. ...
Article
When two different images are presented separately to each eye, one experiences smooth transitions between them-a phenomenon called binocular rivalry. Previous studies have shown that exposure to signals from other senses can enhance the access of stimulation-congruent images to conscious perception. However, despite our ability to infer perceptual consequences from bodily movements, evidence that action can have an analogous influence on visual awareness is scarce and mainly limited to hand movements. Here, we investigated whether one's direction of locomotion affects perceptual access to optic flow patterns during binocular rivalry. Participants walked forwards and backwards on a treadmill while viewing highly-realistic visualisations of self-motion in a virtual environment. We hypothesised that visualisations congruent with walking direction would predominate in visual awareness over incongruent ones, and that this effect would increase with the precision of one's active proprioception. These predictions were not confirmed: optic flow consistent with forward locomotion was prioritised in visual awareness independently of walking direction and proprioceptive abilities. Our findings suggest the limited role of kinaesthetic-proprioceptive information in disambiguating visually perceived direction of self-motion and indicate that vision might be tuned to the (expanding) optic flow patterns prevalent in everyday life.
... Furthermore, we aimed to examine whether the impact of bodily actions on vision is greater for individuals with more precise proprioception -the sense of self-movement and of the position of body parts in space. In general, our hypotheses follow from Bayesian models of multisensory integration, according to which inferences about the state of the environment -and thus our perceptual experience -rely on the degree of congruence between unimodal signals (here: visual and kinesthetic-proprioceptive) and their relative precisions (understood as inverse variance/noisiness) [37][38][39] . The basic idea is that the extent to which one is aware of one's movement and body parts should reflect the overall reliability of action-related afferent signals. ...
... These results seem to be at odds with Bayesian accounts of perception, but are in line with most previous reports regarding the influence of action on visual awareness [23,26,27,31] . According to Bayesian models of perception, the brain uses probabilistic information to optimise inferences about the state of the environment [38,39,60] . Prior expectations -in the form of learned associations between multisensory signals -bias the interpretation of unimodal sensory input towards percepts congruent with information available across different senses. ...
Preprint
Full-text available
When two different images are presented separately to each eye, one experiences smooth transitions between them. Previous studies have shown that exposure to signals from other senses can enhance perceptual awareness of stimulation-congruent images. Surprisingly, despite our ability to infer perceptual consequences from bodily movements, evidence that action can have an analogous influence on visual experience is scarce and mainly limited to local (hand) movements. Here, we investigated whether one’s direction of locomotion affects perceptual awareness of optic flow patterns. Participants walked forwards and backwards on a treadmill while viewing highly-realistic visualisations of self-motion in a virtual environment. We hypothesised that visualisations congruent with walking direction would predominate in visual awareness over incongruent ones, and that this effect would increase with the precision of one’s active proprioception. These predictions were not confirmed: optic flow consistent with forward locomotion was prioritised in visual awareness independently of walking direction and proprioceptive abilities. Our results suggest that kinaesthetic-proprioceptive processing plays a limited role in shaping visual experience. This seems at odds with Bayesian accounts of perception but is in-line with Cancellation theories, which imply that crossmodal influences of self-generated signals are suppressed as a redundant source of information about the outside world.
... Given that only a subset of all the incoming sensory information is redundant, the process of integrating redundant sensory information needs to be preceded by a process in which sensory redundancies are assessed. This latter process is referred to as causal inference (alternative designations are correspondence problem or unity assumption) (for reviews, see [19][20][21]). Causal inference can vary in strength. ...
... This variance of the coupling prior represents the strength of the causality judgement, the belief that two sensory signals belong together (see e.g. [20]). It can only be visualized in a particular twodimensional illustration of integration. ...
Article
Full-text available
Successful computer use requires the operator to link the movement of the cursor to that of his or her hand. Previous studies suggest that the brain establishes this perceptual link through multisensory integration, whereby the causality evidence that drives the integration is provided by the correlated hand and cursor movement trajectories. Here, we explored the temporal window during which this causality evidence is effective. We used a basic cursor-control task, in which participants performed out-and-back reaching movements with their hand on a digitizer tablet. A corresponding cursor movement could be shown on a monitor, yet slightly rotated by an angle that varied from trial to trial. Upon completion of the backward movement, participants judged the endpoint of the outward hand or cursor movement. The mutually biased judgements that typically result reflect the integration of the proprioceptive information on hand endpoint with the visual information on cursor endpoint. We here manipulated the time period during which the cursor was visible, thereby selectively providing causality evidence either before or after sensory information regarding the to-be-judged movement endpoint was available. Specifically, the cursor was visible either during the outward or backward hand movement (conditions Out and Back, respectively). Our data revealed reduced integration in the condition Back compared with the condition Out, suggesting that causality evidence available before the to-be-judged movement endpoint is more powerful than later evidence in determining how strongly the brain integrates the endpoint information. This finding further suggests that sensory integration is not delayed until a judgement is requested.
... The added priors would not be individualized but represent general influences of experience, i.e., irrespective of structural body variations. Computationally, this could be covered by a top-down modulation to predict influences of previous experiences on prior couplings, e.g., visuo-tactile integration or sensorimotor learning, using the implementation of learning-based models of inter-and intramodal sensory signals (Van Dam et al., 2014;Parise, 2016;Noel et al., 2018;Litwin, 2020;Press et al., 2020). ...
Article
Full-text available
Using the seminal rubber hand illusion and related paradigms, the last two decades unveiled the multisensory mechanisms underlying the sense of limb embodiment, that is, the cognitive integration of an artificial limb into one's body representation. Since also individuals with amputations can be induced to embody an artificial limb by multimodal sensory stimulation, it can be assumed that the involved computational mechanisms are universal and independent of the perceiver's physical integrity. This is anything but trivial, since experimentally induced embodiment has been related to the embodiment of prostheses in limb amputees, representing a crucial rehabilitative goal with clinical implications. However, until now there is no unified theoretical framework to explain limb embodiment in structurally varying bodies. In the present work, we suggest extensions of the existing Bayesian models on limb embodiment in normally-limbed persons in order to apply them to the specific situation in limb amputees lacking the limb as physical effector. We propose that adjusted weighting of included parameters of a unified modeling framework, rather than qualitatively different model structures for normally-limbed and amputated individuals, is capable of explaining embodiment in structurally varying bodies. Differences in the spatial representation of the close environment (peripersonal space) and the limb (phantom limb awareness) as well as sensorimotor learning processes associated with limb loss and the use of prostheses might be crucial modulators for embodiment of artificial limbs in individuals with limb amputation. We will discuss implications of our extended Bayesian model for basic research and clinical contexts.
Article
Perceptual representations pick out individuals and attribute properties to them. This paper considers the role of perceptual attribution in determining or guiding perceptual reference to objects. We consider three extant models of the relation between perceptual attribution and perceptual reference–all attribution guides reference, no attribution guides reference, or a privileged subset of attributions guides reference–and argue that empirical evidence undermines all three. We then defend a flexible‐attributives model, on which the range of perceptual attributives used to guide reference shifts adaptively with context. This model underscores the remarkable and dynamic intelligence of our perceptual capacities. We elucidate implications of the model for the boundary between perception and propositional thought.
Article
Full-text available
The aim of this review was to systematically identify, analyze, and summarize research involving interventions based on sensory integration and activities that promote sensory integration in children with ASD. Based on the selection criteria ten out of thirty studies were selected and described in terms of: a) participant characteristics, b) assessments used in the studies, c) intervention procedures, d) study goals, e) intervention outcomes and whether or not there was improvement in behavior or clinical conditions. The results of the analyzed studies indicate a remarkable heterogeneity profile of sensory function in children with ASD, which affect the applicability of different forms of treatment. Based on the results of these studies, we can conclude that treatments based on SI theory can reduce stereotypical, aggressive, auto-aggressive, irritable, and hyperactive behavior, as well as improve self-regulation of behavior.
Article
Previous research suggests that visual processing depends strongly on locomotor activity and is tuned to optic flows consistent with self-motion speed. Here, we used a binocular rivalry paradigm to investigate whether perceptual access to optic flows depends on their optimality in relation to walking velocity. Participants walked at two different speeds on a treadmill while viewing discrepant visualizations of a virtual tunnel in each eye. We hypothesized that visualizations paced appropriately to the walking speeds will be perceived longer than non optimal (too fast/slow) ones. The presented optic flow speeds were predetermined individually in a task based on matching visual speed to both walking velocities. In addition, perceptual preference for optimal optic flows was expected to increase with proprioceptive ability to detect threshold-level changes in walking speed. Whereas faster (more familiar) optic flows showed enhanced access to awareness during faster compared with slower walking conditions, for slower visual flows, only a nonsignificant tendency for the analogous effect was observed. These effects were not dependent on individual proprioceptive sensitivity. Our findings concur with the emerging view that the velocity of one’s locomotion is used to calibrate visual perception of self-motion and extend the scope of reported action effects on visual awareness.
Article
Full-text available
In the perception of self-motion, visual cues originating from an embodied humanoid avatar seen from a first-person perspective (1st-PP) are processed in the same way as those originating from a person’s own body. Here, we sought to determine whether the user’s and avatar’s bodies in virtual reality have to be colocalized for this visual integration. In Experiment 1, participants saw a whole-body avatar in a virtual mirror facing them. The mirror perspective could be supplemented with a fully visible 1st-PP avatar or a suggested one (with the arms hidden by a virtual board). In Experiment 2, the avatar was viewed from the mirror perspective or a third-person perspective (3rd-PP) rotated 90° left or right. During an initial embodiment phase in both experiments, the avatar’s forearms faithfully reproduced the participant’s real movements. Next, kinaesthetic illusions were induced on the static right arm from the vision of passive displacements of the avatar’s arms enhanced by passive displacement of the participant’s left arm. Results showed that this manipulation elicited kinaesthetic illusions regardless of the avatar’s perspective in Experiments 1 and 2. However, illusions were more likely to occur when the mirror perspective was supplemented with the view of the 1st-PP avatar’s body than with the mirror perspective only (Experiment 1), just as they are more likely to occur in the latter condition than with the 3rd-PP (Experiment 2). Our results show that colocalization of the user’s and avatar’s bodies is an important, but not essential, factor in visual integration for self-motion perception.
Article
Full-text available
To combine information from different sensory modalities, the brain must deal with considerable temporal uncertainty. In natural environments, an external event may produce simultaneous auditory and visual signals yet they will invariably activate the brain asynchronously due to different propagation speeds for light and sound, and different neural response latencies once the signals reach the receptors. One strategy the brain uses to deal with audiovisual timing variation is to adapt to a prevailing asynchrony to help realign the signals. Here, using psychophysical methods in human subjects, we investigate audiovisual recalibration and show that it takes place extremely rapidly without explicit periods of adaptation. Our results demonstrate that exposure to a single, brief asynchrony is sufficient to produce strong recalibration effects. Recalibration occurs regardless of whether the preceding trial was perceived as synchronous, and regardless of whether a response was required. We propose that this rapid recalibration is a fast-acting sensory effect, rather than a higher-level cognitive process. An account in terms of response bias is unlikely due to a strong asymmetry whereby stimuli with vision leading produce bigger recalibrations than audition leading. A fast-acting recalibration mechanism provides a means for overcoming inevitable audiovisual timing variation and serves to rapidly realign signals at onset to maximize the perceptual benefits of audiovisual integration.
Article
Full-text available
Recently, it has been shown that visual perception can be radically altered by signals of other modalities. For example, when a single £ash is accompanied by multiple auditory beeps, it is often perceived as multiple £ashes. This e¡ect is known as the sound-induced £ash illusion. In order to investigate the principles underlying this illusion, we developed an ideal observer (derived using Bayes' rule), and compared human judgements with those of the ideal observer for this task. The human observer's performance was highly consistent with that of the ideal observer in all conditions ranging from no interaction, to partial integration, to complete integration, suggesting that the rule used by the nervous system to decide when and how to combine auditory and visual signals is statistically optimal. Our ¢ndings show that the sound-induced £ash illusion is an epiphenomenon of this general, statistically optimal strategy. NeuroReport 16:1923^1927 c 2005 Lippincott Williams & Wilkins.
Article
The pattern of local image velocities on the retina encodes important environmental information. Although humans are generally able to extract this information, they can easily be deceived into seeing incorrect velocities. We show that these 'illusions' arise naturally in a system that attempts to estimate local image velocity. We formulated a model of visual motion perception using standard estimation theory, under the assumptions that (i) there is noise in the initial measurements and (ii) slower motions are more likely to occur than faster ones. We found that specific instantiation of such a velocity estimator can account for a wide variety of psychophysical phenomena.
Article
Humans are equipped with multiple sensory channels that provide both redundant and complementary information about the objects and events in the world around them. A primary challenge for the brain is therefore to solve the ‘correspondence problem’, that is, to bind those signals that likely originate from the same environmental source, while keeping separate those unisensory inputs that likely belong to different objects/events. Whether multiple signals have a common origin or not must, however, be inferred from the signals themselves through a causal inference process. Recent studies have demonstrated that cross-correlation, that is, the similarity in temporal structure between unimodal signals, represents a powerful cue for solving the correspondence problem in humans. Here we provide further evidence for the role of the temporal correlation between auditory and visual signals in multisensory integration. Capitalizing on the well-known fact that sensitivity to crossmodal conflict is inversely related to the strength of coupling between the signals, we measured sensitivity to crossmodal spatial conflicts as a function of the cross-correlation between the temporal structures of the audiovisual signals. Observers’ performance was systematically modulated by the cross-correlation, with lower sensitivity to crossmodal conflict being measured for correlated as compared to uncorrelated audiovisual signals. These results therefore provide support for the claim that cross-correlation promotes multisensory integration. A Bayesian framework is proposed to interpret the present results, whereby stimulus correlation is represented on the prior distribution of expected crossmodal co-occurrence.
Article
Previous experiments showing the importance of visual factors in auditory localization are shown to have been insufficiently quantitative. In the first Experiment, bells were rung and lights shone on the same or different vectors, eleven subjects indicating which bell had rung. In the second Experiment, a puff of steam was seen to issue from a kettle whistle with no whistling sound, while similar whistles were sounded by compressed air on that or another vector. Twenty-one subjects cooperated. The addition of a visual stimulus at 0° deviation increased the percentage of correct responses significantly in the second, and insignificantly in the first experiment. At 20°-30° deviation the proportion of naive responses to the visual cue was 43 per cent. in the first and 97 per cent, in the second experiment. At greater angular deviations, the proportion of naive responses fell to chance level in the first, but remained significant in the second experiment, even at 90°. The “visuo-auditory threshold” was found to be 20°-30°, but might be much larger if there were more grounds for supposing the two stimuli to be from the same source in space.