ArticlePublisher preview available

Evidence for an Event-Integration Window: A Cognitive Temporal Window Supports Flexible Integration of Multimodal Events

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Just as the perception of simple events such as clapping hands requires a linkage of sound with movements that produce the sound, the integration of more complex events such as describing how to give an injection requires a linkage between the instructor’s utterances and their actions. However, the mechanism for integrating these complex multimodal events is unclear. For example, it is possible that predictive temporal relationships are important for multimodal event understanding, but it is also possible that this form of understanding arises more from meaningful causal between-event links that are temporally unspecified. This latter approach might be supported by a cognitive temporal window within which multimodal event information integrates flexibly with few default commitments about specific temporal relationships. To test this hypothesis, we assessed the consequences of disrupting temporal relationships between instructors’ actions and their speech in both narrated screen-capture instructional videos (Experiment 1) and live-action instructional videos (Experiment 2) by displacing the audio channel forward or backward relative to the video by 0, 1, 3, or 7 s. We assessed learning, event segmentation, disruption awareness, segmentation uncertainty, and perceived workload. Across two experiments, 7-s temporal disruptions consistently increased uncertainty and workload and decreased learning in Experiment 2. None of these effects appeared for 3-s disruptions, which were barely detectable. One-second disruptions produced no effects and were undetectable, even though much intraevent information falls within this range. Our results suggest the presence of an event-integration window that supports the integration of events independent of constraining temporal relationships between subevents.
Evidence for an Event-Integration Window: A Cognitive Temporal Window
Supports Flexible Integration of Multimodal Events
Madison Lee and Daniel T. Levin
Department of Psychology and Human Development, Vanderbilt University
Just as the perception of simple events such as clapping hands requires a linkage of sound with movements
that produce the sound, the integration of more complex events such as describing how to give an injection
requires a linkage between the instructors utterances and their actions. However, the mechanism for inte-
grating these complex multimodal events is unclear. For example, it is possible that predictive temporal rela-
tionships are important for multimodal event understanding, but it is also possible that this form of
understanding arises more from meaningful causal between-event links that are temporally unspecied.
This latter approach might be supported by a cognitive temporal window within which multimodal event
information integrates exibly with few default commitments about specic temporal relationships. To
test this hypothesis, we assessed the consequences of disrupting temporal relationships between instructors
actions and their speech in both narrated screen-capture instructional videos (Experiment 1) and live-action
instructional videos (Experiment 2) by displacing the audio channel forward or backward relative to the
video by 0, 1, 3, or 7 s. We assessed learning, event segmentation, disruption awareness, segmentation
uncertainty, and perceived workload. Across two experiments, 7-s temporal disruptions consistently
increased uncertainty and workload and decreased learning in Experiment 2. None of these effects appeared
for 3-s disruptions, which were barely detectable. One-second disruptions produced no effects and were
undetectable, even though much intraevent information falls within this range. Our results suggest the pres-
ence of an event-integration window that supports the integration of events independent of constraining tem-
poral relationships between subevents.
Public Signicance Statement
To perceive complex events, we must be able to integrate visual information such as peoples actions or
gestures with corresponding auditory information such as speech. Although these two forms of infor-
mation are mutually supportive, it is not clear whether the precise temporal relationship between
these streams is perceptually and cognitively important. These experiments demonstrate that, within a
several-second window, the temporal relationship between these modalities can be disrupted without
interfering with effective event perception and understanding. We demonstrate that cognitive integration
of multimodal events is temporally exible, and this may support forms of event understanding that are
robust over small variations in event synchronization and temporal attention.
Keywords: event perception, learning, multimodal integration, psychological present
Effective perception and understanding of real-world events often
require the integration of auditoryand visual information. For simple
events, such as clapping hands, visual movements must be tightly
linked with the sounds that emanate from them. This form of integra-
tion is often referred to as multisensory integration. Research in
the eld of neuroscience has established that this form of integra-
tion is associated with a multisensory temporal binding window of
approximately +250 ms where multisensory event information
(i.e., a beep and ash) can be asynchronous yet perceptually bound
and perceived as occurring simultaneously (Wallace & Stevenson,
2014). Such a window is necessary in part because the relationship
between auditory and visual features of multisensory events is incom-
pletely determined by simple timing features. For example, propaga-
tion delays both externally (because of differences in the speed of
Madison Lee https://orcid.org/0000-0001-6395-0976
Results from Experiment 1 were presented as a poster at the Psychonomic
Societys Annual Conference in 2021. The authors would like to thank Eric
Hall at Vanderbilts School of Nursing for contributing to the creation of our
live-action instructional videos. These studies were not preregistered. The
data and materials are publicly available (https://osf.io/x2cm3/).
Madison Lee served as lead for data curation, formal analysis, project admin-
istration, software, visualization, and writingoriginal draft, contributed equally
to investigation, and served in a supporting role for resources. Daniel T. Levin
served as lead for resources and supervision and served in a supporting role for
formal analysis and writingoriginal draft. Madison Lee and Daniel T. Levin
contributed equally to conceptualization, writingreview and editing, and
methodology.
Correspondence concerning this article should be addressed to Madison
Lee, Department of Psychology and Human Development, Vanderbilt
University, 230 Appleton Place, Nashville, TN 37203-5721, United States.
Email: madison.j.lee@vanderbilt.edu
Journal of Experimental Psychology: General
© 2024 American Psychological Association
ISSN: 0096-3445 https://doi.org/10.1037/xge0001577
1449
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
2024, Vol. 153, No. 6, 1449–1463
This article was published Online First April 4, 2024.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Speech and gesture are two integrated and temporally coordinated systems. Manual gestures can help second language (L2) speakers with vocabulary learning and word retrieval. However, it is still under-investigated whether the synchronisation of speech and gesture has a role in helping listeners compensate for the difficulties in processing L2 aural information. In this paper, we tested, in two behavioural experiments, how L2 speakers process speech and gesture asynchronies in comparison to native speakers (L1). L2 speakers responded significantly faster when gestures and the semantic relevant speech were synchronous than asynchronous. They responded significantly slower than L1 speakers regardless of speech/gesture synchronisation. On the other hand, L1 speakers did not show a significant difference between asynchronous and synchronous integration of gestures and speech. We conclude that gesture-speech asynchrony affects L2 speakers more than L1 speakers.
Article
Full-text available
Although several research programs have explored how people track changes in objects over time, it is not clear how consistently people are aware of the precise state of dynamic scenes. The importance of object tracking is put to a particularly interesting test in cinema, where editors must combine different views of dynamically changing objects (such as actors) in a way that does not disrupt viewers’ perceptual experience. Film editors’ intuition and several recent studies suggest that viewers precisely track configuration changes over time and that temporal overlaps in the depiction of moving objects facilitates viewers’ perception of smooth visual event continuity. We tested these hypotheses by showing large numbers of participants short edited films that varied in temporal matching between views. In several experiments, we found that participants judged a wide range of temporal overlaps (from 400 ms overlaps to 400 ms ellipses) to be equally continuous. Participants did judge 1-second ellipses to be less smooth than exact matches, and when they repeatedly scrutinized films, or compared different versions of the same films, participants discriminated between exact matches and smaller mismatches. We conclude that awareness of temporal mismatches does not always occur by default, and that the importance of precise temporal matching may be overestimated in cinema and in psychological study of event perception.
Article
Full-text available
The Attentional Blink (AB) refers to a deficit in reporting a second target (T2) embedded in a stream of distractors when presented 200-500 ms after a preceding target (T1). Several theories about the origin of the AB have been proposed; filter-based theories claim that the AB is the result of a temporarily closing of an attentional gate to avoid featural confusion for targets and distractors, while bottleneck theories propose that the AB is caused by a reduction in the capacity to either encode into or maintain information in visual short-term memory. In three experiments, we systematically vary the exposure duration and composition of the T2 display allowing us to decompose the T2 deficit into well-established parameter estimates based on the Theory of Visual Attention (TVA). As the different AB theories make specific predictions regarding which parameters should be affected during the AB, we are able to test their plausibility. All three experiments consistently show a lower capacity to process T2 during the AB, supporting theories hypothesizing a bottleneck at the encoding stage. No evidence is found supporting filter-based theories or theories placing the bottleneck at the maintenance stage. (PsycInfo Database Record (c) 2022 APA, all rights reserved).
Article
Full-text available
Pre‐stimulus EEG oscillations, especially in the alpha range (8‐13 Hz), can affect the sensitivity to temporal lags between modalities in multisensory perception. The effects of alpha power are often explained in terms of alpha’s inhibitory functions, whereas effects of alpha frequency have bolstered theories of discrete perceptual cycles, where the length of a cycle, or window of integration, is determined by alpha frequency. Such studies typically employ visual detection paradigms with near‐threshold or even illusory stimuli. It is unclear whether such results generalize to above‐threshold stimuli. Here, we recorded electroencephalography, while measuring temporal discrimination sensitivity in a temporal order judgement task using above‐threshold auditory and visual stimuli. We tested whether the power and instantaneous frequency of pre‐stimulus oscillations predict audio‐visual temporal discrimination sensitivity on a trial‐by‐trial basis. By applying a jackknife procedure to link single‐trial pre‐stimulus oscillatory power and instantaneous frequency to psychometric measures, we identified a posterior cluster where lower alpha power was associated with higher temporal sensitivity of audiovisual discrimination. No statistically significant relationship between instantaneous alpha frequency and temporal sensitivity was found. These results suggest that temporal sensitivity for above‐threshold multisensory stimuli fluctuates from moment to moment and is indexed by modulations in alpha power.
Article
Full-text available
Eye movements can support ongoing manipulative actions, but a class of so-called look ahead fixations (LAFs) are related to future tasks. We examined LAFs in a complex natural task-assembling a camping tent. Tent assembly is a relatively uncommon task and requires the completion of multiple subtasks in sequence over a 5- to 20-minute duration. Participants wore a head-mounted camera and eye tracker. Subtasks and LAFs were annotated. We document four novel aspects of LAFs. First, LAFs were not random and their frequency was biased to certain objects and subtasks. Second, latencies are larger than previously noted, with 35% of LAFs occurring within 10 seconds before motor manipulation and 75% within 100 seconds. Third, LAF behavior extends far into future subtasks, because only 47% of LAFs are made to objects relevant to the current subtask. Seventy-five percent of LAFs are to objects used within five upcoming steps. Last, LAFs are often directed repeatedly to the target before manipulation, suggesting memory volatility. LAFs with short fixation-action latencies have been hypothesized to benefit future visual search and/or motor manipulation. However, the diversity of LAFs suggest they may also reflect scene exploration and task relevance, as well as longer term problem solving and task planning.
Article
Full-text available
Hollywood movies provide continuous audiovisual information. Yet, information conveyed by movies address different sensory systems. For a broad variety of media applications (such as multimedia learning environments) it is important to understand the underlying cognitive principles. This project addresses the interplay of auditory and visual information during movie perception. Because auditory information is known to change basic visual processes, it is possible that movie perception and comprehension depends on stimulus modality. In this project, we report three experiments that studied how humans perceive and remember changes in visual and audiovisual movie clips. We observed basic processes of event perception (event segmentation, change detection, and memory) to be independent of stimulus modality. We thus conclude that event boundary perception is a general perceptual-cognitive mechanism and discuss these findings with respect to current cognitive psychological and media psychological theories.
Article
People spontaneously divide everyday experience into smaller units (event segmentation). To measure event segmentation, studies typically ask participants to explicitly mark the boundaries between events as they watch a movie (segmentation task). Their data may then be used to infer how others are likely to segment the same movie. However, significant variability in performance across individuals could undermine the ability to generalize across groups, especially as more research moves online. To address this concern, we used several widely employed and novel measures to quantify segmentation agreement across different sized groups (n = 2-32) using data collected on different platforms and movie types (in-lab & commercial film vs. online & everyday activities). All measures captured nonrandom and video-specific boundaries, but with notable between-sample variability. Samples of 6-18 participants were required to reliably detect video-driven segmentation behavior within a single sample. As sample size increased, agreement values improved and eventually stabilized at comparable sample sizes for in-lab & commercial film data and online & everyday activities data. Stabilization occurred at smaller sample sizes when measures reflected (1) agreement between two groups versus agreement between an individual and group, and (2) boundary identification between small (fine-grained) rather than large (coarse-grained) events. These analyses inform the tailoring of sample sizes based on the comparison of interest, materials, and data collection platform. In addition to demonstrating the reliability of online and in-lab segmentation performance at moderate sample sizes, this study supports the use of segmentation data to infer when events are likely to be segmented.
Article
Previous research provides some preliminary evidence to link the temporal binding window, the time frame within which multisensory information from different sensory modalities is integrated, and time perception. In addition, alpha peak frequency has been proposed to be the neural mechanism for both processes. However, these links are not well established. Hence, the aim of the current study was to explore to what degree, if any, time perception, the temporal binding window and the alpha peak frequency are related. It was predicted that as the width of the temporal binding window increases the size of the filled duration illusion and the alpha peak frequency decreases. We observed a significant relationship between the temporal binding window and peak alpha frequency. However, time perception was not linked with either of these. These findings are discussed with respect to the possible underlying mechanisms of multisensory integration and time perception.
Article
To integrate auditory and visual signals into a unified percept, the paired stimuli must co-occur within a limited time window known as the Temporal Binding Window (TBW). The width of the TBW, a proxy of audiovisual temporal integration ability, has been found to be correlated with higher-order cognitive and social functions. A comprehensive review of studies investigating audiovisual TBW reveals several findings: (1) a wide range of top-down processes and bottom-up features can modulate the width of the TBW, facilitating adaptation to the changing and multisensory external environment; (2) a large-scale brain network works in coordination to ensure successful detection of audiovisual (a)synchrony; (3) developmentally, audiovisual TBW follows a U-shaped pattern across the lifespan, with a protracted developmental course into late adolescence and rebounding in size again in late life; (4) an enlarged TBW is characteristic of a number of neurodevelopmental disorders; and (5) the TBW is highly flexible via perceptual and musical training. Interventions targeting the TBW may be able to improve multisensory function and ameliorate social communicative symptoms in clinical populations.
Article
Events make up much of our lived experience, and the perceptual mechanisms that represent events in experience have pervasive effects on action control, language use, and remembering. Event representations in both perception and memory have rich internal structure and connections one to another, and both are heavily informed by knowledge accumulated from previous experiences. Event perception and memory have been identified with specific computational and neural mechanisms, which show protracted development in childhood and are affected by language use, expertise, and brain disorders and injuries. Current theoretical approaches focus on the mechanisms by which events are segmented from ongoing experience, and emphasize the common coding of events for perception, action, and memory. Abetted by developments in eye-tracking, neuroimaging, and computer science, research on event perception and memory is moving from small-scale laboratory analogs to the complexity of events in the wild.