ArticlePDF Available

Abstract and Figures

Visual memory for scenes is surprisingly robust. We wished to examine whether an analogous ability exists in the auditory domain. Participants listened to a variety of sound clips and were tested on their ability to distinguish old from new clips. Stimuli ranged from complex auditory scenes (e.g., talking in a pool hall) to isolated auditory objects (e.g., a dog barking) to music. In some conditions, additional information was provided to help participants with encoding. In every situation, however, auditory memory proved to be systematically inferior to visual memory. This suggests that there exists either a fundamental difference between auditory and visual stimuli, or, more plausibly, an asymmetry between auditory and visual processing.
Content may be subject to copyright.
Auditory recognition memory is inferior to visual
recognition memory
Michael A. Cohen
, Todd S. Horowitz
, and Jeremy M. Wolfe
aBrigham and Women’s Hospital, bHarvard Medical School, Boston, MA 02115
Edited by Anne Treisman, Princeton University, Princeton, NJ, and approved February 24, 2009 (received for review November 24, 2008)
Visual memory for scenes is surprisingly robust. We wished to
examine whether an analogous ability exists in the auditory
domain. Participants listened to a variety of sound clips and were
tested on their ability to distinguish old from new clips. Stimuli
ranged from complex auditory scenes (e.g., talking in a pool hall)
to isolated auditory objects (e.g., a dog barking) to music. In some
conditions, additional information was provided to help partici-
pants with encoding. In every situation, however, auditory mem-
ory proved to be systematically inferior to visual memory. This
suggests that there exists either a fundamental difference be-
tween auditory and visual stimuli, or, more plausibly, an asymme-
try between auditory and visual processing.
For several decades, we have known that visual memory for
scenes is very robust (1, 2). In the most dramatic demon-
stration, Standing (3) showed observers up to 10,000 images for
a few seconds each and reported that they could subsequently
identify which images they had seen before with 83% accuracy.
This memory is far superior to verbal memor y (4) and can persist
for a week (5). Recent research has extended these findings to
show that we have a massive memory for the details of thousands
of objects (6). Here, we ask whether the same is true for auditory
memory and find that it is not.
For Experiment 1, we recorded or acquired 96 distinctive 5-s
sound clips from a variety of sources: birds chirping, a coffee
shop, motorcycles, a pool hall, etc. Twelve participants listened
to 64 sound clips during the study phase. Immediately following
the study phase, we tested participants on another series of 64
clips, half from the study phase and half new. Participants were
asked to indicate whether each clip was old or new. Memory was
fairly poor for these stimuli: the hit rate was 78% and the false
alarm rate 20%, yielding a d
score* of 1.68 (s.e.m. 0.14). To put
this performance for a mere 64 sound clips in perspective, in
Shepard’s original study with 600 pictures, he reported a hit rate
of 98%, whereas Standing reported a hit rate of 96% for 1,100
There are several possible explanations for the poor perfor-
mance on this auditory memory task. It could be that the
remarkable ability to rapidly encode and remember meaningful
stimuli is a feature of visual processing. Alternatively, these
might have been the wrong sounds. A particular stimulus set
might yield poor performance for a variety of reasons. Perhaps
the perceptual quality was poor; for example, many of our stimuli
were recorded monaurally but played over headphones. It is also
possible that the sound clips were too closely clustered in the
stimulus space for observers to distinguish between them. Or the
stimuli might simply be the wrong sort of auditory stimuli for
reasons unknown. To distinguish between the poor memory and
poor stimuli hypotheses, we replicated the experiments with a
second set of stimuli that were professionally recorded (e.g.,
binaurally) and designed to be as unique as possible (e.g., the
sound of a tea kettle, the sound of bowling pins falling). Each
sound was assigned a brief description (e.g., ‘‘small dog bark-
ing’’). In a separate experiment, 12 participants were asked to
choose the correct name for each sound clip from a list of 111
descriptions (chance 0.90%), and they succeeded exactly with
64% of the sounds. Two-thirds of the remaining errors being
‘‘near misses’’ (e.g., ‘‘Big dog’’ for the sound of a small dog
barking would be considered a near miss; ‘‘tea-kettle’’ for the
sound of bowling pins falling would not). Thus, with this second
set of sound clips, participants were able to identify the sound
clips relatively well. For each sound clip in this new set, we also
obtained a picture that matched the description.
There were 5 conditions in Experiment 2. In each condition,
12 new participants were tested using the same testing protocol
as Experiment 1. The study phase contained 64 stimuli. In the
test phase, participants labeled 64 stimuli as old or new. We
measured memory for the sound clips alone, the verbal descrip-
tions alone, and the matching pictures alone. We also added 2
conditions intended to improve encoding of the sound clips. In
1 condition, the sound clips were paired with the pictures during
the study phase. In the other, the sound clips were paired with
their verbal descriptions during study. In both of these condi-
tions, participants were tested for recognition of the sound clips
The results, shown in Fig. 1, were unambiguous. According to
s WSD test, memory for pictures was significantly better
than for all other stimuli, while the remaining conditions did not
differ from one another. Recall for sound clips was slightly
higher than in the first experiment, but still quite low (d
s.e.m. 0.21) and far inferior to recall for pictures (d
s.e.m. 0.24). Supplying the participants with descriptions
together in the study phase did not significantly improve recall
for sound clips (d
2.23; s.e.m. 0.17). This may not be
surprising, because recall for the verbal descriptions by them-
selves was also relatively poor (d
2.39; s.e.m. 0.15).
However, even pairing sound clips with pictures of the objects at
the time of encoding did not improve subsequent testing with
sound clips alone (d
1.83; s.e.m. 0.16). Note that these were
the same pictures that, by themselves, produced a d
of 3.57.
Again, it is still possible that these were the wrong stimuli. In
terms of information load, the auditory stimuli we used may
simply be more impoverished than pictures. Thus, poor memory
performance with sounds may be due solely to the nature of the
particular stimulus we used. Perhaps richer stimuli would lead to
more efficient encoding and storage in memory. To explore this
possibility, in Experiment 3 we replicated the testing procedures
from Experiments 1 and 2 using 2 new types of stimuli: spoken
language and music. Both classes of stimuli might contain more
information than the natural auditory sounds used in Experi-
ments 1 and 2. Spoken language conveys information about the
speaker’s age, gender, and nationality, in addition to a wealth of
Author contributions: M.A.C., T.S.H., and J.M.W. designed research; M.A.C. performed
research; M.A.C., T.S.H., and J.M.W. analyzed data; and M.A.C., T.S.H., and J.M.W. wrote the
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
1To whom correspondence should be addressed. E-mail:
*d, a standard index of detectability derived from signal detection theory (7), is computed
from hit and false alarm rates. Because false alarm rates are not available for all of the
early picture memory studies, we also report hit rates.
April 7, 2009
vol. 106
no. 14 www.pnas.orgcgidoi10.1073pnas.0811884106
semantic information about the topic being discussed. Music,
when there is a vocalist, can convey much the same information
as spoken language, in addition to information about rhythm,
harmony, and instrumentation.
Experiment 3 consisted of 2 groups of 12 participants, all
native English speakers. In the spoken language condition,
participants were tested using 90 unique speech clips (7–15 s) on
a variety of topics (e.g., politics, sports, current affairs, sections
from novels). Participants were debriefed afterward to confirm
that they had no problem understanding what was being said, in
terms of both content and speaker’s pronunciation. Per formance
in this condition (d
2.7; s.e.m. 0.16) was better than every
other sound condition, but was still worse than the picture only
condition of Experiment 2 [t (11) 3.31, P0.01]. In the music
condition participants were tested using 90 novel popular music
clips (5–15 s). Each participant was debriefed after the experi-
ment, and none reported having ever heard any of these specific
clips before. Performance in this experiment (d
1.28; s.e.m.
0.11) was actually worse than in the sound only condition of
Experiment 2 [t (11) 2.509, P0.05], and far worse than the
picture only condition [t (11) 14.14, P0.001]. Thus, memory
for a variety of auditory stimulus classes, some of which poten-
tially carry more information than natural auditory sounds, is
inferior to visual memory for scenes and objects.
Experiment 3 suggests that poor auditory memor y is not
simply the product of impoverished stimuli. However, it would be
more satisfying to directly measure the quality of visual and
auditory stimulus sets in the same units. Here, we used the
classification task previously used to calibrate the auditor y
stimuli in Experiment 2, asking participants to assign each
stimulus a label from a prespecified list of labels. Recall that for
the auditory stimuli, participants were able to perform at 64%
on this 111-alternative choice task, using a conservative scoring
criterion. For comparison, we obtained a set of images that had
been created by taking 256 256 pixel images, reducing them
to 16 16 pixel resolution, then upsampling to create 256 256
pixel images for display. This resulted in very degraded, blurred
versions of the originals (8). Previous work with these same
images demonstrated that this procedure leads to a decrease in
performance on a broad categorization task as compared to
higher resolution images (8).
For the first part of Experiment 4, we tested 12 participants in
the same memory protocol as in the previous experiments using
102 upsampled images. As Fig. 2 shows, performance on this
condition (d
1.89; s.e.m. 0.17) was not significantly different
from performance with the auditory stimuli from Experiment 2
[t (11) 0.21, P0.8]. In the second condition, we then asked
12 participants
to choose the correct name for each degraded
image from a list of 102 descriptions (chance 0.98%). Partic-
ipants successfully matched an image with its description just
21% of the time, significantly worse than the 64% classification
performance for the auditor y stimuli reported earlier [t (11)
21.22, P0.001]. Using the more liberal scoring criterion that
corrects for ‘‘near misses’’ (e.g., ‘‘highway’’ for the image of a
forest road would be considered a near miss; ‘‘bedroom’’ for the
image of a ‘‘beach’’ would not), performance was still only 24%
against 83% for the auditory stimuli [t (11) 30.277, P0.001].
Fig. 2 makes our point graphically. To equate the memora-
bility of visual and auditor y stimuli, we needed to render the
visual stimuli almost unrecognizable. Participants were much
better at classifying/identifying the auditory stimuli than the
degraded visual stimuli (triangles, right y-axis). This is consistent
with an asymmetr y between visual and auditory processing.
Stimuli of equal memorability are not equally identifiable.
Highly identifiable auditory stimuli are not remembered well.
It is clear from these results that auditory recognition memory
performance is markedly inferior to visual recognition memor y
on this task. Note that we do not claim that long-term auditory
memory, in general, is impoverished. Clearly, some form of
auditory long-term memory allowed our participants to identify
the stimuli as tea kettles, dogs, and so forth. Moreover, with
practice, people can commit large bodies of auditory material
(e.g., music) to memory. The striking aspects of the original
picture memory experiments are the speed and ease with which
Note that 5 participants participated in both conditions of experiment 4, but were only
allowed to complete the classification condition after having completed the memory
Fig. 2. Auditory stimuli vs degraded visual images. Memory performance
(squares, solid line) is plotted against the left y-axis in units of d. Percent
correct for the naming experiment is plotted against the right y-axis. Error bars
denote standard error of the mean.
Fig. 1. Memory performance in units of d. Error bars denote the standard error of the mean. The leftmost part shows the results from Experiment 1, the center
part shows the results from Experiment 2, and the rightmost part shows the results from Experiment 3.
Cohen et al. PNAS
April 7, 2009
vol. 106
no. 14
complex visual stimuli seem to slide into long-term memory.
Hundreds or thousands of images, seen for a few seconds at a
time, are available for subsequent recognition. It is this aspect of
memory that seems to be markedly less impressive in audition.
Two explanations suggest themselves. Auditory objects might be
fundamentally different from visual objects. In their physics or
psychophysics, they may actually be less memorable than their
visual counterparts. Alternatively, auditory memory might be
fundamentally different/smaller than visual memory. We might
simply lack the capacity to remember more than a few auditory
objects, however memorable, when they are presented one after
another in rapid succession. In either case, it is unlikely that
anyone will find 1000 sounds that can be remembered with
anything like the accuracy of their visual counterparts.
Materials and Methods
Participants. One hundred thirteen total participants (aged 18–54) partici-
pated in the experiments. For each condition there were 12 participants, with
a total of 11 conditions/experiments. Each participant passed the Ishihara test
for color blindness and had normal or corrected to normal vision. All partic-
ipants gave informed consent, as approved by the Partners Healthcare Cor-
poration IRB, and were compensated $10/h for their time.
Stimuli. In Experiment 1, stimuli were gathered using a handheld recording
device (Panasonic PV-GS180) or were obtained from a commercially available
database (SoundSnap). In Experiment 2, stimuli were gathered from Sound- In Experiment 3, music clips came from the collections of members
of the laboratory. Songs were uploaded into WavePad and 7- to 15-s clips were
extracted. Speech clips used came from various podcasts obtained online and
were also uploaded into WavePad to obtain 5- to 15-s clips. Degraded visual
images used in Experiment 4 were obtained from A. Torralba (Massachusetts
Institute of Technology, Cambridge, MA). A list of the stimuli used is provided
on our website:
Experimental Blocks. The memory experiments consisted of a study block and
a test block. In the study block, participants listened to or viewed a set of sound
clips or sound clips and their correlating images/names (60–66 clips) for
approximately 10 min. Their instructions were simply to carefully study to the
clips and try to commit them to memory as best they could. In the test block,
participants were presented with another set of clips (60–64 clips), half that
were repeated from the study block (old) and half that had never been
presented before (new). Participants were asked to make an ‘‘old/new’’
discrimination after every trial. Note that on 1 condition of the memory
experiments the basic paradigm remained the same, but participants were
presented with only visual images (picture only). The naming/classification
experiments comprised a single block lasting approximately 20 min. Partici-
pants were shown each stimulus for 5 s and would then type in the name of
what they had heard/seen from a list provided (102–110 names).
Apparatus. Every experiment was conducted on a Macintosh computer run-
ning MacOS 9.2, controlled by Matlab 7.5.0 and the Psychophysics Toolbox,
version 3.
ACKNOWLEDGMENTS. We thank Christina Chang, Karla Evans, Yair Pinto,
Aude Oliva, and Barbara Shinn-Cunningham for helpful comments and sug-
gestions on the project, and Antonio Torralba for providing the degraded
images used in Experiment 4. This work was funded in part by NIMH-775561
and AFOSR-887783.
1. Shepard RN (1967) Recognition memory for words, sentences, and pictures. J Verb
Learn Verb Behav 6:156–163.
2. Pezdek K, Whetstone T, Reynolds K, Askari N, Dougherty T (1989) Memory for real-
world scenes: The role of consistency with schema expectation. J Exp Pscyhol Learn
Mem Cog 15:587–595.
3. Standing L (1973) Learning 10,00 pictures. Q J Exp Psychol 25:207–222.
4. Standing L, Conezi J, Haber RN (1970) Perception and memory for pictures: Single-trial
learning of 2500 visual stimuli. Psychon Sci 19:73.
5. Dallet K, Wilcox SG, D’Andrea L (1968) Picture memory experiments. J Exp Psychol
6. Brady TF, Konkle T, Alvarez GA, Oliva A (2008) Visual long-term memory has a massive
storage capacity for object details. Proc Natl Acad Sci US 105:14325–14329.
7. Macmillan NA, Creelman CD (2005) in Detection Theory:A User’s Guide 2nd ed.
(Lawrence Erlbaum Assoc, Mahwah, NJ) 2nd Ed.
8. Torralba A (2009) How many pixels make an image? Visual Neurosci, epub ahead of
www.pnas.orgcgidoi10.1073pnas.0811884106 Cohen et al.
... Compared to visual memory, memory for auditory stimuli appears to be inferior (Cohen et al., 2009; see also Kassim et al., 2018); however, recent work from our group has demonstrated that information from both modalities interacts during the formation of long-term memory representations (Meyerhoff & Huff, 2016). In this study, the participants studied brief auditory, visual, or audio-visual tracks from movies. ...
... According to this principle, the task-specific acuity of the involved modalities affects how they are integrated (Bertelson et al., 2000;Vroomen et al., 2001). With regard to our results for long-term memory representations, such an interpretation would suggest that long-term memory for auditory information is so unreliable (see Cohen et al., 2009) that it has no effect on visual information in a weighted integration. In any case, what seems clear from the asymmetric occurrence of study-test congruency effects is that memory representations are not just the product of an equally weighted integration of auditory and visual information. ...
... Second, visual information remained accessible rather independently even when auditory information was present during encoding (i.e., the absence of full studytest congruency effects for visual vs. audio-visual material). Considering the modality appropriateness principle (Welch & Warren, 1980) as well as the generally inferior auditory memory (Cohen et al., 2009), our study suggests that the core memory representation of dynamic scenes is visual in nature. What is more puzzling is the role of coinciding auditory information. ...
Full-text available
In this study, we investigated the nature of long-term memory representations for naturalistic audio-visual scenes. Whereas previous research has shown that audio-visual scenes are recognized more accurately than their unimodal counterparts, it remains unclear whether this benefit stems from audio-visually integrated long-term memory representations or a summation of independent retrieval cues. We tested two predictions for audio-visually integrated memory representations. First, we used a modeling approach to test whether recognition performance for audio-visual scenes is more accurate than would be expected from independent retrieval cues. This analysis shows that audio-visual integration is not necessary to explain the benefit of audio-visual scenes relative to purely auditory or purely visual scenes. Second, we report a series of experiments investigating the occurrence of study-test congruency effects for unimodal and audio-visual scenes. Most importantly, visually encoded information was immune to additional auditory information presented during testing, whereas auditory encoded information was susceptible to additional visual information presented during testing. This renders a true integration of visual and auditory information in long-term memory representations unlikely. In sum, our results instead provide evidence for visual dominance in long-term memory. Whereas associative auditory information is capable of enhancing memory performance, the long-term memory representations appear to be primarily visual.
... Auditory recognition, however, tends to be worse than recognition in the visual (Cohen, Horowitz & Wolfe, 2009) or tactile sensory modalities. Bigelow and Poremba (2014) have examined memory recognition for visual (silent videos), auditory (complex sound of everyday life) and tactile (objects of common use hidden and presented in such a way that they can be touched and manipulated) stimuli, showing that auditory recognition is significantly worse than in other modalities, with no significant differences between visual or tactile stimuli. ...
... Bigelow and Poremba (2014) have examined memory recognition for visual (silent videos), auditory (complex sound of everyday life) and tactile (objects of common use hidden and presented in such a way that they can be touched and manipulated) stimuli, showing that auditory recognition is significantly worse than in other modalities, with no significant differences between visual or tactile stimuli. Cohen et al. (2009) have argued that auditory recognition is worse than other modalities due to our tendency to primarily rely on visual stimuli. This might explain why auditory recognition is weaker than visual recognition even among musicians (Cauda et al., 2011). ...
Full-text available
Our world is full of sounds, either verbal or non-verbal, pleasant or unpleasant , meaningful or simply irrelevant noise. Understanding, memorizing, and predicting the sounds, even non-verbal ones which our environment is full of, is a complex perceptuo-cognitive function that we constantly refine by everyday experience and learning. Musical sounds are a peculiar case due to their culture-dependent complexity and hierarchical organization requiring cognitive functions such as memory to be understood, and due to the presence of individuals (musicians) who dedicate their lifetime to master the specifics of those sounds and rules. Thus far, most of the neuroimaging research focused on verbal sounds and how they are processed and stored in the human brain. Only recently, researchers have tried to elucidate the neural mechanisms and structures allowing non-verbal, musical sounds to be mod-eled, predicted and remembered. However, those neuroimaging studies often provide only a mere snapshot of a complex dynamic process unfolding over time. To capture the complexity of musical memory and cognition, new methods are needed. A promising analysis method is dynamic functional connectivity, which assumes that functional connectivity changes in a short time. We conclude that moving from a locationist to a dynamic perspective on auditory memory might allow us to finally comprehend the neural mechanisms that regulate encoding and retrieval of sounds.
... As such, the experiments performed supported that audio can benefit from a fusion of vision and textual-based features. However, audio, in a sense, has the ethical upper hand over video for instance, in that it can be captured in a pseudo-anonymous way, due to the inferiority of auditory vs visual memory [226]. Furthermore, as humans are generally visually dominant, video can be more challenging in regards to subject privacy, despite it being extremely valuable for a number of tasks. ...
This thesis is focused on the application of computer audition (i. e., machine listening) methodologies for monitoring states of emotional wellbeing. Computer audition is a growing field and has been successfully applied to an array of use cases in recent years. There are several advantages to audio-based computational analysis; for example, audio can be recorded non-invasively, stored economically, and can capture rich information on happenings in a given environment, e. g., human behaviour. With this in mind, maintaining emotional wellbeing is a challenge for humans and emotion-altering conditions, including stress and anxiety, have become increasingly common in recent years. Such conditions manifest in the body, inherently changing how we express ourselves. Research shows these alterations are perceivable within vocalisation, suggesting that speech-based audio monitoring may be valuable for developing artificially intelligent systems that target improved wellbeing. Furthermore, computer audition applies machine learning and other computational techniques to audio understanding, and so by combining computer audition with applications in the domain of computational paralinguistics and emotional wellbeing, this research concerns the broader field of empathy for Artificial Intelligence (AI). To this end, speech-based audio modelling that incorporates and understands paralinguistic wellbeing-related states may be a vital cornerstone for improving the degree of empathy that an artificial intelligence has. To summarise, this thesis investigates the extent to which speech-based computer audition methodologies can be utilised to understand human emotional wellbeing. A fundamental background on the fields in question as they pertain to emotional wellbeing is first presented, followed by an outline of the applied audio-based methodologies. Next, detail is provided for several machine learning experiments focused on emotional wellbeing applications, including analysis and recognition of under-researched phenomena in speech, e. g., anxiety, and markers of stress. Core contributions from this thesis include the collection of several related datasets, hybrid fusion strategies for an emotional gold standard, novel machine learning strategies for data interpretation, and an in-depth acoustic-based computational evaluation of several human states. All of these contributions focus on ascertaining the advantage of audio in the context of modelling emotional wellbeing. Given the sensitive nature of human wellbeing, the ethical implications involved with developing and applying such systems are discussed throughout.
... Although the recognition memory is generally deemed to be inferior in the auditory compared with the visual domain (Cohen et al., 2009), there is compelling evidence that the human brain is exceptionally capable of rapidly forming robust short-and longer-term memories for various types of random auditory patterns, such as tone pip sequences (Bianco et al., 2020), temporal patterns of clicks (Kang et al., 2017), and white noise (Agus et al., 2010). It has been argued that listeners build up these representations during perceptual learning, which refers to experience-dependent changes in the perceptual ability to effectively extract and use information from sensory input through repeated exposure (Gibson, 1969;Gilbert et al., 2001). ...
Full-text available
It is remarkable that human listeners can perceive periodicity in noise, as the isochronous repetition of a particular noise segment is not accompanied by salient physical cues in the acoustic signal. Previous research suggested that listeners rely on short temporally local and idiosyncratic features to perceptually segment periodic noise sequences. The present study sought to test this assumption by disentangling consistency of perceptual segmentation within and between listeners. Presented periodic noise sequences either consisted of seamless repetitions of a 500-ms segment or of repetitions of a 200-ms segment that were interleaved with 300-ms portions of random noise. Both within-and between-subject consistency was stronger for interleaved (compared with seamless) periodic sequences. The increased consistency likely resulted from reduced temporal jitter of potential features used for perceptual segmentation when the recurring segment was shorter and occurred interleaved with random noise. These results support the notion that perceptual segmentation of periodic noise relies on subtle temporally local features. However, the finding that some specific noise sequences were segmented more consistently across listeners than others challenges the assumption that the features are necessarily idiosyncratic. Instead, in some specific noise samples, a preference for certain spectral features is shared between individuals.
... Unlike visual and spatial domains, the auditory system relies on constant acoustic change for effective perception. Furthermore, prior work has demonstrated marked differences in memory abilities across vision and audition (Cohen et al., 2011(Cohen et al., , 2009Morey & Mall, 2012;Xu et al., 2020), suggesting a possible asymmetry for encoding these signals. ...
Full-text available
While our perceptual experience seems to unfold continuously over time, episodic memory preserves distinct events for storage and recollection. Previous work shows that stability in encoding context serves to temporally bind individual items into sequential composite events. This phenomenon has been almost exclusively studied using visual and spatial memory paradigms. Here we adapt these paradigms to test the role of speaker regularity for event segmentation of complex auditory information. The results of our auditory paradigm replicate the findings in other sensory modalities—finding greater within-event temporal memory for items within speaker-bound events and greater source memory for items at speaker or event transitions. The task we use significantly extends the ecological validity of past paradigms by allowing participants to encode the stimuli without any suggestions on the part of the experimenter. This unique property of our design reveals that, while memory performance is strongly dependent on self-reported mnemonic strategy, behavioral effects associated with event segmentation are robust to changes in mnemonic strategy. Finally, we consider the effect of serial position on segmentation effects during encoding and present a modeling approach to estimate the independent contribution of event segmentation. These findings provide several lines of evidence suggesting that contextual stability in perceptual features drives segmentation during word listening and supports a modality-independent role for mechanisms involved in event segmentation.
... They found that subjects' responses were more accurate, but slower, in the auditory version of the task compared with the visual. A comparison of auditory and visual recognition memory, on the contrary, revealed that the former was systematically inferior, as subjects were far worse at recognising already presented sound clips, rather than pictures (M. A. Cohen et al., 2009). Strong modalitydependent differences were found for the LDT as well, between the auditory and the visual versions of the task (Holcomb & Neville, 1990;Krause et al., 2006). ...
Prospective memory (PM) is the ability to perform an intended action when the appropriate conditions occur. Several features play a role in the successful retrieval of an intention: The activity we are concurrently engaged in, the number of intentions we are maintaining, where our attention is focused (outward vs. to inner states), and how outstanding the trigger of the intention is. Another factor that may play a crucial role is sensory modality: Do auditory and visual stimuli prompt PM processing in the same way? In this study, we explored for the first time the nature of PM for auditory stimuli and the presence of modality-dependent differences in PM processing. To do so, an identical paradigm composed of multiple PM tasks was administered in two versions, one with auditory stimuli and one with visual ones. Each PM task differed for features such as focality, saliency, and number of intentions (factors that are known in literature to modulate the monitoring and maintenance requests of PM) to explore the impact of sensory modality on a broad variety of classical PM tasks. In general, PM processing showed similar patterns between modalities, especially for low demanding prospective instructions. Conversely, substantial differences were found when the prospective load was increased and monitoring requests enhanced, as participants were significantly slower and less accurate with acoustic stimuli. These results represent the first evidence that modality-dependent effects arise in PM processing, especially in its interaction with features such as the difficulty of the task and the increased monitoring load.
... Visual aspects of an avatar might also inherently (at a fundamental level) be more important than audial aspects. Humans have been shown to have better visual memory than auditory memory and that there appear to be fundamental diferences between visual and auditory processing [48]. The picture superiority efect describes the phenomenon whereby pictures and images are more often remembered compared to words [43]. ...
Conference Paper
Full-text available
Avatar customization is known to positively affect crucial outcomes in numerous domains. However, it is unknown whether audial customization can confer the same benefits as visual customization. We conducted a preregistered 2 x 2 (visual choice vs. visual assignment x audial choice vs. audial assignment) study in a Java programming game. Participants with visual choice experienced higher avatar identification and autonomy. Participants with audial choice experienced higher avatar identification and autonomy, but only within the group of participants who had visual choice available. Visual choice led to an increase in time spent, and indirectly led to increases in intrinsic motivation, immersion, time spent, future play motivation, and likelihood of game recommendation. Audial choice moderated the majority of these effects. Our results suggest that audial customization plays an important enhancing role vis-à-vis visual customization. However, audial customization appears to have a weaker effect compared to visual customization. We discuss the implications for avatar customization more generally across digital applications.
... It would be informative to perform similar comparisons in other (non-visual) domains. It is known, for instance, that auditory recognition memory is substantially inferior to visual recognition memory in humans (Cohen et al., 2009). It would be interesting to know if this result holds for gradient descent trained deep learning models too. ...
Humans have a remarkably large capacity to store detailed visual information in long-term memory even after a single exposure, as demonstrated by classic experiments in psychology. For example, Standing (1973) showed that humans could recognize with high accuracy thousands of pictures that they had seen only once a few days prior to a recognition test. In deep learning, the primary mode of incorporating new information into a model is through gradient descent in the model's parameter space. This paper asks whether deep learning via gradient descent can match the efficiency of human visual long-term memory to incorporate new information in a rigorous, head-to-head, quantitative comparison. We answer this in the negative: even in the best case, models learning via gradient descent appear to require approximately 10 exposures to the same visual materials in order to reach a recognition memory performance humans achieve after only a single exposure. Prior knowledge induced via pretraining and bigger model sizes improve performance, but these improvements are not very visible after a single exposure (it takes a few exposures for the improvements to become apparent), suggesting that simply scaling up the pretraining data size or model size might not be enough for the model to reach human-level memory efficiency.
Research has shown that novel words can be learned through the mechanism of statistical or cross‐situational word learning (CSWL). So far, CSWL studies using adult populations have focused on the presentation of spoken words. However, words can also be learned through their written form. This study compared auditory and orthographic presentations of novel words with different degrees of phonological overlap using CSWL in a laboratory‐based and an online‐based approach. In our analyses, we first compared accuracy across modalities, with our findings showing more accurate recognition performance for CSWL when novel words were presented through their written forms (orthographic condition) rather than through their spoken forms (auditory condition). Bayesian modeling suggested that accuracy for the orthographic condition was higher in the laboratory compared to online, whereas performance in the auditory condition was similar across both experiments. We discuss the implications of our findings for presentation modality and the benefits of our online testing protocol for future research. A one‐page Accessible Summary of this article in non‐technical language is freely available in the Supporting Information online and at https://oasis‐
Studies have found a multisensory memory benefit: higher recognition accuracy for unimodal test items that were studied as bimodal items than for those studied as unimodal items. This is a surprising finding because the encoding specificity principle predicts that memory performance should be better with greater overlap between processing during study and test. We used Thelen, Talsma, and Murray's (2015) method who previously found a multisensory memory benefit. Items were presented as unimodal (picture or sound) or bimodal (picture and sound) items in a continuous recognition task in which only one modality was task-relevant. In four experiments we obtained little evidence for a difference in memory performance between items studied as unimodal or bimodal stimuli, but there was a benefit of study-test overlap in format if sound was the task-relevant modality. Task-induced attention for the irrelevant modality or response bias may have played a role in previous studies. We conclude that the multisensory memory benefit may not be a general finding, but rather one that is found only under conditions that induce participants to pay attention to the task-irrelevant modality.
Full-text available
Four experiments are reported which examined memory capacity and retrieval speed for pictures and for words. Single-trial learning tasks were employed throughout, with memory performance assessed by forced-choice recognition, recall measures or choice reaction-time tasks. The main experimental findings were: (1) memory capacity, as a function of the amount of material presented, follows a general power law with a characteristic exponent for each task; (2) pictorial material obeys this power law and shows an overall superiority to verbal material. The capacity of recognition memory for pictures is almost limitless, when measured under appropriate conditions; (3) when the recognition task is made harder by using more alternatives, memory capacity stays constant and the superiority of pictures is maintained; (4) picture memory also exceeds verbal memory in terms of verbal recall; comparable recognition/recall ratios are obtained for pictures, words and nonsense syllables; (5) verbal memory shows a higher retrieval speed than picture memory, as inferred from reaction-time measures. Both types of material obey a power law, when reaction-time is measured for various sizes of learning set, and both show very rapid rates of memory search. From a consideration of the experimental results and other data it is concluded that the superiority of the pictorial mode in recognition and free recall learning tasks is well established and cannot be attributed to methodological artifact.
This study tested the generalizability of the consistency effect to real-world settings. The consistency effect refers to the finding that items inconsistent with expectations are better recalled and recognized than items consistent with expectations. In two experiments, subjects walked into a graduate student's office or a preschool classroom. Half of the items in each setting were consistent with expectations about that setting, and half were inconsistent. A recall and a same-changed recognition memory test followed immediately or 1 day later. In both experiments, the consistency effect was affirmed; items inconsistent with expectations were significantly better recalled and recognized than items consistent with expectations. This result is discussed in terms of differences in the encoding processes that operate on inconsistent and consistent items. The present study extends the generalizability of results from picture memory studies to real-world settings. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
The Ss looked through a series of about 600 stimuli selected at random from an initially larger population. They were then tested for their ability to recognize these “old” stimuli in pairs in which the alternative was always a “new” stimulus selected at random from the stimuli remaining in the original population. Depending upon whether this original population consisted solely of words, sentences, or pictures, median Ss were able correctly to recognize the “old” stimulus in 90, 88, or 98% of the test pairs, respectively. Estimated lower bounds on the informational capacity of human memory considerably exceed previously published estimates.
The human visual system is remarkably tolerant to degradation in image resolution: human performance in scene categorization remains high no matter whether low-resolution images or multimegapixel images are used. This observation raises the question of how many pixels are required to form a meaningful representation of an image and identify the objects it contains. In this article, we show that very small thumbnail images at the spatial resolution of 32 x 32 color pixels provide enough information to identify the semantic category of real-world scenes. Most strikingly, this low resolution permits observers to report, with 80% accuracy, four to five of the objects that the scene contains, despite the fact that some of these objects are unrecognizable in isolation. The robustness of the information available at very low resolution for describing semantic content of natural images could be an important asset to explain the speed and efficiently at which the human brain comprehends the gist of visual scenes.
Detection Theory: A User's Guide 2nd ed (Lawrence Erlbaum Assoc
  • Na Macmillan
  • Cd Creelman
  • Macmillan NA