Conference Paper

HCU400: an Annotated Dataset for Exploring Aural Phenomenology through Causal Uncertainty

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In noise pollution research, the high-level perception of human activity is considered 'pleasant' (more positively valenced) regardless of low-level acoustic features [26]. In general, the emotional impact of a sound is correlated with the clarity of its perceived source [27], though sounds can have emotional impact even without a direct mapping to an explicit abstract idea [28]. ...
... In our previous work [27] we curated a large dataset of everyday sounds to include high-level features that may influence a sound's memorability; most notably, its causal certainty (the degree to which a sound implies a clear, unambiguous source, denoted as H cu ), the implied source itself as determined by crowd-sourced workers, and its acoustic features. We also collected ratings for the valence and arousal of each sound, its familiarity, and how easily it conjures a mental image (features that have strong correlations with H cu ). ...
... Audio samples for this test were taken from the HCU400 dataset [27]. Standard low-level acoustic features were extracted from each sample based on prior precedent [17]. ...
Preprint
Our aural experience plays an integral role in the perception and memory of the events in our lives. Some of the sounds we encounter throughout the day stay lodged in our minds more easily than others; these, in turn, may serve as powerful triggers of our memories. In this paper, we measure the memorability of everyday sounds across 20,000 crowd-sourced aural memory games, and assess the degree to which a sound's memorability is constant across subjects. We then use this data to analyze the relationship between memorability and acoustic features like harmonicity, spectral skew, and models of cognitive salience; we also assess the relationship between memorability and high-level features with a dependence on the sound source itself, such as its familiarity, valence, arousal, source type, causal certainty, and verbalizability. We find that (1) our crowd-sourced measures of memorability and confusability are reliable and robust across participants; (2) that the authors' measure of collective causal uncertainty detailed in our previous work, coupled with measures of visualizability and valence, are the strongest individual predictors of memorability; (3) that acoustic and salience features play a heightened role in determining "confusability" (the false positive selection rate associated with a sound) relative to memorability, and that (4), within the framework of our assessment, memorability is an intrinsic property of the sounds from the dataset, shown to be independent of surrounding context. We suggest that modeling these cognitive processes opens the door for human-inspired compression of sound environments, automatic curation of large-scale environmental recording datasets, and real-time modification of aural events to alter their likelihood of memorability.
... Research into audio recall memorability shows that naming or verbalising sounds (phonological-articulation) can improve recall [22], and accordingly, non-verbal sounds have lower recall than verbal sounds [23]. Emotionality is known to play an important role in memory formation, and the emotional impact of a sound is correlated with the clarity of its perceived source [24]. Human activity is considered to be a positively valenced sound [25], and positive valence improves sound recall [26]. ...
... We use the PANNs [38] network to generate audio-tags, labelling the audio as music (giving it a score of 1.0) if a musical tag is present in the top 75% confidence. Hcu and arousal scores are independently predicted with ImageNetpretrained xResNet34 models fine-tuned on spectrograms from the HCU400 dataset [24]. Due limited available options, for familiarity, we use the top audio-tag confidence score of the PANNs [38] network as a proxy (Spearman = 0.305, pval = 4.749e-10 between the two scores in the HCU400 dataset). ...
Preprint
Full-text available
Memories are the tethering threads that tie us to the world, and memorability is the measure of their tensile strength. The threads of memory are spun from fibres of many modalities, obscuring the contribution of a single fibre to a thread's overall tensile strength. Unfurling these fibres is the key to understanding the nature of their interaction, and how we can ultimately create more meaningful media content. In this paper, we examine the influence of audio on video recognition memorability, finding evidence to suggest that it can facilitate overall video recognition memorability rich in high-level (gestalt) audio features. We introduce a novel multimodal deep learning-based late-fusion system that uses audio gestalt to estimate the influence of a given video's audio on its overall short-term recognition memorability, and selectively leverages audio features to make a prediction accordingly. We benchmark our audio gestalt based system on the Memento10k short-term video memorability dataset, achieving top-2 state-of-the-art results.
... When choosing a set of data to validate our hypothesis, two aspects are of the utmost importance. Firstly, we are interested in video memorability, and therefore we exclude any image-only or audio-only corpora [16][17][18]. Secondly, we require that every video sample is accompanied by at least one textual description. ...
Article
Full-text available
Not every visual media production is equally retained in memory. Recent studies have shown that the elements of an image, as well as their mutual semantic dependencies, provide a strong clue as to whether a video clip will be recalled on a second viewing or not. We believe that short textual descriptions encapsulate most of these relationships among the elements of a video, and thus they represent a rich yet concise source of information to tackle the problem of media memorability prediction. In this paper, we deepen the study of short captions as a means to convey in natural language the visual semantics of a video. We propose to use vector embeddings from a pretrained SBERT topic detection model with no adaptation as input features to a linear regression model, showing that, from such a representation, simpler algorithms can outperform deep visual models. Our results suggest that text descriptions expressed in natural language might be effective in embodying the visual semantics required to model video memorability.
... Imageability is based on whether the audio is music or not using the PANNs [17] network. Hcu and arousal scores are predicted with an xResNet34 pre-trained on ImageNet [8], and fine-tuned on HCU400 [1]. For familiarity we chose to use the top audio-tag confidence score of the PANNs [17] network, as we had observed a correlation between the two scores. ...
Preprint
Full-text available
Memorability determines what evanesces into emptiness, and what worms its way into the deepest furrows of our minds. It is the key to curating more meaningful media content as we wade through daily digital torrents. The Predicting Media Memorability task in MediaEval 2020 aims to address the question of media memorability by setting the task of automatically predicting video memorability. Our approach is a multimodal deep learning-based late fusion that combines visual, semantic, and auditory features. We used audio gestalt to estimate the influence of the audio modality on overall video memorability, and accordingly inform which combination of features would best predict a given video's memorability scores.
Article
In a world of rich, complex, and demanding audio environments, intelligent systems can mediate our interaction with the sounds around us - both to enable meaningful, aesthetic experiences and to transition work from humans to computational agents. Drawing from several years of our research, we suggest that the design of such systems must be driven by a deep understanding of auditory cognition. In this article, we discuss two concrete approaches we take toward cognition-informed interface design - one that begins with sounds themselves to form explicit, contextualized, cognitive models, built on the foundations of large-data parsing infrastructure; and one that begins with the individual, built from intuition surrounding the influence of the cognitive state on perception. We point toward an unexplored and compelling future at their intersection.
Article
Full-text available
Learning phrase representations has been widely explored in many Natural Language Processing (NLP) tasks (e.g., Sentiment Analysis, Machine Translation) and has shown promising improvements. Previous studies either learn non-compositional phrase representations with general word embedding learning techniques or learn compositional phrase representations based on syntactic structures, which either require huge amounts of human annotations or cannot be easily generalized to all phrases. In this work, we propose to take advantage of large-scaled paraphrase database and present a pair-wise gated recurrent units (pairwise-GRU) framework to generate compositional phrase representations. Our framework can be re-used to generate representations for any phrases. Experimental results show that our framework achieves state-of-the-art results on several phrase similarity tasks.
Article
Full-text available
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Article
Full-text available
Everyday listening is the experience of hearing events in the world rather than sounds per se. In this article, I take an ecological approach to everyday listening to overcome constraints on its study implied by more traditional approaches. In particular, I am concerned with developing a new framework for describing sound in terms of audible source attributes. An examination of the continuum of structured energy from event to audition suggests that sound conveys information about events at locations in an environment. Qualitative descriptions of the physics of sound-producing events, complemented by protocol studies, suggest a tripartite division of sound-producing events into those involving vibrating solids, gasses, or liquids. Within each of these categories, basic-level events are defined by the simple interactions that can cause these materials to sound, whereas more complex events can be described in terms of temporal patterning, compound, or hybrid sources. The results of these investigations are used to create a map of sound-producing events and their attributes useful in guiding further exploration.
Article
Full-text available
The present research on cognitive categories mediates between individual experiences of soundscapes and collective representations shared in language and elaborated as knowledge. This approach focuses on meanings attributed to soundscapes in an attempt to bridge the gap between individual perceptual categories and sociological representations. First, results of several free categorisation experiments are presented, namely the categorical structures elicited using soundscape recordings and the underlying principles of organisation derived from the analysis of verbal comments. People categorised sound samples on the basis of semantic features that integrate perceptual ones. Specifically, soundscapes reflecting human activity were perceived as more pleasant than soundscapes where mechanical sounds were predominant. Second, the linguistic exploration of free-format verbal description of soundscapes indicated that the meanings attributed to sounds act as a determinant for sound quality evaluations. Soundscape evaluations are therefore qualitative first as they are semiotic in nature as grounded in cultural values given to different types of activities. Physical descriptions of sound properties have to be reconsidered as cues pointing to diverse cognitive objects to be identified first rather as the only adequate, exhaustive and objective description of the sound itself. Finally, methodological and theoretical consequences of these findings are drawn, highlighting the need to address not only noise annoyance but rather sound quality of urban environments. To do so, cognitive evaluations must be conducted in the first place to identify relevant city users' categories of soundscapes and then to use physical measurement to characterize corresponding acoustic events.
Article
Full-text available
It is still unknown whether sonic environments influence the processing of individual sounds in a similar way as discourse or sentence context influences the processing of individual words. One obstacle to answering this question has been the failure to dissociate perceptual (i.e., how similar are sonic environment and target sound?) and conceptual (i.e., how related are sonic environment and target?) priming effects. In this study, we dissociate these effects by creating prime-target pairs with a purely perceptual or both a perceptual and conceptual relationship. Perceptual prime-target pairs were derived from perceptual-conceptual pairs (i.e., meaningful environmental sounds) by shuffling the spectral composition of primes and targets so as to preserve their perceptual relationship while making them unrecognizable. Hearing both original and shuffled targets elicited a more positive N1/P2 complex in the ERP when targets were related to a preceding prime as compared with unrelated. Only related original targets reduced the N400 amplitude. Related shuffled targets tended to decrease the amplitude of a late temporo-parietal positivity. Taken together, these effects indicate that sonic environments influence first the perceptual and then the conceptual processing of individual sounds. Moreover, the influence on conceptual processing is comparable to the influence linguistic context has on the processing of individual words.
Article
Full-text available
Since the discovery of 'mirror neurons' in the monkey premotor and parietal cortex, an increasing body of evidence in animals and humans alike has supported the notion of the inextricable link between action execution and action perception. Although research originally focused on the relationship between performed and viewed actions, more recent studies highlight the importance of representing the actions of others through audition. In the first part of this article, we discuss animal studies, which provide direct evidence that action is inherently linked to multi-sensory cues, as well as the studies carried out on healthy subjects by using state-of-the-art cognitive neuroscience techniques such as functional magnetic resonance imaging (fMRI), event-related potentials (ERP), magnetoencephalography (MEG), and transcranial magnetic stimulation (TMS). In the second section, we review the lesion analysis studies in brain-damaged patients demonstrating the link between 'resonant' fronto-parieto-temporal networks and the ability to represent an action by hearing its sound. Moreover, we examine the evidence in favour of somatotopy as a possible representational rule underlying the auditory mapping of actions and consider the links between language and audio-motor action mapping. We conclude with a discussion of some outstanding questions for future research on the link between actions and the sounds they produce.
Article
Full-text available
The influence of listener's expertise and sound identification on the categorization of environmental sounds is reported in three studies. In Study 1, the causal uncertainty of 96 sounds was measured by counting the different causes described by 29 participants. In Study 2, 15 experts and 15 nonexperts classified a selection of 60 sounds and indicated the similarities they used. In Study 3, 38 participants indicated their confidence in identifying the sounds. Participants reported using either acoustical similarities or similarities of the causes of the sounds. Experts used acoustical similarity more often than nonexperts, who used the similarity of the cause of the sounds. Sounds with a low causal uncertainty were more often grouped together because of the similarities of the cause, whereas sounds with a high causal uncertainty were grouped together more often because of the acoustical similarities. The same conclusions were reached for identification confidence. This measure allowed the sound classification to be predicted, and is a straightforward method to determine the appropriate description of a sound.
Article
Full-text available
Acoustic, ecological, perceptual and cognitive factors that are common in the identification of 41 brief, varied sounds were evaluated. In Experiment 1, identification time and accuracy, causal uncertainty values, and spectral and temporal properties of the sounds were obtained. Experiment 2 was a survey to obtain ecological frequency counts. Experiment 3 solicited perceptual-cognitive ratings. Factor analyses of spectral parameters and perceptual-cognitive ratings were performed. Identification time and causal uncertainty are highly interrelated, and both are related to ecological frequency and the presence of harmonics and similar spectral bursts. Experiments 4 and 5 used a priming paradigm to verify correlational relationships between identification time and causal uncertainty and to assess the effect of sound typicality. Results support a hybrid approach for theories of everyday sound identification.
Article
Full-text available
To identify the brain regions preferentially involved in environmental sound recognition (comprising portions of a putative auditory 'what' pathway), we collected functional imaging data while listeners attended to a wide range of sounds, including those produced by tools, animals, liquids and dropped objects. These recognizable sounds, in contrast to unrecognizable, temporally reversed control sounds, evoked activity in a distributed network of brain regions previously associated with semantic processing, located predominantly in the left hemisphere, but also included strong bilateral activity in posterior portions of the middle temporal gyri (pMTG). Comparisons with earlier studies suggest that these bilateral pMTG foci partially overlap cortex implicated in high-level visual processing of complex biological motion and recognition of tools and other artifacts. We propose that the pMTG foci process multimodal (or supramodal) information about objects and object-associated motion, and that this may represent 'action' knowledge that can be recruited for purposes of recognition of familiar environmental sound-sources. These data also provide a functional and anatomical explanation for the symptoms of pure auditory agnosia for environmental sounds reported in human lesion studies.
Conference Paper
The present work involved a sound-sorting and category-labelling task that elicits rather than prescribes words used to describe sounds, allowing categorization strategies to emerge spontaneously and the interpretation of the principal dimensions of categorization using the generated descriptive words. Previous soundscape work suggests that ‘everyday listening’ is primarily concerned with gathering information about sound sources, and that sounds are typically categorized by perceived similarities between the sound-causing events. The present work demonstrates that this is likely to be the case when sound-sources are sufficiently differentiated for this to be a useful cognitive strategy, such as when categorizing a variety of different sound sources, or when categorizing a broad class of sounds with multiple sources such as ‘water’. However, distinct strategies based upon alternative cues emerge for other types of sounds. For example, categorization of dog sounds is primarily determined by judgements relating to perceptual dimensions similar to valence (‘sad’/’lonely’-‘playful’/’friendly’) and arousal (‘bored’/’whining’- ‘threatening’/’vicious’), a finding that supports the circumplex model of affect as a meaningful framework for understanding human categorization of this type of sound. Categorization of engine sounds on the other hand was found to be based primarily upon explicit assessment of the acoustic signal, along dimensions which correlate strongly with the fluctuation strength (‘steady’-‘chugging’) and sharpness (‘muffled’-‘jarring’) of the recordings. These results demonstrate that categorization of sound is based upon different strategies depending on context and the availability of cues. It has implications for experimental methods in soundscapes that prescribe conceptual frameworks on test subjects. For instance, careful consideration should be given to the appropriateness of semantic differential scales in future perceptual soundscape work.
Conference Paper
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large num- ber of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alterna- tive to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example,we present a simplemethod for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Article
Machine learning about language can be improved by supplying it with specific knowledge and sources of external information. We present here a new version of the linked open data resource ConceptNet that is particularly well suited to be used with modern NLP techniques such as word embeddings. ConceptNet is a knowledge graph that connects words and phrases of natural language with labeled edges. Its knowledge is collected from many sources that include expert-created resources, crowd-sourcing, and games with a purpose. It is designed to represent the general knowledge involved in understanding language, improving natural language applications by allowing the application to better understand the meanings behind the words people use. When ConceptNet is combined with word embeddings acquired from distributional semantics (such as word2vec), it provides applications with understanding that they would not acquire from distributional semantics alone, nor from narrower resources such as WordNet or DBPedia. We demonstrate this with state-of-the-art results on intrinsic evaluations of word relatedness that translate into improvements on applications of word vectors, including solving SAT-style analogies.
Article
The faces we encounter throughout our lives make different impressions on us: Some are remembered at first glance, while others are forgotten. Previous work has found that the distinctiveness of a face influences its memorability-the degree to which face images are remembered or forgotten. Here, we generalize the concept of face memorability in a large-scale memory study. First, we find that memorability is an intrinsic feature of a face photograph-across observers some faces are consistently more remembered or forgotten than others-indicating that memorability can be used for measuring, predicting, and manipulating subsequent memories. Second, we determine the role that 20 personality, social, and memory-related traits play in face memorability. Whereas we find that certain traits (such as kindness, atypicality, and trustworthiness) contribute to face memorability, they do not suffice to explain the variance in memorability scores, even when accounting for noise and differences in subjective experience. This suggests that memorability itself is a consistent, singular measure of a face that cannot be reduced to a simple combination of personality and social facial attributes. We outline modern neuroscience questions that can be explored through the lens of memorability. (PsycINFO Database Record (c) 2013 APA, all rights reserved).
Article
In natural language processing, conflation is the process of merging or lumping together nonidentical words which refer to the same principal concept. This can relate both to words which are entirely different in form (e.g., "group" and "collection"), and to words which share some common root (e.g., "group", "grouping", "subgroups"). In the former case the words can only be mapped by referring to a dictionary or thesaurus, but in the latter case use can be made of the orthographic similarities between the forms. One popular approach is to remove affixes from the input words, thus reducing them to a stem; if this could be done correctly, all the variant forms of a word would be converted to the same standard form. Since the process is aimed at mapping for retrieval purposes, the stem need not be a linguistically correct lemma or root (see also Frakes 1982).
Article
Because meaningful sentences are composed of meaningful words, any system that hopes to process natural languages as people do must have information about words and their meanings. This information is traditionally provided through dictionaries, and machine-readable dictionaries are now widely available. But dictionary entries evolved for the convenience of human readers, not for machines. WordNet ¹ provides a more effective combination of traditional lexicographic information and modern computing. WordNet is an online lexical database designed for use under program control. English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms, each representing a lexicalized concept. Semantic relations link the synonym sets [4].
Article
The development of a set of everyday, nonverbal, digitized sounds for use in auditory confrontation naming applications is described. Normative data are reported for 120 sounds of varying lengths representing a wide variety of acoustic events such as sounds produced by animals, people, musical instruments, tools, signals, and liquids. In Study 1, criteria for scoring naming accuracy were developed and rating data were gathered on degree of confidence in sound identification and the perceived familiarity, complexity, and pleasantness of the sounds. In Study 2, the previously developed criteria for scoring naming accuracy were applied to the naming responses of a new sample of subjects, and oral naming times were measured. In Study 3 data were gathered on how subjects categorized the sounds: In the first categorization task - free classification - subjects generated category descriptions for the sounds; in the second task - constrained classification - a different sample of subjects selected the most appropriate category label for each sound from a list of 27 labels generated in the first task. Tables are provided in which the 120 stimuli are sorted by familiarity, complexity, pleasantness, duration, naming accuracy, speed of identification, and category placement. The. WAV sound files are freely available to researchers and clinicians via a sound archive on the World Wide Web; the URL is http://www.cofc.edu/~marcellm/confront.htm.
Article
The finding of a multisensory representation of actions in a premotor area of the monkey brain suggests that similar multimodal action-matching mechanisms may also be present in humans. Based on the existence of an audiovisual mirror system, we investigated whether sounds referring to actions that can be performed by the perceiver underlie different processing in the human brain. We recorded multichannel ERPs in a visuoauditory version of the repetition suppression paradigm to study the time course and the locus of the semantic processing of action-related sounds. Results show that the left posterior superior temporal and premotor areas are selectively modulated by action-related sounds; in contrast, the temporal pole is bilaterally modulated by non-action-related sounds. The present data, which support the hypothesis of distinctive action sound processing, may contribute to recent theories about the evolution of human language from a mirror system precursor.
Article
To hear a sequence of words and repeat them requires sensory-motor processing and something more-temporary storage. We investigated neural mechanisms of verbal memory by using fMRI and a task designed to tease apart perceptually based ("echoic") memory from phonological-articulatory memory. Sets of two- or three-word pairs were presented bimodally, followed by a cue indicating from which modality (auditory or visual) items were to be retrieved and rehearsed over a delay. Although delay-period activation in the planum temporale (PT) was insensible to the source modality and showed sustained delay-period activity, the superior temporal gyrus (STG) activated more vigorously when the retrieved items had arrived to the auditory modality and showed transient delay-period activity. Functional connectivity analysis revealed two topographically distinct fronto-temporal circuits, with STG co-activating more strongly with ventrolateral prefrontal cortex and PT co-activating more strongly with dorsolateral prefrontal cortex. These argue for separate contributions of ventral and dorsal auditory streams in verbal working memory.
Article
Environmental sound research is still in its beginning stages, although in recent years there has started to accumulate a body of research, both on the perception of environmental sounds themselves, and on their practical applications in other areas of auditory research and cognitive science. In this chapter some of those practical applications are detailed, combined with a discussion of the implications of environmental sound research for auditory perception in general, and finally some outstanding issues and possible directions for future research are outlined.
The international affective digitized sounds (IADS-2): Affective ratings of sounds and instruction manual
  • M Margaret
  • Peter J Bradley
  • Lang
CNN architectures for large-scale audio classification
  • Shawn Hershey
  • Sourish Chaudhuri
  • P W Daniel
  • Jort F Ellis
  • Aren Gemmeke
  • Channing Jansen
  • Moore
The international affective digitized sounds (IADS-2): Affective ratings of sounds and instruction manual
  • bradley