Article

Some Experiments on the Recognition of Speech With One and With Two Ears

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

This paper describes a number of objective experiments on recognition, concerning particularly the relation between the messages received by the two ears. Rather than use steady tones or clicks (frequency or time‐point signals) continuous speech is used, and the results interpreted in the main statistically. Two types of test are reported: (a) the behavior of a listener when presented with two speech signals simultaneously (statistical filtering problem) and (b) behavior when different speech signals are presented to his two ears.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... There have been several studies on the human ability to perceive speech amid other speech, the first by Cherry in 1953 [3]. Cherry conducted a series of experiments on the Cocktail Party Effect. ...
... This suggests that users will be able to understand simultaneous speech better if there are differences in the sound frequencies. In another experiment, Egan et al. confirmed the results found by Cherry [3] that participants were better able to understand the target message when the two messages were played in opposite ears instead of the same ear. ...
Preprint
We explore a method for presenting word suggestions for non-visual text input using simultaneous voices. We conduct two perceptual studies and investigate the impact of different presentations of voices on a user's ability to detect which voice, if any, spoke their desired word. Our sets of words simulated the word suggestions of a predictive keyboard during real-world text input. We find that when voices are simultaneous, user accuracy decreases significantly with each added word suggestion. However, adding a slight 0.15 s delay between the start of each subsequent word allows two simultaneous words to be presented with no significant decrease in accuracy compared to presenting two words sequentially (84% simultaneous versus 86% sequential). This allows two word suggestions to be presented to the user 32% faster than sequential playback without decreasing accuracy.
... Research on auditory attention in complex acoustical environments is a thriving field. Starting with Cherry's seminal paper [1]-which coined the expression 'cocktail party effect' to describe our ability to focus on selected aspects of a given acoustic scene while blocking out non-relevant sonic streams-an impressive amount of psychophysical as well as neuroimaging research has been conducted on both sides of the 'cocktail party problem' [2]: how do we segregate concurrent sonic streams that are somehow mixed? and how do we direct our attention to a source of interest while ignoring other sources? ...
... For each musician, we thus extracted the time series corresponding to our two audio descriptors (RMS and spectral centroid) over a 7 s window (the minimum temporal distance between two target sounds, so that there would never be any overlap in successive measurements) centred around the target sound under consideration. Note that 1 See the electronic supplementary material for a replication of our main analysis using a much shorter time window but yielding similar results. Figure 1. ...
Article
Full-text available
While research on auditory attention in complex acoustical environment is a thriving field, experimental studies thus far have typically treated participants as passive listeners. The present study—which combined real-time covert loudness manipulations and online probe detection—investigates for the first time to our knowledge, the effects of acoustic salience on auditory attention during live interactions, using musical improvisation as an experimental paradigm. We found that musicians were more likely to pay attention to a given co-performer when this performer was made sounding louder or softer; that such salient effect was not owing to the local variations introduced by our manipulations but rather likely to be driven by the more long-term context; and that improvisers tended to be more strongly and more stably coupled when a musician was made more salient. Our results thus demonstrate that a meaningful change of the acoustical context not only captured attention but also impacted the ongoing musical interaction itself, highlighting the tight relationship between attentional selection and interaction in such social scenarios and opening novel perspectives to address whether similar processes are at play in human linguistic interactions.
... Un esempio tipico di questo fenomeno è l'effetto "cocktail party" (Conway et al. 2001). Cherry (1953) studiò l'attenzione selettiva uditiva soffermandosi sull'analisi di questo fenomeno. L'effetto "cocktail party" indica il fenomeno per cui un individuo, come avviene comunemente in una situazione analoga a un cocktail party, è in grado di concentrarsi su una singola conversazione ignorandone molte altre che avvengono in contemporanea. ...
... Ritengono che l'analisi percettiva arrivi a compimento per tutti gli stimoli. È piuttosto semplice registrare le caratteristiche fisiche dell'input, anche in un canale inatteso (Cherry 1953). L'habituation hypothesis (ipotesi di adattamento) è una reinterpretazione del meccanismo del filtro che aiuta ad affrontare questo problema fondamentale e irrisolto della teoria dell'attenzione, cioè: se l'informazione è filtrata o attenuata da un canale su base fisica, perché dovrebbe essere così semplice accorgersi di un cambiamento fisico nel canale non considerato? ...
Book
Full-text available
Il volume descrive un progetto sperimentale volto a studiare l’evoluzione di due processi cognitivi, la memoria di lavoro e l’attenzione selettiva, negli studenti di interpretazione. L’ipotesi è che l’esercizio di interpretazione protratto nel tempo potenzi questi processi. Sono stati confrontati studenti di interpretazione e di traduzione, che hanno svolto una batteria di test all’inizio, a metà e alla fine della laurea magistrale. Oltre ai risultati dei test, per gli studenti di interpretazione sono state raccolte informazioni anche sull’esercizio di interpretazione autonomo e sul rendimento accademico. Dall’analisi dei dati è emerso un miglioramento dell’efficienza della memoria di lavoro negli studenti di interpretazione, mentre non sono stati riscontrati vantaggi per quanto riguarda l’attenzione selettiva. Inoltre, i processi cognitivi migliorano soprattutto nel primo anno di corso e hanno un’influenza positiva sul rendimento accademico, diversamente dall’esercizio autonomo.
... Seven decades ago Cherry (1953) formulated the cocktail party problem: how does a listener "tune in" to one of two or more simultaneous voices and what are the limits of such selective listening? Research conducted since Cherry's seminal experiments has shone light on multiple aspects of the problem, in particular that: the primary challenge in the cocktail party scenario is not the processing of speech signal masked by other speech containing energy in the same spectral bands (energetic masking), but the selection of the relevant spoken message in the face of informational masking from the content of the other streams (e.g., Brungart, 2001;Brungart et al., 2001;Darwin, 2008); spatial separability of talkers benefits selection, but voices coming from the same location can also be effectively distinguished based on cues such as fundamental frequency, vocal tract size, prosody, accent, etc. (e.g., Darwin et al., 2003;Darwin & Hukin, 2000); listeners can integrate such physical/perceptual attributes into auditory objects, which benefit from the temporal constancy (continuity) of the constituent attributes (e.g., Best et al., 2008;Kitterick et al., 2010;Samson & Johnsrude, 2016); familiarity with a voice improves speech intelligibility (e.g., Holmes et al., 2018a; -to list only a few key findings relevant to the present study (for reviews, see Bronkhorst, 2015;Shinn-Cunningham & Best, 2015). ...
... In the studies in which only one voice per gender was presented (Monsell et al., 2019;Seibold et al., 2018;Strivens et al., 2024) participants likely learned to tune in to specific talkers. Yet, as Cherry (1953) showed in some of his seminal experiments, the listener is also able to select a voice based on its location/side. Spatial selection of voices in the context of dynamic changes in the location of the target voice was investigated by Best and colleagues (2008), who presented participants with sequences of digits spoken simultaneously from five loudspeakers in front of them. ...
Article
Full-text available
Can one shift attention among voices at a cocktail party during a silent pause? Researchers have required participants to attend to one of two simultaneous voices – cued by its gender or location. Switching the target gender or location has resulted in a performance ‘switch cost’ – which was recently shown to reduce with preparation when a gender cue was presented in advance. The current study asks if preparation for a switch is also effective when a voice is selected by location. We displayed a word or image 50/800/1400 ms before the onset of two simultaneous dichotic (male and female) voices to indicate whether participants should classify as odd/even the number spoken by the voice on the left or on the right; in another condition, we used gender cues. Preparation reduced the switch cost in both spatial-and gender-cueing conditions. Performance was better when each voice was heard on the same side as on the preceding trial, suggesting ‘binding’ of non-spatial and spatial voice features – but this did not materially influence the reduction in switch cost with preparation, indicating that preparatory attentional shifts can be effective within a single (task-relevant) dimension. We also asked whether words or pictures are more effective for cueing a voice. Picture cues resulted in better performance than word cues, especially when the interval between the cue and the stimulus was short, suggesting that (presumably phonological) processes involved in the recognition of the word cue interfered with the (near) concurrent encoding of the target voice’s speech.
... In an acoustic environment like a cocktail party, we seem capable of effortlessly following one speaker in the presence of other speakers and background noises. Speech separation is commonly called the "cocktail party problem," a term coined by Cherry in his famous 1953 paper [26]. ...
... We refer to speech separation or segregation as the general task of separating target speech from its background interference, which may include nonspeech noise, interfering speech, or both, as well as room reverberation. Furthermore, we equate speech separation and the cocktail party problem, which goes beyond the separation of two speech utterances originally experimented with by Cherry [26]. By speech enhancement (or denoising), we mean the separation of speech and nonspeech noise. ...
Preprint
Speech separation is the task of separating target speech from background interference. Traditionally, speech separation is studied as a signal processing problem. A more recent approach formulates speech separation as a supervised learning problem, where the discriminative patterns of speech, speakers, and background noise are learned from training data. Over the past decade, many supervised separation algorithms have been put forward. In particular, the recent introduction of deep learning to supervised speech separation has dramatically accelerated progress and boosted separation performance. This article provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years. We first introduce the background of speech separation and the formulation of supervised separation. Then we discuss three main components of supervised separation: learning machines, training targets, and acoustic features. Much of the overview is on separation algorithms where we review monaural methods, including speech enhancement (speech-nonspeech separation), speaker separation (multi-talker separation), and speech dereverberation, as well as multi-microphone techniques. The important issue of generalization, unique to supervised learning, is discussed. This overview provides a historical perspective on how advances are made. In addition, we discuss a number of conceptual issues, including what constitutes the target source.
... We show here that our microcircuit motif model M is able to carry out a computational operation on large-scale activity patterns that is fundamental for such a global communication scheme: It can demix superimposed spike patterns that impinge on a generic cortical area, and represent the presence of each pattern in their input stream through the firing of separate populations of neurons. This suggests that the layer 2/3 microcircuit motif has an inherent capability to solve the well known cocktail party problem (blind source separation) [Cherry, 1953] on the level of larger activity patterns. This capability emerges automatically through STDP, as demonstrated in Figure 3 for our data-based model M. ...
... Through this operation, each network module may produce one of a small repertoire of stereotypical firing patterns, commonly referred to as assemblies, assembly sequences, or packets of information [Luczak et al., 2015]. If these assembly activations are fundamental tokens of global cortical computation and communication, as proposed by [Luczak et al., 2015], then cortical columns have to solve a particular instance of the well-known cocktail party problem [Cherry, 1953]: They have to recognize and separately represent spike inputs from different assemblies that are superimposed in their network input stream. ...
Preprint
Cortical microcircuits are very complex networks, but they are composed of a relatively small number of stereotypical motifs. Hence one strategy for throwing light on the computational function of cortical microcircuits is to analyze emergent computational properties of these stereotypical microcircuit motifs. We are addressing here the question how spike-timing dependent plasticity (STDP) shapes the computational properties of one motif that has frequently been studied experimentally: interconnected populations of pyramidal cells and parvalbumin-positive inhibitory cells in layer 2/3. Experimental studies suggest that these inhibitory neurons exert some form of divisive inhibition on the pyramidal cells. We show that this data-based form of feedback inhibition, which is softer than that of winner-take-all models that are commonly considered in theoretical analyses, contributes to the emergence of an important computational function through STDP: The capability to disentangle superimposed firing patterns in upstream networks, and to represent their information content through a sparse assembly code.
... Beyond acoustic incoherence, even in an ideal scenario where all audio sources are perfectly adapted to each user's environment, group conversations in XR can still lead to the so-called "cocktail party effect," where multiple people speak at the same time (Cherry, 1953). This situation is particularly challenging because it involves both energetic and informational masking and requires a high level of selective auditory attention (Brungart et al., 2006;Oberfeld and Kloeckner-Nowotny, 2016). ...
Article
Full-text available
In recent years, extended reality (XR) has gained interest as a platform for human communication, with the emergence of the “Metaverse” promising to reshape social interactions. At the same time, concerns about harmful behavior and criminal activities in virtual environments have increased. This paper explores the potential of technology to support social harmony within XR, focusing specifically on audio aspects. We introduce the concept of acoustic coherence and discuss why it is crucial for smooth interaction. We further explain the challenges of speech communication in XR, including noise and reverberation, and review sound processing methods to enhance the auditory experience. We also comment on the potential of using virtual reality as a tool for the development and evaluation of audio algorithms aimed at enhancing communication. Finally, we present the results of a pilot study comparing several audio enhancement techniques inside a virtual environment.
... Hearing aids (HAs) enhance auditory perception and quality of life among those with hearing impairments [1], particularly in challenging, noisy environments. They aim to distinguish between speech and noise, enhancing the former while suppressing the latter, thereby allowing wearers to participate in conversations even in busy settings like restaurants -the well-known cocktail party problem [2]. Speech enhancement (SE) is arguably one of the core modules of HA devices, with real-time approaches traditionally depending on statistical models [3]. ...
Preprint
The \textbf{DeepFilterNet} (\textbf{DFN}) architecture was recently proposed as a deep learning model suited for hearing aid devices. Despite its competitive performance on numerous benchmarks, it still follows a `one-size-fits-all' approach, which aims to train a single, monolithic architecture that generalises across different noises and environments. However, its limited size and computation budget can hamper its generalisability. Recent work has shown that in-context adaptation can improve performance by conditioning the denoising process on additional information extracted from background recordings to mitigate this. These recordings can be offloaded outside the hearing aid, thus improving performance while adding minimal computational overhead. We introduce these principles to the \textbf{DFN} model, thus proposing the \textbf{DFingerNet} (\textbf{DFiN}) model, which shows superior performance on various benchmarks inspired by the DNS Challenge.
... speech stream. This is particularly critical when speech is challenging, such as in the classic 74 cocktail party situation, where conversations are happening simultaneously (Cherry, 1953). The cleaned data were epoched into trials that matched the length of the audiovisual stimuli. ...
Article
Observing lip movements of a speaker facilitates speech understanding, especially in challenging listening situations. Converging evidence from neuroscientific studies shows stronger neural responses to audiovisual stimuli compared to audio-only stimuli. However, the interindividual variability of this contribution of lip movement information and its consequences on behavior are unknown. We analyzed source-localized magnetoencephalographic (MEG) responses from 29 normal-hearing participants (12 female) listening to audiovisual speech, both with and without the speaker wearing a surgical face mask, and in the presence or absence of a distractor speaker. Using temporal response functions (TRFs) to quantify neural speech tracking, we show that neural responses to lip movements are, in general, enhanced when speech is challenging. After controlling for speech acoustics, we show that lip movements contribute to enhanced neural speech tracking, particularly when a distractor speaker is present. However, the extent of this visual contribution to neural speech tracking varied greatly among participants. Probing the behavioral relevance, we demonstrate that individuals who show a higher contribution of lip movements in terms of neural speech tracking, show a stronger drop in comprehension and an increase in perceived difficulty when the mouth is occluded by a surgical face mask. By contrast, no effect was found when the mouth was not occluded. We provide novel insights on how the contribution of lip movements in terms of neural speech tracking varies among individuals and its behavioral relevance, revealing negative consequences when visual speech is absent. Our results also offer potential implications for objective assessments of audiovisual speech perception.
... In daily life, listeners frequently find themselves in environments where they need to focus on a single talker's speech amid other simultaneous conversations. This situation is often referred to as the "cocktail party problem" (Cherry, 1953). Recognizing speech in such environments is notably challenging due to the spectro-temporal overlap of the concurrent talkers. ...
Article
Full-text available
Speech-on-speech masking is a common and challenging situation in everyday verbal communication. The ability to segregate competing auditory streams is a necessary requirement for focusing attention on the target speech. The Visual World Paradigm (VWP) provides insight into speech processing by capturing gaze fixations on visually presented icons that reflect the speech signal. This study aimed to propose a new VWP to examine the time course of speech segregation when competing sentences are presented and to collect pupil size data as a measure of listening effort. Twelve young normal-hearing participants were presented with competing matrix sentences (structure “name-verb-numeral-adjective-object”) diotically via headphones at four target-to-masker ratios (TMRs), corresponding to intermediate to near perfect speech recognition. The VWP visually presented the number and object words from both the target and masker sentences. Participants were instructed to gaze at the corresponding words of the target sentence without providing verbal responses. The gaze fixations consistently reflected the different TMRs for both number and object words. The slopes of the fixation curves were steeper, and the proportion of target fixations increased with higher TMRs, suggesting more efficient segregation under more favorable conditions. Temporal analysis of pupil data using Bayesian paired sample t-tests showed a corresponding reduction in pupil dilation with increasing TMR, indicating reduced listening effort. The results support the conclusion that the proposed VWP and the captured eye movements and pupil dilation are suitable for objective assessment of sentence-based speech-on-speech segregation and the corresponding listening effort.
... Here, the perceptual processes in a receiver involved in tracking and understanding one speaker amid competing speakers in a setting similar to a cocktail party are of focus [4]. The terminology was coined by Collin Cherry who initially conducted experiments where participants were tasked separating different speech signals presented diotically [5]; he showed that separability of the signals in the presence of background noise depended upon the rate of the speech, its direction of arrival, the participants' gender, and average pitch of the speech signals. However, given the focus on speech perception, the task of detecting and tracking musical targets in the presence of accompanying musical maskers remains underexplored (i.e., Musical Scene Analysis or MSA tasks), especially within the context of sensorineural hearing impairment. ...
Article
Full-text available
Music pre-processing methods are currently becoming a recognized area of research with the goal of making music more accessible to listeners with a hearing impairment. Our previous study showed that hearing-impaired listeners preferred spectrally manipulated multi-track mixes. Nevertheless, the acoustical basis of mixing for hearing-impaired listeners remains poorly understood. Here, we assess listeners’ ability to detect a musical target within mixes with varying degrees of spectral manipulations using the so-called EQ-transform. This transform exaggerates or downplays the spectral distinctiveness of a track with respect to an ensemble average spectrum taken over a number of instruments. In an experiment, 30 young normal-hearing (yNH) and 24 older hearing-impaired (oHI) participants with predominantly moderate to severe hearing loss were tested. The target that was to be detected in the mixes was from the instrument categories Lead vocals, Bass guitar, Drums, Guitar, and Piano. Our results show that both hearing loss and target category affected performance, but there were no main effects of EQ-transform. yNH performed consistently better than oHI in all target categories, irrespective of the spectral manipulations. Both groups demonstrated the best performance in detecting Lead vocals, with yNH performing flawlessly at 100% median accuracy and oHI at 92.5% (IQR = 86.3–96.3%). Contrarily, performance in detecting Bass was arguably the worst among yNH (Mdn = 67.5% IQR = 60–75%) and oHI (Mdn = 60%, IQR = 50–66.3%), with the latter even performing close to chance-levels of 50% accuracy. Predictions from a generalized linear mixed-effects model indicated that for every decibel increase in hearing loss level, the odds of correctly detecting the target decreased by 3%. Therefore, baseline performance progressively declined to chance-level at moderately severe degrees of hearing loss thresholds, independent of target category. The frequency domain sparsity of mixes and larger differences in target and mix roll-off points were positively correlated with performance especially for oHI participants (r = .3, p < .01). Performance of yNH on the other hand remained robust to changes in mix sparsity. Our findings underscore the multifaceted nature of selective listening in musical scenes and the instrument-specific consequences of spectral adjustments of the audio.
... In the real world, it is a common case to hear the mixture but not only speech of one person. Although human easily distinguishes target speech from mixed sound as known as the cocktail party effect [1,2], it is difficult for a machine to realize it. Hence, there are many applications of speech separation algorithms to automatic speech recognition, speaker identification, mobile telecommunication, and hearing aid [3,4]. ...
Article
Full-text available
Introduction Traditional speech separation algorithms used time–frequency (T-F) masks. Recently, deep neural networks, which are generally called deep learning, have been used to estimate T-F masks, and the separation performance has been improved owing to it. However, a conventional deep learning approach that achieved the high separation performance requires a large number of parameters because its network is composed of recurrent neural networks (RNNs) which have recurrent and full connections. This is a disadvantage because the low memory consumption is desired in the case of using speech separation algorithms for actual applications. Because of this shortcoming, we proposed a novel network architecture that balances both the high separation performance and the low memory cost. Methods The proposed network is composed of convolutional neural networks (CNNs) which have a sparse connection instead of RNNs to reduce the number of parameters. Additionally, aiming to achieve the high separation performance, we designed the proposed network for speech separation to make it possible to learn the feature of speech. We evaluated the separation performance for each network on the task of separating two-speaker mixture. Results In the simulation results, the separation performance of the proposed network was competitive with one of conventional ones with the highest separation performance regardless of realizing the more than 80% reduction of the number of parameters. Conclusion The results indicated that the proposed network is superior to a conventional one considering the cost-performances of both the speech separation performance and the memory cost.
... Presumably, this function is aided by the fact that generic-you expresses generality using a word, you, that is typically used to refer to the addressee 21 . Thus, people may experience something similar to the "cocktail party" effect 22 , wherein a person's attention is piqued when they hear a piece of self-relevant information (e.g., most classically, their own name, amidst other stimuli at a cocktail party). In line with this theorizing, people find ideas expressed with generic-you to be more resonant 21 . ...
Article
Full-text available
Persuasion plays a crucial role in human communication. Yet, convincing someone to change their mind is often challenging. Here, we demonstrate that a subtle linguistic device, generic-you (i.e., "you" that refers to people in general, e.g., "You win some, you lose some"), is associated with successfully shifting people's pre-existing views in a naturalistic context. Leveraging Large Language Models, we conducted a preregistered study using a large ([Formula: see text] = 204,120) online debate dataset. Every use of generic-you in an argument was associated with an up to 14% percent increase in the odds of successful persuasion. These findings underscore the need to distinguish between the specific and generic uses of "you" in large-scale linguistic analyses, an aspect that has been overlooked in the literature. The robust association between generic-you and persuasion persisted with the inclusion of various covariates, and above and beyond other pronouns (i.e., specific-you, I or we). However, these findings do not imply causality. In Supplementary Experiment 2, arguments with generic-you (vs. first-person singular pronouns, e.g., I) were rated as more persuasive by open-minded individuals. In Supplementary Experiment 3, generic-you (vs. specific-you) arguments did not differentially predict attitude change. We discuss explanations for these results, including differential mechanisms, boundary conditions, and the possibility that people intuitively draw on generic-you when expressing more persuasive ideas. Together, these findings add to a growing literature on the interpersonal implications of broadening one's perspective via a subtle shift in language, while motivating future research on contextual and individual differences that may moderate these effects.
... UMANS have the ability to concentrate on the voice of a particular speaker in a noisy environment, known as "Cocktail Party Effect" [1]. It is attributed to the ability of the brain filtering out irrelevant sounds and selectively processing the interested content received. ...
Preprint
Full-text available
Auditory attention decoding from electroencephalogram (EEG) could infer to which source the user is attending in noisy environments. Decoding algorithms and experimental paradigm designs are crucial for the development of technology in practical applications. To simulate real-world scenarios, this study proposed a cue-masked auditory attention paradigm to avoid information leakage before the experiment. To obtain high decoding accuracy with low latency, an end-to-end deep learning model, AADNet, was proposed to exploit the spatiotemporal information from the short time window of EEG signals. The results showed that with a 0.5-second EEG window, AADNet achieved an average accuracy of 93.46% and 91.09% in decoding auditory orientation attention (OA) and timbre attention (TA), respectively. It significantly outperformed five previous methods and did not need the knowledge of the original audio source. This work demonstrated that it was possible to detect the orientation and timbre of auditory attention from EEG signals fast and accurately. The results are promising for the real-time multi-property auditory attention decoding, facilitating the application of the neuro-steered hearing aids and other assistive listening devices.
... A great deal of research of speech in noise has shown that people are able to communicate in a noisy environment like a crowded room or a party. The ability to filter out background noises in order to focus on what is necessary, such as a single conversation, highlights the brain's ability to selectively process auditory information (Cherry, 1953). Many studies have investigated speech intelligibility in noise, looking at different factors such as the speech-to-noise ratio (SNR), with speech intelligibility decreasing and perceived difficulty increasing as the ratio of signal-to-noise decreases (Brungart et al., 2020;Munro, 1998); type of background noise, with speech in babble (i.e., with competing multiple talkers in the background) being more challenging than speech in noise (Assmann & Summerfield, 2004;Cooke et al., 2008;McLaughlin et al., 2018;Rogers et al., 2006); and participants' hearing ability (George et al., 2006), amongst others (for a review, see Bronkhorst, 2000). ...
... The combined coded signal is then decoded (decomposed) back to its original constituent signals on the receiving end of the communication line. Research into signals mixing and separation have its early origins in the study of the "Cocktail Party Effect" in speech recognition (Cherry, 1953). The cocktail party effect is described as the human brain ability to focus on certain conversation while filtering other sounds in a noisy party. ...
Preprint
The advent of high density 3D wide azimuth survey configurations has greatly increased the cost of seismic acquisition. Simultaneous source acquisition presents an opportunity to decrease costs by reducing the survey time. Source time delays are typically long enough for seismic reflection energy to decay to negligible levels before firing another source. Simultaneous source acquisition abandons this minimum time restriction and allows interference between seismic sources to compress the survey time. Seismic data processing methods must address the interference introduced by simultaneous overlapping sources. Simultaneous source data are characterized by high amplitude interference artefacts that may be stronger than the primary signal. These large amplitudes are due to the time delay between sources and the rapid decay of seismic energy with arrival time. Therefore, source interference will appear as outliers in denoising algorithms that make use of a Radon transform. This will reduce the accuracy of Radon transform de-noising especially for weak signals. Formulating the Radon transform as an inverse problem with an L1 misfit makes it robust to outliers caused by source interference. This provides the ability to attenuate strong source interference while preserving weak underlying signal. In order to improve coherent signal focusing, an apex shifted hyperbolic Radon transform (ASHRT) is used to remove source interferences. ASHRT transform basis functions are tailored to match the travel time hyperbolas of reflections in common receiver gathers. However, the ASHRT transform has a high computational cost due to the extension of the model dimensions by scanning for apex locations. By reformulating the ASHRT operator using a Stolt migration/demigration kernel that exploits the Fast Fourier Transform (FFT), the computational efficiency of the operator is drastically improved.
... These challenges often make isolating individual sounds particularly difficult. Humans possess a remarkable ability to focus on specific sound sources, exemplified by the "cocktail party effect" in speech (Cherry, 1953). Similarly, in music, listeners can selectively attend to individual instruments (Bregman, 1984;McAdams & Bregman, 1979), discerning the unique contribution of each to the overall composition. ...
Preprint
In this work, we demonstrate the integration of a score-matching diffusion model into a deterministic architecture for time-domain musical source extraction, resulting in enhanced audio quality. To address the typically slow iterative sampling process of diffusion models, we apply consistency distillation and reduce the sampling process to a single step, achieving performance comparable to that of diffusion models, and with two or more steps, even surpassing them. Trained on the Slakh2100 dataset for four instruments (bass, drums, guitar, and piano), our model shows significant improvements across objective metrics compared to baseline methods. Sound examples are available at https://consistency-separation.github.io/.
... The cocktail party problem [1], [2], referring to multi-talker overlapped speech recognition, is critical to enable automatic speech recognition (ASR) scenarios such as automatic meeting transcription, automatic captioning for audio/video recordings, and multi-party human-machine interactions, where overlapping speech is commonly observed and all streams need to be transcribed. The problem is still one of the hardest problems in ASR, despite encouraging progresses [3], [4], [5], [6]. ...
Preprint
Unsupervised single-channel overlapped speech recognition is one of the hardest problems in automatic speech recognition (ASR). Permutation invariant training (PIT) is a state of the art model-based approach, which applies a single neural network to solve this single-input, multiple-output modeling problem. We propose to advance the current state of the art by imposing a modular structure on the neural network, applying a progressive pretraining regimen, and improving the objective function with transfer learning and a discriminative training criterion. The modular structure splits the problem into three sub-tasks: frame-wise interpreting, utterance-level speaker tracing, and speech recognition. The pretraining regimen uses these modules to solve progressively harder tasks. Transfer learning leverages parallel clean speech to improve the training targets for the network. Our discriminative training formulation is a modification of standard formulations, that also penalizes competing outputs of the system. Experiments are conducted on the artificial overlapped Switchboard and hub5e-swb dataset. The proposed framework achieves over 30% relative improvement of WER over both a strong jointly trained system, PIT for ASR, and a separately optimized system, PIT for speech separation with clean speech ASR model. The improvement comes from better model generalization, training efficiency and the sequence level linguistic knowledge integration.
... Humans are remarkably capable of focusing their auditory attention on a single sound source within a noisy environment, while deemphasizing ("muting") all other voices and sounds. The way neural systems achieve this feat, which is known as the cocktail party effect [Cherry 1953], remains unclear. However, research has shown that viewing a speaker's face enhances a person's capacity to resolve perceptual ambiguity in a noisy environment [Golumbic et al. 2013;Ma et al. 2009]. ...
Preprint
We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest).
... Multispeaker babble noise is one of the frequently encountered interferences in daily life that greatly degrades the quality and intelligibility of a target speech signal. The problem of understanding the desired speech in the presence of other interfering speech signals and background noise (also known as the "cocktail party problem") has received great attention since it was popularized by Cherry in 1953 [1]. Different auditory aspects of this problem are investigated (e.g. ...
Preprint
Deriving a good model for multitalker babble noise can facilitate different speech processing algorithms, e.g. noise reduction, to reduce the so-called cocktail party difficulty. In the available systems, the fact that the babble waveform is generated as a sum of N different speech waveforms is not exploited explicitly. In this paper, first we develop a gamma hidden Markov model for power spectra of the speech signal, and then formulate it as a sparse nonnegative matrix factorization (NMF). Second, the sparse NMF is extended by relaxing the sparsity constraint, and a novel model for babble noise (gamma nonnegative HMM) is proposed in which the babble basis matrix is the same as the speech basis matrix, and only the activation factors (weights) of the basis vectors are different for the two signals over time. Finally, a noise reduction algorithm is proposed using the derived speech and babble models. All of the stationary model parameters are estimated using the expectation-maximization (EM) algorithm, whereas the time-varying parameters, i.e. the gain parameters of speech and babble signals, are estimated using a recursive EM algorithm. The objective and subjective listening evaluations show that the proposed babble model and the final noise reduction algorithm significantly outperform the conventional methods.
... Since the cocktail party problem was initially formalized [3], a large number of potential solutions have been proposed [5], and the most popular techniques originate from the field of Computational Auditory Scene Analysis (CASA) [6]- [10]. In CASA, different segmentation and grouping rules are used to group Time-Frequency (T-F) units that are believed to belong to the same speaker. ...
Preprint
In this paper we propose the utterance-level Permutation Invariant Training (uPIT) technique. uPIT is a practically applicable, end-to-end, deep learning based solution for speaker independent multi-talker speech separation. Specifically, uPIT extends the recently proposed Permutation Invariant Training (PIT) technique with an utterance-level cost function, hence eliminating the need for solving an additional permutation problem during inference, which is otherwise required by frame-level PIT. We achieve this using Recurrent Neural Networks (RNNs) that, during training, minimize the utterance-level separation error, hence forcing separated frames belonging to the same speaker to be aligned to the same output stream. In practice, this allows RNNs, trained with uPIT, to separate multi-talker mixed speech without any prior knowledge of signal duration, number of speakers, speaker identity or gender. We evaluated uPIT on the WSJ0 and Danish two- and three-talker mixed-speech separation tasks and found that uPIT outperforms techniques based on Non-negative Matrix Factorization (NMF) and Computational Auditory Scene Analysis (CASA), and compares favorably with Deep Clustering (DPCL) and the Deep Attractor Network (DANet). Furthermore, we found that models trained with uPIT generalize well to unseen speakers and languages. Finally, we found that a single model, trained with uPIT, can handle both two-speaker, and three-speaker speech mixtures.
... The human hearing sense is very good at focusing on a single source of interest and following a conversation even when several people are speaking at the same time. This ability is known as the cocktail party effect [1]. To operate in human and natural settings, autonomous mobile robots should be able to do the same. ...
Preprint
Full-text available
This paper describes a system that gives a mobile robot the ability to perform automatic speech recognition with simultaneous speakers. A microphone array is used along with a real-time implementation of Geometric Source Separation and a post-filter that gives a further reduction of interference from other sources. The post-filter is also used to estimate the reliability of spectral features and compute a missing feature mask. The mask is used in a missing feature theory-based speech recognition system to recognize the speech from simultaneous Japanese speakers in the context of a humanoid robot. Recognition rates are presented for three simultaneous speakers located at 2 meters from the robot. The system was evaluated on a 200 word vocabulary at different azimuths between sources, ranging from 10 to 90 degrees. Compared to the use of the microphone array source separation alone, we demonstrate an average reduction in relative recognition error rate of 24% with the post-filter and of 42% when the missing features approach is combined with the post-filter. We demonstrate the effectiveness of our multi-source microphone array post-filter and the improvement it provides when used in conjunction with the missing features theory.
... First, on the cocktail party effect. Cherry (1953) run laboratory experiments where participants listened to two different messages from the same speaker and were instructed to separate them. Whether and the extent to which people can perform such tasks accurately depends on factors such as the direction from which the messages are coming, the pitch of the messages and the rate of speech. ...
Article
Machines can now match, or outperform, human performance in several reasoning and decision tasks. Some say that all that intelligence amounts to is smart computation. This is not a new thesis, dating back to Leibniz as well as Simon and Newell, but what is new is what smart means. Today it is identified with complex statistics and optimisation. Simon’s meaning, however, of smart rested on bounded rationality, a unified view of human and artificial decision making. This view was f l eshed out by Gigerenzer as fast-and-frugal heuristics. Interestingly, such heuristics are typically sparse, as some machine learning models are optimised to be. So, one might hope that we can make sense of artificial intelligence in human terms after all, and face the upcoming challenges with open-mindedness and courage, just like Simon, and of course Wilkes, would have done.
... Would it matter what language the other guests were speaking? There is considerable literature on the cocktail party problem (Cherry, 1953), and most of this research is concerned with the mechanisms underlying speech recognition, spatial hearing, auditory masking and source segregation, among other factors (McDermott, 2009). To conduct such studies, experimenters primarily recruit listeners who are first-language speakers of the materials used in the test and tend to test monolingual listeners disregarding other language(s) they may have been exposed to (Linck, Osthus, Koeth & Bunting, 2014;Melby-Lervåg & Lervåg, 2014;Yow & Li, 2015). ...
Article
Full-text available
Cocktail party environments require listeners to tune in to a target voice while ignoring surrounding speakers. This presents unique challenges for bilingual listeners who have familiarity with several languages. Our study recruited English-French bilinguals to listen to a male target speaking French or English, masked by two female voices speaking French, English or Tamil, or by speech-shaped noise, in a fully factorial design. Listeners struggled most with L1 maskers and least with foreign maskers. Critically, this finding held regardless of the target language (L1 or L2) challenging theories about the linguistic component of informational masking, which contrary to our results predicts stronger interference with greater target-to-masker similarity such as L2 vs L2 compared to L2 vs L1. Our findings suggest that the listener’s familiarity with the masker language is an important source of informational masking in multilingual environments.
Article
Distant speech processing is a critical downstream application in speech and audio signal processing. Traditionally, researchers have addressed this challenge by breaking it down into distinct subproblems and encompassing the extraction of clean speech signals from noisy inputs, feature extraction, and transcription. This approach led to the development of modular distant automatic speech recognition (DASR) models, which are often designed with multiple stages in cascade, corresponding to specific subproblems. Recently, the surge in the capabilities of deep learning is propelling the popularity of purely end-to-end (E2E) models that employ a single large neural network to tackle an entire DASR task in an extremely data-driven manner. However, an alternative paradigm persists in the form of a modular model design, where we can often leverage speech and signal processing models. Although this approach mirrors the multistage model, it is trained through an E2E process. This article overviews the recent development of DASR systems, focusing on E2E module-based models and showcasing successful downstream applications of model-based and data-driven audio signal processing.
Article
Research on endogenous auditory spatial attention typically uses headphones or sounds in the frontal hemispace, which undersamples panoramic spatial hearing. Crossmodal attention studies also show that visual information impacts spatial hearing and attention. Given the overlap between vision and audition in frontal space, we tested the hypothesis that the distribution of endogenous auditory spatial attention would differ when attending to the front versus back hemispace. Participants performed a non-spatial discrimination task where most sounds were presented at a standard location, but occasionally shifted to other locations. Auditory spatial attention cueing and gradient effects across locations were measured in five experiments. Accuracy was greatest at standard versus shift locations, and was comparable when attending to the front or back. Reaction time measures of cueing and gradient effects were larger when attending to the front versus back midline, a finding that was evident when the range of spatial locations was 180° or 360°. When participants were blindfolded, the front/back differences were still present. Sound localization and divided attention tasks showed that the front/back differences were attentional, rather than perceptual, in nature. Collectively, the findings reveal that the impact of endogenous auditory spatial attention depends on where attention is being focused.
Article
Çok eski yıllardan beri düşünürlerin ilgisini çeken dikkat kavramı, psikolojinin modern bir bilim alanı olarak ortaya çıkması ile birlikte deneysel yöntemlerle araştırılmaya başlanmıştır. İlk dikkat kuramları dikkatin temel özelliklerinden seçici olma ve sınırlı kapasiteye sahip olma konularına odaklanmıştır. Sonraki yıllarda bilgi teknolojilerinin de gelişimi ile birlikte insan zihninin tıpkı bilgisayarlar gibi bir bilgi işleme mekanizmasına sahip olduğu görüşü benimsemiş ve bu mekanizmada sisteme giren bilgilerin akışını, dolayısıyla dikkati, kontrol eden bir sisteme ihtiyaç duyulmuştur. Bu kontrol sistemi ‘bilişsel kontrol’ olarak adlandırılmış ve bilgi işleme sisteminin en önemli parçalarından biri olarak kabul edilmiştir. Bu derlemenin amacı da kontrol alanında kullanılan davranışsal yöntemleri ve modelleri gözden geçirip bir araya getirerek, alanyazınındaki boşluklara ve kapsayıcı bir kuramın eksikliğine dikkat çekmektir. Bilişsel kontrolü konu alan ilk modeller kontrollü ve otomatik davranışların ayrımına odaklanmış ve bu davranışların kendilerine has özelliklerini ortaya koymuşlardır. Takip eden modellerde ve daha güncel modellerde ise zihinde bulunan denetleyici birimler aracılığı ile kontrolün ne zaman ve nereye uygulanacağı konusuna odaklanılmıştır. Bilişsel kontrolü ölçmek için ise deneysel olarak uygulanan Stroop, flanker vb. çatışma görevleri kullanılmıştır. Bu görevler aracılığı ile bilişsel kontrol mekanizmalarını aydınlatan pek çok etki ortaya çıkarılmıştır. Bu etkilerin en önemlilerinden birisi uyumluluk oranı etkileridir. Uyumluluk oranı etkileri çeşitli şekillerde değişimlenerek yeni deneysel yöntemler geliştirilmiş ve bu yöntemler sayesinde dikkatin proaktif, reaktif ve bağlama bağlı şekilde kontrol edilebildiği ortaya çıkarılmıştır. Bu etkiler ile birlikte kontrol modelleri güncellenmiş ve yeni kavramsal çerçeveler ortaya çıkarılmıştır. Yine de tüm bu etkileri kapsamlı şekilde açıklayabilen bir model henüz ortaya konmamış olup, alanyazınında halen çözülmesi gereken çeşitli problemler bulunmaktadır.
Article
Binaural unmasking is a remarkable phenomenon that it is substantially easier to detect a signal in noise when the interaural parameters of the signal are different from those of the noise – a useful mechanism in so‐called cocktail party scenarios. In this study, we investigated the effect of binaural unmasking on neural tracking of the speech envelope. We measured EEG in 8 participants who listened to speech in noise at a fixed signal‐to‐noise ratio, in two conditions: one where speech and noise had the same interaural phase difference (both speech and noise having an opposite waveform across ears, SπNπ ), and one where the interaural phase difference of the speech was different from that of the noise (only the speech having an opposite waveform across ears, SπN ). We measured a clear benefit of binaural unmasking in behavioural speech understanding scores, accompanied by increased neural tracking of the speech envelope. Moreover, analysing the temporal response functions revealed that binaural unmasking also resulted in decreased peak latencies and increased peak amplitudes. Our results are consistent with previous research using auditory evoked potentials and steady‐state responses to quantify binaural unmasking at cortical levels. Moreover, they confirm that neural tracking of speech is associated with speech understanding, even if the acoustic signal‐to‐noise ratio is kept constant. From a clinical perspective, these results offer the potential for the objective evaluation of binaural speech understanding mechanisms, and the objective detection of pathologies sensitive to binaural processing, such as asymmetric hearing loss, auditory neuropathy and age‐related deficits.
Article
Full-text available
Understanding speech in noisy environments is often challenging, but is easier if we are listening to someone familiar—for example, naturally familiar people (e.g., friends, partners) or voices that have been familiarized artificially in the lab. Thus, familiarizing people with voices they regularly encounter (e.g., new friends and colleagues) could improve speech intelligibility in everyday life, which might be particularly useful for people who struggle to comprehend speech in noisy environments, such as older adults. Yet, we do not currently understand whether computer-based voice familiarization is effective when delivered remotely, outside of a lab setting, and whether it is effective for older adults. Here, in an online computer-based study, we examined whether learned voices are more intelligible than unfamiliar voices in 20 older (55–73 years) and 20 younger (18–34 years) adults. Both groups benefited from training, and the magnitude of the intelligibility benefit (approximately 30% improvement in sentence report, or 9 dB release from masking) was similar between groups. These findings demonstrate that older adults can learn new voices as effectively as younger adults for improving speech intelligibility, even given a relatively short (<1 hr) duration of familiarization that is delivered in the comfort of their own homes.
ResearchGate has not been able to resolve any references for this publication.