The Cocktail Party Problem

Adaptive Systems Lab, McMaster University, Hamilton, Ontario, Canada L8S 4K1.
Neural Computation (Impact Factor: 2.21). 10/2005; 17(9):1875-902. DOI: 10.1162/0899766054322964
Source: PubMed


This review presents an overview of a challenging problem in auditory perception, the cocktail party phenomenon, the delineation of which goes back to a classic paper by Cherry in 1953. In this review, we address the following issues: (1) human auditory scene analysis, which is a general process carried out by the auditory system of a human listener; (2) insight into auditory perception, which is derived from Marr's vision theory; (3) computational auditory scene analysis, which focuses on specific approaches aimed at solving the machine cocktail party problem; (4) active audition, the proposal for which is motivated by analogy with active vision, and (5) discussion of brain theory and independent component analysis, on the one hand, and correlative neural firing, on the other.

Download full-text


Available from: Zhe Chen, Aug 21, 2014
  • Source
    • "Blind source separation (BSS) is an unsupervised technique for recovering the underling sources from a set of their mixtures . In acoustic applications [1], as the cocktail party problem [2] [3], the sources (speakers) are typically mixed in a convolutive manner, and the respective source separation task is referred to as convolutive BSS. The convolutive BSS problem is much more challenging compared with the instantaneous BSS, since the separation filters might have thousands of coefficients in a typical room environment. "
    [Show abstract] [Hide abstract]
    ABSTRACT: A network of microphone pairs is utilized for the joint task of localizing and separating multiple concurrent speakers. The recently presented incremental distributed expectation-maximization (IDEM) is addressing the first task, namely detection and localization. Here we extend this algorithm to address the second task, namely blindly separating the speech sources. We show that the proposed algorithm, denoted distributed algorithm for localization and separation (DALAS), is capable of separating speakers in reverberant enclosure without a priori information on their number and locations. In the first stage of the proposed algorithm, the IDEM algorithm is applied for blindly detecting the active sources and to estimate their locations. In the second stage, the location estimates are utilized for selecting the most useful node of microphones for the subsequent separation stage. Separation is finally obtained by utilizing the hidden variables of the IDEM algorithm to construct masks for each source in the relevant node.
    23rd European Signal Processing Conference (EUSIPCO); 09/2015
    • "According to the empiricist point of view (Helmholtz, 1867), the auditory system must, therefore use heuristic computational processes, which are based on assumptions regarding the nature of the sound sources to determine the actual source configuration (see, however, the contrasting view of direct perception; Gibson, 1979). This function has been termed the ''auditory scene analysis'' by Bregman (1990; for recent reviews, see Ciocca, 2008; Denham & Winkler, 2014; Haykin & Chen, 2005; Shinn-Cunningham & Wang, 2008; Snyder & Alain, 2007). Many of these assumptions have been described as the laws of perception by the Gestalt school of psychology (Köhler, 1947). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Communication by sounds requires that the communication channels (i.e. speech/speakers and other sound sources) had been established. This allows to separate concurrently active sound sources, to track their identity, to assess the type of message arriving from them, and to decide whether and when to react (e.g., reply to the message). We propose that these functions rely on a common generative model of the auditory environment. This model predicts upcoming sounds on the basis of representations describing temporal/sequential regularities. Predictions help to identify the continuation of the previously discovered sound sources to detect the emergence of new sources as well as changes in the behavior of the known ones. It produces auditory event representations which provide a full sensory description of the sounds, including their relation to the auditory context and the current goals of the organism. Event representations can be consciously perceived and serve as objects in various cognitive operations. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
    Brain and Language 07/2015; 148:1-22. DOI:10.1016/j.bandl.2015.05.003 · 3.22 Impact Factor
  • Source
    • "We can also extract, within a group of speakers talking simultaneously, the utterance emitted by the person we wish to focus on. Known as the term Cocktail Party Effect [1], this separation capacity enables us to process efficiently and selectively the whole acoustic data coming from our daily environment. Sensitive to the slightest tone and level variations of an audio message, we have developed a faculty to recognize its origin (ringtone, voice of a colleague, etc.) and to interpret its contents. "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper attempts to provide a state-of-the-art of sound source localization in Robotics. Noticeably, this context raises original constraints—e.g. embeddability, real time, broadband environments, noise and reverberation— which are seldom simultaneously taken into account in Acoustics or Signal Processing. A comprehensive review is proposed of recent robotics achievements, be they binaural or rooted in Array Processing techniques. The connections are highlighted with the underlying theory as well as with elements of physiology and neurology of human hearing.
    Computer Speech & Language 03/2015; 34(1). DOI:10.1016/j.csl.2015.03.003 · 1.75 Impact Factor
Show more