Article

The Cocktail Party Problem

Adaptive Systems Lab, McMaster University, Hamilton, Ontario, Canada L8S 4K1.
Neural Computation (Impact Factor: 2.21). 10/2005; 17(9):1875-902. DOI: 10.1162/0899766054322964
Source: PubMed

ABSTRACT

This review presents an overview of a challenging problem in auditory perception, the cocktail party phenomenon, the delineation of which goes back to a classic paper by Cherry in 1953. In this review, we address the following issues: (1) human auditory scene analysis, which is a general process carried out by the auditory system of a human listener; (2) insight into auditory perception, which is derived from Marr's vision theory; (3) computational auditory scene analysis, which focuses on specific approaches aimed at solving the machine cocktail party problem; (4) active audition, the proposal for which is motivated by analogy with active vision, and (5) discussion of brain theory and independent component analysis, on the one hand, and correlative neural firing, on the other.

Download full-text

Full-text

Available from: Zhe Chen, Aug 21, 2014
    • "In the decades since, a large number of additional studies and theoretical work has clarified what properties of the acoustic and perceptual situation contribute to the effect. Providing a comprehensive review is not possible in the limited space available for this article, but relevant surveys are provided by Darwin (1997), Yost (1997), Bronkhorst (2000), Haykin and Chen (2005), Schneider, Li, and Daneman (2007), and Moore and Gockel (2012), and more recent individual studies are cited in what follows. "
    [Show abstract] [Hide abstract]
    ABSTRACT: An important application of cognitive architectures is to provide human performance models that capture psychological mechanisms in a form that can be "programmed" to predict task performance of human-machine system designs. Although many aspects of human performance have been successfully modeled in this approach, accounting for multitalker speech task performance is a novel problem. This article presents a model for performance in a two-talker task that incorporates concepts from psychoacoustics, in particular, masking effects and stream formation.
    No preview · Article · Jan 2016 · Topics in Cognitive Science
  • Source
    • "Blind source separation (BSS) is an unsupervised technique for recovering the underling sources from a set of their mixtures . In acoustic applications [1], as the cocktail party problem [2] [3], the sources (speakers) are typically mixed in a convolutive manner, and the respective source separation task is referred to as convolutive BSS. The convolutive BSS problem is much more challenging compared with the instantaneous BSS, since the separation filters might have thousands of coefficients in a typical room environment. "
    [Show abstract] [Hide abstract]
    ABSTRACT: A network of microphone pairs is utilized for the joint task of localizing and separating multiple concurrent speakers. The recently presented incremental distributed expectation-maximization (IDEM) is addressing the first task, namely detection and localization. Here we extend this algorithm to address the second task, namely blindly separating the speech sources. We show that the proposed algorithm, denoted distributed algorithm for localization and separation (DALAS), is capable of separating speakers in reverberant enclosure without a priori information on their number and locations. In the first stage of the proposed algorithm, the IDEM algorithm is applied for blindly detecting the active sources and to estimate their locations. In the second stage, the location estimates are utilized for selecting the most useful node of microphones for the subsequent separation stage. Separation is finally obtained by utilizing the hidden variables of the IDEM algorithm to construct masks for each source in the relevant node.
    Full-text · Conference Paper · Sep 2015
    • "According to the empiricist point of view (Helmholtz, 1867), the auditory system must, therefore use heuristic computational processes, which are based on assumptions regarding the nature of the sound sources to determine the actual source configuration (see, however, the contrasting view of direct perception; Gibson, 1979). This function has been termed the ''auditory scene analysis'' by Bregman (1990; for recent reviews, see Ciocca, 2008; Denham & Winkler, 2014; Haykin & Chen, 2005; Shinn-Cunningham & Wang, 2008; Snyder & Alain, 2007). Many of these assumptions have been described as the laws of perception by the Gestalt school of psychology (Köhler, 1947). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Communication by sounds requires that the communication channels (i.e. speech/speakers and other sound sources) had been established. This allows to separate concurrently active sound sources, to track their identity, to assess the type of message arriving from them, and to decide whether and when to react (e.g., reply to the message). We propose that these functions rely on a common generative model of the auditory environment. This model predicts upcoming sounds on the basis of representations describing temporal/sequential regularities. Predictions help to identify the continuation of the previously discovered sound sources to detect the emergence of new sources as well as changes in the behavior of the known ones. It produces auditory event representations which provide a full sensory description of the sounds, including their relation to the auditory context and the current goals of the organism. Event representations can be consciously perceived and serve as objects in various cognitive operations. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
    No preview · Article · Jul 2015 · Brain and Language
Show more