Conference Paper

MAD-EEG: an EEG dataset for decoding auditory attention to a target instrument in polyphonic music

Authors:
  • Télécom Paris
  • Télécom Paris
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In this thesis, we consider these sound sources to be speech signals, although other sources, such as music [9][10][11][12][13][14][15][16], are also studied in this context. Furthermore, for the sake of an easy exposition, we assume that there are two competing speakers. ...
Thesis
Full-text available
One in five experiences hearing loss. The World Health Organization estimates that this number will increase to one in four in 2050. Luckily, effective hearing devices such as hearing aids and cochlear implants exist with advanced speaker enhancement algorithms that can significantly improve the quality of life of people suffering from hearing loss. State-of-the-art hearing devices, however, underperform in a so-called `cocktail party' scenario, when multiple persons are talking simultaneously (such as at a family dinner or reception). In such a situation, the hearing device does not know which speaker the user intends to attend to and thus which speaker to enhance and which other ones to suppress. Therefore, a new problem arises in cocktail party problems: determining which speaker a user is attending to, referred to as the auditory attention decoding (AAD) problem. The problem of selecting the attended speaker could be tackled using simple heuristics such as selecting the loudest speaker or the one in the user's look direction. However, a potentially better approach is decoding the auditory attention from where it originates, i.e., the brain. Using neurorecording techniques such as electroencephalography (EEG), it is possible to perform AAD, for example, by reconstructing the attended speech envelope from the EEG using a neural decoder (i.e., the stimulus reconstruction (SR) algorithm). Integrating AAD algorithms in a hearing device could then lead to a so-called `neuro-steered hearing device'. These traditional AAD algorithms are, however, not fast enough to adequately react to a switch in auditory attention, and are supervised and fixed over time, not adapting to non-stationarities in the EEG and audio data. Therefore, the general aim of this thesis is to develop novel signal processing algorithms for EEG-based AAD that allow fast, accurate, unsupervised, and time-adaptive decoding of the auditory attention. In the first part of the thesis, we compare different AAD algorithms, which allows us to identify the gaps in the current AAD literature that are partly addressed in this thesis. To be able to perform this comparative study, we develop a new performance to evaluate AAD algorithms in the context of adaptive gain control for neuro-steered hearing devices. In the second part, we address one of the main signal processing challenges in AAD: unsupervised and time-adaptive algorithms. We first develop an unsupervised version of the stimulus decoder that can be trained on a large batch of EEG and audio data without knowledge of ground-truth labels on the attention. This unsupervised but subject-specific stimulus decoder, starting from a random initial decoder, outperforms a supervised subject-independent decoder, and, using subject-independent information, even approximates the performance of a supervised subject-specific decoder. We also extend this unsupervised algorithm to an efficient time-adaptive algorithm, when EEG and audio are continuously streaming in, and show that it has the potential to outperform a fixed supervised decoder in a practical use case of AAD. In the third part, we develop novel AAD algorithms that decode the spatial focus of auditory attention to provide faster and more accurate decoding. The developed methods achieve a much higher accuracy compared to the SR algorithm at a very fast decision rate. Furthermore, we show that these methods are also applicable on different directions of auditory attention, using only EEG channels close to the ears, and when generalizing to data from an unseen subject. To summarize, in this thesis we have developed crucial building blocks for a plug-and-play, time-adaptive, unsupervised, fast, and accurate AAD algorithm that could be integrated with a low-latency speaker separation and enhancement algorithm, and a wearable, miniaturized EEG system to eventually lead to a neuro-steered hearing device.
... MAD-EEG, represents our rst main contribution and is available to the research community as a free resource. The dataset was acquired by my colleague Gabriel Trégoat during his internship at Télécom Paris and nalised by me, leading to the following conference publication: • Cantisani, Giorgia, Gabriel Trégoat, Slim Essid, and Gaël Richard (2019b). "MAD-EEG: an EEG dataset for decoding auditory attention to a target instrument in polyphonic music". ...
Thesis
In this PhD thesis, we address the challenge of integrating Brain-Computer Interfaces (BCI) and music technologies on the specific application of music source separation, which is the task of isolating individual sound sources that are mixed in the audio recording of a musical piece. This problem has been investigated for decades, but never considering BCI as a possible way to guide and inform separation systems. Specifically, we explored how the neural activity characterized by electroencephalographic signals (EEG) reflects information about the attended instrument and how we can use it to inform a source separation system.First, we studied the problem of EEG-based auditory attention decoding of a target instrument in polyphonic music, showing that the EEG tracks musically relevant features which are highly correlated with the time-frequency representation of the attended source and only weakly correlated with the unattended one. Second, we leveraged this ``contrast'' to inform an unsupervised source separation model based on a novel non-negative matrix factorisation (NMF) variant, named contrastive-NMF (C-NMF) and automatically separate the attended source.Unsupervised NMF represents a powerful approach in such applications with no or limited amounts of training data as when neural recording is involved. Indeed, the available music-related EEG datasets are still costly and time-consuming to acquire, precluding the possibility of tackling the problem with fully supervised deep learning approaches. Thus, in the last part of the thesis, we explored alternative learning strategies to alleviate this problem. Specifically, we propose to adapt a state-of-the-art music source separation model to a specific mixture using the time activations of the sources derived from the user's neural activity. This paradigm can be referred to as one-shot adaptation, as it acts on the target song instance only.We conducted an extensive evaluation of both the proposed system on the MAD-EEG dataset which was specifically assembled for this study obtaining encouraging results, especially in difficult cases where non-informed models struggle.
... Taken together, there are a range of data types that are suitable for the kinds of analytical approaches that are common in MIR but applied to a lesser degree in music psychology and neuroscience, and very rarely in music therapy. Within the MIR community, biomarkers and their relation to perceiving various aspects of music have attracted attention recently (Cantisani et al., 2019;Stober et al., 2014). A collaboration between MIR, music cognition, and music therapy would provide a highly promising area for research and application development. ...
Article
Full-text available
The fields of music, health, and technology have seen significant interactions in recent years in developing music technology for health care and well-being. In an effort to strengthen the collaboration between the involved disciplines, the workshop “Music, Computing, and Health” was held to discuss best practices and state-of-the-art at the intersection of these areas with researchers from music psychology and neuroscience, music therapy, music information retrieval, music technology, medical technology (medtech), and robotics. Following the discussions at the workshop, this article provides an overview of the different methods of the involved disciplines and their potential contributions to developing music technology for health and well-being. Furthermore, the article summarizes the state of the art in music technology that can be applied in various health scenarios and provides a perspective on challenges and opportunities for developing music technology that (1) supports person-centered care and evidence-based treatments, and (2) contributes to developing standardized, large-scale research on music-based interventions in an interdisciplinary manner. The article provides a resource for those seeking to engage in interdisciplinary research using music-based computational methods to develop technology for health care, and aims to inspire future research directions by evaluating the state of the art with respect to the challenges facing each field.
... Taken together, there are a range of data types that are suitable for the kinds of analytical approaches that are common in MIR but applied to a lesser degree in music psychology and neuroscience, and very rarely in music therapy. Within the MIR community, biomarkers and their relation to perceiving various aspects of music have attracted attention recently (Cantisani et al., 2019;Stober et al., 2014). A collaboration between MIR, music cognition, and music therapy would provide a highly promising area for research and application development. ...
Preprint
Full-text available
The fields of music, health, and technology have seen significant interactions in recent years in developing music technology for health care and well-being. In an effort to strengthen the collaboration between the involved disciplines, the workshop ‘Music, Computing, and Health’ was held to discuss best practices and state-of-the-art at the intersection of these areas with researchers from music psychology and neuroscience, music therapy, music information retrieval, music technology, medical technology (medtech) and robotics. Following the discussions at the workshop, this paper provides an overview of the different methods of the involved disciplines and their potential contributions to developing music technology for health and well-being. Furthermore, the paper summarizes the state of the art in music technology that can be applied in various health scenarios and provides a perspective on challenges and opportunities for developing music technology that 1) supports person-centered care and evidence-based treatments, and 2) contributes to developing standardized, large-scale research on music-based interventions in an interdisciplinary manner. The paper provides a resource for those seeking toengage in interdisciplinary research using music-based computational methods to develop technology for health care, and aims to inspire future research directions by evaluating the state of the art with respect to the challenges facing each field.
Conference Paper
Full-text available
Retrieving music information from brain activity is a challenging and still largely unexplored research problem. In this paper we investigate the possibility to reconstruct perceived and imagined musical stimuli from electroencephalography (EEG) recordings based on two datasets. One dataset contains multi-channel EEG of subjects listening to and imagining rhythmical patterns presented both as sine wave tones and short looped spoken utterances. These utterances leverage the well-known speech-to-song illusory transformation which results in very catchy and easy to reproduce motifs. A second dataset provides EEG recordings for the perception of 10 full length songs. Using a multi-view deep generative model we demonstrate the feasibility of learning a shared latent representation of brain activity and auditory concepts, such as rhythmical motifs appearing across different instrumentations. Introspection of the model trained on the rhythm dataset reveals disentangled rhythmical and timbral features within and across subjects. The model allows continuous interpolation between representations of different observed variants of the presented stimuli. By decoding the learned embeddings we were able to reconstruct both perceived and imagined music. Stimulus complexity and the choice of training data shows strong effect on the reconstruction quality.
Article
Full-text available
The relation between a stimulus and the evoked brain response can shed light on perceptual processes within the brain. Signals derived from this relation can also be harnessed to control external devices for Brain Computer Interface (BCI) applications. While the classic event-related potential (ERP) is appropriate for isolated stimuli, more sophisticated "decoding" strategies are needed to address continuous stimuli such as speech, music or environmental sounds. Here we describe an approach based on Canonical Correlation Analysis (CCA) that finds the optimal transform to apply to both the stimulus and the response to reveal correlations between the two. Compared to prior methods based on forward or backward models for stimulus-response mapping, CCA finds significantly higher correlation scores, thus providing increased sensitivity to relatively small effects, and supports classifier schemes that yield higher classification scores. CCA strips the brain response of variance unrelated to the stimulus, and the stimulus representation of variance that does not affect the response, and thus improves observations of the relation between stimulus and response.
Article
Full-text available
Understanding how brains process sensory signals in natural environments is one of the key goals of twenty-first century neuroscience. While brain imaging and invasive electrophysiology will play key roles in this endeavor, there is also an important role to be played by noninvasive, macroscopic techniques with high temporal resolution such as electro- and magnetoencephalography. But challenges exist in determining how best to analyze such complex, time-varying neural responses to complex, time-varying and multivariate natural sensory stimuli. There has been a long history of applying system identification techniques to relate the firing activity of neurons to complex sensory stimuli and such techniques are now seeing increased application to EEG and MEG data. One particular example involves fitting a filter—often referred to as a temporal response function—that describes a mapping between some feature(s) of a sensory stimulus and the neural response. Here, we first briefly review the history of these system identification approaches and describe a specific technique for deriving temporal response functions known as regularized linear regression. We then introduce a new open-source toolbox for performing this analysis. We describe how it can be used to derive (multivariate) temporal response functions describing a mapping between stimulus and response in both directions. We also explain the importance of regularizing the analysis and how this regularization can be optimized for a particular dataset. We then outline specifically how the toolbox implements these analyses and provide several examples of the types of results that the toolbox can produce. Finally, we consider some of the limitations of the toolbox and opportunities for future development and application.
Article
Full-text available
When listening to ensemble music, even nonmusicians can follow single instruments effortlessly. Electrophysiological indices for neural sensory encoding of separate streams have been described using oddball paradigms that utilize brain reactions to sound events that deviate from a repeating standard pattern. Obviously, these paradigms put constraints on the compositional complexity of the musical stimulus. Here, we apply a regression-based method of multivariate electroencephalogram (EEG) analysis in order to reveal the neural encoding of separate voices of naturalistic ensemble music that is based on cortical responses to tone onsets, such as N1/P2 event-related potential components. Music clips (resembling minimalistic electro-pop) were presented to 11 subjects, either in an ensemble version (drums, bass, keyboard) or in the corresponding 3 solo versions. For each instrument, we train a spatiotemporal regression filter that optimizes the correlation between EEG and a target function that represents the sequence of note onsets in the audio signal of the respective solo voice. This filter extracts an EEG projection that reflects the brain’s reaction to note onsets with enhanced sensitivity. We apply these instrument-specific filters to 61-channel EEG recorded during the presentations of the ensemble version and assess by means of correlation measures how strongly the voice of each solo instrument is reflected in the EEG. Our results show that the reflection of the melody instrument keyboard in the EEG exceeds that of the other instruments by far, suggesting a high-voice superiority effect in the neural representation of note onsets. Moreover, the results indicated that focusing attention on a particular instrument can enhance this reflection. We conclude that the voice-discriminating neural representation of tone onsets at the level of early auditory event-related potentials parallels the perceptual segregation of multivoiced music.
Article
Full-text available
Note onsets in music are acoustic landmarks providing auditory cues that underlie the perception of more complex phenomena such as beat, rhythm, and meter. For naturalistic ongoing sounds a detailed view on the neural representation of onset structure is hard to obtain, since, typically, stimulus-related EEG signatures are derived by averaging a high number of identical stimulus presentations. Here, we propose a novel multivariate regression-based method extracting onset-related brain responses from the ongoing EEG. We analyse EEG recordings of nine subjects who passively listened to stimuli from various sound categories encompassing simple tone sequences, full-length romantic piano pieces and natural (non-music) soundscapes. The regression approach reduces the 61-channel EEG to one time course optimally reflecting note onsets. The neural signatures derived by this procedure indeed resemble canonical onset-related ERPs, such as the N1-P2 complex. This EEG projection was then utilized to determine the Cortico-Acoustic Correlation (CACor), a measure of synchronization between EEG signal and stimulus. We demonstrate that a significant CACor (i) can be detected in an individual listener's EEG of a single presentation of a full-length complex naturalistic music stimulus, and (ii) it co-varies with the stimuli's average magnitudes of sharpness, spectral centroid, and rhythmic complexity. In particular, the subset of stimuli eliciting a strong CACor also produces strongly coordinated tension ratings obtained from an independent listener group in a separate behavioral experiment. Thus musical features that lead to a marked physiological reflection of tone onsets also contribute to perceived tension in music.
Article
Full-text available
People readily extract regularity in rhythmic auditory patterns, enabling prediction of the onset of the next beat. Recent magnetoencephalography (MEG) research suggests that such prediction is reflected by the entrainment of oscillatory networks in the brain to the tempo of the sequence. In particular, induced beta-band oscillatory activity from auditory cortex decreases after each beat onset and rebounds prior to the onset of the next beat across tempi in a predictive manner. The objective of the present study was to examine the development of such oscillatory activity by comparing electroencephalography (EEG) measures of beta-band fluctuations in 7-year-old children to adults. EEG was recorded while participants listened passively to isochronous tone sequences at three tempi (390, 585, and 780 ms for onset-to-onset interval). In adults, induced power in the high beta-band (20–25 Hz) decreased after each tone onset and rebounded prior to the onset of the next tone across tempo conditions, consistent with MEG findings. In children, a similar pattern was measured in the two slower tempo conditions, but was weaker in the fastest condition. The results indicate that the beta-band timing network works similarly in children, although there are age-related changes in consistency and the tempo range over which it operates.
Article
Full-text available
Objective: Polyphonic music (music consisting of several instruments playing in parallel) is an intuitive way of embedding multiple information streams. The different instruments in a musical piece form concurrent information streams that seamlessly integrate into a coherent and hedonistically appealing entity. Here, we explore polyphonic music as a novel stimulation approach for use in a brain-computer interface. Approach: In a multi-streamed oddball experiment, we had participants shift selective attention to one out of three different instruments in music audio clips. Each instrument formed an oddball stream with its own specific standard stimuli (a repetitive musical pattern) and oddballs (deviating musical pattern). Main results: Contrasting attended versus unattended instruments, ERP analysis shows subject- and instrument-specific responses including P300 and early auditory components. The attended instrument can be classified offline with a mean accuracy of 91% across 11 participants. Significance: This is a proof of concept that attention paid to a particular instrument in polyphonic music can be inferred from ongoing EEG, a finding that is potentially relevant for both brain-computer interface and music research.
Article
Full-text available
How humans solve the cocktail party problem remains unknown. However, progress has been made recently thanks to the realization that cortical activity tracks the amplitude envelope of speech. This has led to the development of regression methods for studying the neurophysiology of continuous speech. One such method, known as stimulus-reconstruction, has been successfully utilized with cortical surface recordings and magnetoencephalography (MEG). However, the former is invasive and gives a relatively restricted view of processing along the auditory hierarchy, whereas the latter is expensive and rare. Thus it would be extremely useful for research in many populations if stimulus-reconstruction was effective using electroencephalography (EEG), a widely available and inexpensive technology. Here we show that single-trial (≈60 s) unaveraged EEG data can be decoded to determine attentional selection in a naturalistic multispeaker environment. Furthermore, we show a significant correlation between our EEG-based measure of attention and performance on a high-level attention task. In addition, by attempting to decode attention at individual latencies, we identify neural processing at ∼200 ms as being critical for solving the cocktail party problem. These findings open up new avenues for studying the ongoing dynamics of cognition using EEG and for developing effective and natural brain-computer interfaces. © 2014 The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected] /* */
Conference Paper
Full-text available
This study proposes a six-step approach to analyze ongoing EEG elicited by 512-second long natural music (modern tango). The spectrogram of the ongoing EEG was first produced, and then a fourth-order tensor including the spectrograms of multiple channels of multiple participants was decomposed via nonnegative tensor factorization (NTF) into four factors, including temporal, spectral and spatial components, and multi-domain features of all participants. We found one extracted temporal component by NTF significantly (p < 0.01) correlated with the temporal course of a long-term music feature, ‘fluctuation centroid’; moreover, the power of posterior alpha activity was found to be associated with this temporal component. Hence, it looks promising to apply the proposed method for analyzing other ongoing EEG elicited by other natural stimuli.
Article
Full-text available
We present a multimodal data set for the analysis of human affective states. The electroencephalogram (EEG) and peripheral physiological signals of 32 participants were recorded as each watched 40 one-minute long excerpts of music videos. Participants rated each video in terms of the levels of arousal, valence, like/dislike, dominance, and familiarity. For 22 of the 32 participants, frontal face video was also recorded. A novel method for stimuli selection is proposed using retrieval by affective tags from the last.fm website, video highlight detection, and an online assessment tool. An extensive analysis of the participants' ratings during the experiment is presented. Correlates between the EEG signal frequencies and the participants' ratings are investigated. Methods and results are presented for single-trial classification of arousal, valence, and like/dislike ratings using the modalities of EEG, peripheral physiological signals, and multimedia content analysis. Finally, decision fusion of the classification results from different modalities is performed. The data set is made publicly available and we encourage other researchers to use it for testing their own affective state estimation methods.
Article
Full-text available
Author Summary Spoken language is a uniquely human trait. The human brain has evolved computational mechanisms that decode highly variable acoustic inputs into meaningful elements of language such as phonemes and words. Unraveling these decoding mechanisms in humans has proven difficult, because invasive recording of cortical activity is usually not possible. In this study, we take advantage of rare neurosurgical procedures for the treatment of epilepsy, in which neural activity is measured directly from the cortical surface and therefore provides a unique opportunity for characterizing how the human brain performs speech recognition. Using these recordings, we asked what aspects of speech sounds could be reconstructed, or decoded, from higher order brain areas in the human auditory system. We found that continuous auditory representations, for example the speech spectrogram, could be accurately reconstructed from measured neural signals. Reconstruction quality was highest for sound features most critical to speech intelligibility and allowed decoding of individual spoken words. The results provide insights into higher order neural speech processing and suggest it may be possible to readout intended speech directly from brain activity.
Article
Full-text available
Population responses of cortical neurons encode considerable details about sensory stimuli, and the encoded information is likely to change with stimulus context and behavioral conditions. The details of encoding are difficult to discern across large sets of single neuron data because of the complexity of naturally occurring stimulus features and cortical receptive fields. To overcome this problem, we used the method of stimulus reconstruction to study how complex sounds are encoded in primary auditory cortex (AI). This method uses a linear spectro-temporal model to map neural population responses to an estimate of the stimulus spectrogram, thereby enabling a direct comparison between the original stimulus and its reconstruction. By assessing the fidelity of such reconstructions from responses to modulated noise stimuli, we estimated the range over which AI neurons can faithfully encode spectro-temporal features. For stimuli containing statistical regularities (typical of those found in complex natural sounds), we found that knowledge of these regularities substantially improves reconstruction accuracy over reconstructions that do not take advantage of this prior knowledge. Finally, contrasting stimulus reconstructions under different behavioral states showed a novel view of the rapid changes in spectro-temporal response properties induced by attentional and motivational state.
Article
Speech intelligibility is currently measured by scoring how well a person can identify a speech signal. The results of such behavioral measures reflect neural processing of the speech signal, but are also influenced by language processing, motivation, and memory. Very often, electrophysiological measures of hearing give insight in the neural processing of sound. However, in most methods, non-speech stimuli are used, making it hard to relate the results to behavioral measures of speech intelligibility. The use of natural running speech as a stimulus in electrophysiological measures of hearing is a paradigm shift which allows to bridge the gap between behavioral and electrophysiological measures. Here, by decoding the speech envelope from the electroencephalogram, and correlating it with the stimulus envelope, we demonstrate an electrophysiological measure of neural processing of running speech. We show that behaviorally measured speech intelligibility is strongly correlated with our electrophysiological me
Article
Selectively attending to one speaker in a multi-speaker scenario is thought to synchronize low-frequency cortical activity to the attended speech signal. In recent studies, reconstruction of speech from single-trial electroencephalogram (EEG) data has been used to decode which talker a listener is attending to in a two-talker situation. It is currently unclear how this generalizes to more complex sound environments. Behaviorally, speech perception is robust to the acoustic distortions that listeners typically encounter in everyday life, but it is unknown whether this is mirrored by a noise-robust neural tracking of attended speech. Here we used advanced acoustic simulations to recreate real-world acoustic scenes in the laboratory. In virtual acoustic realities with varying amounts of reverberation and number of interfering talkers, listeners selectively attended to the speech stream of a particular talker. Across the different listening environments, we found that the attended talker could be accurately decoded from single-trial EEG data irrespective of the different distortions in the acoustic input. For highly reverberant environments, speech envelopes reconstructed from neural responses to the distorted stimuli resembled the original clean signal more than the distorted input. With reverberant speech, we observed a late cortical response to the attended speech stream that encoded temporal modulations in the speech signal without its reverberant distortion. Single-trial attention decoding accuracies based on 40–50 s long blocks of data from 64 scalp electrodes were equally high (80–90% correct) in all considered listening environments and remained statistically significant using down to 10 scalp electrodes and short (< 30-s) unaveraged EEG segments. In contrast to the robust decoding of the attended talker we found that decoding of the unattended talker deteriorated with the acoustic distortions. These results suggest that cortical activity tracks an attended speech signal in a way that is invariant to acoustic distortions encountered in real-life sound environments. Noise-robust attention decoding additionally suggests a potential utility of stimulus reconstruction techniques in attention-controlled brain-computer interfaces.
Conference Paper
Recently it has been shown to be possible to ascertain which speaker a subject is attending to in a cocktail party environment from single-trial (~60s) electroencephalography (EEG) data. The attentional selection of most of subjects could be decoded with a very high accuracy (>90%). However, the performance of many subjects fell below what would be required for a potential brain computer interface (BCI). One potential reason for this is that activity related to the stimuli may have a lower signal-to-noise ratio on the scalp for some subjects than others. Independent component analysis (ICA) is a commonly used method for denoising EEG data. However, its effective use often requires the subjective choice of the experimenter to determine which independent components (ICs) to retain and which to reject. Algorithms do exist to automatically determine the reliability of ICs, however they provide no information as to their relevance for the task at hand. Here we introduce a novel method for automatically selecting ICs which are relevant for decoding attentional selection. In doing so, we show a significant increase in classification accuracy at all test data durations from 60s to 10s. These findings have implications for the future development of naturalistic and user-friendly BCIs, as well as for smart hearing aids.
Article
This chapter examines the neurological bases of musical communication. First, it presents behavioural data from psychophysical studies in terms of how these data provide insight into assumed brain function on a theoretical level. Second, it reviews studies using brain imaging and brain wave recordings to shed some light on what is known so far about neurophysiological processes mediating rhythm perception and rhythm production. Third, it considers biomedical applications of music's influence on brain and behaviour function in light of a changing paradigm for music therapy and medicine.
Article
The current work investigates the brain activation shared between perception and imagery of music as measured with electroencephalography (EEG). Meta-analyses of four separate EEG experiments are presented, each focusing on perception and imagination of musical sound, with differing levels of stimulus complexity. Imagination and perception of simple accented metronome trains, as manifested in the clock illusion, as well as monophonic melodies are discussed, as well as by more complex rhythmic patterns and ecologically natural music stimuli. By decomposing the data with Principal Component Analysis (PCA), similar component distributions are found to explain most of the variance in each experiment. All data sets show a fronto-central and a more central component as the largest sources of variance, fitting with projections seen for the network of areas contributing to the N1/P2 complex. We expanded on these results using tensor decomposition. This allows to add in the tasks to find shared activation, but does not make assumptions of independence or orthogonality and calculates the relative strengths of these components for each task. The components found in the PCA were shown to be further decomposable into parts that load primarily on to the perception or imagery task, or both, thereby adding more detail. It is shown that especially the frontal and central components have multiple parts that are differentially active during perception and imagination. A number of possible interpretations of these results are discussed, taking into account the different stimulus materials and measurement conditions.
Article
Humans possess a remarkable ability to attend to a single speaker's voice in a multi-talker background. How the auditory system manages to extract intelligible speech under such acoustically complex and adverse listening conditions is not known, and, indeed, it is not clear how attended speech is internally represented. Here, using multi-electrode surface recordings from the cortex of subjects engaged in a listening task with two simultaneous speakers, we demonstrate that population responses in non-primary human auditory cortex encode critical features of attended speech: speech spectrograms reconstructed based on cortical responses to the mixture of speakers reveal the salient spectral and temporal features of the attended speaker, as if subjects were listening to that speaker alone. A simple classifier trained solely on examples of single speakers can decode both attended words and speaker identity. We find that task performance is well predicted by a rapid increase in attention-modulated neural selectivity across both single-electrode and population-level cortical responses. These findings demonstrate that the cortical representation of speech does not merely reflect the external acoustic environment, but instead gives rise to the perceptual aspects relevant for the listener's intended goal.
Article
Timbre characterizes the identity of a sound source. Psychoacoustic studies have revealed that timbre is a multidimensional perceptual attribute with multiple underlying acoustic dimensions of both temporal and spectral types. Here we investigated the relations among the processing of three major timbre dimensions characterized acoustically by attack time, spectral centroid, and spectrum fine structure. All three pairs of these dimensions exhibited Garner interference: speeded categorization along one timbre dimension was affected by task-irrelevant variations along another timbre dimension. We also observed congruency effects: certain pairings of values along two different dimensions were categorized more rapidly than others. The exact profile of interactions varied across the three pairs of dimensions tested. The results are interpreted within the frame of a model postulating separate channels of processing for auditory attributes (pitch, loudness, timbre dimensions, etc.) with crosstalk between channels.
Article
Sl,ai,isl,ica,1 significance, kest,ing of diflkn'ences in values of metrics like recall, i)rccision mM batmined F-s(x)re is a ne(:cssm'y tmrt of eml)iri(:a.l ua.t;ural bmguage 1)ro(;easing. Unfi)rtunat,ely we inertly used tests of'ten ulnlerestimake i,he significm ce mM so a.re less likely to detect, difihrences l,hat exist between difM'eni techniques. This 1111deresl;illla(;ioll comes from an independcn(;e asSmnl)tion that is offten violated. Wc l)oint, out sonic ltse]'Hl l.es(,s (,]mL do nol, make lhis assmnl)- tion, including contput;a, tionally--iltcnsive domizai,ion tests.
Naturalistic music eeg dataset -hindi (nmed-h)
  • B Kaneshiro
  • D T Nguyen
  • J P Dmochowski
  • A M Norcia
  • J Berger
B. Kaneshiro, D. T. Nguyen, J. P. Dmochowski, A. M. Norcia, and J. Berger, "Naturalistic music eeg dataset -hindi (nmed-h)," https://purl.stanford.edu/sd922db3535, 2016.
Nmed-t: A tempo-focused dataset of cortical and behavioral responses to naturalistic music
  • S Losorelli
  • D T Nguyen
  • J P Dmochowski
  • B Kaneshiro
S. Losorelli, D. T. Nguyen, J. P. Dmochowski, and B. Kaneshiro, "Nmed-t: A tempo-focused dataset of cortical and behavioral responses to naturalistic music." in ISMIR, 2017, pp. 339-346.
Neural tracking of simple and complex rhythms: Pilot study and dataset
  • J Appaji
  • B Kaneshiro
J. Appaji and B. Kaneshiro, "Neural tracking of simple and complex rhythms: Pilot study and dataset," Late-Breaking Demos Session for ISMIR, 2018.
Towards music imagery information retrieval: Introducing the openmiir dataset of eeg rec.ings from music perception and imagination
  • S Stober
  • A Sternin
  • A M Owen
  • J A Grahn
S. Stober, A. Sternin, A. M. Owen, and J. A. Grahn, "Towards music imagery information retrieval: Introducing the openmiir dataset of eeg rec.ings from music perception and imagination." in International Society for Music Information Retrieval Conference (ISMIR), 2015, pp. 763-769.
A comparison of temporal response function estimation methods for auditory attention decoding
  • D D Wong
  • S A Fuglsang
  • J Hjortkjaer
  • E Ceolini
  • M Slaney
  • A De Cheveigné
D. D. Wong, S. A. Fuglsang, J. Hjortkjaer, E. Ceolini, M. Slaney, and A. de Cheveigné, "A comparison of temporal response function estimation methods for auditory attention decoding," bioRxiv, p. 281345, 2018.