Bryan Pardo

Bryan Pardo
Northwestern University | NU · Department of Electrical Engineering and Computer Science

PhD

About

162
Publications
33,661
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,084
Citations
Citations since 2016
57 Research Items
1749 Citations
2016201720182019202020212022050100150200250
2016201720182019202020212022050100150200250
2016201720182019202020212022050100150200250
2016201720182019202020212022050100150200250
Introduction
I head the Interactive Audio Lab. We make machines that hear. We develop new methods in Machine Learning, Signal Processing and Human Computer Interaction to make new tools for understanding and manipulating sound. When I'm not researching you may find me teaching classes like Digital Musical Instrument Design, Computational Creativity or Deep Learning.

Publications

Publications (162)
Preprint
Full-text available
Despite phenomenal progress in recent years, state-of-the-art music separation systems produce source estimates with significant perceptual shortcomings, such as adding extraneous noise or removing harmonics. We propose a post-processing model (the Make it Sound Good (MSG) post-processor) to enhance the output of music source separation systems. We...
Preprint
Full-text available
We propose a new method for training a supervised source separation system that aims to learn the interdependent relationships between all combinations of sources in a mixture. Rather than independently estimating each source from a mix, we reframe the source separation problem as an Orderless Neural Autoregressive Density Estimator (NADE), and est...
Preprint
Full-text available
Human perceptual studies are the gold standard for the evaluation of many research tasks in machine learning, linguistics, and psychology. However, these studies require significant time and cost to perform. As a result, many researchers use objective measures that can correlate poorly with human evaluation. When subjective evaluations are performe...
Preprint
Full-text available
We showcase an unsupervised method that repurposes deep models trained for music generation and music tagging for audio source separation, without any retraining. An audio generation model is conditioned on an input mixture, producing a latent encoding of the audio used to generate audio. This generated audio is fed to a pretrained music tagger tha...
Preprint
Full-text available
We present a software framework that integrates neural networks into the popular open-source audio editing software, Audacity, with a minimal amount of developer effort. In this paper, we showcase some example use cases for both end-users and neural network developers. We hope that this work fosters a new level of interactivity between deep learnin...
Preprint
Full-text available
Modifying the pitch and timing of an audio signal are fundamental audio editing operations with applications in speech manipulation, audio-visual synchronization, and singing voice editing and synthesis. Thus far, methods for pitch-shifting and time-stretching that use digital signal processing (DSP) have been favored over deep learning approaches...
Preprint
Full-text available
Deep learning work on musical instrument recognition has generally focused on instrument classes for which we have abundant data. In this work, we exploit hierarchical relationships between instruments in a few-shot learning setup to enable classification of a wider set of musical instruments, given a few examples at inference. We apply a hierarchi...
Preprint
Text-based speech editors expedite the process of editing speech recordings by permitting editing via intuitive cut, copy, and paste operations on a speech transcript. A major drawback of current systems, however, is that edited recordings often sound unnatural because of prosody mismatches around edited regions. In our work, we propose a new conte...
Preprint
Full-text available
Supervised deep learning methods for performing audio source separation can be very effective in domains where there is a large amount of training data. While some music domains have enough data suitable for training a separation system, such as rock and pop genres, many musical domains do not, such as classical music, choral music, and non-Western...
Preprint
Full-text available
In this paper, we introduce a simple method that can separate arbitrary musical instruments from an audio mixture. Given an unaligned MIDI transcription for a target instrument from an input mixture, we synthesize new mixtures from the midi transcription that sound similar to the mixture to be separated. This lets us create a labeled training set t...
Preprint
Clipping the gradient is a known approach to improving gradient descent, but requires hand selection of a clipping threshold hyperparameter. We present AutoClip, a simple method for automatically and adaptively choosing a gradient clipping threshold, based on the history of gradient norms observed during training. Experimental results show that app...
Preprint
Deep generative systems that learn probabilistic models from a corpus of existing music do not explicitly encode knowledge of a musical style, compared to traditional rule-based systems. Thus, it can be difficult to determine whether deep models generate stylistically correct output without expert evaluation, but this is expensive and time-consumin...
Preprint
Deep learning has rapidly become the state-of-the-art approach for music generation. However, training a deep model typically requires a large training set, which is often not available for specific musical styles. In this paper, we present augmentative generation (Aug-Gen), a method of dataset augmentation for any music generation system trained o...
Article
Full-text available
Detecting subtle signals from small earthquakes triggered by transient stresses from the surface waves of large magnitude earthquakes can contribute to a more general understanding of how earthquakes nucleate and interact with each other. However, searching for signals from such small earthquakes in thousands of seismograms is overwhelming, and dis...
Preprint
Early detection and repair of failing components in automobiles reduces the risk of vehicle failure in life-threatening situations. Many automobile components in need of repair produce characteristic sounds. For example, loose drive belts emit a high-pitched squeaking sound, and bad starter motors have a characteristic whirring or clicking noise. O...
Preprint
Full-text available
Separating an audio scene such as a cocktail party into constituent, meaningful components is a core task in computer audition. Deep networks are the state-of-the-art approach. They are trained on synthetic mixtures of audio made from isolated sound source recordings so that ground truth for the separation is known. However, the vast majority of av...
Preprint
Full-text available
Audio source separation is the process of separating a mixture (e.g. a pop band recording) into isolated sounds from individual sources (e.g. just the lead vocals). Deep learning models are the state-of-the-art in source separation, given that the mixture to be separated is similar to the mixtures the deep model was trained on. This requires the en...
Preprint
Full-text available
We present a single deep learning architecture that can both separate an audio recording of a musical mixture into constituent single-instrument recordings and transcribe these instruments into a human-readable format at the same time, learning a shared musical representation for both tasks. This novel architecture, which we call Cerberus, builds o...
Conference Paper
Sound Event Detection (SED) in audio scenes is the task that has been studied by an increasing number of researchers. Recent SED systems often use deep learning models. Building these systems typically require a large amount of carefully annotated, strongly labeled data, where the exact time-span of a sound event (e.g. the `dog bark' starts at 1.2...
Article
Full-text available
Improving audio production tools provides a great opportunity for meaningful enhancement of creative activities due to the disconnect between existing tools and the conceptual frameworks within which many people work. In our work, we focus on bridging the gap between the intentions of both amateur and professional musicians and the audio manipulati...
Conference Paper
Full-text available
Deep-learning-based audio processing algorithms have become very popular over the past decade. Due to promising results reported for deep-learning-based methods on many tasks, some now argue that signal processing audio representations (e.g. magnitude spectrograms) should be entirely discarded, in favor of learning representations from data using d...
Conference Paper
Content-based audio retrieval including query-by-example (QBE) and query-by-vocal imitation (QBV) is useful when search-relevant text labels for the audio are unavailable, or text labels do not sufficiently narrow the search. However, a single query example may not provide sufficient information to ensure the target sound(s) in the database are the...
Conference Paper
Full-text available
Voice recording is a challenging task with many pitfalls due to sub-par recording environments, mistakes in recording setup, microphone quality, etc. Newcomers to voice recording often have difficulty recording their voice, leading to recordings with low sound quality. Many amateur recordings of poor quality have two key problems: too much reverber...
Article
Full-text available
In this special issue of IEEE Signal Processing Magazine (SPM), we survey recent advances in music processing with a focus on audio signals. Eleven articles cover topics including music analysis, retrieval, source separation, singing-voice processing, musical sound synthesis, and user interfaces, to name a few. The tutorial-style articles provide a...
Conference Paper
Full-text available
Query-By-Vocal Imitation (QBV) search systems enable searching a collection of audio files using a vocal imitation as a query. This can be useful when sounds do not have commonly agreed-upon text-labels, or many sounds share a label. As deep learning approaches have been successfully applied to QBV systems, datasets to build models have become more...
Preprint
Full-text available
Separating an audio scene into isolated sources is a fundamental problem in computer audition, analogous to image segmentation in visual scene analysis. Source separation systems based on deep learning are currently the most successful approaches for solving the underdetermined separation problem, where there are more sources than channels. Traditi...
Preprint
Full-text available
The Multi-resolution Common Fate Transform (MCFT) is an audio signal representation useful for representing mixtures of multiple audio signals that overlap in both time and frequency. The MCFT combines the invertibility of a state-of-the-art representation, the Common Fate Transform (CFT), and the multi-resolution property of the cortical stage out...
Poster
Full-text available
The Multi-resolution Common Fate Transform (MCFT) is an audio signal representation that increases the separability of audio sources with significant energy overlap in the time-frequency domain. The MCFT is computed based on a fully invertible complex time-frequency representation. Therefore, the MCFT domain, where there is less overlap between sou...
Article
Full-text available
Conventional methods for finding audio in databases typically search text labels, rather than the audio itself. This can be problematic as labels may be missing, irrelevant to the audio content, or not known by users. Query by vocal imitation lets users query using vocal imitations instead. To do so, appropriate audio feature representations and ef...
Chapter
Separation of existing audio into remixable elements is very useful to repurpose music audio. Applications include upmixing video soundtracks to surround sound (e.g. home theater 5.1 systems), facilitating music transcriptions, allowing better mashups and remixes for disk jockeys, and rebalancing sound levels on multiple instruments or voices recor...
Article
Labeling of audio events is essential for many tasks. However, finding sound events and labeling them within a long audio file is tedious and time-consuming. In cases where there is very little labeled data (e.g., a single labeled example), it is often not feasible to train an automatic labeler because many techniques (e.g., deep learning) require...
Article
Popular music is often composed of an accompaniment and a lead component, the latter typically consisting of vocals. Filtering such mixtures to extract one or both components has many applications, such as automatic karaoke and remixing. This particular case of source separation yields very specific challenges and opportunities, including the parti...
Chapter
When musical instruments are recorded in isolation, modern editing and mixing tools allow correction of small errors without requiring a group to re-record an entire passage. Isolated recording also allows rebalancing of levels between musicians without re-recording and application of audio effects to individual instruments. Many of these technique...
Conference Paper
Full-text available
Audio source separation is the act of isolating sound sources in an audio scene. One application of source separation is singing voice extraction. In this work, we present a novel approach for music/voice separation that uses the 2D Fourier Transform (2DFT). Our approach leverages how periodic patterns manifest in the 2D Fourier Transform and is co...
Conference Paper
Full-text available
Audio source separation is the process of decomposing a signal containing sounds from multiple sources into a set of signals, each from a single source. Source separation algorithms typically leverage assumptions about correlations between audio signal characteristics (“cues”) and the audio sources or mixing parameters, and exploit these to do sepa...
Conference Paper
Audio production includes processing audio tracks to adjust sound levels with tools like compressors and modifying the sound with reverberation and equalization. In this paper, we focus on audio equalizers. We seek to make a tactile interface that lets blind or visually impaired users create an equalization curve in an intuitive manner. This interf...
Conference Paper
Full-text available
Tagging of sound events is essential in many research areas. However, finding sound events and labeling them within a long audio file is tedious and time-consuming. Building an automatic recognition system using machine learning techniques is often not feasible because it requires a large number of human-labeled training examples and fine tuning th...
Conference Paper
Full-text available
We propose the Multi-resolution Common Fate Transform (MCFT), a signal representation that increases the separability of audio sources with significant energy overlap in the time-frequency domain. The MCFT combines the desirable features of two existing representations: the invertibility of the recently proposed Common Fate Transform (CFT) and the...
Conference Paper
Older adults and people with vision impairments are increasingly using phones to receive audio-based information and want to publish content online but must use complex audio recording/editing tools that often rely on inaccessible graphical interfaces. This poster describes the design of an accessible audio-based interface for post-processing audio...
Article
Potential users of audio production software, such as equalizers and reverberators, may be discouraged by the complexity of the interface and a lack of clear affordances in typical interfaces. We seek to simplify interfaces for audio production (e.g., mastering a music album with ProTools), audio tools (e.g., equalizers), and related consumer devic...
Conference Paper
We present the analysis of crowdsourced studies into how a population of Amazon Mechanical Turk Workers describe three commonly used audio effects: equalization, reverberation, and dynamic range compression. We find three categories of words used to describe audio: ones that are generally used across effects, ones that tend towards a single effect,...
Article
Audio production involves using tools such as reverberators and equalizers to transform audio into a state ready for release. While these tools are widely available to musicians who are not expert audio engineers, existing interfaces can be frustrating for newcomers, as their interfaces are parameterized in terms of low-level signal manipulations t...
Conference Paper
A natural way of communicating an audio concept is to imitate it with one's voice. This creates an approximation of the imagined sound (e.g. a particular owl's hoot), much like how a visual sketch approximates a visual concept (e.g a drawing of the owl). If a machine could understand vocal imitations, users could communicate with software in this n...
Article
Audio equalizers (EQs) are perhaps the most commonly used tools used in audio production. The SocialEQ project is a web-based personalized audio equalization system that uses an alternative interface paradigm to the standard approach. Here, the user names a desired effect (e.g. Make the sound 'warm') and teaches the tool (e.g. An equalizer) what se...
Article
Musical works are often composed of two characteristic components: the background (typically the musical accompaniment), which generally exhibits a strong rhythmic structure with distinctive repeating time elements, and the melody (typically the singing voice or a solo instrument), which generally exhibits a strong harmonic structure with a distinc...
Article
Audio production is central to every kind of media that involves sound, such as film, television, and music and involves transforming audio into a state ready for consumption by the public. One of the most commonly-used audio production tools is the reverberator. Current interfaces are often complex and hard-to-understand. We seek to simplify these...
Conference Paper
While programming an audio synthesizer can be difficult, if a user has a general idea of the sound they are trying to program, they may be able to imitate it with their voice. In this technical demonstration, we demonstrate SynthAssist, a system that allows the user to program an audio synthesizer using vocal imitation and interactive feedback. Thi...
Conference Paper
Full-text available
One of the most commonly-used audio production tools is the reverberator. Reverberators apply subtle or large echo effects to sound and are typically used in commercial audio recordings. Current reverberator interfaces are often complex and hard-to-understand. In this work, we describe Reverbalize, a novel and easy-to-use interface for a reverberat...
Article
Recommending media objects to users typically requires users to rate existing media objects so as to understand their preferences. The number of ratings required to produce good suggestions can be reduced through collaborative filtering. Collaborative filtering is more difficult when prior users have not rated the same set of media objects as the c...
Article
Full-text available
Source separation consists of separating a signal into additive components. It is a topic of considerable interest with many applications that has gathered much attention recently. Here, we introduce a new framework for source sepa-ration called Kernel Additive Modelling, which is based on local regression and permits efficient separation of multid...
Conference Paper
Full-text available
Recently, Kernel Additive Modelling was proposed as a new framework for performing sound source separation. Kernel Additive Modelling assumes that a source at some location can be estimated using its values at nearby locations where nearness is defined through a source-specifc proximity kernel. Different proximity kernels can be used for different...
Book
Full-text available
Repetition is a fundamental element in generating and perceiving structure. In audio, mixtures are often composed of structures where a repeating background signal is superimposed with a varying foreground signal (e.g., a singer overlaying varying vocals on a repeating accompaniment or a varying speech signal mixed up with a repeating background no...
Article
Full-text available
In recent years, source separation has been a central research topic in music signal processing, with applications in stereo-to-surround up-mixing, remixing tools for disc jockeys or producers, instrument-wise equalizing, karaoke systems, and preprocessing in music analysis tasks. Musical sound sources, however, are often strongly correlated in tim...
Conference Paper
Full-text available
In this study, we introduce a new framework called Kernel Additive Modelling for audio spectrograms that can be used for multichannel source separation. It assumes that the spectrogram of a source at any time-frequency bin is close to its value in a neighbourhood indicated by a source-specific proximity kernel. The rationale for this model is to ea...
Conference Paper
We propose a novel cepstral representation called the uniform discrete cepstrum (UDC) to represent the timbre of sound sources in a sound mixture. Different from ordinary cepstrum and MFCC which have to be calculated from the full magnitude spectrum of a source after source separation, UDC can be calculated directly from isolated spectral points th...
Conference Paper
A typical audio mixer interface consists of faders and knobs that control the amplitude level as well as processing (e.g. equalization, compression and reverberation) parameters of individual tracks. This interface, while widely used and effective for optimizing a mix, may not be the best interface to facilitate exploration of different mixing opti...
Article
Multi-pitch analysis of concurrent sound sources is an important but challenging problem. It requires estimating pitch values of all harmonic sources in individual frames and streaming the pitch estimates into trajectories, each of which corresponds to a source. We address the streaming problem for monophonic sound sources. We take the original aud...
Article
In this paper, an online score-informed source separation system is proposed under the Non-negative Matrix Factorization (NMF) framework, using parametric instrument models. Each instrument is modelled using a multi-excitation source-filter model, which provides the flexibility to model different instruments. The instrument models are initially lea...
Patent
Full-text available
Systems, methods, and apparatus are provided for equalization preference learning for digital audio modification. A method for listener calibration of an audio signal includes modifying a reference sound using at least one equalization curve; playing the modified reference sound for a listener; accepting listener feedback regarding the modified ref...
Conference Paper
REPET-SIM is a generalization of the REpeating Pattern Extraction Technique (REPET) that uses a similarity matrix to separate the repeating background from the non-repeating foreground in a mixture. The method assumes that the background (typically the music accompaniment) is dense and low-ranked, while the foreground (typically the singing voice)...
Conference Paper
Creative individuals increasingly rely on online crowdfunding platforms to crowdsource funding for new ventures. For novice crowdfunding project creators, however, there are few resources to turn to for assistance in the planning of crowdfunding projects. We are building a tool for novice project creators to get feedback on their project designs. O...