• Home
  • Nicholas J. Bryan
Nicholas J. Bryan

Nicholas J. Bryan
  • Researcher at Adobe

About

55
Publications
15,944
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,129
Citations
Current institution
Adobe
Current position
  • Researcher

Publications

Publications (55)
Preprint
We present Distributional RewArds for Generative OptimizatioN (DRAGON), a versatile framework for fine-tuning media generation models towards a desired outcome. Compared with traditional reinforcement learning with human feedback (RLHF) or pairwise preference approaches such as direct preference optimization (DPO), DRAGON is more flexible. It can o...
Preprint
Full-text available
Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillati...
Preprint
Controllable music generation methods are critical for human-centered AI-based music creation, but are currently limited by speed, quality, and control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in particular, offers state-of-the-art results, but is over 10x slower than real-time, limiting practical use. We propose Distille...
Article
Full-text available
Text-to-music generation models are now capable of generating high-quality music audio in broad styles. However, text control is primarily suitable for the manipulation of global musical attributes like genre, mood, and tempo, and is less suitable for precise control over time-varying attributes such as the positions of beats in time or the cha...
Article
Full-text available
Diffusion-based audio and music generation models commonly perform generation by constructing an image representation of audio (e.g., a mel-spectrogram) and then convert it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their usefulness...
Preprint
Full-text available
Adaptive filters are applicable to many signal processing tasks including acoustic echo cancellation, beamforming, and more. Adaptive filters are typically controlled using algorithms such as least-mean squares(LMS), recursive least squares(RLS), or Kalman filter updates. Such models are often applied in the frequency domain, assume frequency indep...
Preprint
We present a framework that can impose the audio effects and production style from one recording to another by example with the goal of simplifying the audio production process. We train a deep neural network to analyze an input recording and a style reference recording, and predict the control parameters of audio effects used to render the output....
Preprint
Adaptive filtering algorithms are pervasive throughout modern society and have had a significant impact on a wide variety of domains including audio processing, telecommunications, biomedical sensing, astropyhysics and cosmology, seismology, and many more. Adaptive filters typically operate via specialized online, iterative optimization methods suc...
Article
Full-text available
Adaptive filtering algorithms are pervasive throughout signal processing and have had a material impact on a wide variety of domains including audio processing, telecommunications, biomedical sensing, astrophysics and cosmology, seismology, and many more. Adaptive filters typically operate via specialized online, iterative optimization methods such...
Preprint
Content creators often use music to enhance their stories, as it can be a powerful tool to convey emotion. In this paper, our goal is to help creators find music to match the emotion of their story. We focus on text-based stories that can be auralized (e.g., books), use multiple sentences as input queries, and automatically retrieve matching music....
Preprint
Full-text available
Few-shot learning aims to train models that can recognize novel classes given just a handful of labeled examples, known as the support set. While the field has seen notable advances in recent years, they have often focused on multi-class image classification. Audio, in contrast, is often multi-label due to overlapping sounds, resulting in unique pr...
Preprint
Full-text available
Adaptive filtering algorithms are commonplace in signal processing and have wide-ranging applications from single-channel denoising to multi-channel acoustic echo cancellation and adaptive beamforming. Such algorithms typically operate via specialized online, iterative optimization methods and have achieved tremendous success, but require expert kn...
Preprint
Full-text available
Modifying the pitch and timing of an audio signal are fundamental audio editing operations with applications in speech manipulation, audio-visual synchronization, and singing voice editing and synthesis. Thus far, methods for pitch-shifting and time-stretching that use digital signal processing (DSP) have been favored over deep learning approaches...
Preprint
Full-text available
The task of manipulating the level and/or effects of individual instruments to recompose a mixture of recording, or remixing, is common across a variety of applications such as music production, audio-visual post-production, podcasts, and more. This process, however, traditionally requires access to individual source recordings, restricting the cre...
Preprint
Full-text available
We present a data-driven approach to automate audio signal processing by incorporating stateful third-party, audio effects as layers within a deep neural network. We then train a deep encoder to analyze input audio and control effect parameters to perform the desired signal manipulation, requiring only input-target paired audio data as supervision....
Conference Paper
Full-text available
Speech synthesis has recently seen significant improvements in fidelity, driven by the advent of neural vocoders and neural prosody generators. However, these systems lack intuitive user controls over prosody, making them unable to rectify prosody errors (e.g., misplaced emphases and contextually inappropriate emotions) or generate prosodies with d...
Preprint
Text-based speech editors expedite the process of editing speech recordings by permitting editing via intuitive cut, copy, and paste operations on a speech transcript. A major drawback of current systems, however, is that edited recordings often sound unnatural because of prosody mismatches around edited regions. In our work, we propose a new conte...
Preprint
Deep representation learning offers a powerful paradigm for mapping input data onto an organized embedding space and is useful for many music information retrieval tasks. Two central methods for representation learning include deep metric learning and classification, both having the same goal of learning a representation that can generalize well ac...
Preprint
Music similarity search is useful for a variety of creative tasks such as replacing one music recording with another recording with a similar "feel", a common task in video editing. For this task, it is typically necessary to define a similarity metric to compare one recording to another. Music similarity, however, is hard to define and depends on...
Preprint
Speech synthesis has recently seen significant improvements in fidelity, driven by the advent of neural vocoders and neural prosody generators. However, these systems lack intuitive user controls over prosody, making them unable to rectify prosody errors (e.g., misplaced emphases and contextually inappropriate emotions) or generate prosodies with d...
Preprint
Data-driven approaches to automatic drum transcription (ADT) are often limited to a predefined, small vocabulary of percussion instrument classes. Such models cannot recognize out-of-vocabulary classes nor are they able to adapt to finer-grained vocabularies. In this work, we address open vocabulary ADT by introducing few-shot learning to the task....
Conference Paper
Locating perceptually similar sound events within a continuous recording is a common task for various audio applications. However, current tools require users to manually listen to and label all the locations of the sound events of interest, which is tedious and time-consuming. In this work, we (1) adapt state-of-the-art metric-based few-shot learn...
Article
We present a new method to capture the acoustic characteristics of real-world rooms using commodity devices, and use the captured characteristics to generate similar sounding sources with virtual models. Given the captured audio and an approximate geometric model of a real-world room, we present a novel learning-based method to estimate its acousti...
Preprint
Full-text available
Assessment of many audio processing tasks relies on subjective evaluation which is time-consuming and expensive. Efforts have been made to create objective metrics but existing ones correlate poorly with human judgment. In this work, we construct a differentiable metric by fitting a deep neural network on a newly collected dataset of just-noticeabl...
Preprint
We present a new method to capture the acoustic characteristics of real-world rooms using commodity devices, and use the captured characteristics to generate similar sounding sources with virtual models. Given the captured audio and an approximate geometric model of a real-world room, we present a novel learning-based method to estimate its acousti...
Preprint
Full-text available
Reverberation time (T60) and the direct-to-reverberant ratio (DRR) are two commonly used parameters to characterize acoustic environments. Both parameters are useful for various speech processing applications and can be measured from an acoustic impulse response (AIR). In many scenarios, however, AIRs are not available, motivating blind estimation...
Patent
Clustering and synchronizing content may include extracting audio features for each of a plurality of files that include audio content. The plurality of files may be clustered into one or more clusters. Clustering may include clustering based on a histogram that may be generated for each file pair of the plurality of files. Within each of the clust...
Conference Paper
Full-text available
Traditional audio editing tools do not facilitate the task of separating a single mixture recording (e.g. pop song) into its respective sources (e.g. drums, vocal, etc.). Such ability, how­ ever, would be very useful for a wide variety of audio applications such as music remixing, audio denoising, and audio-based forensics. To address this issue, w...
Conference Paper
We propose an interactive refinement method for supervised and semi-supervised single-channel source separation. The refinement method allows end-users to provide feedback to the separation process by painting on spectrogram displays of intermediate output results. The time-frequency annotations are then used to update the separation estimates and...
Conference Paper
Full-text available
The task of separating a single recording of a polyphonic instrument (e.g. piano, guitar, etc.) into distinctive pitch tracks is challenging. One promising class of methods to accomplish this task is based on non-negative matrix factorization (NMF). Such methods, however, are still far from perfect. Distinct pitches from a single instrument have si...
Article
In applications such as audio denoising, music transcription, music remixing, and audio-based forensics, it is desirable to decompose a single-channel recording into its respective sources. One of the current most effective class of methods to do so is based on nonnegative matrix factorization and related latent variable models. Such techniques, ho...
Conference Paper
We propose a method to both identify and synchronize multi-camera video recordings within a large collection of video and/or audio files. Landmark-based audio fingerprinting is used to match multiple recordings of the same event together and time-synchronize each file within the groups. Compared to prior work, we offer improvements towards event id...
Conference Paper
Full-text available
User control over variable-rate time-stretching typically requires direct, manual adjustment of the time-dependent stretch rate. For time-stretching with transient preservation, rhythmic warping, rhythmic emphasis modification, or other effects that require additional timing constraints, however, direct manipulation is difficult. For a more user-fr...
Conference Paper
Full-text available
Computational analysis of musical influence networks and rank of sample-based music is presented with a unique outside examination of the WhoSampled.com dataset. The exemplary dataset maintains a large collection of artist-to-artist relationships of sample-based music, specifying the origins of borrowed or sampled material on a song-by-song basis....
Article
The behavior of the genetic algorithm (GA), a popular approach to search and optimization problems, is known to depend, among other factors, on the fitness function formula, the recombination operator, and the mutation operator. What has received less attention is the impact of the mating strategy that selects the chromosomes to be paired for recom...
Conference Paper
Full-text available
Balloon pops are convenient for probing the acoustics of a space, as they generate relatively uniform radiation patterns and consistent "N-wave" waveforms. However, the N-wave spectrum contains nulls which impart an undesired comb-filter-like quality when the recorded balloon pop is convolved with audio. Here, a method for converting recorded ballo...
Conference Paper
Full-text available
The Mobile Music (MoMu) toolkit is a new open-source software development toolkit focusing on musical interac-tion design for mobile phones. The toolkit, currently im-plemented for iPhone OS, emphasizes usability and rapid prototyping with the end goal of aiding developers in cre-ating real-time interactive audio applications. Simple and unified ac...
Conference Paper
Full-text available
In this paper, we describe the development of the Stanford Mobile Phone Orchestra (MoPhO) since its inception in 2007. As a newly structured ensemble of musicians with iPhones and wearable speakers, MoPhO takes advantage of the ubiquity and mobility of smartphones as well as the unique interaction techniques offered by such devices. MoPhO offers a...
Article
Two methods of extending measured room impulse responses below their noise floor and beyond their measured duration are presented. Both methods extract frequency-dependent reverberation energy decay rates, equalization levels, and noise floor levels, and subsequently extrapolate the reverberation decay towards silence. The first method crossfades i...
Article
There are many impulse response measurement scenarios in which the playback and recording devices maintain separate unsynchronized digital clocks resulting in clock drift. Clock drift is problematic for impulse response measurement techniques involving convolution, including sinusoidal sweeps and pseudo-random noise sequences. We present analysis o...
Conference Paper
Full-text available
In the paper, we chronicle the instantiation and adventures of the Stanford Laptop Orchestra (SLOrk), an ensemble of laptops, humans, hemispherical speaker arrays, interfaces, and, more recently, mobile smart phones. Motivated to deeply explore computer-mediated live performance, SLOrk provides a platform for research, instrument design, sound desi...
Article
An acoustically transparent, configurable microphone array with omnidirectional elements, designed for room acoustics analysis and synthesis, and archaeological acoustics applications, is presented. Omnidirectional microphone elements with 2 mm-diameter capsules and 1 mm-diameter wire mounts produce a nearly acoustically transparent array, and prov...
Article
Hidden-Markov Models (HMMs) have been widely used for speech processing, understanding, and synthesis with great success. The purpose of this work is to apply this prior knowledge and investigate the effectiveness of HMMs on short- duration percussive musical signals. Three main topics of interest are investigated: isolated instrument recognition,...

Network

Cited By