Gerald Schuller

Gerald Schuller
Technische Universität Ilmenau | TUI · Institut für Medientechnik

PhD

About

139
Publications
18,085
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,488
Citations
Citations since 2017
33 Research Items
533 Citations
2017201820192020202120222023020406080100120
2017201820192020202120222023020406080100120
2017201820192020202120222023020406080100120
2017201820192020202120222023020406080100120

Publications

Publications (139)
Preprint
Full-text available
The Goal is to obtain a simple multichannel source separation with very low latency. Applications can be teleconferencing, hearing aids, augmented reality, or selective active noise cancellation. These real time applications need a very low latency, usually less than about 6 ms, and low complexity, because they usually run on small portable devices...
Chapter
This chapter gives a short description of the basics of quantization, for the computation of the quantization error power. This then is the connection to the admissible error power as computed by the psycho-acoustic model.
Chapter
This chapter describes scalable lossless audio coding, based on the Integer-to-Integer MDCT (IntMDCT), which allows an exact conversion between integer signal samples and integer subband values. First it describes the theory and Python implementation of the Integer-to-Integer MDCT, and then it goes on to its use in a lossless encoder and decoder, w...
Chapter
This chapter introduces time/frequency decompositions in general and the design of filter banks with perfect reconstruction and near perfect reconstruction for audio coding. First it describes their theoretical foundations, including down- and up-sampling, and the polyphase representation. Then it applies these principles to the design of MDCT filt...
Chapter
This chapter describes the Python implementation of a complete audio encoder and decoder, putting all parts together. It shows some experiments that can be done with it.
Chapter
This chapter describes predictive coders. It starts with the mean squared error solution, continues to online adaptive predictors, like LPC coders, the Least Mean Squares (LMS) adaptation, and the effect of quantization in predictive coding with Python examples. Then it describes prediction for lossless coders with Python implementations. As an ext...
Chapter
This chapter shows an example implementation of an LMS predictive lossless encoder and decoder, using the tools of the previous chapters, and compression tests with audio files.
Chapter
This chapter describes some of the basics of Entropy coding, and application examples in Python. It first treats the well-known Huffman coder, and then goes on to the Golomb–Rice coder, which can be more simply implemented. The latter is hence chosen for the following implementation of complete audio coders.
Chapter
This chapter describes a new approach to apply a psycho-acoustic model to audio signals. Previously we applied the psycho-acoustic model and the corresponding quantization and processing all in the same subbands. Here it is shown how to use the psycho-acoustic masking threshold to normalize the audio signal to it, and then do any processing in the...
Chapter
This chapter describes the basics and an example implementation of psycho-acoustic models. It starts with the Bark frequency scale for hearing as one of the basics, and mapping functions for the conversion from and to the linear frequency scale. Then it describes the hearing masking threshold in quiet, goes on to models for the masking threshold in...
Preprint
In this work we present a method for unsupervised learning of audio representations, focused on the task of singing voice separation. We build upon a previously proposed method for learning representations of time-domain music signals with a re-parameterized denoising autoencoder, extending it by using the family of Sinkhorn distances with entropic...
Chapter
The classification of musical instruments of instruments of the same type is a challenging case of study. In this paper we conduct feature-based machine learning experiments to classify electric guitar recordings from different manufacturers and models. The Constant-Q Transform features and the Support Vector Machine algorithm obtained an accuracy...
Preprint
Full-text available
In this work, we present a method for learning interpretable music signal representations directly from waveform signals. Our method can be trained using unsupervised objectives and relies on the denoising auto-encoder model that uses a simple sinusoidal model as decoding functions to reconstruct the singing voice. To demonstrate the benefits of ou...
Book
This textbook presents the fundamentals of audio coding, used to compress audio and music signals, using Python programs both as examples to illustrate the principles and for experiments for the reader. Together, these programs then form complete audio coders. The author starts with basic knowledge of digital signal processing (sampling, filtering)...
Article
The goal of this work is to investigate what singing voice separation approaches based on neural networks learn from the data. We examine the mapping functions of neural networks based on the denoising autoencoder (DAE) model that are conditioned on the mixture magnitude spectra. To approximate the mapping functions, we propose an algorithm inspire...
Preprint
Full-text available
The goal of this work is to investigate what music source separation approaches based on neural networks learn from the data. We examine the mapping functions of neural networks that are based on the denoising autoencoder (DAE) model, and conditioned on the mixture magnitude spectra. For approximating the mapping functions, we propose an algorithm...
Article
Full-text available
Close miking represents a widely employed practice of placing a microphone very near to the sound source in order to capture more direct sound and minimize any pickup of ambient sound, including other, concurrently active sources. It is used by the audio engineering community for decades for audio recording, based on a number of empirical rules tha...
Article
Monaural singing voice separation task focuses on the prediction of the singing voice from a single channel music mixture signal. Current state of the art (SOTA) results in monaural singing voice separation are obtained with deep learning based methods. In this work we present a novel deep learning based method that learns long-term temporal patter...
Article
Full-text available
Singing voice separation based on deep learning relies on the usage of time-frequency masking. In many cases the masking process is not a learnable function or is not encapsulated into the deep learning optimization. Consequently, most of the existing methods rely on a post processing step using the generalized Wiener filtering. This work proposes...
Conference Paper
The objective of deep learning methods based on encoder-decoder architectures for music source separation is to approximate either ideal time-frequency masks or spectral representations of the target music source(s). The spectral representations are then used to derive time-frequency masks. In this work we introduce a method to directly learn time-...
Article
Estimating audio and musical signals from single channel mixtures often, if not always, involves a transformation of the mixture signal to the time-frequency (T-F) domain in which a masking operation takes place. Masking is realized as an element-wise multiplication of the mixture signal's T-F representation with a ratio of computed sources' spectr...
Conference Paper
Close miking represents a widely employed practice of placing a microphone very near to the sound source in order to capture more direct sound and minimize any pickup of ambient sound, including other, concurrently active sources. It is used by the audio engineering community for decades for audio recording, based on a number of empirical rules tha...
Article
This paper deals with the automatic transcription of solo bass guitar recordings with an additional estimation of playing techniques and fretboard positions used by the musician. Our goal is to first develop a system for a robust estimation of the note parameters pitch, onset, and duration (score-level parameters). As a second step, we aim to autom...
Conference Paper
The audio mixing process is an art that has proven to be extremely hard to model: What makes a certain mix better than another one? How can the mixing processing chain be automatically optimized to obtain better results in a more efficient manner? Over the last years, the scientific community has exploited methods from signal processing, music info...
Conference Paper
Audio mastering procedures include various processes like frequency equalisation and dynamic range compression. These processes rely solely on musical and perceptually pleasing facets of the acoustic characteristics, derived from subjective listening criteria according to the genre of the audio material or content. These facets are playing a signif...
Patent
Full-text available
An encoder for providing an audio stream on the basis of a transform-domain representation of an input audio signal includes a quantization error calculator configured to determine a multi-band quantization error over a plurality of frequency bands of the input audio signal for which separate band gain information is available. The encoder also inc...
Patent
Full-text available
An audio signal decoder has a time warp contour calculator, a time warp contour data rescaler and a warp decoder. The time warp contour calculator is configured to generate time warp contour data repeatedly restarting from a predetermined time warp contour start value, based on time warp contour evolution information describing a temporal evolution...
Patent
Full-text available
An embodiment of an analysis filterbank for filtering a plurality of time domain input frames, wherein an input frame comprises a number of ordered input samples, comprises a windower configured to generate a plurality of windowed frames, wherein a windowed frame comprises a plurality of windowed samples, wherein the windower is configured to proce...
Patent
Full-text available
An audio signal decoder for providing a decoded multi-channel audio signal representation on the basis of an encoded multi-channel audio signal representation has a time warp decoder configured to selectively use individual audio channel specific time warp contours or a joint multi-channel time warp contour for a reconstruction of a plurality of au...
Patent
Full-text available
An audio encoder has a window function controller, a windower, a time warper with a final quality check functionality, a time/frequency converter, a TNS stage or a quantizer encoder, the window function controller, the time warper, the TNS stage or an additional noise filling analyzer are controlled by signal analysis results obtained by a time war...
Patent
Full-text available
A noise filler for providing a noise-filled spectral representation of an audio signal on the basis of an input spectral representation of the audio signal has a spectral region identifier configured to identify spectral regions of the input spectral representation spaced from non-zero spectral regions of the input spectral representation by at lea...
Article
Full-text available
We present a system for the automatic separation of solo instruments and music accompaniment in polyphonic music recordings. Our approach is based on a pitch detection front-end and a tone-based spectral estimation. We assess the plausibility of using sound separation technologies to create practice material in a music education context. To better...
Patent
Full-text available
An apparatus for encoding an audio signal includes the windower for windowing a first block of the audio signal using an analysis window having an aliasing portion and a further portion. The apparatus furthermore includes a processor for processing the first sub-block of the audio signal associated with the aliasing portion by transforming the sub-...
Patent
Full-text available
An audio encoder has a common preprocessing stage, an information sink based encoding branch such as spectral domain encoding branch, a information source based encoding branch such as an LPC-domain encoding branch and a switch for switching between these branches at inputs into these branches or outputs of these branches controlled by a decision s...
Patent
Full-text available
An embodiment of an apparatus for generating audio subband values in audio subband channels has an analysis windower for windowing a frame of time-domain audio input samples being in a time sequence extending from an early sample to a later sample using an analysis window function having a sequence of window coefficients to obtain windowed samples....
Patent
Full-text available
A processed representation of an audio signal having a sequence of frames is generated by sampling the audio signal within first and second frames of the sequence of frames, the second frame following the first frame, the sampling using information on a pitch contour of the first and second frames to derive a first sampled representation. The audio...
Conference Paper
In this paper we present an audio tampering detection method based on the analysis of discontinuities in the framing grid, caused either by manipulations within the same recording or across recordings even with codec changes. The approach extends state of the art methods for MP3 framing grid detection with respect to efficiency and robustness, and...
Conference Paper
In this paper, we propose an instrument-centered bass guitar transcription algorithm. Instead of aiming at a general-purpose bass transcription algorithm, we incorporate knowledge about the instrument construction and typical playing techniques of the electric bass guitar. In addition to the commonly extracted score-level parameters note onset, off...
Patent
Full-text available
An audio encoder adapted for encoding frames of a sampled audio signal to obtain encoded frames, wherein a frame includes a number of time domain audio samples. The audio encoder includes a predictive coding analysis stage for determining information on coefficients of a synthesis filter and a prediction domain frame based on a frame of audio sampl...
Patent
Full-text available
An audio encoder adapted for encoding frames of a sampled audio signal to obtain encoded frames, wherein a frame includes a number of time domain audio samples. The audio encoder includes a predictive coding analysis stage for determining information on coefficients of a synthesis filter and a prediction domain frame based on a frame of audio sampl...
Conference Paper
Modern audio coding technologies apply methods of bandwidth extension (BWE) to efficiently represent audio data at low bitrates. An established method is the well-known spectral band replication (SBR) that is part of MPEG High Efficiency Advanced Audio Coding (HE-AAC). However, if the signal features a distinct harmonic spectral structure, the use...
Conference Paper
Full-text available
In this paper, we study the effect of prior information on the quality of informed source separation algorithms. We present results with our system for solo and accompaniment separation and contrast our findings with two other state-of-the art approaches. Results suggest current separation techniques limit performance when compared to extraction pr...
Article
Although there is steady progress in sensor technology, imaging with a high dynamic range (HDR) is still difficult for motion imaging with high image quality. This paper presents our new approach for video acquisition with high dynamic range. The principle is based on optical attenuation of some of the pixels of an existing image sensor. This well...
Conference Paper
We present a system which automatically generates 3D models from a sequence of stereo images and point clouds. We show the advantages of using parametric geometric shapes, the so-called superquadrics. We found that using a superposition of extended superquadrics results in higher quality models for sparse or relatively inaccurate point clouds.
Conference Paper
Full-text available
Our goal is to obtain improved perceptual quality for separated solo instruments and accompaniment in polyphonic music. The proposed approach uses a pitch detection algorithm in conjunction with a spectral filtering based source separation. The algorithm was designed to work with polyphonic signals regardless of the main instrument, type of accompa...
Conference Paper
In this paper, we present a novel audio synthesis model that allows us to simulate bass guitar tones with 11 different playing techniques to choose from. In contrast, previous approaches focussing on bass guitar synthesis only implemented the two slap techniques. We apply a digital waveguide model extended by different modular parts to imitate the...
Article
Full-text available
We describe an efficient system, which directly extracts features from compressed audio material. It consists of a time/frequency conversion method and a feature extraction algorithm. The conversion method provides the feature extraction algorithm with a suitable complex spectral representation directly from the compressed domain. It further allows...
Article
A technique for the compression of guitar signals is presented which utilizes a simple model of the guitar. The goal for the codec is to obtain acceptable quality at significantly lower bitrates compared to universal audio codecs. This instrument codec achieves its data compression by submitting an excitation function and model parameters to the re...
Article
In this paper, an audio coding procedure for piano signals is presented, based on a physical model of the piano. Instead of coding the waveform of the signal, the compression is realized by extracting relevant parameters at the encoder. The signal is then re-synthesized at the decoder using the physical model. We describe the development and implem...
Article
We describe an efficient system, which directly extracts features from compressed audio material. It consists of a time-frequency conversion method and a feature extraction algorithm. The conversion method provides the feature extraction algorithm with a suitable complex spectral representation directly from the compressed domain, and further allow...
Conference Paper
During the last year, many research efforts have been directed to the reffnement of sound source separation algorithms. However, little or no effort has been made to assess the impact of different spectral parameters as phase, magnitude and location of harmonic components in the resulting quality of the extracted signals. Recent developments in obj...
Conference Paper
Full-text available
In this paper, we present the results of a pre-study on music performance analysis of ensemble music. Our aim is to implement a music classification system for the description of live recordings, for instance to help musicologist and musicians to analyze improvised ensemble performances. The main problem we deal with is the extraction of a suitable...
Conference Paper
n this paper, we propose a novel method to parametrize and classify different frequency modulation techniques in bass guitar recordings. A parametric spectral estimation technique is applied to refine the fundamental frequency estimates derived from an existing bass transcription algorithm. We apply a two-stage taxonomy of bass playing styles with...
Article
Recent advances in networking technology (higher bit rates and lower transmission latencies) enable new applications where musicians can play together remotely, over the Internet. This application requires an audio coder providing sufficient compression to avoid overloading a connection and delay jitter, and also having a very low encoding/decoding...
Article
This paper proposes a new method for improving the audio quality of predictive perceptual audio coding in the context of the Ultra Low Delay (ULD) coding scheme for real time applications. The commonly used auto-regressive (AR) signal model is leading to an IIR predictor in the decoder. For random access of the transmission as well as for transmiss...
Conference Paper
This paper presents a novel method to detect and distinguish ten frequently used audio effects in recordings of electric guitar and bass. It is based on spectral analysis of audio segments located in the sustain part of previously detected guitar tones. Overall, 541 spectral, cepstral and harmonic features are extracted from short time spectra of t...
Conference Paper
In the field of audio coding a technique called Spectral Band Replication (SBR) is used in several codecs to reduce the data rate. We already developed an SBR tool with very little algorithmic delay for use in low delay applications like teleconferencing or live music performance but it produces a relatively high amount of side information. Further...
Conference Paper
In this paper,we present a feature-based approach for the classification of different playing techniques in bass guitar recordings. The applied audio features are chosen to capture typical instrument sounds induced by 10 different playing techniques. A novel database that consists of approx. 4300 isolated bass notes was assembled for the purpose of...
Conference Paper
Full-text available
In this paper, we compare two approaches for automatic classification of bass playing styles, one based on highlevel features and another one based on similarity measures between bass patterns. For both approaches,we compare two different strategies: classification of patterns as a whole and classification of all measures of a pattern with a subseq...
Conference Paper
Full-text available
This paper compares two prediction structures for predictive perceptual audio coding in the context of the ultra low delay (ULD) coding scheme. One structure is based on the commonly used AR signal model, leading to an IIR predictor in the decoder. The other structure is based on an MA signal model, leading to an FIR predictor in the decoder. We fi...
Article
Full-text available
Traditionally, speech coding and audio coding were separate worlds. Based on different technical approaches and different assumptions about the source signal, neither of the two coding schemes could efficiently represent both speech and music at low bitrates. This paper presents a unified speech and audio codec, which efficiently combines technique...
Conference Paper
Full-text available
Considering its mediation role between the poles of rhythm, harmony, and melody, the bass plays a crucial role in most music genres. This paper introduces a novel set of transcription-based high-level features that characteri ze the bass and its interaction with other participating instrume nts. Furthermore, a new method to model and automatically...
Conference Paper
Full-text available
Coding of speech signals at low bitrates, such as 16 kbps, has to rely on an efficient speech reproduction model to achieve reasonable speech quality. However, for audio signals not fitting to the model this approach generally fails. On the other hand, generic audio codecs, designed to handle any kind of audio signal, tend to show unsatisfactory re...
Article
Full-text available
Coding of speech signals at low bitrates, such as 16 kbps, has to rely on an efficient speech reproduction model to achieve reasonable speech quality. However, for audio signals not fitting to the model this approach generally fails. On the other hand, generic audio codecs, designed to handle any kind of audio signal, tend to show unsatisfactory re...
Conference Paper
Full-text available
We describe an efficient conversion method, which directly converts a desired spectral representation from compressed audio material. The conversion method provides a feature extraction algorithm with a suitable complex frequency representation of an audio signal. The presented conversion allows us to trade off computational complexity with accurac...
Conference Paper
Communication applications are usually delay restricted, especially for the instance of musicians playing over the Internet. This requires a one-way delay of maximum 25 msec and also a high audio quality is desired at feasible bit rates. The ultra low delay (ULD) audio coding structure is well suited to this application and we investigate further t...
Article
Full-text available
An abstract is not available.
Conference Paper
In this paper a Spectral Band Replication (SBR) tool for low delay audio applications is presented. One goal of this enhancement tool is to reduce the needed bit rate for the representation of audio data using an arbitrary audio codec. Another goal is to keep the algorithmic delay as low as possible. A low coding delay is essential for instance for...
Conference Paper
Full-text available
Low delay perceptual audio coding has recently gained wide acceptance for high quality communication. While common schemes are based on the well-known Modified Discrete Cosine Transform (MDCT) filterbank, this paper describes novel coding algorithms that, for the first time, make use of dedicated low delay filterbanks, thus achieving improved codin...
Conference Paper
Full-text available
The MPEG-4 Low Delay Advanced Audio Coding (AAC-LD) scheme has recently evolved into a popular algorithm for audio communication. It produces excellent audio quality at bitrates between 64 kbit/s and 48 kbit/s per channel. This paper introduces an enhancement to AAC-LD which reduces the bitrate demand by 25-33 %. This is achieved by adding both a d...
Article
Full-text available
A key issue for successfully interconnecting musicians in real-time over the Internet is minimizing the endto- end signal delay for transmission and coding. Anyhow, the variance of transmission delay ("jitter") occasionally causes some packets arrive too late for playback. To avoid this problem previous approaches are working with rather large rece...
Article
Full-text available
In this paper, lossless audio coding using the integer modified discrete cosine transform (IntMDCT) is discussed. The IntMDCT is constructed as an integer approximation of the MDCT using the lifting scheme and is reversible. The rounding error shape of the IntMDCT is derived. When the spectral energy of the input audio signal is concentrated at the...
Conference Paper
The paper presents a comparison of a previous and a new approach to shape quantization noise in low bit rate predictive audio coding. The previous approach uses an adaptation of the step size of a uniform quantizer, the new approach uses a quantizer with clipping. Both approaches are evaluated using a predictive audio coding scheme. The presented r...
Conference Paper
Full-text available
This paper describes high data-rate audio data hiding using the IntMDCT. The IntMDCT is an integer approximation of the MDCT with perfect reconstruction. Based on this transform, we describe a straight-forward way to embed data and extract it in a bit-exact manner while perceptual transparency is maintained. Since the IntMDCT spectrum can be used t...
Article
Full-text available
Playing live music on the Internet is one of the hardest disciplines in terms of low delay audio capture and transmission, time synchronization and bandwidth requirements. This has already been successfully evaluated with the Soundjack software which can be described as a low latency UDP streaming application. In combination with the new Fraunhofer...
Article
An audio coder with a very low delay (6-8 ms) for reduced bit rates is presented. Previous coder versions were based on backward adaptive coding, which has suboptimal noise shaping capabilities for reduced bit rate coding. We propose to use a different noise shaping method instead, resulting in an approach which uses forward adaptive predictive cod...
Conference Paper
Full-text available
In this paper we present and evaluate several concealment strategies for packet losses in the context of a low delay predictive audio coder. Our goal is to minimize the audible impact of a packet loss. The problem is that the predictive coder is backward adaptive, hence depending on past values. There is a predictor reset, but to increase coding ef...