Article

Bandwidth extension of audio signals by spectral band replication

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Spectral Band Replication (SBR) is a new audio coding tool that significantly improves the coding gain of perceptual coders and speech coders. Currently, there are three different audio coders that have shown a vast improvement by the combination with SBR: MPEG-AAC, MPEG-Layer II and MPEG-Layer III (mp3), all three being parts of the open ISO-MPEG standard. The combination of AAC and SBR will be used in the standardized Digital Radio Mondiale (DRM) system, and SBR is currently also being standardized within MPEG-4. SBR is a so-called bandwidth extension technique, where a major part of a signal's bandwidth is reconstructed from the lowband on the receiving side. It is developed and marketed by Coding Technologies, an international company in the audio coding field. This paper will focus on the technical details of SBR and in particular on the filter bank, which is the basis of the SBR process.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... In November 2007 Coding Technologies merged into Dolby Laboratories, now being the leading international company in the audio coding field. The SBR is a bandwidth extension method [3] which significantly improves the compression efficiency (coding gain) of perceptual audio and speech coding schemes. SBR cannot be used as a standalone coder, it always operates in conjunction with a conventional codec: a core codec. ...
... This permits to integrate the SBR technology to existing systems, thus enabling a smooth transition from a conventional audio coder to its more efficient SBR-enhanced version. In December 2001, the SBR was chosen as the initial reference model for the MPEG standardization process of bandwidth extension [3], a work item finalized in March 2003 [16]. Indeed, the SBR technology has been initially successfully integrated into three existing international audio coding standards: MPEG-1 audio layer 2 [6], MPEG-1/2 audio layer 3 known as MP3 [8], and MPEG-2/4 AAC (Advanced Audio Coding) [9], all being parts of the open ISO/MPEG standard. ...
... The SBRenhanced version of MPEG-1 audio layer 2 [6] has been adopted as a source coder by the Digital Audio Broadcasting (DAB) system for digital radio services [7]. The SBR-enhanced version of MP3 [8], called mp3PRO, released and marketed by Thomson Multimedia led to several both software-and hardware-based commercial products [3]. The combination of SBR and MPEG-4 AAC, the so-called MPEG-4 High Efficiency AAC (HE-AAC) or aacPlus standard [1314151617 181920], has been adopted as a source coder by the advanced DAB digital broadcasting system (DABþ) [7], by the Digital Radio Mondiale (DRM) universal openly standardized digital broadcasting system [71011], as well as by the XM Satellite Radio [12] being one of two satellite-based digital radio services (XM Satellite Radio and Sirius Satellite Radio) used in the United States and Canada. ...
Article
Spectral Band Replication (SBR) is an enhancement compression technology, a bandwidth extension method which significantly improves the compression efficiency of perceptual audio and speech coding schemes. There are two versions of the SBR technology: Standard SBR and low delay SBR (LD-SBR). Central to the operation of standard and LD-SBR are dedicated complex exponential-modulated and real-valued cosine-modulated quadrature mirror filter (QMF) banks as the basic mathematical tools to analyze and synthesize audio signals. This tutorial paper presents the complete unified efficient implementations of complex exponential-modulated and real-valued cosine-modulated QMF banks used both in the standard SBR and LD-SBR encoder and decoder. In general, for each QMF bank is presented: Definition of its equivalent block transform with a common parameter M representing the number of sub-bands, its general symmetry property in the frequency or time domain, and the derivation of a fast algorithm for its efficient implementation. All fast algorithms are analyzed in detail in terms of the arithmetic complexity, regularity and structural simplicity for a potential real-time low-cost implementation in hardware or software.
... Thus, BWE techniques reduce the bitrates of WB and SWB encoded signals. There are two main categories of BWE methods: time domain BWE [7,8,9] and frequency domain BWE [10,11,12,13]. The time domain methods generate the HF signal by upsampling the LF signal [8,9] or by LPC-based estimation [7]. ...
... Therefore, we focus on the frequency domain methods in this paper. Spectral Band Replication (SBR) [10,11] is the most widely used frequency domain BWE method, and is used in MPEG-4 High-Efficiency AAC (HE-AAC) [14] and AAC Enhanced Low Delay (AAC-ELD) [15]. As shown inFig.1 , a SBR decoder receives the time-domain LF signal from a core decoder, analyzes the LF signal with a complex-valued Quadrature Mirror Filter (QMF) filterbank, generates the HF signal in the QMF domain, and then adjusts the HF signal using the auxiliary information included in the bitstream. ...
... The adjustment of the HF signal is done by first adjusting the spectral envelope, and then by controlling the 'tonality', that is the ratio between the tonal and noise-like components of the signal. Due to the sophisticated tonality control made possible by inverse LPC filtering and noise injection to the QMF coefficients, SBR achieves high subjective quality for both speech and audio signals [10] and improves the coding efficiency by more than 30% [11]. While SBR is a powerful BWE technique, its computational complexity is relatively high due to the QMF filterbank. ...
Conference Paper
We propose a low-complexity bandwidth extension (BWE) method operating in the modified discrete cosine transform (MDCT) domain to reduce the bitrate of wideband and super-wideband speech codecs. The proposed method generates a high-frequency signal by copying the MDCT spectrum from the low frequency part, and then adjusts tonality to improve the subjective quality of the generated high-frequency signal. In combination with an MDCT-based transform codec, it requires only 64.9% of the computational complexity of MPEG-4 spectral band replication (SBR). It also achieves subjective quality better than SBR for many speech samples.
... The original time-domain input signal is first filtered in a 64-channel analysis QMF bank. The filter bank splits the time-domain signal into complex-valued subband signals and is thus oversampled by a factor of two compared to a regular real-valued QMF bank [14]. For every 64 time-domain input samples, the filter bank produces 64 subband samples. ...
... The original time-domain input signal is first filtered in a 64-channel analysis QMF bank. The filter bank splits the time-domain signal into complex-valued subband signals and is thus oversampled by a factor of two compared to a regular real-valued QMF bank[14]. For every 64 time-domain input samples, the filter bank produces 64 subband samples. ...
... The terms containing M/2 (terms needed for aliasing cancellation) present in the traditional cosine modulated filter bank omitted because of the complex-valued representation[14]. InFigure 9the corresponding block scheme for a complexvalued filter bank implementation is outlined. ...
Article
Full-text available
In 2003 and 2004, the ISO/IEC MPEG standardization committee added two amendments to their MPEG-4 audio coding standard. These amendments concern parametric coding techniques and encompass Spectral Band Replication (SBR), Sinusoidal Coding (SSC), and Parametric Stereo (PS). In this paper, we will give an overview of the basic ideas behind these techniques and references to more detailed information. Furthermore, the results of listening tests as performed during the final stages of the MPEG-4 standardization process are presented in order to illustrate the performance of these techniques.
... By reducing the bit rate, a limit is eventually reached when the quantization noise of a traditional audio encoder significantly exceeds the masking threshold, equivalent to exceeding the −13-dB noise level in the previous example. To optimize the quality in scenarios when bit rates of, for example, 24 kilobits per second per channel are applied, the method of spectral band replication (SBR) (Ekstrand, 2002) can be applied to utilize the typical temporal similarities Figure 15.14 Block diagram of an audio encoder and decoder based on perceptual masking. Adapted from Brandenburg (1999). ...
... Other ratios of sampling frequencies are also possible. Adapted from Ekstrand (2002). between the low and high frequencies. ...
Article
A common trend in the field of audio is to process the audio signal in the time–frequency domain. This chapter elaborates on the techniques of time–frequency transforms to visualize audio signals and introduces some phenomena, concepts, and issues related to the processing of audio in the time–frequency domain. It describes the time–frequency processing methods first using the concepts of frame‐based analysis, and second using the concepts of downsampled filter banks. The modified discrete cosine transform is widely applied in audio coding due to its non‐redundancy and its property of representing narrowband signals with a relatively small number of prominent spectral coefficients. The chapter also describes a few time–frequency transforms that are commonly used in the audio industry. A set of key applications in the field of perceptually motivated time–frequency audio processing is reviewed.
... However, it also increases the data requirement and might not be applicable in some use cases. The most well-known method in this category is Spectral Band Replication (SBR) [2,3]. SBR is a technique that has been used in the existing audio codecs such as MPEG-4 * The first author performed the work while at Dolby Laboratories High-Efficiency Advanced Audio Coding (HE-AAC). ...
... It consists of two phases: training and testing. In the training phase, the audio signals are firstly converted into timefrequency representations using Complex Quadrature Mirror Filter (CQMF) transformation as specified in [2]. The CQMF filter-bank decomposes the signal into 64 complex valued sub-bands using blocks of 64 samples. ...
Conference Paper
Full-text available
In this paper, a blind bandwidth extension algorithm for mu- sic signals has been proposed. This method applies the K- means algorithm to firstly cluster audio data in the feature space, and constructs multiple envelope predictors for each cluster accordingly using Support Vector Regression (SVR). A set of well-established audio features for Music Informa- tion Retrieval (MIR) has been used to characterize the audio content. The resulting system is applied to a variety of music signals without any side information provided. The subjec- tive listening test results show that this method can improve the perceptual quality successfully, but the minor artifacts still leave room for future improvements.
... The SBR algorithm makes use of complex-exponential modulated (Pseudo) Quadrature Mirror Filter (QMF) banks as t/f and f/t transforms, enabling flexible signal modification at high efficiency [11]. Therefore, they seem like a suitable alterative to the FFT employed in the decoder as presented in Section 2. Furthermore, the potentially very powerful combination of SBR with PS should not result in a decoder much exceeding the complexity of either SBR or PS. ...
... The combination of MPEG-2/4 AAC with the SBR bandwidth extension tool is known as aacPlus and was standardized in MPEG-4 as the HE-AAC profile [18]. The basic principles of SBR have been elaborated on in several papers [3], [6], [11]. For the convenience of the reader a short review is given here. ...
Article
Full-text available
Parametric stereo coding is a technique to efficiently code a stereo audio signal as a monaural signal plus a small amount of stereo parameters. The monaural signal can be encoded using any audio coder. The stereo parameters can be embedded in the ancillary part of the mono bit stream creating backwards mono compatibility. In the decoder, first the monaural signal is decoded after which the stereo signal is reconstructed from the stereo parameters. In this paper, a low complexity decoder solution is described based on complex-modulated filter banks. Combinations of the parametric stereo decoder with both a parametric coding scheme and with aacPlus will be elucidated.
... Experiments have shown that the complex-exponential modulated (Pseudo) Quadrature Mirror Filter (QMF) bank is very well suited for PS. A more detailed introduction of using the QMF bank as a flexible signal modifier can be found in [13]. The QMF bank used in aacPlus is a 64 channel complex-valued filter-bank with near alias-free behavior even when altering the gains of neighboring subbands excessively, which is a fundamental requirement for use as t/f transform in a PS system. ...
... The IIR reverberator used for frequencies under 8625 Hz according to configuration Table 2, is designed according to Sections 3.4 and 3.5. Since fractional delay is used as described in Equations 7 and 8, the delay lines z −m(n) (13) are replaced by e iα k (n) z −m(n) where α k (n) for n = 0, 1, 2 are the phase rotation values for the corresponding all-pass link n and QMF band k. ...
Article
Full-text available
Parametric stereo coding in combination with an efficient coder for the underlying monaural audio signal results in the most efficient coding scheme for stereo signals at very low bit rates available today. While techniques for lateral localization have been studied since early intensity stereo coding tools, synthesis of stereophonic ambience was only recently applied in parametric stereo coding systems. This paper studies different techniques for synthetic ambience generation in the context of parametric stereo coding systems and discusses their mono-compatibility. Implementations of these techniques in combination with mp3PRO and aacPlus are presented together with experimental results.
... Also, hybrid techniques were introduced that combine filter-bank or transform-domain compression with parametric representations. One such method is known as " spectral band replication " (SBR), which regenerates high-frequency content using a parameter-guided copy from the low-frequency components that are coded using filter-bank or transform coders [9]–[11]. Another well-known example of hybrid techniques is " parametric stereo " (PS), also known as " binaural cue coding " (BCC). ...
... The energy mode parameters can be used in situations where the encoding and decoding of the down mix by the henceforth-called core coder alters the signal waveforms to such an extent that it leads to problems for the prediction mode. For example, the HE-AAC coder, where SBR is used [9]–[11], completely modifies the waveform in the high-frequency range. When using this coder as a core coder, it is possible to use the prediction mode in the lower frequency range, where no SBR is used, and the energy mode in the high-frequency range where the original waveform is completely lost due to SBR. ...
... As the generated side information is appended to the bitstream from the encoder of the target codec, the proposed approach can ensure backward-compatibility with existing devices and content, while improving the quality of the decoded signal significantly. Examples of such side information can be seen in the MPEG high-efficiency advanced audio coding (HE-AAC) family [27], [28], where HE-AAC v1 appends the spectral band replication [29] as side information to the bitstream of AAC [30], and HE-AAC v2 adds the parametric stereo [31], [32] on top of the HE-AAC v1 bitstream. A neural network on the transmitter side generates side information from the signals, and another neural network on the receiver side estimates the log power spectra (LPS) of the original signal from the decoded signal and the quantized side information. ...
Article
Full-text available
Audio codecs generate notable artifacts when operating at low bitrates, which degrade the quality of the coded audio significantly. There have been several approaches to enhance the quality of decoded signals with and without side information. While pre- or post-processing approaches without side information can be applied directly to existing systems without modifying codecs, approaches utilizing side information can further enhance the performance while maintaining backward-compatibility with existing codecs. In this paper, we propose a method to improve decoded signals using neural network-based side information. A neural network in the transmitter side that generates the side information and another neural network in the receiver side that estimates the log power spectra (LPS) of the original signal from the decoded signal and the side information are jointly trained to accurately reconstruct the original signal. In the same line with the analysis-by-synthesis, the neural network that generates the side information in the transmitter side takes not only the LPS of the original signal but also the LPS of the decoded signal as the input by decoding the encoded bitstream at the transmitter side. Experimental results show that the proposed audio codec enhancement scheme using neural network-based side information outperformed the audio codec enhancement without side information for the same codec operating at higher bitrates.
... Most notably, the newly created intelligent gap filling (IGF) tool [17] provides an enhanced mechanism for noise filling, i.e., for filling spectral regions for which spectral coefficients cannot be transmitted due to a shortage of available bits. In contrast to earlier methods, such as perceptual noise substitution (PNS) [18] or spectral band replication (SBR) [19], the tool parametrically restores portions of the transmitted spectrum while allowing to "intelligently" intermix between transmitted spectral coefficients and coefficients that are parametrically restored from lower frequency spectral regions that have been transmitted in a waveform preserving way. The encoder has control over the assignment and the processing of these spectral regions, or tiles, based on an input signal analysis. ...
Article
Full-text available
The term "immersive audio" is frequently used to describe an audio experience that provides the listener the sensation of being fully immersed or "present" in a sound scene. This can be achieved via different presentation modes, such as surround sound (several loudspeakers horizontally arranged around the listener), 3D audio (with loudspeakers at, above, and below listener ear level), and binaural audio to headphones. This article provides an overview of two recent standards that support the bitrate-efficient carriage of high-quality immersive sound. The first is MPEG-H 3D audio, which is a versatile standard that supports multiple immersive sound signal formats (channels, objects, and higher order ambisonics) and is now being adopted in broadcast and streaming applications. The second is MPEG-I immersive audio, an extension of 3D audio, currently under development, which is targeted for virtual and augmented reality applications. This will support rendering of fully user-interactive immersive sound for three degrees of user movement [three degrees of freedom (3DoF)], i.e., yaw, pitch, and roll head movement, and for six degrees of user movement [six degrees of freedom (6DoF)], i.e., 3DoF plus translational x, y, and z user position movements.
... The inclusion of side information typically incurs an additional burden of 1-5 kbps [4]. Examples of non-blind approaches to SWBE include the spectral band replication (SBR)-based high-efficiency advanced audio codec (HE-AAC) [5], the extended adaptive multi-rate WB codec (AMR-WB+) [6] and the enhanced voice services (EVS) codec (SWB mode) [1]. Non-blind approaches are codec specific and require a matching decoder in order to recover HF components. ...
Conference Paper
Full-text available
Many smart devices now support high-quality speech communication services at super-wide bandwidths. Often, however, speech quality is degraded when they are used with networks or devices which lack super-wideband support. Artificial bandwidth extension can then be used to improve speech quality. While approaches to wideband extension have been reported previously, this paper proposes an approach to super-wide bandwidth extension. The algorithm is based upon a classical source filter model in which spectral envelope and residual error information are extracted from a wideband signal using conventional linear prediction analysis. A form of spectral mirroring is then used to extend the residual error component before an extended super-wideband signal is derived from its combination with the original wideband envelope. Improvements to speech quality are confirmed with both objective and subjective assessments. These show that the quality of super-wideband speech, derived from the bandwidth extension of wideband speech, is comparable to that of speech processed with the standard enhanced voice services (EVS) codec with a bitrate of 13.2kbps. Without the need for statistical estimation of missing super-wideband components, the proposed algorithm is highly efficient and introduces only negligible latency.
... Adaptive time-frequency audio processing systems require a transform that is robust in terms of the spectral aliasing. The typical approach in the field, is to employ a filter bank that is double-oversampled to avoid spectral aliasing between the adjacent frequency bands [53]- [55]. Such filter bank has a prototype filter with sufficient stop band attenuation to suppress the spectral aliasing beyond the adjacent bands. ...
Article
Full-text available
Spatial filtering with microphone arrays is a technique that can be utilized to obtain the signal of a target sound source from a specific direction. Typical approaches in the field of audio underperform in practical environments with multiple sound sources and diffuse sound. In this contribution we propose a post-filtering technique to suppress the effect of interferers and diffuse sound. The proposed technique utilizes the cross-spectral estimates of the output of two beamformers to formulate a time- frequency soft masker. The beamformers’ outputs are used only for parameter estimation and not for generating an audio signal. Two sets of beamformer weights, a constant and an adaptive, are applied to the microphone array signals for the parameter estimation. The weights of the constant beamformer are designed such that they provide a spatially narrow beam pattern that is time and frequency invariant, having a unity gain towards the direction of interest. The weights of the adaptive beamformer are formulated using linearly constrained optimization with the constraint of weighted orthogonality with respect to the constant beamformer weights, as well as the unity gain towards the look direction. The orthogonality constraint provides diffuse sound suppression while the unity gain distortionless response. The cross spectrum of these two beamformers provides the target energy at a given look direction for the post filter. The study focuses on compact microphone arrays with which the typical beamforming techniques feature a trade-off between noise amplification and spatial selectivity, especially in the low frequency region. The proposed method is evaluated with instrumental measures and listening tests under different reverberation times, in dual and multi-talker scenarios. The evaluation shows that the proposed method provides a better performance when compared with a previous state-of-the-art spatial filter based on cross-pattern coherence, a linearly constrained beamformer and a Wiener post- filter.
... The following formulation assumes time-frequency transformed signals. Any transform is applicable that allows independent processing of the frequency bands, such as the complex-modulated QMF bank [13] and the robust short-time Fourier processing techniques [14]. ...
Article
Full-text available
The spatial capture patterns according to head-related transfer-functions (HRTFs) can be approximated using linear beamforming techniques. However, assuming a fixed spatial aliasing frequency, with reduction of the number of sensors and thus the array size, the linear approach leads to an excessive amplification of the microphone noise, unless the beam patterns are made broader than determined by the HRTFs. An adaptive technique is proposed that builds upon the assumption that the binaural perception is largely determined by a set of short-time inter-aural parameters in frequency bands. The parameters are estimated from the noisy HRTF beam pattern signals as a function of time and frequency. As a result of temporal averaging, the effect of the noise is mitigated while the perceptual spatial information is preserved. Signals with higher SNR from broader patterns are adaptively processed to obtain the parameters from the estimation stage by means of least-squares optimized mixing and decorrelation. Listening tests confirmed the perceptual benefit of the proposed approach with respect to linear techniques.
... The complex-modulated and double-oversampled properties of the filter bank prevent aliasing to the neighboring frequency bands, and a long prototype filter suppresses the remaining aliasing components to the point of being negligible. A detailed description of the properties of the filter bank can be found in [19]. An efficient implementation of the structure is discussed in [20]. ...
Article
Full-text available
Adaptive perceptual spatial sound reproduction techniques that employ a parametric model describing the properties of the sound field can reproduce spatial sound with high perceptual accuracy when compared to linear techniques. On the other hand, applying a sound-field model to control the reproduced sound may compromise the perceived quality of individual channels in cases where the model does not match the sound field. An alternative parametrization is proposed that estimates directly the perceptually relevant parameters for the target loudspeaker signals without modeling the sound field. At the synthesis stage, the loudspeaker signals with the target parametric properties are generated from the microphone signals with regularized leastsquares mixing and decorrelation. It is shown through listening experiments that the proposed method provides on average the overall perceived spatial sound reproduction quality of a state-of the-art parametric spatial sound reproduction technique, while solving the past shortcomings related to the perceived quality of the individual channels.
... At the decoder, the components in the low-frequency (LF) band were copied into the HF band based on the received side information [1]. This method is often called spectral band replication (SBR) [2,3]. A well-known audio codec utilizing SBR is high-efficiency advanced audio coding (HE-AAC) [4,5], which has been applied in mobile multimedia players, mobile phones, and digital radio services. ...
Article
Full-text available
Bandwidth extension is an effective technique for enhancing the quality of audio signals by reconstructing their high-frequency components. In this paper, a novel blind bandwidth extension method is proposed based on phase space reconstruction. Phase space reconstruction is introduced to convert the low-frequency modified discrete cosine transform coefficients of wideband audio to a multi-dimensional space, and the high-frequency modified discrete cosine transform coefficients of the audio signal are reconstructed by a non-linear prediction model. The performance of the proposed method was evaluated through objective and subjective tests. It is found that the proposed method achieves a better performance than the typical linear extrapolation method, and its performance is comparable to the conventional efficient high-frequency bandwidth extension method.
... The popular BWE methods used in audio coding standards are non-blind, such as spectral band replication of MPEG4 [3], the noise-filling (NF) technique of the International Telecommunication Union -Telecommunication Standardization Sector (ITU-T) G.722.1 WB audio codec [4], and the spectral folding technique of ITU-T G.719 full-band audio codec [5]. In these methods, first the timefrequency energy of the audio signals is extracted at the encoder. ...
Article
Full-text available
The bandwidth limitation of wideband (WB) audio systems degrades the subjective quality and naturalness of audio signals. In this paper, a new method for blind bandwidth extension of WB audio signals is proposed based on non-linear prediction and hidden Markov model (HMM). The high-frequency (HF) components in the band of 7–14 kHz are artificially restored only from the low-frequency information of the WB audio. State-space reconstruction is used to convert the fine spectrum of WB audio to a multi-dimensional space, and a non-linear prediction based on nearest-neighbor mapping is employed in the state space to restore the fine spectrum of the HF components. The spectral envelope of the resulting HF components is estimated based on an HMM according to the features extracted from the WB audio. In addition, the proposed method and the reference methods are applied to the ITU-T G.722.1 WB audio codec for comparison with the ITU-T G.722.1C super WB audio codec. Objective quality evaluation results indicate that the proposed method is preferred over the reference bandwidth extension methods. Subjective listening results show that the proposed method has a comparable audio quality with G.722.1C and improves the extension performance compared with the reference methods.
... It compensates the reduction in bandwidth of the perceptual audio codec in low bits rate and helps the perceptual audio codec reach the low bits rate without losing audio quality. For example, SBR, HFR and Plus V are those kinds of bandwidth extension [12] [14]. ...
... The required filter bank, which is commonly used in SAC, exhibits a hybrid QMF structure [12], [13]. In addition, parametric information of the downmix process is computed and incorporated into the SAOC bitstream together with the OLDs, IOCs and other side information. ...
Article
Full-text available
Spatial sound reproduction on a multi-channel loudspeaker setup indicate a consistent trend in today's audio playback systems. Digital surround sound significantly improves the realism of the spatial sound experience, but also results in a drastic increase in required audio data rate. Spatial Audio Coding (SAC) technol-ogy provides means for efficient storage and transmission of multi-channel signals by a downmix signal and associated para-metric side information describing the spatial sound image. More recently, SAC has been extended with an object-based concept termed Spatial Audio Object Coding (SAOC) enabling efficient coding and interactive spatial rendering of multiple individual audio objects at the playback side. Due to the underlying para-metric coding approach, object level manipulations may affect the produced perceptual sound scene quality, and using extreme object attenuation or boosting may result in unacceptably de-graded audio quality. The paper describes how regular SAOC processing is ad-vanced to ensure high quality sound reproduction even in de-manding remix applications.
... This is generally achieved by reducing the coded audio bandwidth and the sampling frequency. To overcome this limitation, the SBR decoder reconstructs higher frequency components with the help of the low-frequency base band and a very compact parametric description of the high band [9] [10]. The lowfrequency base band of the signal is coded by a conventional core coder. ...
Conference Paper
Full-text available
The MPEG-4 Low Delay Advanced Audio Coding (AAC-LD) scheme has recently evolved into a popular algorithm for audio communication. It produces excellent audio quality at bitrates between 64 kbit/s and 48 kbit/s per channel. This paper introduces an enhancement to AAC-LD which reduces the bitrate demand by 25-33 %. This is achieved by adding both a delay-optimized version of the Spectral Band Replication (SBR) tool and by utilizing a dedicated low delay filterbank. The introduced techniques maintain the high audio quality and offer an algorithmic delay low enough for use in two way communication systems. This paper describes the coder enhancements including a detailed discussion of algorithmic delay issues, a performance assessment and possible applications.
... An essential part of the spatial audio technologies, MPEG Surround as well as SAOC, are the Quadrature Mirror Filter (QMF) banks [3] which serve as time/frequency transform and are required to enable the frequency selective processing. The QMF bank has near alias-free behavior even when altering the gains of neighboring subbands excessively, which is a fundamental requirement for these systems. ...
... One of them exploits the technique of spectrum replication from the family of audio bandwidth extension methods [2]. The High Efficiency Profile of the MPEG-4 AAC standard (MPEG-4 AAC HE) [1] includes a technique called Spectral Band Replication (SBR) [2] [3] [4]. The basic idea of SBR is based on the observation that the signal spectrum in the high-frequency bands is highly correlated with the signal spectrum in the low-frequency band. ...
Article
Full-text available
Proposed is a new technique of effective regeneration of high-frequency tonal components in an augmented MPEG-4 AAC HE decoder. The basic idea is to synthesize the tonal components using the technique called synthetic sinusoidal coding that is already adopted in the MPEG-4 codecs in another context. Here, the idea is to mix this technique with standard Spectral Band Replication (SBR), i.e. to add some control information to a standard MPEG-4 AAC HE bit-stream that is used to synthesize the high-frequency tonal components in a decoder. In that way, provided is proper synthesis of rapidly changing sinusoids as well as proper harmonic structure in the high-frequency band. The experi-ments show that the tool improves significantly the compres-sion performance when added to an MPEG-4 AAC HE co-dec. This improvement has been confirmed by listening tests.
... Concretely, the small overlap towards future Additionally, the codec utilizes a low delay version of the Spectral Band Replication (SBR) tool, known from High Efficiency AAC (HE-AAC) as standardized in MPEG-4. SBR is a semi-parametric so-called bandwidth extension technique, where a large part of the signal's bandwidth is reconstructed from the core coded low band signal in the SBR decoding process [7]. The SBR tool in AAC-ELD is optimized regarding delay by removing the overlap delay and exchanging the QMF bank with a Complex Low Delay Filter Bank (CLDFB) [8]. ...
Article
Tele-and video conferencing systems for modern business communication are managed by central hubs, so-called multipoint control units (MCU). One major task of these units is the mixing of audio streams from the participating sites. This is traditionally done by decoding the streams, mixing in time domain and then re-encoding of the mixed signals. This requires additional processing power, leads to increased delay and degraded audio quality. The paper demonstrates how the recently standardized MPEG-4 Enhanced Low Delay AAC (AAC-ELD) codec offers a solution to these problems by efficient and delayless mixing in the transform domain of the codec.
... Note that, with this configuration, the source data come from the lower 16 subbands or fewer of the compressed signal and therefore the target data come from the remaining upper 16 subbands or more of the uncompressed signal. SBR also uses a PQMF but with a slightly better alias reduction [4]. However, that filterbank produces complex valued subband samples which are not suitable for the next stages of our algorithm. ...
Conference Paper
Full-text available
Algorithmic and protocol constraints of most low bitrate compression schemes lead to audio signals of low bandwidth and, inevitably, of low perceptual audio quality. Audio bandwidth extension methods address this problem by reconstructing the high frequency spectrum of a degraded signal based on information from the low frequency part. In this work, a novel audio bandwidth extension method is presented in which high frequency reconstruction is achieved through statistical conversion between the low frequency spectrum of the compressed signal and the high frequency part of the uncompressed signal's spectrum. Even though no psychoacoustic model is used, quality evaluation tests show that the proposed method has similar performance to one of the most recent, state-of-the-art, bandwidth extension schemes.
... This codec addresses the drawbacks of AAC-LD by incorporating the low-delay spectral band replication (LD-SBR) tool and a new low-delay core coder filterbank. While the SBR technology [23][24] improves coding efficiency, LD-SBR tool also minimizes the introduced delay by avoiding the use of variable time grid [6,22] and by using low-delay analysis and synthesis quadrature mirror filter (QMF) banks [22]. The delay of the new core coder filterbank is independent of filter length [6,8] and hence, a window with multiple overlap for good frequency selectivity can be used. ...
Conference Paper
Full-text available
Recently MPEG has developed a new audio coding standard - MPEG-4 AAC enhanced low delay (ELD), targeting low bit rate, full-duplex communication applications such as audio and video conferencing. The AAC-ELD combines low delay SBR filterbank with a low delay core coder filterbank to achieve both high coding efficiency and low algorithmic delay. In this paper, we propose an efficient mapping of the AAC-ELD core coder filterbank to the well known MDCT. This provides a fast algorithm for the new filterbank. Since AAC-LD and AAC-LC profiles also use MDCT filterbank, this mapping enables efficient joint implementation of filterbanks for all 3 profiles. We also present a very efficient 15-point DCT-II algorithm that is useful for implementation of all 3 profiles with frame lengths of 960 and 480. This algorithm requires just 17 multiplications and 67 additions. The overall design structure and complexity analysis of proposed implementation of the filterbanks is also provided.
... Apart from this elegance in signal representation, MDCT poses a great difficulty in spectral processing due to frequency component aliasing, e.g., gain control [3] that works well in the DFT domain produces significant spectral distortion. For this reason, a complexmodulated Quadrature Mirror Filterbank (QMF) substitutes the MDCT in Side Band Replication (SBR [4]) and Parametric Stereo (PS [5]); the Modified Discrete Sine Transform (MDST), in addition to the MDCT, is used in the modified distortion metric for AAC [6] and the MDCT domain spatial audio coding [7]. Both the strategies are not free. ...
Conference Paper
Full-text available
The spectrum of a sinusoid using the Modified Discrete Cosine Transform (MDCT), when separated into an even subspectrum and an odd subspectrum by bin parity, gives rise to a distinctive property-subspectral shapes are independent of the sinusoid phase, which contributes only to scaling. Based on this finding, we propose an Even-Odd (EO) scheme for stereo coding: partitioning the even and odd subspectra separately into subbands to capture the fine spectral structures of sinusoidal and rich tone signals. The scheme reduces the coding noises by 0-20 dB for music signals. When integrated into a MDCT domain KLT-based stereo coder, the scheme boosts subjective listening test (MUSHRA) scores. This coder, called KLT-EO, competes the Parametric Stereo (PS) in quality by a slightly higher bitrate but without the algorithmic delay of 20 ms resulted from the stereo processing.
... But the overlap also poses a great difficulty in LT domain spectral analysis and manipulation: the basis vectors of the critically sampled LT are not shift-invariant and the LT does not conserve energy-defying easy computation of signal phase and magnitude [8][9][10][11]; applying gain control or equalization directly in the LT domain results in large spectral distortion due to aliasing of LT spectral components [12][13][14]. ...
Article
Critically sampled perfect reconstruction lapped transforms (LT), apart from signal representation, are difficult to use for signal analysis and processing. We tackle this problem by establishing a novel connection between the LT and the DFT based on matrix analysis. Specifically, for the modified discrete cosine transform (MDCT) with an arbitrary symmetric window, one of the most widely used LT, a sparse matrix representation for the conversion to and from the DFT is developed, leading to an efficient implementation in the form of low-order frequency domain FIR filtering. We give two example applications, MDCT domain group delay estimation and aliasing reduction for the temporal noise shaping (TNS).
Article
Decimation of a discrete-time signal below the Nyquist rate without applying an appropriate lowpass filter results in a distortion called aliasing. If wideband speech sampled at 16 kHz is decimated by 2 to result in a signal sampled at 8 kHz with aliasing, the decimated signal would be the summation of two speech-like signals, which are the narrowband speech covering 0-4 kHz and the spectrally flipped aliasing component coming from 8-4 kHz. Recently, the performance of speech separation has been remarkably improved with deep learning-based approaches, implying that the narrowband and aliasing components may be able to be separated. In this letter, we propose a novel method for low-rate wideband speech coding utilizing a standard narrowband codec. Instead of coding wideband speech using a wideband codec with a limited bitrate, we propose to decimate the input wideband speech incurring aliasing, and then encode it with a narrowband codec by allocating all the allowed bitrate to 0-4 kHz. After decoding the encoded bitstream, we apply a speech separation technique to obtain the narrowband and aliasing signals, which are then used to reconstruct the wideband speech by expansion, low/highpass filtering, and summation. Experimental results showed that the proposed method could achieve subjective quality comparable to the speeches coded by wideband codecs at higher bitrates in a subjective MUSHRA test.
Article
Deep learning methods have been successfully applied to audio super-resolution tasks. Although deep learning methods produce good performance, they are not practical for the real-world applications due to the large member of computations. To address this problem, we propose a Recursive Feature Diversity Networks (RFD-Nets), which is a lightweight model for achieving fast and accurate audio super-resolution. RFD-Nets are composed of a Recursive Feature Diversity (RFD) block and a Back-Projection (BP) block. Specifically, the RFD block is a recursive structure to iteratively refine and extract hierarchical audio feature. Subsequently, using an up-and-down sampling learner, the proposed BP block can effectively capture the deep relationships between High-Resolution (HR) and Low-Resolution (LR) audio pairs, thus producing high-quality audio reconstruction. Furthermore, we collect seven different types of complex audio datasets for training and comprehensively evaluating the proposed method. Extensive experiments demonstrate that our RFD-Nets can achieve superior accuracy on the proposed benchmark datasets against state-of-the-art methods while only requiring lower computation and memory. Datasets are released at https://github.com/JiangBoCS/RFDN.
Chapter
Spectral Band Replication (SBR) is an enhancement compression technology. The SBR is a bandwidth extension method which significantly improves the compression efficiency of perceptual audio and speech coding schemes. There are two versions of the SBR technology: Standard SBR and Low Delay SBR (LD-BR). Central to the operation of standard SBR and LD-SBR are dedicated complex exponential-modulated and real-valued cosine-modulated quadrature mirror filter (QMF) banks as the basic mathematical tools to analyze and synthesize audio signals. This chapter presents the complete unified efficient implementations of complex exponential-modulated and real-valued cosine-modulated QMF banks used both in the standard SBR and LD-SBR encoder and decoder. In general, for each QMF bank is presented: Definition in its equivalent block transform with a common parameter M representing the number of sub-bands, its general symmetry property in the frequency or time domain, and the derivation of a fast algorithm for its efficient implementation. All the fast algorithms are analyzed in detail in terms of the arithmetic complexity, regularity, and structural simplicity for a potential real-time low-cost implementation in hardware or software.
Chapter
Perceptual audio coding at low bit rates often relies on semi-parametric or parametric techniques to efficiently transmit and restore audio content that, after receiving, may be very different to the original in its waveform, but is perceptually still very close to it. Audio bandwidth extension exploits the limited resolution of the human auditory perception at high frequencies to recreate a spectral high band from the transmitted spectral low band and post-processing parameters, which elicits the sensation of plausible high frequency content that perceptually fuses with the low band into a decent broadband audio perception. The following chapter details the underlying thoughts, design criteria, perceptual trade-offs and signal processing techniques found in contemporary low bit rate audio codecs using audio bandwidth extension.
Conference Paper
Modern audio coding technologies apply methods of bandwidth extension (BWE) to efficiently represent audio data at low bitrates. An established method is the well-known spectral band replication (SBR) that can provide the very high sound quality with imperceptible artifact. However, its bitrates and complexity are very high. Another great method is LPC-based BWE, which is part of 3GPP AMR-WB+ codec. Although its bitrates and complexity are reduced distinctly, the sound quality it provided is unsatisfactory for music. In this paper, a novel bandwidth extension method is proposed which provided the high sound quality close to eSBR, with only 0.8 kbps bitrates. The proposed method predicts the fine structure of high frequency band from low frequency band by a deep auto-encoder, and only extracts the envelope of high frequency as side information. The performance evaluation demonstrates the advantage of the proposed method compared to the state of the art. Compared with eSBR, the bitrates drop about 63 %, and the subjective listening quality is close to it. Compared with LPC-based BWE, the subjective listening quality is better than it with the same bitrates.
Article
Spectral band replication (SBR) is an important tool in MPEG-4 high efficiency (HE) advanced audio coding (AAC). One key parameter affecting the quality of coded audio in SBR is the frequency band from which replication starts. This paper studies the influences of the start-band frequency on the quality of the coded audio and proposes an algorithm to determine this parameter in real time. The simulation results show that a piece of music coded with the proposed approach has a better objective difference grade (ODG) than that coded with the 3GPP reference program. For real-time applications such as encoding and broadcasting music in a live concert, the proposed approach is a better choice.
Article
Numerous applications require realistic and computationally efficient late reverberation. In this paper, the perceptually relevant properties of reverberation are identified, and a novel frequency transform domain reverberator that fulfills these properties is proposed. Listening test results confirm that the proposed reverberator has a perceptual quality equivalent to the ideal solution of decaying Gaussian noise in frequency bands.
Article
Multichannel audio, when reproduced or transmitted over a smaller number of channels, needs to be downmixed. In case the channels contain non-aligned interdependent sounds, the downmixed signal may attain perceivable spectral bias. In the present study a time-frequency domain phase-adaptive downmixing technique is proposed to reduce such spectral effects when necessary. In detail, the technique aligns the phases of the input channel pairs or groups having a high measured normalized inter-channel coherence prior to the downmixing. Simulations and listening tests were conducted to show the conditions in which the proposed method provides benefits with respect to the legacy methods that do not apply phase processing. Computational evaluations showed that the method may be implemented to run in real time with reasonable hardware requirements also with a larger number of input channels, such as with 22.2 surround. The proposed method was adopted in the upcoming MPEG-H 3D Audio coding standard with minor modifications.
Article
The bandwidth limitation of wideband audio degrades the subjective quality and the naturalness. In this paper, a bandwidth extension of audio signals from wideband to super-wideband was proposed by using a similarity correlation degree-based neural network. Firstly, the fine spectrum of wideband audio was converted to a multi-dimensional phase space. Then, a similarity correlation degree-based neural network was built up to reproduce the high-frequency fine spectrum. In addition, Gaussian mixture model was used to estimate the high-frequency spectral envelope. Finally, the bandwidth was extended to super-wideband by the proposed method in the ITU-T G. 722. 1 wideband codec. Evaluation results indicate that the proposed method is preferred over the reference methods and achieves a comparable subjective quality with the G. 722. 1C super-wideband codec. ©, 2015, Chinese Institute of Electronics. All right reserved.
Conference Paper
An enhanced bandwidth extension scheme is introduced in this paper for wideband speech coding using ADPCM. The coded lower band signal plus small side information (some parameters) are transmitted instead of the whole band. In the decoder both frequency parts are reconstructed from the coded signal and the received parameters. In the proposed method, the high frequency part is derived from the excitation signal of the low frequency part and the received spectral information in the form of vector quantized LPC. Optimum codebook element is chosen, among candidates, that when used in conjunction with the lower frequency part excitation best matches the high frequency signal. The search is carried out in a sorted codebook and the index of the chosen element is transmitted as a Huffman code. The results show that 23% reduction in the bit rate is achieved, in comparison with G.722 standard, bearing no degradation in quality.
Conference Paper
Modern audio coding technologies apply methods of bandwidth extension (BWE) to efficiently represent audio data at low bitrates. An established method is the well-known spectral band replication (SBR) that is part of MPEG High Efficiency Advanced Audio Coding (HE-AAC). However, if the signal features a distinct harmonic spectral structure, the use of these methods tends to result in audible artifacts, because the harmonic structure is not reconstructed correctly. In this paper a bandwidth extension method is proposed which eliminates the undesirable effects and allows for an efficient implementation in the Modified Discrete Cosine Transform (MDCT) domain. The proposed Harmonic Spectral Bandwidth Extension (HSBE) method uses arbitrary frequency shifts for modulating the replicated spectrum in a way that the harmonic structure of the signal is preserved. A listening test demonstrates the advantage of the proposed method compared to the state of the art.
Article
For conventional bandwidth extension, the spectral patching methods, such as spectral folding, spectral translation and non-linear processing, are employed to reconstruct high frequency signal, yet it leads to the spectral shifting between reconstructed and original signal, and does not retain the original harmonic relations. In this paper, a blind harmonic bandwidth extension method from wideband to super-wideband was proposed by estimating the energy of high frequency spectral envelope with Gaussian mixture model (GMM). Both the objective and subjective test results show that proposed algorithm performs better than conventional blind bandwidth extension algorithms.
Article
Parametric stereo is a state-of-the-art stereo coding method. It can efficiently encode a stereo audio signal into two parts including a monaural signal and small amount of stereo parameters. In this paper, the implementation of parametric stereo in digital radio mondial (DRM) system is presented. And the difference between parametric stereo in DRM and MPEG-4 is also discussed. Using parametric stereo, the efficiency of the audio encoder with scalable bit rate in DRM transmitter can be significantly improved. Experiment results indicate more than 33% of the total bit rate can be reduced while the audio quality maintains the same.
Article
Some existing blind bandwidth extension algorithms are restricted to linear extrapolation, such as linear spectrum replication. In this paper, we propose an improved blind bandwidth extension method using a non-linear extrapolation to reconstruct the missing high-frequency components. By training weights of the neural network in real time, the high-band signal can be reconstructed dynamically. Objective and subjective listening tests are used to evaluate the performance of the proposed algorithm, and the results show that it obtains higher quality signals.
Article
The theory of autoregressive (AR) modeling, also known as linear prediction, has been established by the Fourier analysis of infinite discrete-time sequences or continuous-time signals. Nevertheless, for various finite-length discrete trigonometric transforms (DTTs), including the discrete cosine and sine transforms of different types, the theory is not well established. Several DTTs have been used in current audio coding, and the AR modeling method can be applied to reduce coding artifacts or exploit data redundancies. This paper systematically develops the AR modeling fundamentals of temporal and spectral envelopes for the sixteen members of the DTTs. This paper first considers the AR modeling in the generalized discrete Fourier transforms (GDFTs). Then, we derive the modeling to all the DTTs by introducing the analytic transforms which convert the real-valued vectors into complex-valued ones. Through the process, we build the compact matrix representations for the AR modeling of the DTTs in both time domain and DTT domain. These compact forms also illustrate that the AR modeling for the envelopes can be performed through the Hilbert envelope and the power envelope. These compact forms can be used to develop new coding technologies or examine the possible defects in the existing AR modeling methods for DTTs, We apply the forms to analyze the current temporal noise shaping (TNS) tool in MPEG-2/4 advanced audio coding (AAC).
Conference Paper
In this paper a new method of blind bandwidth extension from wideband (WB) to super-wideband (SWB) audio is proposed. The Radial Basic Function (RBF) neural network is utilized to predict the coefficients of high-frequency (HF) based on the nonlinear characteristics of audio spectrum series. In addition, the linear extrapolation is used for reconstructing the envelop of HF spectrum. The bandwidth of the reconstructed audio signals is extended to SWB by using the proposed method. The result of the objective performance evaluation indicates that the proposed method can reconstruct the truncated HF components effectively and outperforms the conventional algorithms of blind bandwidth extension.
Conference Paper
In systems for High Frequency Reconstruction (HFR) employing harmonic transposition, there is a risk of introducing artifacts due to conflicting design goals for stationary, transient, and periodic signals. We describe a filter bank implementation of the transposer which addresses this problem by window design, frequency domain oversampling, and cross products.
Conference Paper
Binaural rendering technology is used to generate a two-channel signal from one or more channel signals, where each channel signal has associated with it a position relative to the listener. The resulting binaural signal, when played back over an appropriate device such as headphones, gives the sensation of audio signal(s) originating from the assigned position. The binaural rendering process typically involves applying a pair of Head-Related Transform Function (HRTF) equalizer to each input channel signal. The left and right ear signals from each of the input channels are then combined to generate a binaural signal. In this paper, we introduce a Time/Frequency (T/F) domain HRTF equalization technique which can be used to accomplish HRTF-based binaural rendering. The proposed technique can be conveniently combined with multi-channel/spatial decoding systems, such as multi-channel HE-AAC or MPEG surround decoder for low-complexity binaural rendering of multi-channel program.
Article
MPEG-4 High-Efficiency Advanced Audio Coding (HE-AAC) has adopted spectral band replication (SBR) to effi- ciently compress the high-frequency part of the audio. In SBR, linear prediction is applied to low-frequency subbands to suppress tonal components and smooth the associated spectra for repli- cating to high-frequency bands. Such a tone-suppressing process is referred to as whitening filtering. In SBR, to avoid the alias artifact incurred by spectral adjustment, a complex filterbank instead of real filterbank is adopted. For QMF subbands, this paper demonstrates that the linear prediction defined in the SBR standard results in a predictive bias. A new whitening filter, called the decimation-whitening filter, is proposed to eliminate the predictive bias and provide advantages in terms of noise-to-signal ratio measure, frequency resolution, energy leakage, and compu- tational complexity for SBR.
ResearchGate has not been able to resolve any references for this publication.