Tom Bäckström

Tom Bäckström
Aalto University · Department of Signal Processing and Acoustics

Professor D.Sc.

About

151
Publications
19,658
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,396
Citations
Introduction
Associate Professor at Aalto University, currently interested in 1. Privacy in speech and audio applications, 2. Distributed speech processing, 3. Speech and audio coding
Additional affiliations
December 2019 - present
Aalto University
Position
  • Professor (Associate)
August 2016 - November 2019
Aalto University
Position
  • Professor
January 2013 - July 2016

Publications

Publications (151)
Chapter
Recent studies show that the spatial distribution of the sentence representations generated from pre-trained language models is highly anisotropic. This results in a degradation in the performance of the models on the downstream task. Most methods improve the isotropy of the sentence embeddings by refining the corresponding contextual word represen...
Article
Full-text available
Current sound-based practices and systems developed in both academia and industry point to convergent research trends that bring together the field of Sound and Music Computing with that of the Internet of Things. This paper proposes a vision for the emerging field of the Internet of Sounds (IoS), which stems from such disciplines. The IoS relates...
Preprint
Full-text available
Speech technology for communication, accessing information and services has rapidly improved in quality. It is convenient and appealing because speech is the primary mode of communication for humans. Such technology however also presents proven threats to privacy. Speech is a tool for communication and it will thus inherently contain private inform...
Article
Full-text available
The use of speech source localization (SSL) and its applications offer great possibilities for the design of speaker local positioning systems with wireless acoustic sensor networks (WASNs). Recent works have shown that data-driven front-ends can outperform traditional algorithms for SSL when trained to work in specific domains, depending on factor...
Article
Full-text available
Machine learning algorithms have been shown to be highly effective in solving optimization problems in a wide range of applications. Such algorithms typically use gradient descent with backpropagation and the chain rule. Hence, the backpropagation fails if intermediate gradients are zero for some functions in the computational graph, because it cau...
Article
Full-text available
The state-of-the-art speaker recognition systems are usually trained on a single computer using speech data collected from multiple users. However, these speech samples may contain private information which users may not be willing to share. To overcome potential breaches of privacy, we investigate the use of federated learning with and without sec...
Article
Full-text available
Enhancement algorithms for wireless acoustic sensor networks (WASNs) are indispensable with the increasing availability and usage of connected devices with microphones. Conventional spatial filtering approaches for enhancement in WASNs approximate quantization noise with an additive Gaussian distribution, which limits performance due to the non-lin...
Preprint
Full-text available
Enhancement algorithms for wireless acoustics sensor networks~(WASNs) are indispensable with the increasing availability and usage of connected devices with microphones. Conventional spatial filtering approaches for enhancement in WASNs approximate quantization noise with an additive Gaussian distribution, which limits performance due to the non-li...
Conference Paper
Full-text available
Voice based devices and virtual assistants are widely integrated into our daily life, but the growing popularity has also raised concerns about data privacy in processing and storage. While improvements in technology and data protection regulations have been made to provide users a more secure experience, the concept of privacy continues to be subj...
Preprint
Full-text available
The COVID-19 outbreak disrupted different organizations, employees and students, who turned to teleconference applications to collaborate and socialize even during the quarantine. Thus, the demand of teleconferencing applications surged with mobile application downloads reaching the highest number ever seen. However, some of the teleconference appl...
Article
Full-text available
Voice user interfaces can offer intuitive interaction with our devices, but the usability and audio quality could be further improved if multiple devices could collaborate to provide a distributed voice user interface. To ensure that users’ voices are not shared with unauthorized devices, it is however necessary to design an access management syste...
Conference Paper
Full-text available
Voice user interfaces have increased in popularity, as they enable natural interaction with different applications using one's voice. To improve their usability and audio quality, several devices could interact to provide a unified voice user interface. However, with devices cooperating and sharing voice-related information, user privacy may be at...
Preprint
Full-text available
Processing of speech and audio signals with time-frequency representations require windowing methods which allow perfect reconstruction of the original signal and where processing artifacts have a predictable behavior. The most common approach for this purpose is overlap-add windowing, where signal segments are windowed before and after processing....
Preprint
Full-text available
Recent speech and audio coding standards such as 3GPP Enhanced Voice Services match the foreseeable needs and requirements in transmission of speech and audio, when using current transmission infrastructure and applications. Trends in Internet-of-Things technology and development in personal digital assistants (PDAs) however begs us to consider fut...
Conference Paper
Full-text available
Advanced coding algorithms yield high quality signals with good coding efficiency within their target bit-rate ranges, but their performance suffer outside the target range. At lower bitrates, the degradation in performance is because the decoded signals are sparse, which gives a perceptually muffled and distorted characteristic to the signal. Stan...
Conference Paper
Full-text available
State-of-the-art speech codecs achieve a good compromise between quality, bitrate and complexity. However, retaining performance outside the target bitrate range remains challenging. To improve performance, many codecs use pre- and post-filtering techniques to reduce the perceptual effect of quantization-noise. In this paper, we propose a postfilte...
Conference Paper
Spectral envelope modelling is a central part of speech and audio codecs and is traditionally based on either vector quan-tization or scalar quantization followed by entropy coding. To bridge the coding performance of vector quantization with the low complexity of the scalar case, we propose an iterative approach for entropy coding the spectral env...
Chapter
This chapter presents an overview of the theory and practice of the tools typically needed in time-frequency processing of audio channels. It introduces a few signal processing techniques commonly used, and serves as background information for the reader. The chapter assumes understanding of basic digital signal processing techniques from the reade...
Article
Efficient coding of speech and audio in a distributed system requires that quantization errors across nodes are uncorrelated. Yet with conventional methods at low bitrates, quantization levels become increasingly sparse, which does not correspond to the distribution of the input signal and importantly, also reduces coding efficiency in a distribute...
Chapter
The final basic component of the speech source model is that of the intensity, volume or energy of the two main components, voiced and unvoiced parts of the speech signal. Usually, signal output energy is modelled by simple gain factors applied to the excitations of the filters. The two parts, unvoiced excitation corresponding to the residual codeb...
Chapter
A model for the pitch is a central part of any source model of speech. It corresponds to the oscillations of the vocal folds and is usually modelled with a long-term predictor. In speech codecs it is generally implemented as an adaptive vector codebook, where the residual of the linear predictive filter is modelled using the past residual. The mode...
Chapter
At low bitrates, basic CELP codecs have an easily recognisable distortion characteristic often described as noisiness or roughness. To reduce the perceptual distortion, we can try to identify and remove typical artefacts by filtering the decoded signal. In this chapter we present most typical postfiltering techniques such as formant enhancement and...
Chapter
Signals which are sufficiently stationary permit highly efficient coding in the frequency domain. Such signals include speech signals such as sustained vowels and prolonged fricatives, as well as generic audio signals such as music and mixed material. The main components of frequency domain coding methods include windowing, a time-frequency transfo...
Chapter
Full-text available
Transmission over real-world networks will occasionally suffer from transmission errors, which can significantly deteriorate the perceived quality of a speech codec. This chapter addresses the problem of transmission errors in packet based voice applications, such as voice over Internet protocol (VoIP). A broad range of techniques for recovery from...
Chapter
Speech processing algorithms usually segment signals into finite-length blocks or windows, since block operations are generally more efficient in terms of both bitrate and computational complexity. Speech codecs model temporal correlations with linear prediction for both coding efficiency as well as to enable smooth transitions between frames. This...
Chapter
Envelope models describe the gross shape of a signal, such as the magnitude spectrum of a speech signal. An envelope model of the spectrum is thus a source model of the speech signal. Perceptual (frequency) masking models are evaluation models, which describe the magnitude of the perceptually detrimental effect of errors in different parts of the s...
Chapter
Perceptual audio coding at low bit rates often relies on semi-parametric or parametric techniques to efficiently transmit and restore audio content that, after receiving, may be very different to the original in its waveform, but is perceptually still very close to it. Audio bandwidth extension exploits the limited resolution of the human auditory...
Chapter
While code-excited linear prediction shows good performance for bitrates above 8 kbits/s, the quality of the speech-specific waveform coding scheme drops noticeably at lower rates. At such low rates, all information included in the waveform of the speech cannot be correctly coded. On the other hand, parametric speech coders also called vocoders con...
Chapter
For evaluating the performance of speech codecs or their algorithms, we need metrics for quality. Since humans are the ultimate users of most speech codecs, subjective testing is the gold standard in quality measurement. In this chapter we present a range of typical subjective evaluation methods and discuss their strengths and weaknesses. Subjectiv...
Chapter
The spectral envelope and fundamental frequency of a speech signal is generally modelled by linear, short- and long-term predictive synthesis filters. The residual from these two filters is a signal without almost any temporal correlation. In this section we describe modelling and optimisation of the residual quantisation. The most famous approach...
Chapter
The objective of speech coding is to transmit speech at the highest possible quality with the lowest possible amount of resources. To achieve the best compromise, we can use available information about 1. the source, which is the speech production system, 2. the quality measure or evaluation criteria, which depends on the performance of the human h...
Chapter
Voice Activity Detection (VAD) provides the information whether an audio signal contains speech or not. Besides speech coding and transmission, there are many other applications in speech and audio processing that benefit from this information, and their performance is crucially dependent on the accuracy and robustness of the applied VAD. Various a...
Chapter
Humans produce speech sounds by pushing air out of the lungs and letting the vocal folds oscillate by the airflow as well as by turbulent constrictions in the vocal tract. The flow-waveform thus created is further modulated by the resonances of the vocal tract. These features form the characteristic properties of phones. For efficient coding, we mu...
Article
This study proposes an approach for glottal inverse filtering of acoustic speech signals using quadratic programming (QPR). The method aims to jointly model the effect of vocal tract and lip radiation with a single filter whose coefficients are optimized using QPR. This optimization is based on the principles of closed phase analysis, where the con...
Conference Paper
In mobile communications, environmental noise often reduces the quality and intelligibility of speech. Problems caused by far-end noise, in the sending side of the communication channel, can be alleviated by using a noise reducing preprocessing stage before the encoder. In this study, a modification increasing the robustness of the encoder itself t...
Conference Paper
Conventional music coders, based on a modified discrete cosine transform (MDCT) suffer greatly when lowering their bit-rate and delay. In particular, tonal music signals are penalized by short analysis windows and the variable length coding of the quantized MDCT coefficients demands a significant amount of bits for coding the harmonic structure. Fo...
Conference Paper
Envelope models are common in speech and audio processing: for example, linear prediction is used for modeling the spectral envelope of speech, whereas audio coders use scale factor bands for perceptual masking models. In this work we introduce an envelope model called distribution quantizer (DQ), with the objective of combining the accuracy of lin...
Conference Paper
Following speech on TV or radio in the presence of interferers is sometimes challenging, in particular for the elderly and the hearing-impaired. To evaluate the performance of speech enhancement methods for such scenarios, we consider a stereo mixture composed of a speech signal and interfering sources. We apply different approaches to separate the...
Conference Paper
The majority of speech coding algorithms are based on the code excited linear prediction (CELP) paradigm, modelling the speech signal by linear prediction. This coding approach offers the advantage of a very short algorithmic delay, due to the windowing scheme based on rectangular windowing of the residual of the linear predictor. Although widely u...
Patent
Full-text available
An apparatus for encoding an audio signal having a stream of audio samples has: a windower for applying a prediction coding analysis window to the stream of audio samples to obtain windowed data for a prediction analysis and for applying a transform coding analysis window to the stream of audio samples to obtain windowed data for a transform analys...
Conference Paper
Main-stream speech codecs are based on modelling the speech source by a linear predictor. An efficient domain for quantization and coding of this linear predictor is the line spectral frequency representation, where the predictor is encoded into an ordered set of frequencies that correspond to the roots of the corresponding line spectral polynomial...
Article
Linear prediction is one of the most established techniques in signal estimation, and it is widely utilized in speech signal processing. It has been long understood that the nerve firing rate of human auditory system can be approximated by power law non-linearity, and this has been the motivation behind using perceptual linear prediction in extract...
Article
The Vandermonde transform was recently presented as a time-frequency transform which, in difference to the discrete Fourier transform, also decorrelates the signal. Although the approximate or asymptotic decorrelation provided by Fourier is sufficient in many cases, its performance is inadequate in applications which employ short windows. The Vande...
Article
Minimum variance distortionless response (MVDR) is a classic design criteria in signal adaptive spectral analysis as well as filter design. We extend this approach to filterbanks with the constraint that transform domain signal components must be uncorrelated. Our analysis shows that filterbanks based on Vandermonde decomposition of the autocorrela...
Conference Paper
In the analysis of speech production, glottal inverse filtering has proved to be an effective yet non-invasive method for obtaining information about the voice source. One of the main challenges of the existing methods is blind estimation of the contribution of the lip radiation, which must often be manually determined. To obtain a fully automatic...
Patent
Full-text available
A multi-mode audio signal decoder has a spectral value determinator to obtain sets of decoded spectral coefficients for a plurality of portions of an audio content and a spectrum processor configured to apply a spectral shaping to a set of spectral coefficients in dependence on a set of linear-prediction-domain parameters for a portion of the audio...
Article
Efficient speech signal representations are prerequisite for efficient speech processing algorithms. The Vandermonde transform is a recently introduced time-frequency transform which provides a sparse and uncorrelated speech signal representation. In contrast, the Fourier transform only decorrelates the signal approximately. To achieve complete dec...
Article
Modern speech codecs based on Code Excited Linear Prediction (CELP) employ an analysis-by-synthesis optimization loop to find the best quantization of the source model parameters. With this approach, optimal quantization can be achieved only with an exhaustive search. Instead, we propose to use matrix factorization to decorrelate the objective func...
Article
Full-text available
By deriving a factorization of Toeplitz matrices into the product of Vandermonde matrices, we demonstrate that the Euclidean norm of a filtered signal is equivalent with the Euclidean norm of the appropriately frequency-warped and scaled signal. In effect, we obtain an equivalence between the energy of frequency-warped and filtered signals. While t...
Patent
Full-text available
An apparatus for obtaining a parameter describing a variation of a signal characteristic of a signal on the basis of actual transform-domain parameters describing the audio signal in transform-domain includes a parameter determinator. The parameter determinator is configured to determine one or more model parameters of a transform-domain variation...
Conference Paper
Speech and audio coding have during the last decade converged to an increasingly unified technology. This contribution discusses one of the remaining fundamental differences between speech and audio paradigms, namely, windowing of the input signal. Audio codecs generally use lapped transforms and apply a perceptual model in the transform domain, wh...
Article
The covariance matrix of a multichannel audio signal is a measure that contains the channel energies and the inter-channel dependencies. This measure is perceptually relevant in frequency bands, since with the effect of the acoustic transfer path, it forms the inter-aural cues based on which the human spatial hearing system decodes the spatial soun...
Patent
Full-text available
A multi-mode audio signal decoder has a spectral value determinator to obtain sets of decoded spectral coefficients for a plurality of portions of an audio content and a spectrum processor configured to apply a spectral shaping to a set of spectral coefficients in dependence on a set of linear-prediction-domain parameters for a portion of the audio...
Conference Paper
Full-text available
Speech coding algorithms based on Algebraic Code Excited Linear Prediction (ACELP) quantize and code the residual signal with a fixed algebraic codebook. This pa-per generalizes the conventional encoding to provide optimal bit consumption for all codebook designs by enumerating all possible states. The codebook designed is aimed for constant bit-ra...
Patent
An apparatus for decoding an encoded audio signal, wherein one or more tracks are associated with the encoded audio signal, each one of the tracks having a plurality of track positions and a plurality of pulses is provided. The apparatus comprises a pulse information decoder (110) and a signal decoder (120). The pulse information decoder (110) is a...
Patent
An apparatus for decoding (10; 40; 60; 410), an apparatus for encoding (510), a method for decoding and a method for encoding positions of slots comprising events in an audio signal frame and respective computer programs and encoded signals, wherein the apparatus for decoding (10; 40; 60; 410) comprises: an analysing unit (20; 42; 70; 420) for anal...
Patent
A unified speech and audio decoder is described, which comprises a frame buffer configured to buffer a sub-part of a datastream composed of consecutive frames in units of the frames so that the subpart continuously comprises at least one frame, each frame representing a coded version of a respective portion of consecutive portions of an audio signa...
Patent
An apparatus for obtaining a parameter describing a variation of a signal characteristic of a signal on the basis of actual transform-domain parameters describing the audio signal in transform-domain comprises a parameter determinator. The parameter determinator is configured to determine one or more model parameters of a transform- domain variatio...

Questions

Question (1)
Question
I'm considering writing a "Speech Processing 101" compendium for the course I'm teaching, because I'm not aware of any existing good material for such a course. That leads up to two questions:
- Could you recommend a forum for publishing educational material with an open-access license?
- Alternatively, are you aware of a good resource for Speech Processing 101, with an emphasis on the DSP side (=it's not about speech recognition).

Network