Jon Philip Barker

Jon Philip Barker
The University of Sheffield | Sheffield · Department of Computer Science (Faculty of Engineering)

PhD in Computer Science

About

189
Publications
27,814
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
5,866
Citations
Citations since 2017
75 Research Items
3361 Citations
20172018201920202021202220230100200300400500600700
20172018201920202021202220230100200300400500600700
20172018201920202021202220230100200300400500600700
20172018201920202021202220230100200300400500600700
Introduction
I am a Professor in Computer Science at the University of Sheffield, and a member of the Speech and Hearing Research Group. My research interests include speech recognition by humans and machines, audio-visual speech processing, machine listening and the application of machine learning to audio processing.
Additional affiliations
January 2008 - present
The University of Sheffield

Publications

Publications (189)
Preprint
Full-text available
In this paper, we explore an improved framework to train a monoaural neural enhancement model for robust speech recognition. The designed training framework extends the existing mixture invariant training criterion to exploit both unpaired clean speech and real noisy data. It is found that the unpaired clean speech is crucial to improve quality of...
Preprint
An accurate objective speech intelligibility prediction algorithms is of great interest for many applications such as speech enhancement for hearing aids. Most algorithms measures the signal-to-noise ratios or correlations between the acoustic features of clean reference signals and degraded signals. However, these hand-picked acoustic features are...
Preprint
End-to-end models have achieved significant improvement on automatic speech recognition. One common method to improve performance of these models is expanding the data-space through data augmentation. Meanwhile, human auditory inspired front-ends have also demonstrated improvement for automatic speech recognisers. In this work, a well-verified audi...
Preprint
Non-intrusive intelligibility prediction is important for its application in realistic scenarios, where a clean reference signal is difficult to access. The construction of many non-intrusive predictors require either ground truth intelligibility labels or clean reference signals for supervised learning. In this work, we leverage an unsupervised un...
Article
Full-text available
This paper presents the Clarity Speech Corpus, a publicly available, forty speaker British English speech dataset. The corpus was created for the purpose of running listening tests to gauge speech intelligibility and quality in the Clarity Project, which has the goal of advancing speech signal processing by hearing aids through a series of challeng...
Article
Full-text available
Acoustic modelling for automatic dysarthric speech recognition (ADSR) is a challenging task. Data deficiency is a major problem and substantial differences between typical and dysarthric speech complicate the transfer learning. In this paper, we aim at building acoustic models using the raw magnitude spectra of the source and filter components for...
Conference Paper
Full-text available
In recent years, rapid advances in speech technology have been made possible by machine learning challenges such as CHiME, REVERB, Blizzard, and Hurricane. In the Clarity project, the machine learning approach is applied to the problem of hearing aid processing of speech-in-noise, where current technology in enhancing the speech signal for the hear...
Conference Paper
A novel crowdsourcing project to gather children's storytelling based language samples using a mobile app was undertaken across the United Kingdom. Parents' scaffolding of children's narratives was observed in many of the samples. This study was designed to examine the relationship of scaffolding and young children’s narrative language ability in a...
Conference Paper
Current hearing aids normally provide amplification based on a general prescriptive fitting, and the benefits provided by the hearing aids vary among different listening environments despite the inclusion of noise suppression feature. Motivated by this fact, this paper proposes a data-driven machine learning technique to develop hearing aid fitting...
Preprint
In this paper, we introduce a novel semi-supervised learning framework for end-to-end speech separation. The proposed method first uses mixtures of unseparated sources and the mixture invariant training (MixIT) criterion to train a teacher model. The teacher model then estimates separated sources that are used to train a student model with standard...
Preprint
Current hearing aids normally provide amplification based on a general prescriptive fitting, and the benefits provided by the hearing aids vary among different listening environments despite the inclusion of noise suppression feature. Motivated by this fact, this paper proposes a data-driven machine learning technique to develop hearing aid fitting...
Preprint
Full-text available
Hearing aids are expected to improve speech intelligibility for listeners with hearing impairment. An appropriate amplification fitting tuned for the listener's hearing disability is critical for good performance. The developments of most prescriptive fittings are based on data collected in subjective listening experiments, which are usually expens...
Preprint
Full-text available
In this paper, we ask whether vocal source features (pitch, shimmer, jitter, etc) can improve the performance of automatic sung speech recognition, arguing that conclusions previously drawn from spoken speech studies may not be valid in the sung speech domain. We first use a parallel singing/speaking corpus (NUS-48E) to illustrate differences in su...
Preprint
In this paper, we present a novel multi-channel speech extraction system to simultaneously extract multiple clean individual sources from a mixture in noisy and reverberant environments. The proposed method is built on an improved multi-channel time-domain speech separation network which employs speaker embeddings to identify and extract multiple t...
Preprint
This paper introduces a new method for multi-channel time domain speech separation in reverberant environments. A fully-convolutional neural network structure has been used to directly separate speech from multiple microphone recordings, with no need of conventional spatial feature extraction. To reduce the influence of reverberation on spatial fea...
Conference Paper
Full-text available
This extended abstract describes the system we submitted to the MIREX 2020 Lyrics Transcription task. The system consists of two modules: a source separation front-end and an ASR back-end. The first module separates the vocal from a polyphonic song by utilising a convolutional time-domain audio separation network (ConvTasNet). The second module tra...
Article
No PDF available ABSTRACT Recent advances in machine learning raise the prospect of radically improving how hearing devices deal with speech in noise and so improve many aspects of health and well-being for an aging population. In many other aspects of speech processing, rapid transformations have been enabled by a research tradition of “open chall...
Conference Paper
Full-text available
Automatic recognition of dysarthric speech is a very challenging research problem where performances still lag far behind those achieved for typical speech. The main reason is the lack of suitable training data to accommodate for the large mismatch seen between dysarthric and typical speech. Only recently has focus moved from single-word tasks to e...
Article
Full-text available
In the Clarity project, we will run a series of machine learning challenges to revolutionise speech processing for hearing devices. Over five years, there will be three paired challenges. Each pair will consist of a competition focussed on hearing-device processing (“enhancement”) and another focussed on speech perception modelling (“prediction”)....
Preprint
Full-text available
In the Clarity project, we will run a series of machine learning challenges to revolutionise speech processing for hearing devices. Over five years, there will be three paired challenges. Each pair will consist of a competition focussed on hearing-device processing ("enhancement") and another focussed on speech perception modelling ("prediction")....
Conference Paper
Full-text available
There has been much recent interest in building continuous speech recognition systems for people with severe speech impairments, e.g., dysarthria. However, the datasets that are commonly used are typically designed for tasks other than ASR development, or they contain only isolated words. As such, they contain much overlap in the prompts read by th...
Conference Paper
This paper presents an improved transfer learning framework applied to robust personalised speech recognition models for speakers with dysarthria. As the baseline of transfer learning, a state-of-theart CNN-TDNN-F ASR acoustic model trained solely on source domain data is adapted onto the target domain via neural network weight adaptation with the...
Preprint
Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environme...
Conference Paper
Full-text available
In the Clarity project, we will run a series of machine learning challenges to revolutionise speech processing for hearing devices. Over five years, there will be three paired challenges. Each pair will consist of a challenge focussed on hearing-device processing and another focussed on speech perception modelling. The series of processing challeng...
Conference Paper
Full-text available
Automatic sung speech recognition is a relatively under-studied topic that has been held back by a lack of large and freely available datasets. This has recently changed thanks to the release of the DAMP Sing! dataset, a 1100 hour karaoke dataset originating from the social music-making company, Smule. This paper presents work undertaken to define...
Conference Paper
Full-text available
Improving the accuracy of personalised speech recognition for speakers with dysarthria is a challenging research field. In this paper, we explore an approach that non-linearly modifies speech tempo to reduce mismatch between typical and atypical speech. Speech tempo analysis at the phonetic level is accomplished using a forced-alignment process fro...
Article
When listeners misperceive words in noise, do they report words that are more common? Lexical frequency differences between misperceived and target words in English and Spanish were examined for five masker types. Misperceptions had a higher lexical frequency in the presence of pure energetic maskers, but frequency effects were reduced or absent fo...
Conference Paper
Full-text available
Improving the accuracy of dysarthric speech recognition is a challenging research field due to the high inter- and intra-speaker variability in disordered speech. In this work, we propose to use estimated articulatory-based representations to augment the conventional acoustic features for better modeling of the dysarthric speech variability in auto...
Preprint
Human auditory cortex excels at selectively suppressing background noise to focus on a target speaker. The process of selective attention in the brain is known to contextually exploit the available audio and visual cues to better focus on target speaker while filtering out other noises. In this study, we propose a novel deep neural network (DNN) ba...
Conference Paper
Full-text available
Most frequency domain techniques for pitch extraction such as cepstrum, harmonic product spectrum (HPS) and summation residual harmonics (SRH) operate on the magnitude spectrum and turn it into a function in which the fundamental frequency emerges as argmax. In this paper, we investigate the extension of these three techniques to the phase and grou...
Article
This paper presents a bi-view (front and side) audiovisual Lombard speech corpus, which is freely available for download. It contains 5400 utterances (2700 Lombard and 2700 plain reference utterances), produced by 54 talkers, with each utterance in the dataset following the same sentence format as the audiovisual “Grid” corpus [Cooke, Barker, Cunni...
Article
Full-text available
When producing speech in noisy backgrounds talkers reflexively adapt their speaking style in ways that increase speech-in-noise intelligibility. This adaptation, known as the Lombard effect, is likely to have an adverse effect on the performance of automatic speech recognition systems that have not been designed to anticipate it. However, previous...
Article
Full-text available
The CHiME challenge series aims to advance robust automatic speech recognition (ASR) technology by promoting research at the interface of speech and language processing, signal processing , and machine learning. This paper introduces the 5th CHiME Challenge, which considers the task of distant multi-microphone conversational ASR in real home enviro...
Conference Paper
Full-text available
In earlier work we studied the effect of statistical normalisation for phase-based features and observed it leads to a significant robustness improvement. This paper explores the extension of the generalised Vector Taylor Series (gVTS) noise compensation approach to the group delay (GD) domain. We discuss the problems it presents, propose some solu...
Article
Full-text available
An effective way to increase noise robustness in automatic speech recognition (ASR) systems is feature enhancement based on an analytical distortion model that describes the effects of noise on the speech features. One of such distortion models that has been reported to achieve a good trade-off between accuracy and simplicity is the masking model....
Conference Paper
Full-text available
In earlier work we proposed a framework for speech source-filter separation that employs phase-based signal processing. This paper presents a further theoretical investigation of the model and optimisations that make the filter and source representations less sensitive to the effects of noise and better matched to downstream processing. To this end...
Conference Paper
Full-text available
Vector Taylor Series (VTS) is a powerful technique for robust ASR but, in its standard form, it can only be applied to log-filter bank and MFCC features. In earlier work, we presented a generalised VTS (gVTS) that extends the applicability of VTS to front-ends which employ a power transformation non-linearity. gVTS was shown to provide performance...
Article
Visual speech information plays a key role in supporting speech perception, especially when acoustic features are distorted or inaccessible. Recent research suggests that for spectrally distorted speech, the use of visual speech in auditory training improves not only subjects’ audiovisual speech recognition, but also their subsequent auditory-only...
Chapter
The CHiME challenge series has been aiming to advance the development of robust automatic speech recognition for use in everyday environments by encouraging research at the interface of signal processing and statistical modelling. The series has been running since 2011 and is now entering its 4th iteration. This chapter provides an overview of the...
Chapter
Recent automatic speech recognition results are quite good when the training data is matched to the test data, but much worse when they differ in some important regard, like the number and arrangement of microphones or the reverberation and noise conditions. Because these configurations are difficult to predict a priori and difficult to exhaustivel...
Conference Paper
A limited number of research developments in the field of speech enhancement have been implemented into commercially available hearing-aids. However, even sophisticated aids remain ineffective in environments where there is overwhelming noise present. Human performance in such situations is known to be dependent upon input from both the aural and v...
Article
Computational auditory scene analysis is increasingly presented in the literature as a set of auditory-inspired techniques for estimating “Ideal Binary Masks” (IBM), i.e., time-frequency domain segregations of the attended source and the acoustic background based on a local signal-to-noise ratio objective (Wang and Brown, 2006). This talk argues th...
Conference Paper
Full-text available
In earlier work we have proposed a source-filter decomposition of speech through phase-based processing. The decomposition leads to novel speech features that are extracted from the filter component of the phase spectrum. This paper analyses this spectrum and the proposed representation by evaluating statistical properties at vari-ous points along...
Conference Paper
The concept of using visual information as part of audio speech processing has been of significant recent interest. This paper presents a data driven approach that considers estimating audio speech acoustics using only temporal visual information without considering linguistic features such as phonemes and visemes. Audio (log filterbank) and visual...
Article
Speech enhancement and automatic speech recognition (ASR) are most often evaluated in matched (or multi-condition) settings where the acoustic conditions of the training data match (or cover) those of the test data. Few studies have systematically assessed the impact of acoustic mismatches between training and test data, especially concerning recen...
Article
Words spoken against a noise background often form an ambiguous percept. However, in certain conditions, a listener will mishear a noisy word but report hearing the same incorrect word as reported by other listeners. These consistent hearing errors are valuable as tests of detailed models of speech perception. This paper describes the collection of...
Article
This paper presents the design and outcomes of the CHiME-3 challenge, the first open speech recognition evaluation designed to target the increasingly relevant multichannel, mobile-device speech recognition scenario. The paper serves two purposes. First, it provides a definitive reference for the challenge, including full descriptions of the task d...
Conference Paper
Full-text available
Designing good normalisation to counter the effect of environmental distortions is one of the major challenges for automatic speech recognition (ASR). The Vector Taylor series (VTS) method is a powerful and mathematically well principled technique that can be applied to both the feature and model domains to compensate for both additive and convolut...