Thomas Hain

Thomas Hain
The University of Sheffield | Sheffield · Department of Computer Science (Faculty of Engineering)

Dipl.-Ing, PhD

About

237
Publications
46,850
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
6,316
Citations
Introduction
Thomas is leading the Speech and Hearing Research Group at the University of Sheffield. His interests are machine learning , speech, language and audio technology.
Additional affiliations
January 2004 - present
The University of Sheffield
Position
  • Professor
January 2001 - December 2003
University of Cambridge
Position
  • Lecturer
May 1997 - December 2001
University of Cambridge
Position
  • PhD Student

Publications

Publications (237)
Preprint
For speech emotion datasets, it has been difficult to acquire large quantities of reliable data and acted emotions may be over the top compared to less expressive emotions displayed in everyday life. Lately, larger datasets with natural emotions have been created. Instead of ignoring smaller, acted datasets, this study investigates whether informat...
Article
This paper investigates the feasibility of the task of automatic behaviour coding of spoken interactions in teamwork settings. We introduce the coding schema used to classify the behaviours of the group members and the corpus we collected to assess the coding schema reliability in real teamwork meetings. The behaviours embedded in spoken utterances...
Preprint
It was shown recently that a combination of ASR and TTS models yield highly competitive performance on standard voice conversion tasks such as the Voice Conversion Challenge 2020 (VCC2020). To obtain good performance both models require pretraining on large amounts of data, thereby obtaining large models that are potentially inefficient in use. In...
Chapter
This paper proposes a one-shot voice conversion (VC) solution. In many one-shot voice conversion solutions (e.g., Auto-encoder-based VC methods), performances have dramatically been improved due to instance normalization and adaptive instance normalization. However, one-shot voice conversion fluency is still lacking, and the similarity is not good...
Conference Paper
Anomalous audio in speech recordings is often caused by speaker voice distortion, external noise, or even electric interferences. These obstacles have become a serious problem in some fields, such as high-quality dubbing and speech processing. In this paper, a novel approach using a temporal convolutional attention network (TCAN) is proposed to tac...
Conference Paper
Many-to-many voice conversion with non-parallel training data has seen significant progress in recent years. It is challenging because of lacking of ground truth parallel data. StarGAN-based models have gained attentions because of their efficiency and effective. However, most of the StarGAN-based works only focused on small number of speakers and...
Article
In this paper, a hierarchical attention network is proposed to generate robust utterance-level embeddings (H-vectors) for speaker identification and verification. Since different parts of an utterance may have different contributions to speaker identities, the use of hierarchical structure aims to learn speaker related information locally and globa...
Preprint
Full-text available
This paper proposes an adaptation method for end-to-end speech recognition. In this method, multiple automatic speech recognition (ASR) 1-best hypotheses are integrated in the computation of the connectionist temporal classification (CTC) loss function. The integration of multiple ASR hypotheses helps alleviating the impact of errors in the ASR hyp...
Conference Paper
Separating different speaker properties from a multi-speaker environment is challenging. Instead of separating a two-speaker signal in signal space like speech source separation, a speaker embedding de-mixing approach is proposed. The proposed approach separates different speaker properties from a two-speaker signal in embedding space. The proposed...
Conference Paper
Embedding acoustic information into fixed length representations is of interest for a whole range of applications in speech and audio technology. Two novel unsupervised approaches to generate acoustic embeddings by modelling of acoustic context are proposed. The first approach is a contextual joint factor synthesis encoder, where the encoder in an...
Article
Full-text available
Speech based human-machine interaction and natural language understanding applications have seen a rapid development and wide adoption over the last few decades. This has led to a proliferation of studies that investigate Error detection and classification in Automatic Speech Recognition (ASR) systems. However, different data sets and evaluation pr...
Conference Paper
In this paper, a novel architecture for speaker recognition is proposed by cascading speech enhancement and speaker processing. Its aim is to improve speaker recognition performance when speech signals are corrupted by noise. Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one frame...
Conference Paper
Full-text available
The evaluation of scientific submissions through peer review is both the most fundamental component of the publication process , as well as the most frequently criticised and questioned. Academic journals and conferences request reviews from multiple reviewers per submission, which an editor, or area chair aggregates into the final acceptance decis...
Preprint
Identifying multiple speakers without knowing where a speaker's voice is in a recording is a challenging task. This paper proposes a hierarchical network with transformer encoders and memory mechanism to address this problem. The proposed model contains a frame-level encoder and segment-level encoder, both of them make use of the transformer encode...
Conference Paper
Full-text available
Speech emotion recognition is essential for obtaining emotional intelligence which affects the understanding of context and meaning of speech. Harmonically structured vowel and consonant sounds add indexical and linguistic cues in spoken information. Previous research argued whether vowel sound cues were more important in carrying the emotional con...
Conference Paper
Full-text available
Speech emotion recognition is essential for obtaining emotional intelligence which affects the understanding of context and meaning of speech. The fundamental challenges of speech emotion recognition from a machine learning standpoint is to extract patterns which carry maximum correlation with the emotion information encoded in this signal, and to...
Conference Paper
Identifying multiple speakers without knowing where a speaker’s voice is in a recording is a challenging task. In this paper, a hierarchical attention network is proposed to solve a weakly labelled speaker identification problem. The use of a hierarchical structure, consisting of a frame-level encoder and a segment-level encoder, aims to learn spea...
Conference Paper
While the use of deep neural networks has significantly boosted speaker recognition performance, it is still challenging to separate speakers in poor acoustic environments. Here speech enhancement methods have traditionally allowed improved performance. The recent works have shown that adapting speech enhancement can lead to further gains. This pap...
Conference Paper
Unsupervised representation learning of speech has been of keen interest in recent years, which is for example evident in the wide interest of the ZeroSpeech challenges. This work presents a new method for learning frame level representations based on WaveNet auto-encoders. Of particular interest in the ZeroSpeech Challenge 2019 were models with di...
Conference Paper
Many applications of speech technology require more and more audio data. Automatic assessment of the quality of the collected recordings is important to ensure they meet the requirements of the related applications. However, effective and high performing assessment remains a challenging task without a clean reference. In this paper, a novel model f...
Preprint
Many-to-many voice conversion with non-parallel training data has seen significant progress in recent years. StarGAN-based models have been interests of voice conversion. However, most of the StarGAN-based methods only focused on voice conversion experiments for the situations where the number of speakers was small, and the amount of training data...
Preprint
Anomalous audio in speech recordings is often caused by speaker voice distortion, external noise, or even electric interferences. These obstacles have become a serious problem in some fields, such as recording high-quality music and speech processing. In this paper, a novel approach using a temporal convolutional attention network (TCAN) is propose...
Conference Paper
Full-text available
This paper investigates supervised adaptation of end-to-end speech recognition, which uses hybrid connectionist temporal classification (CTC)/Attention architecture, for noise robustness. The components of the architecture, namely the shared encoder, the attention decoder's long short-term memory (LSTM) layers, and the soft-max layers of the CTC pa...
Preprint
Unsupervised representation learning of speech has been of keen interest in recent years, which is for example evident in the wide interest of the ZeroSpeech challenges. This work presents a new method for learning frame level representations based on WaveNet auto-encoders. Of particular interest in the ZeroSpeech Challenge 2019 were models with di...
Preprint
Many applications of speech technology require more and more audio data. Automatic assessment of the quality of the collected recordings is important to ensure they meet the requirements of the related applications. However, effective and high performing assessment remains a challenging task without a clean reference. In this paper, a novel model f...
Preprint
While the use of deep neural networks has significantly boosted speaker recognition performance, it is still challenging to separate speakers in poor acoustic environments. Here speech enhancement methods have traditionally allowed improved performance. The recent works have shown that adapting speech enhancement can lead to further gains. This pap...
Preprint
Identifying multiple speakers without knowing where a speaker's voice is in a recording is a challenging task. In this paper, a hierarchical attention network is proposed to solve a weakly labelled speaker identification problem. The use of a hierarchical structure, consisting of a frame-level encoder and a segment-level encoder, aims to learn spea...
Conference Paper
In this paper, a hierarchical attention network is proposed to generate utterance-level embeddings (H-vectors) for speaker identification and verification. Since different parts of an utterance may have different contributions to speaker identities, the use of hierarchical structure aims to learn speaker related information locally and globally. In...
Preprint
In this paper, a novel architecture for speaker recognition is proposed by cascading speech enhancement and speaker processing. Its aim is to improve speaker recognition performance when speech signals are corrupted by noise. Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one frame...
Preprint
In this work, a speaker embedding de-mixing approach is proposed. Instead of separating two-speaker signal in signal space like speech source separation, the proposed approach separates different speaker properties from two-speaker signal in embedding space. The proposed approach contains two steps. In step one, the clean speaker embeddings are lea...
Conference Paper
Full-text available
For speech emotion datasets, it has been difficult to acquire large quantities of reliable data and acted emotions may be over the top compared to less expressive emotions displayed in everyday life. Lately, larger datasets with natural emotions have been created. Instead of ignoring smaller, acted datasets, this study investigates whether informat...
Preprint
In this paper, a hierarchical attention network to generate utterance-level embeddings (H-vectors) for speaker identification is proposed. Since different parts of an utterance may have different contributions to speaker identities, the use of hierarchical structure aims to learn speaker related information locally and globally. In the proposed app...
Preprint
Embedding acoustic information into fixed length representations is of interest for a whole range of applications in speech and audio technology. We propose two novel unsupervised approaches to generate acoustic embeddings by modelling of acoustic context. The first approach is a contextual joint factor synthesis encoder, where the encoder in an en...
Preprint
In this paper a novel framework to tackle speaker recognition using a two-stage attention model is proposed. In recent years, the use of deep neural networks, such as time delay neural network (TDNN), and attention model have boosted speaker recognition performance. However, it is still a challenging task to tackle speaker recognition in severe aco...
Conference Paper
Emotion recognition from speech plays a significant role in adding emotional intelligence to machines and making human-machine interaction more natural. One of the key challenges from machine learning standpoint is to extract patterns which bear maximum correlation with the emotion information encoded in this signal while being as insensitive as po...
Preprint
Selecting in-domain data from a large pool of diverse and out-of-domain data is a non-trivial problem. In most cases simply using all of the available data will lead to sub-optimal and in some cases even worse performance compared to carefully selecting a matching set. This is true even for data-inefficient neural models. Acoustic Latent Dirichlet...
Article
Full-text available
Recurrent neural network language models (RNNLMs) generally outperform n-gram language models when used in automatic speech recognition. Adapting RNNLMs to new domains is an open problem and current approaches can be categorised as either feature-based and model-based. In feature-based adaptation, the input to the RNNLM is augmented with auxiliary...
Article
Full-text available
This paper describes a system for performing alignment of subtitles to audio on multigenre broadcasts using a lightly supervised approach. Accurate alignment of subtitles plays a substantial role in the daily work of media companies and currently still requires large human effort. Here, a comprehensive approach to performing this task in an automat...
Article
Full-text available
This paper addresses errors in continuous Automatic Speech Recognition (ASR) in two stages: error detection and error type classification. Unlike the majority of research in this field, we propose to handle the recognition errors independently from the ASR decoder. We first establish an effective set of generic features derived exclusively from the...
Conference Paper
Automatic literacy assessment of children is a complex task that normally requires carefully annotated data. This paper focuses on a system for the assessment of reading skills, aiming to detection of a range of fluency and pronunciation errors. Naturally, reading is a prompted task, and thereby the acquisition of training data for acoustic modelli...
Conference Paper
Full-text available
Most frequency domain techniques for pitch extraction such as cepstrum, harmonic product spectrum (HPS) and summation residual harmonics (SRH) operate on the magnitude spectrum and turn it into a function in which the fundamental frequency emerges as argmax. In this paper, we investigate the extension of these three techniques to the phase and grou...
Conference Paper
Full-text available
In earlier work we studied the effect of statistical normalisation for phase-based features and observed it leads to a significant robustness improvement. This paper explores the extension of the generalised Vector Taylor Series (gVTS) noise compensation approach to the group delay (GD) domain. We discuss the problems it presents, propose some solu...
Conference Paper
Full-text available
Neural Machine Translation (NMT) has recently demonstrated improved performance over statistical machine translation and relies on an encoder-decoder framework for translating text from source to target. The structure of NMT makes it amenable to add auxiliary features, which can provide complementary information to that present in the source text....
Conference Paper
Full-text available
In earlier work we proposed a framework for speech source-filter separation that employs phase-based signal processing. This paper presents a further theoretical investigation of the model and optimisations that make the filter and source representations less sensitive to the effects of noise and better matched to downstream processing. To this end...
Conference Paper
Full-text available
Vector Taylor Series (VTS) is a powerful technique for robust ASR but, in its standard form, it can only be applied to log-filter bank and MFCC features. In earlier work, we presented a generalised VTS (gVTS) that extends the applicability of VTS to front-ends which employ a power transformation non-linearity. gVTS was shown to provide performance...
Article
Full-text available
Phone tokenisers are used in spoken language recognition (SLR) to obtain elementary phonetic information. We present a study on the use of deep neural network tokenisers. Unsupervised crosslingual adaptation was performed to adapt the baseline tokeniser trained on English conversational telephone speech data to different languages. Two training and...
Conference Paper
In active learning for Automatic Speech Recognition (ASR), a portion of data is automatically selected for manual transcription. The objective is to improve ASR performance with retrained acoustic models. The standard approaches are based on confidence of individual sentences. In this study, we look into an alternative view on transcript label qual...
Conference Paper
Full-text available
Speaker diarisation addresses the question of 'who speaks when' in audio recordings, and has been studied extensively in the context of tasks such as broadcast news, meetings, etc. Performing diarisation on individual headset microphone (IHM) channels is sometimes assumed to easily give the desired output of speaker labelled segments with timing in...
Conference Paper
Full-text available
In earlier work we have proposed a source-filter decomposition of speech through phase-based processing. The decomposition leads to novel speech features that are extracted from the filter component of the phase spectrum. This paper analyses this spectrum and the proposed representation by evaluating statistical properties at vari-ous points along...
Conference Paper
Over the last years, many advances have been made in the field of Automatic Speech Recognition (ASR). However, the persistent presence of ASR errors is limiting the widespread adoption of speech technology in real life applications. This motivates the attempts to find alternative techniques to automatically detect and correct ASR errors, which can...
Conference Paper
Full-text available
The generalisation of dialogue state tracking to unseen dialogue states can be very challenging. In a slot-based dialogue system, dialogue states lie in discrete space where distances between states cannot be computed. Therefore, the model parameters to track states unseen in the training data can only be estimated from more general statistics, und...
Conference Paper
Full-text available
Speaker diarisation, the task of answering "who spoke when?", is often considered to consist of three independent stages: speech activity detection, speaker segmentation and speaker clustering. These represent the separation of speech and nonspeech, the splitting into speaker homogeneous speech segments, followed by grouping together those which be...
Conference Paper
Full-text available
This paper presents the most recent developments of the webASR service (www.webasr.org), the world’s first web– based fully functioning automatic speech recognition platform for scientific use. Initially released in 2008, the functionalities of webASR have recently been expanded with 3 main goals in mind: Facilitate access through a RESTful archite...
Conference Paper
Full-text available
In the phonotactic approach for language recognition, a phone tokeniser is normally used to transform the audio signal into acoustic tokens. The language identity of the speech is modelled by the occurrence statistics of the decoded tokens. The performance of this approach depends heavily on the quality of the audio tokeniser. A high-quality tokeni...
Conference Paper
Full-text available
Huge amounts of digital videos are being produced and broadcast every day, leading to giant media archives. Effective techniques are needed to make such data accessible further. Automatic meta-data labelling of broadcast media is an essential task for multimedia indexing, where it is standard to use multi-modal input for such purposes. This paper d...
Conference Paper
Full-text available
Recurrent neural network language models (RNNLMs) have consistently outperformed n-gram language models when used in automatic speech recognition (ASR). This is because RNNLMs provide robust parameter estimation through the use of a continuous-space representation of words, and can generally model longer context dependencies than n-grams. The adapt...
Conference Paper
Full-text available
Modern standard Arabic (MSA) is the official language of spoken and written Arabic media. Colloquial Arabic (CA) is the set of spoken variants of modern Arabic that exist in the form of regional dialects. CA is used in informal and everyday conversations while MSA is formal communication. An Arabic speaker switches between the two variants accordin...
Conference Paper
Full-text available
Users with disabilities can greatly benefit from personalised voice-enabled environmental-control interfaces, but for users with speech impairments (e.g. dysarthria) poor ASR performance poses a challenge to successful dialogue. Statistical dialogue management has shown resilience against high ASR error rates, hence making it useful to improve the...