About
145
Publications
31,110
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,220
Citations
Introduction
Current institution
Publications
Publications (145)
We inspect the long-term learning ability of Long Short-Term Memory language models (LSTM LMs) by evaluating a contextual extension based on the Continuous Bag-of-Words (CBOW) model for both sentence- and discourse-level LSTM LMs and by analyzing its performance. We evaluate on text and speech. Sentence-level models using the long-term contextual m...
Gradients can be used to train neural networks, but they can also be used to interpret them. We investigate how well the inputs of RNNs are remembered by their state by calculating ‘state gradients’, and applying SVD on the gradient matrix to reveal which directions in embedding space are remembered and to what extent. Our method can be applied to...
When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technol...
When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technol...
Neural cache language models (LMs) extend the idea of regular cache language models by making the cache probability dependent on the similarity between the current context and the context of the words in the cache. We make an extensive comparison of 'regular' cache models with neural cache models, both in terms of perplexity and WER after rescoring...
Neural cache language models (LMs) extend the idea of regular cache language models by making the cache probability dependent on the similarity between the current context and the context of the words in the cache. We make an extensive comparison of 'regular' cache models with neural cache models, both in terms of perplexity and WER after rescoring...
We present a framework for analyzing what the state in RNNs remembers from its input embeddings. Our approach is inspired by backpropagation, in the sense that we compute the gradients of the states with respect to the input embeddings. The gradient matrix is decomposed with Singular Value Decomposition to analyze which directions in the embedding...
We test a series of techniques to predict punctuation and its effect on machine translation (MT) quality. Several techniques for punctuation prediction are compared: language modeling techniques, such as n-grams and long short-term memories (LSTM), sequence labeling LSTMs (unidirectional and bidirectional), and monolingual phrase-based, hierarchica...
We present a framework for analyzing what the state in RNNs remembers from its input embeddings. Our approach is inspired by backpropagation, in the sense that we compute the gradients of the states with respect to the input embeddings. The gradient matrix is decomposed with Singular Value Decomposition to analyze which directions in the embedding...
We present a framework for analyzing what the state in RNNs remembers from its input embeddings. Our approach is inspired by backpropagation, in the sense that we compute the gradients of the states with respect to the input embeddings. The gradient matrix is decomposed with Singular Value Decomposition to analyze which directions in the embedding...
We present the highlights of the now finished 4-year SCATE project. It was completed in February 2018 and funded by the We present key results of SCATE (Smart Computer Aided Translation Environment). The project investigated algorithms, user interfaces and methods that can contribute to the development of more efficient tools for translation work.
Recently, an abundance of deep learning toolkits has been made freely available. These toolkits typically offer the building blocks and sometimes simple example scripts, but designing and training a model still takes a considerable amount of time and knowledge. We present language modeling scripts based on TensorFlow that allow one to train and tes...
"But I don’t know how to work with [name of tool or resource]" is something one often hears when researchers in Human and Social Sciences (HSS) are confronted with language technology, be it written or spoken, tools or resources. The TTNWW project shows that these researchers do not need to be experts in language or speech technology, or to know al...
In Flanders, all TV shows are subtitled. However, the process of subtitling is a very time-consuming one and can be sped up by providing the output of a speech recognizer run on the audio of the TV show, prior to the subtitling. Naturally, this speech recognition will perform much better if the employed language model is adapted to the register and...
Telecommunication standards are documents that contain consolidated knowledge about communication systems and implementation best practices. They are created based on long consensus processes in order to meet practical constraints. This article describes how the DVB-S2 standard is used in the electrical engineering curricula at KU Leuven within a d...
We present a Character-Word Long Short-Term Memory Language Model which both reduces the perplexity with respect to a baseline word-level language model and reduces the number of parameters of the model. Character information can reveal structural (dis)similarities between words and can even be used when a word is out-of-vocabulary, thus improving...
We present an overview of our work for the project STON – Spraak-en Taaltechnologisch Ondertitelen in het Nederlands (subtitling in Dutch with the help of speech and language technology), in collaboration with the University of Ghent, the Flemish national broadcaster VRT and three companies (Devoteam, Limecraft and PerVoice). The purpose of the pro...
Recent work has suggested that prominence perception could be driven by the predictability of the acoustic prosodic features of speech. On the other hand, lexical predictability and part of speech information are also known to correlate with prominence. In this paper, we investigate how the bottom-up acoustic and top-down lexical cues contribute to...
We present a modular video subtitling platform that integrates speech/non-speech segmentation, speaker diarisation, language identification, Dutch speech recognition with state-of-the-art acoustic models and language models optimised for efficient subtitling, appropriate pre-and postprocessing of the data and alignment of the final result with the...
Many tasks in natural language processing (e.g. speech recognition, machine translation, ...) require a language model: a model that predicts the next word given the history. Traditionally, these models are count-based and assign a probability to a sequence of n words based on the frequency of that sequence in the training text. However, these so-c...
The subject of this paper is the expansion of n-gram training data with the aid of morpho-syntactic transformations, in order to create a larger amount of reliable n-grams for Dutch language models. The main aim of this technique is to alleviate a classical problem for language models: data sparsity. Moreover, since language models for automatic sp...
Due to their advantages over conventional n-gram language models, recurrent neural network language models (rnnlms) recently have attracted a fair amount of research attention in the speech recognition community. In this paper, we explore one advantage of rnnlms, namely, the ease with which they allow the integration of additional knowledge sources...
In automatic speech recognition, the language model helps to disambiguate between words with a similar pronunciation. A standard language model is typically based on n-grams (sequences of n consecutive words) and their probabilities of occurrence. These n-gram models however suffer from data sparsity and cannot model long-span dependencies. The pur...
In this paper we examine several combinations of classical N-gram language models with more advanced and well known techniques based on word similarity such as cache models and Latent Semantic Analysis. We compare the efficiency of these combined models to a model that combines N-grams with the recently proposed, state-of-the-art neural network-bas...
In this paper we present a novel clustering technique for compound words. By mapping compounds onto their semantic heads, the technique is able to estimate n-gram probabilities for unseen compounds. We argue that compounds are well represented by their heads which allows the clustering of rare words and reduces the risk of over-generalization. The...
The availability of a speech recognition system for Dutch is mentioned as one of the essential requirements for the language and speech technology community. Indeed, researchers now are faced with the problem that no good speech recognition tool is available for their purposes or existing tools lack functionality or flexibility. This project has tw...
In this paper we present our state-of-the-art automatic speech recognition system for Dutch that we made available on the web. The free, online disclosure of our software aims at allow-ing non-specialists to adopt ASR technology effortlessly. Ac-cess is possible via a standard web browser or as a web service in automated tools. We discuss the way t...
In this paper we investigate whether a layered architecture that has already proven its value for small tasks, works for a system with large lexica (400k words) and language models (5-grams) as well. The architecture was designed to decouple phone and word recognition which allows for the integration of more complex linguistic components, especiall...
This paper describes the use of speech alignment to aid in the process of subtitling Dutch TV programs. The recognizer aligns the audio stream with an existing transcript. The goal is therefore not to transcribe but to generate the correct timing of every word. The system performs subtasks such as audio segmentation, transcript preprocessing, align...
This paper describes the ESAT 2008 Broadcast News transcription system for the N-Best 2008 benchmark, developed in part for testing the recent SPRAAK Speech Recognition Toolkit. ESAT system was developed for the Southern Dutch Broadcast News subtask of N-Best using standard methods of modern speech recognition. A combination of improvements were ma...
In this paper we investigate the behaviour of different acous- tic distance measures for template based speech recognition in light of the combination of acoustic distances, linguistic knowl- edge and template concatenation fluency costs. To that end, dif- ferent acoustic distance measures are compared on tasks with varying levels of fluency/lingui...
Despite their known weaknesses, hidden Markov models (HMMs) have been the dominant technique for acoustic modeling in speech recognition for over two decades. Still, the advances in the HMM framework have not solved its key problems: it discards information about time dependencies and is prone to overgeneralization. In this paper, we attempt to ove...
The objective of this paper is threefold: (1) to provide an extensive review of signal subspace speech enhancement, (2) to derive an upper bound for the performance of these techniques, and (3) to present a comprehensive study of the potential of subspace filtering to increase the robustness of automatic speech recognisers against stationary additi...
Nowadays read speech recognition already works pretty well, but the recognition of spontaneous speech is much more problematic. There are plenty of reasons for this, and we hypothesize that one of them is the regular occurrence of disfluencies in spontaneous speech. Disfluencies disrupt the normal course of the sentence and when for instance word i...
In this paper, several techniques are proposed to incorporate the uncertainty of the clean speech estimate in the decoding process of the backend recogniser in the context of model-based feature enhancement (MBFE) for noise robust speech recognition. Usually, the Gaussians in the acoustic space are sampled in a single point estimate, which means th...
Many compensation techniques, both in the model and feature domain, require an estimate of the noise statistics to compensate for the clean speech degradation in adverse environments. We explore how two spectral noise estimation approaches can be applied in the context of model-based feature enhancement. The minimum statistics method and the improv...
Model-based feature enhancement is an ASR front-end tech- nique to increase the robustness of the recogniser in noisy en- vironments. However, its MMSE-estimates of the clean speech feature vectors are based only on the static components at the current frame. In this paper, we show how the Kalman filter framework can be seen as a natural extension...
In state-of-the-art large vocabulary automatic recognition systems, a large statistical language model is used, typically an N-gram. However in order to estimate this model, a large database of sentences or texts in the same style as the recognition task is needed. For spontaneous speech one doesn't dispose of such database since it should consist...
In 1850, 'Assyriology', or the science of reading and interpreting cuneiform, was created. During this period, historians travelled to the Middle East and spent years copying cuneiform tablets. Now, at the beginning of the third millennium AD, not much has changed. Historians still rely on epigraphy which employs the copying of inscriptions and tex...
Model-based techniques for robust speech recognition often require the statistics of noisy speech. In this paper, we propose two modifications to obtain more accurate versions of the statistics of the combined HMM (starting from a clean speech and a noise model). Usually, the phase difference between speech and noise is neglected in the acoustic en...
This paper presents the derivation of a new perceptual model that represents speech and audio signals by a sum of exponentially damped sinusoids. Compared to a traditional sinusoidal model, the exponential sinusoidal model (ESM) is better suited to model transient segments that are readily found in audio signals.Total least squares (TLS) algorithms...
In this paper we present two techniques to cover the gap between the true and the estimated clean speech features in the context of Model-Based Feature Enhancement (MBFE) for noise robust speech recognition. While in the output of every feature enhancement algorithm some residual uncertainty re- mains, currently this information is mostly discarded...
In order to fully transform the perceived speaker iden-tity, a voice conversion system should also convert the speaker's prosodic characteristics. When considering pitch contours, most systems only transform the pitch by simple scaling. A stochastic system that transforms pitch contours taking into account multiple pitch param-eters, instead of onl...
This paper presents two di#erent directions to build HMM models which give enough acoustic resolution and fit in limited user resources. They both refer to scaling down the acoustic models which are built with tied gaussian HMMs. The total number of gaussians is reduced by a pairwise merging, and the number of gaussians per state is reduced by sele...
The CGN corpus (Oostdijk, 2000) (Corpus Gesproken Nederlands/Corpus Spoken Dutch) is a large speech corpus of contemporary Dutch as spoken in Belgium (3.3 million words) and in the Netherlands (5.6 million words). Due to its size, manual phonemic annotation was limited to 10% of the data and automatic systems were used to complement this data. This...
In this paper we describe how we successfully extended the Model-Based Feature Enhancement (MBFE)-algorithm to jointly remove additive and convolutional noise from corrupted speech. Although a model of the clean speech can incorporate prior knowledge into the feature enhancement process, this model no longer yields an accurate fit if a different mi...
Subspace filtering is an extensively studied technique that has been proven very effective in the area of speech enhancement to improve the speech intelligibility. In this paper, we review different subspace estimation techniques (minimum variance, least squares, singular value adaptation, time domain constrained and spectral domain constrained) in...
In this paper we describe how we successfully extended the model-based feature enhancement (MBFE) algorithm to jointly remove additive and convolutional noise from corrupted speech. Although a model of the clean speech can incorporate prior knowledge into the feature enhancement process, this model no longer yields an accurate fit if a different mi...
State-of-the-art speech recognition relies on a state-dependent distance measure. In HMM systems, the distance measure is trained into state-dependent covariance matrices using a maximum likelihood or discriminative criterion. This "automatic" adjustment of the distance measure is traditionally considered an inherent advantage of HMMs over DTW (dyn...
The dominant acoustic modeling methodology based on Hid- den Markov Models is known to have certain weaknesses. Par- tial solutions to these flaws have been presented, but the fun da- mental problem remains: compression of the data to a compact HMM discards useful information such as time dependencies and speaker information. In this paper, we look...
Maintaining a high level of robustness for Automatic Speech Recognition (ASR) systems is especially challenging when the background noise has a time-varying nature. We have imple- mented a Model-Based Feature Enhancement (MBFE) tech- nique that not only can easily be embedded in the feature ex- traction module of a recogniser, but also is intrinsic...
In this paper, we describe a method to enhance the readability of out-of-vocabulary items (OOVs) in the textual output in a large vocabulary continuous speech recognition system. The basic idea is to indicate uncertain words in the transcriptions and replace them with phoneme recognition results that are post-processed using a phoneme-to-grapheme (...
In this paper, we describe a method to enhance the readability of the textual output in a large vocabulary continuous speech recognizer when out-of-vocabulary words occur.
In this paper, we describe a method to enhance the readability of out-of-vocabulary items in the textual output in a large vocabulary continuous speech recognition system.
In this paper we present a new method to animate the face of a speaking avatar ---i.e., a synthetic 3D human face--- such that it realistically pronounces any given text, based on the audio only. Especially the lip movements must be rendered carefully, and perfectly synchronised with the audio, in order to have a realistic looking result, from whic...
Total Least Squares (TLS) algorithms automatically decompose (audio) frames into a number of exponentially damped sinusoids. This can provide for more ecient modeling than plain sinusoidal modeling, especially in the case of transitional frames. Straightforward implementations of TLS optimize a SNR criterion. In our implementation we apply TLS in a...
We introduce the backward N-gram language model (LM) scores as a confidence measure in large vocabulary continuous speech recognition. Contrary to a forward N-gram LM, in which the probability of a word is dependent on the preceding words, a word in a backward N-gram LM is predicted based on the following words only. So the backward LM is a model f...
In automatic speech recognition, a stochastic language model (LM) predicts the probability of the next word on the basis of previously recognized words. For the recognition of dictated speech this method works reasonably well since sentences are typically well-formed and reliable estimation of the probabilities is possible on the basis of large amo...
In this paper we describe an improved algorithm for the automatic segmentation of speech corpora. Apart from their usefulness in several speech technology domains, segmentations provide easy access to speech corpora by using time stamps to couple the orthographic transcription to the speech signal. The segmentation tool we propose is based on the F...
In this paper we study the concept of inter-signal transplantation of voice characteristics. We describe several example transplantation systems that were used to interchange selected features of the hu- man voice amongst different utterances of a same text by a same or by different speakers. All systems are based on time-domain overlap-add (OLA) a...
While a traditional sinusoidal model is capable of representing audio segments, a sum of exponentially damped sinusoids is more efficient to model the transient segments that are readily found in audio signals. In this paper, Total Least Squares (TLS) algorithms are applied to automatically extract the modeling parameters in the Exponential Sinusoi...
In this paper, we investigate the use of the total likelihood (the weighted sum of the likelihoods of all possible state sequences) instead of the approximation with the Viterbi likelihood (the like-lihood of the best state sequence) normally used in speech recognition. Next to its use in a recognizer, the use of total likelihoods in the context of...
Most parametric audio coders use a traditional sinusoidal model to represent the tonal parts of audio signals, together with dedicated models for the noise and transient-like parts of the audio signal. In this paper we apply Total Least Squares (TLS) algorithms to automatically extract the modeling parameters of a Exponential Sinusoidal Model (ESM)...
We demonstrate that damped sinusoidal modeling can be used to improve the modeling accuracy of current perceptual audio coders. We show that the model parameter estimation can be performed with TLS algorithms, and that a subband modeling approach results in TLS problems that are computationally much more tractable than the fullband approach. Experi...
The aim of discriminant feature analysis techniques in the signal processing of speech recognition systems is to find a feature vector transformation which maps a high dimensional input vector onto a low dimensional vector while retaining a maximum amount of information in the feature vector to discriminate between predefined classes. This paper po...
We describe a method to enhance the readability of the textual output in a large vocabulary continuous speech recognition system when out-of-vocabulary words occur. The basic idea is to replace uncertain words in the transcriptions with a phoneme recognition result that is post-processed using a phoneme-to-grapheme converter. This converter turns p...
In Van Uytsel et al. (2001) a parsing language model based on a probabilistic left-comer grammar (PLCG) was proposed and encouraging performance on a speech recognition task using the PLCG-based language model was reported. In this paper we show how the PLCG-based language model can be further optimized by iterative parameter reestimation on unanno...
In this paper, we describe a method to enhance the readability of out-of-vocabulary items in the textual output in a large vocabulary continuous speech recognition system. The basic idea is to indicate uncertain words in the transcriptions and replace them with phoneme recognition results that are post-processed using a phoneme-to-grapheme (P2G) co...
The Dutch-Flemish project Spoken Dutch Corpus (1998-2003) aims at the development of an annotated corpus of 10 million spoken words. In order to make the speech data easily accessible, a word segmentation couples the orthographic transcription to the speech signal by means of time stamps. Generally, such segmentations are produced manually. Since t...
The accuracy of the acoustic models in large vocabulary recognition systems can be improved by increasing the resolution in the acoustic feature space. This can be obtained by increasing the number of Gaussian densities in the models by splitting of the Gaussians. This paper proposes a novel algorithm for this splitting operation. It is based on th...
In pursuance of better performance, current speech recognition systems tend to use more and more complicated models for both the acoustic and the language component. Cross-word context dependent (CD) phone models and long-span statistical language models (LMs) are now widely used. In this paper, we present a memory-efficient search topology that en...
Recognizing, online, cops and weeds enables to reduce the use of chemicals in agriculture. First, a sensor and classifier is proposed to measure and classify, online, the plant reflectance. However, as plant reflectance varies with unknown field dependent plant stress factors, the classifier must be trained on each field separately in order to reco...
This paper presents two dieren t directions to build HMM models which give enough acoustic resolution and t in limited user resources. They both refer to scaling down the acoustic models which are built with tied gaussian HMMs. The total number of gaussians is reduced by a pairwise merging, and the number of gaussians per state is reduced by select...
This paper describes an alternative to the commonly used linear discrimi- nant analysis (LDA) for nding linear transformations that map large fea- ture vectors onto smaller ones while maintaining most of the discriminative power. The main problem with LDA is that it over-simplies the situa- tion by condensing all class information into only two sca...
This paper presents a system for the detection of weed amongst crop based on the extraction of structural field information. Data is gathered online with a sensor built upon an imaging spectrograph optimized for this purpose. The optical sensor splits the light from a line on the ground parallel with the spray boom in its spectral components that a...
In an HMM based large vocabulary continuous speech recognition system, the evaluation of - context dependent - acoustic models is very time consuming. In Semi-Continuous HMMs, a state is modelled as a mix- ture of elementary - generally gaussian - probability density functions. Observation probability calculations of these states can be made faster...
In X-ray projection radiography, especially if area detectors are
used, scattered radiation can strongly degrade the image contrast.
Although the amount of radiation scatter is partly reduced by the use of
anti-scatter grids or air gaps, the contrast is mostly still
substantially degraded in large areas of the images. Here, the radiation
scatter in...
Because of beam hardening in the target, the energy distribution of the x-radiation emitted by a conventional x-ray tube differs from different directions of emission. This fact is generally neglected in dual-energy subtraction imaging. In this paper we study the influence of the non- uniformity of the energy spectrum of emitted radiation on dual-e...
In this paper we present a method for the compensation for scattered radiation in dual-energy imaging. The method uses shields (PTS), partially transparent for x-radiation, positioned between the x-ray source and the detector. By comparing the detected signals under and beside the shields, and using data from the dual-energy calibration procedure w...
Dual-energy subtraction imaging uses the spectral differences in x-ray radiation attenuation caused by different tissues to obtain material-selective images. When the subtraction is based on decomposition of the object images into basis material images, accurate calibration measurements must be made. This calibration procedure consists of measuring...