[Show abstract][Hide abstract] ABSTRACT: Nearly all Statistical Parametric Speech Synthesizers today use Mel Cepstral
coefficients as the vocal tract parameterization of the speech signal. Mel
Cepstral coefficients were never intended to work in a parametric speech
synthesis framework, but as yet, there has been little success in creating a
better parameterization that is more suited to synthesis. In this paper, we use
deep learning algorithms to investigate a data-driven parameterization
technique that is designed for the specific requirements of synthesis. We
create an invertible, low-dimensional, noise-robust encoding of the Mel Log
Spectrum by training a tapered Stacked Denoising Autoencoder (SDA). This SDA is
then unwrapped and used as the initialization for a Multi-Layer Perceptron
(MLP). The MLP is fine-tuned by training it to reconstruct the input at the
output layer. This MLP is then split down the middle to form encoding and
decoding networks. These networks produce a parameterization of the Mel Log
Spectrum that is intended to better fulfill the requirements of synthesis.
Results are reported for experiments conducted using this resulting
parameterization with the ClusterGen speech synthesizer.
[Show abstract][Hide abstract] ABSTRACT: Intonational Phonology deals with the systematic way in which speakers effectively use pitch to add appropriate emphasis to the underlying string of words in an utterance. Two widely discussed aspects of pitch are the pitch accents and boundary events. These provide an insight into the sentence type, speaker attitude, linguistic background, and other aspects of prosodic form. The main hurdle, however, is the difficulty in getting annotations of these attributes in "real" speech. Besides being language independent, these attributes are known to be subjective and prone to high inter-annotator disagreements. Our investigations aim to automatically derive phonological aspects of intonation from large speech databases. Recurring and salient patterns in the pitch contours, observed jointly with an underlying linguistic context are automatically detected. Our computational framework unifies complementary paradigms such as the physiological Fujisaki model, Autosegmental Metrical phonology, and elegant pitch stylization, to automatically (i) discover phonologically atomic units to describe the pitch contours and (ii) build inventories of tones and long term trends appropriate for the given speech database, either large multi-speaker or single speaker databases, such as audiobooks. We successfully demonstrate the framework in expressive speech synthesis. There is also immense potential for the approach in speaker, style, and language characterization.
The Journal of the Acoustical Society of America 11/2013; 134(5):4237. · 1.65 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: This paper presents an 'Accent Group' based intonation model for statistical parametric speech synthesis. We propose an approach to automatically model phonetic realizations of fundamental frequency(F0) contours as a sequence of intonational events anchored to a group of syllables (an Accent Group). We train an accent grouping model specific to that of the speaker, using a stochastic context free grammar and contextual decision trees on the syllables. This model is used to 'parse' an unseen text into its constituent accent groups over each of which appropriate intonation is predicted. The performance of the model is shown objectively and subjectively on a variety of prosodically diverse tasks- read speech, news broadcast and audio books.
Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
[Show abstract][Hide abstract] ABSTRACT: In this paper, we present a new approach to F0 transformation, that can capture aspects of speaking style. Instead of using the traditional 5ms frames as units in transformation, we propose a method that looks at longer phonological regions such as metrical feet. We automatically detect metrical feet in the source speech, and for each of source speaker's feet, we find its phonological correspondence in target speech. We use a statistical phrase accent model to represent the F0 contour, where a 4-dimensional TILT representation is used for the F0 is parameterized over each feet region for the source and target speakers. This forms the parallel data that is the training data for our transformation. We transform the phrase component using simple z-score mapping. We use a joint density Gaussian mixture model to transform the accent contours. Our transformation method generates F0 contours that are significantly more correlated with the target speech than a baseline, frame-based method.
Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
[Show abstract][Hide abstract] ABSTRACT: In this paper, we present a technique to use the information in multiple parallel speech streams, which are approximate translations of each other, in order to improve performance in a punctuation recovery task. We first build a phraselevel alignment of these multiple streams, using phrase tables to link the phrase pairs together. The information so collected is then used to make it more likely that sentence units are equivalent across streams. We applied this technique to a number of simultaneously interpreted speeches of the European Parliament Committees, for the recovery of the full stop, in four different languages (English, Italian, Portuguese and Spanish). We observed an average improvement in SER of 37% when compared to an existing baseline, in Portuguese and English.
[Show abstract][Hide abstract] ABSTRACT: A spoken dialog system consists of a number of non-trivially interacting components. In order to allow new students, researchers and developers to meaningfully and relatively rapidly enter the field it is critical that, despite their complexity, the resources be accessible and easy to use. Everyone should be able to start building new technologies without spending a significant amount of time re-inventing the wheel. There are four levels of support that we believe new entrants should have. 1) A flexible open source system that runs on many different operating systems, is well documented and supports both simple and complex dialog systems. 2) Logs and speech files from a large number of dialogs that enable analysis and training of new systems and techniques. 3) An actual set of real users that speak to the system on a regular basis. 4) The ability to run studies on complete real user platforms.
NAACL-HLT Workshop on Future Directions and Needs in the Spoken Dialog Community: Tools and Data; 06/2012
[Show abstract][Hide abstract] ABSTRACT: This paper describes some of the results from the project entitled “New Parameterization for Emotional Speech Synthesis” held at the Summer 2011 JHU CLSP workshop. We describe experiments on how to use articulatory features as a meaningful intermediate representation for speech synthesis. This parameterization not only allows us to reproduce natural sounding speech but also allows us to generate stylistically varying speech.
Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on; 01/2012
[Show abstract][Hide abstract] ABSTRACT: This paper presents an approach for transfer of speaker intent in speech-to-speech machine translation (S2SMT). Specifically, we describe techniques to retain the prominence patterns of the source language utterance through the translation pipeline and impose this information during speech synthesis in the target language. We first present an analysis of word focus across languages to motivate the problem of transfer. We then propose an approach for training an appropriate transfer function for intonation on a parallel speech corpus in the two languages within which the translation is carried out. We present our analysis and experiments on English↔Portuguese and English↔German language pairs and evaluate the proposed transformation techniques through objective measures.
Spoken Language Technology Workshop (SLT), 2012 IEEE; 01/2012
[Show abstract][Hide abstract] ABSTRACT: One of the issues in using audio books for building a synthetic voice is the segmentation of large speech files. The use of the Viterbi algorithm to obtain phone boundaries on large audio files fails primarily because of huge memory requirements. Earlier works have attempted to resolve this problem by using large vocabulary speech recognition system employing restricted dictionary and language model. In this paper, we propose suitable modifications to the Viterbi algorithm and demonstrate its usefulness for segmentation of large speech files in audio books. The utterances obtained from large speech files in audio books are used to build synthetic voices. We show that synthetic voices built from audio books in the public domain have Mel-cepstral distortion scores in the range of 4-7, which is similar to voices built from studio quality recordings such as CMU ARCTIC.
IEEE Transactions on Audio Speech and Language Processing 08/2011; · 2.63 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: In this paper, we will discuss state-of-the-art techniques for personality-aware user interfaces, and summarize recent work
in automatically recognizing and synthesizing speech with “personality”. We present an overview of personality “metrics”,
and show how they can be applied to the perception of voices, not only the description of personally known individuals. We
present use cases for personality-aware speech input and/ or output, and discuss approaches at defining “personality” in this
context. We take a middle-of-the-road approach, i.e. we will not try to uncover all fundamental aspects of personality in
speech, but we’ll also not aim for ad-hoc solutions that serve a single purpose, for example to create a positive attitude
in a user, but do not generate transferable knowledge for other interfaces.
Human-Computer Interaction. Interaction Techniques and Environments - 14th International Conference, HCI International 2011, Orlando, FL, USA, July 9-14, 2011, Proceedings, Part II; 01/2011
[Show abstract][Hide abstract] ABSTRACT: The Spoken Dialog Challenge 2010 was an exercise to investigate how different spoken dialog systems perform on the same task. The existing Let's Go Pittsburgh Bus Information System was used as a task and four teams provided systems that were first tested in controlled conditions with speech researchers as users. The three most stable systems were then deployed to real callers. This paper presents the results of the live tests, and compares them with the control test results. Results show considerable variation both between systems and between the control and live tests. Interestingly, relatively high task completion for controlled tests did not always predict relatively high task completion for live tests. Moreover, even though the systems were quite different in their designs, we saw very similar correlations between word error rate and task completion for all the systems. The dialog data collected is available to the research community.
Proceedings of the SIGDIAL 2011 Conference, The 12th Annual Meeting of the Special Interest Group on Discourse and Dialogue, June 17-18, 2011, Oregon Science & Health University, Portland, Oregon, USA; 01/2011
[Show abstract][Hide abstract] ABSTRACT: In this paper, we use artificial neural networks (ANNs) for voice conversion and exploit the mapping abilities of an ANN model to perform mapping of spectral features of a source speaker to that of a target speaker. A comparative study of voice conversion using an ANN model and the state-of-the-art Gaussian mixture model (GMM) is conducted. The results of voice conversion, evaluated using subjective and objective measures, confirm that an ANN-based VC system performs as good as that of a GMM-based VC system, and the quality of the transformed speech is intelligible and possesses the characteristics of a target speaker. In this paper, we also address the issue of dependency of voice conversion techniques on parallel data between the source and the target speakers. While there have been efforts to use nonparallel data and speaker adaptation techniques, it is important to investigate techniques which capture speaker-specific characteristics of a target speaker, and avoid any need for source speaker's data either for training or for adaptation. In this paper, we propose a voice conversion approach using an ANN model to capture speaker-specific characteristics of a target speaker and demonstrate that such a voice conversion approach can perform monolingual as well as cross-lingual voice conversion of an arbitrary source speaker.
IEEE Transactions on Audio Speech and Language Processing 08/2010; · 2.63 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We present the PICTOR browser, a visualization designed to facilitate the analysis of quotations about user-specified topics in large collections of news text. PICTOR focuses on quotations because they are a major vehicle of communication in the news genre. It extracts quotes from articles that match a user's text query, and groups these quotes into "threads" that illustrate the development of subtopics over time. It allows users to rapidly explore the space of relevant quotes by viewing their content and speakers, to examine the contexts in which quotes appear, and to tune how threads are constructed. We offer two case studies demonstrating how PICTOR can support a richer understanding of news events.
[Show abstract][Hide abstract] ABSTRACT: Short vowels in Arabic are normally omitted in written text which leads to ambiguity in the pronunciation. This is even more pronounced for dialectal Arabic where a single word can be pronounced quite differently based on the speaker's nationality, level of education, social class and religion. In this paper we focus on pronunciation modeling for Iraqi-Arabic speech. We introduce multiple pronunciations into the Iraqi speech recognition lexicon, and compare the performance, when weights computed via forced alignment are assigned to the different pronunciations of a word. Incorporating multiple pronunciations improved recognition accuracy compared to a single pronunciation baseline and introducing pronunciation weights further improved performance. Using these techniques an absolute reduction in word-error-rate of 2.4% was obtained compared to the baseline system.