Alan W Black

Carnegie Mellon University, Pittsburgh, Pennsylvania, United States

Are you Alan W Black?

Claim your profile

Publications (184)109.42 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: In spoken dialog systems, dialog state tracking refers to the task of correctly inferring the user's goal at a given turn, given all of the dialog history up to that turn. The Dialog State Tracking Challenge is a research community challenge task that has run for three rounds. The challenge has given rise to a host of new methods for dialog state tracking and also to deeper understanding about the problem itself, including methods for evaluation.
    Ai Magazine 12/2014; 35(4):121-124. · 0.50 Impact Factor
  • Source
    Prasanna Kumar Muthukumar, Alan W. Black
    [Show abstract] [Hide abstract]
    ABSTRACT: Nearly all Statistical Parametric Speech Synthesizers today use Mel Cepstral coefficients as the vocal tract parameterization of the speech signal. Mel Cepstral coefficients were never intended to work in a parametric speech synthesis framework, but as yet, there has been little success in creating a better parameterization that is more suited to synthesis. In this paper, we use deep learning algorithms to investigate a data-driven parameterization technique that is designed for the specific requirements of synthesis. We create an invertible, low-dimensional, noise-robust encoding of the Mel Log Spectrum by training a tapered Stacked Denoising Autoencoder (SDA). This SDA is then unwrapped and used as the initialization for a Multi-Layer Perceptron (MLP). The MLP is fine-tuned by training it to reconstruct the input at the output layer. This MLP is then split down the middle to form encoding and decoding networks. These networks produce a parameterization of the Mel Log Spectrum that is intended to better fulfill the requirements of synthesis. Results are reported for experiments conducted using this resulting parameterization with the ClusterGen speech synthesizer.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Speech synthesis systems are typically built with speech data and transcriptions. In this paper, we try to build synthesis systems when no transcriptions or knowledge about the language are available. It is usually necessary to at least possess phonetic knowledge about the language. In this paper, we propose an automated way of obtaining phones and phonetic knowledge about the corpus at hand by making use of Articulatory Features (AFs). An Articulatory Feature predictor is trained on a bootstrap corpus in an arbitrary other language using a three-hidden layer neural network. This neural network is run on the speech corpus to extract AFs. Hierarchical clustering is used to cluster the AFs into categories i.e. phones. Phonetic information about each of these inferred phones is obtained by computing the mean of the AFs in each cluster. Results of systems built with this framework in multiple languages are reported.
    ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 05/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Intonational Phonology deals with the systematic way in which speakers effectively use pitch to add appropriate emphasis to the underlying string of words in an utterance. Two widely discussed aspects of pitch are the pitch accents and boundary events. These provide an insight into the sentence type, speaker attitude, linguistic background, and other aspects of prosodic form. The main hurdle, however, is the difficulty in getting annotations of these attributes in "real" speech. Besides being language independent, these attributes are known to be subjective and prone to high inter-annotator disagreements. Our investigations aim to automatically derive phonological aspects of intonation from large speech databases. Recurring and salient patterns in the pitch contours, observed jointly with an underlying linguistic context are automatically detected. Our computational framework unifies complementary paradigms such as the physiological Fujisaki model, Autosegmental Metrical phonology, and elegant pitch stylization, to automatically (i) discover phonologically atomic units to describe the pitch contours and (ii) build inventories of tones and long term trends appropriate for the given speech database, either large multi-speaker or single speaker databases, such as audiobooks. We successfully demonstrate the framework in expressive speech synthesis. There is also immense potential for the approach in speaker, style, and language characterization.
    The Journal of the Acoustical Society of America 11/2013; 134(5):4237. DOI:10.1121/1.4831574 · 1.56 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an 'Accent Group' based intonation model for statistical parametric speech synthesis. We propose an approach to automatically model phonetic realizations of fundamental frequency(F0) contours as a sequence of intonational events anchored to a group of syllables (an Accent Group). We train an accent grouping model specific to that of the speaker, using a stochastic context free grammar and contextual decision trees on the syllables. This model is used to 'parse' an unseen text into its constituent accent groups over each of which appropriate intonation is predicted. The performance of the model is shown objectively and subjectively on a variety of prosodically diverse tasks- read speech, news broadcast and audio books.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a new approach to F0 transformation, that can capture aspects of speaking style. Instead of using the traditional 5ms frames as units in transformation, we propose a method that looks at longer phonological regions such as metrical feet. We automatically detect metrical feet in the source speech, and for each of source speaker's feet, we find its phonological correspondence in target speech. We use a statistical phrase accent model to represent the F0 contour, where a 4-dimensional TILT representation is used for the F0 is parameterized over each feet region for the source and target speakers. This forms the parallel data that is the training data for our transformation. We transform the phrase component using simple z-score mapping. We use a joint density Gaussian mixture model to transform the accent contours. Our transformation method generates F0 contours that are significantly more correlated with the target speech than a baseline, frame-based method.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • Joao Miranda, Joao Paulo Neto, Alan W Black
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a technique to use the information in multiple parallel speech streams, which are approximate translations of each other, in order to improve performance in a punctuation recovery task. We first build a phraselevel alignment of these multiple streams, using phrase tables to link the phrase pairs together. The information so collected is then used to make it more likely that sentence units are equivalent across streams. We applied this technique to a number of simultaneously interpreted speeches of the European Parliament Committees, for the recovery of the full stop, in four different languages (English, Italian, Portuguese and Spanish). We observed an average improvement in SER of 37% when compared to an existing baseline, in Portuguese and English.
    Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on; 01/2013
  • Source
    INTERSPEECH; 01/2013
  • Proceedings of COLING 2012: Posters; 12/2012
  • Source
    João Miranda, João P Neto, Alan W Black
    [Show abstract] [Hide abstract]
    ABSTRACT: In a growing number of applications, such as simultane-ous interpretation, audio or text may be available convey-ing the same information in different languages. These different views contain redundant information that can be explored to enhance the performance of speech and language processing applications. We propose a method that directly integrates ASR word graphs or lattices and phrase tables from an SMT system to combine such par-allel speech data and improve ASR performance. We ap-ply this technique to speeches from four European Par-liament committees and obtain a 16.6% relative improve-ment (20.8% after a second iteration) in WER, when Por-tuguese and Spanish interpreted versions are combined with the original English speeches. Our results indicate that further improvements may be possible by including additional languages.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: One challenge of implementing spoken di-alogue systems for long-term interaction is how to adapt the dialogue as user and sys-tem become more familiar. We believe this challenge includes evoking and signaling as-pects of long-term relationships such as rap-port. For tutoring systems, this may addi-tionally require knowing how relationships are signaled among non-adult users. We therefore investigate conversational strategies used by teenagers in peer tutoring dialogues, and how these strategies function differently among friends or strangers. In particular, we use an-notated and automatically extracted linguis-tic devices to predict impoliteness and posi-tivity in the next turn. To take into account the sparse nature of these features in real data we use models including Lasso, ridge estima-tor, and elastic net. We evaluate the predictive power of our models under various settings, and compare our sparse models with stan-dard non-sparse solutions. Our experiments demonstrate that our models are more ac-curate than non-sparse models quantitatively, and that teens use unexpected kinds of lan-guage to do relationship work such as signal-ing rapport, but friends and strangers, tutors and tutees, carry out this work in quite differ-ent ways from one another.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper discusses the efforts in collecting speech databases for Indian languages – Bengali, Hindi, Kan-nada, Malayalam, Marathi, Tamil and Telugu. We dis-cuss relevant design considerations in collecting these databases, and demonstrate their usage in speech syn-thesis. By releasing these speech databases in the pub-lic domain without any restrictions for non commercial and commercial purposes, we hope to promote research and developmental activities in building speech synthesis systems in Indian languages.
  • Alan W. Black, Maxine Eskenazi
    [Show abstract] [Hide abstract]
    ABSTRACT: A spoken dialog system consists of a number of non-trivially interacting components. In order to allow new students, researchers and developers to meaningfully and relatively rapidly enter the field it is critical that, despite their complexity, the resources be accessible and easy to use. Everyone should be able to start building new technologies without spending a significant amount of time re-inventing the wheel. There are four levels of support that we believe new entrants should have. 1) A flexible open source system that runs on many different operating systems, is well documented and supports both simple and complex dialog systems. 2) Logs and speech files from a large number of dialogs that enable analysis and training of new systems and techniques. 3) An actual set of real users that speak to the system on a regular basis. 4) The ability to run studies on complete real user platforms.
    NAACL-HLT Workshop on Future Directions and Needs in the Spoken Dialog Community: Tools and Data; 06/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an approach for transfer of speaker intent in speech-to-speech machine translation (S2SMT). Specifically, we describe techniques to retain the prominence patterns of the source language utterance through the translation pipeline and impose this information during speech synthesis in the target language. We first present an analysis of word focus across languages to motivate the problem of transfer. We then propose an approach for training an appropriate transfer function for intonation on a parallel speech corpus in the two languages within which the translation is carried out. We present our analysis and experiments on English↔Portuguese and English↔German language pairs and evaluate the proposed transformation techniques through objective measures.
    Spoken Language Technology Workshop (SLT), 2012 IEEE; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes some of the results from the project entitled “New Parameterization for Emotional Speech Synthesis” held at the Summer 2011 JHU CLSP workshop. We describe experiments on how to use articulatory features as a meaningful intermediate representation for speech synthesis. This parameterization not only allows us to reproduce natural sounding speech but also allows us to generate stylistically varying speech.
    Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on; 01/2012
  • Source
    Kishore Prahallad, A.W. Black
    [Show abstract] [Hide abstract]
    ABSTRACT: One of the issues in using audio books for building a synthetic voice is the segmentation of large speech files. The use of the Viterbi algorithm to obtain phone boundaries on large audio files fails primarily because of huge memory requirements. Earlier works have attempted to resolve this problem by using large vocabulary speech recognition system employing restricted dictionary and language model. In this paper, we propose suitable modifications to the Viterbi algorithm and demonstrate its usefulness for segmentation of large speech files in audio books. The utterances obtained from large speech files in audio books are used to build synthetic voices. We show that synthetic voices built from audio books in the public domain have Mel-cepstral distortion scores in the range of 4-7, which is similar to voices built from studio quality recordings such as CMU ARCTIC.
    IEEE Transactions on Audio Speech and Language Processing 08/2011; 19(5-19):1444 - 1449. DOI:10.1109/TASL.2010.2081980 · 2.63 Impact Factor
  • Florian Metze, Alan Black, Tim Polzehl
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we will discuss state-of-the-art techniques for personality-aware user interfaces, and summarize recent work in automatically recognizing and synthesizing speech with “personality”. We present an overview of personality “metrics”, and show how they can be applied to the perception of voices, not only the description of personally known individuals. We present use cases for personality-aware speech input and/ or output, and discuss approaches at defining “personality” in this context. We take a middle-of-the-road approach, i.e. we will not try to uncover all fundamental aspects of personality in speech, but we’ll also not aim for ad-hoc solutions that serve a single purpose, for example to create a positive attitude in a user, but do not generate transferable knowledge for other interfaces.
    Human-Computer Interaction. Interaction Techniques and Environments - 14th International Conference, HCI International 2011, Orlando, FL, USA, July 9-14, 2011, Proceedings, Part II; 01/2011
  • INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy, August 27-31, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The Spoken Dialog Challenge 2010 was an exercise to investigate how different spoken dialog systems perform on the same task. The existing Let's Go Pittsburgh Bus Information System was used as a task and four teams provided systems that were first tested in controlled conditions with speech researchers as users. The three most stable systems were then deployed to real callers. This paper presents the results of the live tests, and compares them with the control test results. Results show considerable variation both between systems and between the control and live tests. Interestingly, relatively high task completion for controlled tests did not always predict relatively high task completion for live tests. Moreover, even though the systems were quite different in their designs, we saw very similar correlations between word error rate and task completion for all the systems. The dialog data collected is available to the research community.

Publication Stats

5k Citations
109.42 Total Impact Points


  • 2–2014
    • Carnegie Mellon University
      • • Language Technologies Institute
      • • Computer Science Department
      Pittsburgh, Pennsylvania, United States
  • 2007
    • Nara Institute of Science and Technology
      • Graduate School of Information Science
      Ikuma, Nara, Japan
  • 2003
    • Phoenix Software International
      Ел Сегундо, California, United States
  • 1996–2003
    • The University of Edinburgh
      • • Centre for Speech Technology Research
      • • Human Communication Research Centre
      Edinburgh, Scotland, United Kingdom
  • 2002
    • Nagoya Institute of Technology
      • Department of Computer Science and Engineering
      Nagoya, Aichi, Japan