Alan W Black

Carnegie Mellon University, Pittsburgh, Pennsylvania, United States

Are you Alan W Black?

Claim your profile

Publications (204)122.57 Total impact

  • Wang Ling · Isabel Trancoso · Chris Dyer · Alan W Black
    [Show abstract] [Hide abstract]
    ABSTRACT: We introduce a neural machine translation model that views the input and output sentences as sequences of characters rather than words. Since word-level information provides a crucial source of bias, our input model composes representations of character sequences into representations of words (as determined by whitespace boundaries), and then these are translated using a joint attention/translation model. In the target language, the translation is modeled as a sequence of word vectors, but each word is generated one character at a time, conditional on the previous character generations in each word. As the representation and generation of words is performed at the character level, our model is capable of interpreting and generating unseen word forms. A secondary benefit of this approach is that it alleviates much of the challenges associated with preprocessing/tokenization of the source and target languages. We show that our model can achieve translation results that are on par with conventional word-based models.
    No preview · Article · Nov 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We introduce a model for constructing vector representations of words by composing characters using bidirectional LSTMs. Relative to traditional word representation models that have independent vectors for each word type, our model requires only a single vector per character type and a fixed set of parameters for the compositional model. Despite the compactness of this model and, more importantly, the arbitrary nature of the form-function relationship in language, our "composed" word representations yield state-of-the-art results in language modeling and part-of-speech tagging. Benefits over traditional baselines are particularly pronounced in morphologically rich languages (e.g., Turkish).
    Full-text · Article · Aug 2015
  • Source
    Wang Ling · Chris Dyer · Alan Black · Isabel Trancoso
    [Show abstract] [Hide abstract]
    ABSTRACT: We present two simple modifications to the models in the popular Word2Vec tool, in order to generate embeddings more suited to tasks involving syntax. The main issue with the original models is the fact that they are insensitive to word order. While order independence is useful for inducing semantic representations, this leads to suboptimal results when they are used to solve syntax-based problems. We show improvements in part-of-speech tagging and dependency parsing using our proposed models.
    Full-text · Conference Paper · May 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we build a corpus of tweets from Twitter annotated with keywords using crowdsourcing methods. We identify key differences between this domain and the work performed on other domains, such as news, which makes existing approaches for automatic keyword extraction not generalize well on Twitter datasets. These datasets include the small amount of content in each tweet, the frequent usage of lexical variants and the high variance of the cardinality of keywords present in each tweet. We propose methods for addressing these issues, which leads to solid improvements on this dataset for this task.
    Full-text · Article · Jan 2015
  • [Show abstract] [Hide abstract]
    ABSTRACT: In spoken dialog systems, dialog state tracking refers to the task of correctly inferring the user's goal at a given turn, given all of the dialog history up to that turn. The Dialog State Tracking Challenge is a research community challenge task that has run for three rounds. The challenge has given rise to a host of new methods for dialog state tracking and also to deeper understanding about the problem itself, including methods for evaluation. © 2014,Association for the Advancement of Artificial Intelligence. All rights reserved.
    No preview · Article · Dec 2014 · Ai Magazine
  • Source
    Prasanna Kumar Muthukumar · Alan W. Black
    [Show abstract] [Hide abstract]
    ABSTRACT: Nearly all Statistical Parametric Speech Synthesizers today use Mel Cepstral coefficients as the vocal tract parameterization of the speech signal. Mel Cepstral coefficients were never intended to work in a parametric speech synthesis framework, but as yet, there has been little success in creating a better parameterization that is more suited to synthesis. In this paper, we use deep learning algorithms to investigate a data-driven parameterization technique that is designed for the specific requirements of synthesis. We create an invertible, low-dimensional, noise-robust encoding of the Mel Log Spectrum by training a tapered Stacked Denoising Autoencoder (SDA). This SDA is then unwrapped and used as the initialization for a Multi-Layer Perceptron (MLP). The MLP is fine-tuned by training it to reconstruct the input at the output layer. This MLP is then split down the middle to form encoding and decoding networks. These networks produce a parameterization of the Mel Log Spectrum that is intended to better fulfill the requirements of synthesis. Results are reported for experiments conducted using this resulting parameterization with the ClusterGen speech synthesizer.
    Preview · Article · Sep 2014
  • Wang Ling · Luis Marujo · Chris Dyer · Alan W Black · Isabel Trancoso

    No preview · Conference Paper · Jun 2014
  • Prasanna Kumar Muthukumar · Alan W Black
    [Show abstract] [Hide abstract]
    ABSTRACT: Speech synthesis systems are typically built with speech data and transcriptions. In this paper, we try to build synthesis systems when no transcriptions or knowledge about the language are available. It is usually necessary to at least possess phonetic knowledge about the language. In this paper, we propose an automated way of obtaining phones and phonetic knowledge about the corpus at hand by making use of Articulatory Features (AFs). An Articulatory Feature predictor is trained on a bootstrap corpus in an arbitrary other language using a three-hidden layer neural network. This neural network is run on the speech corpus to extract AFs. Hierarchical clustering is used to cluster the AFs into categories i.e. phones. Phonetic information about each of these inferred phones is obtained by computing the mean of the AFs in each cluster. Results of systems built with this framework in multiple languages are reported.
    No preview · Conference Paper · May 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The 15 papers in this special issue focus on statistical parametric speech synthesis.
    Full-text · Article · Apr 2014 · IEEE Journal of Selected Topics in Signal Processing
  • Joao Miranda · Joao Paulo Neto · Alan W Black
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a technique to use the information in multiple parallel speech streams, which are approximate translations of each other, in order to improve performance in a punctuation recovery task. We first build a phraselevel alignment of these multiple streams, using phrase tables to link the phrase pairs together. The information so collected is then used to make it more likely that sentence units are equivalent across streams. We applied this technique to a number of simultaneously interpreted speeches of the European Parliament Committees, for the recovery of the full stop, in four different languages (English, Italian, Portuguese and Spanish). We observed an average improvement in SER of 37% when compared to an existing baseline, in Portuguese and English.
    No preview · Conference Paper · Dec 2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Intonational Phonology deals with the systematic way in which speakers effectively use pitch to add appropriate emphasis to the underlying string of words in an utterance. Two widely discussed aspects of pitch are the pitch accents and boundary events. These provide an insight into the sentence type, speaker attitude, linguistic background, and other aspects of prosodic form. The main hurdle, however, is the difficulty in getting annotations of these attributes in "real" speech. Besides being language independent, these attributes are known to be subjective and prone to high inter-annotator disagreements. Our investigations aim to automatically derive phonological aspects of intonation from large speech databases. Recurring and salient patterns in the pitch contours, observed jointly with an underlying linguistic context are automatically detected. Our computational framework unifies complementary paradigms such as the physiological Fujisaki model, Autosegmental Metrical phonology, and elegant pitch stylization, to automatically (i) discover phonologically atomic units to describe the pitch contours and (ii) build inventories of tones and long term trends appropriate for the given speech database, either large multi-speaker or single speaker databases, such as audiobooks. We successfully demonstrate the framework in expressive speech synthesis. There is also immense potential for the approach in speaker, style, and language characterization.
    No preview · Article · Nov 2013 · The Journal of the Acoustical Society of America
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an 'Accent Group' based intonation model for statistical parametric speech synthesis. We propose an approach to automatically model phonetic realizations of fundamental frequency(F0) contours as a sequence of intonational events anchored to a group of syllables (an Accent Group). We train an accent grouping model specific to that of the speaker, using a stochastic context free grammar and contextual decision trees on the syllables. This model is used to 'parse' an unseen text into its constituent accent groups over each of which appropriate intonation is predicted. The performance of the model is shown objectively and subjectively on a variety of prosodically diverse tasks- read speech, news broadcast and audio books.
    No preview · Conference Paper · Oct 2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a new approach to F0 transformation, that can capture aspects of speaking style. Instead of using the traditional 5ms frames as units in transformation, we propose a method that looks at longer phonological regions such as metrical feet. We automatically detect metrical feet in the source speech, and for each of source speaker's feet, we find its phonological correspondence in target speech. We use a statistical phrase accent model to represent the F0 contour, where a 4-dimensional TILT representation is used for the F0 is parameterized over each feet region for the source and target speakers. This forms the parallel data that is the training data for our transformation. We transform the phrase component using simple z-score mapping. We use a joint density Gaussian mixture model to transform the accent contours. Our transformation method generates F0 contours that are significantly more correlated with the target speech than a baseline, frame-based method.
    No preview · Conference Paper · Oct 2013
  • J. Miranda · J.P. Neto · A.W. Black
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a method to combine audio of a lecture with its supporting slides in order to improve automatic speech recognition performance. We view both the lecture speech and the slides as parallel streams which contain redundant information. We integrate both streams in order to bias the recognizer's language model towards the words in the slides, by first aligning the speech with the slide words, thus correcting errors on the ASR transcripts. We obtain a 5.9% relative WER improvement on a lecture test set, when compared to a speech recognition only system.
    No preview · Conference Paper · Oct 2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In the ever-expanding sea of microblog data, there is a surprising amount of naturally occurring par-allel text: some users create post multilingual mes-sages targeting international audiences while oth-ers "retweet" translations. We present an efficient method for detecting these messages and extract-ing parallel segments from them. We have been able to extract over 1M Chinese-English parallel segments from Sina Weibo (the Chinese counter-part of Twitter) using only their public APIs. As a supplement to existing parallel training data, our automatically extracted parallel data yields sub-stantial translation quality improvements in trans-lating microblog text and modest improvements in translating edited news commentary. The re-sources in described in this paper are available at
    Full-text · Dataset · Aug 2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Speech synthesis technology has reached the stage where given a well-designed corpus of audio and accurate transcription an at least understandable synthesizer can be built without necessarily resorting to new innovations. However many languages do not have a well-defined writing system but such languages could still greatly benefit from speech systems. In this paper we consider the case where we have a (potentially large) single speaker database but have no transcriptions and no standardized way to write transcriptions. To address this scenario we propose a method that allows us to bootstrap synthetic voices purely from speech data. We use a novel combination of automatic speech recognition and automatic word segmentation for the bootstrapping. Our experimental results on speech corpora in two languages, English and German, show that synthetic voices that are built using this method are close to understandable. Our method is language-independent and can thus be used to build synthetic voices from a speech corpus in any new language.
    Full-text · Conference Paper · May 2013
  • P.K. Muthukumar · A.W. Black · H.T. Bunnell
    [Show abstract] [Hide abstract]
    ABSTRACT: Every parametric speech synthesizer requires a good excitation model to produce speech that sounds natural. In this paper, we describe efforts toward building one such model using the Liljencrants-Fant (LF) model. We used the Iterative Adaptive Inverse Filtering technique to derive an initial estimate of the glottal flow derivative (GFD). Candidate pitch periods in the estimated GFD were then located and LF model parameters estimated using a gradient descent optimization algorithm. Residual energy in the GFD, after subtracting the fitted LF signal, was then modeled by a 4-term LPC model plus energy term to extend the excitation model and account for source information not captured by the LF model. The ClusterGen speech synthesizer was then trained to predict these excitation parameters from text so that the excitation model could be used for speech synthesis. ClusterGen excitation predictions were further used to reinitialize the excitation fitting process and iteratively improve the fit by including modeled voicing and segmental influences on the LF parameters. The results of all of these methods have been confirmed both using listening tests and objective metrics.
    No preview · Article · Jan 2013

  • No preview · Article · Jan 2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper uses a crowd-sourced definition of a speech phenomenon we have called "focus". Given sentences, text and speech, in isolation and in context, we asked annotators to identify what we term the "focus" word. We present their consistency in identifying the focused word, when presented with text or speech stimuli. We then build models to show how well we predict that focus word from lexical (and higher) level features. Also, using spectral and prosodic information, we show the differences in these focus words when spoken with and without context. Finally, we show how we can improve speech synthesis of these utterances given focus information.
    Full-text · Conference Paper · Jan 2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an approach for transfer of speaker intent in speech-to-speech machine translation (S2SMT). Specifically, we describe techniques to retain the prominence patterns of the source language utterance through the translation pipeline and impose this information during speech synthesis in the target language. We first present an analysis of word focus across languages to motivate the problem of transfer. We then propose an approach for training an appropriate transfer function for intonation on a parallel speech corpus in the two languages within which the translation is carried out. We present our analysis and experiments on English↔Portuguese and English↔German language pairs and evaluate the proposed transformation techniques through objective measures.
    No preview · Conference Paper · Dec 2012

Publication Stats

5k Citations
122.57 Total Impact Points


  • 2-2014
    • Carnegie Mellon University
      • • Language Technologies Institute
      • • Computer Science Department
      Pittsburgh, Pennsylvania, United States
  • 2003
    • Phoenix Software International
      Ел Сегундо, California, United States
  • 1996-2003
    • The University of Edinburgh
      • • Centre for Speech Technology Research
      • • Human Communication Research Centre
      Edinburgh, Scotland, United Kingdom
  • 2002
    • Nagoya Institute of Technology
      • Department of Computer Science and Engineering
      Nagoya, Aichi, Japan