A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation.

Journal of Machine Learning Research (Impact Factor: 2.47). 01/2009; 5:607-614.
Source: DBLP


In this paper we present a doubly hierarchi- cal Pitman-Yor process language model. Its bottom layer of hierarchy consists of multi- ple hierarchical Pitman-Yor process language models, one each for some number of do- mains. The novel top layer of hierarchy con- sists of a mechanism to couple together mul- tiple language models such that they share statistical strength. Intuitively this sharing results in the "adaptation" of a latent shared language model to each domain. We intro- duce a general formalism capable of describ- ing the overall model which we call the graph- ical Pitman-Yor process and explain how to perform Bayesian inference in it. We present encouraging language model domain adapta- tion results that both illustrate the potential benefits of our new model and suggest new avenues of inquiry.

Full-text preview

Available from:
  • Source
    • "Experiments on the AP News corpus showed that the novel hierarchical Pitman-Yor process language model produces results superior to hierarchical Dirichlet language models and n-gram LMs smoothed by interpolated Kneser-Ney (IKN), and comparable to those smoothed by modified Kneser-Ney (MKN) [12]. Wood and Teh [35] "
    [Show abstract] [Hide abstract]
    ABSTRACT: Traditional n -gram language models are widely used in state-of-the-art large vocabulary speech recognition systems. This simple model suffers from some limitations, such as overfitting of maximum-likelihood estimation and the lack of rich contextual knowledge sources. In this paper, we exploit a hierarchical Bayesian interpretation for language modeling, based on a nonparametric prior called Pitman-Yor process. This offers a principled approach to language model smoothing, embedding the power-law distribution for natural language. Experiments on the recognition of conversational speech in multiparty meetings demonstrate that by using hierarchical Bayesian language models, we are able to achieve significant reductions in perplexity and word error rate.
    Preview · Article · Dec 2010 · IEEE Transactions on Audio Speech and Language Processing
  • Source
    • "The general theory of PDPs applies them to arbitrary measurable spaces (Ishwaran and James, 2001), for instance real valued spaces, but in many recent applications, such as language and vision applications, the domain is countable (e.g., " English words " ) and standard theory requires some modifications. In language domains, PDPs and DPs are proving useful for full probability modelling of various phenomena including n-gram modelling and smoothing (Teh, 2006b; Goldwater et al., 2006; Mochihashi and Sumita, 2008), dependency models for grammar (Johnson et al., 2007; Wallach et al., 2008), and for data compression (Wood et al., 2009). The PDP-based n-gram models correspond well to versions of Kneser-Ney smoothing (Teh, 2006b), the state of the art method in applications. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Exact Bayesian network inference exists for Gaussian and multinomial distributions. For other kinds of distributions, approximations or restrictions on the kind of inference done are needed. In this paper we present generalized networks of Dirichlet distributions, and show how, using the two-parameter Poisson-Dirichlet distribution and Gibbs sampling, one can do approximate inference over them. This involves integrating out the proba-bility vectors but leaving auxiliary discrete count vectors in their place. We illustrate the technique by extending standard topic models to "structured" documents, where the document structure is given by a Bayesian network of Dirichlets.
    Full-text · Article · Sep 2010
  • Source
    • "We used the CRF sampler outlined in Section 5 with the addition of Metropolis-Hastings updates for the discount parameters (Wood & Teh, 2009). The discounts in the collapsed node restaurants are products of subsets of discount parameters making other approaches difficult. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose an unbounded-depth, hierarchi- cal, Bayesian nonparametric model for dis- crete sequence data. This model can be estimated from a single training sequence, yet shares statistical strength between subse- quent symbol predictive distributions in such a way that predictive performance general- izes well. The model builds on a specific pa- rameterization of an unbounded-depth hier- archical Pitman-Yor process. We introduce analytic marginalization steps (using coagu- lation operators) to reduce this model to one that can be represented in time and space linear in the length of the training sequence. We show how to perform inference in such a model without truncation approximation and introduce fragmentation operators nec- essary to do predictive inference. We demon- strate the sequence memoizer by using it as a language model, achieving state-of-the-art results.
    Full-text · Conference Paper · Jan 2009
Show more