Conference Paper

Experiments with Non-parametric Topic Models

DOI: 10.1145/2623330.2623691 Conference: 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, At New York, USA

ABSTRACT

In topic modelling, various alternative priors have been de-veloped, for instance asymmetric and symmetric priors for the document-topic and topic-word matrices respectively, the hierarchical Dirichlet process prior for the document-topic matrix and the hierarchical Pitman-Yor process prior for the topic-word matrix. For information retrieval, lan-guage models exhibiting word burstiness are important. In-deed, this burstiness effect has been show to help topic mod-els as well, and this requires additional word probability vectors for each document. Here we show how to combine these ideas to develop high-performing non-parametric topic models exhibiting burstiness based on standard Gibbs sam-pling. Experiments are done to explore the behavior of the models under different conditions and to compare the algo-rithms with previously published. The full non-parametric topic models with burstiness are only a small factor slower than standard Gibbs sampling for LDA and require double the memory, making them very competitive. We look at the comparative behaviour of different models and present some experimental insights.

Download full-text

Full-text

Available from: Wray Buntine, Jun 24, 2014
  • Source
    • "In contrast, if K is too large, some valuable features will be broken down into too many small features. There are some existing works using nonparametric topic modeling technique to infer the number of topics (Teh et al., 2006; Buntine and Mishra, 2014). However, they are also sensitive to the hyperparameter choice. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In the age of Web 2.0, user generated content (UGC), such as user review and social tag, ubiquitously exists on the Internet. Although there exist different kinds of UGC in recommender systems, the existing works only studied a single kind of UGC in each of their papers. Thus, the previous works lose a chance to uncover the similar effects of different kinds of UGC in recommender systems. In this paper, we propose a unified way to utilize various types of UGC to enhance the recommendation accuracy. We build two novel statistical models, which are based on collaborative filtering and topic modeling. Incorporating UGC text, one model focuses on learning user preferences, and the other model aims to learn user preferences and item aspects jointly. With an effective parameter estimation algorithm, our models can not only acquire prediction values of missing ratings, but also produce interpretable topics. We conducted comprehensive experiments on three real-world datasets. The experimental results demonstrate that our proposed models can achieve large improvements compared to several well-known baseline models.
    Full-text · Article · Jul 2015 · Engineering Applications of Artificial Intelligence
  • Source
    • "In contrast, if K is too large, some valuable features will be broken down into too many small features. There are some existing works using nonparametric topic modeling technique to infer the number of topics (Teh et al., 2006; Buntine and Mishra, 2014). However, they are also sensitive to the hyperparameter choice. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In the age of Web 2.0, user generated content (UGC), such as user review and social tag, ubiquitously exists on the Internet. Although there exist different kinds of UGC in recommender systems, the existing works only studied a single kind of UGC in each of their papers. Thus, the previous works lose a chance to uncover the similar effects of different kinds of UGC in recommender systems. In this paper, we propose a unified way to utilize various types of UGC to enhance the recommendation accuracy. We build two novel statistical models, which are based on collaborative filtering and topic modeling. Incorporating UGC text, one model focuses on learning user preferences, and the other model aims to learn user preferences and item aspects jointly. With an effective parameter estimation algorithm, our models can not only acquire prediction values of missing ratings, but also produce interpretable topics. We conducted comprehensive experiments on three real-world datasets. The experimental results demonstrate that our proposed models can achieve large improvements compared to several well-known baseline models.
    Full-text · Article · Jul 2015
    • "Modelling word burstiness (Buntine and Mishra, 2014) is important since, as shown in Section 6, words in a document are likely to repeat in the document. This is addressed by making topics bursty, so each document only focuses on a subset of words in the topic. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Bibliographic analysis considers author’s research areas, the citation network and paper content among other things. In this paper, we combine these three in a topic model that produces a bibliographic model of authors, topics and documents using a non-parametric extension of a combination of the Poisson mixed-topic link model and the author-topic model. We propose a novel and efficient inference algorithm for the model to explore subsets of research publications from CiteSeerX. Our model demonstrates improved performance in both model fitting and a clustering task compared to several baselines.
    No preview · Article · Jan 2014
Show more