Conference Paper

Global Models of Document Structure using Latent Permutations.

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We present a novel Bayesian topic model for learning discourse-level document structure. Our model leverages insights from discourse theory to constrain latent topic assignments in a way that reflects the underlying organiza- tion of document topics. We propose a global model in which both topic selection and order- ing are biased to be similar across a collection of related documents. We show that this space of orderings can be elegantly represented us- ing a distribution over permutations called the generalized Mallows model. Our structure- aware approach substantially outperforms al- ternative approaches for cross-document com- parison and single-document segmentation.1

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The first challenge is to represent the paragraph-level topic structure comprehensively. Most paragraph-level corpora (e.g., Cites and Elements corpora (Chen et al., 2009)) following sentence-level topic structure representation consider topic structure as flat and use keywords as topic content without well modeling the macro structure of the document. Since basic units (paragraphs) are longer at the paragraph level, using keywords or phrases (Koshorek et al., 2018;Arnold et al., 2019) cannot express richer topic information they contain (Todd, 2016). ...
... Another challenge is to build a large-scale corpus with high quality. Since paragraph-level topic structure annotation requires assigning the topic attribution of paragraphs, which is laborious and time-consuming (Seale and Silverman, 1997;Todd, 2011), the existing high-quality corpora (Chen et al., 2009;Xu et al., 2021) are relatively small. Besides, the manual annotation may be subjective and different from the author's intention due to topic ambiguity. ...
... However, the paragraph-level topic annotation is relatively tricky, due to the longer basic units and the need to grasp the content from the overall perspective of the document. The manually annotated corpora is relatively small, such as Cities and Elements (Chen et al., 2009) that only have about 100 documents. Therefore, recent research has also shifted towards automatic extraction (Liu et al., 2022). ...
Preprint
Full-text available
Topic segmentation and outline generation strive to divide a document into coherent topic sections and generate corresponding subheadings. Such a process unveils the discourse topic structure of a document that benefits quickly grasping and understanding the overall context of the document from a higher level. However, research and applications in this field have been restrained due to the lack of proper paragraph-level topic representations and large-scale, high-quality corpora in Chinese compared to the success achieved in English. Addressing these issues, we introduce a hierarchical paragraph-level topic structure representation with title, subheading, and paragraph that comprehensively models the document discourse topic structure. In addition, we ensure a more holistic representation of topic distribution within the document by using sentences instead of keywords to represent sub-topics. Following this representation, we construct the largest Chinese Paragraph-level Topic Structure corpus (CPTS), four times larger than the previously largest one. We also employ a two-stage man-machine collaborative annotation method to ensure the high quality of the corpus both in form and semantics. Finally, we validate the computability of CPTS on two fundamental tasks (topic segmentation and outline generation) by several strong baselines, and its efficacy has been preliminarily confirmed on the downstream task: discourse parsing. The representation, corpus, and benchmark we established will provide a solid foundation for future studies.
... Moreover, some datasets (Choi, 2000) were synthesized automatically and thus do not represent the natural distribution of text in documents. Because no large labeled dataset exists, prior work on text segmentation tried to either come up with heuristics for identifying whether two sentences discuss the same topic (Choi, 2000;Glavaš et al., 2016), or to model topics explicitly with methods such as LDA (Blei et al., 2003) that assign a topic to each paragraph or sentence (Chen et al., 2009). ...
... It is a synthetic dataset containing 920 documents, where each document is a concatena-tion of 10 random passages from the Brown corpus. Glavaš et al. (2016) created a dataset of their own, which consists of 5 manually-segmented political manifestos from the Manifesto project. 1 (Chen et al., 2009) also used English Wikipedia documents to evaluate text segmentation. They defined two datasets, one with 100 documents about major cities and one with 118 documents about chemical elements. ...
... Bayesian text segmentation methods (Chen et al., 2009;Riedl and Biemann, 2012) employ a generative probabilistic model for text. In these models, a document is represented as a set of topics, which are sampled from a topic distribution, and each topic imposes a distribution over the vocabulary. ...
... Moreover, some datasets (Choi, 2000) were synthesized automatically and thus do not represent the natural distribution of text in documents. Because no large labeled dataset exists, prior work on text segmentation tried to either come up with heuristics for identifying whether two sentences discuss the same topic (Choi, 2000;Glavaš et al., 2016), or to model topics explicitly with methods such as LDA (Blei et al., 2003) that assign a topic to each paragraph or sentence (Chen et al., 2009). ...
... It is a synthetic dataset containing 920 documents, where each document is a concatena-tion of 10 random passages from the Brown corpus. Glavaš et al. (2016) created a dataset of their own, which consists of 5 manually-segmented political manifestos from the Manifesto project. 1 (Chen et al., 2009) also used English Wikipedia documents to evaluate text segmentation. They defined two datasets, one with 100 documents about major cities and one with 118 documents about chemical elements. ...
... Bayesian text segmentation methods (Chen et al., 2009;Riedl and Biemann, 2012) employ a generative probabilistic model for text. In these models, a document is represented as a set of topics, which are sampled from a topic distribution, and each topic imposes a distribution over the vocabulary. ...
Article
Text segmentation, the task of dividing a document into contiguous segments based on its semantic structure, is a longstanding challenge in language understanding. Previous work on text segmentation focused on unsupervised methods such as clustering or graph search, due to the paucity in labeled data. In this work, we formulate text segmentation as a supervised learning problem, and present a large new dataset for text segmentation that is automatically extracted and labeled from Wikipedia. Moreover, we develop a segmentation model based on this dataset and show that it generalizes well to unseen natural text.
... Text semantic segmentation necessitates that sentences within the same paragraph cohesively revolve around a central topic, while maintaining minimal semantic overlap between distinct paragraphs. Early unsupervised methodologies identified segmentation boundaries through Bayesian models (Chen et al., 2009;Riedl and Biemann, 2012) or graph-based methods, where sentences were treated as nodes (Glavaš et al., 2016). On the other hand, supervised methods have leveraged pretrained language models (PLMs) derived from extensive corpora, subsequently fine-tuning them on annotated text semantic segmentation datasets. ...
... Text semantic segmentation includes supervised and unsupervised methods. Unsupervised methods, like the Bayesian approach by (Chen et al., 2009), use probabilistic generative models for document segmentation; (Riedl and Biemann, 2012) applied a Bayesian method based on LDA, identifying segmentation boundaries through drops in coherence scores between adjacent sentences; (Glavaš et al., 2016) introduced an unsupervised graph method, where sentences are nodes and edges represent semantic similarity, segmenting text by finding the maximum graph between adjacent sentences. Deep learning-based supervised methods have improved text semantic segmentation accuracy. ...
Preprint
Full-text available
Text semantic segmentation involves partitioning a document into multiple paragraphs with continuous semantics based on the subject matter, contextual information, and document structure. Traditional approaches have typically relied on preprocessing documents into segments to address input length constraints, resulting in the loss of critical semantic information across segments. To address this, we present CrossFormer, a transformer-based model featuring a novel cross-segment fusion module that dynamically models latent semantic dependencies across document segments, substantially elevating segmentation accuracy. Additionally, CrossFormer can replace rule-based chunk methods within the Retrieval-Augmented Generation (RAG) system, producing more semantically coherent chunks that enhance its efficacy. Comprehensive evaluations confirm CrossFormer's state-of-the-art performance on public text semantic segmentation datasets, alongside considerable gains on RAG benchmarks.
... • Cities (Chen et al. 2009) has 100 articles about cities. ...
... • Elements (Chen et al. 2009) has 118 chemical elements articles generated from Wikipedia. ...
Article
Topic segmentation aims to reveal the latent structure of a document and divide it into multiple parts. However, current neural solutions are limited in the context modeling of sentences and feature representation of candidate boundaries. This causes the model to suffer from inefficient sentence context encoding and noise information interference. In this paper, we design a new text segmentation model SegFormer with unidirectional attention blocks to better model sentence representations. To alleviate the problem of noise information interference, SegFormer uses a novel additional context aggregator and a topic classification loss to guide the model to aggregate the information within the appropriate range. In addition, SegFormer applies an iterative prediction algorithm to search for optimal boundaries progressively. We evaluate SegFormer's generalization ability, multilingual ability, and application ability on multiple challenging real-world datasets. Experiments show that our model significantly improves the performance by 7.5% on the benchmark WIKI-SECTION compared to several strong baselines. The application of SegFormer to a real-world dataset to separate normal and advertisement segments in product marketing essays also achieves superior performance in the evaluation with other cutting-edge models.
... These segments can be composed of words, sentences, or topics, where the types of text include blogs, articles, news, video transcript, etc. Previous work focused on heuristicsbased methods [15,39], LDA-based modeling algorithms [6,8], or Bayesian methods [8,67]. Recent developments in natural language processing developed large models to learn huge amount of data in the supervised manner [46,56,62,88]. ...
... These segments can be composed of words, sentences, or topics, where the types of text include blogs, articles, news, video transcript, etc. Previous work focused on heuristicsbased methods [15,39], LDA-based modeling algorithms [6,8], or Bayesian methods [8,67]. Recent developments in natural language processing developed large models to learn huge amount of data in the supervised manner [46,56,62,88]. ...
Preprint
Multimedia summarization with multimodal output can play an essential role in real-world applications, i.e., automatically generating cover images and titles for news articles or providing introductions to online videos. In this work, we propose a multimodal hierarchical multimedia summarization (MHMS) framework by interacting visual and language domains to generate both video and textual summaries. Our MHMS method contains video and textual segmentation and summarization module, respectively. It formulates a cross-domain alignment objective with optimal transport distance which leverages cross-domain interaction to generate the representative keyframe and textual summary. We evaluated MHMS on three recent multimodal datasets and demonstrated the effectiveness of our method in producing high-quality multimodal summaries.
... To the best of our knowledge, the combined task of segmentation and classification has not been approached on the full document level before. There exist a large number of data sets for text segmentation, but most of them do not reflect real-world topic drifts (Choi, 2000;Sehikh et al., 2017), do not include topic labels (Eisenstein and Barzilay, 2008;Jeong and Titov, 2010;Glavaš et al., 2016), or are heavily normalized and too small to be used for training neural networks (Chen et al., 2009). We can utilize a generic segmentation data set derived from Wikipedia that includes headings (Koshorek et al., 2018), but there is also a need in IR and QA for supervised structural topic labels (Agarwal and Yu, 2009;MacAvaney et al., 2018), different languages and more specific domains, such as clinical or biomedical research (Tepper et al., 2012;Tsatsaronis et al., 2012), and news-based TDT (Kumaran and Allan, 2004;Leetaru and Schrodt, 2013). ...
... However, their manually labeled data set only contains a sample of sentences from the documents, so they evaluated sentence classification as an isolated task. Chen et al. (2009) introduced two Wikipedia-based data sets for segmentation, one about large cities, the second about chemical elements. Although these data sets have been used to evaluate word-level and sentence-level segmentation (Koshorek et al., 2018), we are not aware of any topic classification approach on this data set. ...
Article
Full-text available
When searching for information, a human reader first glances over a document, spots relevant sections, and then focuses on a few sentences for resolving her intention. However, the high variance of document structure complicates the identification of the salient topic of a given section at a glance. To tackle this challenge, we present SECTOR, a model to support machine reading systems by segmenting documents into coherent sections and assigning topic labels to each section. Our deep neural network architecture learns a latent topic embedding over the course of a document. This can be leveraged to classify local topics from plain text and segment a document at topic shifts. In addition, we contribute WikiSection, a publicly available data set with 242k labeled sections in English and German from two distinct domains: diseases and cities. From our extensive evaluation of 20 architectures, we report a highest score of 71.6% F1 for the segmentation and classification of 30 topics from the English city domain, scored by our SECTOR long short-term memory model with Bloom filter embeddings and bidirectional segmentation. This is a significant improvement of 29.5 points F1 over state-of-the-art CNN classifiers with baseline segmentation.
... Dealing with permutations/rankings is a research area which has gained a great interest from the machine learning community in the last years. The reason is that ranking data is ubiquitous nowadays and we can find applications on several fields like preference lists [1], voting in elections [2], information retrieval [3], collaborative filtering [4], combinatorial optimization [5], computational biology [6,7], etc. ...
... 2. Experiments to compare the GA with state-of-the-art algorithms. 3. Experiments to test the goodness of the proposed GA when the data do not come from a single distribution. ...
Article
Probabilistic reasoning and learning with permutation data has gained interest in recent years because its use in different ranking-based real-world applications. Therefore, constructing a model from a given set of permutations or rankings has become a target problem in the machine learning community. In this paper we focus on probabilistic modelling and concretely in the use of a well known permutation-based distribution as it is the Mallows model. Learning a Mallows model from data requires the estimation of two parameters, a consensus permutation pi(0) and a dispersion parameter theta. Since the exact computation of these parameters is an NP-hard problem, it is natural to consider heuristics to tackle this problem. An interesting approach consists in the use of a two-step procedure, first estimating pi(0), and then computing theta for a given pi(0). This is possible because the optimal pi(0) does not depend on theta. When following this approach, computation of pi(0) reduces to the rank aggregation problem, which consists in finding the ranking which best represents such dataset. In this paper we propose to use genetic algorithms to tackle this problem, studying its performance with respect to state-of-the-art algorithms, specially in complex cases, that is, when the number of items to rank is large and there is few consensus between the available rankings (which traduces in a low value for theta). After a series of experiments involving data of different type, we conclude that our evolutionary approach clearly outperforms the remaining tested algorithms.
... Recently, topic models are increasingly being used for text analysis tasks such as summarisa-tion (Arora and Ravindran, 2008) and segmentation (Misra et al., 2011;Eisenstein and Barzilay, 2008), often times replacing earlier semantic techniques such as latent semantic analysis (Deerwester et al., 1990). Topic models can be improved by better modelling the semantic aspects of text, for instance integrating collocations into the model (Johnson, 2010;Hardisty et al., 2010) or encouraging topics to be more semantically coherent (Newman et al., 2011) based on lexical coherence models (Newman et al., 2010), modelling the structural aspects of documents, for instance modelling a document as a set of segments (Du et al., 2010;Wang et al., 2011;Chen et al., 2009), or improving the underlying statistical methods Wallach et al., 2009). Topic models, like statistical parsing methods, are using more sophisticated latent variable methods in order to model different aspects of these problems. ...
... These models operate at a finer level than we are considering at a segment (like paragraph or section) level. To make a tool like the HMM work at higher levels, one needs to make stronger assumptions, for instance assigning each sentence a single topic and then topic specific word models can be used: the hidden topic Markov model (Gruber et al., 2007) that models the transitional topic structure; a global model based on the generalised Mallows model (Chen et al., 2009), and a HMM based content model (Barzilay and Lee, 2004). Researchers have also considered timeseries of topics: various kinds of dynamic topic models, following early work of (Blei and Lafferty, 2006), represent a collection as a sequence of subcollections in epochs. ...
Conference Paper
Full-text available
Topic models are increasingly being used for text analysis tasks, often times replacing earlier semantic techniques such as latent semantic analysis. In this paper, we develop a novel adaptive topic model with the ability to adapt topics from both the previous segment and the parent document. For this proposed model, a Gibbs sampler is developed for doing posterior inference. Experimental results show that with topic adaptation, our model significantly improves over existing approaches in terms of perplexity, and is able to uncover clear sequential structure on, for example, Herman Melville’s book “Moby Dick”.
... Both data sets used in this work are freely shared for research purposes and the authors do not anticipate any issues with their inclusion in this work. Three other data sets derived from Wikipedia were also considered: the WikiSection (Arnold et al., 2019) and the Cities and Elements data sets (Chen et al., 2009). However, due to the similarity to the Wiki-50 data set they were not included in this work. ...
... To train a classifier, a vast amount of labeled training data is needed. With the advent of segment annotated datasets [2,3,4,5,6,7] deep neural network based methods started to emerge. A popular approach is to use a hierarchical structure, where the lower layer projects sentences into an embedding space, and the upper layer then classifies weather a sentence marks a topic change or not [5,8]. ...
Conference Paper
Full-text available
The current shift from in-person to online education, e.g., through video lectures, requires novel techniques for quickly searching for and navigating through media content. At this point, an automatic segmentation of the videos into thematically coherent units can be beneficial. Like in a book, the topics in an educational video are often structured hierarchically. There are larger topics, which in turn are divided into different subtopics. We thus propose a metric that considers the hierarchical levels in the reference segmentation when evaluating segmentation algorithms. In addition, we propose a multilingual , unsupervised topic segmentation approach and evaluate it on three datasets with English, Portuguese and German lecture videos. We achieve WindowDiff scores of up to 0.373 and show the usefulness of our hierarchical metric.
... In the literature we have multiple English language datasets for paragraph segmentation, such as relatively small Choi [5] or CITIES and ELEMENTS [4] and much bigger one Wiki-727K and its subset Wiki-50 [18]. All datasets are available except one, proposed by Chen, and for that reason we omitted it. ...
Chapter
Full-text available
In this paper, we present paragraph segmentation using cross-lingual knowledge transfer models. In our solution, we investigate the quality of multilingual models, such as mBERT and XLM-RoBERTa, as well as language independent models, LASER and LaBSE. We study the quality of segmentation in 9 different European languages, both for each language separately and for all languages simultaneously. We offer high quality solutions while maintaining language independence. To achieve our goals, we introduced a new multilingual benchmark dataset called Multi-Wiki90k.
... Both supervised and unsupervised learning have been applied to text segmentation. With the lack of large-quantity labels on supervised train- * Corresponding author ing (Koshorek et al., 2018), unsupervised modeling based on clustering (Choi, 2000;Chen et al., 2009), Bayesian methods (Du et al., 2013(Du et al., , 2015Malmasi et al., 2017) and graph methods (Glavaš et al., 2016;Malioutov and Barzilay, 2006) have been proposed. However, with the advancement of self-learning and transfer learning on deep neural networks, there are more recent supervised modeling approaches proposed that aim to predict labeled segment boundaries on smaller datasets. ...
... Both supervised and unsupervised learning have been applied to text segmentation. With the lack of large-quantity labels on supervised train- * Corresponding author ing (Koshorek et al., 2018), unsupervised modeling based on clustering (Choi, 2000;Chen et al., 2009), Bayesian methods (Du et al., 2013(Du et al., , 2015Malmasi et al., 2017) and graph methods (Glavaš et al., 2016;Malioutov and Barzilay, 2006) have been proposed. However, with the advancement of self-learning and transfer learning on deep neural networks, there are more recent supervised modeling approaches proposed that aim to predict labeled segment boundaries on smaller datasets. ...
Preprint
This paper proposes a transformer over transformer framework, called Transformer2^2, to perform neural text segmentation. It consists of two components: bottom-level sentence encoders using pre-trained transformers, and an upper-level transformer-based segmentation model based on the sentence embeddings. The bottom-level component transfers the pre-trained knowledge learned from large external corpora under both single and pair-wise supervised NLP tasks to model the sentence embeddings for the documents. Given the sentence embeddings, the upper-level transformer is trained to recover the segmentation boundaries as well as the topic labels of each sentence. Equipped with a multi-task loss and the pre-trained knowledge, Transformer2^2 can better capture the semantic coherence within the same segments. Our experiments show that (1) Transformer2^2 manages to surpass state-of-the-art text segmentation models in terms of a commonly-used semantic coherence measure; (2) in most cases, both single and pair-wise pre-trained knowledge contribute to the model performance; (3) bottom-level sentence encoders pre-trained on specific languages yield better performance than those pre-trained on specific domains.
... CHOI dataset is divided into subsets containing only documents with specific variability of segment lengths (e.g., segments with 3-5 or with 9-11 sentences). 7 Finally, we evaluate the performance of our models on two small datasets, CITIES and ELEMENTS, created by Chen et al. (2009) from Wikipedia pages dedicated to the cities of the world and chemical elements, respectively. ...
Article
Breaking down the structure of long texts into semantically coherent segments makes the texts more readable and supports downstream applications like summarization and retrieval. Starting from an apparent link between text coherence and segmentation, we introduce a novel supervised model for text segmentation with simple but explicit coherence modeling. Our model – a neural architecture consisting of two hierarchically connected Transformer networks – is a multi-task learning model that couples the sentence-level segmentation objective with the coherence objective that differentiates correct sequences of sentences from corrupt ones. The proposed model, dubbed Coherence-Aware Text Segmentation (CATS), yields state-of-the-art segmentation performance on a collection of benchmark datasets. Furthermore, by coupling CATS with cross-lingual word embeddings, we demonstrate its effectiveness in zero-shot language transfer: it can successfully segment texts in languages unseen in training.
... CHOI dataset is divided into subsets containing only documents with specific variability of segment lengths (e.g., segments with 3-5 or with 9-11 sentences). 7 Finally, we evaluate the performance of our models on two small datasets, CITIES and ELEMENTS, created by Chen et al. (2009) from Wikipedia pages dedicated to the cities of the world and chemical elements, respectively. ...
Preprint
Full-text available
Breaking down the structure of long texts into semantically coherent segments makes the texts more readable and supports downstream applications like summarization and retrieval. Starting from an apparent link between text coherence and segmentation, we introduce a novel supervised model for text segmentation with simple but explicit coherence modeling. Our model -- a neural architecture consisting of two hierarchically connected Transformer networks -- is a multi-task learning model that couples the sentence-level segmentation objective with the coherence objective that differentiates correct sequences of sentences from corrupt ones. The proposed model, dubbed Coherence-Aware Text Segmentation (CATS), yields state-of-the-art segmentation performance on a collection of benchmark datasets. Furthermore, by coupling CATS with cross-lingual word embeddings, we demonstrate its effectiveness in zero-shot language transfer: it can successfully segment texts in languages unseen in training.
... To the best of our knowledge, the combined task of segmentation and classification has not been approached on full document level before. There exist a large number of datasets for text segmentation, but most of them do not reflect real-world topic drifts (Choi, 2000;Sehikh et al., 2017), do not include topic labels (Eisenstein and Barzilay, 2008;Jeong and Titov, 2010;Glavaš et al., 2016) or are heavily normalized and too small to be used for training neural networks (Chen et al., 2009). We can utilize a generic segmentation dataset derived from Wikipedia that includes headings (Koshorek et al., 2018), but there is also a need in IR and QA for supervised structural topic labels (Agarwal and Yu, 2009;MacAvaney et al., 2018), different languages and more specific domains, such as clinical or biomedical research (Tepper et al., 2012;Tsatsaronis et al., 2012) and news-based TDT (Kumaran and Allan, 2004;Leetaru and Schrodt, 2013). ...
Preprint
Full-text available
When searching for information, a human reader first glances over a document, spots relevant sections and then focuses on a few sentences for resolving her intention. However, the high variance of document structure complicates to identify the salient topic of a given section at a glance. To tackle this challenge, we present SECTOR, a model to support machine reading systems by segmenting documents into coherent sections and assigning topic labels to each section. Our deep neural network architecture learns a latent topic embedding over the course of a document. This can be leveraged to classify local topics from plain text and segment a document at topic shifts. In addition, we contribute WikiSection, a publicly available dataset with 242k labeled sections in English and German from two distinct domains: diseases and cities. From our extensive evaluation of 20 architectures, we report a highest score of 71.6% F1 for the segmentation and classification of 30 topics from the English city domain, scored by our SECTOR LSTM model with bloom filter embeddings and bidirectional segmentation. This is a significant improvement of 29.5 points F1 compared to state-of-the-art CNN classifiers with baseline segmentation.
... Some approaches approximate the structure of a document via topic and entity sequences using local dependencies such as conditional probabilities (Lapata, 2003;Barzilay and Lapata, 2008) or Hidden Markov Models ( Barzilay and Lee, 2004). More recently, global approaches which directly model the permutations of topics in the document have been proposed (Chen et al., 2009b). Following this line of work, one of our models uses the Generalized Mallows Model (Fligner and Verducci, 1986) in its generative process which allows to model permutations of complexity levels in the training data. ...
Article
Online discussion forums and community question-answering websites provide one of the primary avenues for online users to share information. In this paper, we propose text mining techniques which aid users navigate troubleshooting-oriented data such as questions asked on forums and their suggested solutions. We introduce Bayesian generative models of the troubleshooting data and apply them to two interrelated tasks: (a) predicting the complexity of the solutions (e.g., plugging a keyboard in the computer is easier compared to installing a special driver) and (b) presenting them in a ranked order from least to most complex. Experimental results show that our models are on par with human performance on these tasks, while outperforming baselines based on solution length or readability.
... Given the relatively constant page-level categories that printers use to describe book design, we formulate our problem as a classification task into a set of pre-established categories. An alternative is to take an unsupervised approach, and learn the set of categories empirically from the data; this general problem of book segmentation in its unlabeled form shares functional similarity with other work in general unsupervised topic or discourse segmentation (Hearst, 1997;Utiyama and Isahara, 2001;Chen et al., 2009)-most notably, the work of Eisenstein and Barzilay (2008) (for whom the section labels may be considered a form of "cue words" akin to discourse markers). Given the amount of data in large-scale book collections, we see this as an interesting path forward (either in a fully unsupervised or semi-supervised setting); an unsupervised approach that includes aspects of metadata such as country of publication or publisher may also be fruitful in accommodating variation in printer's rules as a function of time and geographical location (books by French publishers, for example, often place the table of contents at the back of the book). ...
... For example, a main aspect in some machine learning tasks is how to combine the output of multiple classifiers, in order to determine the most adequate class. Other relevant applications include: computational biology [1], multi-agent planning [2], voting in elections [3], information retrieval [4], among others interesting fields. Formally the aggregation of preferences could be summarized as follows: given orderings over objects/items where each ranking denotes the preference of a single expert, then the goal is to build a consensus (aggregated) ranking taking into account all input rankings. ...
... There is also work on global coherence, focussing on the structure of a document as a whole [4,13,17,23]. However, not many coherence models represent both local and global coherence, even though those two are connected: local coherence is a prerequisite for global coherence [3], and there is psychological evidence that coherence on both local and global levels is manifested in text comprehension [57,59]. ...
Article
Full-text available
We present two novel models of document coherence and their application to information retrieval (IR). Both models approximate document coherence using discourse entities, e.g. the subject or object of a sentence. Our fi?rst model views text as a Markov process generating sequences of discourse entities (entity n-grams); we use the entropy of these entity n-grams to approximate the rate at which new information appears in text, reasoning that as more new words appear, the topic increasingly drifts and text coherence decreases. Our second model extends the work of Guinaudeau & Strube [28] that represents text as a graph of discourse entities, linked by diff?erent relations, such as their distance or adjacency in text. We use several graph topology metrics to approximate di?fferent aspects of the discourse flow that can indicate coherence, such as the average clustering or betweenness of discourse entities in text. Experiments with several instantiations of these models show that: (i) our models perform on a par with two other well-known models of text coherence even without any parameter tuning, and (ii) reranking retrieval results according to their coherence scores gives notable performance gains, con?rming a relation between document coherence and relevance. This work contributes two novel models of document coherence, the application of which to IR complements recent work in the integration of document cohesiveness or comprehensibility to ranking [5, 56].
... For example, a main aspect in some machine learning tasks is how to combine the output of multiple classifiers, in order to determine the most adequate class. Other relevant applications include: computational biology [1], multi-agent planning [2], voting in elections [3], information retrieval [4], among others interesting fields. Formally the aggregation of preferences could be summarized as follows: given orderings over objects/items where each ranking denotes the preference of a single expert, then the goal is to build a consensus (aggregated) ranking taking into account all input rankings. ...
Conference Paper
Aggregating the preference of multiple experts is a very old problem which remains without an absolute solution. This assertion is supported by the Arrow’s theorem: there is no aggregation method that simultaneously satisfies three fairness criteria (non-dictatorship, independence of irrelevant alternatives and Pareto efficiency). However, it is possible to find a solution having minimal distance to the consensus, although it involves a NP-hard problem even for only a few experts. This paper presents a model based on Ant Colony Optimization for facing this problem when input data are incomplete. It means that our model should build a complete ordering from partial rankings. Besides, we introduce a measure to determine the distance between items. It provides a more complete picture of the aggregated solution. In order to illustrate our contributions we use a real problem concerning Employer Branding issues in Belgium.
... Portions of this thesis are based upon material previously presented in peer-reviewed publications. The content modeling work of Chapter 2 builds from conference [35] and journal [34] publications. A version of the relation discovery work of Chapter 3 ...
Article
Semantic analysis is a core area of natural language understanding that has typically focused on predicting domain-independent representations. However, such representations are unable to fully realize the rich diversity of technical content prevalent in a variety of specialized domains. Taking the standard supervised approach to domainspecific semantic analysis requires expensive annotation effort for each new domain of interest. In this thesis, we study how multiple granularities of semantic analysis can be learned from unlabeled documents within the same domain. By exploiting in-domain regularities in the expression of text at various layers of linguistic phenomena, including lexicography, syntax, and discourse, the statistical approaches we propose induce multiple kinds of structure: relations at the phrase and sentence level, content models at the paragraph and section level, and semantic properties at the document level. Each of our models is formulated in a hierarchical Bayesian framework with the target structure captured as latent variables, allowing them to seamlessly incorporate linguistically-motivated prior and posterior constraints, as well as multiple kinds of observations. Our empirical results demonstrate that the proposed approaches can successfully extract hidden semantic structure over a variety of domains, outperforming multiple competitive baselines.
... One simple way to avoid this problem is to take into account other types of reference behaviour, such as zero anaphora and bridging anaphora, because this type of reference function can often relate distant discourse fragments (e.g. two clauses placed far from each other). In addition, although we focused on exploiting the relationship of discourse entities in terms of anaphoric functions, the (latent) topic transition in a text is another key for capturing text coherence, as discussed by Chen et al. (2009). Therefore, one interesting issue for discourse coherence is how to integrate the above factors into existing coherence models. ...
Conference Paper
Full-text available
We propose a simple and effective metric for automatically evaluating discourse coherence of a text using the outputs of a coreference resolution model. According to the idea that a writer tends to appropriately utilise coreference relations when writing a coherent text, we introduce a metric of discourse coherence based on automatically identified coreference rela-tions. We empirically evaluated our metric by comparing it to the entity grid modelling by Barzilay and Lapata (2008) using Japanese newspaper articles as a target data set. The results indicate that our metric better reflects discourse coherence of texts than the existing model.
... For more on segmentation using topic models, suggest reading [Eisenstein & Barzilay 2008;Chen et al 2009;Purver 2011] Functional Segmentation ...
... Topic structure and automated topic segmentation aims to break a discourse into a linear sequence of topics such the geography of a country, followed by its history, its demographics, its economy, its legal structures, etc. Segmentation is usually done on a sentence-by-sentence basis, with segments not assumed to overlap. Methods for topic segmenation emply semantic, lexical and referential similarity or, more recently, language models (Bestgen, 2006;Chen et al., 2009;Choi et al., 2001;Eisenstein and Barzilay, 2008;Galley et al., 2003;Hearst, 1997;Malioutov and Barzilay, 2006;Purver et al., 2006;Purver, 2011). ...
Conference Paper
Full-text available
The discourse properties of text have long been recognized as critical to language tech-nology, and over the past 40 years, our un-derstanding of and ability to exploit the dis-course properties of text has grown in many ways. This essay briefly recounts these de-velopments, the technology they employ, the applications they support, and the new chal-lenges that each subsequent development has raised. We conclude with the challenges faced by our current understanding of discourse, and the applications that meeting these challenges will promote.
... Joty et al. (2011) extended this work by enriching the emission distributions and using additional features such as speaker and position information. An approach to unsupervised discourse modeling that does not use HMMs is the latent permutation model of Chen et al. (2009). This model assumes each segment (e.g. ...
Conference Paper
Recent work has explored the use of hidden Markov models for unsupervised discourse and conversation modeling, where each segment or block of text such as a message in a conversation is associated with a hidden state in a sequence. We extend this approach to allow each block of text to be a mixture of multiple classes. Under our model, the probability of a class in a text block is a log-linear function of the classes in the previous block. We show that this model performs well at predictive tasks on two conversation data sets, improving thread reconstruction accuracy by up to 15 percentage points over a standard HMM. Additionally, we show quantitatively that the induced word clusters correspond to speech acts more closely than baseline models.
... Every culture has stories, and storytelling is one of the key functions of human language. Yet while we have robust, flexible models for the structure of informative documents (for instance (Chen et al., 2009;Abu Jbara and Radev, 2011)), current approaches have difficulty representing the narrative structure of fictional stories. This causes problems for any task requiring us to model fiction, including summarization and generation of stories; Kazantseva and Szpakowicz (2010) show that state-of-theart summarizers perform extremely poorly on short fictional texts 1 . ...
Conference Paper
Better representations of plot structure could greatly improve computational methods for summarizing and generating stories. Current representations lack abstraction, focusing too closely on events. We present a kernel for comparing novelistic plots at a higher level, in terms of the cast of characters they depict and the social relationships between them. Our kernel compares the characters of different novels to one another by measuring their frequency of occurrence over time and the descriptive and emotional language associated with them. Given a corpus of 19th-century novels as training data, our method can accurately distinguish held-out novels in their original form from artificially disordered or reversed surrogates, demonstrating its ability to robustly represent important aspects of plot structure.
... Baseline We use two baselines for this task, both using a clustering algorithm weighted by TF*IDF as implemented by the publicly available CLUTO package (Karypis, 2002), 5 using agglomerative clustering with the cosine similarity distance metric (Chen, Branavan, Barzilay, & Karger, 2009;Chen, Benson, Naseem, & Barzilay, 2011). ...
Article
We present a model for aggregation of product review snippets by joint aspect identification and sentiment analysis. Our model simultaneously identifies an underlying set of ratable aspects presented in the reviews of a product (e.g., sushi and miso for a Japanese restaurant) and determines the corresponding sentiment of each aspect. This approach directly enables discovery of highly-rated or inconsistent aspects of a product. Our generative model admits an efficient variational mean-field inference algorithm. It is also easily extensible, and we describe several modifications and their effects on model structure and inference. We test our model on two tasks, joint aspect identification and sentiment analysis on a set of Yelp reviews and aspect identification alone on a set of medical summaries. We evaluate the performance of the model on aspect identification, sentiment analysis, and per-word labeling accuracy. We demonstrate that our model outperforms applicable baselines by a considerable margin, yielding up to 32% relative error reduction on aspect identification and up to 20% relative error reduction on sentiment analysis.
... More recent improvements to this approach include using different lexical similarity metrics like LSA (Choi et al. 2001;Olney and Cai 2005) and improving feature extraction for supervised methods (Hsueh et al. 2006). It also inspires unsupervised models using bags of words (Purver et al. 2006), language models (Eisenstein and Barzilay 2008), and shared structure across documents (Chen et al. 2009). ...
Article
Full-text available
Identifying influential speakers in multi-party conversations has been the focus of research in communication, sociology, and psychology for decades. It has been long acknowledged qualitatively that controlling the topic of a conversation is a sign of influence. To capture who introduces new topics in conversations, we introduce SITS—Speaker Identity for Topic Segmentation—a nonparametric hierarchical Bayesian model that is capable of discovering (1) the topics used in a set of conversations, (2) how these topics are shared across conversations, (3) when these topics change during conversations, and (4) a speaker-specific measure of “topic control”. We validate the model via evaluations using multiple datasets, including work meetings, online discussions, and political debates. Experimental results confirm the effectiveness of SITS in both intrinsic and extrinsic evaluations.
Article
Understanding text structure, which enables the automated system to parse long text structure, is crucial for various natural language processing applications such as information extraction, summarization, and question answering. Although previous methods have advanced text structure parsing effectively, they face challenges such as not leveraging the abundance of unlabelled data and focusing mainly on content-inferred information. To address this deficiency, this paper introduces a novel Text Structure Language Model (TSLM), an LM pre-training framework that employs ubiquitous HTML documents and considers the text structure among text units. HTML documents are composed by experts and their hierarchies can reflect the structure of documents. Our learning framework is designed to equip the LM with awareness of two complementary kinds of structures from HTML documents. It encourages the model to learn local structure which helps in understanding the immediate connection between two units by reconstructing the structure of DOM tree, and global structure which shapes the overall organization and thematic development by predicting the optimal content-fitting tree. Extensive experiments with structure-related downstream tasks, including text segmentation and table of contents generation, validate the effectiveness of TSLM.
Article
Full-text available
Legislative houses all over the world are adopting tools based on artificial intelligence to support their work. The incorporation of these tools can improve the analysis of the text of the proposed new laws and speed the preparation and discussion of new laws. The performance of artificial intelligence tools for text processing tasks is largely affected by the corpora used, which should ideally be adapted for the specific domain. When dealing with legislative corpora, text segmentation is often necessary due to the distinct purposes of legislative segments within the overall bill structure. While rule-based approaches can be effective in cases where the data follows a consistent format, they fail when inconsistencies arise in the formatting of legislative bills. In this study, we extensively investigate the use of weak supervision and active learning to accurately segment over 100,000 Brazilian federal legislative bills using a sequence tagging approach. The experiments demonstrated that both BERT and LSTM models achieved high statistical performance without the limitations of rule-based systems. In segmenting long documents beyond the limited context window of BERT, we find that simple moving windows suffice because the required context for accurate legislative segmentation is mostly local. We also conducted an analysis of transfer learning from our monolingual models to French, Italian, German, and English (US) legislative texts. According to our experimental results our models present non-trivial zero-shot and effective out-of-distribution fine-tuning performance, suggesting potential avenues for multilingual legislative segmentation without the need for computationally expensive models. The models, data, and code are publicly available at https://github.com/ulysses-camara/ulysses-segmenter.
Chapter
Topic-aware text segmentation (TATS) involves dividing text into cohesive segments and assigning a corresponding topic label to each segment. The TATS of a document has become increasingly significant for business researchers to obtain comprehensive insights into the behavior of enterprises. However, current models either cannot balance accuracy and generalization or are unable to handle the topic nesting problem, leading to low efficiency in practical needs. This paper proposes a novel Span-based approach for Topic-aware Text Segmentation called STTS, which consists of two components including a sliding window encoder and a span-based NER module. First, we utilize the sliding window encoder to transform the input document into text spans, which are then represented in their embeddings using pre-trained language models. Second, we obtain the coherent segments and assign a topic label to each segment based on the span-based NER method called Global Pointer. Experiments on four real-world business datasets demonstrate that STTS achieves state-of-the-art performance on the flat and nested TATS tasks. Consequently, our model provides an effective solution to TATS tasks with lengthy texts and nested topics, which indicates that our solution is highly suitable for large-scale text processing in practice.
Chapter
Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured conversational data. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin. We stress-test our proposed Topic Segmentation approach by experimenting with multiple loss functions, in order to mitigate effects of imbalance in unstructured conversational datasets. Our empirical evaluation indicates that Focal Loss function is a robust alternative to Cross-Entropy and re-weighted Cross-Entropy loss function when segmenting unstructured and semi-structured chats.
Preprint
Full-text available
Recent neural supervised topic segmentation models achieve distinguished superior effectiveness over unsupervised methods, with the availability of large-scale training corpora sampled from Wikipedia. These models may, however, suffer from limited robustness and transferability caused by exploiting simple linguistic cues for prediction, but overlooking more important inter-sentential topical consistency. To address this issue, we present a discourse-aware neural topic segmentation model with the injection of above-sentence discourse dependency structures to encourage the model make topic boundary prediction based more on the topical consistency between sentences. Our empirical study on English evaluation datasets shows that injecting above-sentence discourse structures to a neural topic segmenter with our proposed strategy can substantially improve its performances on intra-domain and out-of-domain data, with little increase of model's complexity.
Chapter
Segmenting documents or conversation threads into semantically coherent segments have been one of the challenging tasks in natural language processing. In this work, we introduce three new text segmentation models that employ BERT for post-training. Extensive experiments are conducted based on benchmark datasets to demonstrate that our BERT-based models show significant improvements over the state-of-the-art text segmentation algorithms.
Preprint
Topic segmentation is critical in key NLP tasks and recent works favor highly effective neural supervised approaches. However, current neural solutions are arguably limited in how they model context. In this paper, we enhance a segmenter based on a hierarchical attention BiLSTM network to better model context, by adding a coherence-related auxiliary task and restricted self-attention. Our optimized segmenter outperforms SOTA approaches when trained and tested on three datasets. We also the robustness of our proposed model in domain transfer setting by training a model on a large-scale dataset and testing it on four challenging real-world benchmarks. Furthermore, we apply our proposed strategy to two other languages (German and Chinese), and show its effectiveness in multilingual scenarios.
Article
Text segmentation is a fundamental step in natural language processing (NLP) and information retrieval (IR) tasks. Most existing approaches do not explicitly take into account the facet information of documents for segmentation. Text segmentation and facet annotation are often addressed as separate problems, but they operate in a common input space. This article proposes FTS, which is a novel model for faceted text segmentation via multitask learning (MTL). FTS models faceted text segmentation as an MTL problem with text segmentation and facet annotation. This model employs the bidirectional long short-term memory (Bi-LSTM) network to learn the feature representation of sentences within a document. The feature representation is shared and adjusted with common parameters by MTL, which can help an optimization model to learn a better-shared and robust feature representation from text segmentation to facet annotation. Moreover, the text segmentation is modeled as a sequence tagging task using LSTM with a conditional random fields (CRFs) classification layer. Extensive experiments are conducted on five data sets from five domains: data structure, data mining, computer network, solid mechanics, and crystallography. The results indicate that the FTS model outperforms several highly cited and state-of-the-art approaches related to text segmentation and facet annotation.
Article
Building machines that can understand text like humans is an AI-complete problem. A great deal of research has already gone into this, with astounding results, allowing everyday people to discuss with their telephones, or have their reading materials analysed and classified by computers. A prerequisite for processing text semantics, common to the above examples, is having some computational representation of text as an abstract object. Operations on this representation practically correspond to making semantic inferences, and by extension simulating understanding text. The complexity and granularity of semantic processing that can be realised is constrained by the mathematical and computational robustness, expressiveness, and rigour of the tools used. This dissertation contributes a series of such tools, diverse in their mathematical formulation, but common in their application to model semantic inferences when machines process text. These tools are principally expressed in nine distinct models that capture aspects of semantic dependence in highly interpretable and non-complex ways. This dissertation further reflects on present and future problems with the current research paradigm in this area, and makes recommendations on how to overcome them. The amalgamation of the body of work presented in this dissertation advances the complexity and granularity of semantic inferences that can be made automatically by machines.
Conference Paper
Nowadays the explosion of Web information has led to the boom of massive web documents such as news webpages, online literature, etc. The latent topics behind the documents spread by self-evolution and mutual transition. Understanding how topics in documents evolve and transit is an important and challenging problem. Topic model is a set of powerful toolkits to model documents generation to find their underlying topics, usually at the unigram level, making it difficult to model the relationship between terms and their underlying topics. In this paper, we propose a pairwise topic modeling method to incorporate a pairwise relationship into topic modeling methods. We manage to discover latent topics as well as topic transitions at the same time in a natural way. We show that the pairwise topic model can facilitate discovering of individual topics as well as topic evolution. The results indicate our proposed method leads to a significant performance improvement over the traditional topic modeling methods, such as Latent Dirichlet Allocation (LDA) in terms of language perplexity. Besides, we conduct a series of empirical studies to show the topic words and topic transitions discovered. From the case studies, we show that with the help of PTM methods, people are able to explicitly understand how topics evolve and transit between each other.
Conference Paper
We propose the use of a genetic algorithm in order to solve the rank aggregation problem, which consists in, given a dataset of rankings (or permutations) of n objects, finding the ranking which best represents such dataset. Though different probabilistic models have been proposed to tackle this problem (see e.g. [12]), the so called Mallows model is the one that has more attentions [1]. Exact computation of the parameters of this model is an NP-hard problem [19], justifies the use of metaheuristic algorithms for its resolution. In particular, we propose a genetic algorithm for solving this problem and show that, in most cases (specially in the most complex ones) we get statistically significant better results than the ones obtained by the state of the art algorithms.
Article
This paper introduces a new approach for large-scale unsupervised segmentation of bibliographic elements. The problem is segmenting a citation given as an untagged word token sequence into subsequences so that each subsequence corresponds to a different bibliographic element e.g., authors, paper title, journal name, publication year, etc.. The same bibliographic element should be referred to by contiguous word tokens. This constraint is called contiguity constraint. The authors meet this constraint by using generalized Mallows models, effectively applied to document structure learning by Chen, Branavan, Barzilay, and Karger 2009. However, the method works for this problem only after modification. Therefore, the author proposes strategies to make the method applicable to this problem.
Article
Topic modeling is a powerful tool to model documents to find their underlying topics. However, the unstructured nature of the raw text makes it hard to model the semantic relationship between the text units, which may be the words, phrases or sentences, and thus even harder to model their corresponding underlying topics. In our work, we try to examine the pairwise relationship of the underlying topics through relation extraction. We first extract the entity pairs within one relation tuple out of the raw text. Then, we model the relationship between the entity pairs by adding the dependencies between entities and their corresponding topics. We propose six different versions of Pairwise Topic Model (PTM) to simultaneously discover the latent topics and their pairwise relationship. The experiment on four data sets (AP news articles, DUC 2004 task2, Clinical Notes and Neuroscience Papers) shows the PTM models are better-structured language model than the traditional topic model Latent Dirichlet Allocation (LDA). Also, empirical results show that the proposed Pairwise Topic Models (PTMs) can explicitly explain how two topics are related.
Conference Paper
The discourse properties of text have long been recognized as critical to language technology, and over the past 40 years, our understanding of and ability to exploit the discourse properties of text has grown in many ways. This essay briefly recounts these developments, the technology they employ, the applications they support, and the new challenges that each subsequent development has raised. We conclude with the challenges faced by our current understanding of discourse, and the applications that meeting these challenges will promote.
Article
In this paper we propose a novel algorithm for opinion summarization that takes account of informativeness and readability, simultaneously. We consider a summary as a sequence of sentences and directly acquire the optimum sequence from multiple review documents by extracting and ordering the sentences. We achieve this with a novel Integer Linear Programming (ILP) formulation. Our proposed formulation is a powerful mixture of the Maximum Coverage Problem and the Traveling Salesman Problem, and is widely applicable to text generation and summarization tasks. We score each candidate sequence according to its informativeness and readability. Since our research goal is to summarize reviews, the informativeness score is measured in terms of opinion information; the readability score is developed in training on the review document corpus. We evaluate our method using the reviews of restaurants and commodities. Our method outperforms existing opinion summarizers as indicated by its ROUGE score. We also show that our proposed method improves the readability of summaries by the results of human readability evaluation.
Article
Full-text available
We investigate some pitfalls regarding the discriminatory power of MT evaluation metrics and the accuracy of statistical sig- nificance tests. In a discriminative rerank- ing experiment for phrase-based SMT we show that the NIST metric is more sensi- tive than BLEU or F-score despite their in- corporation of aspects of fluency or mean- ing adequacy into MT evaluation. In an experimental comparison of two statistical significance tests we show that p-values are estimated more conservatively by ap- proximate randomization than by boot- strap tests, thus increasing the likelihood of type-I error for the latter. We point out a pitfall of randomly assessing signif- icance in multiple pairwise comparisons, and conclude with a recommendation to combine NIST with approximate random- ization, at more stringent rejection levels than is currently standard.
Article
Full-text available
We use a reliably annotated corpus to compare metrics of coherence based on Centering The-ory with respect to their potential usefulness for text structuring in natural language generation. Previous corpus-based evaluations of the coher-ence of text according to Centering did not com-pare the coherence of the chosen text structure with that of the possible alternatives. A corpus-based methodology is presented which distin-guishes between Centering-based metrics taking these alternatives into account, and represents therefore a more appropriate way to evaluate Centering from a text structuring perspective.
Conference Paper
Full-text available
In this paper we introduce a novel collapsed Gibbs sam- pling method for the widely used latent Dirichlet alloca- tion (LDA) model. Our new method results in significant speedups on real world text corpora. Conventional Gibbs sampling schemes for LDA require O(K) operations per sam- ple where K is the number of topics in the model. Our proposed method draws equivalent samples but requires on average significantly less then K operations per sample. On real-word corpora FastLDA can be as much as 8 times faster than the standard collapsed Gibbs sampler for LDA. No ap- proximations are necessary, and we show that our fast sam- pling scheme produces exactly the same results as the stan- dard (but slower) sampling scheme. Experiments on four real world data sets demonstrate speedups for a wide range of collection sizes. For the PubMed collection of over 8 mil- lion documents with a required computation time of 6 CPU months for LDA, our speedup of 5.7 can save 5 CPU months of computation.
Conference Paper
Full-text available
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Conference Paper
Full-text available
Statistical approaches to language learning typically foc us on either short-range syntactic dependencies or long-range semantic dependencies between words. We present a generative model that uses both kinds of dependencies, and can be used to simultaneously find syntact ic classes and semantic topics despite having no representation of syntax or seman- tics beyond statistical dependency. This model is competitive on tasks like part-of-speech tagging and document classification wi th models that exclusively use short- and long-range dependencies respectively.
Conference Paper
Full-text available
We present a method for unsupervised topic modelling which adapts methods used in document classification (Blei et al., 2003; Griffiths and Steyvers, 2004) to unsegmented multi-party discourse tran- scripts. We show how Bayesian infer- ence in this generative model can be used to simultaneously address the prob- lems of topic segmentation and topic identification: automatically segmenting multi-party meetings into topically co- herent segments with performance which compares well with previous unsuper- vised segmentation-only methods (Galley et al., 2003) while simultaneously extract- ing topics which rate highly when assessed for coherence by human judges. We also show that this method appears robust in the face of off-topic dialogue and speech recognition errors.
Conference Paper
Full-text available
We consider the task of unsupervised lec- ture segmentation. We formalize segmen- tation as a graph-partitioning task that op- timizes the normalized cut criterion. Our approach moves beyond localized com- parisons and takes into account long- range cohesion dependencies. Our results demonstrate that global analysis improves the segmentation accuracy and is robust in the presence of speech recognition errors.
Article
Full-text available
Algorithms such as Latent Dirichlet Alloca- tion (LDA) have achieved significant progress in modeling word document relationships. These algorithms assume each word in the document was generated by a hidden topic and explicitly model the word distribution of each topic as well as the prior distribution over topics in the document. Given these pa- rameters, the topics of all words in the same document are assumed to be independent. In this paper, we propose modeling the top- ics of words in the document as a Markov chain. Specifically, we assume that all words in the same sentence have the same topic, and successive sentences are more likely to have the same topics. Since the topics are hid- den, this leads to using the well-known tools of Hidden Markov Models for learning and inference. We show that incorporating this dependency allows us to learn better topics and to disambiguate words that can belong to different topics. Quantitatively, we show that we obtain better perplexity in model- ing documents with only a modest increase in learning and inference complexity.
Article
Full-text available
A first step in identifying the content of a document is determining which topics that document addresses. We describe a generative model for documents, introduced by Blei, Ng, and Jordan [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1022], in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. We then present a Markov chain Monte Carlo algorithm for inference in this model. We use this algorithm to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics. We show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying "hot topics" by examining temporal dynamics and tagging abstracts to illustrate semantic content.
Article
Full-text available
The Pk evaluation metric, initially proposed by Beeferman, Berger, and Lafferty (1997), is becoming the standard measure for assessing text segmentation algorithms. However, a theoretical analysis of the metric finds several problems: the metric penalizes false negatives more heavily than false positives, overpenalizes near misses, and is affected by variation in segment size distribution. We propose a simple modification to the Pk metric that remedies these problems. This new metric—called Window Diff—moves a fixed-sized window across the text and penalizes the algorithm whenever the number of boundaries within the window does not match the true number of boundaries for that window of text.
Article
Full-text available
The problem of organizing information for multidocument summarization so that the generated summary is coherent has received relatively little attention. While sentence ordering for single document summarization can be determined from the ordering of sentences in the input article, this is not the case for multidocument summarization where summary sentences may be drawn from different input articles. In this paper, we propose a methodology for studying the properties of ordering information in the news genre and describe experiments done on a corpus of multiple acceptable orderings we developed for the task. Based on these experiments, we implemented a strategy for ordering information that combines constraints from chronological order of events and topical relatedness. Evaluation of our augmented algorithm shows a significant improvement of the ordering over two baseline strategies.
Article
Markov chain sampling methods that automatically adapt to characteristics of the distribution being sampled can be constructed by exploiting the principle that one can sample from a distribution by sampling uniformly from the region under the plot of its density function. A Markov chain that converges to this uniform distribution can be constructed by alternating uniform sampling in the vertical direction with uniform sampling from the horizontal `slice' defined by the current vertical position, or more generally, with some update that leaves the uniform distribution over this slice invariant. Variations on such `slice sampling' methods are easily implemented for univariate distributions, and can be used to sample from a multivariate distribution by updating each variable in turn. This approach is often easier to implement than Gibbs sampling, and more efficient than simple Metropolis updates, due to the ability of slice sampling to adaptively choose the magnitude of changes made. It is therefore attractive for routine and automated use. Slice sampling methods that update all variables simultaneously are also possible. These methods can adaptively choose the magnitudes of changes made to each variable, based on the local properties of the density function. More ambitiously, such methods could potentially allow the sampling to adapt to dependencies between variables by constructing local quadratic approximations. Another approach is to improve sampling efficiency by suppressing random walks. This can be done using `overrelaxed' versions of univariate slice sampling procedures, or by using `reflective' multivariate slice sampling methods, which bounce off the edges of the slice.
Article
A class of ranking models is proposed for which the probability of a ranking decreases with increasing distance from a modal ranking. Some special distances, namely those associated with Kendall and Cayley, decompose into a sum of independent components under the uniform distribution. These distances lead to multiparameter generalizations whose parameters may be interpreted as information at various stages in a ranking process. Estimation of model parameters is described, and the results are applied to an example of word associations. A censoring argument motivates simple extensions of these models to include partial rankings. The generalized Cayley distance model is illustrated for random arrangements arising from mechanisms other than ranking.
Article
This paper introduces a new statistical approach to automatically partitioning text into coherent segments. The approach is based on a technique that incrementally builds an exponential model to extract features that are correlated with the presence of boundaries in labeled training text. The models use two classes of features: topicality features that use adaptive language models in a novel way to detect broad changes of topic, and cue-word features that detect occurrences of specific words, which may be domain-specific, that tend to be used near segment boundaries. Assessment of our approach on quantitative and qualitative grounds demonstrates its effectiveness in two very different domains, Wall Street Journal news articles and television broadcast news story transcripts. Quantitative results on these domains are presented using a new probabilistically motivated error metric, which combines precision and recall in a natural and flexible way. This metric is used to make a quantitative assessment of the relative contributions of the different feature types, as well as a comparison with decision trees and previously proposed text segmentation algorithms.
Conference Paper
Some models of textual corpora employ text generation methods involving n-gram statis- tics, while others use latent topic variables inferred using the \bag-of-words" assump- tion, in which word order is ignored. Pre- viously, these methods have not been com- bined. In this work, I explore a hierarchi- cal generative probabilistic model that in- corporates both n-gram statistics and latent topic variables by extending a unigram topic model to include properties of a hierarchi- cal Dirichlet bigram language model. The model hyperparameters are inferred using a Gibbs EM algorithm. On two data sets, each of 150 documents, the new model exhibits better predictive accuracy than either a hi- erarchical Dirichlet bigram language model or a unigram topic model. Additionally, the inferred topics are less dominated by func- tion words than are topics discovered using unigram statistics, potentially making them more meaningful.
Conference Paper
A new approach to ensemble learning is introduced that takes ranking rather than classification as fundamental, leading to models on the symmetric group and its cosets. The approach uses a generalization of the Mallows model on permutations to combine multiple input rankings. Applications include the task of combining the output of multiple search engines and multiclass or multilabel classification, where a set of input classifiers is viewed as generating a ranking of class labels.
Conference Paper
This paper describes a novel Bayesian ap- proach to unsupervised topic segmentation. Unsupervised systems for this task are driven by lexical cohesion: the tendency of well- formed segments to induce a compact and consistent lexical distribution. We show that lexical cohesion can be placed in a Bayesian context by modeling the words in each topic segment as draws from a multinomial lan- guage model associated with the segment; maximizing the observation likelihood in such a model yields a lexically-cohesive segmenta- tion. This contrasts with previous approaches, which relied on hand-crafted cohesion met- rics. The Bayesian framework provides a prin- cipled way to incorporate additional features such as cue phrases, a powerful indicator of discourse structure that has not been previ- ously used in unsupervised segmentation sys- tems. Our model yields consistent improve- ments over an array of state-of-the-art systems on both text and speech datasets. We also show that both an entropy-based analysis and a well-known previous technique can be de- rived as special cases of the Bayesian frame- work.1
Conference Paper
The results of the MUC-6 evaluation must be analyzed to determine whether close scores significantl y distinguish systems or whether the differences in those scores are a matter of chance. In order to do such an analysis, a method of computer intensive hypothesis testing was developed by SAIC for the MUC-3 results and has been use d for distinguishing MUC scores since that time. The implementation of this method for the MUC evaluations was firs t described in [1] and later the concepts behind the statistical model were explained in a more understandable manne r in [2]. This paper gives the results of the statistical testing for the three MUC-6 tasks where a single metric could b e associated with a system's performance.
Conference Paper
NOTE TO READERS:We have recently detected a software bug which affects the results of our standalone entity grid ex- periments. (The bug was in our syntactic analysis code, which incorrectly failed to label the second object of a conjoint VP; in the phrase "wash the dishes and clean the sink", 'dishes' would be correctly la- beled as O but 'sink' mislabeled as X.) This bug happened to have an unfortunate interaction with the "This is preliminary information" preamble mentioned in sec- tion 5. The results in table 2 above the line are incorrect; our relaxed entity grid does not outperform the naive grid on the discriminative test. This implies that our argument motivating the relaxed model at the end of section 2 is misguided. The de- sign and performance of the joint model is unaffected. We present a model for discourse co- herence which combines the local entity- based approach of (Barzilay and Lapata, 2005) and the HMM-based content model of (Barzilay and Lee, 2004). Unlike the mixture model of (Soricut and Marcu, 2006), we learn local and global features jointly, providing a better theoretical ex- planation of how they are useful. As the local component of our model we adapt (Barzilay and Lapata, 2005) by relaxing independence assumptions so that it is ef- fective when estimated generatively. Our model performs the ordering task compet- itively with (Soricut and Marcu, 2006), and significantly better than either of the models it is based on.
Article
Ordering information is a critical task for natural language generation applications.
Article
We present a domain-independent topic segmentation algorithm for multi-party speech. Our feature-based algorithm combines knowledge about content using a text-based algorithm as a feature and about form using linguistic and acoustic cues about topic shifts extracted from speech. This segmentation algorithm uses automatically induced decision rules to combine the different features. The embedded text-based algorithm builds on lexical cohesion and has performance comparable to state-of-the-art algorithms based on lexical information. A significant error reduction is obtained by combining the two knowledge sources.
Article
We consider the problem of modeling the content structure of texts within a specific domain, in terms of the topics the texts address and the order in which these topics appear. We first present an effective knowledge-lean method for learning content models from unannotated documents, utilizing a novel adaptation of algorithms for Hidden Markov Models. We then apply our method to two complementary tasks: information ordering and extractive summarization. Our experiments show that incorporating content models in these applications yields substantial improvement over previously-proposed methods. Publication info: Proceedings of HLT-NAACL 2004 (to appear)
Article
We propose a statistical method that finds the maximum-probability segmentation of a given text. This method does not require training data because it estimates probabilities from the given text. Therefore, it can be applied to any text in any domain. An experiment showed that the method is more accurate than or at least as accurate as a state-of-the-art text segmentation system.
Article
In this paper we present a novel framework for extracting the ratable aspects of objects from online user reviews. Extracting such aspects is an important challenge in automatically mining product opinions from the web and in generating opinion-based summaries of user reviews. Our models are based on extensions to standard topic modeling methods such as LDA and PLSA to induce multi-grain topics. We argue that multi-grain models are more appropriate for our task since standard models tend to produce topics that correspond to global properties of objects (e.g., the brand of a product type) rather than the aspects of an object that tend to be rated by a user. The models we present not only extract ratable aspects, but also cluster them into coherent topics, e.g., `waitress' and `bartender' are part of the same topic `staff' for restaurants. This differentiates it from much of the previous work which extracts aspects through term frequency analysis with minimal clustering. We evaluate the multi-grain models both qualitatively and quantitatively to show that they improve significantly upon standard topic models.