Article

Generating Survey Draft Based on Closeness of Position Distributions of Key Words

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Ranking is an important way of retrieving authoritative papers from a large scientific literature database. Current state-of-the-art exploits the flat structure of the heterogeneous academic network to achieve a better ranking of scientific articles, however, ignores the multinomial nature of the multidimensional relationships between different types of academic entities. This paper proposes a novel mutual ranking algorithm based on the multinomial heterogeneous academic hypernetwork, which serves as a generalized model of a scientific literature database. The proposed algorithm is demonstrated effective through extensive evaluation against well-known IR metrics on a well-established benchmarking environment based on the ACL Anthology Network.
Article
Full-text available
Thispaperconsidersextractivesummarisationinacomparative setting: given two or more document groups (e.g., separated by publication time), the goal is to select a small number of documents that are representative of each group, and also maximally distinguishable from other groups. We formulate a set of new objective functions for this problem that connect recent literature on document summarisation, interpretable machine learning, and data subset selection. In particular, by casting the problem as a binary classification amongst different groups, we derive objectives based on the notion of maximum mean discrepancy, as well as a simple yet effective gradient-based optimisation strategy. Our new formulation allows scalable evaluations of comparative summarisation as a classification task, both automatically and via crowd-sourcing. To this end, we evaluate comparative summarisation methods on a newly curated collection of controversial news topics over 13months.Weobserve thatgradient-based optimisationoutperforms discrete and baseline approaches in 15 out of 24 different automatic evaluation settings. In crowd-sourced evaluations, summaries from gradient optimisation elicit 7% more accurate classification from human workers than discrete optimisation. Our result contrasts with recent literature on submodular data subset selection that favours discrete optimisation. We posit that our formulation of comparative summarisation will prove useful in a diverse range of use cases such as comparing content sources, authors, related topics, or distinct view points.
Article
Full-text available
The Semantic Link Network is a semantics modeling method for effective information services. This paper proposes a new text summarization approach that extracts Semantic Link Network from scientific paper consisting of language units of different granularities as nodes and semantic links between the nodes, and then ranks the nodes to select Top-k sentences to compose summary. A set of assumptions for reinforcing representative nodes is set to reflect the core of paper. Then, Semantic Link Networks with different types of node and links are constructed with different combinations of the assumptions. Finally, an iterative ranking algorithm is designed for calculating the weight vectors of the nodes in a converged iteration process. The iteration approximately approaches a stable weight vector of sentence nodes, which is ranked to select Top-k high-rank nodes for composing summary. We designed six types of ranking models on Semantic Link Networks for evaluation. Both objective assessment and intuitive assessment show that ranking Semantic Link Network of language units can significantly help identify the representative sentences. This work not only provides a new approach to summarizing text based on extraction of semantic links from text but also verifies the effectiveness of adopting the Semantic Link Network in rendering the core of text. The proposed approach can be applied to implementing other summarization applications such as generating an extended abstract, the mind map and the bulletin points for making the slides of a given paper. It can be easily extended by incorporating more semantic links to improve text summarization and other information services.
Article
Full-text available
Related work is a component of a scientific paper, which introduces other researchers' relevant works and makes comparisons with the current author's work. Automatically generating the related work section of a writing paper provides a tool for researchers to accomplish the related work section efficiently without missing related works. This paper proposes an approach to automatically generating a related work section by comparing the main text of the paper being written with the citations of other papers that cite the same references. Our approach first collects the papers that cite the reference papers of the paper being written and extracts the corresponding citation sentences to form a citation document. It then extracts keywords from the citation document and the paper being written and constructs a graph of the keywords. Once the keywords that discriminate the two documents are determined, the minimum Steiner tree that covers the discriminative keywords and the topic keywords is generated. The summary is generated by extracting the sentences covering the Steiner tree. According to ROUGE evaluations, the experiments show that the citations are suitable for related work generation and our approach outperforms the three baseline methods of MEAD, LexRank, and ReWoS. This work verifies the general summarization method based on connotation and extension through citation.
Article
Full-text available
The rapid growth of scientific literature has made it difficult for the researchers to quickly learn about the developments in their respective fields. Scientific summarization addresses this challenge by providing summaries of the important contributions of scientific papers. We present a framework for scientific summarization which takes advantage of the citations and the scientific discourse structure. Citation texts often lack the evidence and context to support the content of the cited paper and are even sometimes inaccurate. We first address the problem of inaccuracy of the citation texts by finding the relevant context from the cited paper. We propose three approaches for contextualizing citations which are based on query reformulation, word embeddings, and supervised learning. We then train a model to identify the discourse facets for each citation. We finally propose a method for summarizing scientific papers by leveraging the faceted citations and their corresponding contexts. We evaluate our proposed method on two scientific summarization datasets in the biomedical and computational linguistics domains. Extensive evaluation results show that our methods can improve over the state of the art by large margins.
Article
Full-text available
ive summarization is an ideal form of summarization since it can synthesize information from multiple documents to create concise informative summaries. In this work, we aim at developing an abstractive summarizer. First, our proposed approach identifies the most important document in the multi-document set. The sentences in the most important document are aligned to sentences in other documents to generate clusters of similar sentences. Second, we generate K-shortest paths from the sentences in each cluster using a word-graph structure. Finally, we select sentences from the set of shortest paths generated from all the clusters employing a novel integer linear programming (ILP) model with the objective of maximizing information content and readability of the final summary. Our ILP model represents the shortest paths as binary variables and considers the length of the path, information score and linguistic quality score in the objective function. Experimental results on the DUC 2004 and 2005 multi-document summarization datasets show that our proposed approach outperforms all the baselines and state-of-the-art extractive summarizers as measured by the ROUGE scores. Our method also outperforms a recent abstractive summarization technique. In manual evaluation, our approach also achieves promising results on informativeness and readability.
Article
Full-text available
As information is available in abundance for every topic on internet, condensing the important information in the form of summary would benefit a number of users. Hence, there is growing interest among the research community for developing new approaches to automatically summarize the text. Automatic text summarization system generates a summary, i.e. short length text that includes all the important information of the document. Since the advent of text summarization in 1950s, researchers have been trying to improve techniques for generating summaries so that machine generated summary matches with the human made summary. Summary can be generated through extractive as well as abstractive methods. Abstractive methods are highly complex as they need extensive natural language processing. Therefore, research community is focusing more on extractive summaries, trying to achieve more coherent and meaningful summaries. During a decade, several extractive approaches have been developed for automatic summary generation that implements a number of machine learning and optimization techniques. This paper presents a comprehensive survey of recent text summarization extractive approaches developed in the last decade. Their needs are identified and their advantages and disadvantages are listed in a comparative manner. A few abstractive and multilingual text summarization approaches are also covered. Summary evaluation is another challenging issue in this research field. Therefore, intrinsic as well as extrinsic both the methods of summary evaluation are described in detail along with text summarization evaluation conferences and workshops. Furthermore, evaluation results of extractive summarization approaches are presented on some shared DUC datasets. Finally this paper concludes with the discussion of useful future directions that can help researchers to identify areas where further research is needed.
Article
Full-text available
We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
Conference Paper
Full-text available
We describe and evaluate hidden understanding models, a statistical learning approach to natural language understanding. Given a string of words, hidden understanding models determine the most likely meaning for the string. We discuss 1) the problem of representing meaning in this framework, 2) the structure of the statistical model, 3) the process of training the model, and 4) the process of understanding using the model. Finally, we give experimental results, including results on an ARPA evaluation.
Conference Paper
Full-text available
Given a collection of document groups, a natural question is to identify the differences among these groups. Although traditional document summarization techniques can summarize the content of the document groups one by one, there exists a great necessity to generate a summary of the differences among the document groups. In this article, we study a novel problem of summarizing the differences between document groups. A discriminative sentence selection method is proposed to extract the most discriminative sentences that represent the specific characteristics of each document group. Experiments and case studies on real-world data sets demonstrate the effectiveness of our proposed method.
Conference Paper
Full-text available
We introduce the novel problem of auto- matic related work summarization. Given multiple articles (e.g., conference/journal papers) as input, a related work sum- marization system creates a topic-biased summary of related work specific to the target paper. Our prototype Related Work Summarization system, ReWoS, takes in set of keywords arranged in a hierarchical fashion that describes a target paper's top- ics, to drive the creation of an extractive summary using two different strategies for locating appropriate sentences for general topics as well as detailed ones. Our initial results show an improvement over generic multi-document summarization baselines in a human evaluation.
Conference Paper
Full-text available
Article
Full-text available
In this article we propose a strategy for the summarization of scientific articles that concentrates on the rhetorical status of statements in an article: Material for summaries is selected in such a way that summaries can highlight the new contribution of the source article and situate it with respect to earlier work. We provide a gold standard for summaries of this kind consisting of a substantial corpus of conference articles in computational linguistics annotated with human judgments of the rhetorical status and relevance of each sentence in the articles. We present several experiments measuring our judges' agreement on these annotations. We also present an algorithm that, on the basis of the annotated training material, selects content from unseen articles and classifies it into a fixed set of seven rhetorical categories. The output of this extraction and classification system can be viewed as a single-document summary in its own right; alternatively, it provides starting material for the generation of task-oriented and user-tailored summaries designed to give users an overview of a scientific field.
Article
Full-text available
This paper presents a method for combining query-relevance with information-novelty in the context of text retrieval and summarization. The Maximal Marginal Relevance (MMR) criterion strives to reduce redundancy while maintaining query relevance in re-ranking retrieved documents and in selecting appropriate passages for text summarization. Preliminary results indicate some benefits for MMR diversity ranking in document retrieval and in single document summarization. The latter are borne out by the recent results of the SUMMAC conference in the evaluation of summarization systems. However, the clearest advantage is demonstrated in constructing non-redundant multi-document summaries, where MMR results are clearly superior to non-MMR passage selection. 1 Introduction With the continuing growth of online information, it has become increasingly important to provide improved mechanisms to find information quickly. Conventional IR systems rank and assimilate documents based on maximizing re...
Article
This book explores next-generation artificial intelligence based on the symbiosis between humans, machines and nature, including the rules and emerging patterns of recognition, and the integration and optimization of various flows through cyberspace, physical space and social space. It unveils a reciprocal human-machine-nature symbiotic mechanism together with relevant rules on structuring and evolving reality, and also proposes a multi-dimensional space for modelling reality and managing the methodologies for exploring reality. As such it lays the foundation for the emerging research area cyber-physical-social intelligence. Inspiring researchers and university students to explore the development of intelligence and scientific methodology, it is intended for researchers and broad readers with a basic understanding of computer science and the natural sciences. Next-generation artificial intelligence will extend machine intelligence and human intelligence to cyber-physical-social intelligence rendered by various interactions in cyberspace, physical space and social space. With the transformational development of science and society, a multi-dimensional reality is emerging and evolving, leading to the generation and development of various spaces obeying different principles. A fundamental scientific challenge is uncovering the essential mechanisms and principles that structure and evolve the reality emerging and evolving along various dimensions. Meeting this challenge requires identifying the basic relations between humans, machines and nature in order to reveal the cyber-physical-social principles.
Article
Text representation is one kind of hot topics which support text classification (TC) tasks. It has a substantial impact on the performance of TC. Although the most famous TF-IDF is specially designed for information retrieval rather than TC tasks, it is highly useful in the field of TC as a term weighting method to represent text contents. Inspired by the IDF part of TF-IDF which is defined as the logarithmic transformation, we proposed several alternative methods in this study to generate unsupervised term weighting schemes that can offset the drawback confronting TF-IDF. Moreover, owing to TC tasks are different from information retrieval, representing test texts as a vector in an appropriate way is also essential for TC tasks, especially for supervised term weighting approaches (e.g., TF-RF), mainly due to these methods need to use category information when weighting the terms. But most of current schemes do not clearly explain how to represent test texts with their schemes. To explore this problem and seek a reasonable solution to these schemes, we analyzed three typical supervised term weighting methods in depth to illustrate how to represent the test text. To investigate the effectiveness of our work, three sets of experiments are designed to compare their performance. Comparisons show that our proposed methods can indeed enhance the performance of TC, and sometimes even outperform existing supervised term weighting methods.
Article
The key to document summarization is semantic representation of documents. This paper investigates the role of Semantic Link Network in representing and understanding documents for multi-document summarization. We propose a novel abstractive multi-document summarization framework by first transforming documents into a Semantic Link Network of concepts and events, and then transforming the Semantic Link Network into the summary of the documents by selecting important concepts and events while keeping semantics coherence. Experiments on benchmark datasets show that the proposed approach significantly outperforms relevant state-of-the-art baselines and the Semantic Link Network plays an important role in representing and understanding documents.
Article
In this paper, three methods of extracting single document summary by combining supervised learning with unsupervised learning are proposed. The purpose of these three methods is to measure the importance of sentences by combining the statistical features of sentences and the relationship between sentences at the same time. The first method uses supervised model and graph model to score sentences separately, and then linear combination of scores is used as the final score of sentences. In the second method, the graph model is used as an independent feature of the supervised model to evaluate the importance of sentences. The third method is to score the importance of sentences by supervised model, then as a priori value of nodes in the graph model, and finally use biased graph model to score sentences. On the data sets of DUC2001 and DUC2002, the ROUGE method is used as the evaluation criterion, which shows that the three methods have achieved good results, and are superior to the methods of extracting summary only using supervised learning or unsupervised learning. We also validate that priori knowledge can improve the accuracy of key sentence selection in graph model.
Chapter
Automatic survey generation for a specific research area can quickly give researchers an overview, and help them recognize the technical developing trend of the specific area. As far as we know, the most relevant study with automatic survey generation is the task of automatic related work generation. Almost all existing methods of automatic related work generation extract the important sentences from multiple relevant papers to assemble a related work. However, the extractive methods are far from satisfactory because of poor coherence and readability. In this paper, we propose a novel abstractive method named Hierarchical Seq2seq model based on Dual Supervision (HSDS) to solve problems above. Given multiple scientific papers in the same research area as input, the model aims to generate a corresponding survey. Furthermore, we build a large dataset to train and evaluate the HSDS model. Extensive experiments demonstrate that our proposed model performs better than the state-of-the-art baselines.
Conference Paper
We present a novel unsupervised query-focused multi-document summarization approach. To this end, we generate a summary by extracting a subset of sentences using the Cross-Entropy (CE) Method. The proposed approach is generic and requires no domain knowledge. Using an evaluation over DUC 2005-2007 datasets with several other state-of-the-art baseline methods, we demonstrate that, our approach is both effective and efficient.
Article
We present SummaRuNNer, a Recurrent Neural Network (RNN) based sequence model for extractive summarization of documents and show that it achieves performance better than or comparable to state-of-the-art. Our model has the additional advantage of being very interpretable, since it allows visualization of its predictions broken up by abstract features such as information content, salience and novelty. Another novel contribution of our work is abstractive training of our extractive model that can train on human generated reference summaries alone, eliminating the need for sentence-level extractive labels.
Article
In this paper, we present a novel document summarization mechanism called KeyphraseDS that can organize the scientific articles into multi-aspect and informative scientific survey by exploiting keyphrases. Keyphrases describe text's salience and central focus, which can serve as the component of aspects under specific topic. KeyphraseDS consists of three steps: keyphrase graph construction, semantic aspect generation and content selection. Keyprhases are firstly extracted through CRF-based model exploiting various features, such as syntactic features, correlation features, etc. Spectral clustering is then performed on keyphrase graph to generate different aspects, where the semantic relatedness between keyphrases is computed through knowledge-based similarity and topic-based similarity. The proposed semantic relatedness can not only utilize the statistical text signals efficiently but also overcome the data sparsity problem. Significant sentences are then selected with respect to the generated aspects through integer linear programming (ILP), which takes semantic relevance, semantic diversity, and keyphrase salience into consideration. Extensive experiments, measured by automatic evaluation and human evaluation, demonstrate the effectiveness of our mechanism for generating scientific survey.
Conference Paper
Recent work on search engine ranking functions report improvements on BM25 and Language Models with Dirichlet Smoothing. In this investigation 9 recent ranking functions (BM25, BM25+, BM25T, BM25-adpt, BM25L, TF1°δ°p×ID, LM-DS, LM-PYP, and LM-PYP-TFIDF) are compared by training on the INEX 2009 Wikipedia collection and testing on INEX 2010 and 9 TREC collections. We find that once trained (using particle swarm optimization) there is very little difference in performance between these functions, that relevance feedback is effective, that stemming is effective, and that it remains unclear which function is best over-all.
Conference Paper
In this paper, we investigate the problem of automatic generation of scientific surveys starting from keywords provided by a user. We present a system that can take a topic query as input and generate a survey of the topic by first selecting a set of relevant documents, and then selecting relevant sentences from those documents. We discuss the issues of robust evaluation of such systems and describe an evaluation corpus we generated by manually extracting factoids, or information units, from 47 gold standard documents (surveys and tutorials) on seven topics in Natural Language Processing. We have manually annotated 2,625 sentences with these factoids (around 375 sentences per topic) to build an evaluation corpus for this task. We present evaluation results for the performance of our system using this annotated data.
Article
TextTiling is a technique for subdividing texts into multi-paragraph units that represent passages, or subtopics. The discourse cues for identifying major subtopic shifts are patterns of lexical co-occurrence and distribution. The algorithm is fully implemented and is shown to produce segmentation that corresponds well to human judgments of the subtopic boundaries of 12 texts. Multi-paragraph subtopic segmentation should be useful for many text analysis tasks, including information retrieval and summarization.
Article
This paper presents an innovative unsupervised method for automatic sentence extraction using graph-based ranking algorithms. We evaluate the method in the context of a text summarization task, and show that the results obtained compare favorably with pre-viously published results on established benchmarks.
Article
This paper describes the different steps which lead to the construction of the LIP6 extractive summarizer. The basic idea be-hind this system is to expand question and title keywords of each topic with their respective cluster terms. Term clusters are found by unsupervised learning us-ing a classification variant of the well-known EM algorithm. Each sentence is then characterized by 4 features, each of which uses bag-of-words similarities be-tween expanded topic title or questions and the current sentence. A final score of the sentences is found by manually tun-ing the weights of a linear combination of these features ; these weights are chosen in order to maximize the Rouge-2 AvF mea-sure on the Duc 2006 corpus.
Conference Paper
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. This paper introduces four different ROUGE measures: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S included in the ROUGE summarization evaluation package and their evaluations. Three of them have been used in the Document Understanding Conference (DUC) 2004, a large-scale sum- marization evaluation sponsored by NIST.
Article
We apply the Google PageRank algorithm to assess the relative importance of all publications in the Physical Review family of journals from 1893 to 2003. While the Google number and the number of citations for each publication are positively correlated, outliers from this linear relation identify some exceptional papers or “gems” that are universally familiar to physicists.
Article
The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.
Conference Paper
In most existing retrieval models, documents are scored primarily based on various kinds of term statistics such as within-document frequencies, inverse document frequencies, and document lengths. Intuitively, the proximity of matched query terms in a document can also be exploited to promote scores of documents in which the matched query terms are close to each other. Such a proximity heuristic, however, has been largely under-explored in the litera- ture; it is unclear how we can model proximity and incorporate a proximity measure into an existing retrieval model. In this pa- per, we systematically explore the query term proximity heuristic. Specically , we propose and study the effectiveness of ve differ- ent proximity measures, each modeling proximity from a different perspective. We then design two heuristic constraints and use them to guide us in incorporating the proposed proximity measures into an existing retrieval model. Experiments on ve standard TREC test collections show that one of the proposed proximity measures is indeed highly correlated with document relevance, and by incor- porating it into the KL-divergence language model and the Okapi BM25 model, we can signicantly improve retrieval performance.
Conference Paper
We design a class of submodular functions meant for document summarization tasks. These functions each combine two terms, one which encourages the summary to be representative of the corpus, and the other which positively rewards diversity. Critically, our functions are monotone nondecreasing and submodular, which means that an efficient scalable greedy optimization scheme has a constant factor guarantee of optimality. When evaluated on DUC 2004-2007 corpora, we obtain better than existing state-of-art results in both generic and query-focused document summarization. Lastly, we show that several well-established methods for document summarization correspond, in fact, to submodular function optimization, adding further evidence that submodular functions are a natural fit for document summarization.
Conference Paper
In citation-based summarization, text written by several researchers is leveraged to identify the important aspects of a target paper. Previous work on this problem focused almost exclusively on its extraction aspect (i.e. selecting a representative set of citation sentences that highlight the contribution of the target paper). Meanwhile, the fluency of the produced summaries has been mostly ignored. For example, diversity, readability, cohesion, and ordering of the sentences included in the summary have not been thoroughly considered. This resulted in noisy and confusing summaries. In this work, we present an approach for producing readable and cohesive citation-based summaries. Our experiments show that the proposed approach outperforms several baselines in terms of both extraction quality and fluency.
Conference Paper
We treat the text summarization problem as maximizing a submodular function under a budget constraint. We show, both theoretically and empirically, a modified greedy algorithm can efficiently solve the budgeted submodu- lar maximization problem near-optimally, and we derive new approximation bounds in do- ing so. Experiments on DUC'04 task show that our approach is superior to the best- performing method from the DUC'04 evalu- ation on ROUGE-1 scores.
Conference Paper
Topic-focused multi-document summarization aims to produce a summary biased to a given topic or user profile. This paper presents a novel extrac- tive approach based on manifold-ranking of sen- tences to this summarization task. The mani- fold-ranking process can naturally make full use of both the relationships among all the sentences in the documents and the relationships between the given topic and the sentences. The ranking score is obtained for each sentence in the manifold-ranking process to denote the biased information richness of the sentence. Then the greedy algorithm is em- ployed to impose diversity penalty on each sen- tence. The summary is produced by choosing the sentences with both high biased information rich- ness and high information novelty. Experiments on DUC2003 and DUC2005 are performed and the ROUGE evaluation results show that the proposed approach can significantly outperform existing ap- proaches of the top performing systems in DUC tasks and baseline approaches.
Article
We introduce a stochastic graph-based method for computing relative importance of textual units for Natural Language Processing. We test the technique on the problem of Text Summarization (TS). Extractive TS relies on the concept of sentence salience to identify the most important sentences in a document or set of documents. Salience is typically defined in terms of the presence of particular important words or in terms of similarity to a centroid pseudo-sentence. We consider a new approach, LexRank, for computing sentence importance based on the concept of eigenvector centrality in a graph representation of sentences. In this model, a connectivity matrix based on intra-sentence cosine similarity is used as the adjacency matrix of the graph representation of sentences. Our system, based on LexRank ranked in first place in more than one task in the recent DUC 2004 evaluation. In this paper we present a detailed analysis of our approach and apply it to a larger data set including data from earlier DUC evaluations. We discuss several methods to compute centrality using the similarity graph. The results show that degree-based methods (including LexRank) outperform both centroid-based methods and other systems participating in DUC in most of the cases. Furthermore, the LexRank with threshold method outperforms the other degree-based techniques including continuous LexRank. We also show that our approach is quite insensitive to the noise in the data that may result from an imperfect topical clustering of documents.
Article
Models of document indexing and document retrieval have been extensively studied. The integration of these two classes of models has been the goal of several researchers but it is a very difficult problem. We argue that much of the reason for this is the lack of an adequate indexing model. This suggests that perhaps a better indexing model would help solve the problem. However, we feel that making unwarranted parametric assumptions will not lead to better retrieval performance. Furthermore, making prior assumptions about the similarity of documents is not warranted either. Instead, we propose an approach to retrieval based on probabilistic language modeling. We estimate models for each document individually. Our approach to modeling is non-parametric and integrates document indexing and document retrieval into a single model. One advantage of our approach is that collection statistics which are used heuristically in many other retrieval models are an integral part of our model. We have...
Exploring differential topic models for comparative summarization of scientific papers
  • L He
  • W Li
  • H Zhuge
Integrating importance, non-redundancy and coherence in graph-based extractive summarization
  • Parveen