ArticlePDF Available

Graph-Based Algorithms for Text Summarization

Authors:

Abstract and Figures

Summarization is a brief and accurate representation of input text such that the output covers the most important concepts of the source in a condensed manner. Text Summarization is an emerging technique for understanding the main purpose of any kind of documents. To visualize a large text document within a short duration and small visible area like PDA screen, summarization provides a greater flexibility and convenience. This paper presents innovative unsupervised methods for automatic sentence extraction using graph-based ranking algorithms and shortest path algorithm.
Content may be subject to copyright.
Graph-Based Algorithms for Text Summarization
Khushboo S. Thakkar
Department of Computer Science &
Engineering
G. H. Raisoni College of Engineering,
Nagpur, India
e-mail:khushboo.thakkar86@gmail.com
Dr. R. V. Dharaskar
Professor & Head, Dept. of Computer
Science & Engineering
G. H. Raisoni College of Engineering,
Nagpur, India
e-mail: rajiv.dharaskar@gmail.com
M. B. Chandak
HOD, Dept. of Computer Science &
Engineering
Shri Ramdeobaba Kamla Nehru
Engineering College,
Nagpur, India
e-mail: chandakmb@gmail.com
Abstract-Summarization is a brief and accurate representation
of input text such that the output covers the most important
concepts of the source in a condensed manner. Text
Summarization is an emerging technique for understanding
the main purpose of any kind of documents. To visualize a
large text document within a short duration and small visible
area like PDA screen, summarization provides a greater
flexibility and convenience. This paper presents innovative
unsupervised methods for automatic sentence extraction using
graph-based ranking algorithms and shortest path algorithm.
Keywords- Text Summarization,ranking algorithm, HITS,
PageRank.
I. INTRODUCTION
Due to the rapid growth of the World Wide Web,
information is much easier to disseminate and acquire than
before. Finding useful and favored documents from the
huge text repository creates significant challenges for users.
Typical approaches to resolve such a problem are to employ
information retrieval techniques. Information retrieval relies
on the use of keywords to search for the desired information.
Nevertheless, the amount of information obtained via
information retrieval is still far greater than that a user can
handle and manage. This in turn requires the user to analyze
the searched results one by one until satisfied information is
acquired, which is time-consuming and inefficient. It is
therefore essential to develop tools to efficiently assist users
in identifying desired documents.
One possible means is to utilize automatic text
summarization. Automatic text summarization is a text-
mining task that extracts essential sentences to cover almost
all the concepts of a document. It is to reduce users’
consuming time in document reading without losing the
general issues for users’ comprehension. With document
summary available, users can easily decide its relevancy to
their interests and acquire desired documents with much
less mental loads involved.
II. GRAPH-BASED ALGORITHMS
A. Graph-based Ranking Algorithm
Graph-based ranking algorithms are essentially a way of
deciding the importance of a vertex within a graph, based
on information drawn from the graph structure. In this
section, two graph-based ranking algorithms – previously
found to be successful on a range of ranking problems are
presented. These algorithms can be adapted to undirected or
weighted graphs, which are particularly useful in the
context of text-based ranking applications.
HITS
Hyperlink-Induced Topic Search (HITS) (also known as
Hubs and authorities) is a link analysis algorithm that rates
Web pages, developed by Jon Kleinberg. It determines two
values for a page: its authority, which estimates the value of
the content of the page, and its hub value, which estimates
the value of its links to other pages.
In the HITS algorithm, the first step is to retrieve the set
of results to the search query. The computation is
performed only on this result set, not across all Web pages.
Authority and hub values are defined in terms of one
another in a mutual recursion. An authority value is
computed as the sum of the scaled hub values that point to
that page. A hub value is the sum of the scaled authority
values of the pages it points to. Some implementations also
consider the relevance of the linked pages.
The Hub score and Authority score for a node is
calculated with the following algorithm:
Start with each node having a hub score and
authority score of 1.
Run the Authority Update Rule.
Run the Hub Update Rule.
Normalize the values by dividing each Hub
score by the sum of the squares of all Hub
scores, and dividing each Authority score by the
sum of the squares of all Authority scores.
Repeat from the second step as necessary.
HITS produces two sets of scores – an “authority” score,
and a “hub” score:
HITSA (Vi) = HITSH (Vj) (1)
Vj ЄIn(Vi)
HITSH(Vi) = HITSA(Vj) (2)
Vj ЄOut(Vi)
PageRank
PageRank is a link analysis algorithm, named after Larry
Page, used by the Google Internet search engine that
assigns a numerical weighting to each element of a
hyperlinked set of documents, such as the World Wide Web,
Third International Conference on Emerging Trends in Engineering and Technology
978-0-7695-4246-1/10 $26.00 © 2010 IEEE
DOI 10.1109/ICETET.2010.104
516
Third International Conference on Emerging Trends in Engineering and Technology
978-0-7695-4246-1/10 $26.00 © 2010 IEEE
DOI 10.1109/ICETET.2010.104
516
Third International Conference on Emerging Trends in Engineering and Technology
978-0-7695-4246-1/10 $26.00 © 2010 IEEE
DOI 10.1109/ICETET.2010.104
516
Third International Conference on Emerging Trends in Engineering and Technology
978-0-7695-4246-1/10 $26.00 © 2010 IEEE
DOI 10.1109/ICETET.2010.104
516
with the purpose of "measuring" its relative importance
within the set. The algorithm may be applied to any
collection of entities with reciprocal quotations and
references. The numerical weight that it assigns to any
given element E is also called the PageRank of E and
denoted by PR(E).
The name "PageRank" is a trademark of Google, and the
PageRank process has been patented (U.S. Patent
6,285,999). However, the patent is assigned to Stanford
University and not to Google. Google has exclusive license
rights on the patent from Stanford University. The
university received 1.8 million shares of Google in
exchange for use of the patent; the shares were sold in 2005
for $336 million
Google describes PageRank:
“PageRank relies on the uniquely democratic nature
of the web by using its vast link structure as an indicator
of an individual page's value. In essence, Google
interprets a link from page A to page B as a vote, by
page A, for page B. But, Google looks at more than the
sheer volume of votes, or links a page receives; it also
analyzes the page that casts the vote. Votes cast by
pages that are themselves "important" weigh more
heavily and help to make other pages "important".”
In other words, a PageRank results from a "ballot" among
all the other pages on the World Wide Web about how
important a page is. A hyperlink to a page counts as a vote
of support. The PageRank of a page is defined recursively
and depends on the number and PageRank metric of all
pages that link to it ("incoming links"). A page that is linked
to by many pages with high PageRank receives a high rank
itself. If there are no links to a web page there is no support
for that page.
PageRank is a probability distribution used to represent
the likelihood that a person randomly clicking on links will
arrive at any particular page. PageRank can be calculated
for collections of documents of any size. It is assumed in
several research papers that the distribution is evenly
divided between all documents in the collection at the
beginning of the computational process. The PageRank
computations require several passes, called "iterations",
through the collection to adjust approximate PageRank
values to more closely reflect the theoretical true value.
In the general case, the PageRank value for any page u
can be expressed as:
PR(υ) = PR(ν) (3)
νЄBu L(ν)
i.e. the PageRank value for a page u is dependent on the
PageRank values for each page v out of the set Bu (this set
contains all pages linking to page u), divided by the number
L(v) of links from page v.
The PageRank theory holds that even an imaginary surfer
who is randomly clicking on links will eventually stop
clicking. The probabilities, at any step, that the person will
continue is a damping factor d. Various studies have tested
different damping factors, but it is generally assumed that
the damping factor will be set around 0.85.
The damping factor is subtracted from 1 (and in some
variations of the algorithm, the result is divided by the
number of documents in the collection) and this term is then
added to the product of the damping factor and the sum of
the incoming PageRank scores.
That is,
PR(A) = 1 – d + d . PR(B) + PR(C) + PR(D) + … (4)
L(B) L(C) L(D)
For each of these algorithms, starting from arbitrary
values assigned to each node in the graph, the computation
iterates until convergence below a given threshold is
achieved. After running the algorithm, a score is associated
with each vertex, which represents the “importance” or
“power” of that vertex within the graph. Notice that the
final values are not affected by the choice of the initial
value, only the number of iterations to convergence may be
different.
Text As Graph
To enable the application of graph-based ranking
algorithms to natural language texts, we have to build a
graph that represents the text, and interconnects words or
other text entities with meaningful relations. Depending on
the application at hand, text units of various sizes and
characteristics can be added as vertices in the graph, e.g.
words, collocations, entire sentences, or others. Similarly, it
is the application that dictates the type of relations that are
used to draw connections between any two such vertices,
e.g. lexical or semantic relations, contextual overlap, etc.
Regardless of the type and characteristics of the elements
added to the graph, the application of graph-based ranking
algorithms to natural language texts consists of the
following main steps:
1. Identify text units that best define the task at hand,
and add them as vertices in the graph.
2. Identify relations that connect such text units, and
use these relations to draw edges between vertices
in the graph. Edges can be directed or undirected,
weighted or unweighted.
3. Iterate the graph-based ranking algorithm until
convergence.
4. Sort vertices based on their final score. Use the values
attached to each vertex for ranking/selection decisions
Sentence Extraction
To apply TextRank, we first need to build a graph
associated with the text, where the graph vertices are
representative for the units to be ranked. For the task of
sentence extraction, the goal is to rank entire sentences, and
therefore a vertex is added to the graph for each sentence in
the text.
Formally, given two sentences Si and Sj, with a sentence
being represented by the set of Ni words that appear in the
sentence: Si = Wi
1, Wi
2, … ,Wi
Ni , the similarity of Si and Sj
is defined as:
517517517517
Similarity(Si, Sj) =|Wk|WkЄSi&WkЄSj| (5)
log(|Si|) + log(|Sj|)
Other sentence similarity measures, such as string
kernels, cosine similarity, longest common subsequence,
etc. are also possible, and we are currently evaluating their
impact on the summarization performance.
Figure 1. Sample graph build for sentence extraction
The resulting graph is highly connected, with a weight
associated with each edge, indicating the strength of the
connections established between various sentence pairs in
the text. The text is therefore represented as a weighted
graph.
After the ranking algorithm is run on the graph, sentences
are sorted in reversed order of their score, and the top
ranked sentences are selected for inclusion in the summary.
Fig 1 shows a text sample [6], and the associated
weighted graph constructed for this text. The figure also
shows sample weights attached to the edges connected to
vertex 9, and the final TextRank score computed for each
sentence. The sentences with the highest rank are selected
for inclusion in the abstract. For this sample article, the
sentences with id-s 9, 15, 16, 18 are extracted, resulting in a
summary of about 100 words, which according to automatic
evaluation measures, is ranked the second among
summaries produced by 15 other systems.
B. Shortest-path Algorithm
Extraction based summarization normally produces
summaries that are somewhat unappealing to read. There is
a lack of flow in the text, since the extracted parts, usually
sentences, are taken from different parts of the original text.
This can for instance lead to very sudden topic shifts. The
idea behind the presented method of extracting sentences
that form a path where each sentence is similar to the
previous one is that the resulting summaries hopefully have
better flow. This quality is however quite hard to evaluate.
Since the summaries are still extracts, high quality
summaries should still not be expected [3].
Building the graph
When a text is to be summarized, it is first split into
sentences and words. The sentences become the nodes of
the graph. Sentences that are similar to each other have an
edge between them. Here, similarity simply means word
overlap, though other measures could also be used. Thus, if
two sentences have at least one word in common, there will
be an edge between them. Of course, many words are
ambiguous, and having a matching word does not guarantee
any kind of similarity. Since all sentences come from the
same document, and words tend to be less ambiguous in a
single text, this problem is somewhat mitigated.
All sentences also have an edge to the following
sentence. Edges are given costs (or weights). The more
similar two sentences are, the less the cost of the edge. The
further apart the sentences are in the original text, the higher
the cost of the edge. To favor inclusion of “interesting”
sentences, all sentences that are deemed relevant to the
document according to classical summarization methods
have the costs of all the edges leading to them lowered.
The cost of an edge from the node representing sentence
number i in the text, Si, to the node for Sj is calculated as:
cost i, j = (i − j) 2 (6)
overlap i,j · weight j
and the weight of a sentence is calculated as:
518518518518
weight j = (1 + overlaptitle, j)
.
. early (j). (1 + | edgej |) (7)
Since similarity is based on the number of words in
common between two sentences, long sentences have a
greater chance of being similar to other sentences. Favoring
long sentences is often good from a smoothness
perspective. Summaries with many short sentences have a
larger chance for abrupt changes, since there are more
sentence breaks.
Constructing the summary
When the graph has been constructed, the summary is
created by taking the shortest path that starts with the first
sentence of the original text and ends with the last sentence.
Since the original text also starts and ends in these
positions, this will hopefully give a smooth but shorter set
of sentences between these two points.
The N shortest paths are found by simply starting at the
start node and adding all paths of length one to a priority
queue, where the priority value is the total cost of a path.
The currently cheapest path is then examined and if it does
not end at the end node, all paths starting with this path and
containing one more edge are also added to the priority
queue. Paths with loops are discarded. Whenever the
currently shortest path ends in the end node, another
shortest path has been found, and the search is continued
until the N shortest paths have been found.
III. COMPARISON
TextRank works well because it does not only rely on the
local context of a text unit (vertex), but rather it takes into
account information recursively drawn from the entire text
(graph). Through the graphs it builds on texts, TextRank
identifies connections between various entities in a text, and
implements the concept of recommendation. A text unit
recommends other related text units, and the strength of the
recommendation is recursively computed based on the
importance of the units making the recommendation. In the
process of identifying important sentences in a text, a
sentence recommends another sentence that addresses
similar concepts as being useful for the overall
understanding of the text. Sentences that are highly
recommended by other sentences are likely to be more
informative for the given text, and will be therefore given a
higher score.
An important aspect of TextRank is that it does not
require deep linguistic knowledge, nor domain or language
specific annotated corpora, which makes it highly portable
to other domains, genres, or languages.
The Shortest-path algorithm is easy to implement and
should be relatively language independent, though it was
only evaluated on English texts. The generated summaries,
they are often somewhat “smooth” to read. This smoothness
is hard to quantify objectively, though, and the extracts are
by no means as smooth as a manually written summary.
When it comes to including the important facts from the
original text, the weighting of sentences using traditional
extraction weighting methods seems to be the most
important part. Taking a path from the first to the last
sentence does give a spread to the summary, making it more
likely that most parts of the original text that are important
will be included and making it unlikely that too much
information is included from only one part of the original
text.
IV. CONCLUSION
Automatic text summarization is now used
synonymously that aim to generate summaries of texts. This
area of NLP research is becoming more common in the web
and digital library environment. In simple summarization
systems, parts of text – sentences or paragraphs – are
selected automatically based on some linguistic and/or
statistical criteria to produce the abstract or summary.
Shortest-path algorithm is better because it generates
smooth summaries as compared to ranking algorithms.
Taking a path from the first to the last sentence does give a
spread to the summary, making it more likely that most
parts of the original text that are important will be included.
V. REFERENCES
[1] Satyajeet Raje, Sanket Tulangekar, Rajshekhar Waghe, Rohit Pathak,
Parikshit Mahalle, “ Extraction of Key Phrases from Document using
Statistical and Linguistic analysis”, 2009.
[2] Md. Nizam Uddin, Shakil Akter Khan, “A Study on Text
Summarization Techniques and Implement Few of Them for Ba ngla
Language”, 1-4244-1551-9/07IEEE, 2007.
[3] Jonas Sj¨obergh, Kenji Araki, “Extraction based summarization using
a shortest path algorithm”, Proceedings of the Annual Meeti ng of the
Association for Natural Language Processing, 2006.
[4] Massih R. Amini, Nicolas Usunier, and Pat rick Gallinari, “Automatic
Text Summarization Based onWord-Clusters and Ranking
Algorithms”, D.E. Losada and J.M. Fern´andez-Luna (Eds.): ECIR
2005, LNCS 3408, pp. 142–156, 2005.
[5] Rada Mihalcea, “ Graph-based Ranking Algorithms for Sentence
Extraction, Applied to Text Summarization”. The Companion
Volume t o the Proceedings of 42st Annual Meeting of the
Association for Computational Linguistics, pages 170–173,
Barcelona, Spain, 2004.
[6] R. Mihalcea and P. Tarau, “TextRank – bringing order into texts”,
2004.
[7] R. Mihalcea, P. Tarau, and E. Figa, “PageRank on semantic networks,
with application to word sense disambiguation”. In Proceedings of
the 20st International Conference on Computational Linguistics,
Geneva, Switzerland, August 2004.
[8] C.Y. Lin and E.H. Hovy, “The potential and limitations of sentence
extraction for summarization”. In P roceedings of the HLT/NAACL
Workshop on Automatic Summarization, Edmonton, Canada, May
2003.
[9] Chin-Yew Lin and Eduard Hovy, Automatic Evaluation of
Summaries Using N-gram Cooccurrence Statistics”. In Udo Hahn
and Donna Har man, editors, Proceedings of the 2003 Human
Language Technology Conference
[10] P.J. Herings, G. van der Laan, and D. Talman, “Measuring the power
of nodes in digraphs”. Technical report, Tinbergen Institute, 2001.
[11] S. Brin and L. Page, “The anatomy of a large-scale hypertextual Web
search engine”. Computer Networks and ISDN Systems 1998.
(wЄtext tf(w))
(1 + wЄSj tf(w))
519519519519
... A hub value is the total of the scaled authority values of the pages it points to, and an authority value is the sum of the scaled authority values of the pages it points to [262]. Some articles focused on the HITS ranking algorithm for graphbased text summarization tasks, as can be seen in [263], [264], [265], and [266]. 3) PageRank: The PageRank algorithm utilizes the inbound links of specified pages to measure their significance or quality to rank the search results. ...
... It is worth noting that integrating vertex weights may be performed using a similar method. [263], [264], [65] are the examples of the articles in which the weighted graph algorithm is used. 8) Graph-based Attention Mechanism: The relationship between all other phrases determines the significance score in the graph model. ...
Article
Full-text available
With the evolution of the Internet and multimedia technology, the amount of text data has increased exponentially. This text volume is a precious source of information and knowledge that needs to be efficiently summarized. Text summarization is the method to reduce the source text into a compact variant, preserving its knowledge and the actual meaning. Here we thoroughly investigate the automatic text summarization (ATS) and summarize the widely recognized ATS architectures. This paper outlines extractive and abstractive text summarization technologies and provides a deep taxonomy of the ATS domain. The taxonomy presents the classical ATS algorithms to modern deep learning ATS architectures. Every modern text summarization approach’s workflow and significance are reviewed with the limitations with potential recovery methods, including the feature extraction approaches, datasets, performance measurement techniques, and challenges of the ATS domain, etc. In addition, this paper concisely presents the past, present, and future research directions in the ATS domain.
... In our research, we need to extract the entire sentences to reveal the key idea embedded in the text. In [16], a comparison of PageRank algorithms and Shortest-path algorithms is made. Authors note that the PageRank algorithms do not depend on the language specifics, and the Shortest-path Algorithm generate summaries, which is not so "smooth" to read as a manually written summary. ...
Chapter
Extract summarization algorithms help identify significant information from the news by extracting meaningful sentences from the original text. The information background existing at the time of the news release often significantly affects its content. Such background can distort the text summarization algorithm working results. The study was conducted with the example of the theme “coronavirus” (COVID-19), which at the time of the study was one of the main topics in news feeds. Experiments were carried out on sports news articles, concerned football. This news area was selected because it is not related to medical topics. The TextRank algorithm for sport news extraction was applied in two ways. First, the key information from the source text of news was extracted. Then, a list of the COVID related words was created and the key information from news without considering words from this list was extracted. Our approach showed that mentioning a popular theme such as COVID that is not related to sports can have a negative impact on the text summarization algorithm. We suggest that to obtain accurate results of the algorithm operation, it is necessary to first compile a dictionary of terms related to the coronavirus theme and then exclude them when identifying the main content of news texts.
... The topic-based approach solves the problem of lack of summarization semantics for a certain extent, but it still lacks information of document structure. The graph-based approach transforms the traditional extraction step into a graph construction, calculation and sorting nodes [10]. Graph-based approaches are widely used in the field of text summarization. ...
Article
Given the exponential growth of patent documents, automatic patent summarization methods to facilitate the patent analysis process are in strong demand. Recently, the development of natural language processing (NLP), text-mining, and deep learning has greatly improved the performance of text summarization models for general documents. However, existing models cannot be successfully applied to patent documents, because patent documents describing an inventive technology and using domain-specific words have many differences from general documents. To address this challenge, we propose in this study a multi-patent summarization approach based on deep learning to generate an abstractive summarization considering the characteristics of a patent. Single patent summarization and multi-patent summarization were performed through a patent-specific feature extraction process, a summarization model based on generative adversarial network (GAN), and an inference process using topic modeling. The proposed model was verified by applying it to a patent in the drone technology field. In consequence, the proposed model performed better than existing deep learning summarization models. The proposed approach enables high-quality information summary for a large number of patent documents, which can be used by R&D researchers and decision-makers. In addition, it can provide a guideline for deep learning research using patent data.
Chapter
Automatic text summarization is a useful and needed approach in which a small subset of text is extracted concisely and pertinently from large text documents where the extracted sentences may have significant and notable meaning com-pared to other sentences in the document. Although there have been a lot of approaches to English text summarization, very few works have been found in the literature on automatic Bengali text summarization. Our work focuses on multi-text summarization tasks based on data mining and some statistical approaches which primarily employ the method on Bengali text documents as a basis for summarization. We used a hybrid approach for extracting the most significant word during tokenization and used some statistical methods to rearrange the sentences. The TextRank algorithm is used to pick the top few sentences from the processed text as the summary and finally we compared and evaluated our model with benchmark standard summary text generated by a group of human contributors. Our proposed hybrid model generates an average of 0.66 Precision, 0.59 Recall and 0.62 F-Score which indicates that our model can be used as an alternative system to address multi-text summarization problems of Bengali text documents.KeywordsMulti-text summarizationTokenizationText similaritySentence scoringText ranking
Chapter
Today, data is the most important thing humanity needs, thus understanding the linguistics of such a large data is not practically possible so, text summarization is introduced as the problem in natural language processing (NLP). Text summarization is the technique to convert long text corpus such that the semantics of the text does not change. This paper provides a study of different text summarization methods till Q3 2020. Text summarization methods are broadly classified as abstractive and extractive. In this paper, more focus is given to abstractive summarization a review for most of the methods of text summarization to date is written concisely along with the evaluations and advantages-disadvantages also for each method. At the end of the paper, the challenges faced by researchers for this task are mentioned and what improvements can be done in every method for summarization is also written in a structured way.KeywordsAbstractive summarizationExtractive summarizationGraph-based summarizationRule-based approach
Chapter
The work presented in this paper is an attempt at exploring the field of automatic text summarization and applying it to Konkani language, which is one of the low-resource languages in the automatic text summarization domain. Low-resource languages are the ones that have none or a very limited number of existing resources available, such as data, tools, language experts, and so on. We examine popular graph-based ranking algorithms and evaluate their performance in performing unsupervised automatic text summarization. The text is represented as a graph, where the vertices represent sentences and the edges between a pair of vertices represent a similarity score computed using a similarity measure. The graph-based ranking algorithms then rank the most relevant vertices (or sentences) to include in a summary. This paper also examines the impact of using weighted undirected or directed graphs on the output of the summarization system. The dataset used in the experiments was specially constructed by the authors using books on Konkani literature, and it is written in Devanagari script. The results of the experiments indicate that the graph-based ranking algorithms produce promising summaries of arbitrary length without needing any resources or training data. These algorithms can be effortlessly extended to other low-resource languages to get favorable results.
Conference Paper
Full-text available
Headline generation is the process of generating headlines automatically from text articles. We model a comprehensive abstractive headline generation technique using Seq2Seq Recurrent Neural Networks with Global Attention in this work. Despite being one of the most spoken languages globally, very few significant works have been done on this particular topic in the Bangla language. Thus, our model is solely based on the Bangla language, and we find that the performance of the model is highly satisfactory at the current stage. We also propose an extensive dataset consisting of 5,14,108 filtered Bangla news articles in full and other necessary information. The dataset has been created by scrapping several online reputed Bangla newspapers. Due to the unavailability of a proper and updated dataset, the proposed datasets are freely available at https://tinyurl.com/banglaHead
Chapter
While position information plays a significant role in sentence scoring of single document summarization, the repetition of content among different documents greatly impacts the salience scores of sentences in multi-document summarization. Introducing frequencies information can help identify important sentences which are generally ignored when only considering position information before. Therefore, in this paper, we propose a scoring model, SAFA (Self-Attention with Frequency Graph) which combines position information with frequency to identify the salience of sentences. The SAFA model constructs a frequency graph at the multi-document level based on the repetition of content of sentences, and assigns initial score values to each sentence based on the graph. The model then uses the position-aware gold scores to train a self-attention mechanism, obtaining the sentence significance at its single document level. The score of each sentence is updated by combing position and frequency information together. We train and test the SAFA model on the large-scale multi-document dataset Multi-News. The extensive experimental results show that the model incorporating frequency information in sentence scoring outperforms the other state-of-the-art extractive models.
Conference Paper
Full-text available
We present an extraction based method for automatic summarization. It is based on finding the shortest path from the first sentence to the last sentence in a graph representing the original text. Nodes represent sentences and edges represent similarity between sentences. Simple word overlap is used for similarity. Traditional sentence weights are also used, making edges to important sentences cheaper. Evaluation using ROUGE scores on DUC texts give scores outperforming human interagreement on 200 words and 400 words extracts, while performance on 100 words is less impressive.
Article
Full-text available
This paper presents a new open text word sense disambiguation method that combines the use of logical inferences with PageRank-style algo-rithms applied on graphs extracted from natu-ral language documents. We evaluate the ac-curacy of the proposed algorithm on several sense-annotated texts, and show that it consis-tently outperforms the accuracy of other pre-viously proposed knowledge-based word sense disambiguation methods. We also explore and evaluate methods that combine several open-text word sense disambiguation algorithms.
Article
The algorithm is being developed with a view to reduce the time of going through entire document. The tool will be able to summarize textual documents automatically using statistical as well as linguistic techniques. It will provide the much needed method for creating concise and yet precise documents.
Article
In this paper we present an empirical study of the potential and limitation of sentence extraction in text summarization. Our results show that the single document generic summarization task as defined in DUC 2001 needs to be carefully refocused as reflected in the low inter-human agreement at 100-word1 (0.40 score) and high upper bound at full text2 (0.88) summaries. For 100-word summaries, the performance upper bound, 0.65, achieved oracle extracts3. Such oracle extracts show the promise of sentence extraction algorithms; however, we first need to raise inter-human agreement to be able to achieve this performance level. We show that compression is a promising direction and that the compression ratio of summaries affects average human and system performance.
Article
This paper presents an innovative unsupervised method for automatic sentence extraction using graph-based ranking algorithms. We evaluate the method in the context of a text summarization task, and show that the results obtained compare favorably with pre-viously published results on established benchmarks.
Conference Paper
This paper investigates a new approach for Single Document Sum- marization based on a Machine Learning ranking algorithm. The use of machine learning techniques for this task allows one to adapt summaries to the user needs and to the corpus characteristics. These desirable properties have motivated an increasing amount of work in this field over the last few years. Most approaches attempttogeneratesummariesbyextractingtext-spans(sentencesinourcase)and adopt the classification framework which consists to train a classifier in order to discriminatebetweenrelevantandirrelevantspansofadocument.Asetoffeatures is first used to produce a vector of scores for each sentence in a given document and a classifier is trained in order to make a global combination of these scores. We believe that the classification criterion for training a classifier is not adapted for SDS and propose an original framework based on ranking for this task. A ranking algorithm also combines the scores of different features but its criterion tends to reduce the relative misordering of sentences within a document. Features we use here are either based on the state-of-the-art or built upon word-clusters. These clusters are groups of words which often co-occur with each other, and can serve to expand a query or to enrich the representation of the sentences of the documents.We analyze the performance of our ranking algorithm on two data sets - the Computation and Language (cmp lg) collection ofTIPSTER SUMMAC and the WIPO collection. We perform comparisons with different baseline - non learning-systems,andareferencetrainablesummarizersystembasedontheclas- sification framework. The experiments show that the learning algorithms perform better than the non-learning systems while the ranking algorithm outperforms the classifier. The difference of performance between the two learning algorithms de- pends on the nature of datasets.We give an explanation of this fact by the different separability hypothesis of the data made by the two learning algorithms.
Article
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical largescale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want. Keywords World Wide Web, Search Engines, Information Retrieval, PageRank, Google 1.
Conference Paper
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from 3 years ago. This paper provides an in-depth description of our large-scale web search engine - the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections, where anyone can publish anything they want.