ArticlePDF Available

Graph-Based Algorithms for Text Summarization

Authors:

Abstract and Figures

Summarization is a brief and accurate representation of input text such that the output covers the most important concepts of the source in a condensed manner. Text Summarization is an emerging technique for understanding the main purpose of any kind of documents. To visualize a large text document within a short duration and small visible area like PDA screen, summarization provides a greater flexibility and convenience. This paper presents innovative unsupervised methods for automatic sentence extraction using graph-based ranking algorithms and shortest path algorithm.
Content may be subject to copyright.
Graph-Based Algorithms for Text Summarization
Khushboo S. Thakkar
Department of Computer Science &
Engineering
G. H. Raisoni College of Engineering,
Nagpur, India
e-mail:khushboo.thakkar86@gmail.com
Dr. R. V. Dharaskar
Professor & Head, Dept. of Computer
Science & Engineering
G. H. Raisoni College of Engineering,
Nagpur, India
e-mail: rajiv.dharaskar@gmail.com
M. B. Chandak
HOD, Dept. of Computer Science &
Engineering
Shri Ramdeobaba Kamla Nehru
Engineering College,
Nagpur, India
e-mail: chandakmb@gmail.com
Abstract-Summarization is a brief and accurate representation
of input text such that the output covers the most important
concepts of the source in a condensed manner. Text
Summarization is an emerging technique for understanding
the main purpose of any kind of documents. To visualize a
large text document within a short duration and small visible
area like PDA screen, summarization provides a greater
flexibility and convenience. This paper presents innovative
unsupervised methods for automatic sentence extraction using
graph-based ranking algorithms and shortest path algorithm.
Keywords- Text Summarization,ranking algorithm, HITS,
PageRank.
I. INTRODUCTION
Due to the rapid growth of the World Wide Web,
information is much easier to disseminate and acquire than
before. Finding useful and favored documents from the
huge text repository creates significant challenges for users.
Typical approaches to resolve such a problem are to employ
information retrieval techniques. Information retrieval relies
on the use of keywords to search for the desired information.
Nevertheless, the amount of information obtained via
information retrieval is still far greater than that a user can
handle and manage. This in turn requires the user to analyze
the searched results one by one until satisfied information is
acquired, which is time-consuming and inefficient. It is
therefore essential to develop tools to efficiently assist users
in identifying desired documents.
One possible means is to utilize automatic text
summarization. Automatic text summarization is a text-
mining task that extracts essential sentences to cover almost
all the concepts of a document. It is to reduce users’
consuming time in document reading without losing the
general issues for users’ comprehension. With document
summary available, users can easily decide its relevancy to
their interests and acquire desired documents with much
less mental loads involved.
II. GRAPH-BASED ALGORITHMS
A. Graph-based Ranking Algorithm
Graph-based ranking algorithms are essentially a way of
deciding the importance of a vertex within a graph, based
on information drawn from the graph structure. In this
section, two graph-based ranking algorithms – previously
found to be successful on a range of ranking problems are
presented. These algorithms can be adapted to undirected or
weighted graphs, which are particularly useful in the
context of text-based ranking applications.
HITS
Hyperlink-Induced Topic Search (HITS) (also known as
Hubs and authorities) is a link analysis algorithm that rates
Web pages, developed by Jon Kleinberg. It determines two
values for a page: its authority, which estimates the value of
the content of the page, and its hub value, which estimates
the value of its links to other pages.
In the HITS algorithm, the first step is to retrieve the set
of results to the search query. The computation is
performed only on this result set, not across all Web pages.
Authority and hub values are defined in terms of one
another in a mutual recursion. An authority value is
computed as the sum of the scaled hub values that point to
that page. A hub value is the sum of the scaled authority
values of the pages it points to. Some implementations also
consider the relevance of the linked pages.
The Hub score and Authority score for a node is
calculated with the following algorithm:
Start with each node having a hub score and
authority score of 1.
Run the Authority Update Rule.
Run the Hub Update Rule.
Normalize the values by dividing each Hub
score by the sum of the squares of all Hub
scores, and dividing each Authority score by the
sum of the squares of all Authority scores.
Repeat from the second step as necessary.
HITS produces two sets of scores – an “authority” score,
and a “hub” score:
HITSA (Vi) = HITSH (Vj) (1)
Vj ЄIn(Vi)
HITSH(Vi) = HITSA(Vj) (2)
Vj ЄOut(Vi)
PageRank
PageRank is a link analysis algorithm, named after Larry
Page, used by the Google Internet search engine that
assigns a numerical weighting to each element of a
hyperlinked set of documents, such as the World Wide Web,
Third International Conference on Emerging Trends in Engineering and Technology
978-0-7695-4246-1/10 $26.00 © 2010 IEEE
DOI 10.1109/ICETET.2010.104
516
Third International Conference on Emerging Trends in Engineering and Technology
978-0-7695-4246-1/10 $26.00 © 2010 IEEE
DOI 10.1109/ICETET.2010.104
516
Third International Conference on Emerging Trends in Engineering and Technology
978-0-7695-4246-1/10 $26.00 © 2010 IEEE
DOI 10.1109/ICETET.2010.104
516
Third International Conference on Emerging Trends in Engineering and Technology
978-0-7695-4246-1/10 $26.00 © 2010 IEEE
DOI 10.1109/ICETET.2010.104
516
with the purpose of "measuring" its relative importance
within the set. The algorithm may be applied to any
collection of entities with reciprocal quotations and
references. The numerical weight that it assigns to any
given element E is also called the PageRank of E and
denoted by PR(E).
The name "PageRank" is a trademark of Google, and the
PageRank process has been patented (U.S. Patent
6,285,999). However, the patent is assigned to Stanford
University and not to Google. Google has exclusive license
rights on the patent from Stanford University. The
university received 1.8 million shares of Google in
exchange for use of the patent; the shares were sold in 2005
for $336 million
Google describes PageRank:
“PageRank relies on the uniquely democratic nature
of the web by using its vast link structure as an indicator
of an individual page's value. In essence, Google
interprets a link from page A to page B as a vote, by
page A, for page B. But, Google looks at more than the
sheer volume of votes, or links a page receives; it also
analyzes the page that casts the vote. Votes cast by
pages that are themselves "important" weigh more
heavily and help to make other pages "important".”
In other words, a PageRank results from a "ballot" among
all the other pages on the World Wide Web about how
important a page is. A hyperlink to a page counts as a vote
of support. The PageRank of a page is defined recursively
and depends on the number and PageRank metric of all
pages that link to it ("incoming links"). A page that is linked
to by many pages with high PageRank receives a high rank
itself. If there are no links to a web page there is no support
for that page.
PageRank is a probability distribution used to represent
the likelihood that a person randomly clicking on links will
arrive at any particular page. PageRank can be calculated
for collections of documents of any size. It is assumed in
several research papers that the distribution is evenly
divided between all documents in the collection at the
beginning of the computational process. The PageRank
computations require several passes, called "iterations",
through the collection to adjust approximate PageRank
values to more closely reflect the theoretical true value.
In the general case, the PageRank value for any page u
can be expressed as:
PR(υ) = PR(ν) (3)
νЄBu L(ν)
i.e. the PageRank value for a page u is dependent on the
PageRank values for each page v out of the set Bu (this set
contains all pages linking to page u), divided by the number
L(v) of links from page v.
The PageRank theory holds that even an imaginary surfer
who is randomly clicking on links will eventually stop
clicking. The probabilities, at any step, that the person will
continue is a damping factor d. Various studies have tested
different damping factors, but it is generally assumed that
the damping factor will be set around 0.85.
The damping factor is subtracted from 1 (and in some
variations of the algorithm, the result is divided by the
number of documents in the collection) and this term is then
added to the product of the damping factor and the sum of
the incoming PageRank scores.
That is,
PR(A) = 1 – d + d . PR(B) + PR(C) + PR(D) + … (4)
L(B) L(C) L(D)
For each of these algorithms, starting from arbitrary
values assigned to each node in the graph, the computation
iterates until convergence below a given threshold is
achieved. After running the algorithm, a score is associated
with each vertex, which represents the “importance” or
“power” of that vertex within the graph. Notice that the
final values are not affected by the choice of the initial
value, only the number of iterations to convergence may be
different.
Text As Graph
To enable the application of graph-based ranking
algorithms to natural language texts, we have to build a
graph that represents the text, and interconnects words or
other text entities with meaningful relations. Depending on
the application at hand, text units of various sizes and
characteristics can be added as vertices in the graph, e.g.
words, collocations, entire sentences, or others. Similarly, it
is the application that dictates the type of relations that are
used to draw connections between any two such vertices,
e.g. lexical or semantic relations, contextual overlap, etc.
Regardless of the type and characteristics of the elements
added to the graph, the application of graph-based ranking
algorithms to natural language texts consists of the
following main steps:
1. Identify text units that best define the task at hand,
and add them as vertices in the graph.
2. Identify relations that connect such text units, and
use these relations to draw edges between vertices
in the graph. Edges can be directed or undirected,
weighted or unweighted.
3. Iterate the graph-based ranking algorithm until
convergence.
4. Sort vertices based on their final score. Use the values
attached to each vertex for ranking/selection decisions
Sentence Extraction
To apply TextRank, we first need to build a graph
associated with the text, where the graph vertices are
representative for the units to be ranked. For the task of
sentence extraction, the goal is to rank entire sentences, and
therefore a vertex is added to the graph for each sentence in
the text.
Formally, given two sentences Si and Sj, with a sentence
being represented by the set of Ni words that appear in the
sentence: Si = Wi
1, Wi
2, … ,Wi
Ni , the similarity of Si and Sj
is defined as:
517517517517
Similarity(Si, Sj) =|Wk|WkЄSi&WkЄSj| (5)
log(|Si|) + log(|Sj|)
Other sentence similarity measures, such as string
kernels, cosine similarity, longest common subsequence,
etc. are also possible, and we are currently evaluating their
impact on the summarization performance.
Figure 1. Sample graph build for sentence extraction
The resulting graph is highly connected, with a weight
associated with each edge, indicating the strength of the
connections established between various sentence pairs in
the text. The text is therefore represented as a weighted
graph.
After the ranking algorithm is run on the graph, sentences
are sorted in reversed order of their score, and the top
ranked sentences are selected for inclusion in the summary.
Fig 1 shows a text sample [6], and the associated
weighted graph constructed for this text. The figure also
shows sample weights attached to the edges connected to
vertex 9, and the final TextRank score computed for each
sentence. The sentences with the highest rank are selected
for inclusion in the abstract. For this sample article, the
sentences with id-s 9, 15, 16, 18 are extracted, resulting in a
summary of about 100 words, which according to automatic
evaluation measures, is ranked the second among
summaries produced by 15 other systems.
B. Shortest-path Algorithm
Extraction based summarization normally produces
summaries that are somewhat unappealing to read. There is
a lack of flow in the text, since the extracted parts, usually
sentences, are taken from different parts of the original text.
This can for instance lead to very sudden topic shifts. The
idea behind the presented method of extracting sentences
that form a path where each sentence is similar to the
previous one is that the resulting summaries hopefully have
better flow. This quality is however quite hard to evaluate.
Since the summaries are still extracts, high quality
summaries should still not be expected [3].
Building the graph
When a text is to be summarized, it is first split into
sentences and words. The sentences become the nodes of
the graph. Sentences that are similar to each other have an
edge between them. Here, similarity simply means word
overlap, though other measures could also be used. Thus, if
two sentences have at least one word in common, there will
be an edge between them. Of course, many words are
ambiguous, and having a matching word does not guarantee
any kind of similarity. Since all sentences come from the
same document, and words tend to be less ambiguous in a
single text, this problem is somewhat mitigated.
All sentences also have an edge to the following
sentence. Edges are given costs (or weights). The more
similar two sentences are, the less the cost of the edge. The
further apart the sentences are in the original text, the higher
the cost of the edge. To favor inclusion of “interesting”
sentences, all sentences that are deemed relevant to the
document according to classical summarization methods
have the costs of all the edges leading to them lowered.
The cost of an edge from the node representing sentence
number i in the text, Si, to the node for Sj is calculated as:
cost i, j = (i − j) 2 (6)
overlap i,j · weight j
and the weight of a sentence is calculated as:
518518518518
weight j = (1 + overlaptitle, j)
.
. early (j). (1 + | edgej |) (7)
Since similarity is based on the number of words in
common between two sentences, long sentences have a
greater chance of being similar to other sentences. Favoring
long sentences is often good from a smoothness
perspective. Summaries with many short sentences have a
larger chance for abrupt changes, since there are more
sentence breaks.
Constructing the summary
When the graph has been constructed, the summary is
created by taking the shortest path that starts with the first
sentence of the original text and ends with the last sentence.
Since the original text also starts and ends in these
positions, this will hopefully give a smooth but shorter set
of sentences between these two points.
The N shortest paths are found by simply starting at the
start node and adding all paths of length one to a priority
queue, where the priority value is the total cost of a path.
The currently cheapest path is then examined and if it does
not end at the end node, all paths starting with this path and
containing one more edge are also added to the priority
queue. Paths with loops are discarded. Whenever the
currently shortest path ends in the end node, another
shortest path has been found, and the search is continued
until the N shortest paths have been found.
III. COMPARISON
TextRank works well because it does not only rely on the
local context of a text unit (vertex), but rather it takes into
account information recursively drawn from the entire text
(graph). Through the graphs it builds on texts, TextRank
identifies connections between various entities in a text, and
implements the concept of recommendation. A text unit
recommends other related text units, and the strength of the
recommendation is recursively computed based on the
importance of the units making the recommendation. In the
process of identifying important sentences in a text, a
sentence recommends another sentence that addresses
similar concepts as being useful for the overall
understanding of the text. Sentences that are highly
recommended by other sentences are likely to be more
informative for the given text, and will be therefore given a
higher score.
An important aspect of TextRank is that it does not
require deep linguistic knowledge, nor domain or language
specific annotated corpora, which makes it highly portable
to other domains, genres, or languages.
The Shortest-path algorithm is easy to implement and
should be relatively language independent, though it was
only evaluated on English texts. The generated summaries,
they are often somewhat “smooth” to read. This smoothness
is hard to quantify objectively, though, and the extracts are
by no means as smooth as a manually written summary.
When it comes to including the important facts from the
original text, the weighting of sentences using traditional
extraction weighting methods seems to be the most
important part. Taking a path from the first to the last
sentence does give a spread to the summary, making it more
likely that most parts of the original text that are important
will be included and making it unlikely that too much
information is included from only one part of the original
text.
IV. CONCLUSION
Automatic text summarization is now used
synonymously that aim to generate summaries of texts. This
area of NLP research is becoming more common in the web
and digital library environment. In simple summarization
systems, parts of text – sentences or paragraphs – are
selected automatically based on some linguistic and/or
statistical criteria to produce the abstract or summary.
Shortest-path algorithm is better because it generates
smooth summaries as compared to ranking algorithms.
Taking a path from the first to the last sentence does give a
spread to the summary, making it more likely that most
parts of the original text that are important will be included.
V. REFERENCES
[1] Satyajeet Raje, Sanket Tulangekar, Rajshekhar Waghe, Rohit Pathak,
Parikshit Mahalle, “ Extraction of Key Phrases from Document using
Statistical and Linguistic analysis”, 2009.
[2] Md. Nizam Uddin, Shakil Akter Khan, “A Study on Text
Summarization Techniques and Implement Few of Them for Ba ngla
Language”, 1-4244-1551-9/07IEEE, 2007.
[3] Jonas Sj¨obergh, Kenji Araki, “Extraction based summarization using
a shortest path algorithm”, Proceedings of the Annual Meeti ng of the
Association for Natural Language Processing, 2006.
[4] Massih R. Amini, Nicolas Usunier, and Pat rick Gallinari, “Automatic
Text Summarization Based onWord-Clusters and Ranking
Algorithms”, D.E. Losada and J.M. Fern´andez-Luna (Eds.): ECIR
2005, LNCS 3408, pp. 142–156, 2005.
[5] Rada Mihalcea, “ Graph-based Ranking Algorithms for Sentence
Extraction, Applied to Text Summarization”. The Companion
Volume t o the Proceedings of 42st Annual Meeting of the
Association for Computational Linguistics, pages 170–173,
Barcelona, Spain, 2004.
[6] R. Mihalcea and P. Tarau, “TextRank – bringing order into texts”,
2004.
[7] R. Mihalcea, P. Tarau, and E. Figa, “PageRank on semantic networks,
with application to word sense disambiguation”. In Proceedings of
the 20st International Conference on Computational Linguistics,
Geneva, Switzerland, August 2004.
[8] C.Y. Lin and E.H. Hovy, “The potential and limitations of sentence
extraction for summarization”. In P roceedings of the HLT/NAACL
Workshop on Automatic Summarization, Edmonton, Canada, May
2003.
[9] Chin-Yew Lin and Eduard Hovy, Automatic Evaluation of
Summaries Using N-gram Cooccurrence Statistics”. In Udo Hahn
and Donna Har man, editors, Proceedings of the 2003 Human
Language Technology Conference
[10] P.J. Herings, G. van der Laan, and D. Talman, “Measuring the power
of nodes in digraphs”. Technical report, Tinbergen Institute, 2001.
[11] S. Brin and L. Page, “The anatomy of a large-scale hypertextual Web
search engine”. Computer Networks and ISDN Systems 1998.
(wЄtext tf(w))
(1 + wЄSj tf(w))
519519519519
... In a parallel perspective, long sentences tend to have more common words with other sentences. Therefore, a problem that may arise is the bias of the TextRank in favor of long sentences, as they are more likely to have a high similarity with the other sentences, although it may be good when considering smoothness [57]. This is differentiated by the similarity measure used in each case and can be significantly limited, e.g., by normalizing the similarity index by the length of each sentence [53]. ...
Article
Full-text available
This study introduces a hybrid text summarization technique designed to enhance the analysis of qualitative feedback from online educational surveys. The technique was implemented at the Hellenic Open University (HOU) to tackle the challenges of processing large volumes of student feedback. The TextRank and Walktrap algorithms along with GPT-4o mini were used to analyze student comments regarding positive experiences, study challenges, and suggestions for improvement. The results indicate that students are satisfied with tutor–student interactions but concerns were raised about educational content and scheduling issues. To evaluate the proposed summarization approach, the G-Eval and DeepEval summarization metrics were employed, assessing the relevance, coherence, consistency, fluency, alignment, and coverage of the summaries. This research addresses the increasing demand for effective qualitative data analysis in higher education and contributes to ongoing discussions on student feedback in distance learning environments. By effectively summarizing open-ended responses, universities can better understand student experiences and make informed decisions to improve the educational process.
... In a parallel perspective, long sentences tend to have more common words with other sentences. Therefore, a problem that may arise is the bias of the TextRank in favor of long sentences, as they are more likely to have a high similarity with the other sentences, although it may be good when considering smoothness [52]. This is differentiated by the similarity measure used in each case and can be significantly limited, i.e., by normalizing the similarity index by the length of each sentence [50]. ...
Preprint
Full-text available
This study investigates the effectiveness of a text summarization method applied to open-ended student evaluations at the Hellenic Open University, aiming to improve the analysis of qualitative feedback from online educational surveys. To address the challenges of processing large volumes of student feedback, an automated summarization technique utilizing advancements in Natural Language Processing (NLP) and text mining is proposed. Using the TextRank and the Walktrap algorithms, student comments on positive aspects, study challenges, and improvement suggestions were analyzed. The analysis revealed that while some students expressed satisfaction with tutor-student interactions and the organization of educational material, there were also negative comments about outdated content in some modules and scheduling issues. The findings highlight the importance of qualitative feedback for education quality, providing actionable insights for improving curricula and teaching effectiveness. This research responds to the growing need for effective qualitative data analysis in higher education and contributes to ongoing discussions about student satisfaction in distance learning environments. By effectively summarizing open-ended responses, university staff can better understand student experiences and make informed decisions to enhance the educational process.
... Dari sekian banyak algoritma tersebut, Textrank merupakan algoritma yang lebih sering digunakan karena sifat kompleksitas komputasionalnya relatif rendah dibandingkan yang lain, sehingga proses eksekusinya lebih cepat dibandingkan BERT. Beberapa riset sebelumnya menunjukkan bahwa algoritma Textrank memberikan hasil kinerja yang lebih baik dibandingkan dengan algoritma lainnya [8] - [12]. ...
Article
Full-text available
News summarization is very important in the news analysis process. However, in the summarization process, there are often obstacles such as the large number of news stories and the need for news classification. This research aims to build a simple web-based system that can be used to summarize and classify news which will be very useful in the news analysis process. The proposed summarization method is Textrank, and the news classification method that will be used is KNN. This system is expected to provide an automatic summarization function to make it easier to analyze news content. The data that will be used as the basis for classification modeling is sports news in 3 months, and the classification that will be used to determine whether the news includes sports news in three branches, namely football, rackets or basketball. Testing of the summarization model using textrank was carried out by applying ROUGE-1 and ROUGE-2, with results of 0.79 and 0.67. Meanwhile, testing the classification model using KNN with k=3 and k=5 is 0.9866 and 0.9666 so k=3 will be used. This system will be built using the web scrapping library, textrank, stopword from PySastrawi, scikit-learn for the classification module using the KNN algorithm, and ngrok for publishing web-based applications. By using ngrok, we can expose the application through internet with a temporary public url without hosting required
... Sentence scoring is achieved by initializing weights to the nodes of the graph. The graph-based approach to text summarization is a powerful method that utilizes graph theory to extract essential information from a document or text [27]. ...
Conference Paper
Full-text available
The increasing volume and complexity of legal documents have led to a growing interest in text summarizing for legal texts. In this context, this paper presents LegalSum, a tool for automatically summarizing lawsuits in Portuguese, aiming to improve the efficiency of legal professionals and researchers. The tool is equipped with a legal-domain expression dictionary, which enhances the accuracy of summarization. It provides various algorithms such as Word Frequency, KL-Sum, Reduction, Edmunson, LSA, LexRank, TextRank, and Pagerank, as well as a committee approach that combines multiple algorithms. The architecture of LegalSum is modular and flexible, allowing new algorithms to be easily integrated. The tool was evaluated using the metrics Rouge (Rouge-1, Rouge-2 and Rouge-L) obtaining promising results. This paper contributes to the development of summarization tools for the legal domain, offering a valuable resource for legal professionals and researchers in the field.
... They have used Latent Semantic Analysis (LSA) [60] technique to to make an intuitive semantic structure. They have also used PageRank algorithm to remove the redundancy [61]. Their method works well in comparison with many graph-based and semantic-based methods. ...
Article
Full-text available
This paper presents a query-based extractive text summarization approach by using sense-oriented semantic relatedness measure. To find the query relevant sentences, we have to find semantic relatedness measure between query and input text sentences. To find the relatedness score, we need to know the exact sense of the words present in query and input text sentences. Word sense disambiguation (WSD) finds the actual meaning of a word according to its context of the sentence. We have proposed a WSD technique to extract query relevant sentences which is used to find a sense-oriented sentence semantic relatedness score between the query and input text sentence. Here, a feature-based method is presented to find semantic relatedness score between query and input text sentence. Finally the proposed query-based text summary method uses relevant and redundancy-free features to form cluster. There is a high probability that same featured cluster may contain redundant sentences. Therefore, a redundancy removal method is proposed to get redundancy-free sentences. In the end, redundancy-free query relevant sentences are obtained with an information rich summary. We have evaluated our proposed WSD technique with other existing methods by using Senseval and SemEval datasets and proposed Sense-Oriented Sentence Semantic Relatedness Score by using Li et al. dataset. We compare our proposed query-based extractive text summarization method with other methods participated in Document Understanding Conference and as well as with current methods. Evaluation and comparison state that the proposed query-based extractive text summarization method outperforms many existing and recent methods.
... Comparing the outcome of this experiment to 30 DUC systems was positive. Thakkar et al. [40] used TextRank in their PageRank-based system. They created a tightly connected graph for the text and applied the TextRank method to extract the relevant terms and assess their importance throughout the entire manuscript. ...
Article
Full-text available
Günümüzde internetin yaygın kullanımıyla, bilgi kaynaklarındaki doğru bilgiye erişimi önemli kılmaktadır. Bilgi kaynaklarının artmasıyla birlikte özgün içeriğe sahip bilgiye erişim güçleşmektedir. Bu nedenle metin özetleme yöntemlerinin önemi giderek artmaktadır. Haber metinleri gibi önemli temel bilgi kaynaklarının etkili bir şekilde özetlenmesi günümüzde bir gereklilik haline gelmiştir. Bu çalışmada haber metinlerinin etkili bir şekilde özetlenmesi için Malatya merkezilik algoritmasını temel alan bir özetleme yaklaşımı önerildi. Önerilen yaklaşımda orijinal metin tanımlayıcıların çıkarılması, kelime köklerinin elde edilmesi gibi çeşitli ön işlemlerden geçirilerek graf yapısına dönüştürülür. Graf’a dönüştürülen metin için Malatya merkezilik algoritması kullanılarak graftaki düğümlerin Malatya merkezilik değerleri hesaplanır. Bu değerler dikkate alınarak metin özetini oluşturan özetler seçilir. Seçilen özetler graftan çıkarılır. Oluşan yeni graf yapısı için merkezilik değeri hesaplanarak seçim işlemleri devam ettirilir. Graf Teorisi ve Malatya merkezilik algoritmasının birlikte kullanımı, haber metinlerinin özetlenmesinde verimliliği artırdığı gösterildi. Bununla birlikte haber içeriklerinin anlamlı bir şekilde özetlenmesi sağlandı. Bu yaklaşımın başarısını değerlendirmek amacıyla BBC veri seti üzerinde toplamda 2224 ingilizce haber metniyle kapsamlı bir şekilde test edildi. Çalışmada haber metinleri etkili bir şekilde özetlendiği yapılan testlerle ve alınan etkili rouge değerleriyle gösterildi. Graf teorisi ve Malatya merkezilik algoritması, bilgiye erişimi kolaylaştırmak ve anlam düzeyinde etkileşimi artırmak adına önemli bir potansiyele sahip olduğu gösterildi. Elde edilen uygulama sonuçları, haber metinlerini daha anlamlı bir şekilde sunabileceğini ve etkili özetler üretilebileceğini göstermektedir.
Article
In recent years, deep learning has revolutionized natural language processing (NLP) by enabling the development of models that can learn complex representations of language data, leading to significant improvements in performance across a wide range of NLP tasks. Deep learning models for NLP typically use large amounts of data to train deep neural networks, allowing them to learn the patterns and relationships in language data. This is in contrast to traditional NLP approaches, which rely on hand-engineered features and rules to perform NLP tasks. The ability of deep neural networks to learn hierarchical representations of language data, handle variable-length input sequences, and perform well on large datasets makes them well-suited for NLP applications. Driven by the exponential growth of textual data and the increasing demand for condensed, coherent, and informative summaries, text summarization has been a critical research area in the field of NLP. Applying deep learning to text summarization refers to the use of deep neural networks to perform text summarization tasks. In this survey, we begin with a review of fashionable text summarization tasks in recent years, including extractive, abstractive, multi-document, and so on. Next, we discuss most deep learning-based models and their experimental results on these tasks. The paper also covers datasets and data representation for summarization tasks. Finally, we delve into the opportunities and challenges associated with summarization tasks and their corresponding methodologies, aiming to inspire future research efforts to advance the field further. A goal of our survey is to explain how these methods differ in their requirements as understanding them is essential for choosing a technique suited for a specific setting. This survey aims to provide a comprehensive review of existing techniques, evaluation methodologies, and practical applications of automatic text summarization.
Conference Paper
Full-text available
We present an extraction based method for automatic summarization. It is based on finding the shortest path from the first sentence to the last sentence in a graph representing the original text. Nodes represent sentences and edges represent similarity between sentences. Simple word overlap is used for similarity. Traditional sentence weights are also used, making edges to important sentences cheaper. Evaluation using ROUGE scores on DUC texts give scores outperforming human interagreement on 200 words and 400 words extracts, while performance on 100 words is less impressive.
Article
Full-text available
This paper presents a new open text word sense disambiguation method that combines the use of logical inferences with PageRank-style algo-rithms applied on graphs extracted from natu-ral language documents. We evaluate the ac-curacy of the proposed algorithm on several sense-annotated texts, and show that it consis-tently outperforms the accuracy of other pre-viously proposed knowledge-based word sense disambiguation methods. We also explore and evaluate methods that combine several open-text word sense disambiguation algorithms.
Article
The algorithm is being developed with a view to reduce the time of going through entire document. The tool will be able to summarize textual documents automatically using statistical as well as linguistic techniques. It will provide the much needed method for creating concise and yet precise documents.
Article
In this paper we present an empirical study of the potential and limitation of sentence extraction in text summarization. Our results show that the single document generic summarization task as defined in DUC 2001 needs to be carefully refocused as reflected in the low inter-human agreement at 100-word1 (0.40 score) and high upper bound at full text2 (0.88) summaries. For 100-word summaries, the performance upper bound, 0.65, achieved oracle extracts3. Such oracle extracts show the promise of sentence extraction algorithms; however, we first need to raise inter-human agreement to be able to achieve this performance level. We show that compression is a promising direction and that the compression ratio of summaries affects average human and system performance.
Article
This paper presents an innovative unsupervised method for automatic sentence extraction using graph-based ranking algorithms. We evaluate the method in the context of a text summarization task, and show that the results obtained compare favorably with pre-viously published results on established benchmarks.
Conference Paper
This paper investigates a new approach for Single Document Sum- marization based on a Machine Learning ranking algorithm. The use of machine learning techniques for this task allows one to adapt summaries to the user needs and to the corpus characteristics. These desirable properties have motivated an increasing amount of work in this field over the last few years. Most approaches attempttogeneratesummariesbyextractingtext-spans(sentencesinourcase)and adopt the classification framework which consists to train a classifier in order to discriminatebetweenrelevantandirrelevantspansofadocument.Asetoffeatures is first used to produce a vector of scores for each sentence in a given document and a classifier is trained in order to make a global combination of these scores. We believe that the classification criterion for training a classifier is not adapted for SDS and propose an original framework based on ranking for this task. A ranking algorithm also combines the scores of different features but its criterion tends to reduce the relative misordering of sentences within a document. Features we use here are either based on the state-of-the-art or built upon word-clusters. These clusters are groups of words which often co-occur with each other, and can serve to expand a query or to enrich the representation of the sentences of the documents.We analyze the performance of our ranking algorithm on two data sets - the Computation and Language (cmp lg) collection ofTIPSTER SUMMAC and the WIPO collection. We perform comparisons with different baseline - non learning-systems,andareferencetrainablesummarizersystembasedontheclas- sification framework. The experiments show that the learning algorithms perform better than the non-learning systems while the ranking algorithm outperforms the classifier. The difference of performance between the two learning algorithms de- pends on the nature of datasets.We give an explanation of this fact by the different separability hypothesis of the data made by the two learning algorithms.
Article
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical largescale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want. Keywords World Wide Web, Search Engines, Information Retrieval, PageRank, Google 1.
Conference Paper
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from 3 years ago. This paper provides an in-depth description of our large-scale web search engine - the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections, where anyone can publish anything they want.