Conference PaperPDF Available

Text Summarization and Singular Value Decomposition

Authors:

Abstract and Figures

In this paper we present the usage of singular value decomposition (SVD) in text summarization. Firstly, we mention the taxonomy of generic text summarization methods. Then we describe principles of the SVD and its possibilities to identify semantically important parts of a text. We propose a modification of the SVD-based summarization, which improves the quality of generated extracts. In the second part we propose two new evaluation methods based on SVD, which measure content similarity between an original document and its summary. In evaluation part, our summarization approach is compared with 5 other available summarizers. For evaluation of a summary quality we used, apart from a classical content-based evaluator, both newly developed SVD-based evaluators. Finally, we study the influence of the summary length on its quality from the angle of the three evaluation methods mentioned.
Content may be subject to copyright.
Text Summarization and Singular Value Decomposition
Karel Ježek
1
, Josef Steinberger
1
1
University of West Bohemia in Pilsen,
Department of Computer Science and Engineering,
30614, Univerzitni 22, Plzeň, Czech Republic
{jstein, jezek_ka}@kiv.zcu.cz
Abstract. In this paper we present the usage of singular value decomposition
(SVD) in text summarization. Firstly we mention the taxonomy of generic text
summarization methods. Then we describe the principles of the SVD and its
possibilities to identify semantically important parts of a text. We propose a
modification of the SVD-based summarization, which improves the quality of
generated extracts. In the second part we propose two new evaluation methods
based on SVD, which measure content similarity between an original document
and its summary. In evaluation part, our summarization approach is compared
with 5 other available summarizers. For evaluation of a summary quality we
used, apart from a classical content-based evaluator, both newly developed
SVD-based evaluators. Finally, we study the influence of the summary length
on its quality from the angle of the three evaluation methods mentioned.
1 Introduction
The actual huge amount of electronic information has to be reduced to enable the
users to handle this information more effectively. Short summaries can be presented to
users, for example, in place of full-length documents found by search engine in re-
sponse to a user’s query. In section 2 we mention prior approaches to text summari-
zation and section 3 covers our recent research focus. In section 4 we describe
the method based on SVD which has been recently published. We have further modi-
fied and improved this method. One of the most controversial fields in the summary
research is its evaluation process. Next part of the article deals with possibilities of
summary evaluation. We propose there two new evaluation methods based on SVD,
which measure a content similarity between an original document and its summary. At
the end of the paper we present evaluation results and further research directions.
2 Approaches in Automatic Text Summarization
We will now present a brief overview of prior work in the text summarization. We can
begin with classical approaches that include the use of surface level indicators of in-
formative relevance and corpus statistics that can be applied to unrestricted text.
Luhn (1958) developed the first sentence extraction algorithm which uses term
frequencies to measure sentence relevance [7]. Kupiec et al. (1995) implemented a
trainable Bayesian classifier that computes the probability that a sentence in a source
document should be included in a summary [5]. The next group consists of methods
which take the text cohesion into account. An example is the lexical chains method
which searches for chains of context words in the text [6]. Ono et al. (1994) and Mar-
cu (1997) made use of Rhetorical Structure Theory, which is a descriptive theory
about text organization, as the bases for text summarization. The approach consists in
the construction of a rhetorical tree for a given text [8]. Knowledge intensive ap-
proaches are based on the extensive encoding of world knowledge about specific
situations. These methods base the selection of information not on the surface level
properties of the text, but on expected information about a well known situation. The
next approach is mapping natural language into predefined, structured representations,
that, when instantiated, represent the key information from the original source (e. g.
Concept-based abstracting – Jones and Paice, 1992, [9]). While sentence extraction is
a currently wide-spread and useful technique, more research in summarization is
now moving towards summarization by generation. Jing and McKeown (2000)
proposed a cut-and-paste strategy as a computational process of automatic abstracting
and a sentence reduction strategy in order to produce concise sentences [10]. A quite
new approach in text summarization uses the singular value decomposition.
3 Our Previous Summarization Research
Our recent research has been focused namely on the use of inductive
machine learning methods for automatic document summarization. We analyzed vari-
ous approaches to document summarization, using some existing algorithms and com-
bining these with a novel use of itemsets. The resulted summarizer was
evaluated by comparing classification of original documents and that of a summary
generated automatically [3]. Now/Then ? we decided to investigate possibilities of
using singular value decomposition in both creating a summary and its evaluation.
4 SVD-based Summarization
Yihong Gong and Xin Liu have published the idea of using SVD in text summariza-
tion in 2002 [1]. The process starts with creation of a term by sentences matrix A =
[A
1
, A
2
, …, A
n
] with each column vector A
i
, representing the weighted term-frequency
vector of sentence i in the document under consideration. If there are a total of m
terms and n sentences in the document, then we will have an m × n matrix A for the
document. Since every word does not normally appear in each sentence, the matrix A
is sparse.
Given an m × n matrix A, where without loss of generality m n, the SVD of A is
defined as:
T
VUA Σ=
, (1)
where U = [
u
ij
] is an
m
×
n
column-orthonormal matrix whose columns are called left
singular vectors; Σ = diag(σ
1
, σ
2
, …, σ
n
) is an
n
×
n
diagonal matrix, whose diagonal
elements are non-negative singular values sorted in descending order, and V = [
v
ij
] is
an
n
×
n
orthonormal matrix, whose columns are called right singular vectors (see
figure 1). If rank(A) =
r
, then (see [4]) Σ satisfies:
0......
121
===>
+nrr
σσσσσ
. (2)
Fig. 1. Singular Value Decomposition
The interpretation of applying the SVD to the terms by sentences matrix A can be
made from two different viewpoints. From transformation point of view, the SVD
derives a mapping between the
m
-dimensional space spawned by the weighted term-
frequency vectors and the
r
-dimensional singular vector space. From semantic point of
view, the SVD derives the latent semantic structure from the document represented by
matrix A. This operation reflects a breakdown of the original document into
r
linearly-
independent base vectors or concepts. Each term and sentence from the document is
jointly indexed by these base vectors/concepts. A unique SVD feature is that it is
capable of capturing and modelling interrelationships among terms so that it can se-
mantically cluster terms and sentences. Furthermore, as demonstrated in [4], if a
word combination pattern is salient and recurring in document, this pattern will be
captured and represented by one of the singular vectors. The magnitude of the corre-
sponding singular value indicates the importance degree of this pattern within the
document. Any sentences containing this word combination pattern will be projected
along this singular vector, and the sentence that best represents this pattern will have
the largest index value with this vector. As each particular word combination pattern
describes a certain topic/concept in the document, the facts described above naturally
lead to the hypothesis that each singular vector represents a salient topic/concept of
the document, and the magnitude of its corresponding singular value represents the
degree of importance of the salient topic/concept.
Based on the above discussion, authors [1] proposed a summarization method
which uses the matrix V
T
. This matrix describes an importance degree of each topic in
each sentence. The summarization process chooses the most informative sentence for
each topic. It means that the k’th sentence we choose/chose? has the largest
index value in k’th right singular vector in matrix V
T
.
5 Modified SVD-based Summarization
The above described summarization method has two significant disadvantages. At first
it is necessary to use the same number of dimensions as is the number of sentences we
want to choose for a summary. However, the higher the number of dimensions of
reduced space, the less significant topic we take into a summary. This disadvantage
turns into an advantage only in the case when we know how many different topics the
original document has and we choose the same number of sentences into a summary.
The second disadvantage is that a sentence with large index values, but not the largest
(it doesn’t win in any dimension), will not be chosen although its content for the
summary is very suitable.
In order to clear out the discussed disadvantages, we propose following modifica-
tions in the SVD-based summarization method. Again we need to compute SVD of a
term by sentences matrix. We get the three matrices as shown in figure 1. For
each sentence vector in matrix V (its components are multiplied by corresponding
singular values) we compute its length. The reason of the multiplication is to favour
the index values in the matrix V that correspond to the highest singular values (the
most significant topics). Formally:
=
=
n
iiikk
vs
1
22,
σ
, (3)
where s
k
is the length of the vector of k’th sentence in the modified latent vector space.
It is its significance score for summarization too. n is a number of dimensions of the
new space. This value is independent of the number of summary sentences (it is a
parameter of the method). In our experiments we chose the dimensions whose singular
values didn’t fall under the half of the highest singular value (but it is possible to set a
different strategy). Finally, we put into a/the summary the sentences with the highest
values in vector s.
6 Summary Evaluation Approaches
Evaluation of automatic summarization in a standard and inexpensive way is a diffi-
cult task. It is an equally important area as the own summarization process and that’s
why many evaluation approaches were developed [2].
Co-selection measures include precision and recall of co-selected sentences. These
methods require having at one’s disposal a/the “right extract” (to which we could
compute precision and recall). We can obtain this extract in several ways. The most
common way is to obtain some human (manual) extracts and to declare the average of
these extracts as the “ideal (right) extract”. However, obtaining human extracts is
usually problematic. Another problem is that two manual summaries of the same input
do not share in general many identical sentences.
We can clear out the above discussed weakness of co-selection measures by con-
tent-based similarity measures. These methods compute the similarity between two
documents at a more fine-grained level than just sentences. The basic method evalu-
ates the similarity between the full-text document and its summary with the cosine
similarity measure, computed by the following formula:
( )
∑ ∑
=
2
2
)(*
*
),cos(
ii
ii
yx
yx
YX
, (4)
where X and Y are representations based on the vector space model.
Relevance correlation is a measure for accessing the relative decrease in retrieval
performance when indexing summaries instead of full documents [2].
Task-based evaluations measure human performance using the summaries for a cer-
tain task (after the summaries are created). We can for example measure the suitabil-
ity of using summaries instead of full-texts for text categorization [3]. This evaluation
requires a classified corpus of texts.
7 Using SVD in Summary Evaluation
We classify this new evaluation method to a content-based category because, like the
classical cosine content-based approach (see 6.), it evaluates a summary quality via
content similarity between a full-text and its summary. Our method uses SVD of the
terms by sentences matrix (see 4.), exactly the matrix U. This matrix represents the
degree of importance of terms in salient topics/concepts. In evaluation we measure the
similarity between the matrix U derived from the SVD performed on the original doc-
ument and the matrix U derived from the SVD performed on the summary. For ap-
praising this similarity we have proposed two measures.
7.1 First Left Singular Vector Similarity
This method compares first left singular vectors of the full-text SVD (i. e. SVD per-
formed on the original document) and the summary SVD (i. e. SVD performed on the
summary). These vectors correspond to the most salient word pattern in the
full-text and its summary (we can call it the main topic).
Then we measure the angle between the first left singular vectors. They are normal-
ized, so we can use the following formula:
=
=
n
íii
ufue
1
cos
ϕ
, (5)
where uf is the first left singular vector of the full-text SVD, ue is the first left singular
vector of the summary SVD (values, which correspond to particular terms, are sorted
up the full-text terms and instead of missing terms are zeroes), n is a number of unique
terms in the full-text.
7.2 U.Σ-based Similarity
This evaluation method compares a summary with the original document from an
angle of n most salient topics. We propose the following process:
Perform the SVD on a document matrix (see 4.).
For each term vector in matrix U (its components are multiplied by corre-
sponding singular values) compute its length. The reason of the multiplica-
tion is to favour the index values in the matrix U that correspond to the high-
est singular values (the most significant topics). Formally:
=
=
n
iiikk
us
1
22,
σ
, (6)
where s
k
is the length of the k’st term vector in the modified latent vector
space, n is a number of dimensions of the new space. In our experiments we
chose the dimensions whose singular values didn’t fall under the half of the
highest singular value (by analogy to the summary method described above).
From the lengths of the term vectors (s
k
) make a resulting term vector, whose
index values hold an information about the term significance in the modified
latent space (see figure 2).
Normalize the resulting vector.
This process is performed on the original document and on its summary (for the
same number of dimensions according to the summary) (see figure 2). In the result, we
get one vector corresponding to the term vector lengths of the full-text and one of its
summary. As a similarity measure we use again the angle between resulting vectors
(see 7.1).
Fig. 2. Creation of a resulting term vectors of a full-text and a summary
This evaluation method has the following advantage above the previous one. Sup-
pose, an original document contains two topics with virtually the same? signif-
icance (corresponding singular values are almost the same). When the second signifi-
cant topic outweighs the first one in a summary, the main topic of the summary will
not be consistent with the main topic of the original. Taking more singular vectors
(than just one) into account removes this weakness.
8 Experiments
8.1 Testing Collection
We tested our document summarizer using the Reuters Corpus Volume 1 (RCV1)
collection (the first “official” collection Reuters corpus released to the community of
researches, containing over 800 thousand documents). We prepared a collection by
selecting RCV1 documents with the length of at least 20 sentences. The selected
documents had to be suitable for the summarization task. Table 1 contains details
about our collection.
Table 1. Testing collection – details
Number of documents 127
Minimum number of sentences in document 20
Maximum number of sentences in document 68
Average number of sentences per document 28
Average number of words per document 724
Average number of significant words per document 287
Average number of distinct significant words per document 187
8.2 Results and Discussion
We evaluated the following summarizers:
Gong + Liu SVD summarizer (SVD–G+L)
SVD summarizer based on our approach (SVD–OUR)
RANDOM – evaluation based on the average of 10 random extracts
LEAD – first n sentences
1-ITEMSET – based on itemsets method [3]
TF.IDF –based on frequency method [3]
These summarizers were evaluated by the following three evaluation methods:
Cosine similarity – classical content-based method
SVD similarity – First left singular vector similarity
SVD similarity – U.Σ-based similarity
The summarization ratio was set to 20 %. Results are presented in the following table.
Values are averages of cosines of angles between a full-text and its summary.
Table 2. Summary quality evaluation
Evaluator
Summarizer
SVD-L+G SVD-OUR RANDOM
LEAD 1-ITEMSET TF.IDF
Cosine similarity 0,761 0,765 0,663 0,753 0,759 0,753
First left sing. vector simil.
0,751 0,787 0,488 0,73 0,764 0,758
U.Σ-based similarity 0,824 0,851 0,542 0,771 0,817 0,803
The classical cosine evaluator shows only small differences between summarizers
(the best summarizer – 0,77 and the worst (random) - 0,65). It is caused by a shallow
level of this evaluation method which takes into account only term counts in
compared documents. The evaluation based on SVD is a more fine-grained approach.
It is possible to say that it evaluates a summary via term co-occurrences in sentences.
Figures 3-5 show the dependencies of a summary quality on the summarization ra-
tio and the evaluation methods for our SVD-based and random summarizer.
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
0 20 40 60 80 100
[%]
[cos]
SVD-OUR RANDOM
Fig. 3.
Cosine Similarity Evaluation
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
0 20 40 60 80 100
[%]
[cos]
SVD-OUR RANDOM
Fig. 4.
First singular vector similarity
evaluation
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
0 20 40 60 80 100
[%]
[cos]
SVD-OUR RANDOM
Fig. 5.
U.Σ-based Similarity Evaluation
In the evaluation by the first left singular vector we noticed the disadvantage dis-
cussed in 6. (proved in 10% of documents). The U.Σ-based evaluation removes this
weakness. There is also a big difference between random and other summarizers.
Next we observed from the evaluation , that the SVD summarizer has been shown
as the expressively best with the evaluator (3). This property was expected.
9 Conclusion
This paper introduced a new approach to automatic text summarization and summary
evaluation. The practical tests proved that our summarizing method outperforms the
other examined methods. Our other experiments showed that SVD is very sensitive on
a stoplist and a lemmatization process. Therefore we are working on improved
versions of lemmatizers for English and Czech languages. In future research we plan
to try other weighing schemes and a normalization of a sentence vector on the SVD
input. Of course, other evaluations are needed, especially on longer texts than the
Reuters documents are. Our final goal is to integrate our summarizer in to a natural
language processing system capable of searching and presenting web documents in a
concise and coherent form.
This work has been partly supported by grants No. MSM 235200005 and ME494.
References
1. Gong, Y., Liu, X.: Generic Text Summarization Using Relevance Measure and Latent Se-
mantic Analysis. Proceedings of the 24
th
ACM SIGIR conference on Research and
development in information retrieval, New Orleans, Louisiana, United States (2001) 19-25
2. Radev, R., Teufel, S., Saggion, H., Lam, W., Blitzer, J., Qi, H., Celebi, A., Liu, D.,
Drabek, E.: Evaluation Challenges in Large-scale Document Summarization. Proceeding of
the 41
st
meeting of the Association for Computational Linguistics, Sapporo, Japan (2003)
375-382
3. Hynek, J., Ježek, K.: Practical Approach to Automatic Text Summarization. Proceedings of
the ELPUB ’03 conference, Guimaraes, Portugal (2003) 378-388
4. Berry, M.W., Dumais, S.T., O’Brien, G.W.: Using Linear Algebra for Intelligent Information
Retrieval. SIAM Review (1995)
5. Kupiec, J., Pedersen, J., Chen, F.: A trainable Document Summarizer. Proceedings of the
ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle,
Washington, United States (1995) 68-73
6. Barzilay, R., Elhadad, M.: Using Lexical Chains for Text Summarization. Proceedings of the
Intelligent Scalable Text Summarization Workshop (ISTS'97), ACL Madrid, Spain (1997)
7. H. P. Luhn: Automatic Creation of Literature Abstracts. IBM Journal and Research Devel-
opment 2(2) (1958) 159-165
8. Marcu D.: From Discourse Structures to Text Summaries. Preceedings of the
ACL’97/EACL’97 Workshop on Intelligent Scalable Text Summarization, Madrid, Spain
(1997) 82-88
9. Jones, P.A., Paice, C.D: A ‘select and generate’ Approach to Automatic Abstracting. Pro-
ceeding of the 14
th
British Computer Society Information Retrieval Colloquium, Springer
Verlag (1992) 151-154
10. Jing, H., McKeown, K.: Cut and Paste Text Summarization. Proceedings of the 1
st
meeting
of the North Americat Chapter of the Association for Computational Linguistics, Seatle,
Washington, USA (2000) 178-185
... • Some starting values from  matrix will play a significant role and will dominate. (Steinberger & Ježek, 2004) have proposed another model for document summarization. They are also using both  and V matrix (transpose of V T ) for sentence selection. ...
... So, this is inherently assumed that among extracted sentence have minimum similarity with each other. Matrix  singular values are always in sorted order, and this is assumed by (Gong & Liu, 2001) concepti is preferred over concepti+1, but approach followed by (Steinberger & Ježek, 2004), cross approach by (Ozsoy et al., 2011) , and (Ozsoy, Cicekli, & Alpaslan, 2010) are not considering this phenomenon and sentence selection is based on the longest length sentence given by Equation-3.8 and 3.11 respectively. In both of these approach topics/concept with the highest strength is chosen. ...
... Step-4 have two cases of how to design W (based on two sub-approaches given by (Gong & Liu, 2001), i.e., Model-1 and (Steinberger & Ježek, 2004) i.e., Model 2). In W, each row is representing concepts and column for sentences. ...
Preprint
Since the advent of the web, the amount of data on wen has been increased several million folds. In recent years web data generated is more than data stored for years. One important data format is text. To answer user queries over the internet, and to overcome the problem of information overload one possible solution is text document summarization. This not only reduces query access time, but also optimize the document results according to specific users requirements. Summarization of text document can be categorized as abstractive and extractive. Most of the work has been done in the direction of Extractive summarization. Extractive summarized result is a subset of original documents with the objective of more content coverage and lea redundancy. Our work is based on Extractive approaches. In the first approach, we are using some statistical features and semantic-based features. To include sentiment as a feature is an idea cached from a view that emotion plays an important role. It effectively conveys a message. So, it may play a vital role in text document summarization.
... Moreover, this method works quite well in the semantic summary of texts. [221], [222], [42], [54] are other examples of the LSA method for text summarization tasks. ...
... A number of excellent general summary examines the papers' key points focused on generic methods for text summarization. Rather than repeating the same information we provide the references here [222], [303], [221], and [191]. 8) Q-Network Q-network is used to approximate optimal action-value function that measures the action's longterm reward for the agent. ...
Article
Full-text available
With the evolution of the Internet and multimedia technology, the amount of text data has increased exponentially. This text volume is a precious source of information and knowledge that needs to be efficiently summarized. Text summarization is the method to reduce the source text into a compact variant, preserving its knowledge and the actual meaning. Here we thoroughly investigate the automatic text summarization (ATS) and summarize the widely recognized ATS architectures. This paper outlines extractive and abstractive text summarization technologies and provides a deep taxonomy of the ATS domain. The taxonomy presents the classical ATS algorithms to modern deep learning ATS architectures. Every modern text summarization approach’s workflow and significance are reviewed with the limitations with potential recovery methods, including the feature extraction approaches, datasets, performance measurement techniques, and challenges of the ATS domain, etc. In addition, this paper concisely presents the past, present, and future research directions in the ATS domain.
... Multi-document summarization aims to produce summaries from document clusters on the same topic. Traditional multi-document summarization models studied in the past are extractive in nature, which try to extract the most important sentences in the document and rearranging them into a new summary [8,19,39,59]. Recently, with the emergence of neural network models for text generation, a vast majority of the literature on summarization is dedicated to abstractive methods which are largely in the single-document setting. ...
... • LSA [59] applys Singular Value Decomposition (SVD) to pick a representative sentence. • TextRank [39] is a graph-based method inspired by the PageRank algorithm. ...
Preprint
Query understanding is a fundamental problem in information retrieval (IR), which has attracted continuous attention through the past decades. Many different tasks have been proposed for understanding users' search queries, e.g., query classification or query clustering. However, it is not that precise to understand a search query at the intent class/cluster level due to the loss of many detailed information. As we may find in many benchmark datasets, e.g., TREC and SemEval, queries are often associated with a detailed description provided by human annotators which clearly describes its intent to help evaluate the relevance of the documents. If a system could automatically generate a detailed and precise intent description for a search query, like human annotators, that would indicate much better query understanding has been achieved. In this paper, therefore, we propose a novel Query-to-Intent-Description (Q2ID) task for query understanding. Unlike those existing ranking tasks which leverage the query and its description to compute the relevance of documents, Q2ID is a reverse task which aims to generate a natural language intent description based on both relevant and irrelevant documents of a given query. To address this new task, we propose a novel Contrastive Generation model, namely CtrsGen for short, to generate the intent description by contrasting the relevant documents with the irrelevant documents given a query. We demonstrate the effectiveness of our model by comparing with several state-of-the-art generation models on the Q2ID task. We discuss the potential usage of such Q2ID technique through an example application.
... Words' meanings are determined by the sentences wherein they occur and sentence meanings generally determined by word meanings. The mathematical approach of Singular Value Decomposition (SVD) is used to uncover the interconnections among phrases and SVD improves accuracy by predicting word-to-word correlations and minimising noise [27,28]. ...
Research Proposal
Full-text available
Text summarisation comes under the domain of Natural Language Processing (NLP), which entails replacing a long, precise and concise text with a shorter, precise and concise one. Manual text summarising takes a lot of time, effort and money and it's even unfeasible when there's a lot of text. Much research has been conducted since the 1950s and researchers are still developing Automatic Text Summarisation (ATS) systems. In the past few years, lots of text-summarisation algorithms and approaches have been created. In most cases, summarisation algorithms simply turn the input text into a collection of vectors or tokens. The basic objective of this research is to review the different strategies used for text summarising. There are three types of ATS approaches, namely: Extractive text summarisation approach, Abstractive text summarisation approach and Hybrid text summarisation approach. The first method chooses the relevant statements out of the given input text or document & convolves those statements to create the final output as summary. The second method converts the input document into an intermedial representation before generating a summary containing phrases that differ from the originals. Both the extractive and abstractive processes are used in the hybrid method. Despite all of the methodologies presented, the produced summaries still lag behind human-authored summaries. By addressing the various components of ATS approaches, methodologies, techniques, datasets, assessment methods and future research goals, this study provides a thorough review for researchers and novices in the field of NLP.
Chapter
Effectiveness of Non-negative Matrix Factorization (NMF) in mining latent semantic structure of text has motivated its use for single document summarization. Initial promise shown by the method provokes further research in this field to advance state-of-the-art.
Article
In this paper, we employ machine learning methods to solve the problem of countering terrorism and extremism by using information from the Internet. This problem involves retrieving electronic messages, documents, and web resources that potentially contain information of terrorist or extremist nature, identifying the structure of user groups and online communities that disseminate this information, monitoring and modeling information flows in these communities, as well as assessing threats and predicting risks based on monitoring results. We propose some original language-independent algorithms for pattern-based information retrieval, thematic modeling, and prediction of message flow characteristics, as well as assessment and prediction of potential risk coming from members of online communities by using data on the structure of relations in these communities, which makes it possible to detect potentially dangerous users even without full access to the content they distribute, e.g., through private channels and chat rooms.
Article
Sentence-based summarization aims at extracting concise summaries of collections of textual documents. Summaries consist of a worthwhile subset of document sentences. The most effective multilingual strategies rely on Latent Semantic Analysis (LSA) and on frequent itemset mining, respectively. LSA-based summarizers pick the document sentences that cover the most important concepts. Concepts are modeled as combinations of single-document terms and are derived from a term-by-sentence matrix by exploiting Singular Value Decomposition (SVD). Itemset-based summarizers pick the sentences that contain the largest number of frequent itemsets, which represent combinations of frequently co-occurring terms. The main drawbacks of existing approaches are (i) the inability of LSA to consider the correlation between combinations of multiple-document terms and the underlying concepts, (ii) the inherent redundancy of frequent itemsets because similar itemsets may be related to the same concept, and (iii) the inability of itemset-based summarizers to correlate itemsets with the underlying document concepts. To overcome the issues of both of the abovementioned algorithms, we propose a new summarization approach that exploits frequent itemsets to describe all of the latent concepts covered by the documents under analysis and LSA to reduce the potentially redundant set of itemsets to a compact set of uncorrelated concepts. The summarizer selects the sentences that cover the latent concepts with minimal redundancy. We tested the summarization algorithm on both multilingual and English-language benchmark document collections. The proposed approach performed significantly better than both itemset- and LSA-based summarizers, and better than most of the other state-of-the-art approaches.
Conference Paper
Full-text available
We present a large-scale meta evaluation of eight evaluation measures for both single-document and multi-document summarizers. To this end we built a corpus consisting of (a) 100 Million automatic summaries using six summarizers and baselines at ten summary lengths in both English and Chinese, (b) more than 10,000 manual abstracts and extracts, and (c) 200 Million automatic document and summary retrievals using 20 queries. We present both qualitative and quantitative results showing the strengths and draw-backs of all evaluation methods and how they rank the different summarizers.
Chapter
This paper describes on-going work on a new method of automatic abstracting which uses general domain knowledge in association with extracting techniques. A representation of the main concepts of a paper are stored in an abstract-frame. Each section of the abstract-frame has a group of phrases associated with it, and the occurrence of these phrases are used to extract relevant portions of text from source documents. The completed abstract-frame is then run through a program which uses an abstract template to produce a coherent, cohesive abstract. The system is being developed with papers in the area of agriculture, specifically papers dealing with experiments, trials or observations on crops. One aim of the project however is to produce a system which may be easily transferred to other domain areas.
Article
ABSTRACT Cut-and-Paste Text Summarization Hongyan Jing Automatic text summarization,provides a concise summary,for a document. In this thesis, we present a cut-and-paste approach to addressing the text generation problem in domain-independent, single-document summarization. We found that professional abstractors often reuse the text in an original docu- ment for producing the text in a summary. But rather than simply extracting the original text, as in most existing automatic summarizers, humans often edit the extracted sen- tences. We call such editing operations “revision operations”. Our summarizer,simu- lates two revision operations that are frequently used by humans: sentence reduction and sentence combination. Sentence reduction removes inessential phrases from sentences and sentence combination,merges sentences and phrases together. The sentence reduc- tion algorithm we propose relies on multiple sources of knowledge to decide when it is appropriate to delete a phrase from a sentence, including linguistic knowledge, prob- abilities trained from corpus examples, and context information. The sentence combi- nation module,relies on a set of rules to decide how to combine,sentences and phrases and when,to combine,them. Sentence reduction aims to improve the conciseness of generated summaries,and sentence combination aims to improve the coherence of gen- erated summaries. We call this approach “cut-and-paste” since it produces summaries by excerpting and combining sentences and phrases from original documents, unlike the extraction technique which produces summaries,by simply extracting sentences or passages. Our work also includes a Hidden Markov Model based sentence decomposition
Article
Excerpts of technical papers and magazine articles that serve the purposes of conventional abstracts have been created entirely by automatic means. In the exploratory research described, the complete text of an article in machine-readable form is scanned by an IBM 704 data-processing machine and analyzed in accordance with a standard program. Statistical information derived from word frequency and distribution is used by the machine to compute a relative measure of significance, first for individual words and then for sentences. Sentences scoring highest in significance are extracted and printed out to become the “auto-abstract."
Conference Paper
To summarize is to reduce in complexity, and hence in length, while retaining some of the essential qualities of the original. This paper focuses on document extracts, a particular kind of computed document summary. Document extracts consisting of roughly 20% of the original can be as informative as the full text of a document, which suggests that even shorter extracts may be useful indicative summaries. The trends in our results are in agreement with those of Edmundson who used a subjectively weighted combination of features as opposed to training the feature weights using a corpus. We have developed a trainable summarization program that is grounded in a sound statistical framework.
Conference Paper
We have developed an automatic abstract generation system for Japanese expository writings based on rhetorical structure extraction. The system first extracts the rhetorical structure, the compound of the rhetorical relations between sentences, and then cuts out less important parts in the extracted structure to generate an abstract of the desired length.Evaluation of the generated abstract showed that it contains at maximum 74% of the most important sentences of the original text. The system is now utilized as a text browser for a prototypical interactive document retrieval system.
Conference Paper
In this paper, we propose two generic text summarization methods that create text summaries by ranking and extracting sentences from the original documents. The first method uses standard IR methods to rank sentence relevances, while the second method uses the latent semantic analysis technique to identify semantically important sentences, for summary creations. Both methods strive to select sentences that are highly ranked and different from each other. This is an attempt to create a summary with a wider coverage of the document's main content and less redundancy. Performance evaluations on the two summarization methods are conducted by comparing their summarization outputs with the manual summaries generated by three independent human evaluators. The evaluations also study the influence of different VSM weighting schemes on the text summarization performances. Finally, the causes of the large disparities in the evaluators' manual summarization results are investigated, and discussions on human text summarization patterns are presented.