Conference PaperPDF Available

A survey on extractive text summarization

Authors:
IEEE International Conference on Computer, Communication, and Signal Processing (ICCCSP-2017)
A Survey on Extractive Text Summarization
N.Moratanch* ,S.Chitrakala t
*Research Scholar, t Associate Professor
*tDepartment
of
CSE
Anna University,CEG,Chennai
* tancyanbil@ gmail.com, t aU.chitras@ gmail.com
Abstract-Text Summarization is the process of obtaining
salient information from an authentic text document.
In
this
technique, the extracted information is achieved as a summarized
report
and
conferred as a concise summary to the user.
It
is very
crucial for humans to understand
and
to describe the content
of the text. Text Summarization techniques
are
classified into
abstractive and extractive summarization. The extractive summa-
rization technique focuses on choosing how paragraphs,important
sentences, etc produces the original documents in precise form.
The implication of sentences is determined based on linguistic
and statistical features. In this work, a comprehensive review
of extractive text summarization process methods has been
ascertained.
In
this paper, the various techniques, populous
benchmarking datasets
and
challenges of extractive summariza-
tion have been reviewed. This paper interprets extractive text
summarization methods with a less redundant summary, highly
adhesive, coherent and depth information.
Index Terms-Text Summarization, Unsupervised Learning,
Supervised Learning, Sentence Fusion, Extraction Scheme, Sen-
tence Revision, Extractive Summary
I.
INTRODUCTION
In a recent advance, the significance
of
text summarization
[1] accomplishes more attention due to data inundation on
the web. Hence this information overwhelms yields in the
big requirement for more reliable and capable progressive text
summarizers. Text Summarization gains its importance due to
its various types
of
applications just like the summaries
of
books, digest- (summary
of
stories), the stock market, news,
Highlights- (meeting, event, sport), Abstract
of
scientific pa-
pers, newspaper articles, magazine etc. Due to its tremendous
growth, many finest universities like Faculty
of
Informatics
-Masaryk University, Czech Republic, Concordia University,
Montreal, Canada- Semantic Software Lab, IHR Nexus Lab
at Arizona State University, Arizona, USA and finally Lab
of
Topic Maps-Leipzig University, Germany has been persistently
working on its rapid enhancements.
Text summarization has grown into a crucial and appropriate
engine for supporting and illustrate text content in the latest
speedy emergent information age. It's far very complex for
humans to physically summarize oversized documents
of
text. There is a wealth
of
textual content available on the
internet. But, usually, the internet contribute more data than
is desired. Therefore, a twin problem is detected: Seeking for
appropriate documents through an awe-inspiring number
of
reports offered, and fascinating a high volume
of
important
information. The objective
of
automatic text summarization is
to condense the origin text into a precise version preserves
978-1-5090-3716-2/17/$31.00
©2017
IEEE
its report content and global denotation. The main advantage
of
a text summarization is reading time
of
the user can be
reduced. A marvelous text summary system should reproduce
the assorted theme
of
the document even as keeping repetition
to a minimum. Text Summarization methods are publicly
restricted into abstractive and extractive summarization.
An extractive summarization technique consists
of
selecting
vital sentences, paragraphs, etc, from the original manuscript
and concatenating them into a shorter form. The significance
of
sentences is strongly based on statistical and linguistic
features
of
sentences. This paper generally summarizes the
extensive methodologies fitted, issues launch, exploration and
future directions in text summarization. This paper [1] is
organized as follows. Section 2 depicts about the features for
extractive text summarization, Section 3 describes extractive
text summarization methods, Section 4 illustrate inferences
made, Section 5 represent challenges and future research
directions, Section 6 detail about evaluation metrics and the
final sketch is the conclusion.
II.
FEATURES
FOR
EXTRACTIVE
TEXT
SUMMARIZATION
Earlier techniques involve assigning a score to sentences
based on a countenance that are predefined based on the
methodology applied. Both word level and sentence level
features are employed in text summarization literature. Certain
features discussed are
[2] [3]
[4]used to exclusive sentences
to be included in the summary are:
1.
WORD
LEVEL FEATURES
1.1 Content
Word
feature
Keywords are essential in identifying the importance
of
the
sentence. The sentence that consists
of
main keywords is most
likely included in the final summary. The content (keyword)
words are words that are nouns, verbs, adjectives and adverbs
that are commonly determined based on
tf
x
idf
measure.
1.2 Title
Word
feature
The sentences in the original document which consists
of
words mentioned in the title have greater chances to contribute
to the final summary since they serve as indicators
of
the theme
of
the document.
IEEE International Conference on Computer, Communication, and Signal Processing (ICCCSP-2017)
1.3 Cue phrase feature
Cue phrases are words and phrases that indicate the struc-
ture
of
the document
flow
and it is used
as
a feature in
sentence selection. The sentence that contains cue phrases
(e.g. "denouement", "because", "this information", "summary",
"develop", "desire" etc.) are mostly to be included in the final
summary.
1.4 Biased word feature
The sentences that consist
of
biased words are more likely
important. The biased words are a list
of
the predefined set
of
words that may be domain specific. They are relatively
important words that describe the theme
of
the document.
1.5 Upper case word feature
The words which are in uppercase such
as
"UNICEF" are
considered to be important words and those sentences that
consist
of
these words are termed important in the context
of
sentence selection for the final summary.
2.
SENTENCE
LEVEL
FEATURES
2.1 Sentence location feature
The sentences that occur in the beginning and the conclusion
part
of
the document are most likely important since most
documents are hierarchically structured with important infor-
mation in the beginning and the end
of
the paragraphs.
2.2 Sentence length feature
The sentence length plays an important role in identifying
key sentences. Shorter texts do not convey essential informa-
tion and very long sentences also need not be included in the
summary. The normalized length
of
the sentence is calculated
as
the ratio between a number
of
words in the sentence to the
number
of
words in the longest sentence in the document.
2.3 Paragraph location feature
Similar to sentence location, paragraph location also plays
a crucial role in selecting key sentences. A Higher score is
assigned to the paragraph in the peripheral sections (beginning
and end paragraphs
of
the document).
2.4 Sentence-to-Sentence Cohesion
The cohesion between sentences for every sentence(s), the
similarity between s and alternative sentences are calculated
which are summed up and coarse value
of
the aspect is
obtained for
s.
The feature values are normalized between
[0,
1]
where value closer to 1.0 indicates a higher degree
of
cohesion between sentences.
III. EXTRACTIVE
TEXT
SUMMARIZATION
METHODS
Extractive Text Summarization methods can be broadly
classified as Unsupervised Learning and Supervised learning
methods. Recent works rely on Unsupervised Learning meth-
ods for text summarization.
978-1-5090-3716-2/17/$31.00 ©2017 IEEE
Fig.
1.
Overview
of
Extractive Summarization
A. UNSUPERVISED LEARNING METHODS
In this section, unsupervised techniques for sentence extrac-
tion task is discussed. The unsupervised approaches do not
need human summaries (user input) in deciding the important
features
of
the document, it requires the most sophisticated
algorithm to provide compensation for the lack
of
human
knowledge. Unsupervised summaries provide a higher level
of
automation compared to supervised model and are more
suitable for processing Big Data. Unsupervised learning mod-
els have proved successful in text summarization task.
Fig.
2.
Overview
of
Unsupervised Learning Methods
1.
Graph based approach
Graph-based models are extensively used in document sum-
marization since graphs can efficiently represent the document
structure. Extractive text summarization using external knowl-
edge from Wikipedia incorporating bipartite graph framework
[4
]has been used. They have proposed an iterative ranking
algorithm (variation
of
HITS algorithm [5]) which is efficient
in selecting important sentences and also ensures coherency
in the final summary. The uniqueness
of
this paper is that
it combines both graph based and concept based approach
towards summarization task. Another graph based approach
LexRank [6], where the salience
of
the sentence is determined
by the concept
of
Eigen vector centrality. The sentences in the
document are represented
as
a graph and the edges between
the sentences represents weighted cosine similarity values. The
sentences are clustered into groups based on their similarity
measures and then the sentences are ranked based on their
LexRank scores similar to PageRank algorithm [7]except that
the similarity graph is undirected in LexRank method. The
method outperforms earlier versions
of
lead and centroid based
approaches. The performance
of
the system is evaluated with
DUC dataset [8].
2.
Fuzzy logic based approach
The fuzzy logic approach mainly contains four components:
defuzzifier, fuzzifier, fuzzy knowledge base and inference en-
gine. The textual characteristics input
of
Fuzzy logic approach
are sentenced length, sentence similarity etc which is later
given to the fuzzy system
[9]
[10].
IEEE International Conference on Computer, Communication, and Signal Processing (ICCCSP-2017)
TABLE I
SUPERVISED AND UNSUPERVISED
LEARNING
METHODS FOR TEXT SUMMARIZATION
Categories Methodology Concept
SUPERVISED Machine Learning ap- Summarization task
LEARNING proach Bayes rule modelled
as
classification
APPROACHES problem
Trainable summarization -
SUPERVISED neural network is trained,
LEARNING Artificial Neural Net- pruned and generalized to
APPROACHES work filter sentences and clas-
sify them
as
"summary" or
"non-summary sentence"
Statistical modelling ap-
SUPERVISED
LEARNING Conditional Random proach which uses CRF
as
Fields (CRF) a sequence labelling prob-
APPROACHES lem
UNSUPERVISED Graph based Construction
of
graph to
LEARNING Approach capture relationship be-
APPROACHES tween sentences
Importance
of
sentences
UNSUPERVISED Concept oriented ap- calculated based on
LEARNING the concepts retrieved
APPROACHES proach from external knowledge
base(wikipedia, HowNet)
UNSUPERVISED Fuzzy Logic based ap- Summarization based on
LEARNING fuzzy rule using various
APPROACHES proach sets
of
features
Fig.
3.
Example
of
Sentence concept bipartite graph proposed in
[4]
Ladda Suanmali et al [11] proposed fuzzy logic approach
is used for automatic text summarization which is the initial
step , the text document is pre-processed followed by feature
extraction(Title features, Sentence length, Sentence position,
Sentence-sentence similarity, term weight, Proper noun and
Numerical data. The summary is generated by ordering the
ranked sentences in the order they occur in the original
document to maintain coherency. The proposed method shows
improvement in the quality
of
summarization but issues such
as dangling anaphora are not handled.
3.
Concept-based approach
In concept-based approach, the concepts are extracted from
a piece
of
text from external knowledge base such HowNet
[l2]and Wikipedia [4]. In the methodology proposed [12], the
importance
of
sentences is calculated based on the concepts
retrieved from HowNet instead
of
words. A conceptual vector
model is built to obtain a rough summarization and similarity
measures are calculated between the sentences to reduce
redundancy in the final summary. A good summarizer fo-
cuses on higher coverage and lower redundancy. Ramanathan
978-1-5090-3716-2/17/$31.00
©2017
IEEE
Advantages Limitations
Large set
of
training data im- Human interruption required for
proves the sentence selection for generating manual summaries
summary
The network can be trained ac- I)Neural Network is slow in
cording to the style
of
human training phase and also in ap-
reader. The set
of
features can be plication phase. 2)
It
is difficult
altered to reflect user's need and to determine how the net makes
requirements decision. 3) Requires human in-
terruption for training data
Identifies correct features and
1)
focuses on domain specific
which requires an external do-
provides better representation
of
main specific corpus for training
sentences and groups terms ap-
propriately into its segments step. 2) Limitation is that linguis-
tic features are not considered
1)
Captures redundant inforrna- Doesn't focus on issues such
as
tion 2)Improves coherency dangling anaphora problem
incorporation
of
similarity mea- Dangling anaphora and verb ref-
sures to reduce redundancy erents not considered
improved quality in summary by membership functions and work
maintaining coherency
of
the fuzzy logic system
Fig.
4.
Overall architecture
of
text summarization based on fuzzy logic
approach proposed in
[10]
Fig. 5. Example
of
concepts retrieved for sentence from Wikipedia
as
proposed in
[4]
et al
[4]
proposed a Wikipedia-based summarization which
utilizes graph structure to produce summaries. The method
uses Wikipedia to obtain concept for each sentence and
builds a sentence-concept bipartite graph as already mentioned
in Graph-based summarization. The basic steps in concept
based summarization are: i) Retrieve concepts
of
a text from
IEEE International Conference on Computer, Communication, and Signal Processing (ICCCSP-2017)
external knowledge base(HowNet, WordNet, Wikipedia) ii)
Build a conceptual vector or graph model to depict relationship
between concept and sentences iii) Apply ranking algorithm to
score sentences iv) Generate summaries based on the ranking
scores
of
sentences
4.
Latent Semantic Analysis Method(LSA)
Latent Semantic Analysis(LSA) [13] [14] is a method which
excerpt hidden semantic structures
of
sentences and words
that are popularly used in text summarization task.
It
is an
unsupervised learning approach that does not demand any sort
of
external or training knowledge. LSA captures the text
of
the
input document and excerpt information such as words that
frequently occur together and words that are commonly seen in
different sentences. A high number
of
common words amongst
the sentences illustrate that the sentences are semantically
related. Singular Value Decomposition(SVD) [13], is a method
used to find out the interrelations between words and sentences
which also has the competence
of
noise reduction that helps
to improve accuracy. SVD, [15] when enforced to document
word matrices, can group documents that are semantically
associated to one other despite them sharing no common
words. The set
of
words that ensue in connected text is
also connected within the same peculiar dimensional space.
LSA technique is applied to excerpt the subject-related words
and important content conveying sentences from report. The
advantage
of
adopting LSA vectors for summarization over
word vectors is that conceptual relations
as
represented in the
human brain are naturally captured in the LSA. Choice
of
the representative sentence from every scale
of
the capacity
ensures relevancy
of
sentence to the document and ensures
non-redundancy. LS works with text data and the principal
ambit due to the collection
of
topics can be located.
Considering an example to depict LSA representatieach
otheron, Example
1:
Consider following 3 sentences given to
LSA based system.
dO:
'The man was walked the dog.
dl:
'The
man took the dog to the park in the evening. d2: 'The
dog went to the park in the evening. From Fig6 [13] it is to
be noted in order that
dl
is associated to d2 than
dO
and the
conversation 'walked' is linked to the talk 'man' but it is not
significant to the word 'park'. These kind
of
interpretations
can be built by using input data and LSA, beyond need for
any extraneous awareness.
B.
SUPERVISED LEARNING METHODS
Supervised extractive summarizationrelated techniques are
based on a classification approach at sentence level where
the system learns by examples to classify between summary
and non-summary sentences. The major drawback with the
supervised approach is that it requires known manually created
summaries by a human to label the sentences in the original
training document enclosed with "summary sentence" or "non-
summary sentence" and it also requires more labeled training
data for classification.
978-1-5090-3716-2/17/$31.00
©2017
IEEE
Fig.
6.
Representation
of
LSA for Example
[13]
Fig.
7.
Overview
of
Supervised Learning Methods
1.
Machine Learning Approach based on Bayes rule
A set
of
training documents along with its extractive sum-
maries is fed as input to the training stage. The machine
learning approach views classification problem in text sum-
marization. The sentences are restricted as a non-summary
and summary sentence based on the feature possessed by the
sentence. The probability
of
classification are learned from the
training data by the following Bayes rule [16]: where s rep-
resents the set
of
sentences in the document and
fi
represents
the features used in classification stage and S represents the
set
of
sentences in the summary. P (s
E<
SII1,h,h,
....
In)
represents the probability
of
the sentences to be included in
the summary based on the given features possessed by the
sentence.
2.
Neural Network based approach
Fig.
8.
Neural network after training (a) and after pruning (b)
[17]
In the approach proposed in [18], RankNet algorautomati-
callyithm using neural nets to identify the important sentences
in the document.
It
uses a two-layer neural network with
IEEE International Conference on Computer, Communication, and Signal Processing (ICCCSP-2017)
back propagation trained using RankNet algorithm. The first
step involves labeling the training data using a machine-
learning approach and then extract features
of
the sentences
in both test set and train sets which is then inputted to the
neural network system to rank the sentences in the document.
Another approach [17]uses a three layered feed-forward neural
network which learns in the training stage the characteristics
of
summary and non-summary sentences. The major phase
is the feature fusion phase where the relationship between
the features are identified through two stages
1)
eliminat-
ing infrequent features 2) collapsing frequent features after
which sentence ranking is done to identify the important
summary sentences.Neural Network [17]after feature fusion
is depicted in Fig
8.
Dharmendra Hingu, Deep Shah and
Sandeep S.Udmale proposed an extractive approach [19]for
summarizing the Wikipedia articles by identifying the text
features and scoring the sentences by incorporating neural
network model
[5].
This system gets the Wikipedia articles
as
input followed by tokenisation and stemming. The pre-
processed passage is sent to the feature extraction steps, which
is based on multiple features
of
sentences and words. The
scores obtained after the feature extraction are fed to the
neural network, which produces a single value
as
output score,
signifying the importance
of
the sentences. Usage
of
the words
and sentences is not considered while assigning the weights
which results in less accuracy.
3.
Conditional Random Fields
Conditional Random Fields are a statistical modeling ap-
proach that focuses on machine leaning to provide a structured
prediction. The proposed system overcomes the issues faced
by non-negative matrix Factorization (NMF) methods by in-
corporating conditional random fields (CRF) to identify and
extract correct features to determine the important sentence
of
the given text. The main advantage
of
the method is that
it is able to identify correct features and provides a better
representation
of
sentences and groups terms appropriately
into its segments. The major drawback
of
the method is that it
focuses on domain-specific which requires an external domain
specific corpus for training step, thus this method cannot be
applied generically to any document without building a domain
corpus which is a time-consuming task. The approach specified
in [20]uses CRF
as
a sequence labelling problem and also
captures interaction between sentences through the features
extracted for each sentence and it also incorporates complex
features such
as
LSA_scores
[21]
and lilTS_score
[22]
but
the limitation is that linguistic features are not considered.
IV. INFERENCES
MADE
Abounding variations
of
the extractive path
[15]
have
been focused in the prior ten years. However, it is difficult
to say how analytical improvement (sentence or text
level) devote to work.
Beyond
NLP,
the achieved summary might endure a lack
of
semantics and cohesion. In texts consist
of
numerous
topics, the provoked summary may not be
fair.
Conclusive
978-1-5090-3716-2/17/$31.00 ©2017 IEEE
proper weights for respective features is vital to the
quality
of
concluding summary depends on it.
Feature weights should be given more importance be-
cause it plays a major role in choosing key sentences.
In text Summarization, the most challenging task is
to summarize the contented from a number
of
semi-
structured sources and textual, which includes web pages
and databases, in the proper way (size, format, time,
language,) for an explicit user.
Text summary software should crop effective summary
within a fewer amount
of
redundancy and time. Summa-
rization appraise using extrinsic or intrinsic part.
Intrinsic parts pursuit to measure summary nature adopt-
ing human evaluation whereas, extrinsic parts measure
the same over a effort-based work measure being the
information rehabilitation-oriented task.
V.
EVALUATION METRICS
Numerous benchmarking datasets
[1]
are used for experi-
mental evaluation
of
extractive summarization. Document Un-
derstanding Conferences (DUC) is the most common bench-
marking datasets used for text summarization. There are a
number
of
datasets like TIPSTER, TREC ,
TAC
, DUC, CNN.
It
contains documents along with their summaries that are
created automatically, manually and submitted summaries[20].
From papers surveyed within the previous sections et
al
in
literature, it's been found that agreement between human
summarizers is sort
of
low,
each for evaluating and generating
summaries quite the shape
of
the outline, it is tough to judge
the outline content.
i)
Human Evaluation
Human judgment usually has wide variance on what's
thought-about a "good" outline, which implies that creating
the analysis method automatic is especially tough. Manual
analysis is used, however, this can be each time and labor-
intensive because it needs humans to browse not solely the
summaries however conjointly the supply documents. Other
issues are those regarding coherence and coverage.
ii) Rouge
Formally, ROUGE-N is an n-gram recall between a candi-
date summary and a set
of
reference summaries. ROUGE-N
is computed
as
follows:
ROUGE-N = LSEreference_summaries LN-grams
Countmatch{N-gram)
L S Ereference_summaries LN-grams Count{N -gram)
where, n stands for the length
of
the n-gram
Countmatch(N-
gram) is the maximum number
of
n-grams co-occurring in
a candidate summary and a set
of
reference summaries.
Count(N-gram) is the number
of
N-grams in the set
of
reference summaries.
...
) R
11
R _ ISref n Scandl
m eca -ISrefl
where
Sref
n Scand indicates the number
of
sentences that
co-occur in both reference and candidate summaries.
IEEE International Conference on Computer, Communication, and Signal Processing (ICCCSP-2017)
. ) P
..
(P) P
ISref
n
Scandl
tV
rectswn = I I
Scand
) _ 2(Precision)(Recall)
v F -measure F - . . all
PreCISlOn
+ Rec
vi) Compression Ratio Cr = Slen .
Iien
where, Slen and
Iien
are the length
of
summary and source
text respectively.
VI. CHALLENGES AND
FUTURE
RESEARCH
DIRECTIONS
Evaluating summaries (either automatically or manually) is
a difficult task. The main problem in evaluation comes from
the impossibility
of
building a standard against which the
results
of
the systems that have to be compared. Further, it
is very hard to
find
out what a correct summary is because
there is a chance
of
the system to generate a better summary
that is different from any human summary which is used as an
approximation to correct output. Content choice
[23]
is not a
settled problem. People are completely different and subjective
authors would possibly select completely different sentences.
Two
distinct sentences expressed in disparate words will
specific a similar can explicit the same meaning also known
as
paraphrasing. There exists an approach to automatically
evaluate summaries using paraphrases (paraEval). Most text
summarization systems perform extractive summarization ap-
proach (selecting and photocopying extensive sentences from
the professional documents). Though humans can cut and paste
relevant data from a text, most
of
the times they rephrase sen-
tences whenever necessary, or they may join different related
data into one sentence. The low inter-annotator agreement
figures observed during manual evaluations suggest that the
future
of
this research area massively depends on the capacity
to find efficient ways
of
automatically evaluating the systems.
VII. CONCLUSION
This review has shown assorted mechanism
of
extractive
text summarization process. Extractive summarization process
is highly coherent, less redundant and cohesive (summary and
information rich). The aim is to give a comprehensive review
and comparison
of
distinctive approaches and techniques
of
extractive text summarization process. Although research on
summarization started way long back, there is still a long way
to go. Over the time, focused has drifted from summarizing
scientific articles to advertisements, blogs, electronic mail
messages and news articles. Simple eradication
of
sentences
has composed satisfactory results in massive applications.
Some trends in automatic evaluation
of
summary system have
been focused. However, the work has not focused the different
challenges
of
extractive text summarization process to its full
intensity in premises
of
time and space complication.
REFERENCES
[1]
N. Moratanch and S. Chitrakala, "A survey on abstractive text sum-
marization," in Circuit, Power
and
Computing Technologies (ICCPCT),
2016 International Conference on. IEEE, 2016, pp. 1-7.
978-1-5090-3716-2/17/$31.00 ©2017 IEEE
[2]
F.
Kiyomarsi and
F.
R.
Esfahani, "Optimizing persian text summarization
based on fuzzy logic approach," in 2011 International Conference on
Intelligent Building
and
Management, 2011.
[3]
F.
Chen, K. Han, and G. Chen, "An approach to sentence-selection-
based text summarization," in TENCON'02. Proceedings. 2002 IEEE
Region 10 Conference on Computers, Communications, Control and
Power Engineering, vol.
1.
IEEE, 2002, pp. 489-493.
[4]
Y.
Sankarasubramaniam, K. Ramanathan, and S. Ghosh, "Text sum-
marization using wikipedia," Information Processing & Management,
vol. 50, no. 3, pp. 443-461, 2014.
[5]
J. M. Kleinberg, "Authoritative sources in a hyperlinked environment,"
Journal
of
the
ACM
(JACM), vol. 46, no. 5, pp. 604--632, 1999.
[6]
G. Erkan and D.
R.
Radev, "Lexrank: Graph-based lexical centrality
as salience in text summarization," Journal
of
Artificial Intelligence
Research, pp. 457-479, 2004.
[7]
S. M. R
..
W.
T.
L., Brin, "The pagerank citation ranking: Bringing
order to the web," Technical report, Stanford University, Stanford, CA.,
Tech. Rep., (1998).
[8]
(2004) Document understanding conferences dataset. Online. [Online].
Available: http://duc.nist.gov/data.htrnl.
[9]
F.
Kyoomarsi, H. Khosravi, E. Eslami,
P.
K. Dehkordy, and A. Tajoddin,
"Optimizing text summarization based on fuzzy logic," in Seventh
IEEElACIS International Conference on Computer and Information
Science. IEEE, 2008, pp. 347-352.
[10] L. SUanmali, M. S. Binwahlan, and
N.
Salim, "Sentence features fusion
for text summarization using fuzzy logic," in Hybrid Intelligent Systems,
2009. HIS'09. Ninth International Conference on, vol.
1.
IEEE, 2009,
pp. 142-146.
[11] L. Suanmali,
N.
Salim, and M. S. Binwahlan, "Fuzzy logic based method
for improving text summarization," arXiv pre print arXiv:0906.4690,
2009.
[12] X.
W.
Meng Wang and C. Xu, "An approach to concept oriented text
summarization," in in Proceedings
of
ISClTS05, IEEE international
conference, China,1290-1293" 2005.
[13] M. G. Ozsoy,
F.
N. Alpaslan, and
1.
Cicekli, "Text summarization using
latent semantic analysis," Journal
of
Information Science, vol. 37, no. 4,
pp. 405-417, 2011.
[14]
1.
Mashechkin, M. Petrovskiy, D. Popov, and D.
V.
Tsarev, "Automatic
text summarization using latent semantic analysis," Programming and
Computer Software, vol. 37, no. 6, pp. 299-305, 2011.
[15]
V.
Gupta and G. S. Lehal, "A survey
of
text summarization extractive
techniques," Journal
of
emerging technologies in web intelligence, vol. 2,
no. 3, pp. 258-268, 2010.
[16] J. L. Neto, A. A. Freitas, and C. A. Kaestner, "Automatic text summa-
rization using a machine learning approach," in Advances in Artificial
Intelligence. Springer, 2002, pp. 205-215.
[17] K. Kaikhah, "Automatic text summarization with neural networks,"
2004.
[18] K. M. Svore, L. Vanderwende, and C.
J.
Burges, "Enhancing single-
document summarization by combining ranknet and third-party sources."
in EMNLP-CoNLL, 2007, pp. 448-457.
[19] D. Hingu, D. Shah, and S. S. Udmale, "Automatic text summarization
of
wikipedia articles," in Communication, Information & Computing
Technology (ICCICT), 2015 International Conference on.
IEEE,2015,
pp.I-4.
[20] D. Shen, J.-T. Sun, H. Li, Q. Yang, and Z. Chen, "Document summa-
rization using conditional random fields." in IJCAI, vol. 7, 2007, pp.
2862-2867.
[21]
Y.
Gong and X. Liu, "Generic text summarization using relevance
measure and latent semantic analysis," in Proceedings
of
the 24th annual
international
ACM
SIGIR conference on Research and development in
information retrieval. ACM, 2001, pp. 19-25.
[22]
R.
Mihalcea, "Language independent extractive summarization," in
Proceedings
of
the ACL 2005 on Interactive poster and demonstration
sessions. Association for Computational Linguistics, 2005, pp. 49-52.
[23] N. Lalithamani, R. Sukumaran, K. Alagamrnai, K. K. Sowmya,
V.
Di-
vyalakshmi, and S. Shanmugapriya,
''A
mixed-initiative approach for
summarizing discussions coupled with sentimental analysis," in Proceed-
ings
of
the 2014 International Conference on Interdisciplinary Advances
in Applied Computing. ACM, 2014, p. 5.
... Data on the internet is expanding at an exponential pace on the internet and this is one of the primary reasons why a number of researchers are interested in exploring the area of Automatic Text Summarization. It has become the desideratum of the hour because, in an expeditious paced world like today people infrequently have the time to read all of the content available on the web [1], [2], [3]. It would be preferable to read a gist of the document and then establish whether it is worth the time to look at the whole article. ...
... It would be preferable to read a gist of the document and then establish whether it is worth the time to look at the whole article. Automatic Text Summarization avails in providing concise summaries that highlight pivotal points of a document [1], [3]. The summaries engendered by Automatic Text Summarization methods can either be 'Extractive Summaries' or 'Abstractive Summaries' [1], [2]. ...
... Automatic Text Summarization avails in providing concise summaries that highlight pivotal points of a document [1], [3]. The summaries engendered by Automatic Text Summarization methods can either be 'Extractive Summaries' or 'Abstractive Summaries' [1], [2]. In Extractive summarization, certain critical phrases and sentences are identified to constitute a summary [4], [6], [7]. ...
... According to author the importance of sentences is determined on the basis of sentence statistical and linguistic features. In [4] author has described the word level features and sentence level features. And also categorized all extractive summarization methods into unsupervised and supervised methods and have explained each method and have depicted few evaluation metrics [4]. ...
... In [4] author has described the word level features and sentence level features. And also categorized all extractive summarization methods into unsupervised and supervised methods and have explained each method and have depicted few evaluation metrics [4]. ...
... Jo et al. [19] implemented two text summarization approaches, KNN, and support vector machines (SVM). Lastly, Moratanch et al. [21] conducted a comparison study between different abstractive and abstractive approaches. Results revealed that abstractive summarization methods produce an information-rich and less redundant summary [18]. ...
Article
Full-text available
The topics of identification, description, summary generation, and evaluation of the summaries produced have gained popularity over the years. Abstractive and extractive summarization are two methods of generating summaries from the text. However, a comparison of different abstractive and extractive techniques has been less explored. This study's primary purpose is to evaluate the performance of abstractive and extractive models on text summarization. Abstractive models (Sequence-2-Sequence decoder with attention) and extractive models (TextRank, KNN, and BERT) were developed for text summarization on an Amazon fine food reviews dataset. Different comparison measures (Rouge-1 percentage, count vectorizer, tf-idf vectorizer, and soft-cosine similarity) were implemented to evaluate the text summaries produced. Results revealed that the BERT model outperformed the KNN, Sequence-to-Sequence decoder with attention and TextRank algorithms during training and test on the amazon fine food review dataset. We highlight the implication of using the extractive and abstractive models for text summarization on real-world datasets.
... To make the summary, we used the lexical chain. We have computed the frequency of all the words in the text, taking into account that if a word is in a chain, we count the sum of the whole chain [11,12,13]. Once all the frequencies have been computed, we have normalized the values and performed filtering with a threshold both for maximum and minimum. ...
Article
Full-text available
A web is an information system that stores an enormous number of documents and other online resources. We generally access these documents and resources by Uniform Resource Locators (URLs) over the Internet. There was a time when people used to wait for newspapers in order to catch the previous day's happenings but thanks to the internet, all the latest information is now available with a click of a button. However, the information is much larger than one can manage quickly and efficiently. Also today everyone wants to gain more information in less time. Instead of reading large documents and then getting the insight, it is better to read a summary that gives the core information about the topic and helps in gaining more information in less time. In this paper we have implemented three techniques for generating the extractive summary of news articles of the two benchmark datasets-CNN-corpus and the BBC (these datasets have both the article and its summary) on different retention rates in order to know what would be the best retention rate for generating the summary which would contain most of the information of the original text without much affecting the connectivity among the sentences since, readability and connectivity are the two prime factors because of which most of the people still rely on man written summaries.
... While initial efforts in the field concentrated on unsupervised extractive summarisation methods (Mihalcea and Tarau, 2004;Vanderwende et al., 2007;Moratanch and Chitrakala, 2017), recent work has seen an explosion of deep learning-based models that leverage these large datasets to produce more abstractive and natural-sounding summaries. Out of these we choose three methods that each take a different approach. ...
Article
Full-text available
Nowadays we see huge amount of information is available on both, online and offline sources. For single topic we see hundreds of articles are available, containing vast amount of information about it. It is really a difficult task to manually extract the useful information from them. To solve this problem, automatic text summarization systems are developed. Text summarization is a process of extracting useful information from large documents and compressing them into short summary preserving all important content. This survey paper hand out a broad overview on the work done in the field of automatic text summarization in different languages using various text summarization approaches. The focal centre of this survey paper is to present the research done on text summarization on Indian languages such as, Hindi, Punjabi, Bengali, Malayalam, Kannada, Tamil, Marathi, Assamese, Konkani, Nepali, Odia, Sanskrit, Sindhi, Telugu and Gujarati and foreign languages such as Arabic, Chinese, Greek, Persian, Turkish, Spanish, Czeh, Rome, Urdu, Indonesia Bhasha and many more. This paper provides the knowledge and useful support to the beginner scientists in this research area by giving a concise view on various feature extraction methods and classification techniques required for different types of text summarization approaches applied on both Indian and non-Indian languages.
Chapter
Text summarization is a much evolving task, especially since neural networks were introduced. Similarly, generative adversarial networks (GANs) can be used to perform this task due to their ability to produce features or learn the whole sample distribution and produce correlated sample points. Thus, in paper, the authors exploited the characteristics of generative adversarial networks (GANs) for the abstractive text summarization task. The proposed generative adversarial model has three components: a generator which encodes the input sentences into much shorter representations; a discriminator which enforces generator to create understandable summaries; and a second discriminator which exerts upon generator to curb the output co-related to the input. The generator is optimized using policy gradient method, converting the problem into reinforcement learning. The ROUGE scores achieved by the model are as follows: R-1: 41.52, R-2: 16.20, R-L 37.21.
Chapter
An efficient summarizer that generates an appropriate summary of the document is the need of the hour. There are various models available for an extractive summary of a document. However, it has been observed that a comparative analysis of various techniques used, while obtaining a suitable summary of the text has been done rarely. So in this paper, a comparative analysis of different techniques used at different stages of obtaining a summary of the dataset provided was done. The paper discusses suitable combinations of the different techniques which can be used to obtain an effective summary for a given dataset. In order to carry out the main objective of comparative analysis for summarization techniques, four different sentence representations and three different similarity measures were used along with the K-means algorithm and cluster formation method.
Conference Paper
Full-text available
The main objective of a text summarization system is to identify the most important information from the given text and present it to the end users. In this paper, Wikipedia articles are given as input to system and extractive text summarization is presented by identifying text features and scoring the sentences accordingly. The text is first pre-processed to tokenize the sentences and perform stemming operations. We then score the sentences using the different text features. Two novel approaches implemented are using the citations present in the text and identifying synonyms. These features along with the traditional methods are used to score the sentences. The scores are used to classify the sentence to be in the summary text or not with the help of a neural network. The user can provide what percentage of the original text should be in the summary. It is found that scoring the sentences based on citations gives the best results.
Article
Full-text available
Text Summarization is condensing the source text into a shorter version preserving its information content and overall meaning. It is very difficult for human beings to manually summarize large documents of text. Text Summarization methods can be classified into extractive and abstractive summarization. An extractive summarization method consists of selecting important sentences, paragraphs etc. from the original document and concatenating them into shorter form. The importance of sentences is decided based on statistical and linguistic features of sentences. An abstractive summarization method consists of understanding the original text and re-telling it in fewer words. It uses linguistic methods to examine and interpret the text and then to find the new concepts and expressions to best describe it by generating a new shorter text that conveys the most important information from the original text document. In this paper, a Survey of Text Summarization Extractive techniques has been presented.
Conference Paper
Full-text available
We present a new approach to automatic summarization based on neural nets, called NetSum. We extract a set of features from each sentence that helps identify its impor- tance in the document. We apply novel features based on news search query logs and Wikipedia entities. Using the RankNet learning algorithm, we train a pair-based sentence ranker to score every sentence in the document and identify the most impor- tant sentences. We apply our system to documents gathered from CNN.com, where each document includes highlights and an article. Our system significantly outper- forms the standard baseline in the ROUGE-1 measure on over 70% of our document set.
Conference Paper
In this fast moving world, everybody has the urge to get updated to what's happening around the world. There is a heavy demand for knowing information which is crisp and to the point without getting reading the may be ten articles on it, but to get what the ten articles mean. So, to solve this issue, here we present a novel approach for 'Summarizing the Discussion' which is aimed at presenting the user with the most relevant summary containing all the important points of the discussion. At this juncture, we need to consider the polarity of comments and the whole discussion and present the user with a short, unbiased, comprehensive and non-redundant summary. Our proposal consists of a Nested Thematic Clustering approach coupled with Keyword or Key phrase extraction and sentiment analysis or rather polarity calculation. Be it political, social, geographical, philosophical, superstitious, educational, and environmental or entertainment domain, all the news is generally shared through forms or posts of internet. Now, it gets a little obvious that different people have different views. Hence, we make it a point to include and ensure the multi-dimensionality of every thread under the post. We have used an extractive method of summarization with the help of Natural Language Processing where we incorporate the techniques of repeated clustering and ranking. The proposed model is able to generate a desirably relevant summary.
Article
Automatic text summarization has been an active field of research for many years. Several approaches have been proposed, ranging from simple position and word-frequency methods, to learning and graph based algorithms. The advent of human-generated knowledge bases like Wikipedia offer a further possibility in text summarization - they can be used to understand the input text in terms of salient concepts from the knowledge base. In this paper, we study a novel approach that leverages Wikipedia in conjunction with graph-based ranking. Our approach is to first construct a bipartite sentence-concept graph, and then rank the input sentences using iterative updates on this graph. We consider several models for the bipartite graph, and derive convergence properties under each model. Then, we take up personalized and query-focused summarization, where the sentence ranks additionally depend on user interests and queries, respectively. Finally, we present a Wikipedia-based multi-document summarization algorithm. An important feature of the proposed algorithms is that they enable real-time incremental summarization - users can first view an initial summary, and then request additional content if interested. We evaluate the performance of our proposed summarizer using the ROUGE metric, and the results show that leveraging Wikipedia can significantly improve summary quality. We also present results from a user study, which suggests that using incremental summarization can help in better understanding news articles.
Article
The network structure of a hypcrlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of "authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of "hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.
Article
The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.
Conference Paper
The scoring mechanism of the text features is the unique way for determining the key ideas in the text to be presented as text summary. The efficiency of the technique used for scoring the text sentences could produce good summary. The feature scores are imprecise and uncertain, this marks the differentiation between the important features and unimportant is difficult task. In this paper, we introduce fuzzy logic to deal with this problem. Our approach used important features based on fuzzy logic to extract the sentences. In our experiment, we used 30 test documents in DUC2002 data set. Each document is prepared by preprocessing process: sentence segmentation, tokenization, removing stop word, and word stemming. Then, we use 9 important features and calculate their score for each sentence. We propose a method using fuzzy logic for sentence extraction and compare our results with the baseline summarizer and Microsoft Word 2007 summarizers. The results show that the highest average precision, recall, and F-measure for the summaries were obtained from fuzzy method.