Conference PaperPDF Available

A survey on extractive text summarization

Authors:
IEEE International Conference on Computer, Communication, and Signal Processing (ICCCSP-2017)
A Survey on Extractive Text Summarization
N.Moratanch* ,S.Chitrakala t
*Research Scholar, t Associate Professor
*tDepartment
of
CSE
Anna University,CEG,Chennai
* tancyanbil@ gmail.com, t aU.chitras@ gmail.com
Abstract-Text Summarization is the process of obtaining
salient information from an authentic text document.
In
this
technique, the extracted information is achieved as a summarized
report
and
conferred as a concise summary to the user.
It
is very
crucial for humans to understand
and
to describe the content
of the text. Text Summarization techniques
are
classified into
abstractive and extractive summarization. The extractive summa-
rization technique focuses on choosing how paragraphs,important
sentences, etc produces the original documents in precise form.
The implication of sentences is determined based on linguistic
and statistical features. In this work, a comprehensive review
of extractive text summarization process methods has been
ascertained.
In
this paper, the various techniques, populous
benchmarking datasets
and
challenges of extractive summariza-
tion have been reviewed. This paper interprets extractive text
summarization methods with a less redundant summary, highly
adhesive, coherent and depth information.
Index Terms-Text Summarization, Unsupervised Learning,
Supervised Learning, Sentence Fusion, Extraction Scheme, Sen-
tence Revision, Extractive Summary
I.
INTRODUCTION
In a recent advance, the significance
of
text summarization
[1] accomplishes more attention due to data inundation on
the web. Hence this information overwhelms yields in the
big requirement for more reliable and capable progressive text
summarizers. Text Summarization gains its importance due to
its various types
of
applications just like the summaries
of
books, digest- (summary
of
stories), the stock market, news,
Highlights- (meeting, event, sport), Abstract
of
scientific pa-
pers, newspaper articles, magazine etc. Due to its tremendous
growth, many finest universities like Faculty
of
Informatics
-Masaryk University, Czech Republic, Concordia University,
Montreal, Canada- Semantic Software Lab, IHR Nexus Lab
at Arizona State University, Arizona, USA and finally Lab
of
Topic Maps-Leipzig University, Germany has been persistently
working on its rapid enhancements.
Text summarization has grown into a crucial and appropriate
engine for supporting and illustrate text content in the latest
speedy emergent information age. It's far very complex for
humans to physically summarize oversized documents
of
text. There is a wealth
of
textual content available on the
internet. But, usually, the internet contribute more data than
is desired. Therefore, a twin problem is detected: Seeking for
appropriate documents through an awe-inspiring number
of
reports offered, and fascinating a high volume
of
important
information. The objective
of
automatic text summarization is
to condense the origin text into a precise version preserves
978-1-5090-3716-2/17/$31.00
©2017
IEEE
its report content and global denotation. The main advantage
of
a text summarization is reading time
of
the user can be
reduced. A marvelous text summary system should reproduce
the assorted theme
of
the document even as keeping repetition
to a minimum. Text Summarization methods are publicly
restricted into abstractive and extractive summarization.
An extractive summarization technique consists
of
selecting
vital sentences, paragraphs, etc, from the original manuscript
and concatenating them into a shorter form. The significance
of
sentences is strongly based on statistical and linguistic
features
of
sentences. This paper generally summarizes the
extensive methodologies fitted, issues launch, exploration and
future directions in text summarization. This paper [1] is
organized as follows. Section 2 depicts about the features for
extractive text summarization, Section 3 describes extractive
text summarization methods, Section 4 illustrate inferences
made, Section 5 represent challenges and future research
directions, Section 6 detail about evaluation metrics and the
final sketch is the conclusion.
II.
FEATURES
FOR
EXTRACTIVE
TEXT
SUMMARIZATION
Earlier techniques involve assigning a score to sentences
based on a countenance that are predefined based on the
methodology applied. Both word level and sentence level
features are employed in text summarization literature. Certain
features discussed are
[2] [3]
[4]used to exclusive sentences
to be included in the summary are:
1.
WORD
LEVEL FEATURES
1.1 Content
Word
feature
Keywords are essential in identifying the importance
of
the
sentence. The sentence that consists
of
main keywords is most
likely included in the final summary. The content (keyword)
words are words that are nouns, verbs, adjectives and adverbs
that are commonly determined based on
tf
x
idf
measure.
1.2 Title
Word
feature
The sentences in the original document which consists
of
words mentioned in the title have greater chances to contribute
to the final summary since they serve as indicators
of
the theme
of
the document.
IEEE International Conference on Computer, Communication, and Signal Processing (ICCCSP-2017)
1.3 Cue phrase feature
Cue phrases are words and phrases that indicate the struc-
ture
of
the document
flow
and it is used
as
a feature in
sentence selection. The sentence that contains cue phrases
(e.g. "denouement", "because", "this information", "summary",
"develop", "desire" etc.) are mostly to be included in the final
summary.
1.4 Biased word feature
The sentences that consist
of
biased words are more likely
important. The biased words are a list
of
the predefined set
of
words that may be domain specific. They are relatively
important words that describe the theme
of
the document.
1.5 Upper case word feature
The words which are in uppercase such
as
"UNICEF" are
considered to be important words and those sentences that
consist
of
these words are termed important in the context
of
sentence selection for the final summary.
2.
SENTENCE
LEVEL
FEATURES
2.1 Sentence location feature
The sentences that occur in the beginning and the conclusion
part
of
the document are most likely important since most
documents are hierarchically structured with important infor-
mation in the beginning and the end
of
the paragraphs.
2.2 Sentence length feature
The sentence length plays an important role in identifying
key sentences. Shorter texts do not convey essential informa-
tion and very long sentences also need not be included in the
summary. The normalized length
of
the sentence is calculated
as
the ratio between a number
of
words in the sentence to the
number
of
words in the longest sentence in the document.
2.3 Paragraph location feature
Similar to sentence location, paragraph location also plays
a crucial role in selecting key sentences. A Higher score is
assigned to the paragraph in the peripheral sections (beginning
and end paragraphs
of
the document).
2.4 Sentence-to-Sentence Cohesion
The cohesion between sentences for every sentence(s), the
similarity between s and alternative sentences are calculated
which are summed up and coarse value
of
the aspect is
obtained for
s.
The feature values are normalized between
[0,
1]
where value closer to 1.0 indicates a higher degree
of
cohesion between sentences.
III. EXTRACTIVE
TEXT
SUMMARIZATION
METHODS
Extractive Text Summarization methods can be broadly
classified as Unsupervised Learning and Supervised learning
methods. Recent works rely on Unsupervised Learning meth-
ods for text summarization.
978-1-5090-3716-2/17/$31.00 ©2017 IEEE
Fig.
1.
Overview
of
Extractive Summarization
A. UNSUPERVISED LEARNING METHODS
In this section, unsupervised techniques for sentence extrac-
tion task is discussed. The unsupervised approaches do not
need human summaries (user input) in deciding the important
features
of
the document, it requires the most sophisticated
algorithm to provide compensation for the lack
of
human
knowledge. Unsupervised summaries provide a higher level
of
automation compared to supervised model and are more
suitable for processing Big Data. Unsupervised learning mod-
els have proved successful in text summarization task.
Fig.
2.
Overview
of
Unsupervised Learning Methods
1.
Graph based approach
Graph-based models are extensively used in document sum-
marization since graphs can efficiently represent the document
structure. Extractive text summarization using external knowl-
edge from Wikipedia incorporating bipartite graph framework
[4
]has been used. They have proposed an iterative ranking
algorithm (variation
of
HITS algorithm [5]) which is efficient
in selecting important sentences and also ensures coherency
in the final summary. The uniqueness
of
this paper is that
it combines both graph based and concept based approach
towards summarization task. Another graph based approach
LexRank [6], where the salience
of
the sentence is determined
by the concept
of
Eigen vector centrality. The sentences in the
document are represented
as
a graph and the edges between
the sentences represents weighted cosine similarity values. The
sentences are clustered into groups based on their similarity
measures and then the sentences are ranked based on their
LexRank scores similar to PageRank algorithm [7]except that
the similarity graph is undirected in LexRank method. The
method outperforms earlier versions
of
lead and centroid based
approaches. The performance
of
the system is evaluated with
DUC dataset [8].
2.
Fuzzy logic based approach
The fuzzy logic approach mainly contains four components:
defuzzifier, fuzzifier, fuzzy knowledge base and inference en-
gine. The textual characteristics input
of
Fuzzy logic approach
are sentenced length, sentence similarity etc which is later
given to the fuzzy system
[9]
[10].
IEEE International Conference on Computer, Communication, and Signal Processing (ICCCSP-2017)
TABLE I
SUPERVISED AND UNSUPERVISED
LEARNING
METHODS FOR TEXT SUMMARIZATION
Categories Methodology Concept
SUPERVISED Machine Learning ap- Summarization task
LEARNING proach Bayes rule modelled
as
classification
APPROACHES problem
Trainable summarization -
SUPERVISED neural network is trained,
LEARNING Artificial Neural Net- pruned and generalized to
APPROACHES work filter sentences and clas-
sify them
as
"summary" or
"non-summary sentence"
Statistical modelling ap-
SUPERVISED
LEARNING Conditional Random proach which uses CRF
as
Fields (CRF) a sequence labelling prob-
APPROACHES lem
UNSUPERVISED Graph based Construction
of
graph to
LEARNING Approach capture relationship be-
APPROACHES tween sentences
Importance
of
sentences
UNSUPERVISED Concept oriented ap- calculated based on
LEARNING the concepts retrieved
APPROACHES proach from external knowledge
base(wikipedia, HowNet)
UNSUPERVISED Fuzzy Logic based ap- Summarization based on
LEARNING fuzzy rule using various
APPROACHES proach sets
of
features
Fig.
3.
Example
of
Sentence concept bipartite graph proposed in
[4]
Ladda Suanmali et al [11] proposed fuzzy logic approach
is used for automatic text summarization which is the initial
step , the text document is pre-processed followed by feature
extraction(Title features, Sentence length, Sentence position,
Sentence-sentence similarity, term weight, Proper noun and
Numerical data. The summary is generated by ordering the
ranked sentences in the order they occur in the original
document to maintain coherency. The proposed method shows
improvement in the quality
of
summarization but issues such
as dangling anaphora are not handled.
3.
Concept-based approach
In concept-based approach, the concepts are extracted from
a piece
of
text from external knowledge base such HowNet
[l2]and Wikipedia [4]. In the methodology proposed [12], the
importance
of
sentences is calculated based on the concepts
retrieved from HowNet instead
of
words. A conceptual vector
model is built to obtain a rough summarization and similarity
measures are calculated between the sentences to reduce
redundancy in the final summary. A good summarizer fo-
cuses on higher coverage and lower redundancy. Ramanathan
978-1-5090-3716-2/17/$31.00
©2017
IEEE
Advantages Limitations
Large set
of
training data im- Human interruption required for
proves the sentence selection for generating manual summaries
summary
The network can be trained ac- I)Neural Network is slow in
cording to the style
of
human training phase and also in ap-
reader. The set
of
features can be plication phase. 2)
It
is difficult
altered to reflect user's need and to determine how the net makes
requirements decision. 3) Requires human in-
terruption for training data
Identifies correct features and
1)
focuses on domain specific
which requires an external do-
provides better representation
of
main specific corpus for training
sentences and groups terms ap-
propriately into its segments step. 2) Limitation is that linguis-
tic features are not considered
1)
Captures redundant inforrna- Doesn't focus on issues such
as
tion 2)Improves coherency dangling anaphora problem
incorporation
of
similarity mea- Dangling anaphora and verb ref-
sures to reduce redundancy erents not considered
improved quality in summary by membership functions and work
maintaining coherency
of
the fuzzy logic system
Fig.
4.
Overall architecture
of
text summarization based on fuzzy logic
approach proposed in
[10]
Fig. 5. Example
of
concepts retrieved for sentence from Wikipedia
as
proposed in
[4]
et al
[4]
proposed a Wikipedia-based summarization which
utilizes graph structure to produce summaries. The method
uses Wikipedia to obtain concept for each sentence and
builds a sentence-concept bipartite graph as already mentioned
in Graph-based summarization. The basic steps in concept
based summarization are: i) Retrieve concepts
of
a text from
IEEE International Conference on Computer, Communication, and Signal Processing (ICCCSP-2017)
external knowledge base(HowNet, WordNet, Wikipedia) ii)
Build a conceptual vector or graph model to depict relationship
between concept and sentences iii) Apply ranking algorithm to
score sentences iv) Generate summaries based on the ranking
scores
of
sentences
4.
Latent Semantic Analysis Method(LSA)
Latent Semantic Analysis(LSA) [13] [14] is a method which
excerpt hidden semantic structures
of
sentences and words
that are popularly used in text summarization task.
It
is an
unsupervised learning approach that does not demand any sort
of
external or training knowledge. LSA captures the text
of
the
input document and excerpt information such as words that
frequently occur together and words that are commonly seen in
different sentences. A high number
of
common words amongst
the sentences illustrate that the sentences are semantically
related. Singular Value Decomposition(SVD) [13], is a method
used to find out the interrelations between words and sentences
which also has the competence
of
noise reduction that helps
to improve accuracy. SVD, [15] when enforced to document
word matrices, can group documents that are semantically
associated to one other despite them sharing no common
words. The set
of
words that ensue in connected text is
also connected within the same peculiar dimensional space.
LSA technique is applied to excerpt the subject-related words
and important content conveying sentences from report. The
advantage
of
adopting LSA vectors for summarization over
word vectors is that conceptual relations
as
represented in the
human brain are naturally captured in the LSA. Choice
of
the representative sentence from every scale
of
the capacity
ensures relevancy
of
sentence to the document and ensures
non-redundancy. LS works with text data and the principal
ambit due to the collection
of
topics can be located.
Considering an example to depict LSA representatieach
otheron, Example
1:
Consider following 3 sentences given to
LSA based system.
dO:
'The man was walked the dog.
dl:
'The
man took the dog to the park in the evening. d2: 'The
dog went to the park in the evening. From Fig6 [13] it is to
be noted in order that
dl
is associated to d2 than
dO
and the
conversation 'walked' is linked to the talk 'man' but it is not
significant to the word 'park'. These kind
of
interpretations
can be built by using input data and LSA, beyond need for
any extraneous awareness.
B.
SUPERVISED LEARNING METHODS
Supervised extractive summarizationrelated techniques are
based on a classification approach at sentence level where
the system learns by examples to classify between summary
and non-summary sentences. The major drawback with the
supervised approach is that it requires known manually created
summaries by a human to label the sentences in the original
training document enclosed with "summary sentence" or "non-
summary sentence" and it also requires more labeled training
data for classification.
978-1-5090-3716-2/17/$31.00
©2017
IEEE
Fig.
6.
Representation
of
LSA for Example
[13]
Fig.
7.
Overview
of
Supervised Learning Methods
1.
Machine Learning Approach based on Bayes rule
A set
of
training documents along with its extractive sum-
maries is fed as input to the training stage. The machine
learning approach views classification problem in text sum-
marization. The sentences are restricted as a non-summary
and summary sentence based on the feature possessed by the
sentence. The probability
of
classification are learned from the
training data by the following Bayes rule [16]: where s rep-
resents the set
of
sentences in the document and
fi
represents
the features used in classification stage and S represents the
set
of
sentences in the summary. P (s
E<
SII1,h,h,
....
In)
represents the probability
of
the sentences to be included in
the summary based on the given features possessed by the
sentence.
2.
Neural Network based approach
Fig.
8.
Neural network after training (a) and after pruning (b)
[17]
In the approach proposed in [18], RankNet algorautomati-
callyithm using neural nets to identify the important sentences
in the document.
It
uses a two-layer neural network with
IEEE International Conference on Computer, Communication, and Signal Processing (ICCCSP-2017)
back propagation trained using RankNet algorithm. The first
step involves labeling the training data using a machine-
learning approach and then extract features
of
the sentences
in both test set and train sets which is then inputted to the
neural network system to rank the sentences in the document.
Another approach [17]uses a three layered feed-forward neural
network which learns in the training stage the characteristics
of
summary and non-summary sentences. The major phase
is the feature fusion phase where the relationship between
the features are identified through two stages
1)
eliminat-
ing infrequent features 2) collapsing frequent features after
which sentence ranking is done to identify the important
summary sentences.Neural Network [17]after feature fusion
is depicted in Fig
8.
Dharmendra Hingu, Deep Shah and
Sandeep S.Udmale proposed an extractive approach [19]for
summarizing the Wikipedia articles by identifying the text
features and scoring the sentences by incorporating neural
network model
[5].
This system gets the Wikipedia articles
as
input followed by tokenisation and stemming. The pre-
processed passage is sent to the feature extraction steps, which
is based on multiple features
of
sentences and words. The
scores obtained after the feature extraction are fed to the
neural network, which produces a single value
as
output score,
signifying the importance
of
the sentences. Usage
of
the words
and sentences is not considered while assigning the weights
which results in less accuracy.
3.
Conditional Random Fields
Conditional Random Fields are a statistical modeling ap-
proach that focuses on machine leaning to provide a structured
prediction. The proposed system overcomes the issues faced
by non-negative matrix Factorization (NMF) methods by in-
corporating conditional random fields (CRF) to identify and
extract correct features to determine the important sentence
of
the given text. The main advantage
of
the method is that
it is able to identify correct features and provides a better
representation
of
sentences and groups terms appropriately
into its segments. The major drawback
of
the method is that it
focuses on domain-specific which requires an external domain
specific corpus for training step, thus this method cannot be
applied generically to any document without building a domain
corpus which is a time-consuming task. The approach specified
in [20]uses CRF
as
a sequence labelling problem and also
captures interaction between sentences through the features
extracted for each sentence and it also incorporates complex
features such
as
LSA_scores
[21]
and lilTS_score
[22]
but
the limitation is that linguistic features are not considered.
IV. INFERENCES
MADE
Abounding variations
of
the extractive path
[15]
have
been focused in the prior ten years. However, it is difficult
to say how analytical improvement (sentence or text
level) devote to work.
Beyond
NLP,
the achieved summary might endure a lack
of
semantics and cohesion. In texts consist
of
numerous
topics, the provoked summary may not be
fair.
Conclusive
978-1-5090-3716-2/17/$31.00 ©2017 IEEE
proper weights for respective features is vital to the
quality
of
concluding summary depends on it.
Feature weights should be given more importance be-
cause it plays a major role in choosing key sentences.
In text Summarization, the most challenging task is
to summarize the contented from a number
of
semi-
structured sources and textual, which includes web pages
and databases, in the proper way (size, format, time,
language,) for an explicit user.
Text summary software should crop effective summary
within a fewer amount
of
redundancy and time. Summa-
rization appraise using extrinsic or intrinsic part.
Intrinsic parts pursuit to measure summary nature adopt-
ing human evaluation whereas, extrinsic parts measure
the same over a effort-based work measure being the
information rehabilitation-oriented task.
V.
EVALUATION METRICS
Numerous benchmarking datasets
[1]
are used for experi-
mental evaluation
of
extractive summarization. Document Un-
derstanding Conferences (DUC) is the most common bench-
marking datasets used for text summarization. There are a
number
of
datasets like TIPSTER, TREC ,
TAC
, DUC, CNN.
It
contains documents along with their summaries that are
created automatically, manually and submitted summaries[20].
From papers surveyed within the previous sections et
al
in
literature, it's been found that agreement between human
summarizers is sort
of
low,
each for evaluating and generating
summaries quite the shape
of
the outline, it is tough to judge
the outline content.
i)
Human Evaluation
Human judgment usually has wide variance on what's
thought-about a "good" outline, which implies that creating
the analysis method automatic is especially tough. Manual
analysis is used, however, this can be each time and labor-
intensive because it needs humans to browse not solely the
summaries however conjointly the supply documents. Other
issues are those regarding coherence and coverage.
ii) Rouge
Formally, ROUGE-N is an n-gram recall between a candi-
date summary and a set
of
reference summaries. ROUGE-N
is computed
as
follows:
ROUGE-N = LSEreference_summaries LN-grams
Countmatch{N-gram)
L S Ereference_summaries LN-grams Count{N -gram)
where, n stands for the length
of
the n-gram
Countmatch(N-
gram) is the maximum number
of
n-grams co-occurring in
a candidate summary and a set
of
reference summaries.
Count(N-gram) is the number
of
N-grams in the set
of
reference summaries.
...
) R
11
R _ ISref n Scandl
m eca -ISrefl
where
Sref
n Scand indicates the number
of
sentences that
co-occur in both reference and candidate summaries.
IEEE International Conference on Computer, Communication, and Signal Processing (ICCCSP-2017)
. ) P
..
(P) P
ISref
n
Scandl
tV
rectswn = I I
Scand
) _ 2(Precision)(Recall)
v F -measure F - . . all
PreCISlOn
+ Rec
vi) Compression Ratio Cr = Slen .
Iien
where, Slen and
Iien
are the length
of
summary and source
text respectively.
VI. CHALLENGES AND
FUTURE
RESEARCH
DIRECTIONS
Evaluating summaries (either automatically or manually) is
a difficult task. The main problem in evaluation comes from
the impossibility
of
building a standard against which the
results
of
the systems that have to be compared. Further, it
is very hard to
find
out what a correct summary is because
there is a chance
of
the system to generate a better summary
that is different from any human summary which is used as an
approximation to correct output. Content choice
[23]
is not a
settled problem. People are completely different and subjective
authors would possibly select completely different sentences.
Two
distinct sentences expressed in disparate words will
specific a similar can explicit the same meaning also known
as
paraphrasing. There exists an approach to automatically
evaluate summaries using paraphrases (paraEval). Most text
summarization systems perform extractive summarization ap-
proach (selecting and photocopying extensive sentences from
the professional documents). Though humans can cut and paste
relevant data from a text, most
of
the times they rephrase sen-
tences whenever necessary, or they may join different related
data into one sentence. The low inter-annotator agreement
figures observed during manual evaluations suggest that the
future
of
this research area massively depends on the capacity
to find efficient ways
of
automatically evaluating the systems.
VII. CONCLUSION
This review has shown assorted mechanism
of
extractive
text summarization process. Extractive summarization process
is highly coherent, less redundant and cohesive (summary and
information rich). The aim is to give a comprehensive review
and comparison
of
distinctive approaches and techniques
of
extractive text summarization process. Although research on
summarization started way long back, there is still a long way
to go. Over the time, focused has drifted from summarizing
scientific articles to advertisements, blogs, electronic mail
messages and news articles. Simple eradication
of
sentences
has composed satisfactory results in massive applications.
Some trends in automatic evaluation
of
summary system have
been focused. However, the work has not focused the different
challenges
of
extractive text summarization process to its full
intensity in premises
of
time and space complication.
REFERENCES
[1]
N. Moratanch and S. Chitrakala, "A survey on abstractive text sum-
marization," in Circuit, Power
and
Computing Technologies (ICCPCT),
2016 International Conference on. IEEE, 2016, pp. 1-7.
978-1-5090-3716-2/17/$31.00 ©2017 IEEE
[2]
F.
Kiyomarsi and
F.
R.
Esfahani, "Optimizing persian text summarization
based on fuzzy logic approach," in 2011 International Conference on
Intelligent Building
and
Management, 2011.
[3]
F.
Chen, K. Han, and G. Chen, "An approach to sentence-selection-
based text summarization," in TENCON'02. Proceedings. 2002 IEEE
Region 10 Conference on Computers, Communications, Control and
Power Engineering, vol.
1.
IEEE, 2002, pp. 489-493.
[4]
Y.
Sankarasubramaniam, K. Ramanathan, and S. Ghosh, "Text sum-
marization using wikipedia," Information Processing & Management,
vol. 50, no. 3, pp. 443-461, 2014.
[5]
J. M. Kleinberg, "Authoritative sources in a hyperlinked environment,"
Journal
of
the
ACM
(JACM), vol. 46, no. 5, pp. 604--632, 1999.
[6]
G. Erkan and D.
R.
Radev, "Lexrank: Graph-based lexical centrality
as salience in text summarization," Journal
of
Artificial Intelligence
Research, pp. 457-479, 2004.
[7]
S. M. R
..
W.
T.
L., Brin, "The pagerank citation ranking: Bringing
order to the web," Technical report, Stanford University, Stanford, CA.,
Tech. Rep., (1998).
[8]
(2004) Document understanding conferences dataset. Online. [Online].
Available: http://duc.nist.gov/data.htrnl.
[9]
F.
Kyoomarsi, H. Khosravi, E. Eslami,
P.
K. Dehkordy, and A. Tajoddin,
"Optimizing text summarization based on fuzzy logic," in Seventh
IEEElACIS International Conference on Computer and Information
Science. IEEE, 2008, pp. 347-352.
[10] L. SUanmali, M. S. Binwahlan, and
N.
Salim, "Sentence features fusion
for text summarization using fuzzy logic," in Hybrid Intelligent Systems,
2009. HIS'09. Ninth International Conference on, vol.
1.
IEEE, 2009,
pp. 142-146.
[11] L. Suanmali,
N.
Salim, and M. S. Binwahlan, "Fuzzy logic based method
for improving text summarization," arXiv pre print arXiv:0906.4690,
2009.
[12] X.
W.
Meng Wang and C. Xu, "An approach to concept oriented text
summarization," in in Proceedings
of
ISClTS05, IEEE international
conference, China,1290-1293" 2005.
[13] M. G. Ozsoy,
F.
N. Alpaslan, and
1.
Cicekli, "Text summarization using
latent semantic analysis," Journal
of
Information Science, vol. 37, no. 4,
pp. 405-417, 2011.
[14]
1.
Mashechkin, M. Petrovskiy, D. Popov, and D.
V.
Tsarev, "Automatic
text summarization using latent semantic analysis," Programming and
Computer Software, vol. 37, no. 6, pp. 299-305, 2011.
[15]
V.
Gupta and G. S. Lehal, "A survey
of
text summarization extractive
techniques," Journal
of
emerging technologies in web intelligence, vol. 2,
no. 3, pp. 258-268, 2010.
[16] J. L. Neto, A. A. Freitas, and C. A. Kaestner, "Automatic text summa-
rization using a machine learning approach," in Advances in Artificial
Intelligence. Springer, 2002, pp. 205-215.
[17] K. Kaikhah, "Automatic text summarization with neural networks,"
2004.
[18] K. M. Svore, L. Vanderwende, and C.
J.
Burges, "Enhancing single-
document summarization by combining ranknet and third-party sources."
in EMNLP-CoNLL, 2007, pp. 448-457.
[19] D. Hingu, D. Shah, and S. S. Udmale, "Automatic text summarization
of
wikipedia articles," in Communication, Information & Computing
Technology (ICCICT), 2015 International Conference on.
IEEE,2015,
pp.I-4.
[20] D. Shen, J.-T. Sun, H. Li, Q. Yang, and Z. Chen, "Document summa-
rization using conditional random fields." in IJCAI, vol. 7, 2007, pp.
2862-2867.
[21]
Y.
Gong and X. Liu, "Generic text summarization using relevance
measure and latent semantic analysis," in Proceedings
of
the 24th annual
international
ACM
SIGIR conference on Research and development in
information retrieval. ACM, 2001, pp. 19-25.
[22]
R.
Mihalcea, "Language independent extractive summarization," in
Proceedings
of
the ACL 2005 on Interactive poster and demonstration
sessions. Association for Computational Linguistics, 2005, pp. 49-52.
[23] N. Lalithamani, R. Sukumaran, K. Alagamrnai, K. K. Sowmya,
V.
Di-
vyalakshmi, and S. Shanmugapriya,
''A
mixed-initiative approach for
summarizing discussions coupled with sentimental analysis," in Proceed-
ings
of
the 2014 International Conference on Interdisciplinary Advances
in Applied Computing. ACM, 2014, p. 5.
... Word-level features like high-weight TF-IDF terms, title words, cue phrase terms (e.g. words associated with "summary" or "in conclusion"), or uppercase terms are suggested candidates [7]. Sentence-level features include location, length, location of parent paragraph, and sentence-to-sentence cohesion [8]. ...
... Including a multi-race visualization capability in the dashboard. 7. Correlating a driver's verbal indication of a car's balance to the SMT balance data to build driver-specific metrics for understeer and oversteer. ...
Preprint
Full-text available
The National Association of Stock Car Auto Racing (NASCAR) is consistently ranked among the top ten most popular sports in the United State1,2 . It is the most lucrative, generating an estimated $3 billion annually3 . Teams are financially well-supported and competitive in all areas where an advantage might be gained. NASCAR events are characterized by on-track racing punctuated by pit stops since cars must refuel, replace tires, and modify their setup throughout a race. A well-executed pit stop can allow drivers to gain multiple seconds on their opponents. Strategies around when to pit and what to perform during a pit stop are under constant evaluation. One currently unexplored area is publically available communication between each driver and their pit crew during the race. Due to the many hours of audio, manual analysis of even one driver’s communications is prohibitive. We propose a fully automated approach to analyze driver–pit crew communication. Our work was conducted in collaboration with NASCAR domain experts—a simulation manager, a crew chief, and a software engineering manager—to validate our findings. Audio communication is converted to text and summarized using cluster-based Latent Dirichlet Analysis (LDA) to provide an overview of a driver’s race performance. The transcript is then analyzed to extract important events related to pit stops and driving balance: whether a car is understeering (pushing) or oversteering (over-rotating). Named entity recognition (NER) and relationship extraction provide context to each event. A combination of the race summary, events, and real-time race data provided by NASCAR are presented using Sankey visualizations. Statistical analysis and evaluation by our domain expert collaborators confirmed we can accurately identify important race events and driver interactions, presented in a novel way to provide useful, 1 important, and efficient summaries and event highlights for race preparation and in-race decision-making.
... Extractive summarization is an NLP task whose objective is to generate a summary of a given text by selecting and combining important sentences or phrases from the original document [27]. The summary is created by extracting sentences that are considered the most informative of the original content. ...
... The summary is created by extracting sentences that are considered the most informative of the original content. Various statistical and machine learning techniques have been developed over the years; good surveys of the field are given in [27,28]. Modern top-line techniques for extractive summarization rely on transformers [29,30]. ...
Article
Full-text available
Resume matching is the process of comparing a candidate’s curriculum vitae (CV) or resume with a job description or a set of employment requirements. The objective of this procedure is to assess the degree to which a candidate’s skills, qualifications, experience, and other relevant attributes align with the demands of the position. Some employment courses guide applicants in identifying the key requirements within a job description and tailoring their experience to highlight these aspects. Conversely, human resources (HR) specialists are trained to extract critical information from numerous submitted resumes to identify the most suitable candidate for their organization. An automated system is typically employed to compare the text of resumes with job vacancies, providing a score or ranking to indicate the level of similarity between the two. However, this process can become time-consuming when dealing with a large number of applicants and lengthy vacancy descriptions. In this paper, we present a dataset consisting of resumes of software developers extracted from a public Telegram channel dedicated to Israeli hi-tech job applications. Additionally, we propose a natural language processing (NLP)-based approach that leverages neural sentence representations, keywords, and named entities to achieve state-of-the-art performance in resume matching. We evaluate our approach using both human and automatic annotations and demonstrate its superiority over the leading resume–vacancy matching algorithm.
... Various evaluation metrics have been proposed to measure the performance of summarization models. In this section, we discuss some of the commonly used evaluation metrics for short text summarization [29,30] BLEU (Bilingual Evaluation Understudy): ...
Article
Full-text available
The rapid expansion of social media platforms has resulted in an unprecedented surge of short text content being generated on a daily basis. Extracting valuable insights and patterns from this vast volume of textual data necessitates specialized techniques that can effectively condense information while preserving its core essence. In response to this challenge, automatic short text summarization (ASTS) techniques have emerged as a compelling solution, gaining significant importance in their development. This paper delves into the domain of summarizing short text on social media, exploring various types of short text and the associated challenges they present. It also investigates the approaches employed to generate concise and meaningful summaries. By providing a survey of the latest methods and potential avenues for future research, this paper contributes to the advancement of ASTS in the ever-evolving landscape of social media communication.
... Since then, numerous studies have applied Pegasus to improve summarization across diverse domains and languages. For example, research has explored Pegasus for summarizing legal documents [7], financial reports [7], social media [8] and scientific papers [9]. Others have adapted Pegasus for cross-lingual summarization [10]. ...
Conference Paper
This paper explores the application of deep learning for abstractive text summarization, specifically focusing on the Pegasus model. Abstractive summarization aims to generate concise, coherent summaries that capture the essence of a text. We provide background on text summarization and explain the difference between extractive and abstractive techniques. The Pegasus model, which utilizes transformer architectures, has shown state-of-the-art performance on abstractive summarization benchmarks. To demonstrate a practical application, we implement a web interface with Flask that allows users to input text and view generated summaries. We survey recent literature on applications of Pegasus to summarization across diverse domains and languages. Results demonstrate Pegasus' effectiveness in producing high-quality summaries while preserving key contextual information. However, challenges remain in aspects like factual consistency and coherence for long documents. We conclude by discussing limitations of current techniques and promising directions for future research to enhance abstractive summarization using deep learning.
... The main disadvantage of ROUGE metrics, as well as precision, recall, and F-measure, is that they are more suitable for extractive contrastive summarization, since similar synonym words cannot be considered [42]. Summaries that are semantically very similar, but which may not use common words, will then receive a low score. ...
Article
Full-text available
In our data-flooded age, an enormous amount of redundant, but also disparate textual data is collected on a daily basis on a wide variety of topics. Much of this information refers to documents related to the same theme, that is, different versions of the same document, or different documents discussing the same topic. Being aware of such differences turns out to be an important aspect for those who want to perform a comparative task. However, as documents increase in size and volume, keeping up-to-date, detecting, and summarizing relevant changes between different documents or versions of it becomes unfeasible. This motivates the rise of the contrastive or comparative summarization task, which attempts to summarize the text of different documents related to the same topic in a way that highlights the relevant differences between them. Our research aims to provide a systematic literature review on contrastive or comparative summarization, highlighting the different methods, data sets, metrics, and applications. Overall, we found that contrastive summarization is most commonly used in controversial news articles, controversial opinions or sentiments on a topic, and reviews of a product. Despite the great interest in the topic, we note that standard data sets, as well as a competitive task dedicated to this topic, are yet to come to be proposed, eventually impeding the emergence of new methods. Moreover, the great breakthrough of using deep learning-based language models for abstract summaries in contrastive summarization is still missing.
... The most straightforward approach is to directly detect important utterances using automatic speech recognition (ASR). However, natural language summarization methods [88] are designed for text and cannot be directly applied to speech signals Therefore, it is more effective to translate the speech to text and then apply text summarization [3,54]. Early text summarization approaches simply extracted important sentences from the corpus to represent in the summary. ...
Preprint
Full-text available
Meetings play a critical infrastructural role in the coordination of work. In recent years, due to shift to hybrid and remote work, more meetings are moving to online Computer Mediated Spaces. This has led to new problems (e.g. more time spent in less engaging meetings) and new opportunities (e.g. automated transcription/captioning and recap support). Recent advances in large language models (LLMs) for dialog summarization have the potential to improve the experience of meetings by reducing individuals' meeting load and increasing the clarity and alignment of meeting outputs. Despite this potential, they face technological limitation due to long transcripts and inability to capture diverse recap needs based on user's context. To address these gaps, we design, implement and evaluate in-context a meeting recap system. We first conceptualize two salient recap representations -- important highlights, and a structured, hierarchical minutes view. We develop a system to operationalize the representations with dialogue summarization as its building blocks. Finally, we evaluate the effectiveness of the system with seven users in the context of their work meetings. Our findings show promise in using LLM-based dialogue summarization for meeting recap and the need for both representations in different contexts. However, we find that LLM-based recap still lacks an understanding of whats personally relevant to participants, can miss important details, and mis-attributions can be detrimental to group dynamics. We identify collaboration opportunities such as a shared recap document that a high quality recap enables. We report on implications for designing AI systems to partner with users to learn and improve from natural interactions to overcome the limitations related to personal relevance and summarization quality.
Chapter
An agent in pursuit of a task may work with a corpus containing text documents. To perform information retrieval on the corpus, the agent may need annotations—additional data associated with the documents. Subjective Content Descriptions (SCDs) provide additional location-specific data for text documents. SCDs can be estimated without additional supervision for any corpus of text documents. However, the estimated SCDs lack meaningful descriptions, i.e., labels consisting of short summaries. Labels are important to identify relevant SCDs and documents by the agent and its users. Therefore, this paper presents LESS, a LEan computing approach for Selective Summaries, which can be used as labels for SCDs. LESS uses word distributions of the SCDs to compute labels. In an evaluation, we compare the labels computed by LESS with labels computed by large language models and show that LESS computes similar labels but requires less data and computational power.
Chapter
One of the critical challenges for natural language processing methods is the issue of automatic content summarization. The enormous increase in the amount of data delivered to users by news services leads to an overload of information without meaningful content. There is a need to generate an automatic text summary that contains as much essential information as possible while keeping the resulting text smooth and concise. Methods of automatic content summarization fall into two categories: extractive and abstractive. This work converts the task of extractive summarization to a binary classification problem. The research focused on analyzing various techniques for extracting the abstract in a supervised learning manner. The results suggest that this different view of text summarization has excellent potential.Keywordstext summarizationextractive summaryclassification
Article
Full-text available
This research paper focuses on extracting and classifying information from the Moroccan National Tax Appeals Commission, which is presently nonexistent in the country's legal and tax landscape. This study examines 201 decisions selected from a pool of 562, released between 1999 and 2018, pertaining to corporate tax and involving 550 disputes spanning various corporate tax classifications. The paper aims to propose latent dirichlet allocation (LDA) for topic modeling and compare it with our previous results obtained from the bidirectional encoder representations from transformers (BERT) model. The findings suggest that the rulings can be classified into two primary classifications: those that uphold or reject the tax administration's position. The proposed model shows a good performance, achieving a precision of 9.25% and an accuracy of 9.51%. This highlights the effectiveness of both LDA and BERT models for understanding and classifying topics in tax decision analysis.
Conference Paper
Full-text available
The main objective of a text summarization system is to identify the most important information from the given text and present it to the end users. In this paper, Wikipedia articles are given as input to system and extractive text summarization is presented by identifying text features and scoring the sentences accordingly. The text is first pre-processed to tokenize the sentences and perform stemming operations. We then score the sentences using the different text features. Two novel approaches implemented are using the citations present in the text and identifying synonyms. These features along with the traditional methods are used to score the sentences. The scores are used to classify the sentence to be in the summary text or not with the help of a neural network. The user can provide what percentage of the original text should be in the summary. It is found that scoring the sentences based on citations gives the best results.
Article
Full-text available
Text Summarization is condensing the source text into a shorter version preserving its information content and overall meaning. It is very difficult for human beings to manually summarize large documents of text. Text Summarization methods can be classified into extractive and abstractive summarization. An extractive summarization method consists of selecting important sentences, paragraphs etc. from the original document and concatenating them into shorter form. The importance of sentences is decided based on statistical and linguistic features of sentences. An abstractive summarization method consists of understanding the original text and re-telling it in fewer words. It uses linguistic methods to examine and interpret the text and then to find the new concepts and expressions to best describe it by generating a new shorter text that conveys the most important information from the original text document. In this paper, a Survey of Text Summarization Extractive techniques has been presented.
Conference Paper
Full-text available
We present a new approach to automatic summarization based on neural nets, called NetSum. We extract a set of features from each sentence that helps identify its impor- tance in the document. We apply novel features based on news search query logs and Wikipedia entities. Using the RankNet learning algorithm, we train a pair-based sentence ranker to score every sentence in the document and identify the most impor- tant sentences. We apply our system to documents gathered from CNN.com, where each document includes highlights and an article. Our system significantly outper- forms the standard baseline in the ROUGE-1 measure on over 70% of our document set.
Conference Paper
In this fast moving world, everybody has the urge to get updated to what's happening around the world. There is a heavy demand for knowing information which is crisp and to the point without getting reading the may be ten articles on it, but to get what the ten articles mean. So, to solve this issue, here we present a novel approach for 'Summarizing the Discussion' which is aimed at presenting the user with the most relevant summary containing all the important points of the discussion. At this juncture, we need to consider the polarity of comments and the whole discussion and present the user with a short, unbiased, comprehensive and non-redundant summary. Our proposal consists of a Nested Thematic Clustering approach coupled with Keyword or Key phrase extraction and sentiment analysis or rather polarity calculation. Be it political, social, geographical, philosophical, superstitious, educational, and environmental or entertainment domain, all the news is generally shared through forms or posts of internet. Now, it gets a little obvious that different people have different views. Hence, we make it a point to include and ensure the multi-dimensionality of every thread under the post. We have used an extractive method of summarization with the help of Natural Language Processing where we incorporate the techniques of repeated clustering and ranking. The proposed model is able to generate a desirably relevant summary.
Article
Automatic text summarization has been an active field of research for many years. Several approaches have been proposed, ranging from simple position and word-frequency methods, to learning and graph based algorithms. The advent of human-generated knowledge bases like Wikipedia offer a further possibility in text summarization - they can be used to understand the input text in terms of salient concepts from the knowledge base. In this paper, we study a novel approach that leverages Wikipedia in conjunction with graph-based ranking. Our approach is to first construct a bipartite sentence-concept graph, and then rank the input sentences using iterative updates on this graph. We consider several models for the bipartite graph, and derive convergence properties under each model. Then, we take up personalized and query-focused summarization, where the sentence ranks additionally depend on user interests and queries, respectively. Finally, we present a Wikipedia-based multi-document summarization algorithm. An important feature of the proposed algorithms is that they enable real-time incremental summarization - users can first view an initial summary, and then request additional content if interested. We evaluate the performance of our proposed summarizer using the ROUGE metric, and the results show that leveraging Wikipedia can significantly improve summary quality. We also present results from a user study, which suggests that using incremental summarization can help in better understanding news articles.
Article
The network structure of a hypcrlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of "authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of "hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.
Article
The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.
Conference Paper
The scoring mechanism of the text features is the unique way for determining the key ideas in the text to be presented as text summary. The efficiency of the technique used for scoring the text sentences could produce good summary. The feature scores are imprecise and uncertain, this marks the differentiation between the important features and unimportant is difficult task. In this paper, we introduce fuzzy logic to deal with this problem. Our approach used important features based on fuzzy logic to extract the sentences. In our experiment, we used 30 test documents in DUC2002 data set. Each document is prepared by preprocessing process: sentence segmentation, tokenization, removing stop word, and word stemming. Then, we use 9 important features and calculate their score for each sentence. We propose a method using fuzzy logic for sentence extraction and compare our results with the baseline summarizer and Microsoft Word 2007 summarizers. The results show that the highest average precision, recall, and F-measure for the summaries were obtained from fuzzy method.