Conference PaperPDF Available

TSTR: Too Short to Represent, Summarize with Details! Intro-Guided Extended Summary Generation

TST R: Too Short to Represent, Summarize with Details!
Intro-Guided Extended Summary Generation
Sajad Sotudeh and Nazli Goharian
IR Lab, Georgetown University, Washington DC 20057, USA
{sajad, nazli}
Many scientific papers such as those in arXiv
and PubMed data collections have abstracts
with varying lengths of 50–1000 words and
average length of approximately 200 words,
where longer abstracts typically convey more
information about the source paper. Up to re-
cently, scientific summarization research has
typically focused on generating short, abstract-
like summaries following the existing datasets
used for scientific summarization. In domains
where the source text is relatively long-form,
such as in scientific documents, such summary
is not able to go beyond the general and coarse
overview and provide salient information from
the source document. The recent interest to
tackle this problem motivated curation of scien-
tific datasets, arXiv-Long and PubMed-Long,
containing human-written summaries of 400-
600 words, hence, providing a venue for re-
search in generating long/extended summaries.
Extended summaries facilitate a faster read
while providing details beyond coarse infor-
mation. In this paper, we propose TSTR , an
extractive summarizer that utilizes the introduc-
tory information of documents as pointers to
their salient information. The evaluations on
two existing large-scale extended summariza-
tion datasets indicate statistically significant
improvement in terms of ROU GE and average
ROU GE (F1) scores (except in one case) as
compared to strong baselines and state-of-the-
art. Comprehensive human evaluations favor
our generated extended summaries in terms of
cohesion and completeness.
1 Introduction
Over the past few years, summarization task has
witnessed a huge deal of progress in extractive (Nal-
lapati et al.,2017;Liu and Lapata,2019;Yuan
et al.,2020;Cui et al.,2020;Jia et al.,2020;Feng
et al.,2018) and abstractive (See et al.,2017;Co-
han et al.,2018;Gehrmann et al.,2018;Zhang
et al.,2019;Tian et al.,2019;Zou et al.,2020)
[Introductory] Neural machine translation (@xcite), directly applying a
single neural network to transform the source sentence into the target
sentence, has now reached impressive performance (@xcite [] Motivated
by recent success in unsupervised cross-lingual embeddings (@xcite), the
models proposed for unsupervised NMT often assume that apair of
sentences from two different languages can be mapped to asame latent
representation in a shared-latent space (@xcite) [] Although the shared
encoder is vital for mapping sentences from different languages into the
shared-latent space, it is weak in keeping the uniqueness and internal
characteristics of each language, such as the style, terminology and sentence
structure. [] For each language, the encoder and its correspon ding
decoder perform an AE, where the encoder generates the latent
representations from the perturbed input sentences and the decoder
reconstructs the sentences from the latent representations. Experimental
results show that the proposed approach consistently achieves great success.
[Non-introductory] [] To further enforce the shared-latent space, we
train adiscriminative neural network, referred to as the local discriminator,
to classify between the encoding of source sentences and the encoding of
target sentences. [] the shared encoder is weak in keeping the unique
characteristic of each language.This confirms our intuition that the shared
layers are vital to map the source and target latent representations to a
shared-latent space. [] This shows that the proposed model only trained
with monolingual data effectively learns to use the context information and
the internal structure of each language [] The models proposed recently
for unsupervised NMT use asingle encoder to map sentences from different
languages to a shared-latent space. [] The experimental results reveal that
our approach achieves significant improvement and verify our conjecture
that the shared encoder is really a bottleneck for improving the
unsupervised NMT.
Figure 1: A truncated human-written extended sum-
mary. Top box: introductory information, bottom
box: non-introductory information. Colored spans are
pointers from introductory sentences to associated non-
introductory detailed sentences.
settings. Many scientific papers such as those in
arXiv and PubMed (Cohan et al.,2018) posses ab-
stracts of varying length, ranging from 50 to 1000
words and average length of approximately 200
words. While scientific paper summarization has
been an active research area, most works (Cohan
et al.,2018;Xiao and Carenini,2019;Cui and Hu,
2021;Rohde et al.,2021) in this domain have fo-
cused on generating typical short and abstract-like
summaries (Chandrasekaran et al.,2020). Short
summaries might be adequate when the source text
is of short-form such as those in news domain;
however, to summarize longer documents such as
scientific papers, an extended summary including
400–600 terms on average, such as those found in
extended summarization datasets of arXiv-Long
and PubMed-Long, is more appealing as it conveys
more detailed information.
Extended summary generation has been of re-
search interest very recently. Chandrasekaran
et al. (2020) motivated the necessity of generating
extended summaries through LongSumm shared
. Long documents such as scientific papers
are usually framed in a specific structure. They start
by presenting general introductory information
This introductory information is then followed by
supplemental information (i.e., non-introductory)
that explain the initial introductory information in
more detail. Similarly, as shown in Figure 1, this
pattern holds in a human-written extended sum-
mary of a long document, where the preceding sen-
tences (top box inside Figure 1) are introductory
sentences and succeeding sentences (bottom box
inside Figure 1) are explanations of the introduc-
tory sentences. In this study, we aim to guide the
summarization model to utilize the aforementioned
rationale in human-written summaries. We con-
sider introductory sentences as those that appear
in the first section of paper with headings such as
Introduction,Overview,Motivations, and so forth.
As such, all other parts of paper and their sentences
are considered as non-introductory (i.e., supple-
mentary). We use these definitions in the reminder
of this paper.
Herein, we approach the problem of extended
summary generation by incorporating the most
important introductory information into the sum-
marization model. We hypothesize that incorpo-
rating such information into the summarization
model guides the model to pick salient detailed
non-introductory information to augment the final
extended summary. The importance of the role
of introduction in the scientific papers was earlier
presented in (Teufel and Moens,2002;Arma˘
2013;Jirge,2017) where they showed such infor-
mation provides clues (i.e. pointers) to the objec-
tives and experiments of studies. Similarly, Boni
et al. (2020) conducted a study to show the impor-
tance of introduction part of scientific papers as
its relevance to the paper’s abstract. To validate
our hypothesis, we test the proposed approach on
two publicly available large-scale extended summa-
rization datasets, namely arXiv-Long and PubMed-
Long. Our experimental results improve over the
strong baselines and state-of-the-art models. In
short, the contributions of this work are as follows:
We will exchangeably use (non-)introductory information
and (non-)introductory sentences in the rest of this paper.
A novel multi-tasking approach that incorpo-
rates the salient introductory information into
the extractive summarizer to guide the model
in generating a 600-term (roughly) extended
summary of a long document, containing the
key detailed information of a scientific paper.
Intrinsic evaluation that demonstrates statis-
tically significant improvements over strong
extractive and abstractive summarization base-
lines and state-of-the-art models.
An extensive human evaluation which reveals
the advantage of the proposed model in terms
of cohesion and completeness.
2 Related Work
Summarizing scientific documents has gained a
huge deal of attention from researchers, although
it has been studied for decades. Neural efforts
in scientific text have used specific characteris-
tics of papers such as discourse structure (Cohan
et al.,2018;Xiao and Carenini,2019) and citation
information (Qazvinian and Radev,2008;Cohan
and Goharian,2015,2018) to aid summarization
model. While prior work has mostly covered the
generation of shorter-form summaries (approx. 200
terms), generating extended summaries of roughly
600 terms for long-form source documents such
as scientific papers has been motivated very re-
cently (Chandrasekaran et al.,2020).
The proposed models for the extended summary
generation task include jointly learning to predict
sentence importance and sentence section to ex-
tract top sentences (Sotudeh et al.,2020); utiliz-
ing section-contribution computations to pick sen-
tences from important section for forming the fi-
nal summary (Ghosh Roy et al.,2020); identify-
ing salient sections for generating abstractive sum-
maries (Gidiotis et al.,2020); ensembling of ex-
traction and abstraction models to form final sum-
mary (Ying et al.,2021); an extractive model with
TextRank algorithm equipped with BM25 as sim-
ilarity function (Kaushik et al.,2021); and incor-
porating sentences embeddings into graph-based
extractive summarizer in an unsupervised man-
ner (Ramirez-Orta and Milios,2021). Unlike these
works, we do not exploit any sectional nor citation
information in this work. To the best of our knowl-
edge, we are the first at proposing the novel method
of utilizing introductory information of the scien-
tific paper to guide the model to learn to generate
summary from the salient and related information.
3 Background: Contextualized language
models for summarization
Contextualized language models such as BE RT (De-
vlin et al.,2019), and ROBERTA (Liu et al.,2019)
have achieved state-of-the-art performance on a
variety of downstream NLP tasks including text
summarization. Liu and Lapata (2019) were the
first to fine-tune a contextualized language model
(i.e., BE RT) for the summarization task. They
proposed BE RTSUM —a fine-tuning scheme for
text summarization— that outputs the sentence
representations of the source document (we use
the term source and source document interchange-
ably, referring to the entire document). The BE RT-
SUM EXT model, which is built based on BERT-
SUM, was proposed for the extractive summariza-
tion task. It utilizes the representations produced
by BE RTSUM, passes them through Transformers
encoder (Vaswani et al.,2017), and finally uses
a linear layer with Sigmoid function to compute
copying probabilities for each input sentence. For-
mally, let
l1, l2, ..., ln
be the binary tags over the
source sentences
x={sent1, sent2, ..., sentn}
a long document, in which nis the number of sen-
tences in the paper. The BE RTSUM EXT network
runs over the source documents as follows (Eq. 1),
are the representations of source
sentences encoded by BERTSU M and Trasformers
encoder, respectively.
are trainable pa-
rameters, and
is the probability distribution over
the source sentences, signifying extraction copy
likelihood. The goal of this network is to train a net-
work that can identify the positive sets of sentences
as the summary. To prevent the network from se-
lecting redundant sentences, BERTSU M uses Tri-
gram Blocking (Liu and Lapata,2019) for sentence
selection in inference time. We refer the reader to
the main paper for more details.
4 TS TR: Intro-guided Summarization
In this section, we describe our methodology to
tackle the extended summary generation task. Our
approach exploits the introductory information
3Introductory information is defined in Section 1
Figure 2: Our model uses introductory sentences as
pointers to the source sentences. It then forms the final
extended summary by extracting salient sentences from
the source. Highlights in red show the salient parts.
of the paper as pointers to salient sentences within
it, as shown in Figure 2. It is ultimately expected
that the extractive summarizer is guided to pick
salient sentences across the entire paper.
The detailed illustration of our model is shown
in Figure 3. To aid the extractive summarization
model (i.e., right-hand box in Figure 3) which takes
in source sentences of a scientific paper, we utilize
an additional BERTSU M encoder called Introduc-
tory encoder (left-hand box in Fig. 3) that receives
xintro ={sent1, sent2, ..., sentm}
, with
the number of sentences in introductory section.
The aim of adding second encoder in this frame-
work is to identify the clues in the introductory
section which point to the salient supplementary
. The
network computes the
extraction probabilities for introductory sentences
as follow (same way as in Eq. 1),
in which
, and
are the introductory sentence
representations by BERTSU M, Transformers en-
coder, respectively.
is the introductory sentence
extraction probabilities.
are trainable
After identifying salient introductory sentences,
the representations associated with them are re-
trieved using a pooling function and further used to
guide the first task (i.e., right-hand side in Figure
3) as follows,
htop =Select(˜
h, ˜p, k)
4Supplementary sentences are defined in Section 1.
introductory sentences
Task-2 (t ) : introductory sentence extraction
(MLP )
source sentences
{ sent [ CLS ] sent [ CLS ] ... sent }
(MLP )
1 2 n
{ sent [ CLS ] sent [ CLS ] ... sent }
1 2 m
Task-1 (t ) : source sentence extraction
Output Probablities
Introductory Sentences
Output Probablities
Source Sentences
Figure 3: Detailed illustration of our summarization framework. Task-1 (
): source sentence extraction (right-hand
gray box). Task-2 (
): introductory sentence extraction (left-hand gray box). As shown, the identified salient
introductory sentences at training stages are incorporated into the representations of source sentences by the
function (orange box) with
k= 3
. Plus sign shows the concatenation layer. The feed-forward neural
network is made of one linear layer.
is a function that takes in all in-
troductory sentence representations (i.e.,
), and
introductory sentence probabilities
. It then out-
puts the representations associated with top
troductory sentences, sorted by
. To extract top
introductory sentences, we first sort
vectors based
on their computed probabilities
and then we pick
up top
hidden vectors (i.e.,
) that has the high-
est probability.
is a multi-layer perceptron
that takes in concatenated vector of top introduc-
tory sentences and projects it into a new vector
called ˆ
At the final stage, we concatenate the trans-
formed introductory top sentence representations
) with each source sentence representations
from Eq. 1(i.e.,
shows the
th paper
sentence) and process them to produce a resulting
which is intro-aware source sentence hid-
den representations. After processing the resulting
vector through a linear output layer (with
as trainable parameters), we obtain final intro-
aware sentence extraction probabilities (i.e.,
) as
in which
is a multi-layer perceptron, influ-
encing the knowledge from introductory sentence
extraction task (i.e.,
into the source sentence ex-
traction task (i.e.,
). We train both tasks through
our end-to-end system jointly as follows,
total = (α)t1+ (1 α)t2(5)
, and
are the losses computed for in-
troductory sentence extraction and source sentence
extraction tasks,
is the regularizing parameter
that balances the learning process between two
tasks, and
is the total computed loss that is
optimized during the training.
5 Experimental Setup
In this section, we explain the datasets, baselines,
and preprocessing and training parameters.
5.1 Dataset
We use two publicly available scientific extended
summarization datasets (Sotudeh et al.,2021).
arXiv-Long: A set of arXiv scientific pa-
pers containing papers from various scientific
domains such as physics, mathematics, computer
science, quantitative biology. arXiv-Long is in-
tended for extended summarization task and was
filtered from a larger dataset i.e., arXiv (Cohan
et al.,2018) for the summaries of more than 350
tokens. The ground-truth summaries (i.e., ab-
stract) are long, with the average length of 574
tokens. It contains 7816 (train), 1381 (validation),
and 1952 (test) papers.
PubMed-Long: A set of biomedical scien-
tific papers from PubMed with average summary
length of 403 tokens. This dataset contains 79893
(train), 4406 (validation), and 4402 (test) scien-
tific papers.
LongSumm: The recently proposed Long-
Summ dataset for a shared task (Chandrasekaran
et al.,2020) contains 2236 abstractive and ex-
tractive summaries for training and 22 papers for
the official test set. We report a comparison with
BERT SUMEXTMULTI using this data in Table
2. However, as the official test set is blind, our
experimental results in Table 1do not use this
5.2 Baselines
We compare our model with two strong non-neural
systems, and four state-of-the-art neural summa-
rizers. We use all of these baselines for the pur-
pose of extended summary generation whose docu-
ments hold different characteristics in length, writ-
ing style, and discourse structure as compared to
documents in the other domains of summarization.
LSA (Steinberger and Je
zek,2004): an extrac-
tive vector-based model that utilizes Singular
Value Decomposition (SVD) to find the semanti-
cally important sentences.
LEX RANK (Erkan and Radev,2004): a widely
adopted extractive summarization baseline that
utilizes a graph-based approach based on eigen-
vector centrality to identify the most salient sen-
BERT SUMEXT (Liu and Lapata,2019): a con-
textualized summarizer fine-tuned for summa-
rization task, which encodes input sentence rep-
resentations, and then processes them through
a multi-layer Transformers encoder to obtain
document-level sentence representation. Finally,
a linear output layer with Sigmoid activation
function outputs a probability distribution over
each input sentence, denoting the extent to which
they are probable to be extracted.
BERT SUMEXT-INTRO (Liu and Lapata,2019):
a BE RTSUMEXT model that only runs on the
introductory sentences as the input, and extracts
the salient introductory sentences as the summary.
BERT SUMEXTMU LTI (Sotudeh et al.,2021):
an extension of the BE RTSUMEXT model that
incorporates an additional linear layer with Sig-
moid classifier to output a probability distribution
over a fixed number of pre-defined sections that
an input sentence might belong to. The additional
network is expected to predict a single section
for an input sentence and is trained jointly with
BERT SUMEXT module (i.e., sentence extractor).
BART (Lewis et al.,2020): a state-of-the-art ab-
stractive summarization model that makes use
of pretrained encoder and decoder. BART can
be thought of as an extension of BE RTSUM in
which merely encoder is pre-trained, but decoder
is trained from scratch. While our model is an ex-
tractive one, at the same time, we find it of value
to measure the abstractive model performance in
the extended summary generation task.
5.3 Preprocessing, parameters, labeling, and
implementation details
We used the open implementation of B ERT-
SUM EXT with default parameters
. To implement
the non-neural baseline models, we utilized Sumy
python package
. Longformer model (Beltagy
et al.,2020) is utilized as our contextualized lan-
guage model for running all the models due to its
efficacy at processing long documents. For our
model, the cross-entropy loss function is set for
two tasks (i.e.,
source sentence extraction and
introductory sentences extraction in Figure 3)
and the model is optimized through multi-tasking
approach as discussed in Section 3. The model with
the highest ROUGE -2 on validation set is selected
for inference. The validation is performed every
2k training steps.
(in Eq. 5) is set to be 0.5 (em-
pirically determined). Our model includes 474M
trainable parameters, trained on dual GeForce GTX
1080Ti GPUs for approximately a week. We use
k= 5
for arXiv-Long,
k= 8
for PubMed-Long
datasets (Eq. 3). We make our model implementa-
tion as well as sample summaries publicly available
to expedite ongoing research in this direction 7.
A two-stage labeling approach was employed
to identify ground-truth introductory and non-
introductory sentences. In the first stage, we used a
greedy labeling approach (Liu and Lapata,2019)
to label sentences within the first section of a given
paper (i.e., labeling introductory sentences) with
respect to their ROUGE overlap
with the ground-
truth summary (i.e., abstract). In the second stage,
the same greedy approach was exploited over the
rest of sentences (i.e., non-introductory)
with re-
gard to their ROUGE overlap with the identified
introductory sentences in the first stage. Our choice
of ROU GE-2 and ROU GE-L is based on the fact that
these express higher similarity with human judg-
ments (Cohan and Goharian,2016). We continued
the second stage until a fixed length of the sum-
mary was reached. Specifically, the fixed length of
positive labels is set to be 15 for arXiv-Long, and
20 for PubMed-Long datasets as these achieved the
highest oracle ROUGE scores in our experiments.
6 Results
6.1 Experimental evaluation
The recent effort in extended summarization and its
shared task of LongSumm (Chandrasekaran et al.,
2020) used average ROU GE (F1) to rank the par-
ticipating systems, in addition to commonly-used
ROU GE-Nscores. Table 2shows the performance
of the participated systems on the blind test set.
As shown, BERTSU MEXT MULTI model outper-
forms other models by a large margin (i.e., with
relative improvements of 6% and 3% on ROUGE-
1 and average ROUGE (F1), respectively); hence,
we use the best-performing in terms of F1 (i.e.,
BERT SUMEXTMULTI model) in our experiments.
Tables. 1presents our results on the test sets of
arXiv-Long and PubMed-Long datasets, respec-
tively. As observed, our model statistically sig-
nificantly outperforms the state-of-the-art systems
on both datasets across most of the RO UGE vari-
7 Lab/
8We used mean of ROUG E-2 and RO UGE -L.
We assumed that non-introductory sentences occur in
sections other than the first section.
ants, except ROUGE -L on PubMed-Long. The im-
provements gained by our model validates our hy-
pothesis that incorporating the salient introductory
sentence representations into the extractive summa-
rizer yields a promising improvement. Two non-
neural models (i.e., LSA and LEXRANK) under-
perform the neural models, as expected. Compar-
ing the abstractive model (i.e., BART) with extrac-
tive neural ones (i.e., BE RTSUMEXT and BERT-
SUM EXTMU LTI), we see that while there is rel-
atively a smaller gap in terms of ROU GE-1, the
gap is larger for ROU GE-2, and RO UGE-L. Inter-
estingly, in the case of BART, we found that gen-
erating extended summaries is rather challenging
for abstractive summarizers. Current abstractive
summarizers including BA RT have difficulty in ab-
stracting very detailed information, such as num-
bers, and quantities, which hurts the faithfulness
of the generated summaries to the source. This
behavior has a detrimental effect, specifically, on
ROU GE- 2 and ROU GE-L as their high correlation
with human judgments in terms of faithfulness has
been shown (Pagnoni et al.,2021). Comparing
the extractive BERTSUMEXT and BE RTSUMEXT-
expected to outperfom BE RTSUMEXT, it is ob-
served that they perform almost similarly, with
small (i.e., insignificant) improved metrics. This
might be due to the fact that BE RTSUMEXT MULTI
works out-of-the-box when a handful amount of
sentences are sampled from diverse sections to
form the oracle summary as also reported by its
authors. However, when labeling oracle sentences
in our framework (i.e., Intro-guided labeling), there
is no guarantee that the final set of oracle sen-
tences are labeled from diverse sections. Over-
all, our model achieves about 1.4%, 2.4%, 3.5%
(arXiv-Long), and 1.0%, 2.5%, 1.3% (PubMed-
Long) improvements across ROU GE score vari-
ants; and 2.2% (arXiv-Long), 1.4% (PubMed-
Long) improvements over F1, compared to the
neural baselines (i.e., BERTSU MEXT and BERT-
SUM EXTMU LTI). While comparing our model
with BE RTSUMEXT-INTRO, we see the vital effect
of adding second encoder at finding supplemen-
tary sentences across non-introductory sections,
where our model gains relative improvements of
9.62%-26.26%-16.09% and 9.40%-5.27%-9.99%
for ROU GE-1, RO UGE-2, ROUGE-L on arXiv-
Long and PubMed-Long, respectively. In fact, the
sentences that are picked as summary from the in-
arXiv-Long PubMed-Long
Model R1(%) R2(%) RL(%) F1 (%) R1(%) R2(%) RL(%) F1 (%)
ORAC LE 53.35 24.40 23.65 33.80 52.11 23.41 25.42 33.65
BERT SUMEXT-IN TRO 44.88 15.99 19.14 26.25 45.08 20.08 21.52 28.89
LSA 43.23 13.47 17.50 24.73 44.47 15.38 19.17 26.34
LEX RANK 43.73 15.01 18.62 25.41 48.63 20.37 22.49 30.50
BERT SUMEXT 48.42 19.71 21.47 29.87 48.82 20.89 23.37 31.03
BERT SUMEXTMULTI 48.52 19.66 21.42 29.87 48.85 20.71 23.29 30.95
BART 48.12 15.30 20.80 28.07 48.32 17.33 21.42 29.87
TST R (Ours) 49.2020.1922.2230.54 49.3221.4123.67 31.47
Table 1: ROU GE (F 1) results of the baseline models and our model on the test sets of the extended summarization
datasets (arXiv-Long, and PubMed-Long). shows the statistical significance (paired t-test, p < 0.05).
R1 R2 RL F1(%)
Summaformers (2020)49.38 16.86 21.38 29.21
IIITBH-IITP (2020)49.03 15.74 20.46 28.41
Auth-Team (2020)50.11 15.37 19.59 28.36
CIST_BUPT (2020)48.99 15.06 20.13 28.06
BERT SUMEXT MULTI (2021)53.11 16.77 20.34 30.07
Table 2: ROU GE (F 1) results of different systems on
the blind test set of LongSumm dataset containing 22
abstractive summaries.
troduction section are not comprehensive as such
they are clues to the main points of the paper. The
other important sentences are picked from the sup-
plementary parts (i.e., non-introductory) of the pa-
6.2 Human evaluation
While our model statistically significantly improves
upon the state-of-the-art baselines in terms of
ROU GE scores, a few works have reported the low
correlation of ROUGE with human judgments (Liu
and Liu,2008;Cohan and Goharian,2016;Fab-
bri et al.,2021). In order to provide insights into
why and how our model outperforms the best-
performing baselines, we perform a manual anal-
ysis of our system’s generated summaries, BERT-
sake of evaluation, two annotators were asked to
manually evaluate two sets of 40 papers’ ground-
truth abstracts (40 for arXiv-Long, and 40 for
PubMed-Long) with their generated extended sum-
maries (baselines’ and ours) to gain insights into
qualities of each model. Annotators were Electrical
Engineering and Computer Science PhD students
and familiar with principles of reading scientific
papers. Samples were randomly selected from the
test set, one from each 40 evenly-spaced bins sorted
by the difference of ROU GE-L between two experi-
mented systems.
The evaluations were performed according to
two metrics: (1) Cohesion: whether the ordering
of sentences in summary is cohesive, namely sen-
tences entail each other. (2) Completeness: whether
the summary covers all salient information pro-
vided in the ground-truth summary. To prevent bias
in selecting summaries, the ordering of system-
generated summaries were shuffled such that it
could not be guessed by the annotators. Annotators
were asked to specify if the first system-generated
summary wins/loses or ties with the second system-
generated summary in terms of qualitative metrics.
It has to be mentioned that since our model is purely
extractive, it does not introduce any fact that is un-
faithful to the source.
Our human evaluation results along with Co-
hen’s kappa (Cohen,1960) inter-rater agreements
are shown in Table 3(agr. column). As shown,
our system’s generated summaries improve com-
pleteness and cohesion in over 40% for most of
the cases (6 out of 8 for win cases
). Specifi-
cally, when comparing with BERTSUM EXT, we
see that 68%, 80% (arXiv-Long); and 60%, 66%
(PubMed-Long) of sampled summaries are at least
as good as or better than the corresponding base-
line’s generated summaries in terms of cohesion
and completeness, respectively. Overall, across
two metrics for BE RTSUMEXT and BERTSUMEX T-
MULTI, we gain relative improvements over the
baselines: 25.6%, 19.0% (cohesion), and 56.5%,
Win cases are the ones in which our system wins the
baseline(s) in terms of cohesion/completeness.
Metric Win Tie Lose agr.
Our Model vs. B ERTSU MEXT baseline
Cohesion 43% 25% 32% 46.5%
Completeness 46% 34% 20% 48.9%
Our Model vs. B ERTSU MEXTMU LTI baseline
Cohesion 42% 24% 34% 47.2%
Completeness 45% 32% 24% 49.1%
Metric Win Tie Lose agr.
Our Model vs. B ERTSU MEXT baseline
Cohesion 39% 21% 30% 52.1%
Completeness 47% 19% 34% 51.3%
Our Model vs. B ERTSU MEXTMU LTI baseline
Cohesion 37% 21% 32% 48.2%
Completeness 41% 17% 32% 46.3%
(a) (b)
Table 3: Results of human evaluations over 40 papers sampled from (a) arXiv-Long’s, and (b) PubMed-Long’s test
set. agr. shows inter-rater agreement.
[Introductory] The objective of the work presented here is to study the mechanism of
radiative line driving and the corresponding properties of the winds of possible
generations of very massive stars at extremely low metallicities and to investigate
the principal influence of these winds on ionizing fluxes and observable ultraviolet
spectra. ["#]The basic new element of this approach, needed in the domain of
extremely low metallicity, is the introduction of depth dependent force multipliers
representing the radiative line acceleration. ["%][…] Because of the depth
dependent force multipliers a new formulation of the critical point equations is
developed and a new iterative solution algorithm for the complete stellar wind
problem is introduced (section 4). ["&]
[Non-introductory] In this section we develop a fast algorithm to calculate stellar
wind structures and mass - loss rates from the equation of motion (eq.[eom1]) using
aradiative line acceleration parametrized in the form of eq.[fmp3]. ["']After the
new concept to calculate stellar wind structures with variable force multipliers has
been introduced and tested by comparing with the observed wind properties. ["(]
The purpose of this first study is to provide an estimate about the strengths of stellar
winds at very low metallicity for very massive hot stars in amass range roughly
between 100 to 300 m@xmath3. [")]With our new approach to describe line driven
stellar winds at extremely low metallicity we were able to make first predictions of
stellar wind properties, ionizing fluxes and synthetic spectra of a possible population
of very massive stars in this range of metallicity. ["*][…] We also calculated
synthetic spectra and were able to present for the first time predictions of uv spectra
of very massive stars at extremely low metallicities. ["+]We learned that the
presence of stellar winds leads to observable broad spectral line features, which
might be used for spectral diagnostics, should such an extreme stellar population be
detected at high redshift. [",][…]
(a) (b)
Figure 4: (a) Our system’s generated summary, (b) Sentence graph visualization of our system’s generated summary.
Green and gray nodes are introductory and non-introductory sentences, respectively. Edge thickness denotes the
ROU GE score strength between pair of sentences. Parts, from which sentences are sampled, are shown inside brackets.
The summary is truncated due to space limitations. Ground-truth summary-worthy sentences are underlined, and
colored spans show pointers from introductory to non-introductory sentences.
46.7% (completeness) on arXiv-Long; and 23.1%,
13.5% (cohesion), and 27.7%, 21.9% (complete-
ness) on PubMed-Long.
These improvements,
qualitatively evaluated by the human annotators,
show the promising capability of our purposed
model in generating improved extended summaries
which are more preferable than the baselines’. We
observe a similar improvement trend when com-
paring our summaries with BERTSU MEXT MULTI,
where 66%, 77% (arXiv-Long); and 58%, 58%
(PubMed-Long) of our summaries are as good as or
better than the baseline’s in terms of cohesion and
completeness. Looking at the Cohen’s inter-rater
agreement, the correlation scores fall into “moder-
11Relative improvement of win rate over lose rate.
ate” agreement range according to the interpreta-
tion of Cohen’s kappa range (McHugh,2012).
6.3 Case study
Figure 4(a) demonstrates an extended summary
generated from a sample arXiv-Long paper by our
model. The underlined sentences denote that the
corresponding sentences are oracle (i.e., summary-
worthy), the colored spans denote the pointers from
introductory information to non-introductory infor-
mation, and sentence numbers appear in brackets
following each sentence. As shown, our system
first identifies salient introductory sentences (i.e.,
), and then augments them with im-
portant non-introductory sentences. Figure 4(b)
shows the ROU GE scores between pairs of intro-
ductory and non-introductory sentences. The edge
thickness signifies the strength of the ROUGE score
between a pair of sentences. For example, intro-
ductory sentence
highly correlates with non-
introductory sentence
as it has a stronger edge
) thickness. More specifically,
has men-
tions of “radiative line driving”,“properties of
the winds”,“possible generations of very massive
stars”, and “ionizing fluxes” which maps to
with semantically similar mentions of “line driven
stellar winds”,“stellar wind properties”,“possi-
ble generations of very massive stars”, and “ioniz-
ing fluxes” 12.
7 Error Analysis
To determine the limitations of our model, we fur-
ther analyze our system’s generated summaries
and report three common defects, along with the
percentage of these errors among underperformed
cases. We found that (1) our end-to-end system’s
performance is highly dependent on the introduc-
tory sentence extraction task’s performance (i.e.,
in Figure 3) as identification of salient in-
troductory sentences (i.e., oracle introductory sen-
tences) sets up a firm ground to explore detailed
sentences from the non-introductory parts of the
paper. In other words, identification of non-salient
introductory sentences leads to a drift in finding
supplemental sentences from the non-introductory
parts. Our model often underperforms when it can-
not find important sentences from the introductory
part (65%); (2) in underperformed cases, our model
fails in selecting motivation, objective sentences
from the introductory part, and only identifies the
contribution sentences (i.e., describing paper’s con-
tributions), such that the final generated summary
is composed of contribution sentences, rather than
objective sentences. This observation hurts the sys-
tem in cohesion and completeness (40%); and (3)
as discussed, our model matches introductory sen-
tences with sentences from non-introductory parts
of the paper. Given that two sentences within a sci-
entific paper might conceptually convey the exact
same information, but are just paraphrased of each
other, our model samples both to form the final
summary as a high semantic correlation exists be-
tween them. This phenomenon leads to sampling
two sentences that convey the same information
The entire system-generated summaries are
publicly available at
Georgetown-IR- Lab/TSTRSum
, including 40 human-
evaluated cases.
without providing more details; hence, information
redundancy (35%).
8 Conclusion
In this work, we propose a novel approach to tackle
the extended summary generation for scientific doc-
uments. Our model is built upon the fine-tuned
contextualized language models for text summa-
rization. Our method improves over strong and
state-of-the-art summarization baselines by adding
an auxiliary learning component for identifying
salient introductory information of long documents,
which are then used as pointers to guide the sum-
marizer to pick summary-worthy sentences. The
extensive intrinsic and human evaluations show the
efficacy of our model in comparison with the state-
of-the-art baselines, using two large scale extended
summarization datasets . Our error analysis further
paves the path for future reseacrh.
Abdullah Arma˘
gan. 2013. How to write an introduction
section of a scientific article? Turkish journal of
urology, 39 Suppl 1:8–9.
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020.
Longformer: The long-document transformer. ArXiv,
Odellia Boni, Guy Feigenblat, Doron Cohen, Haggai
Roitman, and David Konopnicki. 2020. A study
of human summaries of scientific articles. ArXiv,
Muthu Kumar Chandrasekaran, Guy Feigenblat, Ed-
uard Hovy, Abhilasha Ravichander, Michal Shmueli-
Scheuer, and Anita de Waard. 2020. Overview and
insights from the shared tasks at scholarly docu-
ment processing 2020: CL-SciSumm, LaySumm and
LongSumm. In Proceedings of the First Workshop
on Scholarly Document Processing, pages 214–224,
Online. Association for Computational Linguistics.
Arman Cohan, Franck Dernoncourt, Doo Soon Kim,
Trung Bui, Seokhwan Kim, Walter Chang, and Nazli
Goharian. 2018. A discourse-aware attention model
for abstractive summarization of long documents. In
Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
Volume 2 (Short Papers), pages 615–621, New Or-
leans, Louisiana. Association for Computational Lin-
Arman Cohan and Nazli Goharian. 2015. Scientific
article summarization using citation-context and arti-
cle’s discourse structure. In Proceedings of the 2015
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 390–400, Lisbon, Portugal.
Association for Computational Linguistics.
Arman Cohan and Nazli Goharian. 2016. Revisiting
summarization evaluation for scientific articles. In
Proceedings of the Tenth International Conference
on Language Resources and Evaluation (LREC’16),
pages 806–813, Portorož, Slovenia. European Lan-
guage Resources Association (ELRA).
Arman Cohan and Nazli Goharian. 2018. Scientific doc-
ument summarization via citation contextualization
and scientific discourse.International Journal on
Digital Libraries, 19(2):287–303.
Jacob Cohen. 1960. A coefficient of agreement for
nominal scales. Educational and Psychological Mea-
surement, 20:37 46.
Peng Cui and Le Hu. 2021. Sliding selector network
with dynamic memory for extractive summarization
of long documents. In NAACL.
Peng Cui, Le Hu, and Yuanchao Liu. 2020. Enhancing
extractive text summarization with topic-aware graph
neural networks. In Proceedings of the 28th Inter-
national Conference on Computational Linguistics,
pages 5360–5371, Barcelona, Spain (Online). Inter-
national Committee on Computational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages
4171–4186, Minneapolis, Minnesota. Association for
Computational Linguistics.
Günes Erkan and Dragomir R. Radev. 2004. Lexrank:
Graph-based lexical centrality as salience in text sum-
marization. J. Artif. Intell. Res., 22:457–479.
A. R. Fabbri, Wojciech Kryscinski, Bryan McCann,
R. Socher, and Dragomir Radev. 2021. Summeval:
Re-evaluating summarization evaluation. Transac-
tions of the Association for Computational Linguis-
tics, 9:391–409.
Chong Feng, Fei Cai, Honghui Chen, and Maarten de Ri-
jke. 2018. Attentive encoder-based extractive text
summarization. In Proceedings of the 27th ACM
International Conference on Information and Knowl-
edge Management, CIKM 2018, Torino, Italy, Octo-
ber 22-26, 2018, pages 1499–1502. ACM.
Sebastian Gehrmann, Yuntian Deng, and Alexander
Rush. 2018. Bottom-up abstractive summarization.
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
4098–4109, Brussels, Belgium. Association for Com-
putational Linguistics.
Sayar Ghosh Roy, Nikhil Pinnaparaju, Risubh Jain,
Manish Gupta, and Vasudeva Varma. 2020. Sum-
maformers @ LaySumm 20, LongSumm 20. In Pro-
ceedings of the First Workshop on Scholarly Docu-
ment Processing, pages 336–343, Online. Associa-
tion for Computational Linguistics.
Alexios Gidiotis, Stefanos Stefanidis, and Grigorios
Tsoumakas. 2020. AUTH @ CLSciSumm 20, Lay-
Summ 20, LongSumm 20. In Proceedings of the
First Workshop on Scholarly Document Processing,
pages 251–260, Online. Association for Computa-
tional Linguistics.
Ruipeng Jia, Yanan Cao, Haichao Shi, Fang Fang, Yan-
bing Liu, and Jianlong Tan. 2020. Distilsum: : Dis-
tilling the knowledge for extractive summarization.
In CIKM ’20: The 29th ACM International Confer-
ence on Information and Knowledge Management,
Virtual Event, Ireland, October 19-23, 2020, pages
2069–2072. ACM.
Padma Rekha Jirge. 2017. Preparing and publishing a
scientific manuscript. Journal of Human Reproduc-
tive Sciences, 10:3 9.
Darsh Kaushik, Abdullah Faiz Ur Rahman Khilji,
Utkarsh Sinha, and Partha Pakray. 2021. CNLP-
NITS @ LongSumm 2021: TextRank variant for
generating long summaries. In Proceedings of the
Second Workshop on Scholarly Document Process-
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
Veselin Stoyanov, and Luke Zettlemoyer. 2020.
BART: Denoising sequence-to-sequence pre-training
for natural language generation, translation, and com-
prehension. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics,
pages 7871–7880, Online. Association for Computa-
tional Linguistics.
Lei Li, Yang Xie, Wei Liu, Yinan Liu, Yafei Jiang, Siya
Qi, and Xingyuan Li. 2020. CIST@CL-SciSumm
2020, LongSumm 2020: Automatic scientific doc-
ument summarization. In Proceedings of the First
Workshop on Scholarly Document Processing, pages
225–234, Online. Association for Computational Lin-
Feifan Liu and Yang Liu. 2008. Correlation between
ROUGE and human evaluation of extractive meeting
summaries. In Proceedings of ACL-08: HLT, Short
Papers, pages 201–204, Columbus, Ohio. Associa-
tion for Computational Linguistics.
Yang Liu and Mirella Lapata. 2019. Text summariza-
tion with pretrained encoders. In Proceedings of
the 2019 Conference on Empirical Methods in Natu-
ral Language Processing and the 9th International
Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 3730–3740, Hong Kong,
China. Association for Computational Linguistics.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized BERT pretraining
approach.CoRR, abs/1907.11692.
M. McHugh. 2012. Interrater reliability: the kappa
statistic. Biochemia Medica, 22:276 282.
Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.
Summarunner: A recurrent neural network based
sequence model for extractive summarization of doc-
uments. In Proceedings of the Thirty-First AAAI
Conference on Artificial Intelligence, February 4-9,
2017, San Francisco, California, USA, pages 3075–
3081. AAAI Press.
Artidoro Pagnoni, Vidhisha Balachandran, and Yulia
Tsvetkov. 2021. Understanding factuality in abstrac-
tive summarization with frank: A benchmark for
factuality metrics. ArXiv, abs/2104.13346.
Vahed Qazvinian and Dragomir R. Radev. 2008. Sci-
entific paper summarization using citation summary
networks. In Proceedings of the 22nd International
Conference on Computational Linguistics (Coling
2008), pages 689–696, Manchester, UK. Coling 2008
Organizing Committee.
Juan Ramirez-Orta and Evangelos E. Milios. 2021.
Unsupervised document summarization using pre-
trained sentence embeddings and graph centrality. In
Saichethan Reddy, Naveen Saini, Sriparna Saha, and
Pushpak Bhattacharyya. 2020. IIITBH-IITP@CL-
SciSumm20, CL-LaySumm20, LongSumm20. In
Proceedings of the First Workshop on Scholarly Doc-
ument Processing, Online. Association for Computa-
tional Linguistics.
T. Rohde, Xiaoxia Wu, and Yinhan Liu. 2021. Hi-
erarchical learning for generation with long source
sequences. ArXiv, abs/2104.07545.
Abigail See, Peter J. Liu, and Christopher D. Manning.
2017. Get to the point: Summarization with pointer-
generator networks. In Proceedings of the 55th An-
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1073–
1083, Vancouver, Canada. Association for Computa-
tional Linguistics.
Sajad Sotudeh, Arman Cohan, and Nazli Goharian.
2020. GUIR @ LongSumm 2020: Learning to gen-
erate long summaries from scientific documents. In
Proceedings of the First Workshop on Scholarly Doc-
ument Processing, pages 356–361, Online. Associa-
tion for Computational Linguistics.
Sajad Sotudeh, Arman Cohan, and Nazli Goharian.
2021. On generating extended summaries of long
documents. The AAAI-21 Workshop on Scientific
Document Understanding (SDU).
Josef Steinberger and Karel Je
zek. 2004. Using latent
semantic analysis in text summarization and sum-
mary evaluation. In ISIM.
Simone Teufel and Marc Moens. 2002. Summarizing
scientific articles: Experiments with relevance and
rhetorical status. Computational Linguistics, 28:409–
Yufei Tian, Jianfei Yu, and Jing Jiang. 2019. Aspect
and opinion aware abstractive review summarization
with reinforced hard typed decoder. In Proceedings
of the 28th ACM International Conference on Infor-
mation and Knowledge Management, CIKM 2019,
Beijing, China, November 3-7, 2019, pages 2061–
2064. ACM.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
cessing Systems 30: Annual Conference on Neural
Information Processing Systems 2017, December 4-9,
2017, Long Beach, CA, USA, pages 5998–6008.
Wen Xiao and Giuseppe Carenini. 2019. Extractive
summarization of long documents by combining
global and local context. In Proceedings of the
2019 Conference on Empirical Methods in Natu-
ral Language Processing and the 9th International
Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 3011–3021, Hong Kong,
China. Association for Computational Linguistics.
Senci Ying, Zheng Yan Zhao, and Wuhe Zou. 2021.
LongSumm 2021: Session based automatic summa-
rization model for scientific document. In Proceed-
ings of the Second Workshop on Scholarly Document
Ruifeng Yuan, Zili Wang, and Wenjie Li. 2020. Fact-
level extractive summarization with hierarchical
graph mask on BERT. In Proceedings of the 28th
International Conference on Computational Linguis-
tics, pages 5629–5639, Barcelona, Spain (Online).
International Committee on Computational Linguis-
Jingqing Zhang, Yao Zhao, Mohammad Saleh, and
Peter J. Liu. 2019. Pegasus: Pre-training with ex-
tracted gap-sentences for abstractive summarization.
Yanyan Zou, Xingxing Zhang, Wei Lu, Furu Wei, and
Ming Zhou. 2020. Pre-training for abstractive doc-
ument summarization by reinstating source text. In
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
pages 3646–3660, Online. Association for Computa-
tional Linguistics.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
The huge influx of published papers in the field of machine learning makes the task of summarization of scholarly documents vital, not just to eliminate the redundancy but also to provide a complete and satisfying crux of the content. We participated in LongSumm 2021: The 2nd Shared Task on Generating Long Summaries for scientific documents, where the task is to generate long summaries for scientific papers provided by the organizers. This paper discusses our extractive summarization approach to solve the task. We used TextRank algorithm with the BM25 score as a similarity function. Even after being a graph-based ranking algorithm that does not require any learning, TextRank produced pretty decent results with minimal compute power and time. We attained 3rd rank according to ROUGE-1 scores (0.5131 for F-measure and 0.5271 for recall) and performed decently as shown by the ROUGE-2 scores.
Full-text available
The scarcity of comprehensive up-to-date studies on evaluation metrics for text summarization and the lack of consensus regarding evaluation protocols continue to inhibit progress. We address the existing shortcomings of summarization evaluation methods along five dimensions: 1) we re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion using neural summarization model outputs along with expert and crowd-sourced human annotations; 2) we consistently benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics; 3) we assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset and share it in a unified format; 4) we implement and share a toolkit that provides an extensible and unified API for evaluating summarization models across a broad range of automatic metrics; and 5) we assemble and share the largest and most diverse, in terms of model types, collection of human judgments of model-generated summaries on the CNN/Daily Mail dataset annotated by both expert judges and crowd-source workers. We hope that this work will help promote a more complete evaluation protocol for text summarization as well as advance research in developing evaluation metrics that better correlate with human judgments.
Conference Paper
Full-text available
Automatic text summarization has been widely studied as an important task in natural language processing. Traditionally, various feature engineering and machine learning based systems have been proposed for extractive as well as abstractive text summarization. Recently, deep learning based, specifically Transformer-based systems have been immensely popular. Summarization is a cognitively challenging task - extracting summary worthy sentences is laborious, and expressing semantics in brief when doing abstractive summarization is complicated. In this paper, we specifically look at the problem of summarizing scientific research papers from multiple domains. We differentiate between two types of summaries, namely, (a) LaySumm: A very short summary that captures the essence of the research paper in layman terms restricting overtly specific technical jargon and (b) LongSumm: A much longer detailed summary aimed at providing specific insights into various ideas touched upon in the paper. While leveraging latest Transformer-based models, our systems are simple, intuitive and based on how specific paper sections contribute to human summaries of the two types described above. Evaluations against gold standard summaries using ROUGE metrics prove the effectiveness of our approach. On blind test corpora, our system ranks first and third for the LongSumm and LaySumm tasks respectively.