PreprintPDF Available

Lift Yourself Up: Retrieval-augmented Text Generation with Self Memory

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

With direct access to human-written reference as memory, retrieval-augmented generation has achieved much progress in a wide range of text generation tasks. Since better memory would typically prompt better generation~(we define this as primal problem), previous works mainly focus on how to retrieve better memory. However, one fundamental limitation exists for current literature: the memory is retrieved from a fixed corpus and is bounded by the quality of the corpus. Due to the finite retrieval space, bounded memory would greatly limit the potential of the memory-augmented generation model. In this paper, by exploring the duality of the primal problem: better generation also prompts better memory, we propose a framework called Selfmem, which iteratively adopts a retrieval-augmented generator itself to generate an unbounded memory pool and uses a memory selector to pick one generated memory for the next generation round. By combining the primal and dual problem, a retrieval-augmented generation model could lift itself up with its own output in the infinite generation space. To verify our framework, we conduct extensive experiments across various text generation scenarios including neural machine translation, abstractive summarization and dialogue generation over seven datasets and achieve state-of-the-art results in JRC-Acquis(four directions), XSum(50.3 ROUGE-1) and BigPatent(62.9 ROUGE-1).
Content may be subject to copyright.
Lift Yourself Up:
Retrieval-augmented Text Generation with Self Memory
Xin Cheng 1, Di Luo 2, Xiuying Chen 3, Lemao Liu 4, Dongyan Zhao 1, Rui Yan 2
1Peking University 2Renmin University of China 3KAUST 4Tencent AI Lab
chengxin1998@stu.pku.edu.cn
Abstract
With direct access to human-written refer-
ence as memory, retrieval-augmented genera-
tion has achieved much progress in a wide
range of text generation tasks. Since better
memory would typically prompt better genera-
tion (we define this as primal problem), previ-
ous works mainly focus on how to retrieve bet-
ter memory. However, one fundamental lim-
itation exists for current literature: the mem-
ory is retrieved from a fixed corpus and is
bounded by the quality of the corpus. Due
to the finite retrieval space, bounded mem-
ory would greatly limit the potential of the
memory-augmented generation model. In this
paper, by exploring the duality of the primal
problem: better generation also prompts bet-
ter memory, we propose a framework called
Selfmem, which iteratively adopts a retrieval-
augmented generator itself to generate an un-
bounded memory pool and uses a memory se-
lector to pick one generated memory for the
next generation round. By combining the pri-
mal and dual problem, a retrieval-augmented
generation model could lift itself up with its
own output in the infinite generation space. To
verify our framework, we conduct extensive
experiments across various text generation sce-
narios including neural machine translation,
abstractive summarization and dialogue gener-
ation over seven datasets and achieve state-of-
the-art results in JRC-Acquis (four directions),
XSum (50.3 ROUGE-1) and BigPatent (62.9
ROUGE-1).
1 Introduction
In recent years, retrieval-augmented text generation
has attracted increasing attention in multiple fields
including neural machine translation (NMT) (Gu
et al.,2018;Khandelwal et al.,2021;Cheng et al.,
2022); abstractive text summarization (Cao et al.,
2018;Peng et al.,2019); dialogue response gen-
eration (Song et al.,2016;Cai et al.,2019a;Wu
et al.,2019) and language modeling (Grave et al.,
0.0 0.2 0.4 0.6 0.8 1.0
Memory Similarity
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Hypothesis BLEU
Figure 1: Relation between memory and hypothesis on
JRC-Acquis EnDe dataset. The hypothesis is gener-
ated by a retrieval-augmented translator whose memory
is retrieved from the training set. The X-axis represents
the similarity between memory and the reference and
Y-axis represents the translation quality. They are both
measured by BLEU.
2017;Khandelwal et al.,2020;Zhong et al.,2022).
This new generation paradigm would first endow
a generation model with access to external mem-
ory (typically the training corpus) via some infor-
mation retrieval techniques and then generates text
on the basis of retrieved memory. Instead of gener-
ating from scratch, retrieval-augmented generation
would equip the generation model with more in-
formation and potentially alleviate the difficulty of
text generation (Li et al.,2022).
In this paradigm, the guiding principle for mem-
ory retrieval is to retrieve one that shares maxi-
mum similarity with the current input (Lewis et al.,
2020a;Khandelwal et al.,2020;Yogatama et al.,
2021), which is also in line with human intuition
that a more similar reference sample would typi-
cally provide more hints. As shown in Figure 1, for
a retrieval-augmented NMT model, regardless of
other factors that may affect the translation qual-
ity (e.g., polysemy, morphology and coreference),
the memory similarity alone possesses a strong
correlation with the final translation quality. We
define this as the primal problem:
better mem-
ory prompts better generation
, so a number of
arXiv:2305.02437v1 [cs.CL] 3 May 2023
works focus on how to retrieve better memory:
from sparse retrieval to more expressive dense re-
trieval (Cao and Xiong,2018;Parvez et al.,2021);
from a fixed retriever to a learnable retriever (Lewis
et al.,2020c;Cai et al.,2021); from sentence-level
memory to more fine-grained token-level mem-
ory (Khandelwal et al.,2020,2021).
However, one fundamental limitation exists for
all previous works: the memory is retrieved from a
fixed corpus and is bounded by the quality of the
corpus. Due to the finite retrieval space, bounded
memory would greatly limit the potential of the
memory-augmented generation model. In this pa-
per, by exploring the primal problem’s duality,
bet-
ter generation also prompts better memory
, we
propose a novel framework called
Selfmem
, which
iteratively adopts a retrieval-augmented generator
itself to generate an unbounded memory pool and
uses a memory selector to pick one output as mem-
ory for the next generation round. By combining
the primal and dual problem, a retrieval-augmented
generation model could lift itself up with its own
output in the infinite generation space (Raffel et al.,
2020) dubbed as self-memory. The key insight
behind
Selfmem
is that the text more similar in
distribution to the data in inference is not the train-
ing data (Wang et al.,2022), but the model’s own
output.
In particular, there are two complementary com-
ponents in
Selfmem
: a retrieval-augmented gen-
erator and a memory selector. First, we train the
generator with retrieved memory and use the out-
put from the generator to train a memory selec-
tor based on a certain metric. Then, simply by
substituting unbounded generated memory for the
retrieved memory, we get generation output with
higher quality (primal problem) which serves as
the memory for the next round after being selected
by the memory selector (dual problem).
To verify the effectiveness of
Selfmem
, we con-
duct extensive experiments across seven datasets
under three text generation scenarios including
neural machine translation, abstractive text sum-
marization and dialogue generation. We ob-
serve significant improvement over strong base-
lines and achieved state-of-the-art results in
JRC-Acquis
(four directions),
XSum
(50.3 ROUGE-
1) and
BigPatent
(62.9 ROUGE-1). To further
understand
Selfmem
, We also carefully examine
each key component and locate the current system
bottleneck for future research.
To summarize, our key contributions are:
We are the first to investigate the problem of
bounded memory in the retrieval-augmented
literature.
By combining the primal and dual problem,
we propose
Selfmem
, a retrieval-augmented
framework that could lift itself up with its own
unbounded output as self-memory.
We conduct extensive experiments across var-
ious text generation scenarios and greatly im-
prove the state-of-the-art performance.
2 Related Work
2.1 Retrieval-augmented Text Generation
Since the world is not a snapshot once the training
corpus is collected, we can never expect an ever-
large model to capture everything in its parameters,
even for LLMs like GPT-3 (Brown et al.,2020),
and it is important to endow the model with access
to an external memory bank to solve different NLP
tasks (Lewis et al.,2020c).
For translation task, long before machine trans-
lation, the localization industry has been proposing
retrieval techniques to help human translators to
get higher productivity and consistency (Yamada,
2011). Early works on machine translation mainly
contribute to employing memory for statistical ma-
chine translation (SMT) systems (Simard and Is-
abelle,2009;Liu et al.,2012). As for NMT, Gu
et al. (2018) firstly uses search engines to retrieve
memory from the training set and incorporate it
with an external memory network. The follow-up
works mainly focus on the different aspects of the
retrieval-augmented NMT including the memory
encoding method (Xia et al.,2019;Xu et al.,2020;
He et al.,2021), the joint training of retriever and
generator with monolingual data (Cai et al.,2021),
the granularity of the memory (Khandelwal et al.,
2021) and the diversity of the memories (Cheng
et al.,2022). On dialogue response generation
task, exemplar/template retrieval as an intermedi-
ate step has been shown beneficial to informative
response generation (Weston et al.,2018;Wu et al.,
2019;Cai et al.,2019a,b). There are also works
on other applications such as abstractive summa-
rization (Wang et al.,2022;Cao et al.,2018;Peng
et al.,2019), code generation (Hashimoto et al.,
2018), paraphrase generation (Kazemnejad et al.,
2020;Su et al.,2021), language modeling (Khan-
delwal et al.,2020;Zhong et al.,2022) and all
knowledge-intensive tasks that can be framed as a
text generation problem (Lewis et al.,2020c).
2.2 Neural Text Reranking
By alleviating the discrepancy between training
and inference (i.e., exposure bias) and directly op-
timizing the desired metrics, two-stage reranking
methods have enabled strong progress in several
branches of text generation tasks. In machine trans-
lation, the seminal work of Shen et al. (2004) and
Och et al. (2004) introduced and popularized dis-
criminative reranking to SMT. For NMT, there are
two lines of work focusing on reranking: generative
reranking (Liu et al.,2018;Imamura and Sumita,
2017;Wang et al.,2017) and discriminative rerank-
ing (Lee et al.,2021;Salazar et al.,2020;Deng
et al.,2020). In syntactic parsing, Collins and
Koo (2005) was the first to employ a two-stage
reranking method to select from the output from
a base parser and Charniak and Johnson (2005)
introduced maximum entropy reranker. For text
summarization, RefSum (Liu et al.,2021) defines a
second-stage summarization framework that helps
address the problem of the train-test distribution
mismatch. SimCLS (Liu and Liu,2021) utilizes
pair-wise Learning To Rank (LTR) to select the can-
didates with the highest matching score. SummaR-
eranker (Ravaut et al.,2022a) adopts a multi-task
mixture-of-experts framework to utilize different
metrics that captures different aspects of generated
candidates. BRIO (Liu et al.,2022) re-uses the
base model for a second round of fine-tuning with
both the cross-entropy loss and a candidate-level
ranking loss. JGR (Shen et al.,2022) adopts an
alternate training paradigm to train the generator
and reranker.
One limitation of these reranking methods lies in
that it is a one-way process where the selected can-
didates are the final output of the system, while our
framework creatively employs the selected candi-
dates as the memory for the next generation round
of a retrieval-augmented generator, which is ca-
pable of generating better candidates with better
memory.
3 Self Memory
In this section, we start with a motivating experi-
ment on generation as memory in §3.1. Then we
introduce
Selfmem
, a framework consisting of a
retrieval-augmented generator 3.2) and a mem-
ory selector 3.3). The whole framework and
algorithm are shown in Figure 2and Algorithm 1.
3.1 Generation as Memory
The key motivation of our framework is based on
the observation that the memory more similar in
distribution to the data in inference is not the train-
ing data (38.89 BLEU in Table 1), but the model’s
own output (58.58 BLEU) in the unbounded gen-
eration space. One interesting exploration is to
directly use the generation as memory in respect
of the primal problem: a better memory prompts
better generation.
We conduct experiments on
JRC-Acquis
En
De dataset with a retrieval-augmented gen-
erator detailed in §3.2. As shown in Table 1, the
first line is the conventional retrieval-augmented
training with retrieved memory and achieves 58.58
BLEU score in the test set. Directly adopting this
beam output as memory (Beam) to the generation
model makes no improvements even though they
are much more similar to the reference than re-
trieved ones. Our conjectures are: (1) the retrieval-
augmented generator could not generalize well in
this setting because of the memory distribution
shift (from 38.89 to 58.58). (2) the beam memory
does not provide any information gain compared
with retrieved memory even with more overlaps
between the references.
Memory Hypothesis
Retrieval 38.89 58.58
Beam 58.58 58.43
Reference 100 90.43
Random 1.14 49.08
Table 1: Experiments on a fixed retrieval-augmented
translator with different memory on JRC-Acquis
EnDe dataset measured by BLEU.
To exclude the former one, we explore the best
and worst scenario by using reference as mem-
ory (Reference) and randomly sampled sentences
as memory (Random). In Table 1, it shows that
a retrieval-augmented generator trained with re-
trieved memory has already learned how to lever-
age the memory information on both oracle and
random scenarios, with fixed parameters. For the
latter conjecture, we first define the token set of
reference, retrieved memory, and beam memory as
R,M
and
B
. And the overlap token set
O
is de-
fined as the tokens overlapped with references that
Target Distribution
Encoder
NLL Loss
KL Loss
YN
Y1
...
Y
XY
X
candidates
source
target training
memory
Decoder
... ...
... ...
(a) Retrieval-augmented Generator (b) Memory Selector
Retrieval
Predicted Distribution
M
Primal
Dual
Figure 2: Overall framework. There are two components in Selfmem, a retrieval-augmented generator (a) and a
memory selector (b). For the primal problem, (a) takes source and memory as input to generate candidates for (b).
For the dual problem, (b) takes as input source and generated candidates to select memory for (a).
are in the beam memory while not in the retrieved
memory, which is
R B R M
.
O
is deemed
as the additional information that beam memory
provided. We calculate the confidence score
ψ(·)
of a set as:
ψ(·) = 1
|·|X
yi∈·
p(yi|x, y<i)(1)
where
p(yi|x, y<i)
is defined by the generation
model.
ψ(·)
measures how confident the generation
model generates the tokens in the set. The
ψ(R)
is 0.58 while that of
O
is 0.76, which means the
generator is quite confident to generate tokens in
O
(Xu et al.,2022;Edunov et al.,2018), so it does
not need to resort to external memory (Kumar et al.,
2016). Because beam search ranks generated candi-
dates based on
p(y|x)
, where the selected memory
falls into the confidence region of the generator
thus provides no information gain. This motivates
us to select memory according to metrics other than
p(y|x)in memory selector 3.3).
3.2 Retrieval-augmented Generator
Given a text pair
(x, y)
, where
x={x1, ..., x|x|}
is the source,
y={y1, ..., y|y|}
is the target. They
could be (document, summary) in summarization
or (context, response) in dialogue generation. The
retrieval-augmented generation would first use
x
to retrieve memory
m
from datastore
D
. Then the
generator
Gξ(x, m)
, parameterized by
ξ
, would
take both
x
and
m
as input to generate the tar-
get sentence
y
. In this paper, following standard
practice (Cheng et al.,2022;Wang et al.,2022),
we choose the training set as
D={(xi, yi)}|D|
i=1
and only keep the target side of top-1 retrieval
results as memory. For the generator
Gξ
, we
consider two commonly used retrieval-augmented
architectures:
Joint-Encoder
(Guu et al.,2020;
Wang et al.,2022;Lewis et al.,2020c) and
Dual-
Encoder
(Xia et al.,2019;Cai et al.,2021;Cheng
et al.,2022).
Joint-Encoder
This architecture is the standard
encoder-decoder-based model (Bahdanau et al.,
2015;Sutskever et al.,2014;Vaswani et al.,2017).
The input is the concatenation of
x
and
m
. The
encoder would first map the input into the hidden
states H:
H=Encoder(x[SEP] m)(2)
And the decoder would incorporate
H
by attention
mechanism (Vaswani et al.,2017) and generate
tokens in an auto-regressive manner:
hi=Decoder(CrossAttn(H), y<i)(3)
PGξ(·|x, y<i) = Softmax(hi)(4)
Dual-Encoder
Instead of treating
x
and
m
as a
long sequence, this architecture has two encoders,
one for
x
and the other for
m
. Their outputs are se-
quentially attended by the decoder with dual cross
attention as in Cheng et al. (2022):
Hx=SourceEncoder(x)(5)
Hm=MemoryEncoder(x)(6)
hi=Decoder(CrossAttn(Hx, Hm), y<i)(7)
We use Transformer (Vaswani et al.,2017) as the
building block for both architectures and optimize
Gξwith NLL loss:
Lnll =
|y|
X
t=1
log PGξ(yt|x, m, y<t)(8)
3.3 Memory Selector
The role of memory selector
Sθ(x, c)
, parameter-
ized by
θ
, is to select one candidate
c
from the
candidate pool
C
generated by
Gξ
based on certain
metric
∆(·,·)
. And the selected
c
would be used as
memory
m
for the next generation round of
Gξ
. Ac-
cording to §3.1, making
pGξ(y|x)
as
∆(·,·)
would
fall into the confidence region of
Gξ
and bring
no information gain. Also, larger
pGξ(y|x)
does
not necessarily guarantee better generation qual-
ity (Meister et al.,2020). Thus we define
∆(·,·)
as
the model-free metrics that are widely used to mea-
sure the generation quality (e.g., BLEU for NMT,
ROUGE for Summarization). Our memory selector
takes as input the concatenation of source
x
and
candidate
ci
and outputs a multinomial distribution
pSθ(·|x)over C:
pSθ(ci|x) = exp(Sθ(x[SEP] ci))
P|C|
j=1 exp(Sθ(x[SEP] cj)) (9)
Following Lee et al. (2021), the training objective
of
Sθ
is to minimize the discrepancy between the
prediction of
Sθ
and score calculated by
∆(·,·)
where the discrepancy is measured by KL diver-
gence:
Lkl =
|C|
X
i=1
pM(ci)logpSθ(ci|x)(10)
where
pM(ci)
is the metric distribution defined as:
pM(ci) = exp(∆(ci, y) )
P|C|
j=1 exp(∆(cj, y) )(11)
τ
is the temperature to control the smoothness of
the distribution. At inference, the output of the
Sθ
is arg max
ciC
pSθ(ci|x).
3.4 Combine Generator and Selector
With
Gξ
, we define two generation modes. The first
is the hypothesis mode which produces one output
for one input for system evaluation. The second is
candidate mode which would generate N outputs
with one input and this is used for training
Sθ
and
memory selection. By combining these two modes
of
Gξ
and
Sθ
, the whole framework of
Selfmem
is
shown in Algorithm 1.
Algorithm 1 Selfmem Framework
Require:
a dataset
D
, a retriever
R
, a memory
selection metric
∆(·,·)
, a retrieval-augmented
generator Gξ, and a memory selector Sθ
1: retrieve memory Min Dwith R
2: train Gξwith Dand M
3:
use
Gξ
to generate candidate pool
C
with
M
in candidate mode
4: train Sθon Cwith ∆(·,·)
5: while not converged do
6: Sθselects memory from Cas M
7: Gξ
generates candidate pool
C
with
M
in
candidate mode
8: end while
9: Gξ
generates the final hypothesis with
M
in
hypothesis mode
4 Experimental Setup
4.1 Dataset
We evaluate
Selfmem
on 3 different tasks with 7
datasets. The data statistics are in Appendix A.
Translation
We evaluate our framework on
JRC-Acquis
datasets (Steinberger et al.,2006), a
collection of parallel legislative text of European
Union Law. It is the benchmark dataset used
in translation memory-augmented NMT task (Gu
et al.,2018;Xia et al.,2019;Cai et al.,2021;
Cheng et al.,2022). We choose 4 translation
directions, namely, Spanish
English (Es
En),
GermanEnglish (DeEn).
Summarization
We evaluate on 2 summariza-
tion datasets: 1)
XSum
(Narayan et al.,2018),
extreme summarization, a single-document sum-
marization dataset with highly abstractive ar-
ticles from British Broadcasting Corporation.
2)
BigPatent
(Sharma et al.,2019), consisting of
1.3 million records of U.S. patent documents along
with human-written abstractive summaries.
Dialogue
We experiment on
DailyDialog
(Li
et al.,2017), which contains multi-turn dialogs on
daily life topics and is used by Chen et al. (2022a);
Bao et al. (2020); Zhao et al. (2022).
4.2 Implementation Details
We use BM25 (Robertson and Zaragoza,2009) to
conduct retrieval. All experiments are based on
Transformer (Wolf et al.,2020) and are conducted
on 8 NVIDIA A100 GPUs. The candidate genera-
System EsEn EnEs DeEn EnDe
Dev Test Dev Test Dev Test Dev Test
None Memory
Bahdanau et al. (2015) 55.02 59.34 50.54 50.48 50.20 49.74 44.94 43.98
Transformer 64.08 64.63 62.02 61.80 60.18 60.16 54.65 55.43
Retrieval Memory
Gu et al. (2018) 60.28 59.34 57.62 57.27 55.63 55.33 49.26 48.80
Zhang et al. (2018) 63.97 64.30 61.50 61.56 60.10 60.26 55.54 55.14
Xia et al. (2019) 66.37 66.21 62.50 62.76 61.85 61.72 57.43 56.88
Cai et al. (2021) 67.73 67.42 64.18 63.86 64.48 64.62 58.77 58.42
Cheng et al. (2022) 67.48 67.76 63.84 64.04 64.22 64.33 58.94 58.69
Transformerdual?66.87 67.12 63.14 63.54 64.09 63.36 58.69 58.06
Transformerjoint67.74 67.32 63.93 64.12 64.50 64.40 58.16 58.58
Self Memory
Transformerdual?68.6369.2064.1264.6765.0664.9859.2659.49
Transformerjoint68.2668.8066.0765.9465.3265.6559.8860.11
Table 2: Results of translation task on JRC-Acquis measured by BLEU. We compare three kinds of translation
systems. The top section is the vanilla sequence-to-sequence model without memory. The second section consists
of models equipped with retrieved translation memory. Models denoted by the same symbol (?and ) have the
same parameters and only differ in memory as input. The bolded numbers show the SOTA performance and the
underlined numbers show the second-best result. denotes the system is significantly better than baselines with
p-value < 0.05 tested by Koehn (2004).
tion method is beam search with beam size 50 for
all tasks. The number of iterations is decided by
the performance of the validation set.
For transla-
tion
, following Xu et al. (2020); Cai et al. (2021);
Cheng et al. (2022) we use randomly initialize
Transformer
base
architecture (Vaswani et al.,2017)
as
Gξ
. The evaluation metrics are BLEU, TER and
chrF++ from SACRE BLEU (Post,2018). The back-
bone of memory selector
Sθ
is XLM-R
base
(Con-
neau et al.,2020) with BLEU as
∆(·,·)
.
For
summarization
, we init
Gξ
with BART
base
(Lewis
et al.,2020b) for
BigPatent
following Wang et al.
(2022) and state-of-the-art BRIO (Liu et al.,2022)
for
XSum
. The evaluation metric is Rouge (R-
1/2/L) (Lin,2004).
For dialogue generation
, we
use BART
base
as the backbone for
Gξ
. We evaluate
our dialogue system with BLEU (B-1/2) and Dis-
tinct (D-1/2) (Li et al.,2016). For both dialogue and
summarization task, we follow Liu and Liu (2021);
Feng et al. (2022) and adopt RoBERTa
base
(Liu
et al.,2019) as the backbone for
Sθ
. We choose the
linear combination of B-1/2 as
∆(·,·)
for Dialogue
Generation and R-1/2/L for Summarization follow-
ing (Shen et al.,2022). For more details, we refer
to Appendix Band Appendix C. And our code is
open sourced 1.
1https://github.com/hannibal046/selfmemory
5 Experimental Results
5.1 Machine Translation
We choose four translation directions and exper-
iment on two generator architectures (joint and
dual as detailed in §3.2). The baselines consist
of two kinds of translation systems, one is the
vanilla sequence-to-sequence model (Bahdanau
et al.,2015;Vaswani et al.,2017) without memory
augmentation, and the other is retrieval-augmented
translation models focusing on memory encod-
ing (Gu et al.,2018;Xia et al.,2019), mem-
ory construction (Zhang et al.,2018), memory
retrieval (Cai et al.,2021) and memory diver-
sity (Cheng et al.,2022). Based on the experimen-
tal results shown in Table 2,
Selfmem
, across four
translation datasets and two different architectures,
significantly boost the performance of
Gξ
, which
is remarkable considering that the parameters of
the
Gξ
keep fixed and the only changing variable is
the memory as input. This also complies with the
primal problem that better memory would typically
prompt better generation results.
The dual problem is revealed in Table 3. The
self-memory, which is actually the model’s own
output, shares more similarities with the ground
truth and serves as better memory to produce the
final generation. This also shows one salient dif-
ference between
Selfmem
and previous candidate
reranking works (Lee et al.,2021;Ravaut et al.,
2022a). For reranking, it aims to select candidates
that are of higher quality than beam output; while
in
Selfmem
, the selected candidates serve as the
memory for the retrieval-augmented generator and
do not necessarily need to outperform the beam
hypotheses.
Since the higher BLEU score in such
range (
>
50) can not safely guarantee a better trans-
lation system (Callison-Burch et al.,2006), we also
evaluate our system with TER and chrF++. Results
in Table 4shows
Selfmem
consistently outperforms
baselines in other two metrics.
Retrieval Self
memory hypothesis memory hypothesis
En-De 38.89 58.58 57.92 60.11
42.56 64.40 64.32 65.65
En-Es 40.67 64.12 63.57 65.94
43.05 67.32 67.78 68.80
Table 3: Comparison between retrieval memory and
self-memory. The quality of memory and hypoth-
esis is measured by the n-gram overlap with refer-
ence (BLEU). All experiments are conducted with
Transformerjoint on JRC-Acquis.
System Memory BLEU chrF++ TER
Transformer None 55.43 70.31 36.35
Transformerdual Retrieval 58.06 71.58 35.41
Transformerjoint Retrieval 58.58 72.22 34.39
Transformerdual Self 59.49 72.62 34.04
Transformerjoint Self 60.11 73.25 32.62
Table 4: Evaluation results on JRC-Acquis EnDe
measured by BLEU, TER and chrF++.
5.2 Summarization
We compare
Selfmem
with REINA (Wang et al.,
2022), a retrieval-augmented framework, PEGA-
SUS (Zhang et al.,2020) and BART (Lewis et al.,
2020b). The result is shown in Figure 5. First, we
could observe that memory has different impacts
on different datasets. The improvement brought by
memory in
BigPatent
is much larger than that of
XSum
, which could be attributed to the character-
istic of the dataset itself since
BigPatent
is com-
posed of official patent documents that are mutually
similar and this greatly boosts the summarization
quality according to the primal problem. We also
find that self-memory greatly improves the perfor-
mance of BRIO (+1.2 R1) and BART (+18.5 R1)
System Memory R-1 R-2 R-L
XSum
PEGASUS None 47.2 24.6 39.3
BRIO None 49.1 25.6 40.4
REINA (PG) Retrieval 48.2 26.0 40.2
REINA (B) Retrieval 43.2 21.0 35.5
REINA (L) Retrieval 46.5 24.1 38.6
BRIOdual?Retrieval 48.6 26.1 40.6
BRIOjointRetrieval 49.5 26.5 41.2
BRIOdual?Self 49.2 26.2 40.8
BRIOjointSelf 50.3 26.7 41.6
BigPatent
PEGASUS None 53.6 33.2 43.2
BART None 44.4 21.3 31.0
REINA (B) Retrieval 59.5 42.6 50.6
REINA (L) Retrieval 60.7 43.3 51.3
REINA (PG) Retrieval 44.6 21.5 33.3
BARTdual ?Retrieval 57.4 43.3 49.7
BARTjoint Retrieval 59.6 43.4 51.0
BARTdual ?Self 61.2 44.6 52.3
BARTjoint Self 62.9 48.1 59.6
Table 5: Results of summarization task on XSum and
BigPatent measured by ROUGE. Models denoted by
the same symbol (?and ) have the same parameters
and only differ in memory as input.
and achieves SOTA results on both datasets. We
choose these baselines for fair comparison in that
they share the same base generator. Due to space
limitations, we include more comparisons and the
confidence region of SOTA model in Appendix D.
5.3 Dialogue Generation
As shown in Table 6, self-memory greatly boosts
the performance of retrieval-augmented generator
for dialog generation tasks. By optimizing memory
with BLEU as
∆(·,·)
, the self-memory improves
the B-1,2 score over retrieved memory by 3.08
B-1 and 0.6 B-2 on BART
joint
. Interestingly, al-
though
Selfmem
outperforms baselines in terms of
B-1/2, it falls short of D-1 and D-2, which could
be explained by the trade-off between BLEU score
and Distinct score when evaluating a dialogue sys-
tem (Zheng et al.,2021). To overcome this prob-
lem, we choose D-1,2 as
∆(·,·)
when optimizing
Sθ
denoted as BART
joint
(D) and the result in Ta-
ble 6shows great flexibility of
Selfmem
by directly
optimizing memory to get the desired attribute for
diverse and informative response generation.
1234
Iteration
60
62
64
66
Hypothesis BLEU
Transformer-joint
rank1
rank2
rank3
rank4
rank5
rank6
1234
Iteration
60
62
64
66
Transformer-dual
rank1
rank2
rank3
rank4
rank5
rank6
(a) Hypothesis
12345
Iteration
45
50
55
60
65
70
Candidates BLEU
(b) Candidates
Figure 3: (a) shows generation quality in the iteration process with different Sθin both generator architectures. (b)
shows candidates quality in the iteration process with an oracle Sθand Transformerjoint as Gξ.
System Memory B-1 B-2 D-1 D-1
Vinyals and Le (2015) None 33.60 26.80 3.00 12.80
Fang et al. (2019) None 30.90 24.90 2.90 25.00
Bao et al. (2021) None 34.80 25.12 3.54 25.11
Li et al. (2021b) None 36.17 27.67 4.56 27.12
BART None 20.72 11.36 3.92 19.44
BARTdual?Retrieval 29.50 21.89 4.74 26.01
BARTjointRetrieval 36.72 31.55 6.13 35.65
BARTdual?Self 33.43 22.85 4.66 26.16
BARTjointSelf 39.80 32.15 5.84 32.16
BARTjoint(D) Self 36.92 32.09 9.12 37.05
Table 6: Results of dialogue generation task on
DailyDialog measured by B-1/2 and D-1/2. Mod-
els denoted by the same symbol (?and ) have the
same parameters and only differ in memory as input.
BARTjoint (D) denotes the metric ∆(·,·)for Sθis the
average of D-1 and D-2.
6 Further Analysis
To obtain a further understanding of
Selfmem
, we
carefully investigate how two key components,
Gξ
and
Sθ
, affect the final generation quality. We con-
duct experiments on JRC-Acquis EnDe dataset.
Tuning Gξ
As discussed in §3.1, we have already
demonstrated that a trained retrieval-augmented
generator, with fixed parameters, has already
learned how to distinguish between "good" or "bad"
memory. So, this explains why we choose to fix the
generator in our framework and suggests that
Gξ
is not the current system bottleneck of Selfmem.
Tuning Sθ
We experiment with different
Sθ
by
directly selecting memory from the candidate pool
based on their real ranking. As shown in Figure 3a,
in both architectures, the model could iteratively
improve itself with its own output and surpass the
current SOTA performance (60.11 BLEU) by a
large margin. This inspires us to explore a more
powerful selection model for better performance
in future work. We also investigate the candidate
pool quality in this iterative process with an oracle
Sθ
and the result is shown in Figure 3b. A clear
patent in this boxplot is that the oracle,quartile,
average and minimum scores of the candidate pool
all get boosted. These two experiments fully ex-
plain the intuition behind
Selfmem
by combining
the primal and dual problems together: a trained
retrieval-augmented generator would benefit from
better memory, which could be selected from its
own unbounded output, and the generator with bet-
ter memory would produce a better candidate pool
for next round of selection. So in this iterative pro-
cess, the model uplift itself with its own outputs.
7 Conclusion
For the first time, we investigate the fundamen-
tal limitation of bounded memory in the current
retrieval-augmented literature. Based on the key
insight that the text more similar in distribution to
the data in inference is not the training data, but the
generation model’s unbounded output, we combine
the primal and dual problems together and pro-
pose
Selfmem
, a general framework for retrieval-
augmented text generation by uplifting generation
model with its own output. There are two com-
ponents in
Selfmem
: a retrieval-augmented gen-
erator and a memory selector. We conduct com-
prehensive experiments across various text gener-
ation tasks including neural machine translation,
abstractive summarization, and dialogue genera-
tion. We surpass all baselines and greatly improve
the state-of-the-art performance in serval datasets.
We also carefully examine each key component
in our framework and locate the current system
bottleneck for future research.
Limitations
We discuss the limitations of our framework as
follows:
(1) Although
Selfmem
greatly improves the
generation quality compared with other retrieval-
augmented generation models, it requires more
computational resources with respect to the mem-
ory selection process. For large dataset with long
text (e.g.,
BigPatent
), it becomes a more crucial
problem considering the quadratic time complexity
of transformer architecture.
(2) This paper proposes a general idea for the
retrieval-augmented generation. But we only ex-
periment with transformer-based architecture for
both generator and memory selector and the archi-
tecture of generator and memory selector keeps
same across all text generation tasks. We believe
the task-specific design for the model architecture,
training objective and generation methods in differ-
ent text generation scenarios would further improve
the performance.
References
Armen Aghajanyan, Anchit Gupta, Akshat Shrivastava,
Xilun Chen, Luke Zettlemoyer, and Sonal Gupta.
2021. Muppet: Massive multi-task representations
with pre-finetuning. In Proceedings of the 2021
Conference on Empirical Methods in Natural Lan-
guage Processing, EMNLP 2021, Virtual Event /
Punta Cana, Dominican Republic, 7-11 November,
2021, pages 5799–5811. Association for Computa-
tional Linguistics.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2015. Neural machine translation by jointly
learning to align and translate. In 3rd Inter-
national Conference on Learning Representations,
ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
Conference Track Proceedings.
Siqi Bao, Huang He, Fan Wang, Hua Wu, and Haifeng
Wang. 2020. PLATO: pre-trained dialogue genera-
tion model with discrete latent variable. In Proceed-
ings of the 58th Annual Meeting of the Association
for Computational Linguistics, ACL 2020, Online,
July 5-10, 2020, pages 85–96. Association for Com-
putational Linguistics.
Siqi Bao, Huang He, Fan Wang, Hua Wu, Haifeng
Wang, Wenquan Wu, Zhen Guo, Zhibin Liu, and
Xinchao Xu. 2021. PLATO-2: Towards building
an open-domain chatbot via curriculum learning. In
Findings of the Association for Computational Lin-
guistics: ACL-IJCNLP 2021, pages 2513–2525, On-
line. Association for Computational Linguistics.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark Chen,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam Mc-
Candlish, Alec Radford, Ilya Sutskever, and Dario
Amodei. 2020. Language models are few-shot learn-
ers. In Advances in Neural Information Processing
Systems 33: Annual Conference on Neural Informa-
tion Processing Systems 2020, NeurIPS 2020, De-
cember 6-12, 2020, virtual.
Deng Cai, Yan Wang, Wei Bi, Zhaopeng Tu, Xi-
aojiang Liu, Wai Lam, and Shuming Shi. 2019a.
Skeleton-to-response: Dialogue generation guided
by retrieval memory. In Proceedings of the 2019
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, NAACL-HLT 2019, Min-
neapolis, MN, USA, June 2-7, 2019, Volume 1 (Long
and Short Papers), pages 1219–1228. Association
for Computational Linguistics.
Deng Cai, Yan Wang, Wei Bi, Zhaopeng Tu, Xiao-
jiang Liu, and Shuming Shi. 2019b. Retrieval-
guided dialogue response generation via a matching-
to-generation framework. In Proceedings of the
2019 Conference on Empirical Methods in Natu-
ral Language Processing and the 9th International
Joint Conference on Natural Language Processing,
EMNLP-IJCNLP 2019, Hong Kong, China, Novem-
ber 3-7, 2019, pages 1866–1875. Association for
Computational Linguistics.
Deng Cai, Yan Wang, Huayang Li, Wai Lam, and
Lemao Liu. 2021. Neural machine translation with
monolingual translation memory. In Proceedings of
the 59th Annual Meeting of the Association for Com-
putational Linguistics and the 11th International
Joint Conference on Natural Language Processing,
ACL/IJCNLP 2021, (Volume 1: Long Papers), Vir-
tual Event, August 1-6, 2021, pages 7307–7318. As-
sociation for Computational Linguistics.
Chris Callison-Burch, Miles Osborne, and Philipp
Koehn. 2006. Re-evaluating the role of Bleu in ma-
chine translation research. In 11th Conference of
the European Chapter of the Association for Com-
putational Linguistics, pages 249–256, Trento, Italy.
Association for Computational Linguistics.
Qian Cao and Deyi Xiong. 2018. Encoding gated trans-
lation memory into neural machine translation. In
Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing, Brussels,
Belgium, October 31 - November 4, 2018, pages
3042–3047. Association for Computational Linguis-
tics.
Ziqiang Cao, Wenjie Li, Sujian Li, and Furu Wei.
2018. Retrieve, rerank and rewrite: Soft template
based neural summarization. In Proceedings of the
56th Annual Meeting of the Association for Com-
putational Linguistics (Volume 1: Long Papers),
pages 152–161, Melbourne, Australia. Association
for Computational Linguistics.
Eugene Charniak and Mark Johnson. 2005. Coarse-to-
fine n-best parsing and maxent discriminative rerank-
ing. In ACL 2005, 43rd Annual Meeting of the Asso-
ciation for Computational Linguistics, Proceedings
of the Conference, 25-30 June 2005, University of
Michigan, USA, pages 173–180. The Association for
Computer Linguistics.
Wei Chen, Yeyun Gong, Song Wang, Bolun Yao,
Weizhen Qi, Zhongyu Wei, Xiaowu Hu, Bartuer
Zhou, Yi Mao, Weizhu Chen, Biao Cheng, and Nan
Duan. 2022a. Dialogved: A pre-trained latent vari-
able encoder-decoder model for dialog response gen-
eration. In Proceedings of the 60th Annual Meet-
ing of the Association for Computational Linguistics
(Volume 1: Long Papers), ACL 2022, Dublin, Ire-
land, May 22-27, 2022, pages 4852–4864. Associa-
tion for Computational Linguistics.
Xiuying Chen, Mingzhe Li, Xin Gao, and Xiangliang
Zhang. 2022b. Towards improving faithfulness in
abstractive summarization. In Advances in Neural
Information Processing Systems.
Xin Cheng, Shen Gao, Lemao Liu, Dongyan Zhao,
and Rui Yan. 2022. Neural machine transla-
tion with contrastive translation memories.CoRR,
abs/2212.03140.
Michael Collins and Terry Koo. 2005. Discrimina-
tive reranking for natural language parsing.Comput.
Linguistics, 31(1):25–70.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
moyer, and Veselin Stoyanov. 2020. Unsupervised
cross-lingual representation learning at scale. In
Proceedings of the 58th Annual Meeting of the As-
sociation for Computational Linguistics, ACL 2020,
Online, July 5-10, 2020, pages 8440–8451. Associa-
tion for Computational Linguistics.
Yuntian Deng, Anton Bakhtin, Myle Ott, Arthur Szlam,
and Marc’Aurelio Ranzato. 2020. Residual energy-
based models for text generation. In 8th Inter-
national Conference on Learning Representations,
ICLR 2020, Addis Ababa, Ethiopia, April 26-30,
2020. OpenReview.net.
Sergey Edunov, Myle Ott, Michael Auli, and David
Grangier. 2018. Understanding back-translation at
scale. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing,
Brussels, Belgium, October 31 - November 4, 2018,
pages 489–500. Association for Computational Lin-
guistics.
Le Fang, Chunyuan Li, Jianfeng Gao, Wen Dong, and
Changyou Chen. 2019. Implicit deep latent vari-
able models for text generation. In Proceedings of
the 2019 Conference on Empirical Methods in Nat-
ural Language Processing and the 9th International
Joint Conference on Natural Language Processing,
EMNLP-IJCNLP 2019, Hong Kong, China, Novem-
ber 3-7, 2019, pages 3944–3954. Association for
Computational Linguistics.
Jiazhan Feng, Chongyang Tao, Zhen Li, Chang Liu,
Tao Shen, and Dongyan Zhao. 2022. Reciprocal
learning of knowledge retriever and response ranker
for knowledge-grounded conversations. In Proceed-
ings of the 29th International Conference on Compu-
tational Linguistics, COLING 2022, Gyeongju, Re-
public of Korea, October 12-17, 2022, pages 389–
399. International Committee on Computational Lin-
guistics.
Tingchen Fu, Xueliang Zhao, Chongyang Tao, Ji-Rong
Wen, and Rui Yan. 2022. There are a thousand
hamlets in a thousand people’s eyes: Enhancing
knowledge-grounded dialogue with personal mem-
ory. In Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics (Vol-
ume 1: Long Papers), ACL 2022, Dublin, Ireland,
May 22-27, 2022, pages 3901–3913. Association for
Computational Linguistics.
Edouard Grave, Armand Joulin, and Nicolas Usunier.
2017. Improving neural language models with a
continuous cache. In 5th International Conference
on Learning Representations, ICLR 2017, Toulon,
France, April 24-26, 2017, Conference Track Pro-
ceedings. OpenReview.net.
Jiatao Gu, Yong Wang, Kyunghyun Cho, and Victor
O. K. Li. 2018. Search engine guided neural ma-
chine translation. In Proceedings of the Thirty-
Second AAAI Conference on Artificial Intelligence,
(AAAI-18), the 30th innovative Applications of Arti-
ficial Intelligence (IAAI-18), and the 8th AAAI Sym-
posium on Educational Advances in Artificial Intel-
ligence (EAAI-18), New Orleans, Louisiana, USA,
February 2-7, 2018, pages 5133–5140. AAAI Press.
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pa-
supat, and Ming-Wei Chang. 2020. Retrieval aug-
mented language model pre-training. In Proceed-
ings of the 37th International Conference on Ma-
chine Learning, ICML 2020, 13-18 July 2020, Vir-
tual Event, volume 119 of Proceedings of Machine
Learning Research, pages 3929–3938. PMLR.
Tatsunori B. Hashimoto, Kelvin Guu, Yonatan Oren,
and Percy Liang. 2018. A retrieve-and-edit frame-
work for predicting structured outputs. In Advances
in Neural Information Processing Systems 31: An-
nual Conference on Neural Information Processing
Systems 2018, NeurIPS 2018, December 3-8, 2018,
Montréal, Canada, pages 10073–10083.
Qiuxiang He, Guoping Huang, Qu Cui, Li Li, and
Lemao Liu. 2021. Fast and accurate neural machine
translation with translation memory. In Proceed-
ings of the 59th Annual Meeting of the Association
for Computational Linguistics and the 11th Interna-
tional Joint Conference on Natural Language Pro-
cessing, ACL/IJCNLP 2021, (Volume 1: Long Pa-
pers), Virtual Event, August 1-6, 2021, pages 3170–
3180. Association for Computational Linguistics.
Kenji Imamura and Eiichiro Sumita. 2017. Ensemble
and reranking: Using multiple models in the NICT-2
neural machine translation system at WAT2017. In
Proceedings of the 4th Workshop on Asian Transla-
tion, WAT@IJCNLP 2017, Taipei, Taiwan, Novem-
ber 27- December 1, 2017, pages 127–134. Asian
Federation of Natural Language Processing.
Amirhossein Kazemnejad, Mohammadreza Salehi, and
Mahdieh Soleymani Baghshah. 2020. Paraphrase
generation by learning how to edit from samples. In
Proceedings of the 58th Annual Meeting of the As-
sociation for Computational Linguistics, ACL 2020,
Online, July 5-10, 2020, pages 6010–6021. Associa-
tion for Computational Linguistics.
Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke
Zettlemoyer, and Mike Lewis. 2021. Nearest neigh-
bor machine translation. In 9th International Con-
ference on Learning Representations, ICLR 2021,
Virtual Event, Austria, May 3-7, 2021. OpenRe-
view.net.
Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke
Zettlemoyer, and Mike Lewis. 2020. Generalization
through memorization: Nearest neighbor language
models. In International Conference on Learning
Representations.
Philipp Koehn. 2004. Statistical significance tests for
machine translation evaluation. In Proceedings of
the 2004 Conference on Empirical Methods in Nat-
ural Language Processing , EMNLP 2004, A meet-
ing of SIGDAT, a Special Interest Group of the ACL,
held in conjunction with ACL 2004, 25-26 July 2004,
Barcelona, Spain, pages 388–395. ACL.
Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer,
James Bradbury, Ishaan Gulrajani, Victor Zhong,
Romain Paulus, and Richard Socher. 2016. Ask
me anything: Dynamic memory networks for nat-
ural language processing. In Proceedings of The
33rd International Conference on Machine Learn-
ing, volume 48 of Proceedings of Machine Learning
Research, pages 1378–1387, New York, New York,
USA. PMLR.
Ann Lee, Michael Auli, and Marc’Aurelio Ranzato.
2021. Discriminative reranking for neural machine
translation. In Proceedings of the 59th Annual Meet-
ing of the Association for Computational Linguistics
and the 11th International Joint Conference on Nat-
ural Language Processing, ACL/IJCNLP 2021, (Vol-
ume 1: Long Papers), Virtual Event, August 1-6,
2021, pages 7250–7264. Association for Computa-
tional Linguistics.
Mike Lewis, Marjan Ghazvininejad, Gargi Ghosh, Ar-
men Aghajanyan, Sida Wang, and Luke Zettlemoyer.
2020a. Pre-training via paraphrasing. In Advances
in Neural Information Processing Systems 33: An-
nual Conference on Neural Information Processing
Systems 2020, NeurIPS 2020, December 6-12, 2020,
virtual.
Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
jan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, and Luke Zettlemoyer.
2020b. BART: denoising sequence-to-sequence pre-
training for natural language generation, translation,
and comprehension. In Proceedings of the 58th An-
nual Meeting of the Association for Computational
Linguistics, ACL 2020, Online, July 5-10, 2020,
pages 7871–7880. Association for Computational
Linguistics.
Patrick S. H. Lewis, Ethan Perez, Aleksandra Pik-
tus, Fabio Petroni, Vladimir Karpukhin, Naman
Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih,
Tim Rocktäschel, Sebastian Riedel, and Douwe
Kiela. 2020c. Retrieval-augmented generation for
knowledge-intensive NLP tasks. In Advances in
Neural Information Processing Systems 33: Annual
Conference on Neural Information Processing Sys-
tems 2020, NeurIPS 2020, December 6-12, 2020,
virtual.
Huayang Li, Yixuan Su, Deng Cai, Yan Wang, and
Lemao Liu. 2022. A survey on retrieval-augmented
text generation.CoRR, abs/2202.01110.
Jinpeng Li, Yingce Xia, Rui Yan, Hongda Sun,
Dongyan Zhao, and Tie-Yan Liu. 2021a. Stylized di-
alogue generation with multi-pass dual learning. In
Advances in Neural Information Processing Systems
34: Annual Conference on Neural Information Pro-
cessing Systems 2021, NeurIPS 2021, December 6-
14, 2021, virtual, pages 28470–28481.
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,
and Bill Dolan. 2016. A diversity-promoting ob-
jective function for neural conversation models. In
NAACL HLT 2016, The 2016 Conference of the
North American Chapter of the Association for Com-
putational Linguistics: Human Language Technolo-
gies, San Diego California, USA, June 12-17, 2016,
pages 110–119. The Association for Computational
Linguistics.
Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang
Cao, and Shuzi Niu. 2017. Dailydialog: A manu-
ally labelled multi-turn dialogue dataset. In Proceed-
ings of the Eighth International Joint Conference on
Natural Language Processing, IJCNLP 2017, Taipei,
Taiwan, November 27 - December 1, 2017 - Volume
1: Long Papers, pages 986–995. Asian Federation of
Natural Language Processing.
Zekang Li, Jinchao Zhang, Zhengcong Fei, Yang Feng,
and Jie Zhou. 2021b. Conversations are not flat:
Modeling the dynamic information flow across di-
alogue utterances. In Proceedings of the 59th An-
nual Meeting of the Association for Computational
Linguistics and the 11th International Joint Confer-
ence on Natural Language Processing (Volume 1:
Long Papers), pages 128–138, Online. Association
for Computational Linguistics.
Chin-Yew Lin. 2004. ROUGE: A package for auto-
matic evaluation of summaries. In Text Summariza-
tion Branches Out, pages 74–81, Barcelona, Spain.
Association for Computational Linguistics.
Lemao Liu, Hailong Cao, Taro Watanabe, Tiejun Zhao,
Mo Yu, and Conghui Zhu. 2012. Locally training
the log-linear model for SMT. In Proceedings of
the 2012 Joint Conference on Empirical Methods
in Natural Language Processing and Computational
Natural Language Learning, EMNLP-CoNLL 2012,
July 12-14, 2012, Jeju Island, Korea, pages 402–411.
ACL.
Yang Liu and Mirella Lapata. 2019. Text summariza-
tion with pretrained encoders. In Proceedings of
the 2019 Conference on Empirical Methods in Nat-
ural Language Processing and the 9th International
Joint Conference on Natural Language Processing,
EMNLP-IJCNLP 2019, Hong Kong, China, Novem-
ber 3-7, 2019, pages 3728–3738. Association for
Computational Linguistics.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized BERT pretraining ap-
proach.CoRR, abs/1907.11692.
Yixin Liu, Zi-Yi Dou, and Pengfei Liu. 2021. Refsum:
Refactoring neural summarization. In Proceedings
of the 2021 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies, NAACL-HLT 2021,
Online, June 6-11, 2021, pages 1437–1448. Associ-
ation for Computational Linguistics.
Yixin Liu and Pengfei Liu. 2021. Simcls: A sim-
ple framework for contrastive learning of abstrac-
tive summarization. In Proceedings of the 59th An-
nual Meeting of the Association for Computational
Linguistics and the 11th International Joint Con-
ference on Natural Language Processing, ACL/IJC-
NLP 2021, (Volume 2: Short Papers), Virtual Event,
August 1-6, 2021, pages 1065–1072. Association for
Computational Linguistics.
Yixin Liu, Pengfei Liu, Dragomir R. Radev, and Gra-
ham Neubig. 2022. BRIO: bringing order to abstrac-
tive summarization. In Proceedings of the 60th An-
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), ACL 2022,
Dublin, Ireland, May 22-27, 2022, pages 2890–2903.
Association for Computational Linguistics.
Yuchen Liu, Long Zhou, Yining Wang, Yang Zhao, Ji-
ajun Zhang, and Chengqing Zong. 2018. A com-
parable study on model averaging, ensembling and
reranking in NMT. In Natural Language Process-
ing and Chinese Computing - 7th CCF International
Conference, NLPCC 2018, Hohhot, China, August
26-30, 2018, Proceedings, Part II, volume 11109 of
Lecture Notes in Computer Science, pages 299–308.
Springer.
Clara Meister, Ryan Cotterell, and Tim Vieira. 2020.
If beam search is the answer, what was the question?
In Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing, EMNLP
2020, Online, November 16-20, 2020, pages 2173–
2185. Association for Computational Linguistics.
Shashi Narayan, Shay B. Cohen, and Mirella Lapata.
2018. Don’t give me the details, just the summary!
topic-aware convolutional neural networks for ex-
treme summarization. In Proceedings of the 2018
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 1797–1807, Brussels, Bel-
gium. Association for Computational Linguistics.
Franz Josef Och, Daniel Gildea, Sanjeev Khudanpur,
Anoop Sarkar, Kenji Yamada, Alexander M. Fraser,
Shankar Kumar, Libin Shen, David Smith, Kather-
ine Eng, Viren Jain, Zhen Jin, and Dragomir R.
Radev. 2004. A smorgasbord of features for sta-
tistical machine translation. In Human Language
Technology Conference of the North American Chap-
ter of the Association for Computational Linguistics,
HLT-NAACL 2004, Boston, Massachusetts, USA,
May 2-7, 2004, pages 161–168. The Association for
Computational Linguistics.
Md. Rizwan Parvez, Wasi Uddin Ahmad, Saikat
Chakraborty, Baishakhi Ray, and Kai-Wei Chang.
2021. Retrieval augmented code generation and
summarization. In Findings of the Association for
Computational Linguistics: EMNLP 2021, Virtual
Event / Punta Cana, Dominican Republic, 16-20
November, 2021, pages 2719–2734. Association for
Computational Linguistics.
Hao Peng, Ankur P. Parikh, Manaal Faruqui, Bhuwan
Dhingra, and Dipanjan Das. 2019. Text generation
with exemplar-based adaptive decoding. In Proceed-
ings of the 2019 Conference of the North American
Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, NAACL-
HLT 2019, Minneapolis, MN, USA, June 2-7, 2019,
Volume 1 (Long and Short Papers), pages 2555–
2565. Association for Computational Linguistics.
Jonathan Pilault, Raymond Li, Sandeep Subramanian,
and Chris Pal. 2020. On extractive and abstractive
neural document summarization with transformer
language models. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Language
Processing, EMNLP 2020, Online, November 16-20,
2020, pages 9308–9319. Association for Computa-
tional Linguistics.
Matt Post. 2018. A call for clarity in reporting BLEU
scores. In Proceedings of the Third Conference on
Machine Translation: Research Papers, WMT 2018,
Belgium, Brussels, October 31 - November 1, 2018,
pages 186–191. Association for Computational Lin-
guistics.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J. Liu. 2020. Exploring the limits
of transfer learning with a unified text-to-text trans-
former.J. Mach. Learn. Res., 21:140:1–140:67.
Mathieu Ravaut, Shafiq R. Joty, and Nancy F. Chen.
2022a. Summareranker: A multi-task mixture-of-
experts re-ranking framework for abstractive sum-
marization. In Proceedings of the 60th Annual Meet-
ing of the Association for Computational Linguistics
(Volume 1: Long Papers), ACL 2022, Dublin, Ire-
land, May 22-27, 2022, pages 4504–4524. Associa-
tion for Computational Linguistics.
Mathieu Ravaut, Shafiq R. Joty, and Nancy F. Chen.
2022b. Towards summary candidates fusion.CoRR,
abs/2210.08779.
Stephen E. Robertson and Hugo Zaragoza. 2009. The
probabilistic relevance framework: BM25 and be-
yond.Found. Trends Inf. Retr., 3(4):333–389.
Julian Salazar, Davis Liang, Toan Q. Nguyen, and Ka-
trin Kirchhoff. 2020. Masked language model scor-
ing. In Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, ACL
2020, Online, July 5-10, 2020, pages 2699–2712.
Association for Computational Linguistics.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016. Neural machine translation of rare words with
subword units. In Proceedings of the 54th Annual
Meeting of the Association for Computational Lin-
guistics, ACL 2016, August 7-12, 2016, Berlin, Ger-
many, Volume 1: Long Papers. The Association for
Computer Linguistics.
Eva Sharma, Chen Li, and Lu Wang. 2019. BIG-
PATENT: A large-scale dataset for abstractive and
coherent summarization. In Proceedings of the 57th
Conference of the Association for Computational
Linguistics, ACL 2019, Florence, Italy, July 28- Au-
gust 2, 2019, Volume 1: Long Papers, pages 2204–
2213. Association for Computational Linguistics.
Noam Shazeer and Mitchell Stern. 2018. Adafac-
tor: Adaptive learning rates with sublinear memory
cost. In Proceedings of the 35th International Con-
ference on Machine Learning, ICML 2018, Stock-
holmsmässan, Stockholm, Sweden, July 10-15, 2018,
volume 80 of Proceedings of Machine Learning Re-
search, pages 4603–4611. PMLR.
Libin Shen, Anoop Sarkar, and Franz Josef Och. 2004.
Discriminative reranking for machine translation.
In Human Language Technology Conference of the
North American Chapter of the Association for Com-
putational Linguistics, HLT-NAACL 2004, Boston,
Massachusetts, USA, May 2-7, 2004, pages 177–184.
The Association for Computational Linguistics.
Weizhou Shen, Yeyun Gong, Yelong Shen, Song Wang,
Xiaojun Quan, Nan Duan, and Weizhu Chen. 2022.
Joint generator-ranker learning for natural language
generation.CoRR, abs/2206.13974.
Michel Simard and Pierre Isabelle. 2009. Phrase-based
machine translation in a computer-assisted transla-
tion environment. In Proceedings of Machine Trans-
lation Summit XII: Papers, MTSummit 2009, Ottawa,
Canada, August 26-30, 2009.
Yiping Song, Rui Yan, Xiang Li, Dongyan Zhao, and
Ming Zhang. 2016. Two are better than one: An
ensemble of retrieval- and generation-based dialog
systems.CoRR, abs/1610.07149.
Ralf Steinberger, Bruno Pouliquen, Anna Widiger,
Camelia Ignat, Tomaz Erjavec, Dan Tufis, and
Dániel Varga. 2006. The jrc-acquis: A multilin-
gual aligned parallel corpus with 20+ languages.
In Proceedings of the Fifth International Confer-
ence on Language Resources and Evaluation, LREC
2006, Genoa, Italy, May 22-28, 2006, pages 2142–
2147. European Language Resources Association
(ELRA).
Yixuan Su, David Vandyke, Simon Baker, Yan Wang,
and Nigel Collier. 2021. Keep the primary, rewrite
the secondary: A two-stage approach for paraphrase
generation. In Findings of the Association for Com-
putational Linguistics: ACL-IJCNLP 2021, pages
560–569, Online. Association for Computational
Linguistics.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Sequence to sequence learning with neural networks.
In Advances in Neural Information Processing Sys-
tems 27: Annual Conference on Neural Informa-
tion Processing Systems 2014, December 8-13 2014,
Montreal, Quebec, Canada, pages 3104–3112.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
cessing Systems 30: Annual Conference on Neural
Information Processing Systems 2017, December 4-
9, 2017, Long Beach, CA, USA, pages 5998–6008.
Oriol Vinyals and Quoc V. Le. 2015. A neural conver-
sational model.CoRR, abs/1506.05869.
Shuohang Wang, Yichong Xu, Yuwei Fang, Yang
Liu, Siqi Sun, Ruochen Xu, Chenguang Zhu, and
Michael Zeng. 2022. Training data is more valu-
able than you think: A simple and effective method
by retrieving from training data. In Proceedings
of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
ACL 2022, Dublin, Ireland, May 22-27, 2022, pages
3170–3179. Association for Computational Linguis-
tics.
Yuguang Wang, Shanbo Cheng, Liyang Jiang, Jiajun
Yang, Wei Chen, Muze Li, Lin Shi, Yanfeng Wang,
and Hongtao Yang. 2017. Sogou neural machine
translation systems for WMT17. In Proceedings
of the Second Conference on Machine Translation,
WMT 2017, Copenhagen, Denmark, September 7-
8, 2017, pages 410–415. Association for Computa-
tional Linguistics.
Jason Weston, Emily Dinan, and Alexander H. Miller.
2018. Retrieve and refine: Improved sequence gen-
eration models for dialogue. In Proceedings of
the 2nd International Workshop on Search-Oriented
Conversational AI, SCAI@EMNLP 2018, Brussels,
Belgium, October 31, 2018, pages 87–92. Associa-
tion for Computational Linguistics.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame,
Quentin Lhoest, and Alexander M. Rush. 2020.
Transformers: State-of-the-art natural language pro-
cessing. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing:
System Demonstrations, pages 38–45, Online. Asso-
ciation for Computational Linguistics.
Wenhao Wu, Wei Li, Xinyan Xiao, Jiachen Liu,
Ziqiang Cao, Sujian Li, Hua Wu, and Haifeng Wang.
2021. BASS: boosting abstractive summarization
with unified semantic graph. In Proceedings of the
59th Annual Meeting of the Association for Compu-
tational Linguistics and the 11th International Joint
Conference on Natural Language Processing, ACL/I-
JCNLP 2021, (Volume 1: Long Papers), Virtual
Event, August 1-6, 2021, pages 6052–6067. Associ-
ation for Computational Linguistics.
Yu Wu, Furu Wei, Shaohan Huang, Yunli Wang, Zhou-
jun Li, and Ming Zhou. 2019. Response gener-
ation by context-aware prototype editing. In The
Thirty-Third AAAI Conference on Artificial Intelli-
gence, AAAI 2019, The Thirty-First Innovative Ap-
plications of Artificial Intelligence Conference, IAAI
2019, The Ninth AAAI Symposium on Educational
Advances in Artificial Intelligence, EAAI 2019, Hon-
olulu, Hawaii, USA, January 27 - February 1, 2019,
pages 7281–7288. AAAI Press.
Mengzhou Xia, Guoping Huang, Lemao Liu, and
Shuming Shi. 2019. Graph based translation mem-
ory for neural machine translation. In The Thirty-
Third AAAI Conference on Artificial Intelligence,
AAAI 2019, The Thirty-First Innovative Applications
of Artificial Intelligence Conference, IAAI 2019,
The Ninth AAAI Symposium on Educational Ad-
vances in Artificial Intelligence, EAAI 2019, Hon-
olulu, Hawaii, USA, January 27 - February 1, 2019,
pages 7297–7304. AAAI Press.
Jiahao Xu, Yubin Ruan, Wei Bi, Guoping Huang,
Shuming Shi, Lihui Chen, and Lemao Liu. 2022. On
synthetic data for back translation. In Proceedings
of the 2022 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies, NAACL 2022, Seat-
tle, WA, United States, July 10-15, 2022, pages 419–
430. Association for Computational Linguistics.
Jitao Xu, Josep Maria Crego, and Jean Senellart. 2020.
Boosting neural machine translation with similar
translations. In Proceedings of the 58th Annual
Meeting of the Association for Computational Lin-
guistics, ACL 2020, Online, July 5-10, 2020, pages
1580–1590. Association for Computational Linguis-
tics.
Masaru Yamada. 2011. The effect of translation mem-
ory databases on productivity. Translation research
projects, 3:63–73.
Dani Yogatama, Cyprien de Masson d’Autume, and
Lingpeng Kong. 2021. Adaptive semiparametric
language models.Trans. Assoc. Comput. Linguis-
tics, 9:362–373.
Manzil Zaheer, Guru Guruganesh, Kumar Avinava
Dubey, Joshua Ainslie, Chris Alberti, Santiago On-
tanon, Philip Pham, Anirudh Ravula, Qifan Wang,
Li Yang, and Amr Ahmed. 2020. Big bird: Trans-
formers for longer sequences. In Advances in
Neural Information Processing Systems, volume 33,
pages 17283–17297. Curran Associates, Inc.
Jingqing Zhang, Yao Zhao, Mohammad Saleh, and
Peter J. Liu. 2020. PEGASUS: pre-training with
extracted gap-sentences for abstractive summariza-
tion. In Proceedings of the 37th International Con-
ference on Machine Learning, ICML 2020, 13-18
July 2020, Virtual Event, volume 119 of Proceedings
of Machine Learning Research, pages 11328–11339.
PMLR.
Jingyi Zhang, Masao Utiyama, Eiichiro Sumita, Gra-
ham Neubig, and Satoshi Nakamura. 2018. Guiding
neural machine translation with retrieved translation
pieces. In Proceedings of the 2018 Conference of the
North American Chapter of the Association for Com-
putational Linguistics: Human Language Technolo-
gies, NAACL-HLT 2018, New Orleans, Louisiana,
USA, June 1-6, 2018, Volume 1 (Long Papers), pages
1325–1335. Association for Computational Linguis-
tics.
Xueliang Zhao, Lemao Liu, Tingchen Fu, Shuming Shi,
Dongyan Zhao, and Rui Yan. 2022. Towards effi-
cient dialogue pre-training with transferable and in-
terpretable latent structure.CoRR, abs/2210.12461.
Yinhe Zheng, Zikai Chen, Rongsheng Zhang, Shilei
Huang, Xiaoxi Mao, and Minlie Huang. 2021. Styl-
ized dialogue response generation using stylized un-
paired texts. In Thirty-Fifth AAAI Conference on Ar-
tificial Intelligence, AAAI 2021, Thirty-Third Confer-
ence on Innovative Applications of Artificial Intelli-
gence, IAAI 2021, The Eleventh Symposium on Ed-
ucational Advances in Artificial Intelligence, EAAI
2021, Virtual Event, February 2-9, 2021, pages
14558–14567. AAAI Press.
Zexuan Zhong, Tao Lei, and Danqi Chen. 2022. Train-
ing language models with memory augmentation.
CoRR, abs/2205.12674.
A Dataset Details
In Table 7, we show the detailed information about
the datasets we use in this paper.
Task Dataset #Train #Dev #Test
Translation JRC (en de) 663,487 2,454 2,483
JRC (en es) 653,127 2,533 2,596
Summarization BigPatent 1,207,222 67,068 67,072
XSum 204,045 11,332 11,334
Dialogue DailyDialog 87,170 8,069 7,740
Table 7: Dataset statistics for three tasks.
B Evaluation Details
Machine Translation
We evaluate our MT sys-
tem with BLEU, TER and chrF++ from SACRE -
BLEU
2
(Post,2018). The signatures for BLEU,
TER and chrF++ are shown in Table 8.
Signature
nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0
nrefs:1|case:lc|tok:tercom|norm:no|punct:yes|asian:no|version:2.0.0
nrefs:1|case:mixed|eff:yes|nc:6|nw:2|space:no|version:2.0.0
Table 8: Signature from SACREBLEU.
Summarization
We evaluate our Summarization
system with standard ROUGE (Lin,2004) Perl
package
3
for evaluation. Following Liu et al.
(2022), we use PTB tokenizer
4
for tokenization.
And the parameters for ROUGE are "-c 95 -r 1000
-n 2 -m".
Dialogue Generation
Following Fu et al. (2022),
we evaluate our dialogue system with NLTK
BLEU
5
with space as tokenizer and smoothing
method1. The Distinction score is from Li et al.
(2021a).
C Self Memory Details
For machine translation tasks, following Xu et al.
(2020); Cai et al. (2021); Cheng et al. (2022) we
use randomly initialize Transformer
base
architec-
ture (Vaswani et al.,2017) as
Gξ
. We use the
joint-bpe algorithm (Sennrich et al.,2016) and
share the parameters between the memory encoder
and source encoder for dual encoder architecture.
2https://github.com/mjpost/sacrebleu.git
3https://github.com/summanlp/evaluation/tree/
master/ROUGE-RELEASE- 1.5.5
4https://nlp.stanford.edu/nlp/javadoc/javanlp/
edu/stanford/nlp/process/PTBTokenizer.html
5https://www.nltk.org/_modules/nltk/translate/
bleu_score.html
The hyper-parameter setting follows Cheng et al.
(2022) with dropout 0.1, label smoothing 0.1, gra-
dient clipping 1.0, Adafactor (Shazeer and Stern,
2018), warm-up steps 4000, maximum learning
rate 4.4e-2 and training epochs 30 for total. The
evaluation metrics are BLEU, TER and chrF++
from SAC REBLEU (Post,2018). The backbone
of memory selector
Sθ
is XLM-R
base
(Conneau
et al.,2020) with BLEU as
∆(·,·)
. The hyper-
parameter setting for
Sθ
follows Lee et al. (2021)
with
τ
0.5, minmax normalization for candidates
ranking, Adam optimizer with max learning rate
5e-5 and polynomial decay scheduler, and classifier
dropout 0.2.
For Summarization, we init the
Gξ
with
BART
base
(Lewis et al.,2020b) for BigPatent fol-
lowing Wang et al. (2022) and state-of-the-art
BRIO (Liu et al.,2022) for XSum. Optimization is
based on Adafactor with a maximum learning rate
of 5e-3, warm-up steps 10000 and gradient clip-
ping value 1.0. The maximum input length is 512
for XSum and 1024 for BigPatent. The evaluation
metric is Rouge (R-1/2/L) (Lin,2004).
For Dialogue Generation, we use BART
base
as
the backbone for
Gξ
on DailyDialog. We tune
the hyper-parameters from learning rate {5e-3,1e-
3,4e-4} and set dropout 0.1, batch size 64, label
smoothing factor 0.1, maximum input length 120
for DailyDialog. Following Bao et al. (2020); Chen
et al. (2022a), we evaluate our dialogue system with
BLEU (B-1/2) and Distinct (D-1,2) (Li et al.,2016).
For both Summarization and Dialogue Generation
task, we follow Liu and Liu (2021); Feng et al.
(2022) and adopt RoBERTa
base
(Liu et al.,2019) as
the backbone for
Sθ
. We choose the linear combi-
nation of B-1/2 as
∆(·,·)
for Dialogue Generation
and R-1/2/L for Summarization following (Shen
et al.,2022). We tune the hyper-parameters
τ
from
{0.08,0.2,0.5,0.8}, learning rate from {5e-5,7e-5,2e-
4}. The maximum input length for
Sθ
is 512 and
we truncate tokens from the longer input of source
and candidate.
D More Baselines
In this Table 9, we include more baselines on the
benchmark dataset
XSum
and
BigPatent
. We also
report the confidence region of SOTA model for
XSum and BigPatent as shown in Table 10.
System R-1 R-2 R-L
XSum
Liu and Lapata (2019) 38.8 16.5 31.3
Lewis et al. (2020b) 45.1 22.3 37.3
Zhang et al. (2020) 47.2 24.6 39.3
Liu and Liu (2021) 47.6 24.6 39.4
Liu et al. (2022) 49.1 25.6 40.4
Wang et al. (2022)(PG) 48.2 26.0 40.2
Wang et al. (2022)(B) 43.1 21.0 35.5
Wang et al. (2022)(L) 46.5 24.1 38.6
Ravaut et al. (2022a) 48.1 25.0 40.0
Ravaut et al. (2022b) 47.1 24.1 38.8
Chen et al. (2022b) 47.8 25.0 39.7
Selfmem 50.3 26.7 41.6
BigPatent
Zhang et al. (2020) 53.6 33.1 42.3
Lewis et al. (2020b) 44.4 21.3 31.0
Zaheer et al. (2020) 60.6 42.5 50.0
Pilault et al. (2020) 38.7 12.3 34.1
Wu et al. (2021) 45.0 20.3 39.2
Aghajanyan et al. (2021) 52.3 33.5 42.8
Wang et al. (2022) (B) 59.5 42.6 50.6
Wang et al. (2022) (L) 60.7 43.3 51.3
Wang et al. (2022) (PG) 44.6 21.5 33.3
Selfmem 62.9 48.1 59.6
Table 9: More baselines on XSum and BigPatent.
System ROUGE-1/2/L 95%-conf.int
XSum
BRIOjoint
50.3 0.49986 - 0.50602
26.7 0.26300 - 0.26989
41.6 0.41231 - 0.41900
BigPatent
BARTjoint
62.9 0.62664 - 0.63080
48.1 0.47783 - 0.48333
59.6 0.59401 - 0.59847
Table 10: Confidence region for SOTA model in XSum
and BigPatent.
ResearchGate has not been able to resolve any citations for this publication.