Abstract and Figures

With direct access to human-written reference as memory, retrieval-augmented generation has achieved much progress in a wide range of text generation tasks. Since better memory would typically prompt better generation~(we define this as primal problem), previous works mainly focus on how to retrieve better memory. However, one fundamental limitation exists for current literature: the memory is retrieved from a fixed corpus and is bounded by the quality of the corpus. Due to the finite retrieval space, bounded memory would greatly limit the potential of the memory-augmented generation model. In this paper, by exploring the duality of the primal problem: better generation also prompts better memory, we propose a framework called Selfmem, which iteratively adopts a retrieval-augmented generator itself to generate an unbounded memory pool and uses a memory selector to pick one generated memory for the next generation round. By combining the primal and dual problem, a retrieval-augmented generation model could lift itself up with its own output in the infinite generation space. To verify our framework, we conduct extensive experiments across various text generation scenarios including neural machine translation, abstractive summarization and dialogue generation over seven datasets and achieve state-of-the-art results in JRC-Acquis(four directions), XSum(50.3 ROUGE-1) and BigPatent(62.9 ROUGE-1).
1 Introduction
In recent years, retrieval-augmented text generation
has attracted increasing attention in multiple fields
including neural machine translation (NMT) (Gu
et al.,2018;Khandelwal et al.,2021;Cheng et al.,
2022); abstractive text summarization (Cao et al.,
2018;Peng et al.,2019); dialogue response gen-
eration (Song et al.,2016;Cai et al.,2019a;Wu
et al.,2019) and language modeling (Grave et al.,
0.0 0.2 0.4 0.6 0.8 1.0
Memory Similarity
Hypothesis BLEU
Figure 1: Relation between memory and hypothesis on
JRC-Acquis EnDe dataset. The hypothesis is gener-
ated by a retrieval-augmented translator whose memory
is retrieved from the training set. The X-axis represents
the similarity between memory and the reference and
Y-axis represents the translation quality. They are both
measured by BLEU.
2017;Khandelwal et al.,2020;Zhong et al.,2022).
This new generation paradigm would first endow
a generation model with access to external mem-
ory (typically the training corpus) via some infor-
mation retrieval techniques and then generates text
on the basis of retrieved memory. Instead of gener-
ating from scratch, retrieval-augmented generation
would equip the generation model with more in-
formation and potentially alleviate the difficulty of
text generation (Li et al.,2022).
In this paradigm, the guiding principle for mem-
ory retrieval is to retrieve one that shares maxi-
mum similarity with the current input (Lewis et al.,
2020a;Khandelwal et al.,2020;Yogatama et al.,
2021), which is also in line with human intuition
that a more similar reference sample would typi-
cally provide more hints. As shown in Figure 1, for
a retrieval-augmented NMT model, regardless of
other factors that may affect the translation qual-
ity (e.g., polysemy, morphology and coreference),
the memory similarity alone possesses a strong
correlation with the final translation quality. We
define this as the primal problem:
better mem-
ory prompts better generation
, so a number of
works focus on how to retrieve better memory:
from sparse retrieval to more expressive dense re-
trieval (Cao and Xiong,2018;Parvez et al.,2021);
from a fixed retriever to a learnable retriever (Lewis
et al.,2020c;Cai et al.,2021); from sentence-level
memory to more fine-grained token-level mem-
ory (Khandelwal et al.,2020,2021).
However, one fundamental limitation exists for
all previous works: the memory is retrieved from a
fixed corpus and is bounded by the quality of the
corpus. Due to the finite retrieval space, bounded
memory would greatly limit the potential of the
memory-augmented generation model. In this pa-
per, by exploring the primal problem’s duality,
ter generation also prompts better memory
, we
propose a novel framework called
, which
iteratively adopts a retrieval-augmented generator
itself to generate an unbounded memory pool and
uses a memory selector to pick one output as mem-
ory for the next generation round. By combining
the primal and dual problem, a retrieval-augmented
generation model could lift itself up with its own
output in the infinite generation space (Raffel et al.,
2020) dubbed as self-memory. The key insight
is that the text more similar in
distribution to the data in inference is not the train-
ing data (Wang et al.,2022), but the model’s own
In particular, there are two complementary com-
ponents in
: a retrieval-augmented gen-
erator and a memory selector. First, we train the
generator with retrieved memory and use the out-
put from the generator to train a memory selec-
tor based on a certain metric. Then, simply by
substituting unbounded generated memory for the
retrieved memory, we get generation output with
higher quality (primal problem) which serves as
the memory for the next round after being selected
by the memory selector (dual problem).
To verify the effectiveness of
, we con-
duct extensive experiments across seven datasets
under three text generation scenarios including
neural machine translation, abstractive text sum-
marization and dialogue generation. We ob-
serve significant improvement over strong base-
lines and achieved state-of-the-art results in
(four directions),
(50.3 ROUGE-
1) and
(62.9 ROUGE-1). To further
, We also carefully examine
each key component and locate the current system
bottleneck for future research.
To summarize, our key contributions are:
We are the first to investigate the problem of
bounded memory in the retrieval-augmented
By combining the primal and dual problem,
we propose
, a retrieval-augmented
framework that could lift itself up with its own
unbounded output as self-memory.
We conduct extensive experiments across var-
ious text generation scenarios and greatly im-
prove the state-of-the-art performance.
2 Related Work
2.1 Retrieval-augmented Text Generation
Since the world is not a snapshot once the training
corpus is collected, we can never expect an ever-
large model to capture everything in its parameters,
even for LLMs like GPT-3 (Brown et al.,2020),
and it is important to endow the model with access
to an external memory bank to solve different NLP
tasks (Lewis et al.,2020c).
For translation task, long before machine trans-
lation, the localization industry has been proposing
retrieval techniques to help human translators to
get higher productivity and consistency (Yamada,
2011). Early works on machine translation mainly
contribute to employing memory for statistical ma-
chine translation (SMT) systems (Simard and Is-
abelle,2009;Liu et al.,2012). As for NMT, Gu
et al. (2018) firstly uses search engines to retrieve
memory from the training set and incorporate it
with an external memory network. The follow-up
works mainly focus on the different aspects of the
retrieval-augmented NMT including the memory
encoding method (Xia et al.,2019;Xu et al.,2020;
He et al.,2021), the joint training of retriever and
generator with monolingual data (Cai et al.,2021),
the granularity of the memory (Khandelwal et al.,
2021) and the diversity of the memories (Cheng
et al.,2022). On dialogue response generation
task, exemplar/template retrieval as an intermedi-
ate step has been shown beneficial to informative
response generation (Weston et al.,2018;Wu et al.,
2019;Cai et al.,2019a,b). There are also works
on other applications such as abstractive summa-
rization (Wang et al.,2022;Cao et al.,2018;Peng
et al.,2019), code generation (Hashimoto et al.,
2018), paraphrase generation (Kazemnejad et al.,
2020;Su et al.,2021), language modeling (Khan-
delwal et al.,2020;Zhong et al.,2022) and all
knowledge-intensive tasks that can be framed as a
text generation problem (Lewis et al.,2020c).
2.2 Neural Text Reranking
By alleviating the discrepancy between training
and inference (i.e., exposure bias) and directly op-
timizing the desired metrics, two-stage reranking
methods have enabled strong progress in several
branches of text generation tasks. In machine trans-
lation, the seminal work of Shen et al. (2004) and
Och et al. (2004) introduced and popularized dis-
criminative reranking to SMT. For NMT, there are
two lines of work focusing on reranking: generative
reranking (Liu et al.,2018;Imamura and Sumita,
2017;Wang et al.,2017) and discriminative rerank-
ing (Lee et al.,2021;Salazar et al.,2020;Deng
et al.,2020). In syntactic parsing, Collins and
Koo (2005) was the first to employ a two-stage
reranking method to select from the output from
a base parser and Charniak and Johnson (2005)
introduced maximum entropy reranker. For text
summarization, RefSum (Liu et al.,2021) defines a
second-stage summarization framework that helps
address the problem of the train-test distribution
mismatch. SimCLS (Liu and Liu,2021) utilizes
pair-wise Learning To Rank (LTR) to select the can-
didates with the highest matching score. SummaR-
eranker (Ravaut et al.,2022a) adopts a multi-task
mixture-of-experts framework to utilize different
metrics that captures different aspects of generated
candidates. BRIO (Liu et al.,2022) re-uses the
base model for a second round of fine-tuning with
both the cross-entropy loss and a candidate-level
ranking loss. JGR (Shen et al.,2022) adopts an
alternate training paradigm to train the generator
and reranker.
One limitation of these reranking methods lies in
that it is a one-way process where the selected can-
didates are the final output of the system, while our
framework creatively employs the selected candi-
dates as the memory for the next generation round
of a retrieval-augmented generator, which is ca-
pable of generating better candidates with better
3 Self Memory
In this section, we start with a motivating experi-
ment on generation as memory in §3.1. Then we
, a framework consisting of a
retrieval-augmented generator 3.2) and a mem-
ory selector 3.3). The whole framework and
algorithm are shown in Figure 2and Algorithm 1.
3.1 Generation as Memory
The key motivation of our framework is based on
the observation that the memory more similar in
distribution to the data in inference is not the train-
ing data (38.89 BLEU in Table 1), but the model’s
own output (58.58 BLEU) in the unbounded gen-
eration space. One interesting exploration is to
directly use the generation as memory in respect
of the primal problem: a better memory prompts
better generation.
We conduct experiments on
De dataset with a retrieval-augmented gen-
erator detailed in §3.2. As shown in Table 1, the
first line is the conventional retrieval-augmented
training with retrieved memory and achieves 58.58
BLEU score in the test set. Directly adopting this
beam output as memory (Beam) to the generation
model makes no improvements even though they
are much more similar to the reference than re-
trieved ones. Our conjectures are: (1) the retrieval-
augmented generator could not generalize well in
this setting because of the memory distribution
shift (from 38.89 to 58.58). (2) the beam memory
does not provide any information gain compared
with retrieved memory even with more overlaps
between the references.
Memory Hypothesis
Retrieval 38.89 58.58
Beam 58.58 58.43
Reference 100 90.43
Random 1.14 49.08
Table 1: Experiments on a fixed retrieval-augmented
translator with different memory on JRC-Acquis
EnDe dataset measured by BLEU.
To exclude the former one, we explore the best
and worst scenario by using reference as mem-
ory (Reference) and randomly sampled sentences
as memory (Random). In Table 1, it shows that
a retrieval-augmented generator trained with re-
trieved memory has already learned how to lever-
age the memory information on both oracle and
random scenarios, with fixed parameters. For the
latter conjecture, we first define the token set of
reference, retrieved memory, and beam memory as
. And the overlap token set
is de-
fined as the tokens overlapped with references that
Target Distribution
NLL Loss
KL Loss
target training
... ...
... ...
(a) Retrieval-augmented Generator (b) Memory Selector
Predicted Distribution
Figure 2: Overall framework. There are two components in Selfmem, a retrieval-augmented generator (a) and a
memory selector (b). For the primal problem, (a) takes source and memory as input to generate candidates for (b).
For the dual problem, (b) takes as input source and generated candidates to select memory for (a).
are in the beam memory while not in the retrieved
memory, which is
is deemed
as the additional information that beam memory
provided. We calculate the confidence score
of a set as:
ψ(·) = 1
p(yi|x, y<i)(1)
p(yi|x, y<i)
is defined by the generation
measures how confident the generation
model generates the tokens in the set. The
is 0.58 while that of
is 0.76, which means the
generator is quite confident to generate tokens in
(Xu et al.,2022;Edunov et al.,2018), so it does
not need to resort to external memory (Kumar et al.,
2016). Because beam search ranks generated candi-
dates based on
, where the selected memory
falls into the confidence region of the generator
thus provides no information gain. This motivates
us to select memory according to metrics other than
p(y|x)in memory selector 3.3).
3.2 Retrieval-augmented Generator
Given a text pair
(x, y)
, where
x={x1, ..., x|x|}
is the source,
y={y1, ..., y|y|}
is the target. They
could be (document, summary) in summarization
or (context, response) in dialogue generation. The
retrieval-augmented generation would first use
to retrieve memory
from datastore
. Then the
Gξ(x, m)
, parameterized by
, would
take both
as input to generate the tar-
get sentence
. In this paper, following standard
practice (Cheng et al.,2022;Wang et al.,2022),
we choose the training set as
D={(xi, yi)}|D|
and only keep the target side of top-1 retrieval
results as memory. For the generator
, we
consider two commonly used retrieval-augmented
(Guu et al.,2020;
Wang et al.,2022;Lewis et al.,2020c) and
(Xia et al.,2019;Cai et al.,2021;Cheng
et al.,2022).
This architecture is the standard
encoder-decoder-based model (Bahdanau et al.,
2015;Sutskever et al.,2014;Vaswani et al.,2017).
The input is the concatenation of
. The
encoder would first map the input into the hidden
states H:
H=Encoder(x[SEP] m)(2)
And the decoder would incorporate
by attention
mechanism (Vaswani et al.,2017) and generate
tokens in an auto-regressive manner:
hi=Decoder(CrossAttn(H), y<i)(3)
PGξ(·|x, y<i) = Softmax(hi)(4)
Instead of treating
as a
long sequence, this architecture has two encoders,
one for
and the other for
. Their outputs are se-
quentially attended by the decoder with dual cross
attention as in Cheng et al. (2022):
hi=Decoder(CrossAttn(Hx, Hm), y<i)(7)
We use Transformer (Vaswani et al.,2017) as the
building block for both architectures and optimize
Gξwith NLL loss:
Lnll =
log PGξ(yt|x, m, y<t)(8)
3.3 Memory Selector
The role of memory selector
Sθ(x, c)
, parameter-
ized by
, is to select one candidate
from the
candidate pool
generated by
based on certain
. And the selected
would be used as
for the next generation round of
. Ac-
cording to §3.1, making
fall into the confidence region of
and bring
no information gain. Also, larger
not necessarily guarantee better generation qual-
ity (Meister et al.,2020). Thus we define
the model-free metrics that are widely used to mea-
sure the generation quality (e.g., BLEU for NMT,
ROUGE for Summarization). Our memory selector
takes as input the concatenation of source
and outputs a multinomial distribution
pSθ(·|x)over C:
pSθ(ci|x) = exp(Sθ(x[SEP] ci))
j=1 exp(Sθ(x[SEP] cj)) (9)
Following Lee et al. (2021), the training objective
is to minimize the discrepancy between the
prediction of
and score calculated by
where the discrepancy is measured by KL diver-
Lkl =
is the metric distribution defined as:
pM(ci) = exp(∆(ci, y) )
j=1 exp(∆(cj, y) )(11)
is the temperature to control the smoothness of
the distribution. At inference, the output of the
is arg max
3.4 Combine Generator and Selector
, we define two generation modes. The first
is the hypothesis mode which produces one output
for one input for system evaluation. The second is
candidate mode which would generate N outputs
with one input and this is used for training
memory selection. By combining these two modes
, the whole framework of
shown in Algorithm 1.
Algorithm 1 Selfmem Framework
a dataset
, a retriever
, a memory
selection metric
, a retrieval-augmented
generator Gξ, and a memory selector Sθ
1: retrieve memory Min Dwith R
2: train Gξwith Dand M
to generate candidate pool
in candidate mode
4: train Sθon Cwith ∆(·,·)
5: while not converged do
6: Sθselects memory from Cas M
7: Gξ
generates candidate pool
candidate mode
8: end while
9: Gξ
generates the final hypothesis with
hypothesis mode
4 Experimental Setup
4.1 Dataset
We evaluate
on 3 different tasks with 7
datasets. The data statistics are in Appendix A.
We evaluate our framework on
datasets (Steinberger et al.,2006), a
collection of parallel legislative text of European
Union Law. It is the benchmark dataset used
in translation memory-augmented NMT task (Gu
et al.,2018;Xia et al.,2019;Cai et al.,2021;
Cheng et al.,2022). We choose 4 translation
directions, namely, Spanish
English (Es
GermanEnglish (DeEn).
We evaluate on 2 summariza-
tion datasets: 1)
(Narayan et al.,2018),
extreme summarization, a single-document sum-
marization dataset with highly abstractive ar-
ticles from British Broadcasting Corporation.
(Sharma et al.,2019), consisting of
1.3 million records of U.S. patent documents along
with human-written abstractive summaries.
We experiment on
et al.,2017), which contains multi-turn dialogs on
daily life topics and is used by Chen et al. (2022a);
Bao et al. (2020); Zhao et al. (2022).
4.2 Implementation Details
We use BM25 (Robertson and Zaragoza,2009) to
conduct retrieval. All experiments are based on
Transformer (Wolf et al.,2020) and are conducted
on 8 NVIDIA A100 GPUs. The candidate genera-
System EsEn EnEs DeEn EnDe
Dev Test Dev Test Dev Test Dev Test
None Memory
Bahdanau et al. (2015) 55.02 59.34 50.54 50.48 50.20 49.74 44.94 43.98
Transformer 64.08 64.63 62.02 61.80 60.18 60.16 54.65 55.43
Retrieval Memory
Gu et al. (2018) 60.28 59.34 57.62 57.27 55.63 55.33 49.26 48.80
Zhang et al. (2018) 63.97 64.30 61.50 61.56 60.10 60.26 55.54 55.14
Xia et al. (2019) 66.37 66.21 62.50 62.76 61.85 61.72 57.43 56.88
Cai et al. (2021) 67.73 67.42 64.18 63.86 64.48 64.62 58.77 58.42
Cheng et al. (2022) 67.48 67.76 63.84 64.04 64.22 64.33 58.94 58.69
Transformerdual?66.87 67.12 63.14 63.54 64.09 63.36 58.69 58.06
Transformerjoint67.74 67.32 63.93 64.12 64.50 64.40 58.16 58.58
Self Memory
Table 2: Results of translation task on JRC-Acquis measured by BLEU. We compare three kinds of translation
systems. The top section is the vanilla sequence-to-sequence model without memory. The second section consists
of models equipped with retrieved translation memory. Models denoted by the same symbol (?and ) have the
same parameters and only differ in memory as input. The bolded numbers show the SOTA performance and the
underlined numbers show the second-best result. denotes the system is significantly better than baselines with
p-value < 0.05 tested by Koehn (2004).
tion method is beam search with beam size 50 for
all tasks. The number of iterations is decided by
the performance of the validation set.
For transla-
, following Xu et al. (2020); Cai et al. (2021);
Cheng et al. (2022) we use randomly initialize
architecture (Vaswani et al.,2017)
. The evaluation metrics are BLEU, TER and
chrF++ from SACRE BLEU (Post,2018). The back-
bone of memory selector
is XLM-R
neau et al.,2020) with BLEU as
, we init
with BART
et al.,2020b) for
following Wang et al.
(2022) and state-of-the-art BRIO (Liu et al.,2022)
. The evaluation metric is Rouge (R-
1/2/L) (Lin,2004).
For dialogue generation
, we
use BART
as the backbone for
. We evaluate
our dialogue system with BLEU (B-1/2) and Dis-
tinct (D-1/2) (Li et al.,2016). For both dialogue and
summarization task, we follow Liu and Liu (2021);
Feng et al. (2022) and adopt RoBERTa
et al.,2019) as the backbone for
. We choose the
linear combination of B-1/2 as
for Dialogue
Generation and R-1/2/L for Summarization follow-
ing (Shen et al.,2022). For more details, we refer
to Appendix Band Appendix C. And our code is
open sourced 1.
5 Experimental Results
5.1 Machine Translation
We choose four translation directions and exper-
iment on two generator architectures (joint and
dual as detailed in §3.2). The baselines consist
of two kinds of translation systems, one is the
vanilla sequence-to-sequence model (Bahdanau
et al.,2015;Vaswani et al.,2017) without memory
augmentation, and the other is retrieval-augmented
translation models focusing on memory encod-
ing (Gu et al.,2018;Xia et al.,2019), mem-
ory construction (Zhang et al.,2018), memory
retrieval (Cai et al.,2021) and memory diver-
sity (Cheng et al.,2022). Based on the experimen-
tal results shown in Table 2,
, across four
translation datasets and two different architectures,
significantly boost the performance of
, which
is remarkable considering that the parameters of
keep fixed and the only changing variable is
the memory as input. This also complies with the
primal problem that better memory would typically
prompt better generation results.
The dual problem is revealed in Table 3. The
self-memory, which is actually the model’s own
output, shares more similarities with the ground
truth and serves as better memory to produce the
final generation. This also shows one salient dif-
ference between
and previous candidate
reranking works (Lee et al.,2021;Ravaut et al.,
2022a). For reranking, it aims to select candidates
that are of higher quality than beam output; while
, the selected candidates serve as the
memory for the retrieval-augmented generator and
do not necessarily need to outperform the beam
Since the higher BLEU score in such
range (
50) can not safely guarantee a better trans-
lation system (Callison-Burch et al.,2006), we also
evaluate our system with TER and chrF++. Results
in Table 4shows
consistently outperforms
baselines in other two metrics.
Retrieval Self
memory hypothesis memory hypothesis
En-De 38.89 58.58 57.92 60.11
42.56 64.40 64.32 65.65
En-Es 40.67 64.12 63.57 65.94
43.05 67.32 67.78 68.80
Table 3: Comparison between retrieval memory and
self-memory. The quality of memory and hypoth-
esis is measured by the n-gram overlap with refer-
ence (BLEU). All experiments are conducted with
Transformerjoint on JRC-Acquis.
System Memory BLEU chrF++ TER
Transformer None 55.43 70.31 36.35
Transformerdual Retrieval 58.06 71.58 35.41
Transformerjoint Retrieval 58.58 72.22 34.39
Transformerdual Self 59.49 72.62 34.04
Transformerjoint Self 60.11 73.25 32.62
Table 4: Evaluation results on JRC-Acquis EnDe
measured by BLEU, TER and chrF++.
5.2 Summarization
We compare
with REINA (Wang et al.,
2022), a retrieval-augmented framework, PEGA-
SUS (Zhang et al.,2020) and BART (Lewis et al.,
2020b). The result is shown in Figure 5. First, we
could observe that memory has different impacts
on different datasets. The improvement brought by
memory in
is much larger than that of
, which could be attributed to the character-
istic of the dataset itself since
is com-
posed of official patent documents that are mutually
similar and this greatly boosts the summarization
quality according to the primal problem. We also
find that self-memory greatly improves the perfor-
mance of BRIO (+1.2 R1) and BART (+18.5 R1)
System Memory R-1 R-2 R-L
PEGASUS None 47.2 24.6 39.3
BRIO None 49.1 25.6 40.4
REINA (PG) Retrieval 48.2 26.0 40.2
REINA (B) Retrieval 43.2 21.0 35.5
REINA (L) Retrieval 46.5 24.1 38.6
BRIOdual?Retrieval 48.6 26.1 40.6
BRIOjointRetrieval 49.5 26.5 41.2
BRIOdual?Self 49.2 26.2 40.8
BRIOjointSelf 50.3 26.7 41.6
PEGASUS None 53.6 33.2 43.2
BART None 44.4 21.3 31.0
REINA (B) Retrieval 59.5 42.6 50.6
REINA (L) Retrieval 60.7 43.3 51.3
REINA (PG) Retrieval 44.6 21.5 33.3
BARTdual ?Retrieval 57.4 43.3 49.7
BARTjoint Retrieval 59.6 43.4 51.0
BARTdual ?Self 61.2 44.6 52.3
BARTjoint Self 62.9 48.1 59.6
Table 5: Results of summarization task on XSum and
BigPatent measured by ROUGE. Models denoted by
the same symbol (?and ) have the same parameters
and only differ in memory as input.
and achieves SOTA results on both datasets. We
choose these baselines for fair comparison in that
they share the same base generator. Due to space
limitations, we include more comparisons and the
confidence region of SOTA model in Appendix D.
5.3 Dialogue Generation
As shown in Table 6, self-memory greatly boosts
the performance of retrieval-augmented generator
for dialog generation tasks. By optimizing memory
with BLEU as
, the self-memory improves
the B-1,2 score over retrieved memory by 3.08
B-1 and 0.6 B-2 on BART
. Interestingly, al-
outperforms baselines in terms of
B-1/2, it falls short of D-1 and D-2, which could
be explained by the trade-off between BLEU score
and Distinct score when evaluating a dialogue sys-
tem (Zheng et al.,2021). To overcome this prob-
lem, we choose D-1,2 as
when optimizing
denoted as BART
(D) and the result in Ta-
ble 6shows great flexibility of
by directly
optimizing memory to get the desired attribute for
diverse and informative response generation.
Hypothesis BLEU
(a) Hypothesis
Candidates BLEU
(b) Candidates
Figure 3: (a) shows generation quality in the iteration process with different Sθin both generator architectures. (b)
shows candidates quality in the iteration process with an oracle Sθand Transformerjoint as Gξ.
System Memory B-1 B-2 D-1 D-1
Vinyals and Le (2015) None 33.60 26.80 3.00 12.80
Fang et al. (2019) None 30.90 24.90 2.90 25.00
Bao et al. (2021) None 34.80 25.12 3.54 25.11
Li et al. (2021b) None 36.17 27.67 4.56 27.12
BART None 20.72 11.36 3.92 19.44
BARTdual?Retrieval 29.50 21.89 4.74 26.01
BARTjointRetrieval 36.72 31.55 6.13 35.65
BARTdual?Self 33.43 22.85 4.66 26.16
BARTjointSelf 39.80 32.15 5.84 32.16
BARTjoint(D) Self 36.92 32.09 9.12 37.05
Table 6: Results of dialogue generation task on
DailyDialog measured by B-1/2 and D-1/2. Mod-
els denoted by the same symbol (?and ) have the
same parameters and only differ in memory as input.
BARTjoint (D) denotes the metric ∆(·,·)for Sθis the
average of D-1 and D-2.
6 Further Analysis
To obtain a further understanding of
, we
carefully investigate how two key components,
, affect the final generation quality. We con-
duct experiments on JRC-Acquis EnDe dataset.
Tuning Gξ
As discussed in §3.1, we have already
demonstrated that a trained retrieval-augmented
generator, with fixed parameters, has already
learned how to distinguish between "good" or "bad"
memory. So, this explains why we choose to fix the
generator in our framework and suggests that
is not the current system bottleneck of Selfmem.
Tuning Sθ
We experiment with different
directly selecting memory from the candidate pool
based on their real ranking. As shown in Figure 3a,
in both architectures, the model could iteratively
improve itself with its own output and surpass the
current SOTA performance (60.11 BLEU) by a
large margin. This inspires us to explore a more
powerful selection model for better performance
in future work. We also investigate the candidate
pool quality in this iterative process with an oracle
and the result is shown in Figure 3b. A clear
patent in this boxplot is that the oracle,quartile,
average and minimum scores of the candidate pool
all get boosted. These two experiments fully ex-
plain the intuition behind
by combining
the primal and dual problems together: a trained
retrieval-augmented generator would benefit from
better memory, which could be selected from its
own unbounded output, and the generator with bet-
ter memory would produce a better candidate pool
for next round of selection. So in this iterative pro-
cess, the model uplift itself with its own outputs.
7 Conclusion
For the first time, we investigate the fundamen-
tal limitation of bounded memory in the current
retrieval-augmented literature. Based on the key
insight that the text more similar in distribution to
the data in inference is not the training data, but the
generation model’s unbounded output, we combine
the primal and dual problems together and pro-
, a general framework for retrieval-
augmented text generation by uplifting generation
model with its own output. There are two com-
ponents in
: a retrieval-augmented gen-
erator and a memory selector. We conduct com-
prehensive experiments across various text gener-
ation tasks including neural machine translation,
abstractive summarization, and dialogue genera-
tion. We surpass all baselines and greatly improve
the state-of-the-art performance in serval datasets.
We also carefully examine each key component
in our framework and locate the current system
bottleneck for future research.
We discuss the limitations of our framework as
(1) Although
greatly improves the
generation quality compared with other retrieval-
augmented generation models, it requires more
computational resources with respect to the mem-
ory selection process. For large dataset with long
text (e.g.,
), it becomes a more crucial
problem considering the quadratic time complexity
of transformer architecture.
(2) This paper proposes a general idea for the
retrieval-augmented generation. But we only ex-
periment with transformer-based architecture for
both generator and memory selector and the archi-
tecture of generator and memory selector keeps
same across all text generation tasks. We believe
the task-specific design for the model architecture,
training objective and generation methods in differ-
ent text generation scenarios would further improve
the performance.
A Dataset Details
In Table 7, we show the detailed information about
the datasets we use in this paper.
Task Dataset #Train #Dev #Test
Translation JRC (en de) 663,487 2,454 2,483
JRC (en es) 653,127 2,533 2,596
Summarization BigPatent 1,207,222 67,068 67,072
XSum 204,045 11,332 11,334
Dialogue DailyDialog 87,170 8,069 7,740
Table 7: Dataset statistics for three tasks.
B Evaluation Details
Machine Translation
We evaluate our MT sys-
tem with BLEU, TER and chrF++ from SACRE -
(Post,2018). The signatures for BLEU,
TER and chrF++ are shown in Table 8.
Table 8: Signature from SACREBLEU.
We evaluate our Summarization
system with standard ROUGE (Lin,2004) Perl
for evaluation. Following Liu et al.
(2022), we use PTB tokenizer
for tokenization.
And the parameters for ROUGE are "-c 95 -r 1000
-n 2 -m".
Dialogue Generation
Following Fu et al. (2022),
we evaluate our dialogue system with NLTK
with space as tokenizer and smoothing
method1. The Distinction score is from Li et al.
C Self Memory Details
For machine translation tasks, following Xu et al.
(2020); Cai et al. (2021); Cheng et al. (2022) we
use randomly initialize Transformer
ture (Vaswani et al.,2017) as
. We use the
joint-bpe algorithm (Sennrich et al.,2016) and
share the parameters between the memory encoder
and source encoder for dual encoder architecture.
master/ROUGE-RELEASE- 1.5.5
The hyper-parameter setting follows Cheng et al.
(2022) with dropout 0.1, label smoothing 0.1, gra-
dient clipping 1.0, Adafactor (Shazeer and Stern,
2018), warm-up steps 4000, maximum learning
rate 4.4e-2 and training epochs 30 for total. The
evaluation metrics are BLEU, TER and chrF++
from SAC REBLEU (Post,2018). The backbone
of memory selector
is XLM-R
et al.,2020) with BLEU as
. The hyper-
parameter setting for
follows Lee et al. (2021)
0.5, minmax normalization for candidates
ranking, Adam optimizer with max learning rate
5e-5 and polynomial decay scheduler, and classifier
dropout 0.2.
For Summarization, we init the
(Lewis et al.,2020b) for BigPatent fol-
lowing Wang et al. (2022) and state-of-the-art
BRIO (Liu et al.,2022) for XSum. Optimization is
based on Adafactor with a maximum learning rate
of 5e-3, warm-up steps 10000 and gradient clip-
ping value 1.0. The maximum input length is 512
for XSum and 1024 for BigPatent. The evaluation
metric is Rouge (R-1/2/L) (Lin,2004).
For Dialogue Generation, we use BART
the backbone for
on DailyDialog. We tune
the hyper-parameters from learning rate {5e-3,1e-
3,4e-4} and set dropout 0.1, batch size 64, label
smoothing factor 0.1, maximum input length 120
for DailyDialog. Following Bao et al. (2020); Chen
et al. (2022a), we evaluate our dialogue system with
BLEU (B-1/2) and Distinct (D-1,2) (Li et al.,2016).
For both Summarization and Dialogue Generation
task, we follow Liu and Liu (2021); Feng et al.
(2022) and adopt RoBERTa
(Liu et al.,2019) as
the backbone for
. We choose the linear combi-
nation of B-1/2 as
for Dialogue Generation
and R-1/2/L for Summarization following (Shen
et al.,2022). We tune the hyper-parameters
{0.08,0.2,0.5,0.8}, learning rate from {5e-5,7e-5,2e-
4}. The maximum input length for
is 512 and
we truncate tokens from the longer input of source
and candidate.
D More Baselines
In this Table 9, we include more baselines on the
benchmark dataset
. We also
report the confidence region of SOTA model for
XSum and BigPatent as shown in Table 10.
System R-1 R-2 R-L
Liu and Lapata (2019) 38.8 16.5 31.3
Lewis et al. (2020b) 45.1 22.3 37.3
Zhang et al. (2020) 47.2 24.6 39.3
Liu and Liu (2021) 47.6 24.6 39.4
Liu et al. (2022) 49.1 25.6 40.4
Wang et al. (2022)(PG) 48.2 26.0 40.2
Wang et al. (2022)(B) 43.1 21.0 35.5
Wang et al. (2022)(L) 46.5 24.1 38.6
Ravaut et al. (2022a) 48.1 25.0 40.0
Ravaut et al. (2022b) 47.1 24.1 38.8
Chen et al. (2022b) 47.8 25.0 39.7
Selfmem 50.3 26.7 41.6
Zhang et al. (2020) 53.6 33.1 42.3
Lewis et al. (2020b) 44.4 21.3 31.0
Zaheer et al. (2020) 60.6 42.5 50.0
Pilault et al. (2020) 38.7 12.3 34.1
Wu et al. (2021) 45.0 20.3 39.2
Aghajanyan et al. (2021) 52.3 33.5 42.8
Wang et al. (2022) (B) 59.5 42.6 50.6
Wang et al. (2022) (L) 60.7 43.3 51.3
Wang et al. (2022) (PG) 44.6 21.5 33.3
Selfmem 62.9 48.1 59.6
Table 9: More baselines on XSum and BigPatent.
System ROUGE-1/2/L
50.3 0.49986 - 0.50602
26.7 0.26300 - 0.26989
41.6 0.41231 - 0.41900
62.9 0.62664 - 0.63080
48.1 0.47783 - 0.48333
59.6 0.59401 - 0.59847
Table 10: Confidence region for SOTA model in XSum
and BigPatent.
