Content uploaded by Wenxiang Jiao
Author content
All content in this area was uploaded by Wenxiang Jiao on Apr 06, 2023
Content may be subject to copyright.
PARROT: Translating During Chat Using Large Language Models
Wenxiang Jiao∗Jen-tse Huang Wenxuan Wang Xing Wang Shuming Shi Zhaopeng Tu
Tencent AI Lab
Abstract
Large language models (LLMs) like ChatGPT
and GPT-4 have exhibited remarkable abili-
ties on a wide range of natural language pro-
cessing (NLP) tasks, including various ma-
chine translation abilities accomplished during
chat. However, these models are only acces-
sible through restricted APIs, which creates
barriers to new research and advancements in
the field. Therefore, we propose the PAR-
ROT1framework to enhance and regulate the
translation abilities during chat based on open-
sourced LLMs (i.e., LLaMA-7b) and human
written translation and evaluation data. Specif-
ically, PARROT reformulates translation data
into the instruction-following style, and intro-
duces a “Hint” field for incorporating extra
requirements to regulate the translation pro-
cess. Accordingly, we propose three instruc-
tion types for finetuning PARROT models, in-
cluding translation instruction, contrastive in-
struction, and error-guided instruction. Ex-
periments on Flores subsets and WMT22 test
sets suggest that translation instruction im-
proves the translation performance of vanilla
LLMs significantly while error-guided instruc-
tion can lead to a further improvement, which
demonstrates the importance of learning from
low-quality translations annotated by human.
Meanwhile, the PARROT models can also pre-
serve the ability on general tasks with the Al-
paca multi-task dataset involved in finetuning.
1 Introduction
Large language models (LLMs), designed in the
instruction-following format, such as ChatGPT and
GPT-4 (OpenAI,2023), have garnered consider-
able interest due to their remarkable abilities in
comprehending instructions and generating human-
like responses. These versatile models can effi-
ciently perform a wide range of natural language
processing (NLP) tasks within a single architecture,
∗Correspondence: joelwxjiao@tencent.com
1https://github.com/wxjiao/ParroT
Translate the following sentences to English.
有些鹦鹉可以⽤适当的词语回应简单的命令或问题,
从⽽与⼈类建⽴⼀种基本的交流⽅式。
Some parrots can respond to simple commands or
questions with the appropriate words, establishing
a basic form of communication with humans..
Hint: preferred/no errors/entity alignments/…
Figure 1: Framework of PARROT. Hints are (optional)
extra requirements to regulate the translation process.
including question answering (Omar et al.,2023),
text summarization (Yang et al.,2023), grammati-
cal error correction (Wu et al.,2023), and machine
translation (Jiao et al.,2023). Consequently, they
represent a significant stride toward the realization
of artificial general intelligence (AGI).
Machine translation, a quintessential NLP task,
faces both challenges and opportunities presented
by the emergence of LLMs. Traditional machine
translation encompasses several sub-tasks (Farhad
et al.,2021), such as bilingual translation (Vaswani
et al.,2017), multilingual translation (Johnson
et al.,2017;Jiao et al.,2022), terminology trans-
lation (Wang et al.,2022), quality estimation (Rei
et al.,2020), and automatic post-editing (Pal et al.,
2016), among others. These tasks are typically
addressed by individual models with limited cross-
task interaction. However, current LLMs have the
potential to revolutionize this inefficient approach
and redefine the machine translation paradigm. On
one hand, LLMs can leverage the benefits of var-
ious sub-tasks and seamlessly transition between
them using only natural language instructions. For
instance, if a user is dissatisfied with a transla-
tion result, they can request the LLM to refine the
translation implicitly (i.e., through automatic post-
editing) or explicitly, by imposing constraints on
specific entities (i.e., terminology translation). On
arXiv:submit/4829623 [cs.CL] 6 Apr 2023
the other hand, LLMs are expected to enhance the
explainability of machine translation, ultimately
leading to further improvements in translation qual-
ity. For example, users may want LLMs to com-
pare two translations of a sentence (i.e., quality
estimation) and provide an explanation for the dis-
crepancies (i.e., error analysis), which can then
be addressed in a targeted manner by the LLM it-
self. However, superior LLMs like ChatGPT and
GPT-4 are only accessible through restricted APIs,
which creates barriers to new research and advance-
ments in the field. Therefore, developing compre-
hensive and general machine translation abilities
upon open-sourced LLMs has become a critical
and challenging research problem.
In this paper, we propose PARROT to enhance
and regulate the translation abilities of LLMs dur-
ing chat by leveraging existing human written trans-
lation and evaluation data. To be compatible with
chat, the PARROT framework reformulates transla-
tion data into the instruction-following style (Taori
et al.,2023), and introduces a “Hint” field for incor-
porating extra requirements or constraints to guide
the translation process. Accordingly, we propose
three distinct instruction types: (1)
Translation
Instruction
, that asks LLMs to generate transla-
tions based on source sentences. (2)
Contrastive
Instruction
, that asks LLMs to generate the trans-
lations of two different systems with the preferred
one at first. (3)
Error-Guided Instruction
, that
asks LLMs to generate the translations with human
annotated errors as the hint. The first instruction
guarantees the basic translation ability of LLMs
while the latter two regulate the LLMs to align with
human feedbacks (Ouyang et al.,2022;Liu et al.,
2023). While we only use the above instruction
types in this paper, it is natural to extend PARROT
to other hints (e.g., entity alignments), which we
leave for future exploration.
We adopt the open-sourced LLaMA-7b (Tou-
vron et al.,2023) model as our base model, and
conduct instruction tuning on previous WMT val-
idation data and Multidimensional Quality Met-
ric (MQM) human evaluation data. The resulting
PARROT models are evaluated on Flores subsets
and WMT22 test sets. Our main findings are sum-
marized as below:
•
Translation instruction, as expected, can im-
prove the translation performance of LLMs
significantly, especially for directions from
English to other languages.
•
Contrastive instruction does not work as ex-
pected, which may result from the subtle dif-
ference between translations by the competi-
tive WMT systems.
•
Error-guided instruction can further improve
the performance when asking PAR ROT to gen-
erate translations with no error, indicating the
importance of learning from low-quality trans-
lations annotated by human.
•
With the Alpaca multi-task dataset involved,
PARROT can also preserve the capability of
general tasks, such as question answering and
code generation.
2 Instruction Pool
In this section, we introduce our instruction pool
based on human written translations and evaluation
data. Specifically, three distinct instruction types
are included, namely, translation instruction, con-
trastive instruction, and error-guided instruction.
The first instruction guarantees the basic transla-
tion ability of LLMs while the latter two regulate
the LLMs to align with human feedbacks.
2.1 Translation Instruction
As traditional translation systems, we rely on bilin-
gual sentence pairs to accomplish the basic trans-
lation ability of LLMs. We follow Stanford Al-
paca (Taori et al.,2023) to transform bilingual sen-
tence pairs into the instruction-following format,
named translation instruction, for finetuning.
Table 1presents an example of the translation
instruction, which includes a preface fixed for all
tasks, an “### Instruction:” to describe the transla-
tion task (e.g., stating the language pair), an “###
Input:” with the source sentence, and a “### Re-
sponse:” with the target sentence to be generated.
To ensure the high quality of sentence pairs, we
use human written translations rather than public
training data that could be potentially noisy.
2.2 Contrastive Instructions
Besides the basic translation ability, we also want
LLMs to understand the relative quality difference
between translations. In this way, we may improve
the quality of translations by asking LLMs to out-
put the preferred ones. To realize this goal, we
need multiple different translations for each source
sentence, which can be acquired by the systems
submitted to WMT competitions. Meanwhile, the
Table 1: Instruction types for finetuning PARROT models.
Type Instruction
Below is an instruction that describes a task, paired with an input that provides further context. Write
a response that appropriately completes the request.\n\n
Translation ### Instruction:
Translate the following sentences from Chinese to English.\n\n
### Input:
检查情况显示,市场销售的粮油、肉类、水果、蔬菜、蛋奶等生活必需品供应充足,商品
价格基本稳定,未发现严重违法违规行为,市场经营秩序总体平稳。\n\n
### Response:The inspection results showed that there was an adequate supply of daily necessities,
including grain, oil, meat, fruit, vegetable, milk, and eggs in the market and commodity prices
basically remain stable, the administration found no serious offensive and noncompliant conducts,
and the market order remains stable on the whole.
Contrastive ### Instruction:
Translate the following sentences from Chinese to English.\n\n
### Input:
检查情况显示,市场销售的粮油、肉类、水果、蔬菜、蛋奶等生活必需品供应充足,商品
价格基本稳定,未发现严重违法违规行为,市场经营秩序总体平稳。\n\n
### Hint: We prefer to translate it to\n\n
### Response:<p>The inspection results showed that there was an adequate supply of daily necessi-
ties, including grain, oil, meat, fruit, vegetable, milk, and eggs in the market and commodity prices
basically remain stable, the administration found no serious offensive and noncompliant conducts,
and the market order remains stable on the whole.</p> rather than <p>The results of the inspec-
tion indicate the sufficient supply of living necessities on marketing including cereals and oils, meat,
fruits, vegetables, eggs and milk, and the basically stabilized commodity price. The inspection hasn’t
found serious violation of laws and regulations. The market order is stable on an overall basis.</p>
Error-Guided ### Instruction:
We are translating the following sentences from Chinese to English. \n\n
### Input:
检查情况显示,市场销售的粮油、肉类、水果、蔬菜、蛋奶等生活必需品供应充足,商品
价格基本稳定,未发现严重违法违规行为,市场经营秩序总体平稳。\n\n
### Hint: A translation with major accuracy/mistranslation errors could be\n\n
### Response:The results of the inspection indicate the sufficient supply of living necessities <v>on
marketing</v> including cereals and oils, meat, fruits, vegetables, eggs and milk, and the basically
stabilized commodity price. The inspection hasn’t found serious violation of laws and regulations.
The market order is stable on an overall basis.
### Hint: A translation with minor fluency/grammar errors could be\n\n
### Response:The results of the <v>inspection</v> indicate the sufficient supply of living necessi-
ties on marketing including cereals and oils, meat, fruits, vegetables, eggs and milk, and the basically
stabilized commodity price. The inspection hasn’t found serious violation of laws and regulations.
The market order is stable on an overall basis.
human evaluation results of these systems also pro-
vide scores to reflect the quality differences.
As shown in Table 1, we form the response
by concatenating two translations (e.g., linked by
“rather than”), in which the first translation has a
higher quality score. Meanwhile, we indicate that
the first translation is preferred in the “### Hint:”
field. Essentially, the second translation acts like
a negative sample to this sentence pair, which ex-
plains the name contrastive instruction.
2.3 Error-Guided Instructions
The potential problem of contrastive instruction is
that, it only tells the LLMs that the two translations
have quality differences but not clarify which kind
of translation errors lead to such differences. How-
ever, we want LLMs to learn the correspondence
between the errors and the translations. With such
a deeper understanding on the translation errors,
we may ask LLMs to produce translations with no
error so as to improve the quality.
We propose
error-guided instruction
. As
shown in Table 1, we use the translation with er-
rors annotated by the “<v></v>” span to form the
response. Similar to contrastive instruction, we
adopt the “### Hint:” field to indicate the error
types. This kind of fine-grained error annotations
also come from the human evaluation data.
3 Experimental Setups
3.1 Training Data
Alpaca Data.
This dataset is built by Stan-
ford Alpaca (Taori et al.,2023)
2
project, which
contains 52.0K instruction-following data of multi-
tasks for tuning the LLaMA (Touvron et al.,2023)
3
models. The dataset is a list of dictionaries with
each contains three fields, namely, “instruction” de-
scribes the task the model should perform, “input”
describes optional context or input for the task,
and “output” describes the answer to the instruc-
tion as generated by
text-davinci-003
. We
call these data
general instructions
, which enable
the LLaMA models to perform well on general
tasks.
WMT Validation Data.
We use human writ-
ten validation data from previous WMT com-
petitions rather that public training data to
avoid introducing noises into instruction tun-
ing. In this version, we use the newstest2017-
2020 of Chinese
⇔
English (i.e., Zh
⇔
En) and
German
⇔
English (i.e., De
⇔
En) tasks, which con-
sist of 51.2K sentence pairs for all the four direc-
tions. These sentence pairs are formed into the
translation instructions.
MQM Human Evaluation Data.
Our human
evaluation data comes from the Multidimensional
Quality Metrics (MQM) datasets (Freitag et al.,
2021)
4
, which annotate the different translation
errors (e.g., major accuracy/mistranslation, minor
fluency/grammar) of top WMT systems. Due to its
higher reliability than Direct Assessment, MQM
was introduced to WMT competitions starting from
WMT20 but only provided for a few language
pairs. In this version, we use the MQM data for
the WMT20 En
⇒
De and Zh
⇒
En submissions.
These data are formed into the
contrastive instruc-
tions
based on the quality scores and the
error-
guided instructions
based on the error annota-
2https://github.com/tatsu-lab/
stanford_alpaca
3https://github.com/facebookresearch/
llama
4https://github.com/google/
wmt-mqm- human-evaluation
tions, respectively. We use 26K instructions from
both instruction types to form a 52.0K dataset.
Automatically Assessed Data.
Although the Di-
rect Assessment (DA) data of WMT systems pro-
vide scores for language directions that lack MQM
data (i.e., De
⇒
En, En
⇒
Zh), we find the DA score
to be very unreliable as they could be quite differ-
ent for two similar translations. Instead, we opt
for automatic evaluation metrics like COMET to
score the translations of WMT systems. We also
heuristically determine a rough error level for each
translation based on the COMET score, namely,
Major Error: [0, 85]; Minor Error: (85, 90]; No
Error: (90, 100]. This decision comes in part from
the observation that the top commercial translation
systems achieve COMET scores of nearly 90 on
the Flores subset in Table 3.
Note: To obtain a set of diverse instructions,
we use the three instructions in Jiao et al. (2023),
including the one in Table 1, as the seeds to ask
GPT-4 (OpenAI,2023) to paraphrase them. In total,
we have 33 different instructions that are randomly
combined with the training examples.
3.2 Evaluation
Test Data.
We evaluate the translation perfor-
mance of LLMs on two sources of test sets:
•Flores Subset
: This dataset is a subset of Flores
benchmark, in which 50 sentences are sampled
for German, English, Romanian and Chinese,
respectively, for evaluating the translation perfor-
mance of ChatGPT (Jiao et al.,2023)5.
•WMT22 Test Sets
: We also use the test sets
from WMT22 competition (Kocmi et al.,2022),
which are constructed based on more recent con-
tent from various domains, including news, so-
cial, e-commerce, and conversational domains.
The numbers of samples for De
⇒
En, En
⇒
De,
Zh
⇒
En and En
⇒
Zh tasks are 1984, 2037, 1875
and 2037, respectively.
Metrics.
For automatic evaluation, we adopt
BLEU (Papineni et al.,2002) implementated in
SacreBLEU (Post,2018)
6
, and COMET (Rei et al.,
2020)
7
from
Unbabel/wmt22-comet-da
,
which are driven by
n
-gram similarity and
cross-lingual pretrained models, respectively.
5https://github.com/wxjiao/
Is-ChatGPT- A-Good-Translator
6https://github.com/mjpost/sacrebleu
7https://github.com/Unbabel/COMET
Table 2: Ablation study of key factors on Flores
En⇒De subset with Alpaca-7b.
Prompt Instruct. Search BLEU COMET
no-input
TP1sample 20.09 80.03
beam 4 22.19 79.13
TP3sample 19.43 79.00
beam 4 21.52 79.08
input
TP1sample 21.00 79.51
beam 4 23.32 80.56
TP3sample 19.33 78.68
beam 4 20.64 80.07
3.3 Model Training
We conduct our experiments on LLaMA (Touvron
et al.,2023) models with HuggingFace Transform-
ers
8
. The hyper-parameters for finetuning are con-
sistent with that used in Stanford Alpaca (Taori
et al.,2023). We finetune LLaMA-7b model for
three variants:
•Alpaca-7b
, as a reimplementation of the Stan-
ford Alpaca model fine-tuned only on the Alpaca
multi-task dataset.
•PARROT-7b
, finetuned on both the Alpaca
multi-task dataset and WMT validation data.
•PARROT-HINT-7b
, finetuned on the Alpaca
multi-task dataset, WMT validation data, MQM
human evaluation data and the automatically as-
sessed data.
We finetune the Alpaca-7b and PARROT-7b models
for 3 epochs on the corresponding data combina-
tion. For PARROT-HINT-7b, we finetune it for 1.5
epochs to maintain the similar training steps as
PARROT-7b. We conduct finetuning on 32 Nvidia
A100 GPUs and utilize DeepSpeed
9
ZeRO stage 3
for the model parallel.
4 Results
4.1 Ablation Study
Before diving into more experiments, we investi-
gate some factors that may affect the translation
performance of LLMs. By default, we conduct the
ablation studies on the Flores En
⇒
De subset with
the Alpaca-7b model.
8https://github.com/huggingface/
transformers
9https://github.com/microsoft/
DeepSpeed
Prompt Format.
In the Alpaca multi-task
dataset, about 60% examples contain empty “###
Input:”, which results in two different prompt for-
mats during finetuning, i.e.,
prompt-input
and
prompt-no-input
. During inference, they use
prompt-no-input
which combines the instruc-
tion and input to fill the “### Instruction:” field,
introducing the inconsistency between finetuning
and inference. Therefore, we study if such an oper-
ation makes any performance variation.
Instruction Variation.
Recent studies (Jiao
et al.,2023;Zhang et al.,2023) suggest that LLMs
are sensitive to task instructions, which could vary
the translation performance considerably. We con-
duct a brief study for this by comparing the TP1
and TP3 instructions in Jiao et al. (2023). TP1 is
the one presented in Table 1while TP3 is “Please
provide the [TGT] translation for the following sen-
tences.”, which was demonstrated a better choice
when tested on ChatGPT10.
Search Algorithm.
In machine translation, the
beam search strategy (Sutskever et al.,2014;Fre-
itag and Al-Onaizan,2017;Vaswani et al.,2017)
has been the standard search algorithm for infer-
ence, which finds a translation that approximately
maximizes the conditional probability given by the
translation model. However, beam search requires
high computation costs which becomes infeasible
with the LLMs, since they can easily induce out-of-
memory (OOM) issues. Therefore, more efficient
search algorithm such as sampling may have to be
the choice. Therefore, we compare the sampling
strategy (Taori et al.,2023) and the beam search
strategy with a beam size of 4 for this factor.
Table 2presents the results of these ablation stud-
ies. We have the following observations: (1) The
prompt-input
performs slightly better than
prompt-no-input
though the gap is marginal.
(2) The TP1 instruction works better on Alpaca-7b
than TP3 which is different from that on ChatGPT.
(3) Generally, beam search outperforms sampling
significantly, especially in terms of BLEU score.
Therefore, we use
prompt-input
+ TP1 + beam
search as the default setting for inference.
4.2 Main Results
Table 3presents the translation performance of
LLMs on the two benchmarks, i.e., Flores subsets
and WMT22 test sets. For Flores subsets, we also
10https://chat.openai.com
Table 3: Translation performance of LLMs on Flores subsets and WMT22 test sets.
System De⇒En En⇒De Zh⇒En En⇒Zh
BLEU COMET BLEU COMET BLEU COMET BLEU COMET
Flores Subsets
Google 45.04 88.79 41.16 88.61 31.66 87.71 43.58 88.42
DeepL 49.23 89.70 41.46 89.03 31.22 87.39 44.31 88.11
ChatGPT 43.71 89.10 38.87 88.14 24.73 85.81 38.27 86.99
GPT-4 46.00 89.31 45.73 89.28 28.50 87.42 42.50 88.40
LLaMA-7b 3.43 60.15 2.41 49.03 1.86 53.73 0.05 47.64
Alpaca-7b 36.66 86.87 23.32 80.56 15.13 81.20 9.81 58.69
PARROT-7b 41.34 87.75 28.58 83.30 19.56 83.17 24.77 79.90
PARROT-HINT-7b 41.07 87.97 30.83 84.38 19.22 83.96 25.84 80.15
+ Infer w/ Prefer. 38.10 87.60 23.07 83.97 18.69 83.10 22.54 80.12
+ Infer w/ No Err. 42.25 88.73 32.10 84.94 21.57 83.72 27.44 81.88
WMT22 Test Set
Alpaca-7b 27.82 82.33 20.17 78.13 14.24 74.00 10.44 62.10
PARROT-7b 26.61 82.57 24.03 80.43 18.10 75.38 27.08 78.45
PARROT-HINT-7b 27.33 82.47 24.68 81.22 18.90 75.26 28.15 79.31
+ Infer w/ No Err. 27.38 82.47 26.14 81.67 20.23 75.90 30.33 80.34
Table 4: Effects of error levels as hints during inference
on Flores subsets.
Hint En⇒De Zh⇒En
BLEU COMET BLEU COMET
None 30.83 84.38 19.22 83.96
No Err. 32.10 84.94 21.57 83.72
Minor Err. 28.84 83.60 20.67 82.14
Major Err. 28.57 82.93 19.33 80.55
include the results reported in Jiao et al. (2023) on
ChatGPT and GPT-4.
Instruction tuning exploits the potential of
vanilla LLMs for machine translation.
Obvi-
ously, the vanilla LLaMA-7b model performs badly
on the Flores subsets. By inspecting the outputs,
we find that the vanilla LLaMA-7b model tends to
generate very long sentences (e.g., copy the instruc-
tions, continuing text expansion), which makes the
generated text not faithful to the source sentences
and also not grammatically correct. The reason
could be the long context modeling during pre-
training. Another reason is that we use the Alpaca
inference format, which is basically a zero-shot
setting that exhibits no guidance for translation.
Tuning LLaMA-7b on the Alpaca multi-task
dataset (i.e., Alpaca-7b) can ameliorate the above
issue. Basically, Alpaca-7b can produce com-
plete translations that have similar lengths as the
references. The translation performance is also
boosted noticeably, particularly for language pairs
like De
⇒
En due to the dominant language of Al-
paca dataset in English. Further introducing a small
amount of translation instructions (i.e., PAR ROT-
7b) in the four language directions can significantly
improve the performance, especially for En
⇒
Zh,
in which Chinese was unseen in the pretraining
of LLaMA models (Touvron et al.,2023). The
findings are also consistent on WMT22 test sets.
Learning from low-quality translations anno-
tated by human is also important.
While pre-
senting the high-quality bilingual pairs to LLMs is
important, as discussed above, we argue that low-
quality translations annotated by human also bring
benefits. As shown in Table 3, with no hint for infer-
ence, PARROT-HINT-7b outperforms PAR ROT-7b
slightly on translation directions from English to
other langaugese (i.e., En
⇒
De, En
⇒
Zh). How-
ever, when asking PARROT-HINT-7b to generate
translations with no error, the performance can be
significantly improved across translation directions
and test sets. We speculate that PARROT-HINT-
7b does learn the relationship between errors and
translations by error-guided instruction, such that it
Table 5: Examples from Flores Zh⇒En subset when translated with varied error levels.
COMET Example
SRC n/a 例如,在卢瓦尔河谷和莱茵河谷参观城堡,或者乘游轮去游览多瑙河沿岸好玩的城
市,或者沿伊利运河乘船游玩。
REF n/a
For example visiting castles in the Loire Valley, the Rhine valley or taking a cruise to
interesting cites on the Danube or boating along the Erie Canal.
None 81.22
For example, visiting the castles in the Rhine Valley and the Lahn Valley, or taking a cruise
along the Rhine River to explore the cities along the river, or taking a boat trip along the Illi
River.
No Err. 81.49
For example, visiting the castles in the Rhine Valley and the Ruhr Valley, or taking a cruise
to visit the cities along the Rhine River, or taking a boat ride along the Illi River.
Minor Err. 78.12
For example, visiting the castles in the Rhine Valley and the Ruhr Valley, or taking a cruise
to visit the cities along the Rhine River, or taking a boat to play in the cities along the Elbe
River.
Major Err. 65.39
For example, visiting the fortress in the valley of the River Volga and the River Levan, or
taking a cruise to visit the cities along the River Volga, or taking a boat to play in the cities
along the River Volga.
Table 6: Effects of preference as hint during inference
on Flores subsets.
Hint En⇒De Zh⇒En
BLEU COMET BLEU COMET
None 30.83 84.38 19.22 83.96
Prefer. 23.07 83.97 18.69 83.10
Unprefer. 29.19 83.76 19.68 82.35
can avoid the translation errors as much as possible
when the hint of no error is provided.
A bit unexpected is that when asking PARROT-
HINT-7b to generate preferred translations, the per-
formance drops considerably. As stated in Sec-
tion 2.3, contrastive instruction only indicates that
two translations may have quality differences but
not states why, which is difficult for LLMs to iden-
tify by themselves. Previous study by Min et al.
(2022) also suggests that it is easier for LLMs to
learn the instruction formats rather than the input-
response patterns.
4.3 Analysis
Effectiveness of Error-Guided Instruction.
To
understand how error-guided instruction works, we
investigate the behavior of PARROT-HINT-7b when
asking it to generate translations with varied error
levels as hints. As shown in Table 4, the translation
quality is getting worse from no error to minor error
to major error, especially in terms of COMET score.
The translations generated with no hint are usually
comparable with the minor error level. It demon-
strates that PARROT-HINT-7b places the erroneous
translations into other locations of the probability
space with the regulation of human annotations.
As a result, PARROT-HINT-7b is more likely to
generate high-quality translation with “no error”.
For qualitative analysis, we show an example
from Flores Zh
⇒
En subset in Table 5, in which
we highlight all errors in each translation. Com-
pared to no error level, minor and major error levels
tend to produce more over-translations and mis-
translations. It is also important to point out that
no error level does not guarantee that completely
correct translations will be generated, especially
for named entities, which we attribute to the under-
explored translation abilities of current LLMs.
Failure of Contrastive Instruction.
We try to
understand why contrastive instruction does not
work. By examining the responses of PAR ROT-
HINT-7b when asking it to generate preferred
translations, we observe significant differences of
lexical choices between the “preferred” and “un-
preferred” (i.e., the second translation in the re-
sponse) translations. Surprisingly, as shown in Ta-
ble 6, the “unpreferred” translations obtain much
higher BLEU score though the situation is not for
COMET score. It indicates that PARROT-HINT-7b
attempted to identify the quality differences be-
tween the first and second translations in the con-
trastive instructions through lexical choices, which
is a low-level pattern to reflect the translation qual-
ity. One potential reason is that the WMT systems
are so competitive with each other that the qual-
ity differences between them are too subtle for the
LLM to learn effectively. We will investigate more
Table 7: Examples of general tasks.
System Question-Answering
Tell me about alpacas.
Alpaca-7b Alpacas are a species of camelid native to South America. They are domesticated and bred for their
fiber, which is used in the production of knitted and woven items such as clothing, blankets, and rugs.
Alpacas are typically kept as livestock, and can live up to 20 years in captivity. They are typically
brown, white, or grey in color, and can grow up to 12 inches tall at the shoulder.
PARROT-HINT-7b Alpacas are a species of domesticated South American camelids, bred primarily for their fiber. They
are typically kept as pets, for their fleece, or for their meat. Alpacas are very social animals and live
in herds, with a dominant male leader and a number of females and young. They are very intelligent
and can be trained to respond to basic commands. Alpacas require a lot of space and good quality
food in order to thrive.
System Code Generation
Write a Python program that prints the first 10 Fibonacci numbers.
Alpaca-7b numbers = [0, 1]
for i in range(2, 10):
numbers.append(numbers[i-2] + numbers[i-1])
print(numbers)
PARROT-HINT-7b def fibonacci(n):
if n == 0:
return 0
elif n == 1:
return 1
else:
return fibonacci(n-1)+fibonacci(n-2)
print(fibonacci(10))
about the contrastive instruction in future work.
General Abilities.
In our experiments, we in-
clude the Alpaca multi-task dataset for finetuning
PARROT-HINT-7b to preserve its ability on general
tasks. We present two examples in Table 7, from
question answering and code generation, respec-
tively. As shown, the responses of both Alpaca-7b
and PARROT-HINT-7b are correct, although the
expressions may be different. For comprehensive
comparison, it is better to perform a quantitative
evaluation and we leave it for future work.
5 Conclusion
Superior large language models (LLMs) like Chat-
GPT and GPT-4 are only accessible through re-
stricted APIs, which creates barriers to new re-
search and advancements in the field. In this paper,
we propose the PAR ROT framework to enhance
and regulate the translation abilities during chat
based on open-sourced LLMs (i.e., LLaMA-7b)
and human written translation and evaluation data.
Specifically, PARROT reformulates translation data
into the instruction-following style, and introduces
a “Hint” field for incorporating extra requirements
to regulate the translation process. Accordingly, we
propose three instruction types for finetuning PAR-
ROT models, including translation instruction, con-
trastive instruction, and error-guided instruction.
Experiments on Flores subsets and WMT22 test
sets suggest that translation instruction improves
the translation performance of vanilla LLMs signif-
icantly while error-guided instruction can lead to a
further improvement, which demonstrates the im-
portance of learning from low-quality translations
annotated by human. Meanwhile, the PAR ROT
models can also preserve some abilities on general
tasks with the Alpaca multi-task dataset involved
in finetuning.
In the future, various aspects can be explored
to improve PARROT, for example: (1) taking the
advantages of other hints like entity alignments; (2)
extending the coverage of translation directions;
and (3) exploring parameter-efficient finetuning
techniques for larger language models.
References
Akhbardeh Farhad, Arkhangorodsky Arkady, Biesial-
ska Magdalena, Bojar Ondˇ
rej, Chatterjee Rajen,
Chaudhary Vishrav, Marta R Costa-jussa, España-
Bonet Cristina, Fan Angela, Federmann Christian,
et al. 2021. Findings of the 2021 conference on ma-
chine translation (wmt21). In WMT.
Markus Freitag and Yaser Al-Onaizan. 2017. Beam
search strategies for neural machine translation.
ACL.
Markus Freitag, George Foster, David Grangier, Viresh
Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021.
Experts, errors, and context: A large-scale study of
human evaluation for machine translation. TACL.
Wenxiang Jiao, Zhaopeng Tu, Jiarui Li, Wenxuan
Wang, Jen-tse Huang, and Shuming Shi. 2022. Ten-
cent’s multilingual machine translation system for
WMT22 large-scale african languages. In WMT.
Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Xing
Wang, and Zhaopeng Tu. 2023. Is ChatGPT a good
translator? a preliminary study. In ArXiv.
Melvin Johnson, Mike Schuster, Quoc V Le, Maxim
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,
Fernanda Viégas, Martin Wattenberg, Greg Corrado,
et al. 2017. Google’s multilingual neural machine
translation system: Enabling zero-shot translation.
TACL.
Tom Kocmi, Rachel Bawden, Ondˇ
rej Bojar, Anton
Dvorkovich, Christian Federmann, Mark Fishel,
Thamme Gowda, Yvette Graham, Roman Grund-
kiewicz, Barry Haddow, et al. 2022. Findings of the
2022 conference on machine translation (WMT22).
In WMT.
Hao Liu, Carmelo Sferrazza, and Pieter Abbeel. 2023.
Chain of hindsight aligns language models with
feedback. arXiv.
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe,
Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle-
moyer. 2022. Rethinking the role of demonstrations:
What makes in-context learning work? In EMNLP.
Reham Omar, Omij Mangukiya, Panos Kalnis, and Es-
sam Mansour. 2023. ChatGPT versus traditional
question answering for knowledge graphs: Cur-
rent status and future directions towards knowledge
graph chatbots. arXiv.
OpenAI. 2023. GPT-4 technical report. arXiv.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
roll L Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
2022. Training language models to follow instruc-
tions with human feedback. arXiv.
Santanu Pal, Sudip Kumar Naskar, Mihaela Vela, and
Josef van Genabith. 2016. A neural network based
approach to automatic post-editing. In ACL.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic eval-
uation of machine translation. In ACL.
Matt Post. 2018. A call for clarity in reporting bleu
scores. In WMT.
Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Lavie. 2020. COMET: A neural framework for mt
evaluation. In EMNLP.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Sequence to sequence learning with neural networks.
NeurIPS.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
Dubois, Xuechen Li, Carlos Guestrin, Percy
Liang, and Tatsunori B. Hashimoto. 2023. Stan-
ford alpaca: An instruction-following llama
model. https://github.com/tatsu-lab/
stanford_alpaca.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro,
Faisal Azhar, et al. 2023. Llama: Open and efficient
foundation language models. arXiv.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. NeurIPS.
Shuo Wang, Peng Li, Zhixing Tan, Zhaopeng Tu,
Maosong Sun, and Yang Liu. 2022. A template-
based method for constrained neural machine trans-
lation. In EMNLP.
Haoran Wu, Wenxuan Wang, Yuxuan Wan, Wenxiang
Jiao, and Michael Lyu. 2023. ChatGPT or Gram-
marly? Evaluating ChatGPT on grammatical error
correction benchmark. arXiv.
Xianjun Yang, Yan Li, Xinlu Zhang, Haifeng Chen,
and Wei Cheng. 2023. Exploring the limits of chat-
gpt for query or aspect-based text summarization.
arXiv.
Biao Zhang, Barry Haddow, and Alexandra Birch.
2023. Prompting large language model for machine
translation: A case study. arXiv.