ArticlePDF Available

BLEU: a Method for Automatic Evaluation of Machine Translation


Abstract and Figures

Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused.
Content may be subject to copyright.
BLEU: a Method for Automatic Evaluation of Machine Translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu
IBM T. J. Watson Research Center
Yorktown Heights, NY 10598, USA
Human evaluations of machine translation
are extensive but expensive. Human eval-
uations can take months to finish and in-
volve human labor that can not be reused.
We propose a method of automatic ma-
chine translation evaluation that is quick,
inexpensive, and language-independent,
that correlates highly with human evalu-
ation, and that has little marginal cost per
run. We present this method as an auto-
mated understudy to skilled human judges
which substitutes for them when there is
need for quick or frequent evaluations.1
1 Introduction
1.1 Rationale
Human evaluations of machine translation (MT)
weigh many aspects of translation, including ade-
quacy,fidelity , and fluency of the translation (Hovy,
1999; White and O’Connell, 1994). A compre-
hensive catalog of MT evaluation techniques and
their rich literature is given by Reeder (2001). For
the most part, these various human evaluation ap-
proaches are quite expensive (Hovy, 1999). More-
over, they can take weeks or months to finish. This is
a big problem because developers of machine trans-
lation systems need to monitor the effect of daily
changes to their systems in order to weed out bad
ideas from good ideas. We believe that MT progress
stems from evaluation and that there is a logjam of
fruitful research ideas waiting to be released from
1So we call our method the bilingual evaluation understudy,
the evaluation bottleneck. Developers would bene-
fit from an inexpensive automatic evaluation that is
quick, language-independent, and correlates highly
with human evaluation. We propose such an evalua-
tion method in this paper.
1.2 Viewpoint
How does one measure translation performance?
The closer a machine translation is to a professional
human translation, the better it is. This is the cen-
tral idea behind our proposal. To judge the quality
of a machine translation, one measures its closeness
to one or more reference human translations accord-
ing to a numerical metric. Thus, our MT evaluation
system requires two ingredients:
1. a numerical “translation closeness” metric
2. a corpus of good quality human reference trans-
We fashion our closeness metric after the highly suc-
cessful word error rate metric used by the speech
recognition community, appropriately modified for
multiple reference translations and allowing for le-
gitimate differences in word choice and word or-
der. The main idea is to use a weighted average of
variable length phrase matches against the reference
translations. This view gives rise to a family of met-
rics using various weighting schemes. We have se-
lected a promising baseline metric from this family.
In Section 2, we describe the baseline metric in
detail. In Section 3, we evaluate the performance of
BLEU. In Section 4, we describe a human evaluation
experiment. In Section 5, we compare our baseline
metric performance with human evaluations.
Computational Linguistics (ACL), Philadelphia, July 2002, pp. 311-318.
Proceedings of the 40th Annual Meeting of the Association for
2 The Baseline BLEU Metric
Typically, there are many “perfect” translations of a
given source sentence. These translations may vary
in word choice or in word order even when they use
the same words. And yet humans can clearly dis-
tinguish a good translation from a bad one. For ex-
ample, consider these two candidate translations of
a Chinese source sentence:
Example 1.
Candidate 1: It is a guide to action which
ensures that the military always obeys
the commands of the party.
Candidate 2: It is to insure the troops
forever hearing the activity guidebook
that party direct.
Although they appear to be on the same subject, they
differ markedly in quality. For comparison, we pro-
vide three reference human translations of the same
sentence below.
Reference 1: It is a guide to action that
ensures that the military will forever
heed Party commands.
Reference 2: It is the guiding principle
which guarantees the military forces
always being under the command of the
Reference 3: It is the practical guide for
the army always to heed the directions
of the party.
It is clear that the good translation, Candidate 1,
shares many words and phrases with these three ref-
erence translations, while Candidate 2 does not. We
will shortly quantify this notion of sharing in Sec-
tion 2.1. But first observe that Candidate 1 shares
"It is a guide to action" with Reference 1,
"which" with Reference 2, "ensures that the
military" with Reference 1, "always" with Ref-
erences 2 and 3, "commands" with Reference 1, and
finally "of the party" with Reference 2 (all ig-
noring capitalization). In contrast, Candidate 2 ex-
hibits far fewer matches, and their extent is less.
It is clear that a program can rank Candidate 1
higher than Candidate 2 simply by comparing n-
gram matches between each candidate translation
and the reference translations. Experiments over
large collections of translations presented in Section
5 show that this ranking ability is a general phe-
nomenon, and not an artifact of a few toy examples.
The primary programming task for a BLEU imple-
mentor is to compare n-grams of the candidate with
the n-grams of the reference translation and count
the number of matches. These matches are position-
independent. The more the matches, the better the
candidate translation is. For simplicity, we first fo-
cus on computing unigram matches.
2.1 Modified n-gram precision
The cornerstone of our metric is the familiar pre-
cision measure. To compute precision, one simply
counts up the number of candidate translation words
(unigrams) which occur in any reference translation
and then divides by the total number of words in
the candidate translation. Unfortunately, MT sys-
tems can overgenerate “reasonable” words, result-
ing in improbable, but high-precision, translations
like that of example 2 below. Intuitively the prob-
lem is clear: a reference word should be considered
exhausted after a matching candidate word is iden-
tified. We formalize this intuition as the modified
unigram precision. To compute this, one first counts
the maximum number of times a word occurs in any
single reference translation. Next, one clips the to-
tal count of each candidate word by its maximum
reference count,2adds these clipped counts up, and
divides by the total (unclipped) number of candidate
Example 2.
Candidate: the the the the the the the.
Reference 1: The cat is on the mat.
Reference 2: There is a cat on the mat.
Modified Unigram Precision =2/7.3
In Example 1, Candidate 1 achieves a modified
unigram precision of 17/18; whereas Candidate
2 achieves a modified unigram precision of 8/14.
Similarly, the modified unigram precision in Exam-
ple 2 is 2/7, even though its standard unigram pre-
cision is 7/7.
2Countclip =min(Count,Max Re f Count). In other words,
one truncates each word’s count, if necessary, to not exceed the
largest count observed in any single reference for that word.
3As a guide to the eye, we have underlined the important
words for computing modified precision.
Modified n-gram precision is computed similarly
for any n: all candidate n-gram counts and their
corresponding maximum reference counts are col-
lected. The candidate counts are clipped by their
corresponding reference maximum value, summed,
and divided by the total number of candidate n-
grams. In Example 1, Candidate 1 achieves a mod-
ified bigram precision of 10/17, whereas the lower
quality Candidate 2 achieves a modified bigram pre-
cision of 1/13. In Example 2, the (implausible) can-
didate achieves a modified bigram precision of 0.
This sort of modified n-gram precision scoring cap-
tures two aspects of translation: adequacy and flu-
ency. A translation using the same words (1-grams)
as in the references tends to satisfy adequacy. The
longer n-gram matches account for fluency. 4
2.1.1 Modified n-gram precision on blocks of
How do we compute modified n-gram precision
on a multi-sentence test set? Although one typically
evaluates MT systems on a corpus of entire docu-
ments, our basic unit of evaluation is the sentence.
A source sentence may translate to many target sen-
tences, in which case we abuse terminology and re-
fer to the corresponding target sentences as a “sen-
tence.” We first compute the n-gram matches sen-
tence by sentence. Next, we add the clipped n-gram
counts for all the candidate sentences and divide by
the number of candidate n-grams in the test corpus
to compute a modified precision score, pn, for the
entire test corpus.
4BLEU only needs to match human judgment when averaged
over a test corpus; scores on individual sentences will often vary
from human judgments. For example, a system which produces
the fluent phrase “East Asian economy” is penalized heavily on
the longer n-gram precisions if all the references happen to read
“economy of East Asia.” The key to BLEU’s success is that
all systems are treated similarly and multiple human translators
with different styles are used, so this effect cancels out in com-
parisons between systems.
2.1.2 Ranking systems using only modified
n-gram precision
To verify that modified n-gram precision distin-
guishes between very good translations and bad
translations, we computed the modified precision
numbers on the output of a (good) human transla-
tor and a standard (poor) machine translation system
using 4 reference translations for each of 127 source
sentences. The average precision results are shown
in Figure 1.
Figure 1: Distinguishing Human from Machine
The strong signal differentiating human (high pre-
cision) from machine (low precision) is striking.
The difference becomes stronger as we go from un-
igram precision to 4-gram precision. It appears that
any single n-gram precision score can distinguish
between a good translation and a bad translation.
To be useful, however, the metric must also reliably
distinguish between translations that do not differ so
greatly in quality. Furthermore, it must distinguish
between two human translations of differing quality.
This latter requirement ensures the continued valid-
ity of the metric as MT approaches human transla-
tion quality.
To this end, we obtained a human translation
by someone lacking native proficiency in both the
source (Chinese) and the target language (English).
For comparison, we acquired human translations of
the same documents by a native English speaker. We
also obtained machine translations by three commer-
cial systems. These five “systems” two humans
and three machines are scored against two refer-
ence professional human translations. The average
modified n-gram precision results are shown in Fig-
ure 2.
Each of these n-gram statistics implies the same
Figure 2: Machine and Human Translations
 
ranking: H2 (Human-2) is better than H1 (Human-
1), and there is a big drop in quality between H1 and
S3 (Machine/System-3). S3 appears better than S2
which in turn appears better than S1. Remarkably,
this is the same rank order assigned to these “sys-
tems” by human judges, as we discuss later. While
there seems to be ample signal in any single n-gram
precision, it is more robust to combine all these sig-
nals into a single number metric.
2.1.3 Combining the modified n-gram
How should we combine the modified precisions
for the various n-gram sizes? A weighted linear av-
erage of the modified precisions resulted in encour-
aging results for the 5 systems. However, as can be
seen in Figure 2, the modified n-gram precision de-
cays roughly exponentially with n: the modified un-
igram precision is much larger than the modified bi-
gram precision which in turn is much bigger than the
modified trigram precision. A reasonable averag-
ing scheme must take this exponential decay into ac-
count; a weighted average of the logarithm of modi-
fied precisions satisifies this requirement.
BLEU uses the average logarithm with uniform
weights, which is equivalent to using the geometric
mean of the modified n-gram precisions.5,6Experi-
mentally, we obtain the best correlation with mono-
5The geometric average is harsh if any of the modified pre-
cisions vanish, but this should be an extremely rare event in test
corpora of reasonable size (for Nmax 4).
6Using the geometric average also yields slightly stronger
correlation with human judgments than our best results using
an arithmetic average.
lingual human judgments using a maximum n-gram
order of 4, although 3-grams and 5-grams give com-
parable results.
2.2 Sentence length
A candidate translation should be neither too long
nor too short, and an evaluation metric should en-
force this. To some extent, the n-gram precision al-
ready accomplishes this. N-gram precision penal-
izes spurious words in the candidate that do not ap-
pear in any of the reference translations. Addition-
ally, modified precision is penalized if a word oc-
curs more frequently in a candidate translation than
its maximum reference count. This rewards using
a word as many times as warranted and penalizes
using a word more times than it occurs in any of
the references. However, modified n-gram precision
alone fails to enforce the proper translation length,
as is illustrated in the short, absurd example below.
Example 3:
Candidate: of the
Reference 1: It is a guide to action that
ensures that the military will forever
heed Party commands.
Reference 2: It is the guiding principle
which guarantees the military forces
always being under the command of the
Reference 3: It is the practical guide for
the army always to heed the directions
of the party.
Because this candidate is so short compared to
the proper length, one expects to find inflated pre-
cisions: the modified unigram precision is 2/2, and
the modified bigram precision is 1/1.
2.2.1 The trouble with recall
Traditionally, precision has been paired with
recall to overcome such length-related problems.
However, BLEU considers multiple reference trans-
lations, each of which may use a different word
choice to translate the same source word. Further-
more, a good candidate translation will only use (re-
call) one of these possible choices, but not all. In-
deed, recalling all choices leads to a bad translation.
Here is an example.
Example 4:
Candidate 1: I always invariably perpetu-
ally do.
Candidate 2: I always do.
Reference 1: I always do.
Reference 2: I invariably do.
Reference 3: I perpetually do.
The first candidate recalls more words from the
references, but is obviously a poorer translation than
the second candidate. Thus, na¨ıve recall computed
over the set of all reference words is not a good
measure. Admittedly, one could align the refer-
ence translations to discover synonymous words and
compute recall on concepts rather than words. But,
given that reference translations vary in length and
differ in word order and syntax, such a computation
is complicated.
2.2.2 Sentence brevity penalty
Candidate translations longer than their refer-
ences are already penalized by the modified n-gram
precision measure: there is no need to penalize them
again. Consequently, we introduce a multiplicative
brevity penalty factor. With this brevity penalty in
place, a high-scoring candidate translation must now
match the reference translations in length, in word
choice, and in word order. Note that neither this
brevity penalty nor the modified n-gram precision
length effect directly considers the source length; in-
stead, they consider the range of reference transla-
tion lengths in the target language.
We wish to make the brevity penalty 1.0 when the
candidate’s length is the same as any reference trans-
lation’s length. For example, if there are three ref-
erences with lengths 12, 15, and 17 words and the
candidate translation is a terse 12 words, we want
the brevity penalty to be 1. We call the closest refer-
ence sentence length the “best match length.
One consideration remains: if we computed the
brevity penalty sentence by sentence and averaged
the penalties, then length deviations on short sen-
tences would be punished harshly. Instead, we com-
pute the brevity penalty over the entire corpus to al-
low some freedom at the sentence level. We first
compute the test corpus’ effective reference length,
r, by summing the best match lengths for each can-
didate sentence in the corpus. Wechoose the brevity
penalty to be a decaying exponential in r/c, where c
is the total length of the candidate translation corpus.
2.3 BLEU details
We take the geometric mean of the test corpus’
modified precision scores and then multiply the re-
sult by an exponential brevity penalty factor. Cur-
rently, case folding is the only text normalization
performed before computing the precision.
We first compute the geometric average of the
modified n-gram precisions, pn, using n-grams up to
length Nand positive weights wnsumming to one.
Next, let cbe the length of the candidate transla-
tion and rbe the effective reference corpus length.
We compute the brevity penalty BP,
BP =1 if c>r
e(1r/c)if cr.
BLEU= BP·exp N
n=1wnlog pn!.
The ranking behavior is more immediately apparent
in the log domain,
log BLEU =min(1r
c,0) +
n=1wnlog pn.
In our baseline, we use N=4 and uniform weights
3 The BLEU Evaluation
The BLEU metric ranges from 0 to 1. Few transla-
tions will attain a score of 1 unless they are identi-
cal to a reference translation. For this reason, even
a human translator will not necessarily score 1. It
is important to note that the more reference trans-
lations per sentence there are, the higher the score
is. Thus, one must be cautious making even “rough”
comparisons on evaluations with different numbers
of reference translations: on a test corpus of about
500 sentences (40 general news stories), a human
translator scored 0.3468 against four references and
scored 0.2571 against two references. Table 1 shows
the BLEU scores of the 5 systems against two refer-
ences on this test corpus.
The MT systems S2 and S3 are very close in this
metric. Hence, several questions arise:
Table 1: BLEU on 500 sentences
S1 S2 S3 H1 H2
0.0527 0.0829 0.0930 0.1934 0.2571
Table 2: Paired t-statistics on 20 blocks
S1 S2 S3 H1 H2
Mean 0.051 0.081 0.090 0.192 0.256
StdDev 0.017 0.025 0.020 0.030 0.039
t 6 3.4 24 11
Is the difference in BLEU metric reliable?
What is the variance of the BLEU score?
If we were to pick another random set of 500
sentences, would we still judge S3 to be better
than S2?
To answer these questions, we divided the test cor-
pus into 20 blocks of 25 sentences each, and com-
puted the BLEU metric on these blocks individually.
We thus have 20 samples of the BLEU metric for
each system. We computed the means, variances,
and paired t-statistics which are displayed in Table
2. The t-statistic compares each system with its left
neighbor in the table. For example, t=6 for the pair
S1 and S2.
Note that the numbers in Table 1 are the BLEU
metric on an aggregate of 500 sentences, but the
means in Table 2 are averages of the BLEU metric
on aggregates of 25 sentences. As expected, these
two sets of results are close for each system and dif-
fer only by small finite block size effects. Since a
paired t-statistic of 1.7 or above is 95% significant,
the differences between the systems’ scores are sta-
tistically very significant. The reported variance on
25-sentence blocks serves as an upper bound to the
variance of sizeable test sets like the 500 sentence
How many reference translations do we need?
We simulated a single-reference test corpus by ran-
domly selecting one of the 4 reference translations
as the single reference for each of the 40 stories. In
this way, we ensured a degree of stylistic variation.
The systems maintain the same rank order as with
multiple references. This outcome suggests that we
may use a big test corpus with a single reference
translation, provided that the translations are not all
from the same translator.
4 The Human Evaluation
We had two groups of human judges. The first
group, called the monolingual group, consisted of 10
native speakers of English. The second group, called
the bilingual group, consisted of 10 native speakers
of Chinese who had lived in the United States for
the past several years. None of the human judges
was a professional translator. The humans judged
our 5 standard systems on a Chinese sentence sub-
set extracted at random from our 500 sentence test
corpus. We paired each source sentence with each
of its 5 translations, for a total of 250 pairs of Chi-
nese source and English translations. We prepared a
web page with these translation pairs randomly or-
dered to disperse the five translations of each source
sentence. All judges used this same webpage and
saw the sentence pairs in the same order. They rated
each translation from 1 (very bad) to 5 (very good).
The monolingual group made their judgments based
only on the translations’ readability and fluency.
As must be expected, some judges were more lib-
eral than others. And some sentences were easier
to translate than others. To account for the intrin-
sic difference between judges and the sentences, we
compared each judge’s rating for a sentence across
systems. We performed four pairwise t-test compar-
isons between adjacent systems as ordered by their
aggregate average score.
4.1 Monolingual group pairwise judgments
Figure 3 shows the mean difference between the
scores of two consecutive systems and the 95% con-
fidence interval about the mean. We see that S2 is
quite a bit better than S1 (by a mean opinion score
difference of 0.326 on the 5-point scale), while S3
is judged a little better (by 0.114). Both differences
are significant at the 95% level.7The human H1 is
much better than the best system, though a bit worse
than human H2. This is not surprising given that H1
is not a native speaker of either Chinese or English,
7The 95% confidence interval comes from t-test, assuming
that the data comes from a T-distribution with N degrees of free-
dom. N varied from 350 to 470 as some judges have skipped
some sentences in their evaluation. Thus, the distribution is
close to Gaussian.
whereas H2 is a native English speaker. Again, the
difference between the human translators is signifi-
cant beyond the 95% level.
Figure 3: Monolingual Judgments - pairwise differ-
ential comparison
  
 
 
   
4.2 Bilingual group pairwise judgments
Figure 4 shows the same results for the bilingual
group. They also find that S3 is slightly better than
S2 (at 95% confidence) though they judge that the
human translations are much closer (indistinguish-
able at 95% confidence), suggesting that the bilin-
guals tended to focus more on adequacy than on flu-
Figure 4: Bilingual Judgments - pairwise differential
  
   
5 BLEU vs The Human Evaluation
Figure 5 shows a linear regression of the monolin-
gual group scores as a function of the BLEU score
over two reference translations for the 5 systems.
The high correlation coefficient of 0.99 indicates
that BLEU tracks human judgment well. Particularly
interesting is how well BLEU distinguishes between
S2 and S3 which are quite close. Figure 6 shows
the comparable regression results for the bilingual
group. The correlation coefficient is 0.96.
Figure 5: BLEU predicts Monolingual Judgments
  
   
 
Figure 6: BLEU predicts Bilingual Judgments
  
 
 
We now take the worst system as a reference point
and compare the BLEU scores with the human judg-
ment scores of the remaining systems relative to
the worst system. We took the BLEU, monolingual
group, and bilingual group scores for the 5 systems
and linearly normalized them by their correspond-
ing range (the maximum and minimum score across
the 5 systems). The normalized scores are shown in
Figure 7. This figure illustrates the high correlation
between the BLEU score and the monolingual group.
Of particular interest is the accuracy of BLEU’s esti-
mate of the small difference between S2 and S3 and
the larger difference between S3 and H1. The figure
also highlights the relatively large gap between MT
systems and human translators.8In addition, we sur-
mise that the bilingual group was very forgiving in
judging H1 relative to H2 because the monolingual
group found a rather large difference in the fluency
of their translations.
Figure 7: BLEU vs Bilingual and Monolingual Judg-
    
 
  
6 Conclusion
We believe that BLEU will accelerate the MT R&D
cycle by allowing researchers to rapidly home in on
effective modeling ideas. Our belief is reinforced
by a recent statistical analysis of BLEUs correla-
tion with human judgment for translation into En-
glish from four quite different languages (Arabic,
Chinese, French, Spanish) representing 3 different
language families (Papineni et al., 2002)! BLEU’s
strength is that it correlates highly with human judg-
8Crossing this chasm for Chinese-English translation ap-
pears to be a significant challenge for the current state-of-the-art
ments by averaging out individual sentence judg-
ment errors over a test corpus rather than attempting
to divine the exact human judgment for every sen-
tence: quantity leads to quality.
Finally, since MT and summarization can both be
viewed as natural language generation from a tex-
tual context, we believe BLEU could be adapted to
evaluating summarization or similar NLG tasks.
Acknowledgments This work was partially sup-
ported by the Defense Advanced Research Projects
Agency and monitored by SPAWAR under contract
No. N66001-99-2-8916. The views and findings
contained in this material are those of the authors
and do not necessarily reflect the position of pol-
icy of the Government and no official endorsement
should be inferred.
We gratefully acknowledge comments about the
geometric mean by John Makhoul of BBN and dis-
cussions with George Doddington of NIST. We es-
pecially wish to thank our colleagues who served
in the monolingual and bilingual judge pools for
their perseverance in judging the output of Chinese-
English MT systems.
E.H. Hovy. 1999. Toward finely differentiated evaluation
metrics for machine translation. In Proceedings of the
Eagles Workshop on Standards and Evaluation, Pisa,
Kishore Papineni, Salim Roukos, Todd Ward, John Hen-
derson, and Florence Reeder. 2002. Corpus-based
comprehensive and diagnostic MT evaluation: Initial
Arabic, Chinese, French, and Spanish results. In Pro-
ceedings of Human Language Technology 2002, San
Diego, CA. To appear.
Florence Reeder. 2001. Additional mt-eval references.
Technical report, International Standards for Language
Engineering, Evaluation Working Group. http://issco-
J.S. White and T. O’Connell. 1994. The ARPA MT eval-
uation methodologies: evolution, lessons, and future
approaches. In Proceedings of the First Conference of
the Association for Machine Translation in the Ameri-
cas, pages 193–205, Columbia, Maryland.
... Evaluation metrics measure the accuracy of the generated description with respect to the annotated reference description. Similar to existing works [3,9], we automatically evaluate the change caption performance with four different widely used evaluation metrics, which include BLEU-N (B-N, N = 1, 2, 3, 4) [50], ROUGE-L (R) [51], METEOR (M) [52], and CIDEr-D (C) [53]. ...
Full-text available
Remote sensing image change captioning (RSICC) aims to automatically generate sentences describing the difference in content in remote sensing bitemporal images. Recent works extract the changes between bitemporal features and employ a hierarchical approach to fuse multiple changes of interest, yielding change captions. However, these methods directly aggregate all features, potentially incorporating non-change-focused information from each encoder layer into the change caption decoder, adversely affecting the performance of change captioning. To address this problem, we proposed an Interactive Change-Aware Transformer Network (ICT-Net). ICT-Net is able to extract and incorporate the most critical changes of interest in each encoder layer to improve change description generation. It initially extracts bitemporal visual features from the CNN backbone and employs an Interactive Change-Aware Encoder (ICE) to capture the crucial difference between these features. Specifically, the ICE captures the most change-aware discriminative information between the paired bitemporal features interactively through difference and content attention encoding. A Multi-Layer Adaptive Fusion (MAF) module is proposed to adaptively aggregate the relevant change-aware features in the ICE layers while minimizing the impact of irrelevant visual features. Moreover, we extend the ICE to extract multi-scale changes and introduce a novel Cross Gated-Attention (CGA) module into the change caption decoder to select essential discriminative multi-scale features to improve the change captioning performance. We evaluate our method on two RSICC datasets (e.g., LEVIR-CC and LEVIRCCD), and the experimental results demonstrate that our method achieves a state-of-the-art performance.
... The experiment chose BLEU [37], Rouge [38], Meteor [39], and Distinct [40] to fully evaluate the authenticity, fluency, and diversity of the generated text. Each metric with the average value is used as the final result. ...
Full-text available
The constantly updating big data in the ocean engineering domain has challenged the traditional manner of manually extracting knowledge, thereby underscoring the current absence of a knowledge graph framework in such a special field. This paper proposes a knowledge graph framework to fill the gap in the knowledge management application of the ocean engineering field. Subsequently, we propose an intelligent question-answering framework named OEQA based on an ocean engineering-oriented knowledge graph. Firstly, we define the ontology of ocean engineering and adopt a top-down approach to construct a knowledge graph. Secondly, we collect and analyze the data from databases, websites, and textual reports. Based on these collected data, we implement named entity recognition on the unstructured data and extract corresponding relations between entities. Thirdly, we propose an intent-recognizing-based user question classification method, and according to the classification result, construct and fill corresponding query templates by keyword matching. Finally, we use T5-Pegasus to generate natural answers based on the answer entities queried from the knowledge graph. Experimental results show that the accuracy in finding answers is 89.6%. OEQA achieves in the natural answer generation in the ocean engineering domain significant improvements in relevance (1.0912%), accuracy (4.2817%), and practicability (3.1071%) in comparison to ChatGPT.
Despite the growing capabilities of Large Language Models (LLMs) in various domains, their proficiency in addressing domain-specific high-school physics questions remains an unexplored area. In this study, we present a pioneering data set curated from NCERT exemplar solutions strategically designed to facilitate the use of LLMs to solve school physics questions. Originally comprising 766 questions accompanied by LaTeX representations, the dataset underwent a sophisticated augmentation process that expanded its scope to an impressive 7,983 questions. The augmentation employed innovative techniques which effectively broaden the dataset’s coverage. The dataset, prioritizing text-based questions, is formatted as JSON objects detailing instructions, inputs, and outputs. Post evaluation, we noted significant scores: METEOR at 0.282 and BERTScore F1 at 0.833, indicating a close alignment between generated and reference texts.
Full-text available
Deep neural networks have been integrated into the whole clinical decision procedure which can improve the efficiency of diagnosis and alleviate the heavy workload of physicians. Since most neural networks are supervised, their performance heavily depends on the volume and quality of available labels. However, few such labels exist for rare diseases (e.g., new pandemics). Here we report a medical multimodal large language model (Med-MLLM) for radiograph representation learning, which can learn broad medical knowledge (e.g., image understanding, text semantics, and clinical phenotypes) from unlabelled data. As a result, when encountering a rare disease, our Med-MLLM can be rapidly deployed and easily adapted to them with limited labels. Furthermore, our model supports medical data across visual modality (e.g., chest X-ray and CT) and textual modality (e.g., medical report and free-text clinical note); therefore, it can be used for clinical tasks that involve both visual and textual data. We demonstrate the effectiveness of our Med-MLLM by showing how it would perform using the COVID-19 pandemic “in replay”. In the retrospective setting, we test the model on the early COVID-19 datasets; and in the prospective setting, we test the model on the new variant COVID-19-Omicron. The experiments are conducted on 1) three kinds of input data; 2) three kinds of downstream tasks, including disease reporting, diagnosis, and prognosis; 3) five COVID-19 datasets; and 4) three different languages, including English, Chinese, and Spanish. All experiments show that our model can make accurate and robust COVID-19 decision-support with little labelled data.
Full-text available
We describe two metrics for automatic evaluation of machine trans-lation quality. These metrics, BLEU and NEE, are compared to hu-man judgment of quality of translation of Arabic, Chinese, French, and Spanish documents into English.
Full-text available
The ARPA MT Evaluation methodology effort is intended to provide a basis for measuring and thereby facilitating the progress of MT systems of the ARPA-sponsored research program. The evaluation methodologies have the further goal of being useful for identifying the context of that progress among developed, production MT systems in use today. Since 1991, the evaluations have evolved as we have discovered more about what properties are valuable to measure, what properties are not, and what elements of the tests/evaluations can be adjusted to enhance significance of the results while still remaining relatively portable. This paper describes this evolutionary process, along with measurements of the most recent MT evaluation (January 1994) and the current evaluation process now underway.
The ARPA MT evaluation methodologies: evolution, lessons, and future approaches
  • J S White
  • T O'connell
J.S. White and T. O'Connell. 1994. The ARPA MT evaluation methodologies: evolution, lessons, and future approaches. In Proceedings of the First Conference of the Association for Machine Translation in the Americas, pages 193-205, Columbia, Maryland.