Conference PaperPDF Available

Comparing Machine Translation and Human Translation: A Case Study

Authors:

Figures

Content may be subject to copyright.
Comparing Machine Translation and Human Translation: A Case Study
Lars Ahrenberg
Department of Computer and Information Science
Link¨
oping University
lars.ahrenberg@liu.se
Abstract
As machine translation technology im-
proves comparisons to human perfor-
mance are often made in quite general
and exaggerated terms. Thus, it is impor-
tant to be able to account for differences
accurately. This paper reports a simple,
descriptive scheme for comparing trans-
lations and applies it to two translations
of a British opinion article published in
March, 2017. One is a human translation
(HT) into Swedish, and the other a ma-
chine translation (MT). While the compar-
ison is limited to one text, the results are
indicative of current limitations in MT.
1 Introduction
In the CFP for this workshop it is claimed
that ’Human translation and Machine Translation
(MT) aim to solve the same problem’. This is
doubtful as translation is not one thing but many,
spanning a large space of genres, purposes, and
contexts.
The aim of MT research and development is
often phrased as ’overcoming language barriers’.
To a large extent this aim has been achieved with
many systems producing texts of gisting quality
for hundreds, perhaps even thousands of language
pairs, and (albeit fewer) systems that enable con-
versations between speakers that do not share a
common language. Human translation, however,
often has a more ambitious aim, to produce texts
that satisfy the linguistic norms of a target culture
and are adapted to the assumed knowledge of its
readers. To serve this end of the market, MT in
combination with human post-editing is increas-
ingly being used (O’Brien et al.,2014). The goals
for MT have then also been set higher, to what
is often called quality translation, and new ’in-
teractive’ and/or ’adaptive’ interfaces have been
proposed for post-editing (Green,2015;Vashee,
2017). Thus, when production quality is aimed
for, such as in sub-titling or publication of a news
item or feature article, human involvement is still
a necessity.
Some recent papers claim that MT now is al-
most ’human-like’ or that it ’gets closer to that
of average human translators’ (Wu et al.,2016).
While such claims may be made in the excitement
over substantial observed improvements in a MT
experiment, they raise the question (again!) of
how HT may differ from MT.
Some scholars have argued that MT will never
reach the quality of a professional human trans-
lator. The limitations are not just temporary, but
inherent in the task. These arguments are perhaps
most strongly expressed in (Melby with T. Warner,
1995). More recently, but before the breakthrough
of NMT, Giammarresi and Lapalme (2016) still
consider them valid. As MT can produce human-
like translations in restricted domains and is in-
creasingly being included in CAT-tools, they in-
sist that MT is posing a challenge for Translation
Studies.
In this paper I report a small case study, a close
comparison of a human translation and a machine
translation from a state-of-the-art system of the
same source text. This is done with two purposes
in mind. The first concerns how the differences
should be described, what concepts and tools are
useful to make such a comparison meaningful and
enlightening. The second is to assess the differ-
ences between state-of-the-art MT and HT, with
the caveat, of course, that one pair of translations
cannot claim to be representative of the very large
translation universe.
2 MT and Translation Studies
The two fields of MT and Translation Studies (TS)
have developed separately for almost as long as
they have existed. In the early days of both disci-
plines, some researchers attempted to account for
translation in more or less formal linguistic terms,
potentially forming a foundation for automatiza-
tion, e.g. (Catford,1965). The ’cultural turn’ in
TS moved the field away from linguistic detail and
further apart from MT. The 1990:ies saw a com-
mon interest in empirical data, but while corpora,
and parallel corpora in particular, were collected
and studied in both fields, they were largely used
for different purposes. For example, it seems that
the empirical results generated by TS studies on
translation universals (Baker,1993) did not have
much effect on MT.
A problem related to this challenge is that MT
and TS lack common concepts and terminology.
MT prefers to speak in terms of models, whereas
TS is more comfortable with concepts such as
function and culture. There is a mutual inter-
est in translation quality assessment (TQA), how-
ever, and large-scale projects on MT tend to have
some participation of TS scholars. For example,
one result of the German Verbmobil project is
the volume Machine Translation and Translation
Theory, (Hauenschild and Heizmann,1997) that
contain several studies on human translation and
how it can inform MT. It is also true of more re-
cent projects such as QTLaunchPad where eval-
uation of translation quality was in focus, and
CASMACAT where the design of a CAT tool was
informed by translation process research (Koehn
et al.,2015).
Error analysis is an area of common inter-
est. (O’Brian,2012) showed that error typologies
and weightings were used in all eleven translation
companies taking part in her study. It was also
shown that some categories occurred in all or the
large majority of the taxonomies. She concludes
though that error analysis is insufficient and some-
times rightout inappropriate. This is so because it
doesn’t take a holistic view of the text and its util-
ity and paying too little attention to aspects such as
text type, function or user requirements. A num-
ber of alternative evaluation models including us-
ability evaluation, ratings of adequacy and fluency,
and readability evaluation are propsed.
In the MT context the merits of error analysis
is that it can tell developers where the major prob-
lems are, and users what to expect. A taxonomy
which has been popular in MT is (Vilar et al.,
2006). To avoid the necessity of calling in hu-
man evaluators every time an error analysis is to
be performed there have also been work on auto-
matic error classification (Popovi´
c and Burchardt,
2011). While simply counting errors seems less
relevant for comparing machine translation to hu-
man translation, showing what type of errors oc-
cur can be useful. We must recognize then that the
categories could vary with purpose.
Another line of research studies the effects
of tools and processes on translations. This
field is quite underresearched, though see for
instance (Jim´
enez-Crespo,2009;Lapshinova-
Koltunski,2013;Besacier and Schwartz,2015) for
some relevant studies.
2.1 Comparing Translations
The most common standard for comparing trans-
lations is probably quality, a notion that itself re-
quires definition. If we follow the insights of TS,
quality cannot be an absolute notion, but must be
related to purpose and context. For instance, Ma-
teo (2014), referring to (Nord,1997) defines it as
“appropriateness of a translated text to fulfill a
communicative purpose”. In the field of Trans-
lation Quality Assessment (TQA) the final out-
come of such a comparison will then be a judge-
ment of the kind ’Very good’, ’Satisfactory’, or
’Unacceptable’ where at least some of the criteria
for goodness refer to functional or pragmatic ade-
quacy (Mateo et al.,2017).
In MT evaluation, which is concerned with sys-
tem comparisons based on their produced trans-
lations, the judgements are more often rankings:
’Better-than’ or ’Indistinguishable-from’. One fo-
cus has then been on developing metrics whose
ratings correlate well with human ratings or rank-
ings. This line of research got a boost by Pap-
ineni et al. (2002) and has since been an ongoing
endeavour in the MT community, in particular in
conjunction with the WMT workshops from 2006
onwards. Most measures developed within MT
rely on reference translations and give a kind of
measure of similarity to the references.
While judgements such as Good or Unaccept-
able are of course very relevant in a use context,
a comparison of MT and HT may better focus on
characteristic properties and capabilities instead.
The questions that interest me here are questions
such as: What are the characteristics of a machine-
translated text as compared to a human transla-
tion? What can the human translator do that the
MT system cannot (and vice versa)? What actions
are needed to make it fit for a purpose?
Many works on translation, especially those that
are written for presumtive translators, include a
chapter on the options available to a translator,
variously called strategies, methods or procedures.
A translation procedure is a type of solution to a
translation problem. In spite of the term, trans-
lation procedures can be used descriptively, to
characterize and compare translations, and even
to characterize and compare translators, or trans-
lation norms. This is the way they will be used
here, for comparing human translation and ma-
chine translation descriptively. This level of de-
scription seems to me to be underused in MT,
though see (Fomicheva et al.,2015) for an excep-
tion.
With (Newmark,1988) we may distinguish gen-
eral or global translation methods, such as seman-
tic vs. communicative translation, that apply to a
text as a whole (macro-level) from procedures that
apply at the level of words (micro-level), such as
shifts or transpositions. In this paper the focus is
on the micro-level methods.
3 A Case Study
3.1 The Approach
The analysis covers intrinsic as well as extrinsic
or functional properties of the translations. The
intrinsic part covers basic statistical facts such as
length and type-token ratios, and MT metrics. Its
main focus, however, is on translation procedures
or the different forms of correspondence that can
be found between units. A special consideration is
given to differences in word order as these can be
established less subjectively than categorizations.
The functional part considers purpose and context,
but one translation can in principle be evaluated in
relation to two or more purposes, i.e., post-editing
or gisting.
Catford (1965) introduced the notion of shifts,
meaning a procedure that deviates somehow from
a plain or literal translation. A large catalogue
of translation procedures, or methods, was pro-
vided by (Vinay and Darbelnet,1958) summarized
in seven major categories: borrowing, calque, lit-
eral translation, transposition, modulation, equiv-
alence, and adaptation. Newmark (1988) pro-
vides a larger set. The most detailed taxonomy
for translation procedures is probably van Leuven-
Zwart (1989) who establishes correspondence on
a semantic basis through what she calls archi-
transemes.
A problem with these taxonomies is to apply
them in practice. For this reason I will only give
counts for coarse top level categories and report
more fine-grained procedures only in rough esti-
mates. At the top level we have a binary distinc-
tion between Shifts and Unshifted, or literal trans-
lations. An unshifted translation is one where only
procedures which are obligatory or standard for
the target language have been used, and content
is judged to be the same. Semantic shifts are as far
as possible noted separately from structural shifts.
Shifts are identified at two levels: sentences and
clausal units. Relations between units are estab-
lished on the basis of position and content. At the
sentence level position in the linear flow of infor-
mation is usually sufficient to infer a relation. At
the clausal level correspondence must take syntac-
tic relations into account in addition to content. As
for content we require only that there is some sort
of describable semantic or pragmatic relation.
3.2 The Data
The source text is an opinion article published by
the Financial Times on March 17, 2017 entitled
Why I left my liberal London tribe and written by
David Goodhart. It is fairly typical of a British
opinion article. Paragraphs are short with only a
few sentences, the length of sentences are quite
varied, and long sentences tend to be built both
syntactically and with insertions, appositions and
parataxis. Table 1 (first column) gives an illustra-
tion.
The human translation appeared in the June is-
sue of the Swedish magazine Axess. It was trans-
lated manually with no other computer aids than
interactive dictionaries and the web for fact check-
ing. No specific brief had been issued with the as-
signment. The translation was published with only
minor edits but under a different title.
The machine translation was produced by
Google Translate in the middle of June, 2017. Ac-
cording to Google’s web information, translation
from English into Swedish was then using NMT1.
Even a non-Swedish speaker can observe that
the paragraph shown in Table 1 has the same num-
1https://cloud.google.com/translate/docs/languages
English source Swedish MT Swedish HT
I referred to changing my mind Jag h¨
anvisade till att ¨
andra mig Jag refererade till ett byte av uppfattning
as though it were a rational process, som om det var en rationell process, som om det vore en rationell process,
in which one audits ones beliefs d¨
ar man reviderar sin tro d¨
ar man granskar sina ˚
asikter
every few years and decides to p˚
a n˚
agra ˚
ar och best¨
ammer sig f¨
or att med n˚
agra ˚
ars mellanrum och beslutar sig f¨
or
shift ground on Israel/Palestine flytta marken mot Israel / Palestina att ¨
andra st˚
andpunkt ifr˚
aga om Israel och
or the single market. eller den inre marknaden. palestinierna eller EU:s inre marknad
But thats not how it works. Men det ¨
ar inte hur det fungerar. Men det ¨
ar inte s˚
a det fungerar.
If, like most educated people, Om du, som de mest utbildade persone- Om man, som de flesta v¨
alutbildade,
you place a high value on moral and rna l¨
agger h¨
ogt v¨
arde p˚
a moralisk och s ¨
atter stort v¨
arde p˚
a moralisk och
intellectual coherence, your views intellektuell sammanhang, har dina intellektuell samst¨
ammighet brukar ens
tend to fit together into something ˚
asikter en tendens att passa in i n˚
agot ˚
asikter passa in i n˚
agot som liknar
like an explicable worldview. som en f ¨
orklarlig v¨
arldsutsikt. en f¨
orst˚
aelig v¨
arlds˚
ask˚
adning.
And that usually goes along with Och det g˚
ar oftast med Och med den f¨
oljer vanligtvis ett
informal membership of a network informellt medlemskap i ett n¨
atverk informellt medlemskap i ett n¨
atverk
like-minded people. av likasinnade m¨
anniskor. av likasinnade.
Without having to think very hard Utan att beh¨
ova t¨
anka v¨
aldigt sv˚
art, Utan att egentligen beh¨
ova fundera p˚
a
you know you all broadly favour k¨
anner du dig allihopa saken vet man att alla p˚
a det hela
and oppose the same things. och mots ¨
atter sig samma saker. taget ¨
ar f¨
or och emot samma saker.
Table 1: A source paragraph and its two translations.
ber of sentences as the source in both translations
(there are 5), that the sentences correspond one-
to-one and are quite similar in length. The flow
of information is also very similar; shorter units
than sentences such as clauses and phrases can
be aligned monotonously in both translations with
few exceptions.
Source MT HT
Paragraphs 30 30 30
Sentences 86 86 95
Word tokens 2555 2415 2603
Characters 13780 13888 15248
Type-token ratio 2.84 2.56 2.58
Mean Sent.length 29.7 28.1 27.4
Avg length diff. 2.0 3.2
Table 2: Basic statistics for the three texts. The
last line states the average absolute value of length
differences at the sentence level.
4 Results
4.1 Basic Statistics
The visual impression of Table 1 indicates that
the human translation is longer than the machine
translation. This is confirmed when we look at
the translations as wholes, the human translation is
longer both in terms of number of words and num-
ber of characters. In terms of characters the ratio is
1.01 for the MT and 1.11 for the HT. Yet, when the
HT is shorter than the source for a given sentence,
the difference can be large. The HT also has more
sentences, as the human translator has decided to
split eight sentences (roughly 9% of all) into two
or three shorter ones. Basic statistics for all three
texts are shown in Table 2.
MT metrics are especially valuable for com-
parisons over time. As we only have one ma-
chine translation in this study, we limit ourselves
to reporting BLEU (Papineni et al.,2002) and
TER (Snover et al.,2006). After tokenization and
segmentation into clause units of both the MT and
the HT translations, using the latter as reference
we obtained the results shown in Table 32. Fol-
lowing the analysis of clauses into Shifted and Un-
shifted (see section 4.4) we also computed these
metrics for the two types of segments separately.
Section BLEU Bleu(1) Bleu(2) TER
Unshifted 42.79 69.0 48.7 0.374
Shifted 16.84 48.2 23.6 0.662
All 23.27 59.6 30.7 0.621
Table 3: BLEU and TER scores for different sec-
tions of the MT, using HT as reference.
2Values were computed with the multi-bleu.perl script
provided with the Moses system, and tercom.7.25, respec-
tively.
4.2 Monotonicity
By monotonicity we mean information on the or-
der of content in the translation as compared to
the order of corresponding content in the source
text. Both translations are one-to-one as far as
paragraphs are concerned. As noted, the HT is not
altogether one-to-one at the sentence level, but at
the level of clauses, the similarities are greater: the
order is the same with the exception that the HT
has added one clause.
To get a measure of monotonicity all corre-
sponding word sequences s (from the source text)
and t (from the translation) of the form s=a:b and
t=Tr(b):Tr(a) are identified. The number of in-
stances per sentence is noted as well as the number
of words that are part of such a source sequence.
The degree of monotonicity is expressed as a ratio
between the total number of affected source words
and all words in the text. The results are shown in
Table 4.
Word Order MT HT
changes Sents Words Sents Words
0 36 0 15 0
1 40 125 40 197
2 9 72 22 203
3 1 10 5 52
4 - 0 4 110
Total 86 207 86 562
Table 4: Number of sentence segments affected by
a certain number of word order changes.
A total of 61 changes of word order is observed
in the MT, related to 207 words of the source text,
or 1.5% of all words. Almost all of them are cor-
rect, the large majority of them relates to the V2-
property of Swedish main clauses. as in (1), but
there are also successful changes producing cor-
rect word order in subordinate clauses, as in (2), or
a Swedish s-genitive from an English of-genitive
as in (3). While the system thus has a high preci-
sion in its word order changes, there are also cases
where it misses out.
(1) Why do we1change2our minds about things?
Varf¨
or f¨
or¨
andrar2vi1v˚
art sinne om saker?
(2) and feel that for the first time in my life1I2...
och k¨
anner att jag2f¨
or f¨
orsta g˚
angen i mitt liv1...
(3) the core beliefs1of modern liberalism2
den moderna liberalismens2k¨
arnf¨
orest¨
allningar1
The human translation displays almost twice as
many word order changes, 116, and they affect
longer phrases and cover longer distances. Still
4.1% is not a very large share and confirms the im-
pression that the information order in the human
translation follows the source text closely. The
human translator does more than fixing a correct
grammar, however, and also improves the style of
the text, for instance as regards the placement of
insertions, as in (4), and shifts of prominence, as
in (5).
(4) I have changed1my mind, more slowly2, about..
MT: Jag har f¨
or¨
andrat1mig, mer l˚
angsamt2, om..
HT: Sj¨
alv har jag, om ¨
an l˚
angsammare2,¨
andrat1
inst¨
allning..
(5) Instead I met the intolerance1of .. for the first time2
MT: Ist¨
allet m¨
otte jag den intolerans av den moderna
v¨
anster1f¨
or f¨
orsta g˚
angen2
HT: Ist¨
allet fick jag f¨
or f¨
orsta g˚
angen2m¨
ota den
moderna v¨nsterns intolerans1
4.3 Purpose-related Analysis
It is obvious, and unsurprising, that the MT does
not achieve publication quality. To get a better
idea of where the problems are, a profiling was
made in terms of the number and types of edits
judged to be necessary to give the translation pub-
lication quality. Any such analysis is of course
subjective, and it has been done by the author, so
the exact numbers are not so important. However,
the total number and distribution of types are in-
dicative of the character of the translation. A sim-
ple taxonomy was used with six types3:
Major edit; substantial edit of garbled output requiring
close reading of the source text
Order edit; a word or phrase needs reordering
Word edit; a content word or phrase must be replaced
to achieve accuracy
Form edit; a form word must be replaced or a content
word changed morphologically
Char edit; change, addition or deletetion of a single
character incl. punctuation marks
Missing; a source word that should have been trans-
lated has not been
The distribution of necessary edits on the differ-
ent types are shown in Table 4. The most frequent
type is ’word edit’ which accounts for more than
half of all edits. In this group we find words that
’merely’ affect style as well as word choices that
thwarts the message substantially.
3All categories except ’Major edit’ have counterparts in
the taxonomy of Vilar et al. (2006).
Type of edit Frequency
Major edit 13
Order edit 24
Word edit 139
Form edit 66
Char edit 13
Missing 21
Total 276
Table 5: Frequencies of different types of required
editing operations for the MT (one analyst).
A skilled post-editor could probably perform
this many edits in two hours or less. However,
there is no guarantee that the edits will give the
translation the same quality or reading experi-
ence as the human translation. Minimal edits us-
ing a segment-oriented interface will probably not
achieve that. The style and phrasing of the source
would shine through to an extent that could offend
some readers of the magazine, although most of
the contents may be comprehended without prob-
lems, cf. Besacier and Schwartz (2015) on literary
text. However, for gisting purposes the MT would
be quite adequate.
4.4 Translation Procedures: What the MT
System didn’t do
While the human translator did not deviate that
much in the order of content from the source text,
he used a number of other procedures that seem to
be beyond the reach of the MT system. Altogether
we find more than 50 procedures of this kind in the
HT. The most important of these are:
Sentence splitting. There were eight such
splits, including one case of splitting a source
sentence into three target sentences. This
procedure also comes with the insertion of
new material such as a suitable adverb or
restoring a subject.
Shifts of function and/or category. These
are numerous; non-finite clauses or NP:s are
translated by a finite clause in Swedish, a
complete clause is reduced by ellipsis, a rela-
tive clause may be rendered by a conjoined
clause, adjectival attributes are rendered as
relative clauses, an adverb is translated by an
adjective or vice versa, and so on.
Explicitation. Names whose referents can-
not be assumed to be known by the readers
are explained, e.g. ’Rusell Group universi-
ties’ receives an explanation in parentheses.
Also, at the grammatical level function words
such as och (and), som (relative pronoun),
att (complementizer ’that’) and indefinite ar-
ticles are inserted more often in the HT than
in the MT.
Modulation = change of point of view.
For example, translating ’here’ and ’these
islands’ in the source by ’Storbritannien’
(Great Britain).
Paraphrasing. The semantics is not quite the
same but the content is similar enough to pre-
serve the message, e.g. the translation of
’move more confidently through the world’ is
translated as the terser ’¨
oka ens sj¨
alvs¨
akerhet’
(increase your confidence).
5 Conclusions and Future Work
Differences between machine translations and hu-
man translations can be revealed by fairly simple
statistical metrics in combination with an analy-
sis based on so-called shifts or translation proce-
dures. In our case, the MT is in many ways, such
as length, information flow, and structure more
similar to the source than the HT. More impor-
tantly, it exhibits a much more restricted reper-
toir of procedures, and its output is estimated to
require about three edits per sentence. Thus, for
publishing purposes it is unacceptable without hu-
man involvement. Post-editing of the MT output
could no doubt produce a readable text, but may
not reach the level of a human translation. In fu-
ture work I hope to be able include post-edited text
in the comparison.
Another topic for future research is predicting
translation procedures on a par with current shared
tasks predicting post-editing effort and translation
adequacy.
Acknowledgments
I am indebted to the human translator of the ar-
ticle, Martin Peterson, for information on his as-
signment and work process, and to the reviewers
for pointing out an important flaw in the submit-
ted version.
References
Mona Baker. 1993. Corpus linguistics and translation
studies: Implications and applications. In Mona
Baker, Gill Francis, and Elena Tognini-Bonelli, edi-
tors, Text and Technology: In Honour of John Sin-
clair, John Benjamins, Amsterdam and Philadel-
phia.
Laurent Besacier and Lane Schwartz. 2015. Au-
tomated translation of a literary work: A pi-
lot study. In Proceedings of the Fourth Work-
shop on Computational Linguistics for Liter-
ature. Association for Computational Linguis-
tics, Denver, Colorado, USA, pages 114–122.
http://www.aclweb.org/anthology/W15-0713.
John C Catford. 1965. A Linguistic Theory of Transla-
tion. Oxford University Press, London, UK.
Marina Fomicheva, N´
uria Bel, and Iria da Cunha.
2015. Neutralizing the Effect of Translation Shifts
on Automatic Machine Translation Ev aluation,
Springer International Publishing, pages 596–607.
https://doi.org/10.1007/978-3-319-18111-0 45.
Salvatore Giammarresi and Guy Lapalme. 2016. Com-
puter science and translation: Natural languages and
machine translation. In Yves Gambier and Luc van
Doorslaer, editors, Border Crossings: Translation
Studies and other disciplines, John Benjamins, Am-
sterdam/Philadelphia, chapter 8, pages 205–224.
Spence Green. 2015. Beyond post-editing: Advances
in interactive translation environments. ATA Chron-
icle Www.atanet.org/chronicle-on-line/...
Christa Hauenschild and Susanne Heizmann. 1997.
Machine Translation and Translation Theory. De
Gruyter.
Miguel A. Jim´
enez-Crespo. 2009. Conventions in lo-
calisation: a corpus study of original vs. translated
web texts. JoSTrans: The Journal of Specialised
Translation 12:79–102.
Philipp Koehn, Vicent Alabau, Michael Carl, Fran-
cisco Casacuberta, Mercedes Garc´
ıa-Mart´
ınez,
Jes´
us Gonz´
alez-Rubio, Frank Keller, Daniel Ortiz-
Mart´
ınez, Germ´
an Sanchis-Trilles, and Ulrich
Germann. 2015. Casmacat, final public report.
http://www.casmacat.eu/uploads/Deliverables/final-
public-report.pdf.
Ekaterina Lapshinova-Koltunski. 2013. Vartra:
A comparable corpus for analysis of transla-
tion variation. In Proceedings of the Sixth
Workshop on Building and Using Compara-
ble Corpora. Association for Computational
Linguistics, Sofia, Bulgaria, pages 77–86.
http://www.aclweb.org/anthology/W13-2510.
Roberto Mart´
ıinez Mateo. 2014. A deeper look into
metrics for translation quality assessment (TQA): A
case study. Miscel´
anea: A Journal of English and
American Studies 49:73–94.
Roberto Mart´
ıinez Mateo, Silvia Montero Mart´
ıinez,
and Arsenio Jes´
us Moya Guijarro. 2017. The
modular assessment pack a new approach to trans-
lation quality assessment at the directorate gen-
eral for translation. Perspectives: Studies in
Translation Theory and Practice 25:18–48. Doi
10.1080/0907676X.2016.1167923.
Alan Melby with T. Warner. 1995. The Possibility of
Language. John Benjamins, London and New York.
https://doi.org/10.1075/btl.14.
Peter Newmark. 1988. A Textbook of Translation.
Prentice Hall, London and New York.
Christiane Nord. 1997. Translation as a Purposeful
Activity. St Jerome, Manchester, UK.
Sharon O’Brian. 2012. Towards a dynamic quality
evaluation model for translation. The Journal of
Specialized Translation 17:1.
Sharon O’Brien, Laura Winther Balling, Michael Carl,
Michel Simard, and Lucia Specia. 2014. Post-
Editing of Machine Translation: Processes and Ap-
plications. Cambridge Scholars Publishing, New-
castle upon Tyne.
Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. Bleu: a method
for automatic evaluation of machine transla-
tion. In Proceedings of 40th Annual Meeting
of the Association for Computational Linguis-
tics. Association for Computational Linguistics,
Philadelphia, Pennsylvania, USA, pages 311–318.
https://doi.org/10.3115/1073083.1073135.
Maja Popovi´
c and Aljoscha Burchardt. 2011. From
human to automatic error classification for machine
translation output. In Proceedings of the 15th Inter-
national Conference of the European Association for
Machine Translation. Leuven, Belgium, pages 265–
272.
Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
nea Micciulla, and John Makhoul. 2006. A study of
translation edit rate with targeted human annotation.
In Proceedings of Association for Machine Transla-
tion in the Americas.
Kitty M van Leuven-Zwart. 1989. Translation and
original: Similarities and dissimilarities, 1. Target
1:2:151–181.
Kirti Vashee. 2017. A closer look at
sdl’s adaptive mt technology. Http://kv-
emptypages.blogspot.se/2017/01/a-closer-look-
at-sdls-adaptive-mt.html.
David Vilar, Jia Xu, Luis Fernando D’Haro, and Her-
mann Ney. 2006. Error analysis of machine trans-
lation output. In LREC06. Genoa, Italy, pages 697–
702.
Jean-Paul Vinay and Jean Darbelnet. 1958. Stylistique
Compar´
ee du Francais et de l’Anglais. M´
ethode de
Traduction. Didier, Paris.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, Jeff Klingner, Apurva Shah, Melvin
Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan
Gouws, Yoshikiyo Kato, Taku Kudo, Hideto
Kazawa, Keith Stevens, George Kurian, Nishant
Patil, Wei Wang, Cliff Young, Jason Smith, Jason
Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,
Macduff Hughes, and Jeffrey Dean. 2016. Google’s
neural machine translation system: Bridging the gap
between human and machine translation. CoRR
abs/1609.08144. Http://arxiv.org/abs/1609.08144.
... When comparing translation varieties, MT is used as a translation variety independent of HT or other translation varieties in some studies (Toral, 2019). Different from the conventional practice of MT evaluation that treats HT as the gold standard, some studies adopt a descriptive approach to comparing MT and HT (Bizzoni et al., 2020;Ahrenberg, 2017;Vanmassenhove et al., 2019). Among these studies, Bizzoni et al. (2020) find that MT shows independent patterns of translationese and it resembles HT only partly. ...
... While the number of studies on comparing translation varieties is much smaller than on the identification of translationese, there are even fewer studies that explore the differences between MT and HT. Ahrenberg (2017) compares MT and HT by means of automatically extracted features and statistics obtained through manual examination. By comparing the shifts (i.e. ...
Conference Paper
Full-text available
By using a trigram model and fine-tuning a pretrained BERT model for sequence classification , we show that machine translation and human translation can be classified with an accuracy above chance level, which suggests that machine translation and human translation are different in a systematic way. The classification accuracy of machine translation is much higher than of human translation. We show that this may be explained by the difference in lexical diversity between machine translation and human translation. If machine translation has independent patterns from human translation, automatic metrics which measure the deviation of machine translation from human translation may conflate difference with quality. Our experiment with two different types of automatic metrics shows correlation with the result of the classification task. Therefore, we suggest the difference in lexical diversity between machine translation and human translation be given more attention in machine translation evaluation.
... When comparing translation varieties, MT is used as a translation variety independent of HT or other translation varieties in some studies (Toral, 2019). Different from the conventional practice of MT evaluation that treats HT as the gold standard, some studies adopt a descriptive approach to comparing MT and HT (Bizzoni et al., 2020;Ahrenberg, 2017;Vanmassenhove et al., 2019). Among these studies, Bizzoni et al. (2020) find that MT shows independent patterns of translationese and it resembles HT only partly. ...
... While the number of studies on comparing translation varieties is much smaller than on the identification of translationese, there are even fewer studies that explore the differences between MT and HT. Ahrenberg (2017) compares MT and HT by means of automatically extracted features and statistics obtained through manual examination. By comparing the shifts (i.e. ...
Preprint
Full-text available
By using a trigram model and fine-tuning a pretrained BERT model for sequence classification, we show that machine translation and human translation can be classified with an accuracy above chance level, which suggests that machine translation and human translation are different in a systematic way. The classification accuracy of machine translation is much higher than of human translation. We show that this may be explained by the difference in lexical diversity between machine translation and human translation. If machine translation has independent patterns from human translation, automatic metrics which measure the deviation of machine translation from human translation may conflate difference with quality. Our experiment with two different types of automatic metrics shows correlation with the result of the classification task. Therefore, we suggest the difference in lexical diversity between machine translation and human translation be given more attention in machine translation evaluation.
... Non-literal translations can also cause noisy sentence pairs in parallel corpora, which affect the training of MT systems (Carpuat et al., 2017;Pham et al., 2018;Vyas et al., 2018). On the other hand, non-literal but appropriate translations are difficult to produce (Carl and Schaeffer, 2017) and machines are still on the way to simulate human translators on this aspect (Ahrenberg, 2017;Toral and Way, 2018). To inspire MT system's development, efforts have been done to analyze language contrasts through alignment discrepancies (Lapshinova-Koltunski and Hardmeier, 2017), and to detect free and fluent translation examples from English-Chinese parallel corpora . ...
... Carl and Schaeffer (2017) investigated the effects of cross-lingual syntactic and semantic distance on translation production times and found that non-literality makes from-scratch translation and post-editing difficult. In a case study of comparing human and machine translations, Ahrenberg (2017) suggested that the human translator used several procedures that seem to be beyond the reach of the MT system (such as sentence splitting, shifts of phrase function and/or category, explicitation, modulation and paraphrasing). The project of Fraisse et al. (2019) aims to preserve cultural heritage and language diversity by analyzing the translation adaptations in multilingual corpora of translated literary texts, which is particularly important for low-resource languages. ...
Conference Paper
Full-text available
Human-generated non-literal translations reflect the richness of human languages and are sometimes indispensable to ensure adequacy and fluency. Non-literal translations are difficult to produce even for human translators, especially for foreign language learners, and machine translations are still on the way to simulate human ones on this aspect. In order to foster the study on appropriate and creative non-literal translations, automatically detecting them in parallel corpora is an important step, which can benefit downstream NLP tasks or help to construct materials to teach translation. This article demonstrates that generic sentence representations produced by a pre-trained cross-lingual language model could be fine-tuned to solve this task. We show that there exists a moderate positive correlation between the prediction probability of being human translation and the non-literal translations' proportion in a sentence. The fine-tuning experiments show an accuracy of 80.16% when predicting the presence of non-literal translations in a sentence and an accuracy of 85.20% when distinguishing literal and non-literal translations at phrase level. We further conduct a linguistic error analysis and propose directions for future work.
... Until today, the question of whether MT can reach human-like quality, even replace HT someday, is ongoing, and even more present since the emergence of NMT (cf. Ahrenberg, 2017;. So, it can be debated whether HT is still the gold standard alone for high quality. ...
Presentation
This study attempts to investigate the relationship between attention patterns of different interpreters and their SIMTXT quality, with special concern to the English-to-Chinese language pair.
... As the environment allows only morphological analysis and synthesis of texts and does not contain tools for complete V. CONCLUSION As machine translation technologies continually improve and, thus, can be compared to human performance, it is important to be able to account for both their advantages and disadvantages accurately. [11] This paper provides a short, but substantial descriptive analysis of both different machine translation types and machine translation systems. ...
Conference Paper
Full-text available
This write-up critically assesses machine translation in its current phase of development, thus providing a short comparative analysis of the two predominant translator models. The goal of this paper is to disect the main advantages and disadvantages of both statistical and neural machine translation, which might offer a new perspective on the field in general.
Article
Full-text available
Every field has its own technical terms that distinguish it from other fields referred to as jargon. Jargon may not be understood to people who do not belong to a certain field, let alone a machine translation (henceforth, MT) system like Google Translate (henceforth, GT). This research analyzes jargon in civil engineering texts translated by GT to see whether the translation of a certain jargon gives the right meaning stated in the original text and fulfills its contextual function or not. Data was collected from different civil engineering texts. The results showed that GT in most cases could not recognize the accurate translation of jargon word, phrase, abbreviation and acronym when they come in a certain technical context. These findings validate the importance of providing an instruction on GT on using jargon vocabulary in its right context regarding civil engineering texts in particular, as well as the necessity for further investigation by the translators regarding other technical texts in general. Keywords: Civil Engineering Texts, Google Translate, Jargon, Machine Translation, Technical Texts.
Conference Paper
Ovaj rad kritički ocjenjuje mašinsko prevođenje u njegovoj trenutnoj fazi razvoja, pružajući tako kratku uporednu analizu dva prevladavajuća sistema mašinskog prevođenja. Cilj ovog rada je utvrditi glavne prednosti i nedostatke statističkih i neuralnih mašinskih prevoditelja, čime se nudi nova perspektiva za ovu naučnu oblast.
Conference Paper
Full-text available
: JASM project (Janela aberta sobre o mundo: línguas estrangeiras, criatividade multimodal e inovação pedagógica no ensino superior) consists of an experience of active pedagogy with students of the undergraduate course in Media Studies at the School of Education in Viseu (Portugal). The main objective of JASM is to promote the acquisition of multilingual and multicultural skills and to generate multilingual awareness. In addition to the cognitive dimension, students explore the aesthetic and emotional dimensions of language. Experiences of artistic creativity (media arts, multimedia art, among others) enable multimodal communication in English and French, starting off with information gathering pertaining to the cultural and linguistic diversity of Viseu. After conducting research on the countries of origin of the chosen nationalities as well as the underlying cultures, the students, working in groups, found out about the life stories of migrants on the basis of interviews. Experiences of artistic creativity made it possible to exercise multimodal communication. An object or a tradition mentioned in the stories told by migrants allowed them to build a fictional story around the said object or tradition. Photos were taken at all stages of this work. A storyboard of each fictional story was developed. The Korsakow system made it possible to create dynamic documentaries. The disclosure of this learning experience is made public on the project site and through an e-book.The students' language level (written and oral comprehension and expression) was assessed at the start of the project, using tests. The intermediate evaluation is of a qualitative type as well as the final evaluation (interview type, carried out with students and teachers) due to COVID-19 crisis. The progress of the learning process, as well as the involvement of the teachers could thus be documented.
Article
Full-text available
This paper presents a conceptual proposal for Translation Quality Assessment (TQA) and its practical tool as a remedy to the deficiencies detected in a quantitative quality assessment tool developed at the Directorate General for Translation (DGT) of the European Commission: the Quality Assessment Tool (QAT). The new theoretical model, the Functional-Componential Approach (FCA) takes on a functionalist and holistic quality definition to solve the theoretical shortcomings of the QAT. Thus, it incorporates the complementary top-down view of a qualitative module to build up a quality measurement tool, the aim of which is to increase inter- and intra-rater reliability. Its practical tool, the Modular Assessment Pack (MAP), is tested using a pretest-posttest methodology based on an ad hoc corpus of real assignments translated by professional freelance translators. The results of this experimental pilot study, carried out with the English-Spanish language pair at the Spanish Language Department of the DGT, are described and discussed. This analysis sheds some light on the benefits of adopting a mixed bottom-up top-down approach to quality assessment and reveals some weaknesses of the FCA which suggest the methods of further research. Although small-scale, the findings of this pilot study indicate that improvements can be achieved by remedying its limitations in broader experimental conditions and adjusting the tool for use in other language combinations.
Conference Paper
Full-text available
State-of-the-art automatic Machine Translation [MT] evaluation is based on the idea that the closer MT output is to Human Translation [HT], the higher its quality. Thus, automatic evaluation is typically approached by measuring some sort of similarity between machine and human translations. Most widely used evaluation systems calculate similarity at surface level, for example, by computing the number of shared word n-grams. The correlation between automatic and manual evaluation scores at sentence level is still not satisfactory. One of the main reasons is that metrics underscore acceptable candidate translations due to their inability to tackle lexical and syntactic variation between possible translation options. Acceptable differences between candidate and reference translations are frequently due to optional translation shifts. It is common practice in HT to paraphrase what could be viewed as close version of the source text in order to adapt it to target language use. When a reference translation contains such changes, using it as the only point of comparison is less informative, as the differences are not indicative of MT errors. To alleviate this problem, we design a paraphrase generation system based on a set of rules that model prototypical optional shifts that may have been applied by human translators. Applying the rules to the available human reference, the system generates additional references in a principled and controlled way. We show how using linguistic rules for the generation of additional references neutralizes the negative effect of optional translation shifts on n-gram-based MT evaluation.
Conference Paper
Full-text available
Future improvement of machine translation systems requires reliable automatic evaluation and error classification measures to avoid time and money consuming human classification. In this article, we propose a new method for automatic error classification and systematically compare its results to those obtained by humans. We show that the proposed automatic measures correlate well with human judgments across different error classes as well as across different translation outputs on four out of five commonly used error classes.
Book
he series serves to propagate investigations into language usage, especially with respect to computational support. This includes all forms of text handling activity, not only interlingual translations, but also conversions carried out in response to different communicative tasks. Among the major topics are problems of text transfer and the interplay between human and machine activities.
Article
Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.