Conference PaperPDF Available

Human Quality Evaluation of Machine-Translated Poetry

Authors:
  • University of Zagreb - Faculty of Humanities and Social Sciences
Human Quality Evaluation of
Machine-Translated Poetry
S. Seljan*, I. Dunđer* and M. Pavlovski*
* Department of Information and Communication Sciences, Faculty of Humanities and Social Sciences, University of
Zagreb, Zagreb, Croatia
sseljan@ffzg.hr, ivandunder@gmail.com, mpavlovs@ffzg.hr
Abstract - The quality of literary translation was from
the beginning of literacy an important factor in publishing
and, as a consequence, in research and education. The
quality of literary text translation is of utmost significance
for researchers and students, especially in higher education.
Only complete and high-standard translations are believed
to be necessary for the use in the evaluation and study of
style and concepts of a given author or a literary genre. This
quality verification applies even more to machine
translation in general, due to the fact that such translations
are deemed subpar and unsuitable for further dissemination
and examination. The need for human quality evaluation of
machine-translated text is therefore highly emphasised,
since human translations are considered to be the “gold
standard” and reference translations in the machine
translation process. The aim of this paper is to explore, on
the example of a data set consisting of poems written by a
relevant contemporary Croatian poet, the effectiveness of
applying machine translation on the Croatian-German
language pair in the domain of poetry, with regard to
human judgment of machine translation quality. Human
evaluation in this paper is conducted by taking into account
two machine translation quality criteria adequacy and
fluency, after which an inter-rater agreement analysis is
performed.
Keywords - automatic machine translation; machine
translation quality evaluation; human evaluation; domain-
specific evaluation; natural language processing
I. INTRODUCTION AND MOTIVATION
The quality of a literary text translation is of great
importance for researchers in the field of higher education,
especially those involved in studies of language and
literature for the precise evaluation of an author’s writing
in terms of style, concepts and elements of a literary work,
such as character, theme, plot, point of view, setting,
conflict and tone.
In recent times the demand for generating fast and
low-cost translations by applying various technological
approaches but often at the expense of quality is
becoming more and more apparent. The results of
employing dominant machine translation approaches, such
as statistical and neural machine translation, need to be
verified and most often post-edited, i.e. manually
corrected according to the user-specific requirements.
The need to demonstrate the efficacy of technology in
translating literary texts rises with the ever-growing
possibilities of machine translation, which could be an
important tool for a tradition that, from the beginning of
literacy, considered the quality of literary translation an
important factor in various fields, including publishing
and higher education.
The aim of this paper is to evaluate the quality of
machine translations in comparison with human
translations made by two professional and acclaimed
Croatian-German translators, with numerous translated
literary works: Klaus Detlef Olof (Oebisfelde, 1939),
PhD, a professor of Slavic languages and winner of the
1991 Austrian national prize for literary translators; and
Alida Bremer (Split, 1959), PhD, an accentuated promoter
of Croatian and other literatures of Southeastern Europe.
This paper explores how machine translations rank
when compared with translations done by professional
human translators, on the example of works of Delimir
Rešicki (Osijek, 1960), one of the most renowned and
awarded contemporary Croatian poets. As of 2019,
Delimir Rešicki published 17 books of poetry, fiction,
essays and criticism, his works were translated in
numerous languages including English, German, French,
Italian etc., and he was awarded some notable literary
prizes, including Kiklop (2005), Vladimir Nazor (2005
and 2015) and the Hubert-Burda-Preis (2008).
Although machine translations can also be assessed
automatically by using quality metrics, in this paper the
authors tried to examine the quality of German-Croatian
machine translations in the domain of poetry with the help
of human evaluators who are native speakers of the
Croatian language. Here, the human quality evaluation is
done with regard to two machine translation quality
evaluation criteria – adequacy and fluency.
Adequacy analyses “how much of the meaning
expressed in the gold-standard translation or the source is
also expressed in the target translation” [1]. Fluency,
however, inspects to what extent the machine translation
is “one that is well-formed grammatically, contains
correct spellings, adheres to common use of terms, titles
and names, is intuitively acceptable and can be sensibly
interpreted by a native speaker” [1].
The basis of this research derives from various
scientific fields, such as information and communication
sciences, computer science, linguistics, and especially
natural language processing (NLP), and it is not surprising
that the study of machine translation is already
implemented in various courses in higher education that
treat and evaluate different aspects of machine translation
and its resulting quality in particular.
The methods that are used for machine translation
quality assessment are important not only for students in
the humanities, but also in the education of students in the
technical sciences that are taught how machine translation
is accomplished through different phases, what limitations
might occur, what quality can be expected in different
scenarios, what evaluation aspects can be considered, and
so on.
II. RELATED WORK
When discussing human evaluation of machine
translation, [2] states that “evaluation of segment-level
machine translation metrics is currently hampered by: low
inter-annotator agreement levels in human assessments;
lack of an effective mechanism for evaluation of
translations of equal quality; and lack of methods of
significance testing improvements over a baseline”.
Reference [3] concludes that “there is evidence that
machine translation can streamline the translation process
for specific types of texts, such as questions; however, it
does not yet rival the quality of human translations, to
which post-editing is key in this process”.
Comparing statistical machine translation with neural
machine evaluation, [4] denotes that „automatic
evaluation results presented for neural machine translation
are very promising, however human evaluations show
mixed results”. The authors report “increases in fluency
but inconsistent results for adequacy and post-editing
effort. Neural machine translation undoubtedly represents
a step forward for the machine translation field, but one
that the community should be careful not to oversell”.
While identifying fluently inadequate output in neural
and statistical machine translation, [5] states that „with the
impressive fluency of modern machine translation output,
systems may produce output that is fluent but not adequate
(fluently inadequate)”.
Reference [6] confirms that “due to the lack of
effective control over the influence from source and target
contexts, conventional neural machine translation tends to
yield fluent but inadequate translations”.
One research [7] denotes that “although end-to-end
neural machine translation has achieved remarkable
progress in the past […] years, it suffers from a major
drawback: translations generated by neural machine
translation systems often lack adequacy. It has been
widely observed that neural machine translation tends to
repeatedly translate some source words while mistakenly
ignoring other words”.
When it comes to quality evaluation of Croatian
machine translations, [8] has done extensive research in
both automatic and human evaluation methods.
Various quality aspects of Croatian machine
translations have been analysed in different domains, such
as sociology, philosophy, spirituality [9], business
correspondence [10] or legislation [11]. Online machine
translation services have also been evaluated for the
Croatian language [12-14].
III. RESEARCH
This section describes the experimental data set, the
applied research steps and the human machine translation
quality evaluation approach.
In this paper, the authors decided to manually assess
the quality of machine-translated literary texts obtained
from an earlier research, in which a specific data set was
used consisting of:
a collection of poems written in Croatian by
Delimir Rešicki,
and their human translations into German.
Rešicki’s poems originally written in Croatian and
their corresponding professional human translations into
German were crawled from a multilingual web platform
called “Lyrikline” (https://www.lyrikline.org), which aims
to attract diverse poets to publish their work online.
In total, 14 Rešicki’s poems consisting of 532 verses,
i.e. sentences or segments (chunks of text that do not end
with a sentence delimiter) per language, comprised the
data set which was analysed, preprocessed and prepared
for machine translation.
The analysis and preparation phases were mostly
accomplished with regular expressions, Python and Perl,
and included various tasks, such as, applying the
appropriate character encoding (UTF-8), stripping of text
formatting, deleting of boilerplate and metadata, removing
of redundant and unnecessary characters, tokenising,
converting to a 1-1 sentence/verse-based parallel corpus
etc.
Professional human translations of Rešicki’s poems,
which were used as “gold-standard” reference translations
in the machine translation trials, were conducted by two
well-known translators Klaus Detlef Olof and Alida
Bremer.
Before starting with the actual machine translation
quality evaluation, a human evaluator was told to check
whether misalignments, orthographic mistakes, various
oversights, missing or faulty characters were present in the
original data set consisting of Rešicki’s poems and the
corresponding human translations. The evaluator found
that some of the segments (verses) were not correctly
aligned, obviously due to the artistic license of
professional human translations, which was expected to
some degree. Some of the translations were relatively
freely translated, while some other, although correct, just
appeared in later verses and therefore caused
misalignments. Nonetheless, for research purposes the
identified imperfections were left unfixed. However, some
of the smaller segments, i.e. verses were combined where
considered appropriate by the human evaluators.
The machine translations were automatically generated
for the Croatian-German and German-Croatian language
pairs, i.e. for both directions. However, in this research the
authors decided to evaluate only German-Croatian
machine translations. Automatic translations were
obtained by using two freely available machine translation
systems Google Translate (https://translate.google.com)
and Yandex.Translate (https://translate.yandex.com/),
which are both trained on general data, and not
specifically on data from the domain of poetry. Both
systems are based on statistical and neural machine
translation. Once all the machine translations were
acquired, a human machine translation quality evaluation
followed, in which machine translations were compared to
the original human reference translations.
Here the authors decided to analyse the qualitative
aspects of the resulting machine translations. Instead of
utilising automatic machine translation quality metrics, the
authors decided to perform an evaluation with the help of
three human evaluators who are native speakers of the
Croatian language, and who were told to rate machine
translations with respect to the original reference (source)
texts and by considering two quality criteria adequacy
and fluency.
When it comes to adequacy, the evaluators used a 4-
point scale to rate how much of the meaning is
represented in the translation: 4 (everything), 3 (most), 2
(little), and 1 (none). Fluency was also assessed on a 4-
point scale. Here, the evaluators had to rate the extent to
which the machine translation was well-formed
grammatically, contained correct spellings, adhered to
common use of terms, titles and names, was intuitively
acceptable and could be sensibly interpreted by a native
speaker: 4 (flawless), 3 (good), 2 (dis-fluent), and 1
(incomprehensible).
Human evaluations were made on the entire machine-
translated data set. After the three evaluators finished
grading all of the 532 machine translations per each
machine translation system, the evaluators were asked to
specify their observations during manual machine
translation quality inspection. All quality scores were
statistically analysed with regard to both machine
translation systems, both evaluation criteria (adequacy and
fluency) and with respect to each evaluator. In addition, an
inter-rater agreement analysis was performed.
IV. RESULTS AND DISCUSSION
Table I presents descriptive statistics of the whole data
set calculated on the scores given by the three human
evaluators. Overall, the mean scores for both adequacy
and fluency are higher for Google Translate, and the
standard deviations are smaller. The mean value for
adequacy is lower than the mean value for fluency for
Google Translate, as opposed to Yandex.Translate – here,
the mean adequacy score is slightly higher than the mean
fluency score.
TABLE I. DESCRIPTIVE STATISTICS
System Criterion N Mean Std. dev.
Google Translate
Adequacy 532 2.644 0.989
Fluency 532 2.744 1.000
Yandex.Translate
Adequacy 532 2.534 1.086
Fluency 532 2.525 1.088
Table II shows average scores given by each evaluator
during the human evaluation phase. Overall, scores for
Google Translate are (in most cases) higher than for
Yandex.Translate. Scores for the adequacy criterion are
higher than for fluency according to two evaluators, for
both machine translation systems. This is surprising, as
earlier research (as stated in the Related Work section)
reaffirmed that neural machine translation tends to
produce machine translations that are fluent but oftentimes
not adequate.
Furthermore, average scores of fluency and adequacy
given by all three evaluators are higher for Google
Translate.
TABLE II. SCORES PER EVALUATOR
Evaluator 1 2 3
Criterion A F A F A F
Google
Translate
(average)
2.77 2.64 2.82 2.74 2.34 2.77
Yandex.Tr
anslate
(average)
2.53 2.27 2.72 2.68 2.35 2.62
Google
Translate
A (avera
g
efo
r
all evaluators): 2.64
F (average for all evaluators): 2.72
Yandex.Tr
anslate
A (average for all evaluators): 2.53
F (avera
g
e for all evaluators): 2.52
Remarks: A = adequacy, F = fluency
Interestingly, although the evaluators compared only
the resulting machine translations with the original source
text of Delimir Rešicki, the results of the human
evaluation process were still skewed to the right half of
the scale, i.e. all scores were above 2. This was the case
for both fluency and adequacy. This implies that both
machine translation systems proved to be relatively
suitable for translating poetry from German into Croatian,
despite the fact that the selected machine translation
systems were not specially prepared and trained for that
particular domain.
The inter-rater agreement, i.e. the reliability of human
evaluators, was assessed with the use of Cronbach’s alpha.
This statistic estimates the internal consistency among
evaluators. Cronbach’s alpha ranges from 0.00 to 1.00,
where 0.00 indicates absolute absence of agreement, and
1.00 perfect agreement among evaluators, while generally
the score of 0.70 is considered as reliable. Precise
interpretation of Cronbach’s alpha values is as follows:
α<0.5 no agreement (unacceptable),
0.5≤α<0.6 poor agreement,
0.6≤α<0.7 acceptable agreement,
0.7≤α<0.9 good agreement, and
α≥0.9 excellent agreement.
In this research, the alpha values ranged from 0.84 to
0.88 (see Table III), which indicates good internal
consistency in the human quality evaluation of machine
translations.
TABLE III. CRONBACHS ALPHA SCORE
System / Criterion Adequacy Fluency
Google Translate 0.86 0.84
Yandex.Translate 0.88 0.85
When it comes to the evaluators’ observations during
manual inspection, the evaluators felt that in many cases
the machine translations were actually both adequate and
fluent when compared to the human reference translations.
This is an important finding, as this implies that machine
translation can be used for, at least partially, translating
texts in the domain of poetry.
The evaluators commented that in some situations the
machine translations were more fluent, and they felt that
this derived from the fact that German reference
translations of particular verses, which were used as the
input for generating Croatian machine translations, were
more fluent. For instance, the original (source) text
Pakao je , reklo bi se , bezizlazan samo za one koji ne
znaju iz njega da se vrate ...was translated by Google
Translate as Pakao je beznadan samo za one koji se ne
znaju vratiti iz njega ...”.
They also pointed out that there were some cases they
felt the professional human translators added a word or
changed the sequence of words in a sentence in
comparison to the original source text. E.g. the original
verse druga vučija was translated by Google Translate
as drugo vučje stopalo and by Yandex.Translate as
drugi vuk šapa”. Still, both notions could not be verified
by the human evaluators, as they received only the
original (source) Croatian text and the two machine-
translated texts generated by Google Translate and
Yandex.Translate, but not the (reference) human
translation into German.
The evaluators also observed that the machine
translation systems were not always capable of correctly
translating the grammatical case of a feminine noun.
Examples of this claim are given below:
original: Za Milanu Vuković Runjić”, error
in translation by Google Translate: Za
Milana Vuković Runjić”, correct translation
by Yandex.Translate: Za Milanu Vuković
Runjić”.
original: Za Eriku Bók , zauvijek”; error in
translation by Google Translate za Erika
Bóka , zauvijek”, error in translation by
Yandex.Translate: “za Erica boca , zauvijek”.
One of the evaluators denoted that in some sentences
the machine translations of synonyms were not adequate,
e.g.
original: “Ledeni nož u ledenoj torti”,
Google Translate: Ledeni nož u ledenoj
pita”, and
Yandex.Translate: Ledeni nož u ledenom
kolaču”.
In several occasions the evaluators commented that the
machine translation of adjectives was not adequate nor
fluent. E.g.
original: “jedna mu je noga bila kozja”,
Google Translate: stopala su mu bila jedna
od koza”, and
Yandex.Translate: jedna od njegovih nogu
bila je kod koze”.
Evaluators also noticed that self-standing sentences
were fluent and adequate, but in terms of a stanza they
were not fluent and not adequate, and seemed to be out of
context, due to the fact that more lines (verses) constituted
a stanza. For instance,
original: U to danas vjeruju milijuni/ na
šalterima kladionica/ za borbe ljudi i pasa , i
jedni i drugi/ na kratkim su lancima .”,
Google Translate: U to danas vjeruju
milijuni/ na šalterima kladionica/ za borbe
ljudi i pasa , jedna poput druge/ visi na
kratkim lancima .”, and
Yandex.Translate: Milijuni vjeruju u to
danas/ na policama kladionica/ za borbu
protiv ljudi i pasa , sami kao i drugi/ visi na
kratkim lancima .”.
One of the evaluators in this experiment stated that the
main limitation of the machine translation quality
evaluation approach was that the human original text and
the corresponding machine translation were presented in a
context-free verse-based manner, i.e. only line by line.
In conclusion, according to the evaluators, the
generated machine translations were comprehensible in
most of the cases but should undergo a post-processing
phase where all needed corrections should be made.
Nevertheless, in most of the cases the generated machine
translations were of relatively decent quality but still not
suitable for direct publishing.
V. FUTURE RESEARCH AND ADDITIONAL DIRECTIONS
The authors plan to manually correct the verse
alignments in the data set, i.e. to set it to the required 1-1
parallel corpus format, so that the misalignments do not
influence the overall evaluation process. Verification of
the data set alignments could be done through a specially-
built NLP platform [15].
Although the data set was decent in size for evaluation
purposes, it might be useful to increment it with additional
data from the domain of poetry. Also, more evaluators
should probably reaffirm the findings in this paper.
In spite of the fact that the human evaluation results in
terms of adequacy and fluency were skewed to the higher
scores, in future research native Croatian-German
speakers should also be presented the reference
translations made by the two human professional
translators. It would be very interesting to see if this could
also improve human judgment scores.
The authors plan to repeat the human evaluation for
the Croatian-German direction, and on the segment level,
i.e. stanza level, instead of the verse level.
Machine translations should also be evaluated with
regard to various error types and error classification
methods. It might be useful to annotate the data set
beforehand, statistically or linguistically [16]. Applying
word embeddings could reveal interesting concept-related
and semantic relationships between different unigrams
[17] as well. Moreover, a sentiment analysis could detect
the overall affective states in the poetry-related data set
[18].
Analyses of word occurrences and their corresponding
distributions [19], extracted key words [20, 21] and
concordances [22, 23] might also expose author-specific
writing styles and literary elements.
VI. CONCLUSION
With the rise of machine translation possibilities, the
need to demonstrate the effectiveness of technology in
translating literary texts is becoming increasingly evident.
The quality of a translation has always been essential
to the full comprehension of a source text. This applies
even more to the quality of machine translations of
complicated, oftentimes ambiguous, or idiomatic texts,
such as various literary works. Especially when it comes
to studying various poets and other writers in higher
education, in order to examine one author’s style,
concepts etc. it is very important to make sure that the
automatic translations are of sufficient quality.
This can be accomplished by applying human quality
evaluation of a machine-translated text in comparison to a
human translation made by a professional translator.
The aim of this paper was to demonstrate, on the
example of a data set containing poems written by a
contemporary Croatian poet, the effectiveness of applying
machine translation a specific language pair in the domain
of poetry by analysing two machine translation quality
evaluation criteria adequacy and fluency. Those criteria
were assessed with the help of three human evaluators.
Also, the reliability of evaluators was measured as well.
This paper performed, in fact, a usability analysis in
order to demonstrate the applicability of poetry-related
machine translation. The authors showed that machine
translation services can, to some extent, be used for
translating texts from various domains. Especially Google
Translate generated decent translations, and this was
verified during the human machine translation evaluation
process and confirmed by the good inter-rater agreement
according to Cronbach’s alpha.
However, results of this research are to be taken as the
first insight due to several reasons. Firstly, translations
from the domain of poetry are very specific and subject to
personal interpretation. Secondly, the neural machine
translation paradigm, which is used by both machine
translation services, has introduced new algorithms when
compared to statistical machine translation, and their
effects on machine translation quality have not yet been
explored fully, especially when it comes to flective
languages, such as Croatian. Third, evaluating machine
translation was done on a relatively small data set and
only by three human evaluators. Therefore, the results
should be treated as preliminary, and all mentioned
limitations should be taken into consideration in follow-up
research.
The applied research methods can easily be
implemented in education. This could be useful for
students in technical sciences and the purpose of learning
how to conduct machine translation quality evaluation
trials. When it comes to non-technical science students,
machine translations could be used, e.g. for “gisting”
purposes (obtaining the gist, i.e. the key semantic
information about a given text), due to the fact that the
elementary notion of a text can be derived from subpar
machine translations.
REFERENCES
[1] A. Görög, “Quality Evaluation Today: the Dynamic Quality
Framework,” in Proc. of the Translating and The Computer 36
Conference, 2014, pp. 155–164.
[2] Y. Graham, T. Baldwin, and N. Mathur, “Accurate Evaluation of
Segment-level Machine Translation Metrics,” in Proc. of the 2015
Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, 2015,
pp. 1183–1191.
[3] J. Gutiérrez-Artacho, M.-D. Olvera-Lobo, and I. Rivera-Trigueros,
“Human Post-editing in Hybrid Machine Translation Systems:
Automatic and Manual Analysis and Evaluation,” in Proc. of
WorldCIST'18: Trends and Advances in Information Systems and
Technologies, 2018, pp 254–263.
[4] S. Castilho, J. Moorkens, F. Gaspari, I. Calixto, J. Tinsley, and A.
Way, “Is Neural Machine Translation the New State of the Art?,”
The Prague Bulletin of Mathematical Linguistics, no. 108, pp.
109–120, June 2017.
[5] M. Martindale, M. Carpuat, K. Duh, and P. McNamee,
“Identifying Fluently Inadequate Output in Neural and Statistical
Machine Translation,” in Proc. of Machine Translation Summit
XVII volume 1: Research Track, 2019, pp. 233–243.
[6] Z. Tu, Y. Liu, Z. Lu, X. Liu, and H. Li, “Context Gates for Neural
Machine Translation,” Transactions of the Association for
Computational Linguistics, vol. 5, pp. 87–99, 2017.
[7] Z. Tu, Y. Liu, L. Shang, X. Liu, and H. Li, “Neural Machine
Translation with Reconstruction,” in Proc. of the Thirty-First
AAAI Conference on Artificial Intelligence (AAAI'17),
arXiv:1611.01874 [cs.CL], 2017, p. 7.
[8] I. Dunder, Statistical machine translation and computational
domain adaptation (Sustav za statističko strojno prevođenje i
računalna adaptacija domene) // doctoral dissertation. Zagreb:
University of Zagreb, 2015, p. 281.
[9] S. Seljan, and I. Dunđer, “Automatic Quality Evaluation of
Machine-Translated Output in Sociological-Philosophical-
Spiritual Domain,” in Proc. of the 10th Iberian Conference on
Information Systems and Technologies (CISTI'2015), vol. 2,
2015, pp. 128–131.
[10] S. Seljan, and I. Dunđer, “Combined Automatic Speech
Recognition and Machine Translation in Business Correspondence
Domain for English-Croatian,” in Proc. of the International
Conference on Embedded Systems and Intelligent Technology
(ICESIT 2014) International Journal of Computer, Information,
Systems and Control Engineering, vol. 8, 2014, pp. 1069–1075.
[11] M. Brkić, S. Seljan, and T. Vičić, “Automatic and Human
Evaluation on English-Croatian Legislative Test Set,” in Proc. of
the 4th International Conference (CICLing 2013) “Computational
Linguistics and Intelligent Text Processing”, Part I Intelligent
Text Processing and Computational Linguistics, Lecture Notes in
Computer Science - LNCS, Springer, 2013, pp. 311–317.
[12] Seljan, S., Klasnić, K., Stojanac, M., Pešorda, B., and Mikelić
Preradović, N, “Information Transfer through Online
Summarizing and Translation Technology,” in Proc. of
INFuture2015: e-Institutions Openness, Accessibility, and
Preservation, 2015, pp. 197–210.
[13] S. Seljan, M. Tucaković, and I. Dunđer, “Human Evaluation of
Online Machine Translation Services for English/Russian-
Croatian,” in Proc. of WorldCIST'15 – 3rd World Conference on
Information Systems and Technologies (Advances in Intelligent
Systems and Computing New Contributions in Information
Systems and Technologies), 2015, pp. 1089–1098.
[14] S. Seljan, and I. Dunđer, “Machine Translation and Automatic
Evaluation of English/Russian-Croatian,” in Proc. of the
International Conference “Corpus Linguistics 2015”
(CORPORA 2015), 2015, pp. 72–79.
[15] R. Jaworski, S. Seljan, and I. Dunđer, “Towards educating and
motivating the crowd – a crowdsourcing platform for harvesting
the fruits of NLP students' labour,” in Proc. of the 8th Language &
Technology Conference Human Language Technologies as a
Challenge for Computer Science and Linguistics, 2017, pp. 332–
336.
[16] S. Seljan, I. Dunđer, and A. Gašpar, “From Digitisation Process to
Terminological Digital Resources,” in Proc. of the 36th
International Convention on Information and Communication
Technology, Electronics and Microelectronics (MIPRO 2013),
2013, pp. 1329–1334.
[17] I. Dunđer, and M. Pavlovski, “Through the Limits of Newspeak:
an Analysis of the Vector Representation of Words in George
Orwell’s 1984,” in Proc. of the 42nd International Convention on
Information and Communication Technology, Electronics and
Microelectronics (MIPRO 2019), 2019, pp. 0691–0696.
[18] I. Dunđer, and M. Pavlovski, “Behind the Dystopian Sentiment: a
Sentiment Analysis of George Orwell’s 1984,” in Proc. of the
42nd International Convention on Information and
Communication Technology, Electronics and Microelectronics
(MIPRO 2019), 2019, pp. 0685–0690.
[19] M. Pavlovski, and I. Dunđer, “Is Big Brother Watching You? A
Computational Analysis of Frequencies of Dystopian Terminology
in George Orwell’s 1984,” in Proc. of the 41st International
Convention on Information and Communication Technology,
Electronics and Microelectronics (MIPRO 2018), 2018, pp. 0638–
0643.
[20] S. Seljan, I. Dunđer, and H. Stančić, “Extracting Terminology by
Language Independent Methods,” in Proc. of the 2nd International
Conference on Translation and Interpreting Studies “Translation
Studies and Translation Practice” (TRANSLATA II) – Peter Lang
series “Forum Translationswissenschaft”, vol. 19, 2014, pp. 141
147.
[21] I. Dunđer, S. Seljan, and H. Stančić, “Koncept automatske
klasifikacije registraturnoga i arhivskoga gradiva (The concept of
the automatic classification of the registry and archival records),”
in Proc. of the 48. savjetovanje hrvatskih arhivista (HAD) / Zaštita
arhivskoga gradiva u nastajanju, 2015, pp. 195–211.
[22] R. Jaworski, I. Dunđer, and S. Seljan, “Usability Analysis of the
Concordia Tool Applying Novel Concordance Searching,” in
Proc. of the 10th International Conference on Natural Language
Processing (HrTAL2016) Springer Lecture Notes in Computer
Science (LNCS), 2016, p. 6, in press.
[23] I. Dunđer, and M. Pavlovski, “Computational Concordance
Analysis of Fictional Literary Work,” in Proc. of the 41st
International Convention on Information and Communication
Technology, Electronics and Microelectronics (MIPRO 2018),
2018, pp. 0644-0648.
... The previous research to probe poetic translation by human and machine systems was completed (Dunder et al., 2021;Ni & Wang, 2022;Seljan et al., 2020). ...
... There are three findings dealing with poetic translation research (Seljan et al., 2020). First, poetic translations are extremely detailed and exposed to reach the individual interpretation. ...
... Specifically, the mean scores for accuracy, acceptance, and readability were higher for human translators compared to machine systems. Such results proved that the poetic translation quality of a human translator was better than the machine system (Seljan et al., 2020). Furthermore, this created a suggestion that human translators outperformed machine systems in terms of accuracy, acceptance, and readability. ...
Article
Full-text available
The engagement with poetry is a personalized journey that transcends standardized methodologies. Key to this process is complete immersion in the poetic experience, alongside evaluators’ openness to both human and machine-generated translations from Indonesian to English. The overarching goal is to enhance students’ discernment and appreciation of these translated works. The study’s specific objectives involved comparing poetic translation assessments by evaluators for both human translators and machine systems. It was to assess students’ appreciation of poetry through the lens of both translation methods across three different institutions. The research employed a mixed-methods approach, combining quantitative and qualitative descriptive analyses, including descriptive statistics. The dataset comprised two Indonesian poems by Taufik Ismail, rated using a score as proposed by Nababan. Findings indicate that human translators outperformed machine systems in terms of accuracy, acceptance, and readability. While students from Institutions 1 and 2 preferred human-translated poetry, students from Institution 3 favoured machine-translated versions. This suggests that human translation quality remains superior.
... Consequently, Baidu Translate was selected as the online translation system for our study. Previous work on poetry generation has utilized diverse human evaluation criteria such as fluency, coherence, meaningfulness, poeticness, overall quality, and adequacy (Manurung et al., 2012;Zhang and Lapata, 2014;Yan, 2016;Yi et al., 2017Yi et al., , 2018Seljan et al., 2020;Refaee, 2023). However, when translating English into Chinese, strict adherence to grammatical rules is not always necessary (Chen, 1993;Owen, 2003), and modern poetry often features discontinuous narration and a flexible combination of words (Awan and Khalida, 2015). ...
... For these reasons, appropriate line-breaking is also included in our evaluation criteria. Drawing from previous studies (Manurung et al., 2012;Zhang and Lapata, 2014;Yan, 2016;Yi et al., 2017Yi et al., , 2018Seljan et al., 2020;Refaee, 2023) and reflecting the unique elements of modern poetry, we have designed a new human evaluation framework specifically tailored for the translation of modern poetry. We assess candidate translations comprehensively, focusing on eight key aspects, ranging from overall impact to specific details: ...
... Previous work on poetry generation has utilized diverse human evaluation criteria such as fluency, coherence, meaningfulness, poeticness, overall quality, and adequacy (Manurung et al., 2012;Zhang and Lapata, 2014;Yan, 2016;Yi et al., 2017Yi et al., , 2018Seljan et al., 2020;Refaee, 2023). However, when translating English into Chinese, strict adherence to grammatical rules is not always necessary (Chen, 1993;Owen, 2003), and modern poetry often features discontinuous narration and a flexible combination of words (Awan and Khalida, 2015). ...
... For these reasons, appropriate line-breaking is also included in our evaluation criteria. Drawing from previous studies (Manurung et al., 2012;Zhang and Lapata, 2014;Yan, 2016;Yi et al., 2017Yi et al., , 2018Seljan et al., 2020;Refaee, 2023) and reflecting the unique elements of modern poetry, we have designed a new human evaluation framework specifically tailored for the translation of modern poetry. We assess candidate translations comprehensively, focusing on eight key aspects, ranging from overall impact to specific details: ...
Preprint
Full-text available
Machine translation (MT) has historically faced significant challenges when applied to literary works, particularly in the domain of poetry translation. The advent of Large Language Models such as ChatGPT holds potential for innovation in this field. This study examines ChatGPT's capabilities in English-Chinese poetry translation tasks, utilizing targeted prompts and small sample scenarios to ascertain optimal performance. Despite promising outcomes, our analysis reveals persistent issues in the translations generated by ChatGPT that warrant attention. To address these shortcomings, we propose an Explanation-Assisted Poetry Machine Translation (EAPMT) method, which leverages monolingual poetry explanation as a guiding information for the translation process. Furthermore, we refine existing evaluation criteria to better suit the nuances of modern poetry translation. We engaged a panel of professional poets for assessments, complemented evaluations by using GPT-4. The results from both human and machine evaluations demonstrate that our EAPMT method outperforms traditional translation methods of ChatGPT and the existing online systems. This paper validates the efficacy of our method and contributes a novel perspective to machine-assisted literary translation.
... Stimulated by advances in neural machine translation, there has emerged a body of empirical research on literary machine translation. Most studies focus on the quality of raw machine translations of poetry (Greene et al., 2010;Genzel et al., 2010;Almahasees, 2017;Humblé, 2019;Dunđer et al., 2020Dunđer et al., , 2021Seljan et al., 2020) and prose (Voigt and Jurafsky, 2012;Jones and Irvine, 2013;Way, 2015a, 2015b;Kuzman et al., 2019;Matusov, 2019;Tezcan et al., 2019;Toral et al., 2020;Webster et al., 2020;Jiang and Niu, 2022). In comparison, the post-editing of literary machine translation remains underexplored. ...
Chapter
Despite the advances in the automation of translation processes, several discourses insist that there is still a privileged status for the non-automated human, especially in idealized conceptualisations of literary translation. Empirical data from experiments with 141 students show general awareness of trade-offs between the advantages and disadvantages of post-editing texts by Agatha Christie. Although machine translation was assumed to suffer from excessive literalism, it was also criticized for being deceptively fluent. Comparison of work by post-editors and from-scratch translators shows that the former tends to use risk transfer, trusting machine translation in situations of uncertainty, while the latter are more prone to risk taking. The two strategies lead to errors of different kinds, with the from-scratch translations having significantly fewer errors than the results of post-editing, with the exception of the “deviations” resulting from excessive risk taking.
... Comparative studies analyzing human and machine translations of literary works have gained traction among translation scholars. Seljan et al. (2020) underscored the effectiveness of machine poetry translation for low-resource languages. Dai et al. (2022) discussed the limitations of machine translation for poetry, citing its struggle to convey the beauty of ancient Chinese traditional culture while noting the occasional enrichment of language through figures of speech. ...
Article
Full-text available
Abstract: The article presents a detailed comparative analysis of translations of 12 great Ukrainian poet Ivan Franko’s poems by translator Percival Cundy and GPT-3.5 AI language model. Using various manual and automatic analytical research methods and techniques, we analyzed the translations’ merits, demerits, and 15 essential qualitative and quantitative linguistic and poetic characteristics to verify a hypothesis that human and GPT-3.5-driven machine translations can be quite comparable in terms of their quality and poetic features. The results obtained sufficiently prove the hypothesis and suggest that developing AI translation potential for poetry translation can help build more capable, diversified, and nuanced large language models. The AI revolutionary breakthrough in translation makes it quite possible to acquaint satisfactorily the wider public with the poetic heritage of the world’s nations, especially those using minor languages, whose poetry is evidently undertranslated. A follow-up study is desirable to assess the progress made by GPT4.0 and its possible later versions in poetry translation, as compared with GPT-3.5. Keywords: AI translation; human translation; comparative analysis; poetry translation; translation from Ukrainian into English
... The research results show the effectiveness of poetry machine translation in terms of special automatic quality indicators. At the same time, Seljan et al. [28] performed manual adequacy and fluency analysis on the machine translation results of poetry under the same data set, which proved the effectiveness of applying machine translation to Croatian-German pairs in the field of poetry. Gašpar et al. [29] proposed a method to test the quality of language texts for terms. ...
Article
Full-text available
In practical applications, the accuracy of domain terminology translation is an important criterion for the performance evaluation of domain machine translation models. Aiming at the problem of phrase mismatch and improper translation caused by word-by-word translation of English terminology phrases, this paper constructs a dictionary of terminology phrases in the field of electrical engineering and proposes three schemes to integrate the dictionary knowledge into the translation model. Scheme 1 replaces the terminology phrases of the source language. Scheme 2 uses the residual connection at the encoder end after the terminology phrase is replaced. Scheme 3 uses a segmentation method of combining character segmentation and terminology segmentation for the target language and uses an additional loss module in the training process. The results show that all three schemes are superior to the baseline model in two aspects: BLEU value and correct translation rate of terminology words. In the test set, the highest accuracy of terminology words was 48.3% higher than that of the baseline model. The BLEU value is up to 3.6 higher than the baseline model. The phenomenon is also analyzed and discussed in this paper.
... Challenges in building NMT systems in terms of parallel corpora, such as domain mismatches, amount of training data, rare words, size of sentences, word alignments, etc., have been analyzed in [41]. In comparison to the SMT model, the authors reported that NMT had lower quality for out-of-domain translations, especially for the criterion of adequacy, as also confirmed in a study by [42]. NMT systems perform better when large amounts of parallel data are available, i.e., worse for low-resource language pairs. ...
Article
Full-text available
Parallel corpora have been widely used in the fields of natural language processing and translation as they provide crucial multilingual information. They are used to train machine translation systems, compile dictionaries, or generate inter-language word embeddings. There are many corpora available publicly; however, support for some languages is still limited. In this paper, the authors present a framework for collecting, organizing, and storing corpora. The solution was originally designed to obtain data for less-resourced languages, but it proved to work very well for the collection of high-value domain-specific corpora. The scenario is based on the collective work of a group of people who are motivated by the means of gamification. The rules of the game motivate the participants to submit large resources, and a peer-review process ensures quality. More than four million translated segments have been collected so far.
Chapter
Translation technology plays a crucial role in facilitating multilingual communication, requiring specialized tools and expertise. This study explores the application of computer-assisted translation (CAT) tool, which integrates automatic machine translation, translation memories, and terminology databases, to assess their effectiveness compared to traditional translation methods. The research evaluates translation performance across metrics such as time efficiency, error frequency, translation similarity, and terminological accuracy.Two distinct translation processes were conducted, leveraging different tools, resources, and methodologies. A translation memory was built by collecting and aligning parallel corpora, alongside the creation of a term base. Results revealed that the use of CAT tools reduced translation time by 49% on average. Error analysis showed a marked decrease in spelling and grammatical errors when using CAT tools. Additionally, translation similarity with the reference text increased from 79% in traditional translation to 95% with the CAT tool. Terminological analysis highlighted a substantial reduction in terminology errors, although some errors persisted due to limitations in automatic machine translation. The findings emphasize the critical role of a well-maintained term base in achieving consistent terminology.
Article
Full-text available
Automatsko strojno prevođenje sve je popularnija istraživačka tema u znanosti i raznim znanstvenim disciplinama, kao što su informacijske i komunikacijske znanosti, računarstvo, računalna lingvistika i sl. Razlog tome je prvenstveno to što danas omogućuje nezaobilaznu komunikaciju i brz prijenos informacija između različitih prirodnih jezika. To je posebno bitno za manje govorene jezike poput hrvatskoga, za koji još uvijek ne postoji dovoljan broj softverskih alata i digitalnih resursa potrebnih za razvoj specijaliziranih i kvalitetnih sustava za strojno prevođenje koji bi bili optimizirani za upotrebu u jednom specifičnom području. Sve brži rast količine podataka i sve veća potreba raznih dionika u sektorima industrije, gospodarstvu, znanosti, ali i u svakodnevnom životu ljudi impliciraju motivaciju za sistematiziranim i organiziranim razvojem te naknadnom prilagodbom sustava za automatsko strojno prevođenje za različite jezične parove. Budući da strojni prijevodi nisu savršeni, važno je primijeniti metode za računalno generiranje prijevoda prihvatljive razine kvalitete koja ovisi o samom zadatku i području primjene sustava za strojno provođenje. U ovom radu analiziran je model sustava za automatsko statističko strojno prevođenje, njegove komponente te uloga i značaj pojedinih elemenata unutar modela.
Chapter
Full-text available
This paper describes a novel tool for concordance searching, named Concordia. It combines the capabilities of standard concordance searchers with the usability of a translation memory. The tool is described in detail with regard to main applied methods and differences when compared to already existing CAT tools. Concordia uses three data structures, i.e. hashed index, markers array and suffix array, which are loaded into memory to enable fast lookups according to fragments that cover a search pattern. In this new concordancing system, sentences are stored in the index and marked with additional information, such as unique ids, which are then retrieved by the Concordia search algorithm. The usability of the new tool is analysed in an experiment involving two English-Croatian human translation tasks. The paper presents a detailed scheme and methodology of the conducted experiment. Furthermore, an analysis of the experiment results is presented, with special emphasis on the users’ attitudes towards the usefulness and functionalities of Concordia.
Chapter
Full-text available
This study assesses, automatically and manually, the performance of two hybrid machine translation (HMT) systems, via a text corpus of questions in the Spanish and English languages. The results show that human evaluation metrics are more reliable when evaluating HMT performance. Further, there is evidence that MT can streamline the translation process for specific types of texts, such as questions; however, it does not yet rival the quality of human translations, to which post-editing is key in this process.
Poster
Full-text available
The poster presents 3 basic steps used in the system TMrepository used for collection of parallel data through crowdsourcing: - collecting and uploading parallel corpora - review and quality check - get top ranking (gamification)
Conference Paper
Full-text available
This paper presents an idea to bring crowdsourcing to a higher level, for the purpose of acquiring valuable machine translation and natural language processing resources. In the proposed scenario, students are being educated in order to improve the quality and effectiveness of their natural language processing (NLP) related work. Their motivation is ensured by introducing an element of gamification-a ranking is kept, where the best contributing users are decorated with medals. The ranking is available at all times to all users and is always up-to-date, hence the effects of the contributions are immediately visible to the users. This scenario was applied to a group of students enrolled in Natural Language Processing course, who were presented with a task of collecting parallel corpora for less-resourced language pairs, in this case Croatian-English and English-Croatian. The whole experiment was supervised with the help of a custom-made open-source system named TMrepository, developed and maintained by the authors of this paper.
Article
Full-text available
This paper discusses neural machine translation (NMT), a new paradigm in the MT field, comparing the quality of NMT systems with statistical MT by describing three studies using automatic and human evaluation methods. Automatic evaluation results presented for NMT are very promising, however human evaluations show mixed results. We report increases in fluency but inconsistent results for adequacy and post-editing effort. NMT undoubtedly represents a step forward for the MT field, but one that the community should be careful not to oversell.
Conference Paper
Full-text available
The paper presents automatic extraction process from monolingual text performed by three language independent tools, but relying on different principles. The research is conducted on the domain of pharmaceutical documentation. After the digitization process and use of OCR techniques, the automatic extraction process is performed. Results are compared with reference terminology list created by responsible institution and evaluated by measures of recall, precision and F-measure. Results are discussed in the frame of possible integration into the process of digital archiving.