Conference PaperPDF Available

Automatic Quality Evaluation of Machine-Translated Output in Sociological-Philosophical-Spiritual Domain

Authors:
  • University of Zagreb - Faculty of Humanities and Social Sciences

Abstract

Automatic quality evaluation of machine translation systems has become an important issue in the field of natural language processing, due to raised interest and needs of industry and everyday users. Development of online machine translation systems is also important for less-resourced languages, as they enable basic information transfer and communication. Although the quality of free online automatic translation systems is not perfect, it is important to assure acceptable quality. As human evaluation is time-consuming, expensive and subjective, automatic quality evaluation metrics try to approach and approximate human evaluation as much as possible. In this paper, several automatic quality metrics will be utilised, in order to assess the quality of specific machine translated text. Namely, the research is performed on sociological-philosophical-spiritual domain, resulting from the digitisation process of a scientific publication written in Croatian and English. The quality evaluation results are discussed and further analysis is proposed.
Automatic Quality Evaluation of
Machine-Translated Output in
Sociological-Philosophical-Spiritual Domain
Sanja Seljan; Ivan Dunđer
Department of Information and Communication Sciences
Faculty of Humanities and Social Sciences, University of Zagreb
Zagreb, Croatia
sanja.seljan@ffzg.hr; ivandunder@gmail.com
Abstract Automatic quality evaluation of machine translation
systems has become an important issue in the field of natural
language processing, due to raised interest and needs of industry
and everyday users. Development of online machine translation
systems is also important for less-resourced languages, as they
enable basic information transfer and communication. Although
the quality of free online automatic translation systems is not
perfect, it is important to assure acceptable quality. As human
evaluation is time-consuming, expensive and subjective,
automatic quality evaluation metrics try to approach and
approximate human evaluation as much as possible. In this
paper, several automatic quality metrics will be utilised, in order
to assess the quality of specific machine translated text. Namely,
the research is performed on sociological-philosophical-spiritual
domain, resulting from the digitisation process of a scientific
publication written in Croatian and English. The quality
evaluation results are discussed and further analysis is proposed.
Keywords automatic quality evaluation; machine translation;
BLEU; NIST; METEOR; GTM; English-Croatian; Croatian-
English; sociological-philosophical-spiritual domain.
I.
I
NTRODUCTION
Machine translation evaluation has become a hot topic of
interest to numerous researchers and projects, usually when
comparing several machine translation systems with the same
test set or when evaluating one system through different
phases, such as automatic evaluation, correlation of automatic
evaluation scores with human evaluation etc. [1].
Lately, extensive evaluation of machine translation quality
was conducted focusing on online machine translation systems,
commercial or integrated systems applying statistical machine
translation, sometimes combined with other machine
translation approaches [2].
Statistical machine translation systems rely on huge
amounts of parallel data, which is sometimes inconvenient for
less-resourced languages or between different types of
languages, and which requires more detailed quality error
analysis and evaluation, in order to improve the performance of
a machine translation system [3].
Some authors point out the context of machine translation
evaluation relating quality, purpose and context, trying to
establish a coherent evaluation approach [4]. Automatic
evaluation for less-resourced, but morphologically rich
languages is a topic of interest of numerous researchers and
organisations, since the results could be useful to professional
translators, translation industry, researchers and to everyday
users. The main advantages of automatic evaluation metrics are
speed, cost and objectiveness. They perform always in the
same way, and besides being tuneable, they can provide
meaningful, consistent, correct and reliable information on the
level of machine translation quality [5].
Human evaluation, on the other hand, is considered to be
the “gold standard”, but it is subjective, tedious and more
expensive.
II. R
ELATED WORK
Evaluation of machine translation was mainly performed in
the legislation, technical or general domain, due to available
bilingual corpora [6]. Domains such as sociology, philosophy
or religion are rarely investigated, as acquiring necessary
corpora and a specific reference translation represents a notable
problem.
However, one research paper describes translation in cross-
language information access performed by machine translation,
supplemented by domain-specific phrase dictionaries, which
were automatically mined from the online Wikipedia in the
domain of cultural heritage [7]. Queries were translated from
and into English, Spanish and Italian and then evaluated using
human annotations.
In a research, the idea was to evaluate machine translations
produced by Google Translate for the English-Swedish
language pair and for the fictional and non-fictional texts,
including samples of law documents, commercial company
reports, social science texts (religion, welfare, astronomy) and
medicine [8]. Evaluation is carried out with the BLEU metric,
showing that law texts gained double of average BLEU scores.
Evaluation of machine translation shows better scores for
about 20% when two reference sets are used, and up to 29% for
three reference sets, with regard to differentiating short and
long sentences [9]. Besides sentence length, other problems in
the machine translation process were investigated, such as
specific terminology, anaphora and ambiguity [8].
Another research describes the importance of the
translation domain, which influences the quality of machine
translation output. Therefore, domain knowledge and specific
terminology translations have been added [10]. The research is
conducted with the SYSTRAN translation system, which uses
the transfer translation approach for Chinese-English, English-
French, French-English and Russian-English language pairs.
Research on machine translation of religious texts by
Google Translate for English-Urdu and Arabic-Urdu has also
been conducted [11]. Evaluation of cross-language information
retrieval using machine translation in the domain of sociology
is presented in [12] for English, French, German and Italian.
III. R
ESEARCH
The following subsections describe the digitisation process
and data set, and discuss the research methods, quality
evaluation metrics and used tools.
A. Digitisation
For the purpose of this research, a book of abstracts from a
scientific conference containing mutual translations in Croatian
and English was digitised with a scanner. Digitisation
represents the systematic recording, storing and processing of
content using digital cameras, scanners and computers [13]. It
is the process of creating a digital representation of an object,
image, document or a signal, and allows them to be stored,
displayed, disseminated and manipulated on a computer. In
order to digitise the mutual bitexts, a HP Scanjet G3110 flatbed
scanner was used, set to 300 dpi and grayscale scanning.
Scanned abstracts were in A5 format and text was written in
Times New Roman font, size 10, standard black font colour on
white paper.
Afterwards, optical character recognition (OCR) was
carried out for extracting, editing, searching and repurposing
data from the scanned book of abstracts. In this research Abby
Fine Reader 8.0.0.677 was used as the OCR software, which
identifies text by analysing the structure of the object that
needs to be digitised, by dividing it into structural elements and
by distinguishing characters through comparison with a set of
pattern images stored in a database and built-in dictionaries.
During optical character recognition, errors are inevitable, and
the induced noise is a serious challenge to subsequent
processes that attempt to make use of such data [14].
B. Data set
The book of abstracts, which consisted of 41 abstracts, was
digitised and afterwards processed with OCR, then manually
corrected and later on used as the reference set for machine
translation, i.e. gold standard. The book contained very specific
abstracts of full scientific papers in the fields of sociology,
psychology, theology and philosophy with emphasis on several
topics, such as human dignity, religion, dialogue, freedom,
peace, responsibility, family and community, philosophical and
sociological reflections.
The texts were compiled into a parallel bilingual sentence-
aligned Croatian-English bitext consisting of mutual
translations relating to the sociological-philosophical-spiritual
domain. The process of preparing the data set included the
digitisation of printed material, applying OCR techniques and
its evaluation. All segments were sentence-aligned and a
translation memory was created. The format of such a
translation memory is ideal for further research in statistical
terminology and collocation extraction, evaluation and
analysis.
The following table shows the data set statistics (Table I).
The data set consisted of 369 segments (sentences) and 107264
characters in total. The longest segment was 80 words long in
Croatian, and 88 words in English, whereas 3677 distinct
words appeared in Croatian, and 2782 in English. On average,
English abstracts were composed of 10.6% more characters,
and 19% more words. Specificity of terminology is also
reflected in the large number of hapax legomena, which also
indicates a variety of different topics in the digitised book of
abstracts.
TABLE I. D
ATA SET STATISTICS
Data set
Language
Croatian English
No. of characters
50631 56633
No. of words 7340 9062
No. of segments 369 369
No. of abstracts 41 41
Max. of words per segment 80 88
Min. words per segment 1 1
Distinct words 3677 2782
Words that appear only
once (hapax legomena)
2837 (77.16%) 1892 (68.01%)
Words that appear twice
(dis legomena)
424 (11.53%) 389 (13.98%)
Words that appear three
times (tris legomena)
165 (4.49%) 159 (5.72%)
Words that appear more
than three times
251 (6.83%) 342 (12.29%)
Arithmetical mean of
characters per abstract
1234.90 1381.29
Arithmetical mean of
characters per segment
137.21 153.48
Arithmetical mean of words
per abstract
179.02 221.02
Arithmetical mean of words
per segment
19.89 24.56
No. of OCR errors in total 66 67
Arithmetical mean of OCR
errors per abstract
1.61 1.63
In total, 133 OCR errors occurred in the digitisation
process. Typical errors during optical character recognition
were misrecognitions of characters, missing whitespace
characters or apostrophes, various forms of substitution errors,
as well as space deletion and insertion errors. The most
frequent OCR errors in Croatian were substitution errors (e.g.
(l) (i)) and space deletion, where two words were
erroneously unified (e.g. (U postkomunističkim)
(Upostkomunističkim)). The most frequent OCR errors in
English were substitution errors (e.g. («) ((()) and missing
apostrophes ('). OCR errors have an impact on later-stage
processing and data usability, therefore, all scanned texts were
manually post-edited afterwards.
C. Tools and methods
All machine translations for both directions (Croatian-
English and English-Croatian) were generated by the freely
available online machine translation service, Google Translate
(https://translate.google.com/). Automatic machine translation
quality evaluation is performed for both directions by the
following metrics: BLUE (BiLingual Evaluation Understudy),
NIST (National Institute of Standards and Technology),
METEOR (Metric for Evaluation of Translation with Explicit
ORdering) and GTM (General Text Matcher).
The basic idea behind the mentioned metrics is to calculate
the matching of automatic and reference translations. The
metrics are based on overlapping of the same surface forms,
which is not suitable for languages with rich morphology and
relatively free word order. Some of the metrics are based on
fixed word order (METEOR, GTM), which is also not suitable
for languages with relatively free word order, such as Croatian.
BLEU is more order-independent, whereas METEOR
introduces linguistic knowledge for n-grams having the same
lemma, or for synonym matches.
GTM and METEOR are based on precision and recall,
while BLEU and NIST are based on precision and to
compensate recall, BLEU introduced brevity penalty [15].
Metrics are mainly focused on evaluation of adequacy, as it
gives information to what extent the meaning in the translation
is preserved, and penalise translations with missing words,
affecting recall.
The BLEU metric, proposed by IBM, represents a standard
for machine translation evaluation [16]. It matches machine
translation n-grams with n-grams of its reference translation,
and counts the number of matches on the sentence level,
typically for 1-4 n-grams. For each n-gram it assigns the same
weights, which is one of the main defaults of this metric. This
metric is based on the same surface forms, accepting only
complete matches, and does not take into account words having
the same lemma. BLEU also assigns a brevity penalty score,
which is given to automatic translations shorter than the
reference translation. It allows evaluation of multiple reference
translations as well.
The NIST metric is based on BLEU with some
modifications [17]. While BLEU is based on n-gram precision
assigning an equal weight to each word, NIST calculates
information weight for each word, i.e. higher scores are given
to more rare n-grams which are considered as more informative
n-grams. It differs also from BLEU in brevity penalty
calculation, where small differences in translation length do not
impact the overall score. Stemming is significantly beneficial
to BLEU and NIST [18].
METEOR metric modifies BLEU in the way that it gives
more emphasis to recall than to precision [19]. This metric
incorporates linguistic knowledge, taking into account the same
lemma and synonym matches, which is suitable for languages
with rich morphology [20]. This metric, like GTM, favours
longer matches in the same order. It uses fragmentation
penalty, which reduces F-measure if there are no bigrams or
longer matches [21]. This metric is calculated at the sentence or
segment level, while BLEU metric is usually computed at the
corpus level. It cannot implement language knowledge from
several references into the score, but gives scores for each
reference translation.
GTM metric computes the correct number of unigrams and
favours longer matches, based on precision (the number of
correct words, divided by the generated machine translation
system output-length) and recall (the number of correct words,
divided by the reference-length) and calculates the F-measure
[22]. This metric computes unigrams, i.e. the correct number of
unigram matches referring to non-repeated words in the output
and in the reference translation. This metric favours n-grams in
the correct order and assigns them higher weights [23].
Apart from the mentioned disadvantages of automatic
evaluation metrics, there are other numerous defaults, such as,
aspect of evaluation, difficulties with the interpretation and
meaning of scores, ignoring the importance of words, not
addressing grammatical coherence etc. [23].
IV. R
ESULTS AND DISCUSSION
The following table shows the results of automatic machine
translation quality evaluation metrics (Table II). BLEU scores
range from 0 (no overlapping with reference translation) to 1
(perfect overlapping with reference translation), whereas scores
over 0.3 generally reflect understandable translations, and
scores over 0.5 reflect good and fluent translations [24].
METEOR scores are usually higher than BLEU scores and
reflect understandable translation when higher than 0.5, and
good and fluent translation when scored higher than 0.7 [24].
NIST scores be 0 or higher, and have no fixed maximum,
whereas GTM scores can range from 0 to 1. Different metric
scores provide an overall overview of the machine translation
quality with regard to various aspects of evaluation and can be
correlated.
TABLE II. R
ESULTS OF AUTOMATIC QUALITY EVALUATION METRICS
Machine
translation
direction
Automatic machine translation
quality evaluation metrics
(higher is better)
BLEU NIST METEOR GTM
English-Croatian
0.1656 4.6527 0.1976 0.3348
Croatian-English 0.2383 5.8686 0.2439 0.5044
Overall, results of automatic quality assessment show better
scores for Croatian-English direction for 20-30%. This is
mainly due to fact that Croatian language is highly flective
with rich morphology. Furthermore, metrics which rely on
word matching penalise the word types that appear in form of
different tokens. As the data set belongs to the sociological-
philosophical-spiritual domain, it contains specific terminology
for which Google Translate does not provide a correct
translation. Namely, such terminology is unlikely to appear in
the correct context in the language and translation models of
the analysed machine translation system. The fact that the data
set contains 77.16% of hapax legomena in Croatian and
68.01% in English also points to infrequently used
terminology. BLEU metric, which ignores the word relevance,
penalises a machine translation that is shorter than the
reference translation and counts the words having the same
surface form. In this research BLEU score is relatively low for
English-Croatian (0.17) when compared to the Croatian-
English direction (0.24). Generally, translating from
morphologically rich languages to less rich languages results in
better BLEU scores. NIST metric which is sensitive to more
informative n-grams which occur less frequently, gives the
following results: 4.65 for English-Croatian and 5.87 for
Croatian-English. The metric METEOR shows scores close to
BLEU metric for English-Croatian (0.20) and for Croatian-
English (0.24). Although METEOR counts matches at the stem
level, in this research the raw data set was used, which was not
lowercased and tokenised. The results of GTM metric, which
computes F-measure, are as follows: 0.33 for English-Croatian
and 0.50 for Croatian-English. The GTM score for English-
Croatian is lower than for Croatian-English due to
morphological variants of the same lemma, which causes lower
scores due to non-matching of words belonging to the same
lemma but with different morphological suffixes.
V. C
ONCLUSIONS
In this research, a book of scientific abstracts was digitised
with a scanner, subsequently processed with OCR, later on
post-edited, and eventually used as a gold standard for machine
translation. Automatic machine translations were generated by
Google Translate and afterwards evaluated by means of several
metrics and for both directions (Croatian-English and English-
Croatian). The results for translation into Croatian are less
scored due to specific terminology that is not widely used on
the internet and therefore not available in the correct context in
the machine translation models, morphological richness of the
Croatian language, long sentences, relatively free word order
and grammatical case agreement. Namely, this causes
decreased scores since several types of the same lemma, which
are not identical with the reference morphological variant,
count as mismatches. Overall, results of the automatic machine
translation quality evaluation for BLEU, NIST, METEOR and
GTM are better for the Croatian-English direction (20-30%
better). Further research on automatic quality evaluation would
include more extensive evaluation applying other metrics, text
lemmatisation, lowercasing, tokenisation and enlargement of
the data set, using multiple reference sentences.
R
EFERENCES
[1] D. R. Amancio, M. G. V. Nunes, O. N. Oliveira Jr., T. A. S. Pardo, L.
Antiqueira and L. da F. Costa, “Using metrics from complex networks to
evaluate MT,” Physica A: Statistical Mechanics and its Applications,
vol. 390, no. 1, 2011, pp. 131-142.
[2] S. Hampshire and C. Porta Salvia, “Translation and the internet:
Evaluating the quality of free online machine translators,” Quaderns:
revista de traducció, no. 17, 2010, pp. 197-209.
[3] S. Stymne, “Pre- and postprocessing for statistical machine translation
into germanic languages,” Proceedings of the ACL-HLT 2011 Student
Session, 2011, pp. 12-17.
[4] E. Hovy, M. King and A. Popescu-Belis, “Principles of context-based
machine translation evaluation,” Machine Translation, vol. 17, 2002, pp.
43-75.
[5] P. Koehn, “What is a better translation? Reflections on six years of
running evaluation campaigns,” Tralogy 2011, 2011, p. 9, available at:
http://homepages.inf.ed.ac.uk/pkoehn/publications/tralogy11.pdf
[6] C. Kit and T. M. Wong, Comparative evaluation of online machine
translation systems with legal texts,” Law Library Journal, vol. 100, no.
2, 2008, pp. 299-321.
[7] G. J. F. Jones, F. Fantino, E. Newman, and Y. Zhang, “Domain-specific
query translation for multilingual information access using machine
translation augmented with dictionaries mined from Wikipedia,”
ceedings of th Second International Workshop on “Cross Lingual
Information Access”, 2008, pp. 34-41.
[8] J. Salimi, “Machine Translation Of Fictional And Non-fictional Texts,”
Stockholm University Library, 2014, p. 16, available at:
http://www.diva-portal.org/smash/get/diva2:737887/FULLTEXT01.pdf
[9] S. Seljan, T. Vičić and M. Brkić, “BLEU evaluation of machine-
translated English-Croatian legislation,” Proceedings of the Eighth
International Conference on Language Resources and Evaluation
(LREC'12), 2012, pp. 2143-2148.
[10] E. D. Lange and J. Yang, “Automatic domain recognition for machine
translation”, Proceedings of the MT Summit VII, 1999, pp. 641-645.
[11] T. T. Soomro, G. Ahmad and M. Usman, “Google Translation service
issues: Religious text perspective,” Journal of Global Research in
Computer Science, vol. 4, no. 8, 2013, pp. 40-43.
[12] M. Braschler, D. Harman, M. Hess, M. Kluck, C. Peters and P.
Schäuble, The evaluation of systems for cross-language information
retrieval,” Proceedings of the Second International Conference on
Language Ressources and Evaluation (LREC-2000), 2000, p. 6.
[13] D. Lopresti, “Optical character recognition errors and their effects on
natural language processing,” International Journal on Document
Analysis and Recognition, vol. 12, no. 3, 2009, pp. 141-151.
[14] J. Smolčić and A. Valešić, “Legal contexts of digitization and
preservation of written heritage,” Proceedings of the INFuture2009
Digital Resources and Knowledge Sharing Conference, 2009, pp. 87-94.
[15] C. Callison-Burch, M. Osborne and P. Koehn, “Re-evaluating the role of
BLEU in machine translation research,” Proceedings of the 11th
Conference of the European Chapter of the Association for
Computational Linguistics, 2006, pp. 249-256.
[16] K. Papineni, S. Roukos, T. Ward and W. J. Zhu, “BLEU: a method for
automatic evaluation of machine translation,” Proceedings of the 40th
Annual Meeting on Association for Computational Linguistics, 2002,
pp. 311-318.
[17] G. Doddington, “Automatic evaluation of machine translation quality
using n-gram co-occurrence statistics,” Proceedings of the Second
Conference on Human Language Technology, 2002, pp. 128-132.
[18] A. Lavie, K. Sagae and S. Jayaraman, “The significance of Recall in
Automatic Metrics for MT Evaluation,” in Machine Translation: From
Real Users to Research, R. E. Frederking and K. B. Taylor, Eds. Berlin,
Heidelberg: Springer, 2004, pp. 134-143.
[19] S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT
evaluation with improved correlation with human judgments,”
Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation
Measures for MT and/or Summarization at the 43rd Annual Meeting of
the Association of Computational Linguistics, 2005, pp. 65-72.
[20] M. Denkowski and A. Lavie, “Meteor 1.3: Automatic metric for reliable
optimization and evaluation of machine translation systems,”
Proceedings of the Sixth Workshop on Statistical Machine Translation,
(ACL), 2011, pp. 85-91.
[21] A. Agarwal and A. Lavie, “METEOR, M-BLEU and M-TER:
Evaluation metrics for high-correlation with human rankings of machine
translation output,” Proceedings of the ACL 2008 Workshop on
Statistical Machine Translation, 2008, pp. 115-118.
[22] Automated Community Content Editing PorTal (ACCEPT), “Analysis
of existing metrics and proposal for a task-oriented metric,” European
Community's FP7 project deliverable, 2012, available at:
http://cordis.europa.eu/docs/projects/cnect/9/288769/080/deliverables/00
1-D91Analysisofexistingmetricsandproposalofataskorientedmetric.pdf
[23] J. P. Turian, L. Shen and I. D. Melamed, “Evaluation of machine
translation and its evaluation”, Proceedings of the 9th Machine
Translation Summit, 2003, pp. 386-393.
[24] A. Lavie, “Evaluating the Output of Machine Translation Systems,”
AMTA Tutorial, 2010, p. 86, available at:
http://amta2010.amtaweb.org/AMTA/papers/6-04-
LavieMTEvaluation.pdf
... Automatska evaluacija kvalitete strojnog prijevoda ispitana je u domeni sociologije, filozofije i religioznosti (Seljan i Dunđer, 2015a) te za različite jezične parove, uključujući i hrvatski jezik (Seljan i Dunđer, 2015b). ...
Article
Full-text available
Automatsko strojno prevođenje sve je popularnija istraživačka tema u znanosti i raznim znanstvenim disciplinama, kao što su informacijske i komunikacijske znanosti, računarstvo, računalna lingvistika i sl. Razlog tome je prvenstveno to što danas omogućuje nezaobilaznu komunikaciju i brz prijenos informacija između različitih prirodnih jezika. To je posebno bitno za manje govorene jezike poput hrvatskoga, za koji još uvijek ne postoji dovoljan broj softverskih alata i digitalnih resursa potrebnih za razvoj specijaliziranih i kvalitetnih sustava za strojno prevođenje koji bi bili optimizirani za upotrebu u jednom specifičnom području. Sve brži rast količine podataka i sve veća potreba raznih dionika u sektorima industrije, gospodarstvu, znanosti, ali i u svakodnevnom životu ljudi impliciraju motivaciju za sistematiziranim i organiziranim razvojem te naknadnom prilagodbom sustava za automatsko strojno prevođenje za različite jezične parove. Budući da strojni prijevodi nisu savršeni, važno je primijeniti metode za računalno generiranje prijevoda prihvatljive razine kvalitete koja ovisi o samom zadatku i području primjene sustava za strojno provođenje. U ovom radu analiziran je model sustava za automatsko statističko strojno prevođenje, njegove komponente te uloga i značaj pojedinih elemenata unutar modela.
... Whereas distributed representation of arbitrary language can be realized through the end-to-end training of the NMT system, the NMT model can prevent the problems during the SMT training process. NMT can produce more fluent results [8][9][10] but which are often not adequate, while SMT models generally obtain lower results for the criteria of fluency, especially for low-resource languages with relatively free word order [11]. Generally, MT evaluation metrics favor translations that follow a strict word order when compared to the reference translations, which could be the reason for lower BLEU scores. ...
Article
Full-text available
Both the statistical machine translation (SMT) model and neural machine translation (NMT) model are the representative models in Uyghur–Chinese machine translation tasks with their own merits. Thus, it will be a promising direction to combine the advantages of them to further improve the translation performance. In this paper, we present a hybrid framework of developing a system combination for a Uyghur–Chinese machine translation task that works in three layers to achieve better translation results. In the first layer, we construct various machine translation systems including SMT and NMT. In the second layer, the outputs of multiple systems are combined to leverage the advantage of SMT and NMT models by using a multi-source-based system combination approach and the voting-based system combination approaches. Moreover, instead of selecting an individual system’s combined outputs as the final results, we transmit the outputs of the first layer and the second layer into the final layer to make a better prediction. Experiment results on the Uyghur–Chinese translation task show that the proposed framework can significantly outperform the baseline systems in terms of both the accuracy and fluency, which achieves a better performance by 1.75 BLEU points compared with the best individual system and by 0.66 BLEU points compared with the conventional system combination methods, respectively.
... For Croatian as the target language, several types of research explored the use of MQM framework: for English/Russian-Croatian (Seljan & Dunđer, 2015a), for Croatian-German , for different domainse.g., sociological-philosophical-spiritual domain (Seljan & Dunđer, 2015b), legislation, technology, generalor combined with automatic speech recognition or with the summarization technology (Seljan et al., 2015) for five language pairs, showing problems with fluency, due to the characteristics of Croatian as a language with rich morphological system and relatively free word order. Meer and Görög (2015) present the report on Dynamic Quality Framework (DQF), where the quality is dynamic, depending on the type of text, audience, and content. ...
Chapter
Full-text available
Increased use of computer-assisted translation (CAT) technology in business settings with augmented amounts of tasks, collaborative work, and short deadlines give rise to errors and the need for quality assurance (QA). The research has three operational aims: 1) methodological framework for QA analysis, 2) comparative evaluation of four QA tools, 3) to justify introduction of QA into CAT process. The research includes building of translation memory, terminology extraction, and creation of terminology base. Error categorization is conducted by multidimensional quality (MQM) framework. The level of mistake is calculated considering detected, false, and not detected errors. Weights are assigned to errors (minor, major, or critical), penalties are calculated, and quality estimation for translation memory is given. Results show that process is prone to errors due to differences in error detection, harmonization, and error counting. Data analysis of detected errors leads to further data-driven decisions related to the quality of output results and improved efficacy of translation business process.
... Evaluation of machine translation has been done in many different researches. One paper focused on applying automatic quality metrics on machine translations in the sociological, philosophical and spiritual domain [6]. Automatic and human evaluation of online machine translation systems have also been done for the English, Russian and Croatian language [7,8]. ...
... Various quality aspects of Croatian machine translations have been analysed in different domains, such as sociology, philosophy, spirituality [9], business correspondence [10] or legislation [11]. Online machine translation services have also been evaluated for the Croatian language [12][13][14]. ...
... Evaluation of machine translation has been done in many different research papers. One paper focused on applying automatic quality metrics on machine translations in the sociological-philosophical-spiritual domain [11]. ...
Article
Full-text available
Machine translation is increasingly becoming a hot research topic in information and communication sciences, computer science and computational linguistics, due to the fact that it enables communication and transferring of meaning across different languages. As the Croatian language can be considered low-resourced in terms of available services and technology, development of new domain-specific machine translation systems is important, especially due to raised interest and needs of industry, academia and everyday users. Machine translation is not perfect, but it is crucial to assure acceptable quality, which is purpose-dependent. In this research, different statistical machine translation systems were built – but one system utilized domain adaptation in particular, with the intention of boosting the output of machine translation. Afterwards, extensive evaluation has been performed – in form of applying several automatic quality metrics and human evaluation with focus on various aspects. Evaluation is done in order to assess the quality of specific machine-translated text.
... A comparison with four automatic metrics can be found in (Seljan & Dunđer, 2015b). GT is also evaluated by four automatic metrics for Croatian-English and English-Croatian translations in sociological-philosophical-spiritual domain in (Seljan & Dunđer, 2015a). Better results are obtained for the Croatian-English translation direction. ...
Article
Full-text available
Google Translate service is among the widely used services of its kind and supporting translation of 64 International languages. This paper presents and explores the process of machine translation used by Google Translation Service (GTS) which translates a given text or Website to desire language automatically. The paper will explore the translation process furnished by Google Translate Service with focus on the issues of the method used by Google along with the possibilities to improve the system and better utilization of it from the user point of view. The paper will also focus on translation of religious text from English to Urdu and Arabic to Urdu issues along with issues related Quranic and hadith translations. Finally this paper will propose the solution to avoid such translations ambiguities.
Article
Full-text available
This paper presents work on the evaluation of online available machine translation (MT) service, i.e. Google Translate, for English-Croatian language pair in the domain of legislation. The total set of 200 sentences, for which three reference translations are provided, is divided into short and long sentences. Human evaluation is performed by native speakers, using the criteria of adequacy and fluency. For measuring the reliability of agreement among raters, Fleiss' kappa metric is used. Human evaluation is enriched by error analysis, in order to examine the influence of error types on fluency and adequacy, and to use it in further research. Translation errors are divided into several categories: non-translated words, word omissions, unnecessarily translated words, morphological errors, lexical errors, syntactic errors and incorrect punctuation. The automatic evaluation metric BLEU is calculated with regard to a single and multiple reference translations. System level Pearson's correlation between BLEU scores based on a single and multiple reference translations is given, as well as correlation between short and long sentences BLEU scores, and correlation between the criteria of fluency and adequacy and each error category.
Conference Paper
Full-text available
This paper presents work on the evaluation of online available machine translation (MT) service, i.e. Google Translate, for English-Croatian language pair in the domain of legislation. The total set of 200 sentences, for which three reference translations are provided, is divided into short and long sentences. Human evaluation is performed by native speakers, using the criteria of adequacy and fluency. For measuring the reliability of agreement among raters, Fleiss' kappa metric is used. Human evaluation is enriched by error analysis, in order to examine the influence of error types on fluency and adequacy, and to use it in further research. Translation errors are divided into several categories: non-translated words, word omissions, unnecessarily translated words, morphological errors, lexical errors, syntactic errors and incorrect punctuation. The automatic evaluation metric BLEU is calculated with regard to a single and multiple reference translations. System level Pearson’s correlation between BLEU scores based on a single and multiple reference translations is given, as well as correlation between short and long sentences BLEU scores, and correlation between the criteria of fluency and adequacy and each error category.
Article
The authors discuss both the proper use of available online machine translation (MT) technologies for law library users and their comparative evaluation of the performance of a number of representative online MT systems in translating legal texts from various languages into English. They evaluated a large-scale corpus of legal texts by means of BLEU/NIST scoring, a de facto standard way of exercising translation-quality evaluation in the field of MT in recent years and a method that provides an objective view of the suitability of these systems for legal translation in different language pairs.
Conference Paper
This paper describes Meteor 1.3, our submission to the 2011 EMNLP Workshop on Statistical Machine Translation automatic evaluation metric tasks. New metric features include improved text normalization, higher-precision paraphrase matching, and discrimination between content and function words. We include Ranking and Adequacy versions of the metric shown to have high correlation with human judgments of translation quality as well as a more balanced Tuning version shown to outperform BLEU in minimum error rate training for a phrase-based Urdu-English system.
Article
We describe METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine- produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; further- more, METEOR can be easily extended to include more advanced matching strate- gies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference. We evaluate METEOR by measuring the cor- relation between the metric scores and human judgments of translation quality. We compute the Pearson R correlation value between its scores and human qual- ity assessments of the LDC TIDES 2003 Arabic-to-English and Chinese-to-English datasets. We perform segment-by- segment correlation, and show that METEOR gets an R correlation value of 0.347 on the Arabic data and 0.331 on the Chinese data. This is shown to be an im- provement on using simply unigram- precision, unigram-recall and their har- monic F1 combination. We also perform experiments to show the relative contribu- tions of the various mapping modules.
Article
Evaluation is recognized as an extremely helpful forcing function in Human Language Technology R&D. Unfortunately, evaluation has not been a very powerful tool in machine translation (MT) research because it requires human judgments and is thus expensive and time-consuming and not easily factored into the MT research agenda. However, at the July 2001 TIDES PI meeting in Philadelphia, IBM described an automatic MT evaluation technique that can provide immediate feedback and guidance in MT research. Their idea, which they call an "evaluation understudy", compares MT output with expert reference translations in terms of the statistics of short sequences of words (word N-grams). The more of these N-grams that a translation shares with the reference translations, the better the translation is judged to be. The idea is elegant in its simplicity. But far more important, IBM showed a strong correlation between these automatically generated scores and human judgments of translation quality. As a result, DARPA commissioned NIST to develop an MT evaluation facility based on the IBM work. This utility is now available from NIST and serves as the primary evaluation measure for TIDES MT research.
Article
This paper describes an ongoing project which has the goal of improving machine translation quality by increasing knowledge about the text to be translated. A ba-sic piece of such knowledge is the domain or subject field of the text. When this is known, it is possible to improve mean-ing selection appropriate to that domain. Our current effort consists in automating both recognition of the text's domain and the assignment of domain-specific transla-tions. Results of our implementation show that the approach of using terminology cat-egorization already existing in the machine translation system is very promising.