Machine Translation Evaluation Metrics for Quality Assessment of Automatically Simplified Sentences

Conference Paper (PDF Available) · May 2016with 331 Reads 
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
·
Conference: QATS: Workshop on Quality Assessment for Text Simplification
Cite this publication
Abstract
We investigate whether it is possible to automatically evaluate the output of automatic text simplification (ATS) systems by using automatic metrics designed for evaluation of machine translation (MT) outputs. In the first step, we select a set of the most promising metrics based on the Pearson's correlation coefficients between those metrics and human scores for the overall quality of automatically simplified sentences. Next, we build eight classifiers on the training dataset using the subset of 13 most promising metrics as features, and apply two best classifiers on the test set. Additionally, we apply an attribute selection algorithm to further select best subset of features for our classification experiments. Finally, we report on the success of our systems in the shared task and report on confusion matrices which can help to gain better insights into the most challenging problems of this task.
Advertisement
Machine Translation Evaluation Metrics
for Quality Assessment of Automatically Simplified Sentences
Maja Popovi´
c1and Sanja ˇ
Stajner2
1Humboldt University of Berlin, Germany
2Data and Web Science Research Group, University of Mannheim, Germany
1maja.popovic@hu-berlin.de, 2sanja@informatik.uni-mannheim.de
Abstract
We investigate whether it is possible to automatically evaluate the output of automatic text simplification (ATS) systems by using
automatic metrics designed for evaluation of machine translation (MT) outputs. In the first step, we select a set of the most promising
metrics based on the Pearson’s correlation coefficients between those metrics and human scores for the overall quality of automatically
simplified sentences. Next, we build eight classifiers on the training dataset using the subset of 13 most promising metrics as features,
and apply two best classifiers on the test set. Additionally, we apply an attribute selection algorithm to further select best subset of
features for our classification experiments. Finally, we report on the success of our systems in the shared task and report on confusion
matrices which can help to gain better insights into the most challenging problems of this task.
Keywords: text simplification, automatic evaluation, machine translation metrics
1. Introduction
Automatic text simplification (ATS) has gained a consid-
erable attention in the last twenty years. Many ATS sys-
tems have been proposed for various languages, e.g. En-
glish (Angrosh et al., 2014; Glavaˇ
s and ˇ
Stajner, 2015),
Portuguese (Specia, 2010), Spanish (Saggion et al., 2015;
ˇ
Stajner et al., 2015), French (Brouwers et al., 2014), Ital-
ian (Barlacchi and Tonelli, 2013), and Basque (Aranzabe et
al., 2012). The aim of ATS systems is to transform syntac-
tically and lexically complex sentences into their simpler
variants more accessible to wider audiences (non-native
speakers, children, people with learning disabilities, etc.).
The standard way of evaluating ATS systems is by hu-
man assessment of the quality of the generated sentences
in terms of their grammaticality, meaning preservation,
and simplicity (Woodsend and Lapata, 2011; Glavaˇ
s and
ˇ
Stajner, 2013; Saggion et al., 2015). The annotators are pre-
sented with pairs of original and automatically simplified
sentences (for meaning preservation), and with automati-
cally simplified sentences (for grammaticality and simplic-
ity), and asked to evalute them on a 1–3 or 1–5 scale (where
lower score always denotes worse output).
In addition to this, some ATS systems are also evaluated by
readability metrics (on a text level), e.g. (Zhu et al., 2010;
Woodsend and Lapata, 2011; Glavaˇ
s and ˇ
Stajner, 2013;
Saggion et al., 2015), or by machine translation (MT) eval-
uation metrics, such as BLEU (Papineni et al., 2002), NIST
(Doddington, 2002), or TER (Snover et al., 2006) in case of
the MT-based ATS systems (Specia, 2010; Zhu et al., 2010;
Woodsend and Lapata, 2011; Coster and Kauchak, 2011;
ˇ
Stajner et al., 2015).
As any other human evaluation, the assessment of gram-
maticality (G), meaning preservation (M), and simplicity
(S) is a costly and time-consuming task. Therefore, au-
tomatic methods are needed in order to provide a faster
and more consistent evaluation. In spite of that, this task
have not attracted much attention so far. To the best of
our knowledge, there has been only one work (ˇ
Stajner et
al., 2014) tackling this problem by assessing the potential
of several MT evaluation metrics: BLEU (Papineni et al.,
2002), TER (Snover et al., 2006), METEOR (Denkowski
and Lavie, 2011), and TINE (Rios et al., 2011). They
showed that all of them correlate well with the human
scores for meaning preservation, and some of them (BLEU
and TER) also have a good correlation with the human
scores for grammaticality (they did not investigate the cor-
relation of those metrics with the human scores for simplic-
ity). ˇ
Stajner et al. (2014) further built several classifiers for
automatic assessment of grammaticality, meaning preserva-
tion, and their combination. One of the main limitations of
their work was the dataset used for training and testing. The
dataset1was produced by the ATS system which performs
syntactic simplification and content reduction (Glavaˇ
s and
ˇ
Stajner, 2013) and is, therefore, not the best representative
for most of the ATS systems (that usually perform syntactic
and/or lexical simplification and no content reduction).
In this work, we built upon previous work (ˇ
Stajner et al.,
2014) on several levels:
1. We significantly extended the list of MT metrics, in-
cluding the MT evaluation metrics based on one of the
two usual approaches: n-gram matching or edit dis-
tance.
2. We calculated Pearson’s correlation of all MT met-
rics with human scores for grammaticality, meaning
preservation and simplicity (not only for grammatical-
ity and meaning preservation).
3. We preselected a subset of most promising MT metrics
based on their Pearson’s correlations with the human
scores.
4. We tested the usefulness of MT metrics on a larger and
more heterogeneous dataset (which also contains lexi-
cal simplifications and simplifications without content
reduction) provided for the shared task.
1takelab.fer.hr/data/evsimplify/
Version Sentence Aspect Modification
G M S O
Original Mladic reportedly gave a thumbs-up and clapped to
supporters in the court’s public gallery as the trial got
under way. ok ok ok ok syntactic + content reduction
Simple Mladic gave a thumbs-up. Mladic clapped to support-
ers. The trial got under way.
Original Philippine President Benigno Aquino said he was look-
ing to end the standoff through diplomatic means.bad bad good bad content reduction
Simple Philippine President Benigno Aquino said.
Original Her mother wanted her to leave school and marry, but
she rebelled.good good good good lexical
Simple Her mother wanted her to leave school and marry, but
she did not.
Original The novel received favorable reviews in several major
newspapers. bad ok ok ok dropping
Simple The received favorable reviews in several major news-
papers.
Original The place where Waste was executed is now the site of
a Roman Catholic church. good good good good lexical + insertion
Simple The place where Joan Waste was killed is now the site
of a church.
Table 1: Examples from the training dataset (differences between the original and simplified versions are presented in bold)
We calculated a total of 26 metrics on the provided training
dataset, using the original English sentences as references,
and their simplifications as hypotheses. In the next step,
we selected the most promising metrics based on Pearson’s
correlation coefficients between them and the assigned hu-
man scores for the overall quality of the sentences. Finally,
we used that subset of MT metrics to train eight classifiers,
out of which we submitted the two best ones to the shared
task. Additionally, we experimented with one attribute se-
lection algorithm and submitted the best of the two classi-
fiers trained on that subset of features.
2. Shared Task Description
The shared task participants were provided with a training
dataset of 505 sentence pairs and a test set (without ‘gold
standard’ scores) of 126 sentence pairs from news articles
and Wikipedia articles. The automatically simplified sen-
tences were obtained by various automatic text simplifi-
cation systems and thus cover different simplification phe-
nomena (only lexical simplification, only syntactic simpli-
fication, mixure of lexical and syntactic simplification, con-
tent reduction, etc.).
The training dataset contained a quality label ( good,ok, or
bad) for each sentence according to four aspects:
Grammaticality (G)
Meaning preservation (M)
Simplicity (S)
Overall (O)
where the overall score presents a combination of the previ-
ous three scores which rewards more meaning preservation
and simplicity than grammaticality.2
Several examples from the training dataset are presented in
Table 1. It can be noted that some of the changes to the orig-
inal sentence affect only one word (lexical changes, drop-
ping, or insertion), while others reflect on larger portions of
the sentence (syntactic changes and content reduction).
The MT evaluation metrics calculate similarity between the
obtained translation and a reference translation. The higher
similarity indicates better translation quality. For TS eval-
uation, which is a similar – but nevertheless distinct – task,
we expected that the MT evaluation metrics could be good
indicators of grammaticality and meaning preservation. We
also expected that the simplicity aspect will be difficult to
capture using the overall MT evaluation scores. However,
we thought that certain components, such as insertions or
deletions, might be able to capture relations between text
reduction and quality.
We did not investigate complex MT evaluation metrics
which require language dependent knowledge such as se-
mantic roles, etc. Most of the metrics we used are com-
pletely language independent, while some of them require
part-of-speech taggers and/or lemmatisers for the given lan-
guage.
3. MT Evaluation Metrics
We focused on 26 automatic MT metrics:
N-gram based metrics (9)
2http://qats2016.github.io/shared.html
BLEU – n-gram precision (Papineni et al., 2002);
METEOR – n-gram matching enhanced by us-
ing synonymes and stems (Denkowski and Lavie,
2011);
wordF,baseF,morphF,posF,chrF – F1
scores of word, base form, morpheme,
POS tag (Popovi´
c, 2011b) and character n-
grams (Popovi´
c, 2015); for word and character
n-grams, F3 scores are investigated as well
(wordF3,chrF3).
BLEU and METEOR are widely used for MT evalua-
tion, and the other metrics have shown very good cor-
relations with human judgments in recent years, espe-
cially the character n-gram F3 score (chrF3) for mor-
phologically rich languages.
Edit-distance based metrics (4)
WER – Levenshtein (edit) distance (Levenshtein,
1966);
TER – modified edit distance taking into account
shifts of word sequences (Snover et al., 2006);
Serr – sum of word-level error rates provided
by Hjerson, automatic tool for MT error analy-
sis (Popovi´
c, 2011a);
bSerr – Hjerson’s sum of block-level error rates.
WER is the basic edit-distance metric widely used
in speech recognition and at the beginnings of ma-
chine translation development. It was later substituted
by TER which shows better correlations with human
judgments as it does not penalise small differences as
much as WER. Recently, Hjerson’s error rates have
also shown good correlations with human judgments.
Components of edit-distance based metrics (13)
WER and TER substitutions, deletions and inser-
tions (wer-sub,ter-sub,wer-del,ter-del,wer-ins,
ter-ins);
TER shifts and word shifts (ter-sh,ter-wsh);
Hjerson’s error classes:
Inflectional errors (infl)
Reordering errors (reord)
Missing words (miss)
Extra words (ext)
Lexical errors (lex).
It should be noted that the n-gram based metrics represent
scores, i.e. the higher the value, the more similar the seg-
ments, whereas edit-distance based metrics represent error
rates, i.e. the lower the value, the more similarity between
the segments.
4. Correlations with Human Scores
The first step in exploring those 26 MT metrics consisted in
calculating Pearson’s correlation coefficients between each
metric and human scores for four aspects (grammaticality,
meaning preservation, simplicity and overall score), which
are presented in Table 2. The metrics are sorted from best
to worst according its correlation with the human scores for
the overall quality of sentences.
Metric Aspect
Overall G M S
posF 0.288 0.386 0.559 -0.043
wer-del -0.265 -0.297 -0.542 0.097
ter-del -0.257 -0.264 -0.496 0.059
miss -0.252 -0.267 -0.529 0.117
BLEU 0.251 0.326 0.582 -0.122
wordF3 0.250 0.333 0.588 -0.124
baseF 0.248 0.340 0.578 -0.111
morphF 0.239 0.332 0.559 -0.104
WER -0.237 -0.302 -0.543 0.104
chrF3 0.231 0.303 0.575 -0.147
TER -0.229 -0.296 -0.539 0.119
METEOR 0.228 0.262 0.527 -0.140
Serr -0.223 -0.276 -0.536 0.127
wordF 0.219 0.316 0.552 -0.129
chrF 0.216 0.299 0.549 -0.140
ter-sh -0.155 -0.261 -0.186 -0.175
ter-wsh -0.151 -0.150 -0.157 0.052
reord -0.142 -0.172 -0.156 -0.124
bSerr -0.097 -0.222 -0.345 0.097
infl -0.033 -0.024 -0.136 0.104
ter-ins -0.025 -0.111 -0.069 -0.034
ter-sub -0.021 -0.052 -0.130 -0.129
wer-sub -0.001 -0.029 -0.149 0.085
wer-ins 0.016 -0.029 0.020 -0.058
lex 0.065 -0.012 -0.119 0.140
ext 0.074 -0.025 0.014 0.013
Table 2: Pearson’s correlation coefficients between auto-
matic metrics and human scores.
First of all, it can be confirmed that the TS evaluation task,
although similar to the MT evaluation task, requires differ-
ent approaches to evaluation of certain aspects. The MT
evaluation metrics show the best correlations for the mean-
ing preservation where a number of metrics have correla-
tions over 0.5. Grammaticality seems to be more difficult
aspect (the MT metrics achieve maximum correlation of
about 0.390). Simplicity is, as intuitively expected, the
most difficult aspect. While in MT evaluation the simi-
larity between the reference and the hypothesis should be
rewarded, this is not the case for the simplicity. The cor-
relations with the overall score are not greater than 0.290;
this score is also difficult for MT metrics since it takes all
aspects (including simplicity) into account.
Contrary to the expected, simplicity is not well captured
by deletions/omissions. Those metrics, instead, have rather
high correlation with meaning preservation scores, prob-
ably due to the fact that text reduction can influence the
meaning.
For building the classifiers, we selected only those metrics
which had the correlation greater than 0.200 with the over-
all score, i.e. the first 13 rows in Table 2.3
5. Classification Experiments
After selection of the 13 best correlating features, for each
of the four aspects (G, M, S, and Overall), we trained eight
different classifiers implemented in Weka Experimenter
(Hall et al., 2009):
1. Log – Logistic Regression (le Cessie and van
Houwelingen, 1992)
2. NB – Na¨
ıve Bayes (John and Langley, 1995)
3. SVM-n – Support Vector Machines with feature nor-
malisation
4. SVM-s –Support Vector Machines with feature stan-
dardisation
5. IBk – K-nearest neighbours (Aha and Kibler, 1991)
6. JRip – a propositional rule learner (Cohen, 1995)
7. J48 – C4.5 decision tree (Quinlan, 1993)
8. RandF – Random Forest (Breiman, 2001)
All experiments were conducted in a 10-fold cross-
validation setup with 10 repetitions, using the provided
training dataset of 505 sentence pairs.4
Classifier Aspect
G S M Overall
Log 0.710 0.452 0.641 0.493
NB 0.653 0.416 0.628 0.373
SVM-n 0.652 0.363 0.633 0.424
SVM-s 0.655 0.363 0.653 0.494
IBk 0.737 0.532 0.655 0.530
JRip 0.671 0.458 0.620 0.461
J48 0.718 0.507 0.609 0.470
RandF 0.747 0.519 0.653 0.499
Majority class 0.652 0.363 0.428 0.240
Table 3: Results of the classification experiments (weighted
F-score). The two best results for each aspect (G, S, M, and
Overall) are presented in bold.
5.1. Results on the Training Dataset
The results of the classification experiments are presented
in Table 3. The majority-class baseline was already quite
high for grammaticality aspect (G). Nevertheless, the two
best classification algorithms (IBk and RandF) significantly
outperformed that baseline, as well as the majority-class
baseline on all other aspects (S, M, and Overall).
3Although wordF and chrF scores also fulfill this criterion,
they are not used in the best set because their F3 versions showed
better performance.
4http://qats2016.github.io/shared.html
5.2. Feature Selection
We further selected a subset of best features using the Cf-
sSubsetEval attribute selection algorithm (Hall and Smith,
1998) implemented in Weka by applying it to the whole
training dataset (in a 10-fold cross-validation setup). Next,
for each aspect, we trained a classifier (the most successful
one for that aspect according to the results in Table 3) only
on that subset of features. The CfsSubsetEval attribute se-
lection algorithm uses a correlation-based approach to the
feature selection problem, following the idea that “good
feature sets contain features that are highly correlated with
the class, yet uncorrelated with each other” (Hall, 1999).
On small datasets, the CfsSubsetEval gives results similar
to, or better than, those obtained by using a wrapper (Hall,
1999).
The CfsSubsetEval attribute selection algorithm returned
the following subsets of best features:
For Grammaticality (G): {BLEU,METEOR,TER,
WER,chrF3,morphF,posF,ter-d,wer-d}– a total
of 9 features
For Simplicity (S): {BLEU,chrF3,morphF}– a total
of 3 features
For Meaning preservation (M): All except Serr – a to-
tal of 12 features
For Overall: {BLEU,WER,chrF3,posF,miss,wer-d}
– a total of 6 features
5.3. Results on the Test Dataset
We submitted three runs for each aspect to the shared task.
The first two runs were the two classifiers which led to the
best results of the 10-fold cross-validation experiments on
the training dataset, the IBk and RandF classifiers. In the
third run, we built a classifier – either IBk or RandF, de-
pending on which of them was more successful in the cross-
validation experiments on the training dataset (the IBk for
Overall, M, and S, and RandF for G) – using only the sub-
set of features returned by the CfsSubsetEval attribute se-
lection algorithm (Section 5.2.).
The results of all three runs are presented in Table 4, to-
gether with the majority-class baseline, which turned out to
be a very strong baseline for this task (ˇ
Stajner et al., 2016).
On the task of predicting the meaning preservation score
(M), all three of our systems outperformed the majority-
class baseline in terms of accuracy and the weighted aver-
age F-score. On the task of predicting the overall score,
all our systems achieved higher weighted F-scores than the
majority-class baseline. On the other two tasks, predicting
grammaticality and simplicity, only some of our systems
(two for grammaticality, and one for simplicity) succeeded
in outperforming the majority-class baseline in terms of
the weighted F-score, and none of the systems achieved a
higher accuracy score than the majority-class baseline.
Among all our systems, the Random Forest classification
algorithm with all 13 preselected features (RandF) achieved
the best results on all four tasks.
System Grammaticality Meaning Simplicity Overall
accuracy weighted-F accuracy weighted-F accuracy weighted-F accuracy weighted-F
Run 1 (IBk) 60.32 0.620 61.90 0.615 38.10 0.381 38.10 0.383
Run 2 (RandF) 72.22 0.675 66.67 0.656 49.21 0.478 39.68 0.398
Run 3 (best) 69.84 0.667 62.70 0.617 37.30 0.377 38.10 0.382
Majority class 76.19 0.659 57.94 0.425 55.56 0.397 43.65 0.265
Table 4: Results on the shared task (performances better than those of the baseline are presented in bold)
5.4. Error Analysis
In order to better understand which sentences pose most
difficulties in these classification tasks, the confusion ma-
trices for our best systems in each of the four aspects (ac-
cording to Table 4) are presented in Table 5.
Table 5: Confusion matrices (RandF using all 13 features)
Aspect Predicted Actual class
good ok bad
Grammaticality
good 87 9 14
ok 8 4 2
bad 1 1 0
Meaning
good 61 11 7
ok 9 17 7
bad 3 5 6
Simplicity
good 47 22 10
ok 18 12 5
bad 5 4 3
Overall
good 15 13 6
ok 13 23 17
bad 8 19 12
In most misclassification cases (all except those for the ok
sentences for the Overall aspect), our systems have a ten-
dency of assigning a higher class than the actual class. This
is particularly accentuated in classifications according to
sentence grammaticality and simplicity, where more than
a half of sentences that should be classified as bad were
classified as good.
6. Conclusions
Automatic evaluation of the quality of sentences produced
by automatic text simplification (ATS) systems is an im-
portant task which could significantly speed up evaluation
process and offer a fairer comparison among different ATS
systems. Nevertheless, it has hardly been addressed so far.
In this paper, we reported on results of our classification
systems, which were submitted to the QATS shared task
on quality assessment for text simplification. The proposed
feature sets were based on the use of standard – as well as
some more recent – machine translation (MT) evaluation
metrics.
More importantly, we explored the correlation of 26 MT
evaluation metrics with human scores for grammaticality,
meaning preservation, simplicity, and overall quality of au-
tomatically simplified sentences. The results revealed some
important differences between evaluation tasks in MT and
TS, which may seem very similar at first sight. They indi-
cated that it is necessary to propose different, TS-specific,
features in order to better assess the simplicity of automat-
ically simplified sentences.
7. References
Aha, D. and Kibler, D. (1991). Instance-based learning al-
gorithms. Machine Learning, 6:37–66.
Angrosh, M., Nomoto, T., and Siddharthan, A. (2014).
Lexico-syntactic text simplification and compression
with typed dependencies. In Proceedings of the 25th
International Conference on Computational Linguistics
(COLING): Technical Papers, pages 1996–2006, Dublin,
Ireland. ACL.
Aranzabe, M. J., D´
ıaz De Ilarraza, A., and Gonz´
alez, I.
(2012). First Approach to Automatic Text Simplification
in Basque. In Proceedings of the first Natural Language
Processing for Improving Textual Accessibility Workshop
(NLP4ITA).
Barlacchi, G. and Tonelli, S. (2013). ERNESTA: A Sen-
tence Simplification Tool for Children’s Stories in Ital-
ian. In Computational Linguistics and Intelligent Text
Processing, LNCS 7817, pages 476–487.
Breiman, L. (2001). Random Forests. Machine Learning,
45(1):5–32.
Brouwers, L., Bernhard, D., Ligozat, A.-L., and Franc¸ois,
T. (2014). Syntactic sentence simplification for french.
In Proceedings of the 3rd Workshop on Predicting and
Improving Text Readability for Target Reader Popula-
tions (PITR), pages 47–56.
Cohen, W. W. (1995). Fast Effective Rule Induction. In
Proceedings of the Twelfth International Conference on
Machine Learning, pages 115–123.
Coster, W. and Kauchak, D. (2011). Learning to Simplify
Sentences Using Wikipedia. In Proceedings of the 49th
Annual Meeting of the Association for Computational
Linguistics (ACL), pages 1–9. ACL.
Denkowski, M. and Lavie, A. (2011). Meteor 1.3: Auto-
matic Metric for Reliable Optimization and Evaluation
of Machine Translation Systems. In Proceedings of the
EMNLP Workshop on Statistical Machine Translation,
pages 85–91.
Doddington, G. (2002). Automatic evaluation of machine
translation quality using n-gram coocurrence statistics.
In Proceedings of the second international conference on
Human Language Technology Research, pages 138–145.
Morgan Kaufmann Publishers Inc.
Glavaˇ
s, G. and ˇ
Stajner, S. (2013). Event-Centered Simpli-
cation of News Stories. In Proceedings of the Student
Workshop held in conjunction with RANLP 2013, Hissar,
Bulgaria, pages 71–78.
Glavaˇ
s, G. and ˇ
Stajner, S. (2015). Simplifying Lexical
Simplification: Do We Need Simplified Corpora? In
Proceedings of the 53rd Annual Meeting of the Associ-
ation for Computational Linguistics and the 7th Interna-
tional Joint Conference on Natural Language Processing
(Volume 2: Short Papers), pages 63–68. ACL.
Hall, M. A. and Smith, L. A. (1998). Practical feature
subset selection for machine learning. In Proceedings
of the 21st Australasian Computer Science Conference
(ACSC), pages 181–191. Berlin: Springer.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reute-
mann, P., and Witten, I. H. (2009). The WEKA data
mining software: an update. SIGKDD Explor. Newsl.,
11:10–18.
Hall, M. A. (1999). Correlation-based Feature Selection
for Machine Learning. Ph.D. thesis, The University of
Waikato. Hamilton, New Zealand.
John, G. H. and Langley, P. (1995). Estimating Continuous
Distributions in Bayesian Classifiers. In Proceedings of
the Eleventh Conference on Uncertainty in Artificial In-
telligence, pages 338–345.
le Cessie, S. and van Houwelingen, J. (1992). Ridge
Estimators in Logistic Regression. Applied Statistics,
41(1):191–201.
Levenshtein, V. I. (1966). Binary Codes Capable of
Correcting Deletions, Insertions and Reversals. Soviet
Physics Doklady, 10(8):707–710.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).
BLEU: a Method for Automatic Evaluation of Machine
Translation. In Proceedings of the 40th Annual Meeting
of the Association for Computational Linguistics (ACL-
02), Philadelphia, PA.
Popovi´
c, M. (2011a). Hjerson: An Open Source Tool for
Automatic Error Classification of Machine Translation
Output. The Prague Bulletin of Mathematical Linguis-
tics, (96):59–68.
Popovi´
c, M. (2011b). Morphemes and POS tags for
n-gram based evaluation metrics. In Proceedings of
the Sixth Workshop on Statistical Machine Translation
(WMT 2011), pages 104–107, Edinburgh, Scotland, July.
Popovi´
c, M. (2015). chrF: character n-gram F-score for au-
tomatic MT evaluation. In Proceedings of the 10th Work-
shop on Statistical Machine Translation, pages 392–395,
Lisbon, Portugal, September. Association for Computa-
tional Linguistics.
Quinlan, R. (1993). C4.5: Programs for Machine Learn-
ing. Morgan Kaufmann Publishers, San Mateo, CA.
Rios, M., Aziz, W., and Specia, L. (2011). TINE: A met-
ric to assess MT adequacy. In Proceedings of the Sixth
Workshop on Statistical Machine Translation (WMT-
2011), Edinburgh, UK, pages 116–122.
Saggion, H., ˇ
Stajner, S., Bott, S., Mille, S., Rello, L., and
Drndarevic, B. (2015). Making It Simplext: Implemen-
tation and Evaluation of a Text Simplification System for
Spanish. ACM Transactions on Accessible Computing,
6(4):14:1–14:36.
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., and
Makhoul, J. (2006). A Study of Translation Error Rate
with Targeted Human Annotation. In Proceedings of the
7th Conference of the Association for Machine Transla-
tion in the Americas (AMTA-06), Boston, MA.
Specia, L. (2010). Translating from complex to simplified
sentences. In Proceedings of the 9th international con-
ference on Computational Processing of the Portuguese
Language (PROPOR), volume 6001 of Lecture Notes in
Computer Science, pages 30–39. Springer.
ˇ
Stajner, S., Mitkov, R., and Saggion, H. (2014). One Step
Closer to Automatic Evaluation of Text Simplification
Systems. In Proceedings of the 3rd Workshop on Pre-
dicting and Improving Text Readability for Target Reader
Populations (PITR) at EACL.
ˇ
Stajner, S., Calixto, I., and Saggion, H. (2015). Automatic
Text Simplification for Spanish: Comparative Evaluation
of Various Simplification Strategies. In Proceedings of
the International Conference Recent Advances in Natu-
ral Language Processing, pages 618–626, Hissar, Bul-
garia.
ˇ
Stajner, S., Popovi´
c, M., Saggion, H., Specia, L., and
Fishel, M. (2016). Shared Task on Quality Assessment
for Text Classification. In Proceedings of the LREC
Workshop on Quality Assessment for Text Simplification
(QATS).
Woodsend, K. and Lapata, M. (2011). Learning to Sim-
plify Sentences with Quasi-Synchronous Grammar and
Integer Programming. In Proceedings of the 2011 Con-
ference on Empirical Methods in Natural Language Pro-
cessing (EMNLP), pages 409–420.
Zhu, Z., Berndard, D., and Gurevych, I. (2010). A
Monolingual Tree-based Translation Model for Sen-
tence Simplification. In Proceedings of the 23rd Inter-
national Conference on Computational Linguistics (Col-
ing), pages 1353–1361.
  • ... Another study was conducted to examine the existing readability formulas and help to design effective text simplification software related to health (Kauchak and Leroy, 2016). Also, researchers carried out the work to evaluate the output of ATS (Automatic Text Simplification) system by using automatic metric (Popovic and Štajner, 2016). ...
    Article
    Full-text available
    With the emergence of World Wide Web (WWW) as the primary reference channel of information, the need for making it barrier-free for all categories of users has evolved into a critical factor. Making the web resources barrier-free for users requires action across various dimensions such as accessibility and readability. This paper presents an analysis of accessibility, readability, and site-ranking of top ranked (N=20) government websites of India. The accessibility analysis has been carried out using aChecker and WAVE tools. The readability of the contents of the website is measured with six different indices such as Flesch-Kincaid reading ease, Flesch-Kincaid grade level, Gunning fog, SMOG, Coleman-Liau index and Automated Readability Index. The ranking of sites by National Informatics Centre (NIC) has been utilized to select the top ranked websites and their corresponding rankings are compared with global site ranking services such as Alexa. The correlation among these three factors of accessibility, readability, and site-ranking has been carried out with Spearman’s rank correlation method and the inferences derived from the results are presented.
  • Conference Paper
    Full-text available
    This study explores the possibility of replacing the costly and time-consuming human evaluation of the grammaticality and meaning preservation of the output of text simplification (TS) systems with some automatic measures. The focus is on six widely used machine translation (MT) evaluation metrics and their correlation with human judgements of grammaticality and meaning preservation in text snippets. As the results show a significant correlation between them, we go further and try to classify simplified sentences into: (1) those which are acceptable; (2) those which need minimal post-editing; and (3) those which should be discarded. The preliminary results, reported in this paper, are promising.
  • Conference Paper
    Full-text available
    This paper presents the results of the shared task of the Workshop on Quality Assessment for Text Simplification (QATS), which consisted in automatically assigning one of the three labels (good, ok, and bad) for each of the four aspects of automatically simplified English sentences, i.e. their grammaticality, meaning preservation, simplicity, and overall quality. We asked participants to submit a maximum of three systems (raw metrics and/or classifiers) for each aspect. We received a total of 10 raw metrics and 16 classifiers for each of the four aspects. In addition to that, we computed correlations for four standard MT metrics (BLEU, METEOR, TER and WER) as baselines. The collected scores were evaluated by Pearson correlation (how well each score metric correlates with the manually assigned values) and the classifiers were evaluated in terms of their accuracy, mean average error, root squared mean error and weighted F-scores.
  • Conference Paper
    Full-text available
    We propose the use of character n-gram F-score for automatic evaluation of machine translation output. Character n-grams have already been used as a part of more complex metrics, but their individual potential has not been investigated yet. We report system-level correlations with human rankings for 6-gram F1-score (CHRF) on the WMT12, WMT13 and WMT14 data as well as segment-level correlation for 6-gram F1 (CHRF) and F3-scores (CHRF3) on WMT14 data for all available target languages. The results are very promising, especially for the CHRF3 score – for translation from English, this variant showed the highest segment-level correlations out-performing even the best metrics on the WMT14 shared evaluation task.
  • Article
    Full-text available
    This paper presents a method for the syntactic simplification of French texts. Syntactic simplification aims at making texts easier to understand by simplifying complex syntactic structures that hinder reading. Our approach is based on the study of two parallel corpora (encyclopaedia articles and tales). It aims to identify the linguistic phenomena involved in the manual simplification of French texts and organise them within a typology. We then propose a syntactic simplification system that relies on this typology to generate simplified sentences. The module starts by generating all possible variants before selecting the best subset. The evaluation shows that about 80% of the simplified sentences produced by our system are accurate.
  • Article
    The way in which a text is written can be a barrier for many people. Automatic text simplification is a natural language processing technology that, when mature, could be used to produce texts that are adapted to the specific needs of particular users. Most research in the area of automatic text simplification has dealt with the English language. In this article, we present results from the Simplext project, which is dedicated to automatic text simplification for Spanish. We present a modular system with dedicated procedures for syntactic and lexical simplification that are grounded on the analysis of a corpus manually simplified for people with special needs. We carried out an automatic evaluation of the system’s output, taking into account the interaction between three different modules dedicated to different simplification aspects. One evaluation is based on readability metrics for Spanish and shows that the system is able to reduce the lexical and syntactic complexity of the texts. We also show, by means of a human evaluation, that sentence meaning is preserved in most cases. Our results, even if our work represents the first automatic text simplification system for Spanish that addresses different linguistic aspects, are comparable to the state of the art in English Automatic Text Simplification.
  • Article
    The way in which a text is written can be a barrier for many people. Automatic text simplification is a natural language processing technology that, when mature, could be used to produce texts that are adapted to the specific needs of particular users. Most research in the area of automatic text simplification has dealt with the English language. In this article, we present results from the Simplext project, which is dedicated to automatic text simplification for Spanish. We present a modular system with dedicated procedures for syntactic and lexical simplification that are grounded on the analysis of a corpus manually simplified for people with special needs. We carried out an automatic evaluation of the system’s output, taking into account the interaction between three different modules dedicated to different simplification aspects. One evaluation is based on readability metrics for Spanish and shows that the system is able to reduce the lexical and syntactic complexity of the texts. We also show, by means of a human evaluation, that sentence meaning is preserved in most cases. Our results, even if our work represents the first automatic text simplification system for Spanish that addresses different linguistic aspects, are comparable to the state of the art in English Automatic Text Simplification.
  • Conference Paper
    Full-text available
    Newswire text is often linguistically complex and stylistically decorated, hence very difficult to comprehend for people with reading disabilities. Acknowledging that events represent the most important information in news, we propose an event-centered approach to news simplification. Our method relies on robust extraction of factual events and elimination of surplus information which is not part of event mentions. Experimental results obtained by combining automated readability measures with human evaluation of correctness justify the proposed event-centered approach to text simplification.
  • Article
    Full-text available
    We describe Hjerson, a tool for automatic classification of errors in machine translation output. The tool features the detection of five word level error classes: morphological errors, reordering errors, missing words, extra words and lexical errors. As input, the tool requires original full form reference translation(s) and hypothesis along with their corresponding base forms. It is also possible to use additional information on the word level (e.g. POS tags) in order to obtain more details. The tool provides the raw count and the normalised score (error rate) for each error class at the document level and at the sentence level, as well as original reference and hypothesis words labelled with the corresponding error class in text and HTML formats.
  • Article
    Full-text available
    We describe TINE, a new automatic evalua-tion metric for Machine Translation that aims at assessing segment-level adequacy. Lexical similarity and shallow-semantics are used as indicators of adequacy between machine and reference translations. The metric is based on the combination of a lexical matching com-ponent and an adequacy component. Lexi-cal matching is performed comparing bags-of-words without any linguistic annotation. The adequacy component consists in: i) us-ing ontologies to align predicates (verbs), ii) using semantic roles to align predicate argu-ments (core arguments and modifiers), and iii) matching predicate arguments using dis-tributional semantics. TINE's performance is comparable to that of previous metrics at segment level for several language pairs, with average Kendall's tau correlation from 0.26 to 0.29. We show that the addition of the shallow-semantic component improves the performance of simple lexical matching strate-gies and metrics such as BLEU.
  • Conference Paper
    This paper describes Meteor 1.3, our submission to the 2011 EMNLP Workshop on Statistical Machine Translation automatic evaluation metric tasks. New metric features include improved text normalization, higher-precision paraphrase matching, and discrimination between content and function words. We include Ranking and Adequacy versions of the metric shown to have high correlation with human judgments of translation quality as well as a more balanced Tuning version shown to outperform BLEU in minimum error rate training for a phrase-based Urdu-English system.