Content uploaded by Prodromos Malakasiotis
Author content
All content in this area was uploaded by Prodromos Malakasiotis on May 21, 2015
Content may be subject to copyright.
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 96–106,
Edinburgh, Scotland, UK, July 27–31, 2011. c
2011 Association for Computational Linguistics
A Generate and Rank Approach to Sentence Paraphrasing
Prodromos Malakasiotis∗and Ion Androutsopoulos∗+
∗Department of Informatics, Athens University of Economics and Business, Greece
+Digital Curation Unit – IMIS, Research Centre “Athena”, Greece
Abstract
We present a method that paraphrases a given
sentence by first generating candidate para-
phrases and then ranking (or classifying)
them. The candidates are generated by ap-
plying existing paraphrasing rules extracted
from parallel corpora. The ranking compo-
nent considers not only the overall quality of
the rules that produced each candidate, but
also the extent to which they preserve gram-
maticality and meaning in the particular con-
text of the input sentence, as well as the de-
gree to which the candidate differs from the
input. We experimented with both a Max-
imum Entropy classifier and an SV R ranker.
Experimental results show that incorporating
features from an existing paraphrase recog-
nizer in the ranking component improves per-
formance, and that our overall method com-
pares well against a state of the art paraphrase
generator, when paraphrasing rules apply to
the input sentences. We also propose a new
methodology to evaluate the ranking compo-
nents of generate-and-rank paraphrase gener-
ators, which evaluates them across different
combinations of weights for grammaticality,
meaning preservation, and diversity. The pa-
per is accompanied by a paraphrasing dataset
we constructed for evaluations of this kind.
1 Introduction
In recent years, significant effort has been devoted
to research on paraphrasing (Androutsopoulos and
Malakasiotis, 2010; Madnani and Dorr, 2010). The
methods that have been proposed can be roughly
classified into three categories: (i) recognition meth-
ods, i.e., methods that detect whether or not two in-
put sentences or other texts are paraphrases; (ii) gen-
eration methods, where the aim is to produce para-
phrases of a given input sentence; and (iii) extraction
methods, which aim to extract paraphrasing rules
(e.g., “Xwrote Y” “↔Ywas authored by X”) or
similar patterns from corpora. Most of the methods
that have been proposed belong in the first category,
possibly because of the thrust provided by related
research on textual entailment recognition (Dagan et
al., 2009), where the goal is to decide whether or not
the information of a given text is entailed by that of
another. Significant progress has also been made in
paraphrase extraction, where most recent methods
produce large numbers of paraphrasing rules from
multilingual parallel corpora (Bannard and Callison-
Burch, 2005; Callison-Burch, 2008; Zhao et al.,
2008; Zhao et al., 2009a; Zhao et al., 2009b; Kok
and Brockett, 2010). In this paper, we are concerned
with paraphrase generation, which has received less
attention than the other two categories.
There are currently two main approaches to para-
phrase generation. The first one treats paraphrase
generation as a machine translation problem, with
the peculiarity that the target language is the same as
the source one. To bypass the lack of large monolin-
gual parallel corpora, which are needed to train sta-
tistical machine translation (SMT) systems for para-
phrasing, monolingual clusters of news articles re-
ferring to the same event (Quirk et al., 2004) or
other similar monolingual comparable corpora can
be used, though sentence alignment methods for par-
allel corpora may perform poorly on comparable
corpora (Nelken and Shieber, 2006); alternatively,
large collections of paraphrasing rules obtained via
paraphrase extraction from multilingual parallel cor-
pora can be used as monolingual phrase tables in a
96
phrase-based SM T systems (Zhao et al., 2008; Zhao
et al., 2009a); in both cases, paraphrases can then
be generated by invoking an SMT system’s decoder
(Koehn, 2009). A second paraphrase generation ap-
proach is to treat existing machine translation en-
gines as black boxes, and translate each input sen-
tence to a pivot language and then back to the orig-
inal language (Duboue and Chu-Carroll, 2006). An
extension of this approach uses multiple translation
engines and pivot languages (Zhao et al., 2010).
In this paper, we investigate a different paraphrase
generation approach, which does not produce para-
phrases by invoking machine translation system(s).
We use an existing collection of monolingual para-
phrasing rules extracted from multilingual parallel
corpora (Zhao et al., 2009b); each rule is accompa-
nied by one or more scores, intended to indicate the
rule’s overall quality without considering particular
contexts where the rule may be applied. Instead of
using the rules as a monolingual phrase table and in-
voking an SM T system’s decoder, we follow a gen-
erate and rank approach, which is increasingly com-
mon in several language processing tasks.1Given
an input sentence, we use the paraphrasing rules to
generate a large number of candidate paraphrases.
The candidates are then represented as feature vec-
tors, and a ranker (or classifier) selects the best ones;
we experimented with a Maximum Entropy classi-
fier and a Support Vector Regression (SVR) ranker.
The vector of each candidate paraphrase includes
features indicating the overall quality of the rules
that produced the candidate, the extent to which the
rules preserve grammaticality and meaning in the
particular context of the input sentence, and the de-
gree to which the candidate’s surface form differs
from that of the input; we call the latter factor di-
versity. The intuition is that a good paraphrase is
grammatical, preserves the meaning of the original
sentence, while also being as different as possible.
Experimental results show that including in the
ranking (or classification) component features from
an existing paraphrase recognizer leads to improved
results. We also propose a new methodology to eval-
uate the ranking components of generate-and-rank
paraphrase generators, which evaluates them across
different combinations of weights for grammatical-
1See, for example, Collins and Koo (2005).
ity, meaning preservation, and diversity. The paper
is accompanied by a new publicly available para-
phrasing dataset we constructed for evaluations of
this kind. Further experiments indicate that when
paraphrasing rules apply to the input sentences, our
paraphrasing method is competitive to a state of the
art paraphrase generator that uses multiple transla-
tion engines and pivot languages (Zhao et al., 2010).
We note that paraphrase generation is useful in
several language processing tasks. In question an-
swering, for example, paraphrase generators can be
used to paraphrase the user’s queries (Duboue and
Chu-Carroll, 2006; Riezler and Liu, 2010); and
in machine translation, paraphrase generation can
help improve the translations (Callison-Burch et al.,
2006; Marton et al., 2009; Mirkin et al., 2009; Mad-
nani et al., 2007), or it can be used when evaluat-
ing machine translation systems (Lepage and De-
noual, 2005; Zhou et al., 2006; Kauchak and Barzi-
lay, 2006; Pad´
o et al., 2009).
The remainder of this paper is structured as fol-
lows: Section 2 explains how our method gener-
ates candidate paraphrases; Section 3 introduces the
dataset we constructed, which is also used in sub-
sequent sections; Section 4 discusses how candi-
date paraphrases are ranked; Section 5 compares our
overall method to a state of the art paraphrase gen-
erator; and Section 6 concludes.
2 Generating candidate paraphrases
We use the approximately one million English para-
phrasing rules of Zhao et al. (2009b). Roughly
speaking, the rules were extracted from a parallel
English-Chinese corpus, based on the assumption
that two English phrases e1and e2that are often
aligned to the same Chinese phrase care likely to
be paraphrases and, hence, they can be treated as a
paraphrasing rule e1↔e2.2Zhao et al.’s method ac-
tually operates on slotted English phrases, obtained
from parse trees, where slots correspond to part of
speech (POS) tags. Hence, rules like the following
three may be obtained, where NNiindicates a noun
slot and NNPia proper name slot.
2This pivot-based paraphrase extraction approach was first
proposed by Bannard and Callison-Burch (2005). It under-
lies several other paraphrase extraction methods (Riezler et al.,
2007; Callison-Burch, 2008; Kok and Brockett, 2010).
97
(1) a lot of NN1↔plenty of NN1
(2) NNP1area ↔NNP1region
(3) NNP1wrote NNP2↔NNP2was written by NNP1
In the basic form of their method, called Model
1, Zhao et al. (2009b) use a log-linear ranker to as-
sign scores to candidate English paraphrase pairs
he1, e2i; the ranker uses the alignment probabilities
P(c|e1)and P(e2|c)as features, along with features
that assess the quality of the corresponding align-
ments. In an extension of their method, Model 2,
Zhao et al. consider two English phrases e1and e2as
paraphrases, if they are often aligned to two Chinese
phrases c1and c2, which are themselves paraphrases
according to Model 1 (with English used as the pivot
language). Again, a log-linear ranker assigns a score
to each he1, e2ipair, now with P(c1|e1),P(c2|c1),
and P(e2|c1)as features, along with similar features
for alignment quality. In a further extension, Model
3, all the candidate phrase pairs he1, e2iare collec-
tively treated as a monolingual parallel corpus. The
phrases of the corpus are aligned, as when aligning
a bilingual parallel corpus, and additional features,
based on the alignment, are added to the log-linear
ranker, which again assigns a score to each he1, e2i.
The resulting paraphrasing rules e1↔e2typi-
cally contain short phrases (up to four or five words
excluding slots) on each side; hence, they can be
used to rewrite only parts of longer sentences. Given
an input (source) sentence S, we generate candidate
paraphrases by applying rules whose left or right
hand side matches any part of S. For example, rule
(1) matches the source sentence (4); hence, (4) can
be rewritten as the candidate paraphrase (5).3
(4) S: He had a lot of [NN 1admiration] for his job.
(5) C: He had plenty of [NN 1admiration] for his job.
Several rules may apply to S; for example, they may
rewrite different parts of S, or they may replace the
same parts of Sby different phrases. We allow all
possible combinations of applicable rules to apply to
S, excluding combinations that include rules rewrit-
ing overlapping parts of S.4To avoid generating too
many candidates (C), we use only the 20 rules (that
3We use Stanford’s POS tagger, MaxEnt classifier, and de-
pendency parser; see http://nlp.stanford.edu/.
4A possible extension, which we have not explored, would
be to recursively apply the same process to the resulting Cs.
apply to S) with the highest scores. Zhao et al. actu-
ally associate each rule with three scores. The first
one, hereafter called r1, is the Model 1 score, and the
other two, r2and r3, are the forward and backward
alignment probabilities of Model 3; see Zhao et al.
(2009b) for details. We use the average of the three
scores, hereafter r4, when generating candidates.
Unfortunately, Zhao et al.’s scores reflect the over-
all quality of each rule, without considering the con-
text of the particular Swhere the rule is applied.
Szpektor et al. (2008) point out that, for example,
a rule like “Xacquire Y”↔“Xbuy Y” may work
well in many contexts, but not in “Children acquire
language quickly”. Similarly, “Xcharged Ywith”
↔“Xaccused Yof” should not be applied to sen-
tences about charging batteries. Szpektor et al. pro-
pose, roughly speaking, to associate each rule with
a model of the contexts where the rule is applicable,
as well as models of the expressions that typically
fill its slots, in order to be able to assess the applica-
bility of each rule in specific contexts. The rules that
we use do not have associated models of this kind,
but we follow Szpektor et al.’s idea of assessing the
applicability of each rule in each particular context,
when ranking candidates, as discussed below.
3 A dataset of candidate paraphrases
Our generate and rank method relies on existing
large collections of paraphrasing rules to generate
candidate paraphrases. Our main contribution is in
the ranking of the candidates. To be able to evalu-
ate the performance of different rankers in the task
we are concerned with, we first constructed an eval-
uation dataset that contains pairs hS, Ciof source
(input) sentences and candidate paraphrases, and we
asked human judges to assess the degree to which
the Cof each pair was a good paraphrase of S.
We selected randomly 75 source (S) sentences
from the AQUAINT corpus, such that at least one
of the paraphrasing rules applied to each S.5For
each S, we generated candidate Cs using Zhao et
al.’s rules, as discussed in Section 2. This led to
1,935 hS, Cipairs, approx. 26 pairs for each S. The
pairs were given to 13 judges other than the authors.6
Each judge evaluated approx. 148 (different) hS, Ci
5The corpus is available from the LDC (L DC 2002T31).
6The judges were fluent, but not native English speakers.
98
Figure 1: Distribution of overall quality scores in the
evaluation dataset (1 = totally unacceptable, 4 = perfect).
pairs; each of the 1,935 pairs was evaluated by one
judge. The judges were asked to provide grammati-
cality, meaning preservation, and overall paraphrase
quality scores for each hS, Cipair, each score on a
1–4 scale (1 for totally unacceptable, 4 for perfect);
guidelines and examples were also provided.
Figure 1 shows the distribution of the overall qual-
ity scores in the 1,935 hS, Cipairs of the evalua-
tion dataset; the distributions of the grammaticality
and meaning preservation scores are similar. No-
tice that although we used only the 20 applicable
paraphrasing rules with the highest scores to gen-
erate the hS, Cipairs, less than half of the candidate
paraphrases (C) were considered good, and approx-
imately only 20% perfect. In other words, apply-
ing paraphrasing rules (even only those with the 20
best scores) to each input sentence Sand randomly
picking one of the resulting candidate paraphrases
C, without any further filtering (or ranking) of the
candidates, would on average produce unacceptable
paraphrases more frequently than acceptable ones.
Hence, the role of the ranking component is crucial.
We also measured inter-annotator agreement by
constructing, in the same way, 100 additional hS, Ci
pairs (other than the 1,935) and asking 3 of the 13
judges to evaluate all of them. We measured the
mean absolute error, i.e., the mean absolute differ-
ence in the judges’ scores (averaged over all pairs
of judges) and the mean (over all pairs of judges)
Kstatistic (Carletta, 1996). In the overall scores,
Kwas 0.64, which is in the range often taken to
indicate substantial agreement (0.61–0.80).7Agree-
ment was higher for grammaticality (K= 0.81),
7It is also close to 0.67, which is sometimes taken to be a
cutoff for substantial agreement in computational linguistics.
mean abs. diff. K-statistic
grammaticality 0.20 0.81
meaning preserv. 0.26 0.59
overall quality 0.22 0.64
Table 1: Inter-annotator agreement when manually eval-
uating candidate paraphrases.
and lower (K= 0.59) for meaning preservation. Ta-
ble 1 shows that the mean absolute difference in the
annotators’ scores was 1
5to 1
4of a point.
Several judges commented that they had trouble
deciding to what extent the overall quality score
should reflect grammaticality or meaning preserva-
tion. They also wondered if it was fair to consider as
perfect candidate paraphrases that differed in only
one or two words from the source sentences, i.e.,
candidates with low diversity. These comments led
us to ignore the judges’ overall quality scores in
some experiments, and to use a weighted average
of grammaticality, meaning preservation, and (auto-
matically measured) diversity instead, with different
weight combinations corresponding to different ap-
plication requirements, as discussed further below.
In the same way, 1,500 more hS, Cipairs (other
than the 1,935 and the 100, not involving previously
seen Ss) were constructed, and they were evaluated
by the first author. The 1,500 pairs were used as
a training dataset in experiments discussed below.
Both the 1,500 training and the 1,935 evaluation
(test) pairs are publicly available.8We occasionally
refer to the training and evaluation datasets as a sin-
gle dataset, but they are clearly separated.
4 Ranking candidate paraphrases
We now discuss the ranking component of our
method, which assesses the candidate paraphrases.
4.1 Features of the ranking component
Each hS, Cipair is represented as a feature vector.
To allow the ranking component to assess the degree
to which a candidate Cis grammatical, or at least
as grammatical as the source S, we include in the
feature vectors the language model scores of S,C,
and the difference between the two scores. We use
a 3-gram language model trained on approximately
8See the paper’s supplementary material.
99
6.5 million sentences of the AQUAINT corpus.9To
allow the ranker to consider the (context-insensitive)
quality scores of the rules that generated Cfrom S,
we also include as features the highest, lowest, and
average r1,r2,r3, and r4scores (Section 2) of these
rules, 12 features in total.
The features discussed so far are similar to those
employed by Zhao et al. (2009a) in the only compa-
rable paraphrase generation method we are aware of
that uses paraphrasing rules. That method, hereafter
called ZH AO-RU L, uses the language model score
of Cand scores similar to r1,r2,r3in a log-linear
model.10 The log-linear model of ZHAO-RUL is used
by an SMT-like decoder to identify the transforma-
tions (applications of rules) that produce the (hope-
fully) best paraphrase. By contrast, we first gen-
erate a large number of candidates using the para-
phrasing rules, and we then rank them. Unfortu-
nately, we did not have access to an implementa-
tion of ZHAO-RUL to compare against, but below
we compare against another paraphraser proposed
by Zhao et al. (2010), hereafter called Z HAO -ENG,
which uses multiple machine translation engines and
pivot languages, instead of paraphrasing rules, and
which Zhao et al. found to outperform ZHAO-RUL.
To further help the ranking component assess the
degree to which Cpreserves the meaning of S, we
also optionally include in the vectors of the hS, Ci
pairs the features of an existing paraphrase recog-
nizer (Malakasiotis, 2009) that obtained the best
published results (Androutsopoulos and Malakasio-
tis, 2010) on the widely used MSR paraphrasing cor-
pus.11 Most of the recognizer’s features are com-
puted by using nine similarity measures: Leven-
shtein, Jaro-Winkler, Manhattan, Euclidean, and n-
gram (n= 3) distance, cosine similarity, Dice, Jac-
card, and matching coefficients, all computed on to-
kens; consult Malakasiotis (2009) for details. For
each hS, Cipair, the nine similarity measures are ap-
9We use SRILM; see http://www-speech.sri.com/.
10Application-specific features are also included, which can
be used, for example, to favor paraphrases that are shorter than
the input in sentence compression (Knight and Marcu, 2002;
Clarke and Lapata, 2008). Similar features could also be added
to application-specific versions of our method.
11The MSR corpus contains pairs that are paraphrases or not.
It is a benchmark for paraphrase recognizers, not generators. It
provides only one paraphrase (true or false) of each source, and
few of the true paraphrases can be obtained by the rules we use.
plied to ten different forms hs1, c1i,...,hs10, c10i
of hS, Ci, described below, leading to 90 features.
hs1, c1i:The original forms of Sand C.
hs2, c2i:Sand Cwith tokens replaced by stems.
hs3, c3i:Sand C, with tokens replaced by PO S tags.
hs4, c4i:Sand C, tokens replaced by soundex codes.12
hs5, c5i:Sand C, but having removed non-nouns.
hs6, c6i:As previously, but nouns replaced by stems.
hs7, c7i:As previously, nouns replaced by soundex.
hs8, c8i:Sand C, but having removed non-verbs.
hs9, c9i:As previously, but verbs replaced by stems.
hs10, c10 i:As previously, verbs replaced by soundex.
When constructing all ten forms hsi, ciiof hS, Ci,
synonyms (in any WordNet synset) are treated as
identical words. Additional variants of some of the
90 features compare a sliding window of some of
the siforms to the corresponding ciforms (or vice
versa), adding 40 more features; see Malakasiotis
(2009). Two more Boolean features indicate the ex-
istence or absence of negation in Sor C, respec-
tively; and another feature computes the ratio of the
lengths of Sand C, measured in tokens. Finally,
three additional features compare the dependency
trees of Sand C:
RS=|common dependencies of S, C|
|dependencies of S|
RC=|common dependencies of S, C|
|dependencies of C|
Fβ=1 =2·RS·RC
RS+RC
The recognizer’s features are 136 in total.13
Hence, the full feature set of our paraphraser’s rank-
ing component comprises 151 features.
12The Soundex algorithm maps English words to alphanu-
meric codes, so that words with the same pronunciations
receive the same codes, despite spelling differences; see
http://en.wikipedia.org/wiki/Soundex.
13Malakasiotis (2009) shows that although there is a lot of re-
dundancy in the recognizer’s feature set, the full feature set still
leads to better paraphrase recognition results, compared to sub-
sets constructed via feature selection with hill-climbing or beam
search. The same paper reports that the recognizer performs al-
most as well without the last three features, which may not be
available in languages with no reliable dependency parsers. No-
tice, also, that the recognizer does not use paraphrasing rules.
100
4.2 Learning rate with a MaxEnt classifier
To obtain a first indication of whether or not a rank-
ing component equipped with the features discussed
above could learn to distinguish good from bad can-
didate paraphrases, and to investigate if our train-
ing dataset is sufficiently large, we initially experi-
mented with a Maximum Entropy classifier (with the
151 features) as the ranking component. This initial
version of the ranking component, called ME-REC,
was trained on increasingly larger parts of the train-
ing dataset of Section 3, and it was always evaluated
on the entire test dataset of that section. For simplic-
ity, we used only the judges’ overall quality scores
in these experiments, and we treated the problem as
one of binary classification; overall quality scores of
1 and 2 where conflated to a negative category, and
scores of 3 and 4 to a positive category.
Figure 2 plots the error rate of ME-REC, com-
puted both on the test set and the encountered train-
ing subset. The error rate on the training instances
a learner has encountered is typically lower than the
error rate on the test set (unseen instances); hence,
the former error rate can be seen as a lower bound
of the latter. M E-REC shows signs of having reached
its lower bound when the entire training dataset is
used, suggesting that the training dataset is suffi-
ciently large. The baseline (BASE) of Figure 2 uses
only a threshold on the average r4(Section 2) of the
rules that turned Sinto C. If the average r4is higher
than the threshold, the hS, Cipair is classified in the
positive class, otherwise in the negative one. The
threshold was tuned by experimenting on a sepa-
rate tuning dataset. Clearly, ME-REC outperforms
the baseline, which uses only the average (context-
insensitive) scores of the applied paraphrasing rules.
4.3 Experiments with an SVR ranker
As already noted, when our dataset were constructed
the judges felt it was not always clear to what ex-
tent the overall quality scores should reflect meaning
preservation or grammaticality; and they also won-
dered if the overall quality scores should have also
taken into consideration diversity. To address these
concerns, in the experiments described in this sec-
tion (and the remainder of the paper) we ignored the
judges’ overall scores, and we used a weighted av-
erage of the grammaticality, meaning preservation,
15%
20%
25%
30%
35%
40%
45%
50%
75
150
225
300
375
450
525
600
675
750
825
900
975
1050
1125
1200
1275
1350
1425
1500
E
r
r
o
r
r
a
t
e
Training instances used
ME-REC.TRAIN
ME-REC.TEST
BASE
Figure 2: Learning curves of a Maximum Entropy classi-
fier used as the ranking component of our method.
and diversity scores instead; the grammaticality and
meaning preservation scores were those provided by
the judges, while diversity was automatically com-
puted as the edit distance (Levenshtein, computed
on tokens) between Sand C. Stated otherwise, the
correct score y(xi)of each training or test instance
xi(i.e., of each feature vector of an hS, Cipair) was
taken to be a linear combination of the grammati-
cality score g(xi), the meaning preservation score
m(xi), and the diversity d(xi), as in Equation (6),
where λ3= 1 −λ1−λ2.
y(xi) = λ1·g(xi) + λ2·m(xi) + λ3·d(xi)(6)
We believe that the λiweights should in prac-
tice be application-dependent. For example, when
paraphrasing user queries to a search engine that
turns them into bags of words, diversity and meaning
preservation may be more important than grammati-
cality; by contrast, when paraphrasing the sentences
of a generated text to avoid repeating the same ex-
pressions, grammaticality is very important. Hence,
generic paraphrase generators, like ours, intended to
be useful in many different applications, should be
evaluated for many different combinations of the λi
weights. Consequently, in the experiments of this
section we trained and evaluated the ranking com-
ponent of our method (on the training and evalua-
tion part, respectively, of the dataset of Section 3)
several times, each time with a different combina-
tion of λ1, λ2, λ3values, with the values of each λi
ranging from 0 to 1 with a step of 0.2.
We employed a Support Vector Regression (SVR)
model in the experiments of this section, instead of
101
0,00%
10,00%
20,00%
30,00%
40,00%
50,00%
60,00%
70,00%
λ1=0.0 λ2=0.0
λ1=0.0 λ2=0.2
λ1=0.0 λ2=0.4
λ1=0.0 λ2=0.6
λ1=0.0 λ2=0.8
λ1=0.0 λ2=1.0
λ1=0.2 λ2=0.0
λ1=0.2 λ2=0.2
λ1=0.2 λ2=0.4
λ1=0.2 λ2=0.6
λ1=0.2 λ2=0.8λ1=0.4 λ2=0.0
λ1=0.4 λ2=0.2
λ1=0.4 λ2=0.4
λ1=0.4 λ2=0.6
λ1=0.6 λ2=0.0
λ1=0.6 λ2=0.2
λ1=0.6 λ2=0.4
λ1=0.8 λ2=0.0
λ1=0.8 λ2=0.2
λ1=1.0 λ2=0.0
SVR-REC
SVR-BASE
Ρ
2
Figure 3: Performance of our method’s SV R ranking com-
ponent with (SV R-R EC) and without (SVR-BASE) the ad-
ditional features of the paraphrase recognizer.
a classifier, given that the y(xi)scores that we want
to predict are real values.14 An SVR is very similar
to a Support Vector Machine (Vapnik, 1998; Cris-
tianini and Shawe-Taylor, 2000; Joachims, 2002),
but it is trained on examples of the form hxi, y(xi)i,
where xi∈Rnand y(xi)∈R, and it learns a rank-
ing function f:Rn→Rthat is intended to return
f(xi)values as close as possible to the correct ones
y(xi), given feature vectors xi. In our case, the cor-
rect y(xi)values were those of Equation (6). We call
SV R-R EC the SVR ranker with all the 151 features of
Section 4.2, and SVR-BAS E the SV R ranker without
the 136 features of the paraphrase recognizer.
We used the squared correlation coefficient ρ2to
evaluate SVR -RE C against SVR-BA SE.15 The ρ2co-
efficient shows how well the scores returned by the
SV R are correlated with the desired scores y(xi); the
higher the ρ2the higher the agreement. Figure 3
14Additional experiments confirmed that the S VR per-
forms better than ME -RE C as the ranking component. We
use the SV R implementation of LI BS VM, available from
http://www.csie.ntu.edu.tw/∼cjlin/libsvm/,
with an RB F kernel and default settings. All the features are
normalized in [−1,1], when using SV R or ME -REC .
15If nis the number of test pairs, f(xi)the score returned by
the SV R for the i-th pair, and y(xi)the correct score, then ρ2is:
(nPn
i=1 f(xi)yi−Pn
i=1 f(xi)Pn
i=1 y(xi))2
(nPn
i=1 f(xi)2
−(Pn
i=1 f(xi))2)(nPn
i=1 y2
i
−(Pn
i=1 y(xi))2)
shows the experimental results. Each line from the
diagram’s center represents a different experimental
setting, i.e., a different combination of λ1and λ2;
recall that λ3= 1 −λ1−λ2. The distance of a
method’s curve from the center is the method’s ρ2
for that setting. The farther a point is from the center
the higher ρ2is; hence, methods whose curves are
closer to the diagram’s outmost perimeter are better.
Clearly, SVR-REC (which includes the recognizer’s
features) outperforms SVR-BASE (which relies only
on the language model and the scores of the rules).
The two peaks of SV R-R EC’s curve are when λ3
is very high (1 or 0.8), i.e., when y(xi)is dominated
by the diversity score; in these cases, SVR-REC is
at a clear advantage, since it includes features for
surface string similarity (e.g., Levenshtein distance
measured on hs1, c1i), which in effect measure di-
versity, unlike SV R-BASE. Even when λ1is very
high (1 or 0.8), i.e., when all or most of the weight
is placed on grammaticality, SVR-REC outperforms
SV R-BA SE, indicating that the extra features in SV R-
RE C also contribute towards assessing grammatical-
ity; by contrast S VR -BAS E relies exclusively on the
language model for grammaticality. Unfortunately,
when λ2is very high (1 or 0.8), i.e., when all or
most of the weight is placed on meaning preserva-
tion, there is no or very small difference between
SV R-R EC and SVR-BAS E, suggesting that the extra
features of the paraphrase recognizer are not as use-
ful to the SVR, when assessing meaning preserva-
tion, as we would have hoped. Nevertheless, SVR-
RE C is overall better than SVR-BA SE.
We believe that the dataset of Section 3 and the
evaluation methodology summarized by Figure 3
will prove useful to other researchers, who may wish
to evaluate other ranking components of generate-
and-rank paraphrasing methods against ours, for ex-
ample with different ranking algorithms or features.
Similar datasets of candidate paraphrases can also
be created using different collections of paraphras-
ing rules.16 The same methodology can then be used
to evaluate ranking components on those datasets.
5 Comparison to the state of the art
Having established that SV R-R EC is a better config-
uration of our method’s ranker than SVR -BASE, we
16See Androutsopoulos and Malakasiotis (2010) for pointers.
102
proceed to investigate how well our overall generate-
and-rank method (with SVR-REC ) compares against
a state of the art paraphrase generator.
As already mentioned, Zhao et al. (2010) recently
presented a method (we call it ZHAO-ENG ) that out-
performs their previous method (Zhao et al., 2009a),
which used paraphrasing rules and an SMT-like de-
coder (we call that previous method ZH AO-RU L).
Given an input sentence S,ZHAO -EN G produces
candidate paraphrases by translating Sto 6 pivot
languages via 3 different commercial machine trans-
lation engines (treated as black boxes) and then back
to the original language, again via 3 machine transla-
tion engines (54 combinations). Roughly speaking,
ZH AO-E NG then ranks the candidate paraphrases by
their average distance from all the other candidates,
selecting the candidate(s) with the smallest distance;
distance is measured as BLEU score (Papineni et
al., 2002).17 Hence, ZH AO-E NG is also, in effect,
a generate-and-rank paraphraser, but the candidates
are generated by invoking multiple machine transla-
tion engines instead of applying paraphrasing rules,
and they are ranked by the average distance measure
rather than using an SVR.
An obvious practical advantage of ZHAO-ENG is
that it exploits the vast resources of existing com-
mercial machine translation engines when generat-
ing candidate paraphrases, which allows it to always
obtain large numbers of candidate paraphrases. By
contrast, the collection of paraphrasing rules that we
currently use does not manage to produce any can-
didate paraphrases in 40% of the sentences of the
New York Times part of AQUAINT, because no rule
applies. Hence, in terms of ability to always para-
phrase the input, ZHAO-ENG is clearly better, though
it should be possible to improve our methods’s per-
formance in that respect by using larger collections
of paraphrasing rules.18 A further interesting ques-
tion, however, is how good the paraphrases of the
two methods are, when both methods manage to
paraphrase the input, i.e., when at least one para-
17We use the version of ZHAO-EN G that Zhao et al. (2010)
call “selection-based”, since they reported it performs overall
better than an alternative decoding-based version.
18Recall that the paraphrasing rules we use were extracted
from an English-Chinese parallel corpus. Additional rules
could be extracted from other parallel corpora, like Europarl
(http://www.statmt.org/europarl/).
phrasing rule applies to S. This scenario can be seen
as an emulation of the case where the collection of
paraphrasing rules is sufficiently large to guarantee
that at least one rule applies to any source sentence.
To answer the latter question, we re-implemented
ZH AO-E NG, with the same machine translation en-
gines and languages used by Zhao et al. (2010).
We also trained our paraphraser (with SVR-REC) on
the training part of the dataset of Section 3. We
then selected 300 random source sentences Sfrom
AQUA IN T that matched at least one of the paraphras-
ing rules, excluding sentences that had been used be-
fore. Then, for each one of the 300 Ssentences, we
kept the single best candidate paraphrase C1and C2,
respectively, returned by our paraphraser and ZH AO-
EN G. The resulting hS, C1iand hS, C2ipairs were
given to 10 human judges. This time the judges
assigned only grammaticality and meaning preser-
vation scores (on a 1–4 scale); diversity was again
computed as edit distance. Each pair was evaluated
by one judge, who was given an equal number of
pairs from the two methods, without knowing which
method each pair came from. The same judge never
rated two pairs with the same S. Since we had no
way to make ZH AO-E NG sensitive to λ1, λ2, λ3, we
trained SV R-R EC with λ1=λ2= 1/3, as the most
neutral combination of weights.
Table 2 lists the average grammaticality, meaning
preservation, and diversity scores of the two meth-
ods. All scores were normalized in [0,1], but the
reader should keep in mind that diversity was com-
puted as edit distance, whereas the other two scores
were provided by human judges on a 1–4 scale. The
grammaticality score of our method was better than
ZH AO-E NG’s, and the difference was statistically
significant.19 In meaning preservation, ZHAO-ENG
was slightly better, but the difference was not statis-
tically significant. The difference in diversity was
larger and statistically significant, with the diversity
scores indicating that it takes approximately twice as
many edit operations (insert, delete, replace) to turn
each source sentence to ZHAO-ENG’s paraphrase,
compared to the paraphrase of our method.
We note that our method can be tuned, by ad-
justing the λiweights, to produce paraphrases with
19We used Analysis of Variance (ANO VA) (Fisher, 1925), fol-
lowed by post-hoc Tukey tests to check whether the scores of
the two methods differ significantly (p < 0.05).
103
score (%) our method ZH AO-E NG
grammaticality 90.89 85.33
meaning preserv. 76.67 78.56
diversity 6.50 14.58
Table 2: Evaluation of our paraphrasing method (with
SV R-REC) against Z HAO -EN G, using human judges. Re-
sults in bold indicate statistically significant differences.
higher grammaticality, meaning preservation, or di-
versity scores; for example, we could increase λ3
and decrease λ1to obtain higher diversity at the cost
of lower grammaticality in the results of Table 2.20 It
is unclear how ZH AO-E NG could be tuned that way.
Overall, our method seems to perform well
against ZH AO-E NG, despite the vastly larger re-
sources of ZHAO-ENG, provided of course that we
limit ourselves to source sentences to which para-
phrasing rules apply. It would be interesting to in-
vestigate in future work if our method’s coverage
(sentences it can paraphrase) can increase to ZHAO-
EN G’s level by using larger collections of paraphras-
ing rules. It would also be interesting to combine the
two methods, perhaps by using SVR-REC (without
features for the quality scores of the rules) to rank
candidate paraphrases generated by ZHAO-ENG.
6 Conclusions and future work
We presented a generate-and-rank method to para-
phrase sentences. The method first produces can-
didate paraphrases by applying existing paraphras-
ing rules extracted from parallel corpora, and it then
ranks (or classifies) the candidates to keep the best
ones. The ranking component considers not only the
context-insensitive quality scores of the paraphras-
ing rules that produced each candidate, but also fea-
tures intended to measure the extent to which the
rule applications preserve grammaticality and mean-
ing in the particular context of the input sentence, as
well as the degree to which the resulting candidate
differs from the input sentence (diversity).
Initial experiments with a Maximum Entropy
classifier confirmed that the features we use can help
a ranking component select better candidate para-
phrases than a baseline ranker that considers only
20Additional application-specific experiments confirm that
this tuning is possible (Malakasiotis, 2011).
the average context-insensitive quality scores of the
applied rules. Further experiments with an SVR
ranker indicated that our full feature set, which in-
cludes features from an existing paraphrase recog-
nizer, leads to improved performance, compared to
a smaller feature set that includes only the context-
insensitive scores of the rules and language model-
ing scores. We also propose a new methodology to
evaluate the ranking components of generate-and-
rank paraphrase generators, which evaluates them
across different combinations of weights for gram-
maticality, meaning preservation, and diversity. The
paper is accompanied by a paraphrasing dataset we
constructed for evaluations of this kind.
Finally, we evaluated our overall method against
a state of the art sentence paraphraser, which
generates candidates by using several commercial
machine translation systems and pivot languages.
Overall, our method performed well, despite the vast
resources of the machine translation systems em-
ployed by the system we compared against. Our
method performed better in terms of grammaticality,
equally well in meaning preservation, and worse in
diversity, but it could be tuned to obtain higher diver-
sity at the cost of lower grammaticality, whereas it
is unclear how the system we compare against could
be tuned this way. On the other hand, an advantage
of the paraphraser we compared against is that it al-
ways produces paraphrases; by contast, our system
does not produce paraphrases when no paraphrasing
rule applies to the source sentence. Larger collec-
tions of paraphrasing rules would be needed to im-
prove our method in that respect.
Apart from obtaining and experimenting with
larger collections of paraphrasing rules, it would be
interesting to evaluate our method in vivo, for ex-
ample by embedding it in question answering sys-
tems (to paraphrase the questions), in information
extraction systems (to paraphrase extraction tem-
plates), or in natural language generators (to para-
phrase template-like sentence plans). We also plan
to investigate the possibility of embedding our SVR
ranker in the sentence paraphraser we compared
against, i.e., to rank candidates produced by using
several machine translation systems and pivot lan-
guages, as in ZHAO-ENG.
104
Acknowledgments
This work was partly carried out during INDIGO, an
FP 6IS T project funded by the European Union, with
additional funding from the Greek General Secre-
tariat of Research and Technology.21
References
I. Androutsopoulos and P. Malakasiotis. 2010. A survey
of paraphrasing and textual entailment methods. Jour-
nal of Artificial Intelligence Research, 38:135–187.
C. Bannard and C. Callison-Burch. 2005. Paraphrasing
with bilingual parallel corpora. In Proc. of the 43rd
ACL , pages 597–604, Ann Arbor, MI.
C. Callison-Burch, P. Koehn, and M. Osborne. 2006.
Improved statistical machine translation using para-
phrases. In Proc. of HLT-NAACL, pages 17–24, New
York, NY.
C. Callison-Burch. 2008. Syntactic constraints on para-
phrases extracted from parallel corpora. In Proc. of
EMNLP, pages 196–205, Honolulu, HI, October.
J. Carletta. 1996. Assessing agreement on classification
tasks: The kappa statistic. Computational Linguistics,
22:249–254.
J. Clarke and M. Lapata. 2008. Global inference for
sentence compression: An integer linear programming
approach. Journal of Artificial Intelligence Research,
1(31):399–429.
M. Collins and T. Koo. 2005. Discriminative reranking
for natural language parsing. Computational Linguis-
tics, 31(1):25–69.
N. Cristianini and J. Shawe-Taylor. 2000. An In-
troduction to Support Vector Machines and Other
Kernel-based Learning Methods. Cambridge Univer-
sity Press.
I. Dagan, B. Dolan, B. Magnini, and D. Roth. 2009. Rec-
ognizing textual entailment: Rational, evaluation and
approaches. Natural Lang. Engineering, 15(4):i–xvii.
Editorial of the special issue on Textual Entailment.
P. A. Duboue and J. Chu-Carroll. 2006. Answering the
question you wish they had asked: The impact of para-
phrasing for question answering. In Proc. of H LT-
NAACL, pages 33–36, New York, NY.
Ronald A. Fisher. 1925. Statistical Methods for Re-
search Workers. Oliver and Boyd.
T. Joachims. 2002. Learning to Classify Text Using Sup-
port Vector Machines: Methods, Theory, Algorithms.
Kluwer.
D. Kauchak and R. Barzilay. 2006. Paraphrasing for
automatic evaluation. In Proc. of H LT-NAAC L, pages
455–462, New York, NY.
21Consult http://www.ics.forth.gr/indigo/.
K. Knight and D. Marcu. 2002. Summarization be-
yond sentence extraction: A probalistic approach to
sentence compression. Artif. Intelligence, 139(1):91–
107.
P. Koehn. 2009. Statistical Machine Translation. Cam-
bridge University Press.
S. Kok and C. Brockett. 2010. Hitting the right para-
phrases in good time. In Proc. of HLT-NAAC L, pages
145–153, Los Angeles, CA .
Y. Lepage and E. Denoual. 2005. Automatic genera-
tion of paraphrases to be used as translation references
in objective evaluation measures of machine transla-
tion. In Proc. of the 3rd Int. Workshop on Paraphras-
ing, pages 57–64, Jesu Island, Korea.
N. Madnani and B.J. Dorr. 2010. Generating phrasal and
sentential paraphrases: A survey of data-driven meth-
ods. Computational Linguistics, 36(3):341–387.
N. Madnani, F. Ayan, P. Resnik, and B. J. Dorr. 2007.
Using paraphrases for parameter tuning in statistical
machine translation. In Proc. of 2nd Workshop on Sta-
tistical Machine Translation, pages 120–127, Prague,
Czech Republic.
P. Malakasiotis. 2009. Paraphrase recognition us-
ing machine learning to combine similarity measures.
In Proc. of the Student Research Workshop of AC L-
AF NLP, Singapore.
P. Malakasiotis. 2011. Paraphrase and Textual Entail-
ment Recognition and Generation. Ph.D. thesis, De-
partment of Informatics, Athens University of Eco-
nomics and Business, Greece.
Y. Marton, C. Callison-Burch, and P. Resnik. 2009.
Improved statistical machine translation using
monolingually-derived paraphrases. In Proc. of
EM NLP, pages 381–390, Singapore.
S. Mirkin, L. Specia, N. Cancedda, I. Dagan, M. Dymet-
man, and I. Szpektor. 2009. Source-language en-
tailment modeling for translating unknown terms. In
Proc. of AC L-A FNLP, pages 791–799, Singapore.
R. Nelken and S. M. Shieber. 2006. Towards robust
context-sensitive sentence alignment for monolingual
corpora. In Proc. of the 11th E ACL, pages 161–168,
Trento, Italy.
S. Pad´
o, M. Galley, D. Jurafsky, and C. D. Manning.
2009. Robust machine translation evaluation with en-
tailment features. In Proc. of ACL-A FN LP, pages 297–
305, Singapore.
K. Papineni, S. Roukos, T. Ward, and W. J. Zhu. 2002.
BL EU: a method for automatic evaluation of machine
translation. In Proc. of the 40th AC L, pages 311–318,
Philadelphia, PA.
C. Quirk, C. Brockett, and W. B. Dolan. 2004. Mono-
lingual machine translation for paraphrase generation.
In Proc. of the Conf. on EMN LP, pages 142–149,
Barcelona, Spain.
105
S. Riezler and Y. Liu. 2010. Query rewriting using
monolingual statistical machine translation. Compu-
tational Linguistics, 36(3):569–582.
S. Riezler, A. Vasserman, I. Tsochantaridis, V. Mittal, and
Y. Liu. 2007. Statistical machine translation for query
expansion in answer retrieval. In Proc. of the 45th
ACL , pages 464–471, Prague, Czech Republic.
I. Szpektor, I. Dagan, R. Bar-Haim, and J. Goldberger.
2008. Contextual preferences. In Proc. of AC L-H LT,
pages 683–691, Columbus, OH.
V. Vapnik. 1998. Statistical learning theory. John Wiley.
S. Zhao, H. Wang, T. Liu, and S. Li. 2008. Pivot ap-
proach for extracting paraphrase patterns from bilin-
gual corpora. In Proc. of AC L-HLT, pages 780–788,
Columbus, OH.
S. Zhao, X. Lan, T. Liu, and S. Li. 2009a. Application-
driven statistical paraphrase generation. In Proc. of
ACL -AFNLP, pages 834–842, Singapore.
S. Zhao, H. Wang, T. Liu, and Li. S. 2009b. Extract-
ing paraphrase patterns from bilingual parallel cor-
pora. Natural Language Engineering, 15(4):503–526.
S. Zhao, H. Wang, X. Lan, and T. Liu. 2010. Leverag-
ing multiple MT engines for paraphrase generation. In
Proceedings of the 23rd COLING, pages 1326–1334,
Beijing, China.
L. Zhou, C.-Y. Lin, and Eduard Hovy. 2006. Re-
evaluating machine translation results with paraphrase
support. In Proc. of the Conf. on EMNLP, pages 77–84.
106