ArticlePDF Available

Corpus Annotation with Paraphrase Types: New Annotation Scheme and Inter-annotator Agreement Measures

Authors:

Abstract

Paraphrase corpora annotated with the types of paraphrases they contain constitute an essential resource for the understanding of the phe- nomenon of paraphrasing and the improvement of paraphrase-related systems in Natural Language Processing. In this article, a new annotation scheme for paraphrase-type annotation is set out, together with newly created measures for the computation of inter-annotator agreement. Three corpora different in nature and in two languages have been annotated using this infrastructure. The annotation results and the inter-annotator agreement scores for these corpora are proof of the adequacy and robustness of our proposal.
Is This a Paraphrase? What Kind?
Paraphrase Boundaries and Typology
Marta Vilaa, M. Ant`onia Mart´ıa, Horacio Rodr´ıguezb
aCLiC, Departament de Ling¨ıstica, Universitat de Barcelona. Gran Via de les Corts
Catalanes 585. 08007 Barcelona
bTALP, Departament de Llenguatges i Sistemes Inform`atics, Universitat Polit`ecnica de
Catalunya. Jordi Girona Salgado 1-3. 08034 Barcelona
Abstract
A precise and commonly accepted definition of paraphrasing does not ex-
ist. This is one of the reasons that has prevented computational linguistics
from a real success when dealing with this phenomenon in its systems and
applications. With the aim of helping to overcome this difficulty, in this
article, new insights on paraphrase characterization are provided. We first
overview what has been said on paraphrasing from linguistics and the new
lights shed on the phenomenon from computational linguistics. Under the
light of the shortcomings observed, the paraphrase phenomenon is studied
from two different perspectives. On the one hand, insights on paraphrase
boundaries are set out analyzing paraphrase borderline cases and the in-
teraction of paraphrasing with related linguistic phenomena. On the other
hand, a new paraphrase typology is presented. It goes beyond a simple list of
types and is embedded in a linguistically-based hierarchical structure. This
typology has been empirically validated through corpus annotation and its
application in the plagiarism-detection domain.
Keywords: Paraphrasing, Paraphrase boundaries, Paraphrase typology
1. Introduction
Although the computational linguistics1community has been working
on paraphrasing over the last decades, it continues to be a challenging and
1We use the terms computational linguistics and natural language processing indistictly
in this article.
Preprint submitted to Lingua April 15, 2013
unresolved issue. One of the main reasons is found in the multifaceted and
boundless nature of the phenomenon, which makes its automatic treatment
complicated.
Computational linguists have looked for precise and computationally-
treatable knowledge on paraphrasing in the linguistics field without reaching
a definitive solution. This has led researchers to rely on vague definitions
of paraphrasing, such as “expressing one thing in other words” (Shinyama
et al., 2002), “alternative ways to convey the same information” (Barzilay,
2003), or “sentences or phrases that convey approximately the same mean-
ing using different surface words” (Bhagat, 2009), and to develop techniques
based on workable paraphrase notions that are partial and ad-hoc.
In this scenario, our aim is to go a step further in paraphrase linguistic
characterization in order to provide Natural Language Processing (NLP) with
more solid grounds for the development of methods and systems dealing with
paraphrasing. We adhere to Wintner (2009), who calls for the return of
linguistics to computational linguistics: “what makes our systems special is
the fact that they manipulate natural languages, and the only scientific field
that can inform our work is linguistics”.
In concrete, we overview what has been said about paraphrasing in lin-
guistics, how computational linguistics has used this knowledge as a base of
its systems, an what are the new insights to paraphrasing derived from them.
In light of the shortcomings observed, our proposal on paraphrase character-
ization is set out. It aims to help in answering two questions that reflect
two different approaches to the phenomenon: “is this a paraphrase?”, which
puts on the table where paraphrase boundaries should be drawn, and “what
kind?”, aiming to describe what are the paraphrase linguistic manifestations,
made concrete in a typology.
Our work is not tight to any concrete theoretical framework. Moreover, it
has been empirically validated through annotation with our typology of more
than 5,700 paraphrase pairs from three paraphrase corpora, which are differ-
ent in nature and in two languages. This annotation gave rise to the Para-
phrase for Plagiarism (P4P), the Microsoft Research Paraphrase-Annotated
(MSRP-A), and the Wikipedia-based Relational Paraphrase Acquisition-Annotated
(WRPA) corpora (Vila et al., Submitted).2No poso les web i refer`encies dels
2Annotated paraphrase corpora and the annotation guidelines used are available at
http://clic.ub.edu/corpus/en/paraphrases-en.
2
corpus d’origen per no carregar.
In Section 2, the state of the art on paraphrasing from linguistics and
computational linguistics is set out. Section 3 presents our proposals on
paraphrase boundaries and typology. Finally, conclusions and future work
are presented in Section 4.
2. What Has Been Said About Paraphrasing?
Paraphrasing has been conceived and apprehended from different angles
in linguistics and computational linguistics. The variety of visions of para-
phrasing is even larger if we consider fields like discourse analysis or psy-
cholinguistics, which have also addressed the phenomenon. This variety is
again enlarged if we adopt a diachronic view, including disciplines such as
rhetoric or biblical exegesis. As can be seen, paraphrase broad and multi-
faceted nature has a direct reflect in the literature.
In what follows, we focus on how paraphrasing has been understood in
linguistics (Section 2.1) and computational linguistics (Section 2.2).3
2.1. In Linguistics
In the field of linguistics, paraphrasing is in the core of two theories
that set out language models focussing on production: Meaning-Text The-
ory (MTT) and Systemic-Functional Grammar (SFG). Their proposals are
substantially different in essence, but their approaches to paraphrasing, sim-
ilar: both see language production as a system of choices or alternatives,
which can give rise to paraphrases.
MTT gives rise to Meaning-Text Models (MTMs). Such models incorpo-
rate a grammar organized in seven levels of representation—with semantics
and phonetics at the wings—comprising six components, which contain the
rules that allow for going from one level of representation to the other. The
second constituent in MTMs is the Explanatory Combinational Dictionary
(ECD), which governs the whole process. Lexical Functions (LF), which
identify recurrent patterns of semantic-syntactic correspondence, are a fun-
damental part of the ECD. Within this framework, two paraphrase mecha-
nisms can be identified. First, paraphrases can be produced in the transition
between levels of representation: representations in one level can be projected
3See Fuchs (1994), Chapters 1 and 2 for a diachronic overview on approaches to para-
phrasing from linguistics and discourse analysis.
3
in two or more representation in the next one. Second, paraphrases can be
established through equivalence rules between representations at the same
level. Paraphrasing at the deep syntax level was first described by ˇ
Zolkovskij
and Mel’ˇcuk (1965), who built a paraphrasing system comprising lexical and
syntactic paraphrasing rules;4paraphrasing at the semantic level was more
recently described by Mili´cevi´c (2007a,b). The axiomatic foundations and
formal complexity of MTT prevent its straightforward exploitation outside
the MTT framework and lead to a costly computational implementation.
Nevertheless, ECD and LF in particular are useful in themselves as they
encode most of the paraphrase potential in the model.
Although in a less explicit way, paraphrasing is also at the base of SFG:
“the systemic theory is a theory of meaning as a choice, by which a language,
or any other semiotic system, is interpreted as networks of interlocking op-
tions” (Halliday, 1994). In this framework, paraphrases are the result of mak-
ing alternative choices. Obviously, not all alternants are meaning-preserving
and, therefore, not all of them give rise to paraphrases.
Transformations, which are at the core of Harris (1957)’s proposal and
Chomsky (1965)’s Generative Grammar, have been used as a way to represent
and enumerate formal relations between sentences. Some of these transforma-
tions are paraphrastic as they preserve the meaning of sentences. Transfor-
mations take place between surface structures in Harris’s approach; in Chom-
sky’s, in contrast, they take place from deep to surface syntax structures. In
the latter case, different surface representations derived from the same deep
structure can be understood as paraphrases. Following Hi˙z (1964)’s ideas,
Smaby (1971) describes a paraphrase transformational grammar which maps
equivalent structures. The main interest of this work is the effort to formalize
paraphrasing; nevertheless, it only deals with those paraphrases which can
be formally apprehended.
With the emergence of generative semantics (Lakoff, 1971), there was a
movement to a semantically-based framework. Since, in this case, the deep
structure is purely semantic, generative semantics appears to be a suitable
means for describing paraphrasing.5Diathesis alternations, which stand for
those alternate structures that are admitted by the same predicate, can also
be viewed as paraphrases. Levin (1993) provides diathesis alternations for
4For a more recent reference in English, see Mel’ˇcuk (1992).
5See Bagha (2011) to read more about this topic.
4
English, some of them, such active/passive or causative/inchoative alterna-
tions, are of general application while others are specific for English language.
There also exist works that analyze and discuss the linguistic nature
of paraphrasing. Martin (1976) defines linguistic paraphrasing as logical
equivalence. He also describes two mechanisms of linguistic paraphrasing:
first, “semic content” identity and “actantial pattern” correspondence, which
roughly corresponds to structural reorganizations, and, second, “actantial
pattern” identity and “semic content” correspondence, which mainly corre-
spond to synonymy substitutions. Fuchs (1994), in turn, describes paraphras-
ing in discourse and in language from a diachronic perspective. Moreover, she
argues for the enunciative dimension of paraphrasing: it cannot be reduced to
closed equivalence, instead it consists of a dynamic and approximate relation-
ship. Mili´cevi´c (2007a), in line with proposals within the MTT framework,
analyzes paraphrasing as a multifaceted and variable phenomenon focussing
on the different paraphrase dimensions. Some concrete aspects discussed by
these authors are taken up in subsequent sections of this article.
Some of works mentioned above include lists of paraphrase types. Mel’ˇcuk
(1992) enumerates 54 lexical and 29 syntactic paraphrasing rules within the
MTT. Mili´cevi´c (2007a) defines a set of MTT semantic-paraphrase rules and
also classifies paraphrases from five different perspectives, such as accuracy
of the paraphrase link (exact and approximate) or paraphrase-relationship
depth (semantic, lexico-syntactic, syntactic, and morphological paraphrases).
Lists of transformations (Harris, 1957) or diathesis alternations (Levin, 1993)
can also be seen as typologies of potential paraphrases. The latter sets out
around 60 diatheses organized in 8 main classes. Martin (1976), in turn, sets
out varied paraphrase mechanisms, focussing on paraphrasing by connotative
variation, double-negation or double-inversion paraphrasing, and paraphras-
ing by synonymy substitution.
2.2. In Computational Linguistics
We analyse the paraphrase characterization in computational linguistics
from two different perspectives. In Section 2.2.1, we analyze the notions of
paraphrase which underlie NLP paraphrase techniques. In Section 2.2.2, we
overview paraphrase typologies built in this field.
2.2.1. Paraphrase Notions Underlying NLP Methods
While linguistic analysis approaches paraphrasing with the aim of ex-
ploring, explaining, and formalizing it, NLP researchers focus on developing
5
methods and techniques to deal with the phenomenon in their systems and
applications.6Each method applied subsumes a way of understanding para-
phrasing and paraphrases addressed with such a technique are of a particular
nature. Sometimes these methods have their roots in linguistics; on other
occasions, they were born within NLP.
A number of authors have applied MTT proposals. Boyer and Lapalme
(1985) developed a paraphrase generation system based on the ECD and the
lexical transformations of the model. Lareau (2002), in turn, presents an
automatic text synthesis prototype system, Sentence Garden, which aimed
to produce not only one sentence, but all possible sentences that express a
given meaning (although the prototype only implemented the semantics-deep
syntax interface).
The idea of transformation between surface structures has also been used
in NLP. McKeown (1983), for example, sets out a paraphrase component for
a question-answering system, where a transformational grammar is used to
generate paraphrases. Romano et al. (2006) use transformation rules in their
paraphrase-based approach to relation extraction.
Harris (1954)’s distributional hypothesis, which states that words occur-
ring in the same contexts tend to have similar meanings, has been widely
applied, directly or indirectly, more or less strictly, and under different forms:
“sentences which appear in similar contexts are paraphrases” (Barzilay and
McKeown, 2001), “if two paths [in a dependency tree] tend to occur in simi-
lar contexts, the meanings of the paths tend to be similar” (Lin and Pantel,
2001),7“named entities are preserved across paraphrases” (Shinyama et al.,
2002), “the meaning of the text around the source and target entities [in a
concrete relation] will be similar throughout their diferent occurrences” (Vila
et al., submitted).
Other authors establish the paraphrase link through a third vertex. In Ri-
naldi et al. (2003)’s question-answering system, paraphrases are those linguis-
tic units mapping to the same logical representation. Bannard and Callison-
Burch (2005), in turn, start out from the assumption of similar meaning
when multiple phrases map onto a single foreign language phrase. The third
vertex is a logical meaning representation in the first case and a sentence in
6See surveys by Androutsopoulos and Malakasiotis (2010) and Madnani and Dorr
(2010) for a complete overview of paraphrase methods in NLP.
7This work and Kouylekov and Magnini (2005) below focus on entailment relations,
which include paraphrases. See Section 3.1.
6
another language in the second.
Similarity measures have also been used to address paraphrasing in NLP.
In this framework, paraphrases are those text snippets with a high level of
overlapping or a low distance. Similarity can be calculated at word level
using, for example, string edit distance or ngrams overlapping (Dolan and
Brockett, 2005); at syntax level, applying tree edit distance (Kouylekov and
Magnini, 2005); and at semantic level, taking advantage of semantic roles
in PropBank or FrameNet frames, using a semantic space such as WordNet
or Wikipedia, or using distributed representations of co-occurrences, usually
vector-based (Baroni and Lenci, 2010).8Semantic similarity has also been
addressed in the Semantic Textual Similarity task in Semeval 2012, where
paraphrases are ranked according to their similarity level.9
To conclude, each NLP technique applied addresses a concrete paraphrase
facet, which is generally partial and ad-hoc. In this regard, a major dis-
tinction can be made. In methods relying on the formal mapping of the
paraphrase members (transformations and formal similarity measures), para-
phrases addressed must be similar in form. This is not the case of those meth-
ods where no formal mapping is necessarily assumed (MTT, distributional
hypothesis, semantic similarity measures, and third vertex).
2.2.2. Paraphrase Typologies
Many NLP researchers have found in typology building a way to ap-
prehend paraphrasing. Early works on paraphrase typologies are Culicover
(1968) and Honeck (1971). They set out similar typologies in the sense that
both divide their paraphrase types into formalizable and non-formalizable
ones, leaving the latter group outside the scope of their work. This has
been a general tendency in NLP and paraphrases where no formal map-
ping can be established have hardly been addressed. In concrete, Culicover
(1968) presents a paraphrase typology of five types: transformational, atten-
uated, lexical, derivational, and real-world, and carries out a formalization
attempt through the definition of some structural and semantic conditions
to be fulfilled by each of the paraphrase types. He makes a division between
computationally “accessible” and “inaccessible paraphrase relationships” and
focusses on the accessible ones, leaving those inaccessible (most real-world
8See Androutsopoulos and Malakasiotis (2010) for further reading on this topic.
9http://www.cs.york.ac.uk/semeval-2012/task6/
7
paraphrases) under-explained. Honeck (1971), in the psychology field, offers
a taxomony of three types of paraphrases, including transformational, lexi-
cal and formalexic (combination of the two); however, he isolates two extra
types of paraphrases that are outside the scope of his study: parasyntactic
(unavailable for formal treatment) and syndetic (combination between the
other types), where no formal correspondences can be established.
More recently, some typologies in NLP consist of lists of the most common
types found in a corpus (Barzilay et al., 1999; Dutrey et al., 2011; Dolan et al.,
2004), lists of the paraphrases they address (Dorr et al., 2004; Kozlowski
et al., 2003; Boonthum, 2004), or simply lists of typical paraphrases with
illustrative purposes (Rinaldi et al., 2003). In general, they are specific-work
oriented and far from being comprehensive.
Sometimes paraphrasing is classified in a very generic way setting out only
a few types, such as in Shimohata (2004, pp. 15–18) or Barreiro (2008, pp.
29–33). This types generally stand for the type of linguistic units or the level
of language where paraphrases take place. There also exist typologies that
focus on concrete paraphrase cases, such as paraphrases involving support-
verb constructions (Barreiro, 2008, pp. 73–81), and typologies that come
from paraphrase related fields, such as text reuse (Clough, 2003, p. 100) or
editing (Faigley and Witte, 1981).
There also exist exhaustive paraphrase typologies focussing on concrete
paraphrase facets, such as syntactic (Dras, 1999) or lexical mechanisms (Bha-
gat, 2009), or covering paraphrasing in a more comprehensive way (Fujita,
2005). More specifically, Dras (1999) sets out 54 types expressed in terms of
syntactic transformations and groups them into five classes standing for para-
phrase effects: change of perspective, change of emphasis, change of relation,
deletion, and clause movement. Bhagat (2009), in turn, classifies paraphrases
according to the lexical changes involved (e.g. actor/action substitution or
noun/adjective conversion) and links each of these types to the structural
modifications accompaining them (substitution, addition/deletion, and/or
permutation). Finally, Fujita (2005) presents a general classification of lexical
and structural paraphrases10 setting out 24 paraphrase types grouped into six
classes including paraphrases of single content words, function-expressional
10This work is based on Japanese language; English and other examples can be found
at http://paraphrasing.org/paraphrase.html. See also Atsushi Fujita’s slides for
the invited talk at CBA 2010 at http://paraphrasing.org/~fujita/publications/
fujita-CBA2010-slides.pdf trencat
8
paraphrases, paraphrases of compound expressions, clause-structural para-
phrases, multi-clausal paraphrases, and paraphrases of idiosyncratic expres-
sions.
Approaches to paraphrase characterization from NLP are generally par-
tial and ad-hoc, but have opened new windows onto the paraphrase phe-
nomenon understanding. In this section, we have shown how can compu-
tational linguistics “shed[s] new light on phenomena that tradicional ap-
proaches fail to account for [and] bring refreshing insights and new points of
view to al branches of linguisitcs” (Wintner, 2009).
3. Paraphrase Characterization
As shown in Section 2, a commonly accepted and, at the same type, pre-
cise definition of paraphrasing does not exist. From the perspective of linguis-
tics and computational linguistics, the definition of “approximate sameness
of meaning” is generally assumed, but it is vague (to what extent can it be
“approximate”?) and actually shifts the problem to another location (what
stands for “meaning”?)
In this article, we adopt a different approach to paraphrase characteriza-
tion. Instead of focussing on the definition of paraphrasing itself, we address
the questions of where to draw the boundaries between paraphrases and
non-paraphrases (Section 3.1) and what phenomena fall under paraphrasing
(Section 3.2). Although we are conscious that paraphrase fuzziness is also
present in both boundary drawing and typology building, and that they are
simply another approach to the same problem, they allow us to be more
precise without giving up a general perspective on paraphrasing.
3.1. Paraphrase Boundaries
The meaning preservation has been largely discussed in the linguistics
literature. In lexical semantics, Cruse (1986) defines absolute synonymy as
an unexpected and just transitory relationship. Sameness of meaning is also
negated in the paraphrase literature; Fuchs (1988) refuses the idea of para-
phrasing as pure and simple identity: “the synonymy-identity myth has only
given rise to sterile arguments.” Therefore, paraphrasing must be situated in
the field of the approximation, opening the path to different semantic simi-
larity or paraphrasability degrees. Paraphrasing takes place in a continuum
that goes from the absolute identity to the absence of semantic similarity.
9
In this scenery, a question arises: where to draw the boundaries between
paraphrases and non-paraphrases?
We consider that fixed and precise paraphrase boundaries do not exist,
instead they depend on the task and objectives: sometimes a wide under-
standing of paraphrasing will be required, on other occasions, a more restic-
tive view will be necessary. Fuchs (1994) points out that a linguistic unit
is a paraphrase of another one if the latter can be considered within the
bounds of acceptable deformability or “distortion threshold” with respect to
the former. This threshold is variable as “it depends on different parameters
constituting the discursive activity: tolerance to deformation is greater or
lesser depending on the subjects and situations.”
In this section, we set out three cases of borderline paraphrases: loss of
content, pragmatic knowledge, and changes in some grammatical features.
They are the result of our analysis of the state of the art of paraphrasing
and related areas and our experience in paraphrase-type annotation. Bor-
derline paraphrases are placed in the continuum between paraphrases and
non-paraphrases, in which authors can position their own paraphrase border
according to their objectives. Moreover, for each of these cases, we mention
which has been our approach, which is reflected in our typology (Section
3.2). The section is closed with a comparison between paraphrasing and two
related phenomena, namely coreference and textual entailment, which often
lead to confusion in NLP.
Content Loss. Many cases in the boundaries of paraphrasing are due to
some kind of content loss. Content loss may be due to deletion [my favourite
in (1)] or generalization [from pilot to commander in (2)].11
(1) a. Yesterday I went to the beach
b. Yesterday I went to my favorite beach
(2) a. The pilot was having breakfast
b. The commander was having breakfast
Depending on the quantity and relevance of the lost content, different degrees
of paraphrasability are possible. In this sense, the level of paraphrasability
of the sentences in (3) is lower than those in (1).
11Unless explicitly mentioned, examples in this article are extracted/adapted from our
annotated corpora and the state-of-the-art works, or are our own.
10
(3) a. Yesterday I went to the beach
b. Yesterday I went to the beach which used to be my favorite when
I was a child
Moreover, the lost content can sometimes be recovered by means of im-
plicit lexical knowledge in the context. The Generative Lexicon (Pustejovsky,
1995), although not addressing paraphrasing directly, offers useful insights in
this regard. Departing from the idea that the meaning of words reflects the
deeper conceptual structures in the cognitive system, the qualia structure
specifies four aspects of word meanings: formal (distinction within a larger
domain), constitutive (relation between an object and its constiutent parts),
telic (purpose and function), and agentive (factors involved in its origin).
In (4), the information contained in the qualia’s telic of book allows for the
recoverability of the deleted content (reading). In contrast, in (1), we have
no means to recover the lost content. Therefore, the level of paraphrasability
is higher in (4). Moreover, the pair in (5) shows a higher degree of para-
phrasability than the pair in (2), as the context of taking off in the former
clarifies that this commander is, actually, a pilot. In (2), we only rely on the
hypernym relationship between pilot and commander.
(4) a. John began reading a book
b. John began a book
(5) a. The pilot was ready to take off
b. The commander was ready to take off
Depending on the task and objetives it would be necessary either consid-
ering the above examples paraphrases or not. Many types in our typology
involve semantic loss at different degrees.12 The addition/deletion type,
exemplified in Table 2, is a clear example of this. Although the lost con-
tent cannot always be recovered in our types, this is sometimes possible: in
“light/generic element addition/deletion” within the synthetic/analytic
substitution type (Table 3), the content of the deleted element is embed-
ded in the one that remains, as the latter is an hyponym of the former. As
shown in Vila et al. (Submitted), addition/deletion is one of the most
frequent types in the annotated corpora, showing the high accessibility of
12Dras (1999, pp. 79–86) addresses the loss of meaning in paraphrasing regarding the
paraphrase classes in his typology.
11
this mechanism when paraphrasing.
Pragmatic Knowledge. Examples like the ones in (6) to (10) are treated
by several authors, both in linguistics and computational linguistic, as special
types of paraphrases that fall beyond pure semantic similarity to conquere
the field of pragmatics.
(6) a. Close the door please
b. There is air flow
(7) a. Penelope was waiting for Ulysses return
b. The Ithaca queen was waiting for Ulysses return
(8) a. Here, life is good
b. In Paris, life is good
(9) a. They got married last year.
b. They got married in 2004.
(10) a. The U.S.-led invasion of Iraq
b. The U.S.-led liberation of Iraq
Martin (1976) opposes “linguistic” to “pragmatic paraphrases”, the lat-
ter standing for pairs that, in a given situation, refer to the same intention
(6) or refer to the same facts and events (7).13 Mili´cevi´c (2007a), in turn,
opposes “language” to “cognitive paraphrases”, the latter comprising para-
phrases exploiting pragmatic data, such as (6), (8), and (9), and paraphrases
exploiting encyclopedic knowledge, such as (7).14 Fujita (2005) talks about
“pragmatic paraphrases” (6) and ‘referential paraphrases” (9). Dorr et al.
(2004) mention “viewpoint variation paraphrases” (10), also cited by Hirst
(2003). Finally, Fuchs (1994) considers cases like the one in (7) to be outside
paraprhase boundaries.
The way to present and conceptualize all these examples varies according
to the author, but all of them put on the table that paraphrasing may rely on
something further pure semantic similarity. We distinguish two main types of
knowledge that can give rise to pragmatic paraphrases, namely encyclopedic
knowledge [(7) and (10)] and situational knowledge (the remaining exam-
13Martin (1976) presents a third type of pragmatic paraphrase relying on implication
and coreference. We address coreference in the last part of this section.
14Mili´cevi´c (2007a) includes a third type of cognitive paraphrases called paraphrases
exploiting logic capacities, which also involves encyclopedic knowledge.
12
ples). As Mili´cevi´c (2007a) point out, we can also draw a continuum here:
“between those clear and unambiguous cases, there is a gray area populated
by paraphrases that can be called quasi-linguistic.”
If we stick to the paraphrase definition of sameness of meaning, these
examples should be outside paraphrase limits. However, under certain cir-
cumstances, it may be necessary to consider these cases as a special type of
paraphrase linked to the situational context. As far as our typology relies on
semantic content, those cases fall outside our proposal.
Grammatical Features. With the generic concept of “grammatical fea-
tures”, we refer to changes in person, number, and time. They generally lead
to deep changes in meaning; however, on occasions, they may give rise to
paraphrases.
The example in (11) is clearly nearer paraphrasing than (12), as, in (11),
the first person plural includes the first person singular. In (13), the change
in number is not relevant: street does not refer to a concrete one, but to the
general sense of ‘outdoors’; in (14), the change in number gains relevance as
me move from the idea of ‘liking a concrete cake’ to ‘liking cakes in general’.
In (15), both tenses are in the present and highly overlap, which is not the
case of (16), standing for different moments in time.
(11) a. We love flowers
b. Ilove flowers
(12) a. She is my collaborator
b. He is my collaborator
(13) a. I got lost in the street
b. I got lost in the streets
(14) a. I like the cake
b. I like cakes
(15) a. The plane takes off
b. The plane is taking off
(16) a. She lives in Barceloa
b. She had lived in Barcelona
Only examples (11), (13), and (15) are considered to be paraphrases in our
approach. They are comprised in the inflectional change type in our
typology (Table 1). Contrary to content loss and pragmatic knowledge, which
13
are language independent, this group refers to phenomena that are closely
related to how languages encode morpho-semantic content. In English, this
is reflected in the inflection.
Paraphrase, Coreference, and Textual Entailment. Paraphrasing over-
laps with coreference and textual entailment leading to recurrent confusions.
In what follows, the main difference and similarities between these two phe-
nomena and paraphrasing are presented.
Paraphrasing and coreference overlap considerabiliy, but they notabiliy
differ in essence: paraphrasing concerns meaning, whereas coreference is
about discourse referents (Recasens and Vila, 2010). In the example (17),
a paraphrase relationship exists between salesmen and seller ; nevertheless,
the former acts as a nominal predicate, which is not referential and cannot
be part of coreference relationships. In contrast, in (18), we can establish a
coreference relationship between the noun phrases in italics, but they do not
hold the same meaning and, therefore, are not paraphrases. Finally, in (19),
paraphrase and coreference overlap in the coast/the seashore.
(17) She is a salesmen in that shop, but the seller that assisted me was
not her.
(18) – Are you a family member of the patient in room 235?
– Yes, my cousin is in that room.
(19) Yesterday I was walking along the coast.The seashore is what I
really love in this area.
Paraphrases can also be seen as bidirectional entailment relations: “text
A is a paraphrase of text B if and only if A entails B and B entails A” (Rus
et al., 2009). Limiting paraphrasing to bidirectional entailment reduces it to
only a few cases and, therefore, some unidirectional-entailment cases are gen-
erally considered to be paraphrases. Dorr et al. (2004), for example, present
“inference” as a paraphrase type. One more time, we situate paraphrasing
in a continuum with strict bidirectional entailment in one extreme and strict
unidirectional entailment in the other. Where to put the boundaries between
paraphrases and non-paraphrases depends again on the task and objetives.
The relationship between textual entailment and paraphrasing is inti-
mately linked to the question of content loss mentioned above, as all para-
phrases exhibiting content loss are cases of unidirectional entailment. In our
typology, addition/deletion (Table 2) illustrates this. Moreover, our ty-
14
pology includes types categorized as “paraphrase extremes” including iden-
tical and non-paraphrase, which stand for clear paraphrase limits, and
entailment, that is, those cases of non-paraphrase which are closer to the
paraphrase domain (Table 2). In the annotation task, it is worth to isolate
these cases of entailment for researchers interested in broadening the scope
of their work (Vila et al., Submitted).
3.2. Paraphrase Typology
In this section, we focus on the characterization of paraphrasing through
the description of its possible linguistic manifestations or types. Our ty-
pology is not a proposal started from scratch, but it has been built on the
basis of state-of-the-art typologies, which have provided ours with insights on
structure and types. Actually, our typology aims to cover all the phenomena
described in them.15
A set of characteristics make our typology a step further with respect
to the state of the art. First, it consists of comprehensive typology of para-
phrasing that focusses on general paraphrase phenomena, leaving fine-grained
linguistic mechanisms in a second term. Second, it goes beyond a simple list
of types: it is provided with a hierarchical structure, which is linguistically-
based and uniform throughout, and it is accompanied by a linguistic reflec-
tion describing and justifying its nature. Finally, as already mentioned, it
has been empirically validated on paraphrase corpora.
The typology is displayed in Tables 1 and 2. It consists of a three-level
typology of 24 paraphrase types (third column) grouped in 5 classes (first
column), two of them having two sub-classes each (second column).16 In
what follows, an overview of our typology is set out. In concrete, we describe
(i) its scope, (ii) the type of units it classifies, (iii) its structure, (iv) and its
types.
Scope of the typology. It is a general typology of paraphrasing in the sense
that it comprehends the paraphrase phenomenon as a whole and covers all
its possible manifestations, from elementary modifications like the inflec-
tional change type in Table 1 to deep reorganizations like semantics-
15See Section 2 in this article and the appendice in the annotation guidelines (footnote
2) for a complete list of the consulted typologies.
16The typology was first presented (with some slight differences) in Barr´on-Cede˜no et al.
(2013, to appear). The present article focusses on the nature and structure of the typology;
Barr´on-Cede˜no et al. (2013, to appear), in contrast, focusses on the definition of each type.
15
Morpholexicon-based changes
Morphology-based
Inflectional changes (a) it was with difficulty that the course of streets could
be followed
(b) You couldn’t even follow the path of the street
Modal-verb changes (a) I [. . . ] was still lost in conjectures who they might be
(b) I was pondering who they could be
Derivational
changes
(a) I have heard many accounts of him [. . . ] all differing
from each other
(b) I have heard many different things about him
Lexicon-based
Spelling changes (a) the foodservice pie business doesn’t fit the company’s
long-term growth strategy
(b) The foodservice pie business does not fit our long-term
growth strategy
Same-polarity (a) a teaspoonful of vanilla
substitutions (b) very little vanilla
Synthetic/analytic (a) A sequence of ideas
substitutions (b) ideas
Opposite-polarity (a) Leicester [. . . ] failed in both enterprises
substitutions (b) he did not succeed in either case
Converse substitu-
tions
(a) the Geological Society of London in 1855 awarded to
him the Wollaston medal
(b) resulted in him receiving the Wollaston medal from
the Geological Society in London in 1855
Structure-based changes
Syntax-based
Diathesis alterna-
tions
(a) the guide drew our attention to a gloomy little dun-
geon
(b) ou[r] attention was drawn by our guide to a little dun-
geon
Negation switching (a) In order to move us, it needs no reference to any rec-
ognized original
(b) One does not need to recognize a tangible object to be
moved by its artistic representation
Ellipsis (a) In the scenes with Iago he equaled Salvini, yet did not
in any one point surpass him
(b) He equaled Salvini, in the scenes with Iago, but he did
not in any point surpass him or imitate him
Coordination
changes
(a) It is estimated that he spent nearly £10,000 on these
works. In addition he published a large number of
separate papers
(b) Altogether these works cost him almost £10,000 and
he wrote a lot of small papers as well
Table 1: Paraphrase typology (1). Classes appear in the first column, subclasses in the
second, and types in the third. Most of the examples come from the P4P corpus and
also appear in Barr´on-Cede˜no et al. (2013, to appear). Spelling, punctuation, format, and
paraphrase extremes are extracted from the MSRP-A corpus.
16
Structure-based changes
(cont.)
Subordination-and-
nesting changes
(a) the Russian law, which limits the percentage of Jewish
pupils in any school, barred his admission
(b) the Russian law had limits for Jewish students so they
barred his admission
Discourse-based
Punctuation
changes
(a) Swartz repaid it in full, with interest, according to his
lawyer, Charles Stillman
(b) Swartz fully repaid it with interest, according to his
lawyer, Charles Stillman
Direct/indirect- (a) “She is mine,” said the Great Spirit
style alternations (b) The Great Spirit said that she is her[s]
Sentence-modality
changes
(a) The real question is, will it pay? will it please
Theophilus P. Polk or vex Harriman Q. Kunz?
(b) He do it just for earning money or to please Theophilus
P. Polk or vex Hariman Q. Kunz
Syntax/discourse- (a) How he would stare!
structure changes (b) He would surely stare!
Semantics-based changes (a) The scenery was altogether more tropical
(b) which added to the tropical appearance
Miscellaneous changes
Change of format (a) fell 1.5%
(b) fell 1.5 percent
Change of order (a) First we came to the tall palm trees
(b) We got to some rather biggish palm trees first
Addition/deletion (a) One day she took a hot flat-iron, removed my clothes,
and held it on my naked back until I howled with pain
(b) As a proof of bed treatment, she took a hot flat-iron
and put it on my back after removing my clothes
Paraphrase extremes
Identical (a) But he added group performance would improve in the
second half of the year and beyond
(b) De Sole said in the results statement that group per-
formance would improve in the second half of the year
and beyond
Entailment (a) [...] it was acquiring the “intellectual property and
technology assets” of GeCAD
(b) [...] it intends to acquire the intellectual property and
technology assets of Romanian antivirus firm GeCAD
Software Srl
Non-paraphrase (a) The report was found Oct. 23, tucked inside an old
three-ring binder not related to the investigation
(b) The report was found last week tucked inside a train-
ing manual that belonged to Hicks
Table 2: Paraphrase typology (2)
17
based changes in Table 2. Also, it covers paraphrases from the word to
the discourse level. It should be noted that, as far as our typology relies on
semantic content, pragmatic paraphrase fall outside our proposal (Section
3.1).
Unit of classification. The units classified according to our typology
are what we call atomic paraphrase phenomena (paraphrase phenomena on-
wards), that is, autonomous paraphrase reorganizations consisting of a set of
dependent linguistic mechanisms. The derivational change in Table 1,
for example, comprises a change from a verb to an adjective form, as well as
an involved structural modification. Among the dependent linguistic mech-
anisms, one of them is the trigger. In the previous example, it is the change
of category or derivational change. As can be seen, paraphrase-type names
stand for the linguistic mechanism triggering the paraphrase phenomenon.
Paraphrase phenomena can take place isolated or combined, giving rise to
complex paraphrase pairs. In the pair containing a derivational change
above mentioned, other paraphrase phenomena can be observed, such as a
same-polarity substitution (or synonymy substitution) between things
and accounts.
Typology structure: classes, subclasses, and types. Types are grouped
in classes according to the nature of the trigger linguistic mechanism: (i) The
morpholexicon-based change class comprises those types where the para-
phrase phenomenon is triggered at the morpholexicon level; (ii) the structure-
based change class comprises those types that are the result of a different
structural organization; and (iii) the semantic-based change class contains
those types arising at the semantic level. An example for (i) are deriva-
tional changes, where the trigger consists of the change of category, which
implies structural reorganizations. Regarding (ii) , a diathesis alterna-
tion like the one in Table 1 involves a change of voice of the verb among
others changes, but the trigger is syntactic. Finally, paraphrases in the se-
mantics class (iii) are based on a different distribution of semantic content
across the lexical units involving multiple and varied formal changes (Table
2).
There are two more classes in our typology: miscellaneous changes and
paraphrase extremes (Table 2). The former comprises types not directly
related to one single language level. The latter comprises those phenomena
that are at the limits or outside the limits of paraphrasing (Section 3.1).
Finally, the sub-classes follow the classical organization in formal linguistic
18
levels from morphology to discourse and simply establish an intermediate
grouping between some classes and their types.
Two main kinds of paraphrase structural reorganisations can be inferred
from the previous explanation: those that are triggered by a lexical substitu-
tion (morpholexicon-based changes), and those that are not (structure-based
changes). The idea of lexical trigger has its basis on the lexical projection
rules stated by Chomsky (1986) and its further reformulations.
This organization in classes and the idea of trigger determined the method-
ology applied to annotate the scope in Vila et al. (Submitted).
The types.17 Types in our typology correspond to general and contrastive
categories: they stand for coarse-grained categories of paraphrase phenomena
that are substantially different between them, e.g., same-polarity substi-
tution vs. punctuation change. Even types closer in nature clearly con-
trast. For example, linguistic mechanisms involved in opposite-polarity
and converse substitutions are similar (both can involve a change in
the order of the arguments); however, the linguistic mechanism triggering
the paraphrase phenomenon (the opposite-polarity or converse lexical unit)
makes them different.
An important consideration regarding the nomenclature used for the
types has to be pointed out. Some paraphrase-type names refer to para-
phrase relationships by default, e.g., all derivational changes give rise
to paraphrase relationships as changes of category does not affect the core
meaning of the sentence. Other paraphrase-type names refer to linguistic
mechanisms that do not necessarily give rise to paraphrases, e.g., inflec-
tional changes may change the core meaning of the sentences. There-
fore, cases like the inflectional change type have to be understood as
meaning-preserving changes in inflection, and not as changes in inflection as
a whole (Section 3.1).
Each type is realized by a set of more fine-grained prototypes, that is,
those patterns that characterize the linguistic mechanisms undergoing the
paraphrase. Defining a complete list of prototypes for each type is not the
objective of this work. Nevertheless, without the aim of being exhaustive,
we exemplify prototypes taking synthetic/analytic substitutions as
17See Barr´on-Cede˜no et al. (2013, to appear) for a detailed description and exemplifi-
cation of each type.
19
an example.18 In this case, we identified the five prototypes shown in Table
3: (i) compounding/decomposition, (ii) alternations affecting genitives and
possessives, (iii) synthetic/analytic-superlative alternation, (iv) light/generic
element addition/deletion, and (v) specifier addition/deletion.
Martin (1976) analyzes with detail what he calls “double-negation” and
“double inversion paraphrasing”, which roughly corresponds to our opposite-
polarity and converse substitutions. The equivalence rules he defines
for French can be seen as a list of prototypes for these types. Barreiro (2008,
pp. 73–81)’s typology of support verb construction and, at a smaller scale,
Pe˜nas and Ovchinnikova (2012, pp. 399–400)’s noun-compound and genitive
paraphrases can also be seen as potential lists of prototypes for the type
synthetic/analytic substitutions.
Types and prototypes differ in that types are stable and prototypes are
an open class. Types stand for general paraphrase phenomena covering para-
phrasing as a whole. Their comprehensiveness has been tested through cor-
pus annotation in two languages (English and Spanish). Prototypes, in con-
trast, stand for concrete linguistic mechanisms or patterns of realization for
which a complete list is not necessarily provided in this work. They are more
language dependent than types.
4. Conclusions and Future Work
This article has offered an overview on what has been said about para-
phrasing in linguistics, how computational linguistics has used this knowledge
as a base of its systems, and what are the new insights to paraphrase char-
acterization derived from computational-linguistics methods. This analysis
has shown that, given paraphrase vague and multifaceted nature, a precise
and commonly accepted definition of the phenomenon does not exist. This
has complicated paraphrase tasks in NLP on many occasions: “the difficulty
when working with paraphrases lies on its own definition” (Herrera et al.,
2007).
The aim of this article has been to move forward in paraphrase character-
ization in order to provide NLP with more rigorous paraphrase knowledge.
We addressed this problem from two directions. First, based on the idea that
paraphrase boundaries are not fixed and depend on the task and objetives, we
18Examples of prototypes for different types can be see in our annotations guidelines.
See footnote 2.
20
Compounding/decomposition (1) a. wildlife television documentaries
b. television documentaries about wildlife
(2) a. chemical life-cycles
b. life-cycles for chemistry
(3) a. physiography
b. physical geography
Alternations affecting (1) a. Tina’s birthday
genitives and possessives b. the birthday of Tina
(2) a. his reflection
b. the reflection of his own features
(3) a. the Met show
b. the Met’s show
(4) a. Russia’s Foreign Ministry
b. the Russian Foreign Ministry
Synthetic/analytic (1) a. smarter than everybody else
superlative alternation b. the smartest
Light/generic element (1) a. boast
addition/deletion b. speak boastfully
(2) a. cheerfully
b. in a cheerful way
Specifier addition/deletion (1) a. fog
b. wall of fog
(2) a. 5
b. 5 o’clock
Table 3: Prototypes for synthetic-analytic substitutions.
21
have presented three areas where boundary-paraphrases are placed. Second,
paraphrase characterization has been addressed with the construction of a
new paraphrase typology. Types in our typology are comprehensive, general,
and stable. The prototypes they contain, in contrast, constitute and open
and flexible group where new linguistic mechanisms can be described. This
typology has been empirically validated through the annotation of more than
5,700 paraphrase pairs from three corpora different in nature and in two lan-
guages (Vila et al., Submitted). Moreover, our typology proposal has already
been inspected in the automatic plagiarism detection field with promising re-
sults (Barr´on-Cede˜no et al., 2013, to appear).
Finally, this article opens a number of lines for future research, such as
(i) further analyzing paraphrase boundaries: unseen borderline areas may
be defined, and (ii) studying in depth the idea of prototype and prototype
definition.
References
Androutsopoulos, I., Malakasiotis, P., 2010. A survey of paraphrasing and
textual entailment methods. Journal of Artificial Intelligence Research 38,
135–187.
Bagha, K. N., 2011. Generative semantics. English Language Teaching 4 (3),
223–231.
Bannard, C., Callison-Burch, C., 2005. Paraphrasing with bilingual parallel
corpora. In: Proceedings of ACL 2005. pp. 597–604.
URL http://www.aclweb.org/anthology/P05-1074
Baroni, M., Lenci, A., 2010. Distributional memory: a general framework for
corpus-based semantics. Computational Linguistics 36 (4), 673–721.
Barreiro, A., 2008. Make it simple with paraphrases. automated paraphrasing
for authoring aids and machine translation. Ph.D. thesis, Universidade do
Porto, Porto.
Barr´on-Cede˜no, A., Vila, M., Mart´ı, M., Rosso, P., 2013, to appear. Pla-
giarism meets paraphrasing: Insights for the next generation in au-
tomatic plagiarism detection. Computational Linguistics 39 (4), DOI:
10.1162/COLI a 00153.
22
Barzilay, R., 2003. Information fusion for multidocument summarization:
Paraphrasing and generation. Ph.D. thesis, Columbia University.
Barzilay, R., McKeown, K., 2001. Extracting paraphrases from a parallel
corpus. In: Proceedings of the ACL 2001. Toulouse, pp. 50–57.
Barzilay, R., McKeown, K., Elhadad, M., 1999. Information fusion in the
context of multi-document summarization. In: Proceedings of the ACL
1999. pp. 550–557.
B`es, G. G., Fuchs, C., 1988. Introduction. In: Lexique et paraphrase. Presses
Universitaires de Lille, pp. 7–11.
Bhagat, R., 2009. Learning paraphrases from text. Ph.D. thesis, University
of Southern California.
Boonthum, C., 2004. iSTART: Paraphase recognition. In: Proceedings of the
Fifth ACL Workshop on Student Research. pp. 55–60.
Boyer, M., Lapalme, G., 1985. Generating paraphrases from meaning-text
semantic networks. Computational Intelligence 3-4 (1), 103–117.
Chomsky, N., 1965. Aspects of the Theory of Syntax. MIT Press, Cambridge.
Chomsky, N., 1986. Knowledge of Language: Its Nature, Origin, and Use.
Praeger Publishers.
Clough, P., 2003. Measuring text reuse. Ph.D. thesis, University of Sheffield.
Cruse, D. A., 1986. Lexical Semantics. Cambridge University Press.
Culicover, P., 1968. Paraphrase generation and information retrieval
from stored text. Mechanical Translation and Computational Linguistics
11 (1,2), 78–88.
Dolan, B., Quirk, C., Brockett, C., 2004. Unsupervised construction of large
paraphrase corpora: Exploiting massively parallel news sources. In: Pro-
ceedings of COLING 2004. pp. 350–356.
Dolan, W. B., Brockett, C., 2005. Automatically constructing a corpus of
sentential paraphrases. In: Proceedings of the 3rd International Workshop
on Paraphrasing (IWP 2005). Jeju Island, pp. 9–16.
23
Dorr, B. J., Green, R., Levin, L., Rambow, O., Farwell, D., Habash, N.,
Helmreich, S., Hovy, E., Miller, K. J., Mitamura, T., Reeder, F., Sid-
dharthan, A., 2004. Semantic annotation and lexico-syntactic paraphrase.
In: Proceedings of the Workshop on Building Lexical Resources from Se-
mantically Annotated Corpora, LREC 2004. Lisbon, Portugal.
Dras, M., 1999. Tree adjoining grammar and the reluctant paraphrasing of
text. Ph.D. thesis, Macquarie University, Australia.
Dutrey, C., Bernhard, D., Bouamor, H., Max, A., 51–58 2011. Local modifi-
cations and paraphrases in wikipedia’s revision history. Procesamiento del
Lenguaje Natural 46.
Faigley, L., Witte, S., 1981. Analyzing revision. College Composition and
Communication 32 (4), 400–414.
Fuchs, C., 1988. Paraphrases pr´edicatives et contraintes ´enonciatives. In:
G. B`es, G., Fuchs, C. (Eds.), Lexique et Paraphrase. No. 6 in Lexique.
Presses Universitaires de Lille, Villeneuve d’Ascq, pp. 157–171.
Fuchs, C., 1994. Paraphrase et ´enonciation. Ophrys, Paris.
Fujita, A., 2005. Automatic generation of syntactically well-formed and se-
mantically appropriate paraphrases. Ph.D. thesis, Nara Institute of Science
and Technology.
Halliday, M., 1994. An Introduction to Functional Grammar, 2nd Edition.
Edward Arnold, New York.
Harris, Z., 1954. Distributional structure. Word 10 (23), 146–162.
Harris, Z., 1957. Co-occurence and transformation in linguistic structure.
Language 3 (33), 283–340.
Herrera, J., Pe˜nas, A., Verdejo, F., 2007. Paraphrase extraction from vali-
dated question answering corpora in spanish. Procesamiento del Lenguaje
Natural 39, 37–44.
Hirst, G., 2003. Paraphrasing paraphrased. Keynote address for The Second
International Workshop on Paraphrasing: Paraphrase Acquisition and Ap-
plications.
24
Hi˙z, H., 1964. The role of paraphrase in grammar. Monograph series on
language and linguistics 17, 97–104.
Honeck, R. P., 1971. A study of paraphrases. Journal of Verbal Learning and
Verbal Behavior 10, 367–381.
Kouylekov, M., Magnini, B., 2005. Recognizing textual entailment with tree
edit distance. In: Proceedings of the PASCAL RTE Challenge. pp. 17–20.
Kozlowski, R., McCoy, K. F., Shanker, V. K., 2003. Generation of single-
sentence paraphrases from predicate/argument structure using lexico-
grammatical resources. In: Proceedings of IWP 2003. pp. 1–8.
Lakoff, G., 1971. On generative semantics. In: Steinberg, D. D., Jakobovits,
L. A. (Eds.), Semantics: An interdisciplinary reader in philosophy, linguis-
tics and psychology. Cambridge University Press, pp. 232–296.
Lareau, F., 2002. La synth`ese automatique de paraphrases comme outil de
erification des dictionnaires et grammaires de type sens-texte. Master’s
thesis, Universit´e de Montr´eal.
Levin, B., 1993. English Verb Classes and Alternations: A Preliminary In-
vestigation. University of Chicago Press.
Lin, D., Pantel, P., 2001. DIRT-Discovery of Inference Rules from Text. In:
Proceedings of the KDD 2001. pp. 323–328.
Madnani, N., Dorr, B. J., 2010. Generating phrasal and sentential para-
phrases: A survey of data-driven methods. Computational Linguistics
36 (3), 341–387.
Martin, R., 1976. Inf´erence, antonymie et paraphrase. Librarie C. Klinck-
sieck.
McKeown, K., 1983. Paraphrasing questions using given and new informa-
tion. American Journal of Computational Linguistics 9 (1).
Mel’ˇcuk, I. A., 1992. Paraphrase et lexique: la th´eorie Sens-Texte et le Dic-
tionnaire Explicatif et Combinatoire. In: Mel’ˇcuk, I. A., Arbatchewsky-
Jumarie, N., Clas, A., Mantha, S., Polgu`ere, A. (Eds.), Dictionnaire Ex-
plicatif et Combinatoire du Fran¸cais Contemporain. Recherches Lexico-
emantiques III. Les Presses de l’Universit´e de Montr´eal, pp. 9–59.
25
Mili´cevi´c, J., 2007a. La Paraphrase. Peter Lang, Berne.
Mili´cevi´c, J., 2007b. Semantic equivalence rules in meaning-text paraphras-
ing. In: Wanner, L. (Ed.), Selected lexical and grammatical issues in
the Meaning-Text Theory. John Benjamins, Amsterdam/Philadelphia, pp.
267–296.
Pe˜nas, A., Ovchinnikova, E., 2012. Unsupervised acquisition of axioms to
paraphrase noun compounds and genitives. In: Gelbukh, A. (Ed.), Com-
putational Linguistics and Intelligent Text Processing. Springer Berlin Hei-
delberg, pp. 388–401.
Pustejovsky, J., 1995. The Generative Lexicon. Massachusetts Institute of
Technology.
Recasens, M., Vila, M., 2010. On paraphrase and coreference. Computational
Linguistics 36 (4), 639–647.
Rinaldi, F., Dowdall, J., Kaljurand, K., Hess, M., Moll´a, D., 2003. Exploiting
paraphrases in a question answering system. In: Proceedings of IWP 2003.
pp. 25–32.
Romano, L., Kouylekov, M., Szpektor, I., Dagan, I., Lavelli, A., 2006. Inves-
tigating a generic paraphrase-based approach for relations extraction. In:
Proceedings of EACL 2006. pp. 409–416.
Rus, V., McCarthy, P. M., C. Graesser, A., Danielle, S. M., 2009. Identifica-
tion of sentence-to-sentence relations using a textual entailer. Research on
Language and Computation 7 (2–4), 209–229.
Shimohata, M., 2004. Acquiring paraphrases from corpora and its applica-
tion to machine translation. Ph.D. thesis, Nara Institute of Science and
Technology.
Shinyama, Y., Sekine, S., Sudo, K., 2002. Automatic paraphrase acquisition
from news articles. In: Proceedings of the second international conference
on Human Language Technology Research. Morgan Kaufmann Publishers
Inc., San Francisco, CA, USA, pp. 313–318.
Smaby, R. M., 1971. Paraphrase Grammars. Vol. 2 of Formal Linguistics
Series. D. Reidel Publishing Company.
26
Vila, M., Mart´ı, M. A., Rodr´ıguez, H., Submitted. Corpus annotation with
paraphrase types. a new annotation scheme.
Vila, M., Rodr´ıguez, H., Mart´ı, M. A., submitted. Relational paraphrase
acquisition from Wikipedia. The WRPA method and corpus.
ˇ
Zolkovskij, A., Mel’ˇcuk, I., 1965. O vozmoˇznom metode i instrumentax se-
mantiˇceskogo sinteza. Nauˇcno-texniˇceskaja informacija 5, 23–28.
Wintner, S., 2009. What science underlies Natural Language Engineering?
Computational Linguistics 35 (4), 641–644.
27
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Generative semantics is (or perhaps was) a research program within linguistics, initiated by the work of George Lakoff, John R. Ross, Paul Postal and later McCawley. The approach developed out of transformational generative grammar in the mid 1960s, but stood largely in opposition to work by Noam Chomsky and his students. The nature and genesis of the program are a matter of some controversy and have been extensively debated. Generative semanticists took Chomsky's concept of deep structure and ran with it, assuming (contrary to later work by Chomsky and Ray Jackendoff) that deep structures were the sole input to semantic interpretation. This assumption, combined with a tendency to consider a wider rang of empirical evidence than Chomskyan linguists, lead generative semanticists to develop considerably more abstract and complex theories of deep structure than those advocated by Chomsky and his students. Throughout the late 1960s and 1970s, there were heated debates between generative semanticists and more orthodox Chomskyans. The generative semanticists lost the debate, in so far as their research program ground to a halt by the 1980s. However, this was in part because the interests of key generative semanticists such as George Lakoff had gradually shifted away from the narrow study of syntax and semantics. A number of ideas from later work in generative semantics have been incorporated into cognitive linguistics (and indeed into main stream Chomskyan linguistics, often without citation)
Article
Full-text available
Although paraphrasing is the linguistic mechanism underlying many plagiarism cases, little attention has been paid to its analysis in the framework of automatic plagiarism detection. Therefore, state-of-the-art plagiarism detectors find it difficult to detect cases of paraphrase plagiarism. In this article, we analyze the relationship between paraphrasing and plagiarism, paying special attention to which paraphrase phenomena underlie acts of plagiarism and which of them are detected by plagiarism detection systems. With this aim in mind, we created the P4P corpus, a new resource that uses a paraphrase typology to annotate a subset of the PAN-PC-10 corpus for automatic plagiarism detection. The results of the Second International Competition on Plagiarism Detection were analyzed in the light of this annotation. The presented experiments show that (i) more complex paraphrase phenomena and a high density of paraphrase mechanisms make plagiarism detection more difficult, (ii) lexical substitutions are the paraphrase mechanisms used the most when plagiarizing, and (iii) paraphrase mechanisms tend to shorten the plagiarized text. For the first time, the paraphrase mechanisms behind plagiarism have been analyzed, providing critical insights for the improvement of automatic plagiarism detection systems.
Article
Full-text available
Paraphrase corpora are an essential but scarce resource in Natural Language Processing. In this paper, we present the Wikipedia-based Relational Paraphrase Acquisition (WRPA) method, which extracts relational paraphrases from Wikipedia, and the derived WRPA paraphrase corpus. The WRPA corpus currently covers person-related and authorship relations in English and Spanish, respectively, suggesting that, given adequate Wikipedia coverage, our method is independent of the language and the relation addressed. WRPA extracts entity pairs from structured information in Wikipedia applying distant learning and, based on the distributional hypothesis, uses them as anchor points for candidate paraphrase extraction from the free text in the body of Wikipedia articles. Focussing on relational paraphrasing and taking advantage of Wikipedia-structured information allows for an automatic and consistent evaluation of the results. The WRPA corpus characteristics distinguish it from other types of corpora that rely on string similarity or transformation operations. WRPA relies on distributional similarity and is the result of the free use of language outside any reformulation framework. Validation results show a high precision for the corpus.
Book
This volume explores new interfaces between linguistics and jurisprudence. Its theoretical and methodological importance lies in showing that many questions asked within the field of language and law receive satisfactory answers from formal linguistics. The book starts with a paper by the two editors in which they explain why the volume - as a whole and with its individual papers - is an innovation in the field of language and law. In addition, an overview about the most important research projects on language and law is given. The first chapter of the book is on understanding the law. Jurists and laypersons always ask for the precise meaning of a certain piece of the law. In linguistics, the discipline investigating 'meaning' is semantics; thus, it is to be expected that semantics can contribute to a correct understanding of the law.Chapter 1 also investigates the alleged incomprehensibility of legal language with the help of psycholinguistics. Chapter 2 is on identifying the criminal. To find the author of a blackmailer's letter, text/ corpus linguistics is instrumental. If the blackmailer uses the telephone instead of the letter, speaker identification and phonetics are necessary. The BKA stores all blackmailing letters in a database, but databases are only one possibility of organizing legal systems; another possibility is the application of tools from computational linguistics and artificial intelligence. These tools can be useful to handle terminology, to retrieve information, or to model legal theorizing in a formal system. Chapter 3 demonstrates a variety of examples of organizing legal systems. The topic of chapter 4 is multilingualism and the law. The European legislation is a product of legal and linguistic diversity, as the member states do not only differ in languages but also in their legal systems. One paper shows how Switzerland handles its multilingualism in legal drafting. The input of translation studies is of course vital in this field of research. An index for both subjects and persons complements the volume. © 2009 by Walter de Gruyter GmbH & Co. KG, D-10785 Berlin. All rights reserved.
Conference Paper
A predicate is usually omitted from text when it is highly predictable from the context. This omission is due to the effort optimization that humans perform during the language generation process. Authors omit the information that they know the addressee is able to recover effortlessly. Most noun-noun structures including genitives and compounds are result of this process. The goal of this work is to generate automatically and without supervision the paraphrases that make explicit the omitted predicate in these noun-noun structures. The method is general enough to address also the cases were components are Named Entities. The resulting paraphrasing axioms are necessary for recovering the semantics of a text, and therefore, useful for applications such as Question Answering.