Conference PaperPDF Available

The PASCAL recognising textual entailment challenge

Authors:

Abstract

This paper describes the PASCAL Net- work of Excellence Recognising Textual Entailment (RTE) Challenge benchmark 1. The RTE task is defined as recognizing, given two text fragments, whether the meaning of one text can be inferred (en- tailed) from the other. This application- independent task is suggested as capturing major inferences about the variability of semantic expression which are commonly needed across multiple applications. The Challenge has raised noticeable attention in the research community, attracting 17 submissions from diverse groups, sug- gesting the generic relevance of the task.
The Second PASCAL Recognising Textual Entailment Challenge
Roy Bar-Haim, Ido Dagan, Bill Dolan∗∗, Lisa Ferro, Danilo Giampiccolo,
Bernardo Magnini?, Idan Szpektor
Computer Science Department, Bar-Ilan University, Ramat Gan 52900, Israel
∗∗Microsoft Research, Redmond, WA 98052, USA
The MITRE Corporation, 202 Burlington Rd., Bedford, MA 01730, USA
CELCT, Via dei Solteri 38, Trento, Italy
?ITC-irst, Istituto per la Ricerca Scientifica e Tecnologica 38050 Povo, Trento, Italy
Abstract
This paper describes the Second PASCAL
Recognising Textual Entailment Chal-
lenge (RTE-2).1We describe the RTE-
2 dataset and overview the submissions
for the challenge. One of the main
goals for this year’s dataset was to pro-
vide more “realistic” text-hypothesis ex-
amples, based mostly on outputs of ac-
tual systems. The 23 submissions for the
challenge present diverse approaches and
research directions, and the best results
achieved this year are considerably higher
than last year’s state of the art.
1 Introduction
1.1 Textual entailment recognition
Textual entailment recognition is the task of decid-
ing, given two text fragments, whether the mean-
ing of one text is entailed (can be inferred) from
another text (see section 2.2 for the specific oper-
ational definition of textual entailment assumed in
the challenge). This task, introduced by Dagan and
Glickman (2004), captures generically a broad range
of inferences that are relevant for multiple applica-
tions. For example, a Question Answering (QA)
system has to identify texts that entail the expected
answer. Given the question “Who is John Lennon’s
widow?”, the text “Yoko Ono unveiled a bronze
statue of her late husband, John Lennon, to com-
plete the official renaming of England’s Liverpool
1http://www.pascal-network.org/Challenges/RTE2
Airport as Liverpool John Lennon Airport. entails
the expected answer “Yoko Ono is John Lennon’s
widow”. Similarly, semantic inference needs of
other text understanding applications such as Infor-
mation Retrieval (IR), Information Extraction (IE),
and Machine Translation evaluation can be cast as
entailment recognition (Dagan et al., 2006).
Textual entailment may serve as a unifying
generic framework for modeling semantic inference,
which so far have been addressed independently by
separate application-specific research communities.
Its formulation as a mapping between texts makes
it independent of concrete semantic interpretations,
which then become a mean rather than the goal. For
example, in word sense disambiguation, it is not al-
ways easy to define explicitly the right set of senses
to choose from. In practice, however, it is suffi-
cient for most applications to determine whether a
word meaning in a given context lexically entails
another word. For instance, the occurrence of the
word “chair” in the sentence “IKEA announced a
new comfort chair” entails “furniture”, while its oc-
currence in the sentence “MIT announced a new CS
chair position” does not. Thus, proper modeling of
lexical entailment in context may alleviate the need
to interpret word occurrences into explicitly stipu-
lated senses.
Eventually, we hope that research on textual en-
tailment will lead to the development of entailment
“engines”, which will be used as a standard module
in many applications (similar to the role of part-of-
speech taggers and syntactic parsers in current NLP
applications).
1.2 The First RTE Challenge
The first PASCAL Recognising Textual Entailment
Challenge (RTE-1), (Dagan et al., 2006) introduced
the first benchmark for the entailment recognition
task. The RTE-1 dataset consists of manually col-
lected text fragment pairs, termed text (t) (1-2 sen-
tences) and hypothesis (h) (one sentence). The par-
ticipating systems were required to judge for each
pair whether tentails h. The pairs represented suc-
cess and failure settings of inferences in various ap-
plication types (termed “tasks”), including, among
others, the QA, IE, IR, and MT evaluation described
above.
The challenge raised noticeable attention in
the research community, attracting 17 submissions
from diverse groups. The relatively low accuracy
achieved by the participating systems suggested that
the entailment task is indeed a challenging one, with
a wide room for improvement. It was followed by an
ACL 2005 Workshop on Empirical Modeling of Se-
mantic Equivalence and Entailment. The challenge
and its dataset motivated further research on empiri-
cal entailment, which resulted in a number of publi-
cations in recent main conferences.2
1.3 Goals for the second challenge
Following the success and impact of RTE-1, the
main goal of the second challenge was to support
the continuation of research on textual entailment.
Our main focus in creating the RTE-2 dataset was to
provide more ”realistic” text-hypothesis examples,
based mostly on outputs of actual systems. As in the
previous challenge, the main task is judging whether
a hypothesis his entailed by a text t. The examples
represent different levels of entailment reasoning,
such as lexical, syntactic, morphological and logi-
cal. Data collection and annotation processes were
improved this year, including cross-annotation of the
examples across the organizing sites (most of the
pairs were triply annotated). The data collection and
annotation guidelines were revised and expanded. In
order to make the challenge data more accessible,
we also provided some pre-processing for the ex-
amples, including sentence splitting and dependency
parsing.
2A list of related publications can be found at
http://www.pascal-network.org/Challenges/RTE2/Introduction
2 The RTE-2 Dataset
2.1 Overview
The RTE-2 dataset consists of 1600 text-hypothesis
pairs, divided into a development set and a test set,
each containing 800 pairs. We followed the ba-
sic setting of RTE-1: the texts consist of 1-2 sen-
tences, while the hypotheses are one sentence (usu-
ally shorter).
We chose to focus on four out of the seven appli-
cations that were present in RTE-1: Information Re-
trieval (IR), Information Extraction (IE), Question
Answering (QA), and multi-document summariza-
tion (SUM3). Within each application setting the an-
notators selected positive entailment examples (an-
notated YES), where tdoes entail h, as well as neg-
ative examples (annotated NO), where entailment
does not hold (50%-50% split, as in RTE-1). In
total, 200 pairs were collected for each application
in each dataset. Each pair was annotated with its
related task (IE/IR/QA/SUM) and entailment judg-
ment (YES/NO, released only in the development
set). Some of the pairs in the development set are
listed in Table 1.
The examples in the dataset are based mostly on
outputs (both correct and incorrect) of Web-based
systems, while most of the input was sampled from
existing application-specific benchmarks4. Thus,
the examples give some sense of how existing sys-
tems could benefit from an entailment engine post-
processing their output. The data collection pro-
cedure for each task is described in sections 2.3
through 2.6.
2.2 Defining and judging entailment
We consider an applied notion of textual entailment,
defined as a directional relation between two text
fragments, termed t- the entailing text, and h- the
entailed text. We say that tentails hif, typically, a
human reading twould infer that his most likely
true. This somewhat informal definition is based
on (and assumes) common human understanding of
language as well as common background knowl-
edge. Textual entailment recognition is the task of
deciding, given tand h, whether tentails h.
3Equivalent to the CD task in RTE-1
4See the Acknowledgments section for a complete list of
systems and benchmarks used
ID Text Hypothesis Task Judgment
77 Google and NASA announced a work-
ing agreement, Wednesday, that could
result in the Internet giant building a
complex of up to 1 million square feet
on NASA-owned property, adjacent to
Moffett Field, near Mountain View.
Google may build a cam-
pus on NASA property. SUM YES
110 Drew Walker, NHS Tayside’s public
health director, said: “It is important to
stress that this is not a confirmed case of
rabies.
A case of rabies was con-
firmed. IR NO
294 Meanwhile, in an exclusive interview
with a TIME journalist, the first one-
on-one session given to a Western print
publication since his election as presi-
dent of Iran earlier this year, Ahmadine-
jad attacked the “threat” to bring the
issue of Iran’s nuclear activity to the
UN Security Council by the US, France,
Britain and Germany.
Ahmadinejad is a citizen
of Iran. IE YES
387 About two weeks before the trial
started, I was in Shapiro’s office in Cen-
tury City.
Shapiro works in Century
City. QA YES
415 The drugs that slow down or halt
Alzheimer’s disease work best the ear-
lier you administer them.
Alzheimer’s disease is
treated using drugs. IR YES
691 Arabic, for example, is used densely
across North Africa and from the East-
ern Mediterranean to the Philippines, as
the key language of the Arab world and
the primary vehicle of Islam.
Arabic is the primary lan-
guage of the Philippines. QA NO
Table 1: Examples of text-hypothesis pairs, taken from the RTE-2 development set
Some additional judgment criteria and guidelines
are listed below (examples are taken from Table 1):
Entailment is a directional relation. The hy-
pothesis must be entailed from the given text,
but the text need not be entailed from the hy-
pothesis.
The hypothesis must be fully entailed by the
text. Judgment would be NO if the hypothesis
includes parts that cannot be inferred from the
text.
Cases in which inference is very probable (but
not completely certain) are judged as YES. For
instance, in pair #387 one could claim that al-
though Shapiro’s office is in Century City, he
actually never arrives to his office, and works
elsewhere. However, this interpretation of tis
very unlikely, and so the entailment holds with
high probability. On the other hand, annota-
tors were guided to avoid vague examples for
which inference has some positive probability
which is not clearly very high.
Our definition of entailment allows presuppo-
sition of common knowledge, such as: a com-
pany has a CEO, a CEO is an employee of the
company, an employee is a person, etc. For in-
stance, in pair #294, the entailment depends on
knowing that the president of a country is also
a citizen of that country.
2.3 Collecting IE pairs
This task is inspired by the Information Extraction
(and Relation Extraction) application, adapting the
setting to pairs of texts rather than a text and a struc-
tured template. The pairs were generated using four
different approaches. In the first approach, ACE-
2004 relations (the relations tested in the ACE-2004
RDR task) were taken as templates for hypotheses.
Relevant news articles were collected as texts (t).
These collected articles were then given to actual
IE systems for extraction of ACE relation instances.
The system outputs were used as hypotheses, gener-
ating both positive examples (from correct outputs)
and negative examples (from incorrect outputs). In
the second approach, the output of IE systems on
the dataset of the MUC-4 TST3 task (in which the
events are acts of terrorism) was similarly used to
create entailment pairs. In the third approach, ad-
ditional entailment pairs were manually generated
from both the annotated MUC-4 dataset and news
articles collected for the ACE relations. For exam-
ple, given the ACE relation “X work for Y” and
the text “An Afghan interpreter, employed by the
United States, was also wounded.” (t), a hypothe-
sis “An interpreter worked for Afghanistan. is cre-
ated, producing a non-entailing (negative) pair. In
the forth approach, hypotheses which correspond
to new types of semantic relations (not found in
the ACE and MUC datasets) were manually gener-
ated for sentences in collected news articles. These
relations were taken from various semantic fields,
such as sports, entertainment and science. All these
processes simulate the need of IE systems to recog-
nize that the given text indeed entails the semantic
relation that is expected to hold between the candi-
date template slot fillers.
2.4 Collecting IR pairs
In this application setting, the hypotheses are propo-
sitional IR queries, which specify some statement,
e.g. “Alzheimer’s disease is treated using drugs”.
The hypotheses were adapted and simplified from
standard IR evaluation datasets (TREC and CLEF).
Texts (t) that do or do not entail the hypothesis
were selected from documents retrieved by differ-
ent search engines (e.g. Google, Yahoo and MSN)
for each hypothesis. In this application setting it is
assumed that relevant documents (from an IR per-
spective) should entail the given propositional hy-
pothesis.
2.5 Collecting QA pairs
Annotators were given questions, taken from TREC-
QA and QA@CLEF datasets and the correspond-
ing answers extracted from the Web by QA sys-
tems. Transforming a question-answer pair to text-
hypothesis pair consisted of the following stages:
First, the annotators picked from the answer pas-
sage an answer term of the expected answer type, ei-
ther a correct or an incorrect one. Then, the annota-
tors turned the question into an affirmative sentence
with the answer term “plugged in”. These affirma-
tive sentences serve as the hypotheses (h), and the
original answer passage serves as the text (t). For
example (pair #575 in the development set), given
the question “How many inhabitants does Slove-
nia have?” and an answer text “In other words,
with its 2 million inhabitants, Slovenia has only
5.5 thousand professional soldiers” (t), the annota-
tors picked “2 million inhabitants” as the (correct)
answer term, which was used to turn the question
into the statement “Slovenia has 2 million inhabi-
tants” (h), producing a positive entailment pair. This
process simulates the need of a QA system to verify
that the retrieved passage text indeed entails the pro-
vided answer.
2.6 Collecting SUM pairs
In this setting tand hare sentences taken from a
news document cluster, a collection of news arti-
cles that describe the same news item. Annotators
were given output of multi-document summariza-
tion systems, including the document clusters and
the summary generated for each cluster. The anno-
tators picked sentence pairs with high lexical over-
lap, preferably where at least one of the sentences
was taken from the summary (this sentence usually
played the role of t). For positive examples, the hy-
pothesis was simplified by removing sentence parts,
until it was fully entailed by t. Negative examples
were simplified in the same manner. This process
simulates the need of a summarization system to
identify information redundancy, which should be
avoided in the summary, and may also increase the
assessed importance of such repeated information.
2.7 Creating the final dataset
Cross-annotation of the collected pairs was done be-
tween the organizing sites. Each pair was judged by
at least two annotators and most of the pairs (75%
of the pairs in the development set, and all of the
test set) were triply judged. As in RTE-1, we filtered
out pairs on which the annotators disagreed. The av-
erage agreement on the test set (between each pair
of annotators who shared at least 100 examples),
was 89.2%, with average Kappa level of 0.78, which
corresponds to “substantial agreement” (Landis and
Koch, 1997). 18.2% of the pairs were removed from
the test set due to disagreement. The following situ-
ations often caused disagreement:
tgives approximate numbers and hgives exact
numbers.
tstates an asserted claim made by some entity,
and the hdrops the assertion and just states the
claim. For example:
t: “Scientists say that global warming is made
worse by human beings.
h: “Global warming is made worse by human
beings.
tmakes a weak statement, and hmakes a
slightly stronger statement about the same
thing.
Additional filtering was done by two of the organiz-
ers, who discarded pairs that seemed controversial,
too difficult, or redundant (too similar to other pairs).
In this phase, 25.5% of the (original) pairs were re-
moved from the test set.
We allowed only minimal correction of texts ex-
tracted from the web, e.g. fixing spelling and punc-
tuation but not style, therefore the English of some
of the pairs is less-than-perfect. In addition to the
corrections made by the annotators, a final proof-
reading pass over the dataset was performed by one
of the annotators.
3 The RTE-2 Challenge
3.1 Submission instructions and evaluation
measures
The main task in the RTE-2 challenge was classifi-
cation – entailment judgement for each pair in the
test set. The evaluation criterion for this task was
accuracy – the percentage of pairs correctly judged.
A secondary task was ranking the pairs, accord-
ing to their entailment confidence. In this ranking,
the first pair is the one for which the entailment is
most certain, and the last pair is the one for which
the entailment is least likely (i.e. the one for which
the judgment as “NO” is the most certain). A per-
fect ranking would place all the positive pairs (for
which the entailment holds) before all the negative
pairs. This task was evaluated using the Average
precision measure, which is a common evaluation
measure for ranking (e.g. in information retrieval),
and is computed as the average of the system’s pre-
cision values at all points in the ranked list in which
recall increases, that is at all points in the ranked
list for which the gold standard annotation is YES
(Voorhees and Harman, 1999). More formally, it can
be written as follows:
1
R
n
X
i=1
E(i)×#C orrectU pT oP air(i)
i(1)
where nis the number of the pairs in the test set, Ris
the total number of positive pairs in the test set, E(i)
is 1 if the i-th pair is positive and 0 otherwise, and i
ranges over the pairs, ordered by their ranking (note
the difference between this measure and the Confi-
dence Weighted Score used in the first challenge).
Participating teams were allowed to submit results
of up to two systems, and many of the participants
made use of this option, and provided the results of
two runs.
3.2 Submitted systems
Twenty-three teams participated in the challenge,
a 35% growth compared to last year. Table 2
lists for each submitted run the methods used and
the obtained results. These methods include lexi-
cal overlapping, based on lexicons such as Word-
Net (Miller, 1995) and automatically acquired re-
sources which are based on statistical measures; n-
gram matching and subsequence overlapping be-
Table 2: Submission results and system description. Systems for which no component is indicated used
lexical overlap.
tween tand h; syntactic matching, e.g. relation
matching, and tree edit distance algorithms; se-
mantic annotation induced using resources such as
FrameNet (Baker et al., 1998); logical inference us-
ing logic provers; statistics computed from local cor-
pora or the Web (including statical measures avail-
able for lexical resources such as WordNet); us-
age of background knowledge, including inference
rules and paraphrase templates, and acquisition (au-
tomatic and manual) of additional entailment cor-
pora. Many of the systems derive multiple similarity
measures, based on different levels of analysis (lexi-
cal, syntactic, logical), and subsequently use them as
features for a classifier that makes the final decision.
Overall, the common criteria for entailment
recognition were similarity between tand h, or the
coverage of hby t(in lexical and lexical syntactic
methods), and the ability to infer h from t(in the
logical approach). Zanzotto et al. also measured
the similarity between different (t,h) pairs (cross-
pair similarity). Some groups tried to detect non-
entailment, by looking for various kinds of mis-
match between the text and the hypothesis. This ap-
proach is related to an earlier observation in (Van-
derwende et al., 2005), which suggested that it is
often easier to detect false entailment.
3.3 Results
The accuracy achieved by the participating systems
ranges from 53% to 75% (considering the best run
of each group), while most of the systems obtained
55%-61%. Two submissions, Hickl at el. (accuracy
75.4%, average precision 80.8%) and Tatu at el. (ac-
curacy 73.8%, average precision 71.3%), stand out
as about 10% higher than the other systems. The
top accuracies are considerably higher than the best
results achieved in RTE-1 (around 60%).
The results show, for the first time, that systems
that rely on deep analysis such as syntactic match-
ing and logical inference can considerably outper-
form lexical systems, which were shown to achieve
around 60% on the RTE datasets. In the RTE-1 chal-
lenge, one of the two best performing systems was
based on lexical statistics from the web (Glickman
et al., 2006). Zanzotto at el. experimented with
baseline lexical systems, applied to both RTE-1 and
RTE-2 datasets. For RTE-1 they found that even a
simple statistical lexical system, based on IDF mea-
sure, gets close to 60% in accuracy. Bar-Haim et
al. (Bar-Haim et al., 2005) also showed by man-
ually analyzing the RTE-1 dataset, that lexical sys-
tems are expected to achieve up to around 60%, if we
require that his fully lexically entailed from (cov-
ered by) t. For the RTE-2 test set, Zanzotto et al.
found that simple lexical overlapping achieves ac-
curacy of 60%, better than any other sophisticated
lexical methods they tested (Katrenko and Adriaans
report 57% for a slightly different baseline).
3.4 The contribution of knowledge and
training data
Although it is clear that deeper analysis is a must for
achieving high accuracy, most of the systems par-
ticipated in RTE-2 that employed deep analysis did
not improve significantly over the 60% baseline of
lexical matching. The participants’ reports point out
two main reasons for the shortcoming of current sys-
tems: the size of the training corpus (RTE-2 devel-
opment set and the RTE-1 datasets together contain
less than 2,200 pairs), and the lack of linguistic and
background knowledge.
It seems that the best performing systems were
those which better coped with these issues. Hickl
et al. utilized a very large entailment corpus, auto-
matically collected from the web, following (Burger
and Ferro, 2005). In addition, they manually anno-
tated a corpus of lexical entailment, which was used
to bootstrap automatic annotation of a larger lexical
entailment corpus. These corpora contributed 10%
to the overall accuracy they achieved. Tatu et al. de-
veloped an entailment system based on logical infer-
ence, which relys on extensive linguistic and back-
ground knowledge from various sources.
The success of these systems suggests that per-
haps the most important factors for deep entailment
systems are the amount of linguistic and background
knowledge, and the size of training corpora, rather
than the exact method for modeling tand hand the
exact inference mechanism.
3.5 Per-task analysis
Per-task analysis shows that systems scored consid-
erably higher on the multi-document summarization
task (SUM). The same trend was observed in RTE-
1 for the comparable documents (CD) task, which
was similar to the RTE-2 summarization task. For
most systems, the lowest accuracy was obtained for
the IE task. Katrenko and Adriaans report that sim-
ple lexical overlapping was able to predict correctly
entailment for 67% of the SUM pairs, but only for
47% of the IE pairs.
Some of the participants took into account such
inter-task differences, and tuned the parameters of
their models separately for each task. Given the
observed differences among the tasks, it seems that
better understanding of how entailment in each task
differs might improve the performance of future sys-
tems.
3.6 Additional observations
Some participants tested their systems on both RTE-
1 and RTE-2 datasets. Some systems performed bet-
ter on RTE-1 while others performed better on RTE-
2, and the results were usually quite close, with up to
5% difference for either side. This indicates similar
level of difficulty for both datasets. However, simple
lexical overlap systems were found to perform better
on the RTE-2 test set than on RTE-1 test set - 60%
on RTE-2 vs. 53.9% on RTE-1, as reported by Zan-
zotto et al., (although for the RTE-1 development
set they obtained 57.1%). Interestingly, de Marn-
effe et al. and Zanzotto et al. report that adding the
RTE-1 data to the RTE-2 training set reduced the re-
sults, which indicates the variance between the two
datasets (notice that the RTE-1 datasets include three
tasks not present in RTE-2. Inkpen at el. showed that
the results somewhat improve if only the compatible
tasks in RTE-1 are considered). Schilder and Thom-
son McInnes found that classification using only the
lengths of tand has features could give accuracy of
57.4%.
In the RTE-2 dataset (both the development set
and the test set), multiple IR pairs were created for
a single IR query (where twas extracted from dif-
ferent documents retrieved), and similarly, multiple
QA pairs were created for a single question (where
twas extracted from different answer passages).
Some of the groups (de Marneffe et al., Nicholson et
al.) noted that these dependencies between the pairs
could potentially have a negative effect on the learn-
ing, and somewhat bias the evaluation on the test set.
In practice, however, there was no evidence that sys-
tems perform significantly worse on the RTE-2 test
set than on the RTE-1 test set (using RTE-2/RTE-
1 development sets, respectively, for training), and,
as described above, similar scores were obtained for
both datasets.
4 Conclusion and future work
The submissions for the Second PASCAL Recog-
nising Textual Entailment Challenge show growing
interest in this applied framework. The considerable
improvement in performance achieved within only
one year is very encouraging, and the diversity of
new approaches and research directions introduced
this year seem very promising for further research.
While the setting for the entailment recognition task
in RTE-2 followed the same setting of RTE-1, we
expect that the next RTE challenges will introduce
new settings. One possible direction is to provide
wider contexts, i.e. expanding tfrom 1-2 sentences
to paragraphs or even complete documents.
Acknowledgment
The following sources were used in the preparation
of the data:
AnswerBus question answering system, pro-
vided by Zhiping Zheng, Computational Lin-
guistics Department, Saarland University.
http://answerbus.coli.uni-sb.de/
PowerAnswer question answering system,
from Language Computer Corporation, pro-
vided by Dan Moldovan, Abraham Fowler,
Christine Clark, Arthur Dexter and Justin
Larrabee.
http://www.languagecomputer.com/solutions/
question answering/power answer/
Columbia NewsBlaster multi-document sum-
marization system, from the Natural Language
Processing group at Columbia University’s De-
partment of Computer Science.
http://newsblaster.cs.columbia.edu/
NewsInEssence multi-document summariza-
tion system, provided by Dragomir R. Radev
and Jahna Otterbacher from the Computational
Linguistics And Information Retrieval research
group, University of Michigan.
http://www.newsinessence.com/
IBM’s information extraction system, pro-
vided by Salim Roukos and Nanda Kambhatla,
I.B.M. T.J. Watson Research Center.
New York University’s information extraction
system, provided by Ralph Grishman, Depart-
ment of Computer Science, Courant Institute of
Mathematical Sciences, New York University.
ITC-irst’s information extraction system, pro-
vided by Lorenza Romano, Cognitive and
Communication Technologies (TCC) division,
ITC-irst, Trento, Italy.
MUC-4 information extraction dataset, from
the National Institute of Standards and Tech-
nology (NIST).
http://www.itl.nist.gov/iaui/894.02/related projects/muc/
ACE 2004 information extraction templates,
from the National Institute of Standards and
Technology (NIST).
http://www.nist.gov/speech/tests/ace/
TREC IR queries and TREC-QA question col-
lections, from the National Institute of Stan-
dards and Technology (NIST).
http://trec.nist.gov/
CLEF IR queries and CLEF-QA question col-
lections, from DELOS Network of Excellence
for Digital Libraries.
http://www.clef-campaign.org/, http://clef-qa.itc.it/
We would like to thank the people and organizations
that made these sources available for the challenge.
In addition, we thank Oren Glickman and Dan Roth
for their assistance and advice.
We would also like to acknowledge the people
and organizations involved in creating and annotat-
ing the data: Malky Rabinowitz, Dana Mills, Ruthie
Mandel, Errol Hayman, Vanessa Sandrini, Allesan-
dro Valin, Elizabeth Lima, Jeff Stevenson, Amy
Muia and the Butler Hill Group.
This work was supported in part by the IST Pro-
gramme of the European Community, under the
PASCAL Network of Excellence, IST-2002-506778.
This publication only reflects the authors’ views. We
wish to thank the managers of the PASCAL chal-
lenges program, Michele Sebag, Florence d’Alche-
Buc, and Steve Gunn, and the PASCAL Challenges
Workshop Chair, Rodolfo Delmonte, for their efforts
and support, which made this challenge possible.
References
C. Baker, C. Fillmore, and J. Lowe. 1998. The Berke-
ley Framenet project. In Proceedings of the COLING-
ACL, Montreal, Canada.
Roy Bar-Haim, Idan Szpecktor, and Oren Glickman.
2005. Definition and analysis of intermediate entail-
ment levels. In Proceedings of the ACL Workshop on
Empirical Modeling of Semantic Equivalence and En-
tailment, pages 55–60, Ann Arbor, Michigan, June.
Association for Computational Linguistics.
John Burger and Lisa Ferro. 2005. Generating an entail-
ment corpus from news headlines. In Proceedings of
the ACL Workshop on Empirical Modeling of Semantic
Equivalence and Entailment, pages 49–54, Ann Arbor,
Michigan, June. Association for Computational Lin-
guistics.
Ido Dagan and Oren Glickman. 2004. Probabilistic tex-
tual entailment: Generic applied modeling of language
variability. PASCAL workshop on Text Understand-
ing and Mining.
Ido Dagan, Oren Glickman, and Bernardo Magnini.
2006. The PASCAL Recognising Textual Entail-
ment Challenge. In Qui˜
nonero-Candela et al., edi-
tor, MLCW 2005, LNAI Volume 3944, pages 177–190.
Springer-Verlag.
Oren Glickman, Ido Dagan, and Moshe Koppel. 2006.
Web based probabilistic textual entailment. In
Qui˜
nonero-Candela et al., editor, MLCW 2005, LNAI
Volume 3944, pages 287–298. Springer-Verlag.
J. R. Landis and G. G. Koch. 1997. The measurements of
observer agreement for categorical data. Biometrics,
33:159–174.
G. A. Miller. 1995. WordNet: A Lexical Databases for
English. Communications of the ACM, pages 39–41,
November.
Lucy Vanderwende, Deborah Coughlin, and Bill Dolan.
2005. What syntax can contribute in entailment task.
Proceedings of the PASCAL Challenges Workshop
on Recognising Textual Entailment (and forthcoming
LNAI book chapter).
Ellen M. Voorhees and Donna Harman. 1999. Overview
of the seventh text retrieval conference. In Proceed-
ings of the Seventh Text REtrieval Conference (TREC-
7). NIST Special Publication.
... • Question Natural Language Inference (QNLI): inference task based on question and sentence [16]. • Recognizing Textual Entailment (RTE): text entailment task [47]. • Winograd Natural Language Inference (WNLI): inference task based on Winograd schema [48]. ...
Preprint
Over the past few decades, Artificial Intelligence(AI) has progressed from the initial machine learning stage to the deep learning stage, and now to the stage of foundational models. Foundational models have the characteristics of pre-training, transfer learning, and self-supervised learning, and pre-trained models can be fine-tuned and applied to various downstream tasks. Under the framework of foundational models, models such as Bidirectional Encoder Representations from Transformers(BERT) and Generative Pre-trained Transformer(GPT) have greatly advanced the development of natural language processing(NLP), especially the emergence of many models based on BERT. BERT broke through the limitation of only using one-way methods for language modeling in pre-training by using a masked language model. It can capture bidirectional context information to predict the masked words in the sequence, this can improve the feature extraction ability of the model. This makes the model very useful for downstream tasks, especially for specialized applications. The model using the bidirectional encoder can better understand the domain knowledge and be better applied to these downstream tasks. So we hope to help understand how this technology has evolved and improved model performance in various natural language processing tasks under the background of foundational models and reveal its importance in capturing context information and improving the model's performance on downstream tasks. This article analyzes one-way and bidirectional models based on GPT and BERT and compares their differences based on the purpose of the model. It also briefly analyzes BERT and the improvements of some models based on BERT. The model's performance on the Stanford Question Answering Dataset(SQuAD) and General Language Understanding Evaluation(GLUE) was compared.
... RTE.(Dagan et al., 2005) Recognizing Textual Entailment datasets are a compilation of examples from annual challenges, combining data from multiple sources. The datasets are based on news and Wikipedia text, and are converted into a two-class split for consistency. ...
... Entailment graphs, first proposed by Berant et al. (2010), are graphs with verbs as nodes and entailment relations as edges, which can be seen as subproperty relations (in natural language form) in a schema of knowledge graphs (Pan et al., 2017b,a). Entailment graphs aim to globally discover textual entailment relationships between verbs which is different from logical entailment (Pan and Horrocks, 2002;Pan et al., 2017a), and textual entailment has a more relaxed definition: "t entails h" (t ⊨ h) if, typically, a human reading t would infer that h is most likely true (Dagan et al., 2006). Early work explored the entailment relationships between bivalent verbs based on global transitivity constraints (Hosseini et al., 2018(Hosseini et al., , 2019. ...
... The NLI task involves determining if the meaning of one text fragment (hypothesis) can be inferred from another (premise). Independent of any specific application, this task is designed to encapsulate the essential inferences about the variability of semantic expression frequently required for various settings (Dagan et al., 2006). MRC is another common task -many NLU tasks have been formulated as MRC (He et al., 2015) or models trained on MRC format data have shown good performance on NLU tasks (McCann et al., 2018). ...
... . 3 We use entailment as a shorthand for "conveying the same information as" despite a minor deviation from the definition of entailment in linguistics as a strict logical entailment (Heim and Kratzer, 1998), and in NLP as "a human [reading the premise] would typically think that the hypothesis is likely true" (Dagan et al., 2005;Bowman et al., 2015). Our definition is a bit more relaxed and we also consider partial entailment (Levy et al., 2013), i.e., when the most important information in e i is conveyed by F , allowing the omission of peripheral information. ...
... Since its inception (Dagan et al., 2005), several large-scale benchmark datasets have been proposed for NLI (Bowman et al., 2015;Williams et al., 2018;Nie et al., 2019); however, in time, the most ubiquitously used are the Stanford Natural Language Inference (SNLI) (Bowman et al., 2015) and the MultiNLI datasets (Williams et al., 2018), which have played a pivotal role in advancing the state of the art (Storks et al., 2019). ...
Conference Paper
Full-text available
This paper argues that local textual inferences come in three well-defined varieties (entailments, conventional implicatures/presuppositions, and conversational implicatures) and one less clearly defined one, generally available world knowledge. Based on this taxonomy, it discusses some of the examples in the Pascal text suite and shows that these examples do not fall into any of them. It proposes to enlarge the test suite with examples that are more directly related to the inference patterns discussed.
Article
Full-text available
Suppose that all of C∞ functions f1,…, fk have the zero property. We give a necessary and sufficient condition for their product to have the same property This is a generalization of Bochnak’s result ([1]).
Article
Full-text available
This paper proposes a general probabilis-tic setting that formalizes the notion of textual entailment. In addition we de-scribe a concrete model for lexical en-tailment based on web co-occurrence statistics in a bag of words representation.
Article
Full-text available
Inference tasks in computational semantics have mostly been tackled by means of first-order theorem proving tools. While this is an important and welcome development, it has some inherent limitations. First, generating first-order logic representations of natural language documents is hampered by the lack of efficient and sufficiently robust NLP tools. Second, the com- putational costs of deploying first-order logic theorem proving tools in real- world situations may be prohibitive. And third, the strict yes/no decisions delivered by such tools are not always appropriate. In this paper we report on an approach to inference in semantics that works on very minimal representations which can easily be generated for arbitrary domains. Moreover, our approach is computationally efficient, and provides graded outcomes instead of strict yes/no decisions. Our approach is fully implemented, and a preliminary evaluation of the approach is discussed in the paper.
Conference Paper
Full-text available
This paper addresses the classification of semantic relations between pairs of sen- tences extracted from a Dutch parallel cor- pus at the word, phrase and sentence level. We first investigate the performance of hu- man annotators on the task of manually aligning dependency analyses of the re- spective sentences and of assigning one of five semantic relations to the aligned phrases (equals, generalizes, specifies, re- states and intersects). Results indicate that humans can perform this task well, with an F-score of .98 on alignment and an F- score of .95 on semantic relations (after correction). We then describe and evalu- ate a combined alignment and classifica- tion algorithm, which achieves an F-score on alignment of .85 (using EuroWordNet) and an F-score of .80 on semantic relation classification.
Article
Full-text available
This paper argues that local textual inferences come in three well-defined varieties (entailments, conventional implicatures/presuppositions, and conversational implicatures) and one less clearly defined one, generally available world knowledge. Based on this taxonomy, it discusses some of the examples in the Pascal text suite and shows that these examples do not fall into any of them. It proposes to enlarge the test suite with examples that are more directly related to the inference patterns discussed.
Article
Full-text available
In this paper we define two intermediate models of textual entailment, which correspond to lexical and lexical-syntactic levels of representation. We manually annotated a sample from the RTE dataset according to each model, compared the outcome for the two models, and explored how well they approximate the notion of entailment. We show that the lexical-syntactic model outperforms the lexical model, mainly due to a much lower rate of false-positives, but both models fail to achieve high recall. Our analysis also shows that paraphrases stand out as a dominant contributor to the entailment task. We suggest that our models and annotation methods can serve as an evaluation scheme for entailment at these levels.
Article
Full-text available
We describe our efforts to generate a large (100,000 instance) corpus of textual entailment pairs from the lead paragraph and headline of news articles. We manually inspected a small set of news stories in order to locate the most productive source of entailments, then built an annotation interface for rapid manual evaluation of further exemplars. With this training data we built an SVM-based document classifier, which we used for corpus refinement purposes---we believe that roughly three-quarters of the resulting corpus are genuine entailment pairs. We also discuss the difficulties inherent in manual entailment judgment, and suggest ways to ameliorate some of these.
Article
This paper presents a knowledge-based method for measuring the semantic-similarity of texts. While there is a large body of previous work focused on finding the semantic similarity of concepts and words, the application of these word-oriented methods to text similarity has not been yet explored. In this paper, we introduce a method that combines word-to-word similarity metrics into a text-to-text metric, and we show that this method outperforms the traditional text similarity metrics based on lexical matching.