Conference Paper

A Proposal on Evaluation Measures for RTE

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

We outline problems with the interpretation of accuracy in the presence of bias, arguing that the issue is a particularly pressing concern for RTE evaluation. Furthermore, we argue that average precision scores are unsuitable for RTE, and should not be reported. We advocate mutual information as a new evaluation measure that should be reported in addition to accuracy and confidence-weighted score.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Some of the material in this chapter was previously published(Bergmair 2009). 3 seeBentivogli et al. (2009),Giampiccolo et al. (2008),Voorhees (2008),Giampiccolo et al. (2007),Bar-Haim et al. (2006),Dagan et al. (2005) ...
Thesis
Full-text available
This thesis develops several pieces of theory and computational techniques which can be deployed for the purpose of allowing a computer to analyze short pieces of text (e.g. "Socrates is a man and every man is mortal.") and, on the basis of such an analysis, to decide yes/no questions about the text ("Is Socrates mortal?"). More particularly, the problem is seen as a logical inferencing task. The computer must decide whether or not a logical consequence relation "therefore" holds between the two pieces of text. ("Socrates is a man and every man is mortal, therefore Socrates is mortal.") This problem is a pervasive theme in logic and semantics but has also been subject over the last five years to a wave of renewed attention in computational linguistics sparked by the Recognizing Textual Entailment (RTE) challenge. A critical reevaluation of this line of work is presented here which demonstrate several problems concerning the empirical methodology used at RTE and the results derived from it. This thesis is thus more theorydriven, but nevertheless inspired by RTE in that it addresses problems raised by RTE which have not previously received sufficient attention from a theoretical viewpoint, such as the problem of robustness. With this goal in mind, two of the results on Natural Language Reasoning (NLR) established here become particularly important: (1) Assuming the syllogism as a benchmark fragment of NLR, the model theory which underlies NLR is not necessarily a two-valued logic, but it can be the many-valued Åukasiewicz logic. (2) Despite the fact that the syllogism is a logical language of less expressive power than natural language as a whole, a good approximation to NLR can still be obtained by using the method outlined here for rewriting natural language text into syllogistic premises. These two properties of NLR enable the approach to robust inference and logical pattern processing called Monte Carlo semantics, which, in turn, demonstrates that a single logically based theory can account for the semantic informativity of deep techniques using theorem proving and for the robustness of bag-of-words shallow inference.
Article
Download Free Sample In the last few years, a number of NLP researchers have developed and participated in the task of Recognizing Textual Entailment (RTE). This task encapsulates Natural Language Understanding capabilities within a very simple interface: recognizing when the meaning of a text snippet is contained in the meaning of a second piece of text. This simple abstraction of an exceedingly complex problem has broad appeal partly because it can be conceived also as a component in other NLP applications, from Machine Translation to Semantic Search to Information Extraction. It also avoids commitment to any specific meaning representation and reasoning framework, broadening its appeal within the research community. This level of abstraction also facilitates evaluation, a crucial component of any technological advancement program. This book explains the RTE task formulation adopted by the NLP research community, and gives a clear overview of research in this area. It draws out commonalities in this research, detailing the intuitions behind dominant approaches and their theoretical underpinnings. This book has been written with a wide audience in mind, but is intended to inform all readers about the state of the art in this fascinating field, to give a clear understanding of the principles underlying RTE research to date, and to highlight the short- and long-term research goals that will advance this technology. Table of Contents: List of Figures / List of Tables / Preface / Acknowledgments / Textual Entailment / Architectures and Approaches / Alignment, Classification, and Learning / Case Studies / Knowledge Acquisition for Textual Entailment / Research Directions in RTE / Bibliography / Authors' Biographies
Article
Paraphrasing methods recognize, generate, or extract phrases, sentences, or longer natural language expressions that convey almost the same information. Textual entailment methods, on the other hand, recognize, generate, or extract pairs of natural language expressions, such that a human who reads (and trusts) the first element of a pair would most likely infer that the other element is also true. Paraphrasing can be seen as bidirectional textual entailment and methods from the two areas are often similar. Both kinds of methods are useful, at least in principle, in a wide range of natural language processing applications, including question answering, summarization, text generation, and machine translation. We summarize key ideas from the two areas by considering in turn recognition, generation, and extraction methods, also pointing to prominent articles and resources. Comment: Technical Report, Natural Language Processing Group, Department of Informatics, Athens University of Economics and Business, Greece, 2010
Conference Paper
Full-text available
This paper describes the PASCAL Net- work of Excellence Recognising Textual Entailment (RTE) Challenge benchmark 1. The RTE task is defined as recognizing, given two text fragments, whether the meaning of one text can be inferred (en- tailed) from the other. This application- independent task is suggested as capturing major inferences about the variability of semantic expression which are commonly needed across multiple applications. The Challenge has raised noticeable attention in the research community, attracting 17 submissions from diverse groups, sug- gesting the generic relevance of the task.
Conference Paper
Full-text available
The third PASCAL Recognizing Textual En- tailment Challenge (RTE-3) contained an op- tional task that extended the main entailment task by requiring a system to make three-way entailment decisions (entails, contradicts, nei- ther) and to justify its response. Contradic- tion was rare in the RTE-3 test set, occurring in only about 10% of the cases, and systems found accurately detecting it difficult. Subse- quent analysis of the results shows a test set must contain many more entailment pairs for the three-way decision task than the traditional two-way task to have equal confidence in sys- tem comparisons. Each of six human judges representing eventual end users rated the qual- ity of a justification by assigning "understand- ability" and "correctness" scores. Ratings of the same justification across judges differed significantly, signaling the need for a better characterization of the justification task.
Article
Full-text available
In recent years, the kappa coefficient of agreement has become the de facto standard for evaluating intercoder agreement for tagging tasks. In this squib, we highlight issues that affect κ and that the community has largely neglected. First, we discuss the assumptions underlying different computations of the expected agreement component of κ. Second, we discuss how prevalence and bias affect the κ measure.
Article
This paper presents the Third PASCAL Recognising Textual Entailment Challenge (RTE-3), providing an overview of the dataset creating methodology and the submitted systems. In creating this year's dataset, a number of longer texts were introduced to make the challenge more oriented to realistic scenarios. Additionally, a pool of resources was offered so that the participants could share common tools. A pilot task was also set up, aimed at differentiating unknown entailments from identified contradictions and providing justifications for overall system decisions. 26 participants submitted 44 runs, using different approaches and generally presenting new entailment models and achieving higher scores than in the previous challenges.
Kappa3 = alpha (or beta)
  • Ron Artstein
  • Massimo Poesio
Ron Artstein and Massimo Poesio. 2005. Kappa3 = alpha (or beta). Technical Report CSM-437, University of Essex Department of Computer Science.
Contradiction annotation
  • Marie-Catherine De Marneffe
  • Christopher Manning
Marie-Catherine de Marneffe and Christopher Manning. 2007. Contradiction annotation. http:// nlp.stanford.edu/ RTE3-pilot/ contradictions.pdf.
The second pascal recognising textual entailment challenge
  • Roy Bar-Haim
  • Ido Dagan
  • Bill Dolan
  • Lisa Ferro
  • Danilo Giampiccolo
  • Bernardo Magnini
  • Idan Szpektor
Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. 2006. The second pascal recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment (RTE-2).
The fourth pascal recognizing textual entailment challenge
  • Danilo Giampicolo
  • Hoa Trang Dang
  • Bernardo Magnini
  • Ido Dagan
  • Bill Dolan
Danilo Giampicolo, Hoa Trang Dang, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2008. The fourth pascal recognizing textual entailment challenge. In Preproceedings of the Text Analysis Conference (TAC).
Tac2009 rte-5 main task guidelines
  • Tac
TAC. 2009. Tac2009 rte-5 main task guidelines. http:// www.nist.gov/ tac/ 2009/ RTE/ RTE5 Main Guidelines.pdf.