About
282
Publications
30,288
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
15,605
Citations
Introduction
Skills and Expertise
Publications
Publications (282)
Understanding discourse relies to a great extent on correctly interpreting relations holding between the eventualities and facts mentioned in discourse. These discourse relations, such as causal, contrastive and temporal relations, can be expressed explicitly or implicitly in the discourse, and are the subject of annotation in the Penn Discourse Tr...
The Penn Discourse Treebank (PDTB) was released to the public in 2008. It remains the largest manually annotated corpus of discourse relations to date. Its focus on discourse relations that are either lexically-grounded in explicit discourse connectives or associated with sentential adjacency has not only facilitated its use in language technology...
Recursion is a key aspect of formal theories of language. In this chapter we have presented a new perspective on recursion in language, claiming that recursion in language is indirectly constrained. After discussing some aspects of the relation between linguistic theory and mathematical linguistics, we show that this perspective arises from the pro...
This paper considers the problem of checking whether an organization conforms to a body of regulation. Conformance is cast as a trace checking question – the regulation is represented in a logic that is evaluated against an abstract trace or run representing the operations of an organization. We focus on a problem in designing a logic to represent...
In this paper we present our experiments with different annotation workflows for annotating discourse relations in the Hindi Discourse Relation Bank(HDRB). In view of the growing interest in the development of discourse data-banks based on the PDTB framework and the complexities associated with the discourse annotation, it is important to study and...
Recently, Kuhlmann (2007, Dependency Structures and Lexicalized Grammars. PhD Thesis, Saarland University) and collaborators
have shown how the derivations of generative grammars can be recast as dependency structures. This connection between the
generative and dependency traditions opens the door to a fresh perspective on how to formally character...
The discourse properties of text have long been recognized as critical to language tech-nology, and over the past 40 years, our un-derstanding of and ability to exploit the dis-course properties of text has grown in many ways. This essay briefly recounts these de-velopments, the technology they employ, the applications they support, and the new cha...
The discourse properties of text have long been recognized as critical to language technology, and over the past 40 years, our understanding of and ability to exploit the discourse properties of text has grown in many ways. This essay briefly recounts these developments, the technology they employ, the applications they support, and the new challen...
Multi-word expressions (MWEs) account for a large portion of the language used in day-to-day interactions. A formal system that is flexible enough to model these large and often syntactically-rich non-compositional chunks as single units in naturally occurring text could considerably simplify large-scale semantic annotation projects, in which it wo...
Identification of discourse relations, such as causal and contrastive relations, between situations mentioned in text is an important task for biomedical text-mining. A biomedical text corpus annotated with discourse relations would be very useful for developing and evaluating methods for biomedical discourse processing. However, little effort has...
Formal languages for policy have been developed for access control and conformance checking. In this paper, we describe a formalism that combines features that have been developed for each application. From access control, we adopt the use of a saying operator. From conformance checking, we adopt the use of operators for obligation and permission....
The computation of logical form has been proposed as an intermediate step in the translation of sentences to logic. Logical form encodes the resolution of scope ambiguities. In this paper, we describe experiments on a modest-sized corpus of regulation annotated with a novel variant of logical form, called abstract syntax trees (ASTs). The main step...
Studies of discourse relations have not, in the past, attempted to characterize what serves as evidence for them, beyond lists of frozen expressions, or markers, drawn from a few well-defined syntactic classes. In this paper, we describe how the lexicalized discourse relation annotations of the Penn Discourse Treebank (PDTB) led to the discovery of...
We present analyses aimed at eliciting which specific aspects of discourse pro- vide the strongest indication for text im- portance. In the context of content selec- tion for single document summarization of news, we examine the benefits of both the graph structure of text provided by dis- course relations and the semantic sense of these relations....
We report results on predicting the sense of implicit discourse relations between ad- jacent sentences in text. Our investigation concentrates on the association between discourse relations and properties of the referring expressions that appear in the re- lated sentences. The properties of inter- est include coreference information, gram- matical...
This paper describes the question generation system devel-oped at UPenn for QGSTEC, 2010. The system uses predicate argument structures of sentences along with semantic roles for the question gener-ation task from paragraphs. The semantic role labels are used to identify relevant parts of text before forming questions over them. The generated quest...
In this paper, we make a qualitative and quantitative analysis of discourse relations within the LUNA conversational spoken dialog corpus. In particular, we describe the adaptation of the Penn Discourse Treebank (PDTB) annotation scheme to the LUNA dialogs. We discuss similarities and differences between our approach and the PDTB paradigm and point...
We present an approach to automatically identifying the arguments of discourse connectives based on data from the Penn Discourse Treebank. Of the two arguments of connectives, called Arg1 and Arg2, we focus on Arg1, which has proven more challenging to identify. Our approach employs a sentence-based representation of arguments, and distinguishes in...
Within generative approaches to grammar, char-acterizing the complexity of natural language has traditionally been couched in terms of formal language theory. Recently, Kuhlmann (2007) and collaborators have shown how derivations of generative grammars can be alternately repre-sented as dependency graphs. The properties of such structures provide a...
Investigations into employing statistical approaches with linguistically motivated representations and its impact on Natural Language processing tasks. © 2010 Massachusetts Institute of Technology. All rights reserved.
We describe the Hindi Discourse Relation Bank project, aimed at developing a large corpus annotated with discourse relations. We adopt the lexically grounded approach of the Penn Discourse Treebank, and describe our classification of Hindi discourse connectives, our modifications to the sense classification of discourse relations, and some cross-li...
In the Hindi Discourse Relation Bank (HDRB) project, we are developing a large corpus annotated with discourse relations, such as causal, temporal, contrastive and conjunctive relations. Adopting the lexi-cally grounded approach of the Penn Dis-course Treebank (PDTB), we annotate the argument structure of both explicit and im-plicit discourse relat...
While many advances have been made in Natural Language Generation (NLG), the scope of the field has been somewhat restricted because of the lack of annotated corpora from which properties of texts can be automatically acquired and applied towards the development of generation systems. In this paper, we describe how the Penn Discourse Tree-Bank (PDT...
This paper considers the problem of checking whether an organization conforms to a body of regulation. Conformance is cast as a trace checking question – the regulation is represented in a logic that is evaluated against an abstract trace or run representing the operations of an organization. We focus on a problem in designing a logic to represent...
We introduce LTAG-spinal, a novel variant of traditional Lexicalized Tree Adjoining Grammar (LTAG) with desirable linguistic,
computational and statistical properties. Unlike in traditional LTAG, subcategorization frames and the argument–adjunct distinction
are left underspecified in LTAG-spinal. LTAG-spinal with adjunction constraints is weakly eq...
The goal of the Penn Discourse Treebank (PDTB) project is to develop a large-scale cor-pus, annotated with coherence relations marked by discourse connectives. Currently, the primary application of the PDTB annotation has been to news articles. In this study, we tested whether the PDTB guidelines can be adapted to a differ-ent genre. We annotated d...
We present a corpus study of local dis- course relations based on the Penn Dis- course Tree Bank, a large manually anno- tated corpus of explicitly or implicitly re- alized relations. We show that while there is a large degree of ambiguity in temporal explicit discourse connectives, overall con- nectives are mostly unambiguous and al- low high-accu...
This paper considers the problem of checking whether an organization conforms to a body of regulation. Conformance is cast as a trace checking question – the regulation is represented in a logic that is evaluated against an abstract trace or run representing the operations of an organization. We focus on a problem in designing a logic to represent...
We consider the problem of checking whether the operations of an organization conform to a body of regulation. The immediate motivation comes from the analysis of the U.S. Food and Drug Administration regulations that apply to bloodbanks - organizations that collect, process, store, and use donations of blood and blood components. Statements in suc...
An important aspect of discourse understanding and genera- tion involves the recognition and processing of discourse relations. These are conveyed by discourse connectives, i.e., lexical items like because and as a result or implicit connectives expressing an inferred discourse rela- tion. The Penn Discourse TreeBank (PDTB) provides annotations of...
We present the second version of the Penn Discourse Treebank, PDTB-2.0, describing its lexically-grounded annotations of discourse relations and their two abstract object arguments over the 1 million word Wall Street Journal corpus. We describe all aspects of the annotation, including (a) the argument structure of discourse relations, (b) the sense...
In this paper, we first introduce a new archi- tecture for parsing, bidirectional incremental parsing. We propose a novel algorithm for in- cremental construction, which can be applied to many structure learning problems in NLP. We apply this algorithm to LTAG dependency parsing, and achieve significant improvement on accuracy over the previous bes...
The term discourse structure is used to denote any structure of a text above that of the sentence. Trees have often been posited as a good abstraction when discourse is taken to have a hierarchical structure (Mann and Thompson 1987; Webber et al. 2003; Marcu 2000; Egg and Redeker 2008). Nevertheless, periodically researchers have commented on the n...
We address the subtask of generating why-questions from texts and propose the use of causal relations annotated in the Penn Discourse TreeBank for evaluating content-selection methods for why-question genera-tion. Our initial experiments show that 71% of an independently developed data set of why-questions can be correlated with causal rela-tions a...
We describe our initial efforts towards developing a large-scale corpus of Hindi texts annotated with discourse relations. Adopting the lexically grounded approach of the Penn Discourse Treebank (PDTB), we present a preliminary analysis of discourse connectives in a small corpus. We describe how discourse connectives are represented in the sentence...
The overall goal is to discuss some issues concerning the dependencies at the discourse level and at the sentence level. However, first I will briefly describe the Penn Discourse Treebank (PDTB)*, a corpus in which we annotate the discourse connectives (explicit and implicit) and their arguments together with “attributions” of the arguments and the...
In this paper, we describe an approach to formally assess whether an organization conforms to a body of regulation. Conformance is cast as a model checking question where the regulation is represented in a logic that is evaluated against an abstract model representing the operations of an organization. Regulatory bases are large and complex, and th...
Unlike homopolymers, biopolymers are composed of specific sequences of different types of monomers. In proteins and RNA molecules, one-dimensional sequence information encodes a three-dimensional fold, leading to a corresponding molecular function. Such folded structures are not treated adequately through traditional methods of polymer statistical...
Recent work identifies two properties that appear particularly relevant to the characterization of graph-based dependency models of syntactic structure: the absence of interleaving substructures (well-nestedness) and a bound on a type of discontinuity (gap-degree ≤ 1) successfully describe more than 99% of the structures in two dependency treebanks...
Discriminative approaches for word align-ment have gained popularity in recent years because of the flexibility that they offer for using a large variety of features and combining information from various sources. But, the models proposed in the past have not been able to make much use of features that capture the likelihood of an alignment structu...
Despite much study, biomolecule folding cooperativity is not well understood. There are quantitative models for helix-coil transitions and for coil-to-globule transitions, but no accurate models yet treat both chain collapse and secondary structure formation together. We develop here a dynamic programming approach to statistical mechanical partitio...
This report contains the guidelines for the annotation of discourse relations in the Penn Discourse Treebank (http://www.seas.upenn.edu/~pdtb), PDTB. Discourse relations in the PDTB are annotated in a bottom up fashion, and capture both lexically realized relations as well as implicit relations. Guidelines in this report are provided for all aspect...
In this paper, we propose guided learning, a new learning framework for bidirectional sequence classification. The tasks of learn- ing the order of inference and training the local classifier are dynamically incorporated into a single Perceptron like learning algo- rithm. We apply this novel learning algo- rithm to POS tagging. It obtains an error...
In this paper we explore the use of selectional preferences for detecting noncompositional verb-object combinations. To characterise the arguments in a given grammatical relationship we experiment with three models of selectional preference. Two use WordNet and one uses the entries from a distributional thesaurus as classes for representation. In p...
Measuring the relative compositionality of Multi-word expressions (MWEs) is crucial to Natural Language Processing. Hindi contains a rich set of Noun+Verb MWEs and hence, it is very important to handle them. Very limited work was done previously towards characterizing the MWEs in Hindi of Noun+Verb type. Also, various statistical measures which are...
In cooperative man-machine interaction, it is necessary but not sufficient for a system to respond truthfully and informatively to a user's question. In particular, if the system has reason to believe that its planned response might mislead the user, then it must block that conclusion by modifying its response. This paper focuses on identifying and...
We describe the problem of anaphora resolution and discuss approaches to modeling this problem. Centering Theory (CT), which is an approach to modeling certain aspects of local coherence in discourse, includes within it the component that models anaphora resolution. However, CT itself is not a theory of anaphora resolution. It was developed as part...
An important puzzle in structural biology is the question of how proteins are able to fold so quickly into their unique native structures. There is much evidence that protein folding is hierarchic. In that case, folding routes are not linear, but have a tree structure. Trees are commonly used to represent the grammatical structure of natural langua...
It is well known that multi-word expres-sions are problematic in natural language processing. In previous literature, it has been suggested that information about their degree of compositionality can be helpful in various applications but it has not been proven empirically. In this pa-per, we propose a framework in which information about the multi...
We present an initial investigation into the use of a metagrammar for explic-itly sharing abstract grammatical specifi-cations among languages. We define a single class hierarchy for a metagrammar which allows us to automatically gener-ate grammars for different languages from a single compact metagrammar hierarchy. We use as our linguistic example...
A syntax-directed translator first parses the source-language input into a parse-tree, and then recursively converts the tree into a string in the target-language. We model this conversion by an extended tree-to-string transducer that have multi-level trees on the source-side, which gives our system more expressive power and flexi-bility. We also d...
Since the first application of context-free grammars to RNA secondary structures in 1988, many researchers have used both ad hoc and formal methods from computational linguistics to model RNA and protein structure. We show how nearly all of these methods are based on the same core principles and can be converted into equivalent approaches in the fr...
How can proteins fold so quickly into their unique native structures? We show here that there is a natural analogy between parsing and the protein folding problem, and demonstrate that CKY can find the na- tive structures of a simplified lattice model of proteins with high accuracy.
An emerging task in text understanding and generation is to categorize information as fact or opinion and to further attribute it to the appropriate source. Corpus annotation schemes aim to encode such distinctions for NLP applications concerned with such tasks, such as information extraction, question answering, summarization, and generation. We d...
In syntax-directed translation, the source- language input is first parsed into a parse- tree, which is then recursively converted into a string in the target-language. We model this conversion by an extended tree- to-string transducer that has multi-level trees on the source-side, which gives our system more expressive power and flexi- bility. We...
We present an initial investigation into the use of a metagrammar for explicitly sharing abstract grammatical specifications among languages. We define a single class hierarchy for a metagrammar which allows us to automatically generate grammars for different languages from a single compact metagrammar hierarchy. We use as our linguistic example th...
Polymers, including biomolecules such as proteins, have two particularly important types of single-molecule transitions: a helix-coil transition, driven by interactions that are local in the chain, and a collapse transition, driven by nonlocal interactions. A long-standing challenge of polymer statistical mechanics has been to deal with both types...
Recognition of Multi-word Expressions (MWEs) and their relative compositionality are crucial to Natural Language Processing. Various statistical techniques have been proposed to recognize MWEs. In this paper, we integrate all the existing statistical features and in- vestigate a range of classifiers for their suitability for recognizing the non-com...
The annotations of the Penn Discourse Treebank (PDTB) include (1) discourse connectives and their arguments, and (2) attribution of each argument of each con-nective and of the relation it denotes. Be-cause the PDTB covers the same text as the Penn TreeBank WSJ corpus, syntac-tic and discourse annotation can be com-pared. This has revealed signific...
D-LTAG is a discourse-level extension of lexicalized tree-adjoining grammar (LTAG), in which discourse syntax is projected by different types of discourse connectives and discourse interpretation is a product of compositional rules, anaphora resolution, and inference. In this paper, we present a D-LTAG extension of ongoing work on an LTAG syntax-se...
We present a very efficient statistical in- cremental parser for LTAG-spinal, a vari- ant of LTAG. The parser supports the full adjoining operation, dynamic predi- cate coordination, and non-projective de- pendencies, with a formalism of provably stronger generative capacity as compared to CFG. Using gold standard POS tags as input, on section 23 o...
Measuring the relative compositionality of Multi-word Expressions (MWEs) is crucial to Natural Language Processing. Various collocation based measures have been proposed to compute the relative compositionality of MWEs. In this paper, we define novel measures (both colloca- tion based and context based measures) to measure the relative compositiona...
In this paper, we describe an annotation scheme for the attribution of abstract objects (propositions, facts, and eventualities) associated with discourse relations and their arguments annotated in the Penn Discourse TreeBank. The scheme aims to capture both the source and degrees of factuality of the abstract objects through the annotation of text...
Discourse connectives can be analysed as encoding predicate-argument relations whose arguments derive from the interpretation of discourse units. These arguments can be anaphoric or structural. Although structural arguments can be encoded in a parse tree, anaphoric arguments must be resolved by other means. A study of nine connectives, annotating t...
In setting up a formal system to specify a grammar formalism, the conventional (mathematical) wisdom is to start with primitives (basic primitive structures) as simple as possible, and then introduce various operations for constructing more complex structures. An alternate approach is to start with complex (more complicated) primitives, which direc...
The Penn Discourse TreeBank (PDTB) is a new resource built on top of the Penn Wall Street Journal corpus, in which discourse connectives are annotated along with their arguments. Its use of standoff annotation allows integration with a stand-off version of the Penn TreeBank (syntactic structure) and PropBank (verbs and their arguments), which adds...
In this paper, we propose novel EM al-gorithms for LTAG treebank induction, and present inside-outside algorithms on LTAG derivation shared forest. We illustrate our approach by showing how to use richer resources for this induc-tion, in particular, the Penn Treebank, Propbank, and XTAG English Gram-mar.
This work is inspired by the so-called reranking tasks in natural language processing. In this paper, we first study the ranking, reranking, and ordinal regression algorithms proposed recently in the context of ranks and margins. Then we propose a general framework for ranking and reranking, and introduce a series of variants of the perceptron algo...
Perceptron like large margin algorithms are introduced for the experiments with various margin selections. Compared to the previous perceptron reranking algorithms, the new algorithms use full pairwise samples and allow us to search for margins in a larger space. Our experimental results on the data set of [1] show that a perceptron like ordinal re...
The Penn Discourse TreeBank (PDTB) is a new resource built on top of the complete Penn Wall Street Journal corpus, in which discourse connectives are annotated along with their arguments. Its use of stand-off annotation allows integration with a standoff version of the Penn TreeBank (syntactic structure) and PropBank (verbs and their arguments) , w...
A grammar formalism specifies a domain of locality, i.e., a domain over which various dependencies (syntactic and semantic) can be specified. This issue is related to the use of constrained formal/computational systems just adequate for modeling various aspects of language. It leads to some novel ways of describing locality of structures and brings...
Discourse connectives can be analyzed as encoding predicate-argument relations whose arguments derive from the interpretation of discourse units. These arguments can be anaphoric or structural. Although structural arguments can be encoded in a parse tree, anaphoric arguments must be resolved by other means. A study of nine connectives, annotating t...
This paper describes a new, large scale discourse-level annotation project -- the Penn Discourse TreeBank (PDTB). We present an approach to annotating a level of discourse structure that is based on identifying discourse connectives and their arguments. The PDTB is being built directly on top of the Penn TreeBank and Propbank, thus supporting the e...
This paper describes a new discourse-level annotation project -- the Penn Discourse Treebank (PDTB) -- that aims to produce a large-scale corpus in which discourse connectives are annotated, along with their arguments, thus exposing a clearly defined level of discourse structure.
this paper we outline a broad and integrated approach to creating behaviors for realtime 3D embodied agents. We start with a brief summary of the sorts of instructions we wish to accommodate and the architecture we have designed and implemented to interpret and execute instructions in context. The architecture includes a parameterized action dictio...
Objects in Discourse. Kluwer, Boston.
On the basis of data from Old Georgian (Boeder 1995), Michaelis and Kracht (1997) argue against treating semilinearity as a syntactic invariant. They claim that Suffixaufnahme in Old Georgian noun phrases is responsible for making Old Georgian a non-semilinear-growth language. We show that Michaelis and Kracht (a) draw an incorrect inference from t...
this paper is as follows. In Section 7.2 we will present a short introduction to LTAG, pointing out speci cally how LTAG arises in the natural process of lexicalization of context-free grammars (CFG). The resulting system is however, more powerful than CFGs, both in terms of weak generative capacity (string sets) and strong generative capacity (in...
This paper introduces a novel Support Vec-tor Machines (SVMs) based voting algorithm for reranking, which provides a way to solve the sequential models indirectly. We have presented a risk formulation under the PAC framework for this voting algorithm. We have applied this algorithm to the parse reranking problem, and achieved labeled recall and pre...
We propose the use of Lexicalized Tree Adjoining Grammar (LTAG) as a source of features that are useful for reranking the output of a statistical parser. In this paper, we extend the notion of a tree kernel over arbitrary sub-trees of the parse to the derivation trees and derived trees provided by the LTAG formalism, and in addition, we extend the...
Supertagging is the tagging process of assigning the correct elementary tree of LTAG, or the correct supertag, to each word of an input sentence . In this paper we propose to use supertags to expose syntactic dependencies which are unavailable with POS tags. We first propose a novel method of applying Sparse Network of Winnow (SNoW) to sequential m...
For the specification of formal systems for a grammar formalism, conventional mathematical wisdom dictates that we start with
primitives (basic primitive structures or building blocks) as simple as possible and then introduce various operations for
constructing more complex structures. Alternatively, we can start with complex (more complicated) pri...
This paper is a note on some relationships between the strong and weak generative powers of formal systems, in particular, from the point of view of squeezing more strong power out of a formal system without increasing its weak generative power. We will comment on some old and new results from this perspective. Our main goal of this note is to comm...
This paper addresses the problem of constraints for relative quantifier scope, in particular in inverse linking readings where certain scope orders are excluded. We show how to account for such restrictions in the Tree Adjoining Grammar (TAG) framework by adopting a notion of flexible composition. In the semantics we use for TAG we introduce quanti...
We present an implementation of a discourse parsing system for a lexicalized Tree-Ajoining Grammar for discourse, specifying the integration of sentence and discourse level processing. Our system is based on the assumption that the compositional aspects of semantics at the discourse-level parallel those at the sentence-level. This coupling is achie...
In this paper we present a general parsing strategy that arose from the development of an Earicy-type parsing algorithm for TAGs (Schabes and Joshi 1988) and from recent linguistic work in TAGs (Abeille 1988).
In this paper we show that an account for coordination can be constructed us- ing the derivation structures in a lexical- ized Tree Adjoining Grammar (LTAG). We present a notion of derivation in LTAGs that preserves the notion of fixed constituency in the LTAG lexicon while providing the flexibility needed for coordination phenomena. We also discus...
This paper describes our research on producing justifications. ("U" refers to the user, "S" to the system.) U: Is John taking four courses? Si: No. John can't take any courses: he's not a student
Lexicalized Tree Adjoining Grammar (LTAG) is an attractive formalism for linguistic description mainly because cff its extended domain of locality and its factoring recursion out from the domain of local dependencies (Joshi, 1985, Kroch and Joshi, 1985, Abeill6, 1988). LTAG's extended domain of locality enables one to localize syntactic dependencie...
We have argued extensively in prior work that discourse connectives can be analyzed as en-coding predicate-argument relations whose ar-guments derived from the interpretation of dis-course units. All adverbial connectives we have analyzed to date have expressed binary relations. But they are special in taking one of their two arguments structurally...
MILDLY CONTEXT-SENSITIVE GRAMMARS and languages (MCSG, MCSL) arose out the study of formal grammars adequate to model natural language structures, which are as restricted as possible in their formal power when compared to the unrestricted grammars which are equivalent to Turing machines. In the early 1980's, a grammatical formalism called gen- eral...
We here explore a "fully" lexicalizod Tree-Adjoining Grammar for discourse that takes the basic ele- ments of a (monologic) discourse to be not simply clauses, but larger structures that are anchored on variously realized discourse cues This link with intra-sentential grammar suggests an account for different patterns of discourse cues. while the d...
In this paper, we present a method for comparing Lexicalized Tree Adjoining Grammars extracted from annotated corpora for three lan- guages: English, Chinese and Ko- rean. This method makes it possible to do a quantitative comparison between the syntactic structures of each language, thereby providing a way of testing the Universal Grammar Hypothes...
this paper we pursue the idea that by making the descriptions of primitive items (lexical items in the lin- guistic context) more complex, we can make the com- putation of linguistic structure more locaP. The idea is that by making the descriptions of primitives more complex, we can not only make more complex constraints operate more locally but al...