About
266
Publications
32,626
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
13,105
Citations
Introduction
Skills and Expertise
Publications
Publications (266)
The Termolator is an open-source high-performing terminology extraction system, available on Github. The Termolator combines several different approaches to get superior coverage and precision. The in-line term component identifies potential instances of terminology using a chunking procedure, similar to noun group chunking, but favoring chunks tha...
This is the first time New York University (NYU) participates in the event nugget (EN) evaluation of the Text Analysis Conference (TAC). We developed EN systems for both subtasks of event nugget, i.e, EN Task 1: Event Nugget Detection and EN Task 2: Event Nugget Detection and Coreference. The systems are mainly based on our recent research on deep...
The task of Named Entity Linking is to link entity mentions in the document to their correct entries in a knowledge base and to clusterNIL mentions. Ambiguous, misspelled, and incomplete entity mention names are the main challenges in the linking process. We propose a novel approach that combines two state-of-the-art models — for entity disambiguat...
The last decade has witnessed the success of the traditional feature-based
method on exploiting the discrete structures such as words or lexical patterns
to extract relations from text. Recently, convolutional and recurrent neural
networks has provided very effective mechanisms to capture the hidden
structures within sentences via continuous repres...
The goal of paraphrase identification is to decide whether two given text fragments have the same meaning.
Of particular interest in this area is the identification of paraphrases among short texts, such as SMS and Twitter.
In this paper, we present idiomatic expressions as a new domain for short-text paraphrase identification.
We propose a techni...
The task of Named Entity Disambiguation is to map entity mentions in the document to their correct entries in some knowledge base. We present a novel graph-based dis- ambiguation approach based on Personalized PageRank (PPR) that combines local and global evidence for disambiguation and effectively filters out noise introduced by incorrect candidat...
New York University (NYU) participated in three tracks of the 2014 TAC-KBP evaluation: English Slot Filling, Cold Start and Entity Discovery and Linking. While this year is the first time and second time we participated in entity discovery and linking (EDL) and cold start respectively, we have been working on the slot filling task for several years...
When an information extraction system is applied to a new task or domain, we must specify the classes of entities and relations to be extracted. This is best done by a subject matter expert, who may have little training in NLP. To meet this need, we have developed a toolset which is able to analyze a corpus and aid the user in building the specific...
Relation extraction suffers from a performance loss when a model is applied to out-of-domain data. This has fostered the development of domain adaptation techniques for relation extraction. This paper evaluates word embeddings and clustering on adapting feature-based relation extraction systems. We systematically explore various ways to apply word...
Distant supervision usually utilizes only unlabeled data and existing knowledge bases to learn relation extraction models. However, in some cases a small amount of human labeled data is available. In this paper, we demonstrate how a state-of-theart multi-instance multi-label model can be modified to make use of these reliable sentence-level labels...
Distant supervision has attracted recent interest for training information extraction systems because it does not require any human annotation but rather isting knowledge bases to he bel a training corpus. Howe work has failed to address of false negative training exa beled due to the incompleten edge bases. To tackle this propose a simple yet nove...
Information Extraction (IE) and Summarization share the same goal of extracting and presenting the relevant information of a document. While IE was a primary element of early abstractive summarization systems, it's been left out in more recent extractive systems. However, extracting facts, recognizing entities and events should provide useful infor...
Relation extraction is the process of identifying instances of specified types of semantic relations in text; relation type extension involves extending a relation extraction system to recognize a new type of relation. We present LGCo-Testing, an active learning system for relation type extension based on local and global views of relation instance...
The Web brings an open-ended set of semantic relations. Discovering the significant types is very challenging. Unsupervised algorithms have been developed to extract relations from a corpus without knowing the relation types in advance, but most rely on tagging arguments of predefined types. One recently reported system is able to jointly extract r...
A precondition for extracting information from large text corpora is discovering the information structures underlying the text. Progress in this direction is being made in the form of unsupervised information extraction (IE). We describe recent work in unsupervised relation extraction and compare its goals to those of grammar discovery for science...
The well-studied supervised Relation Extraction algorithms require training data that is accurate and has good coverage. To obtain such a gold standard, the common practice is to do independent double annotation followed by adjudication. This takes significantly more human effort than annotation done by a single annotator. We do a detailed analysis...
We present initial investigation into the task of paraphrasing language while targeting a particular writing style. The plays of William Shakespeare and their modern translations are used as a testbed for evaluating paraphrase systems targeting a specific style of writing. We show that even with a relatively small amount of parallel training data,...
In this paper we give an overview of the Knowledge Base Population (KBP) track at the 2010 Text Analysis Conference. The main goal of KBP is to promote research in discovering facts about entities and augmenting a knowledge base (KB) with these facts. This is done through two tasks, Entity Linking -- linking names in context to entities in the KB -...
We present a simple semi-supervised relation extraction system with large-scale word clustering. We focus on systematically exploring the effectiveness of different cluster-based features. We also propose several statistical methods for selecting clusters at an appropriate level of granularity. When training on different sizes of data, our semi-sup...
Annotating training data for event extraction is tedious and labor-intensive. Most current event extraction tasks rely on hundreds of annotated documents, but this is often not enough. In this paper, we present a novel self-training strategy, which uses Information Retrieval (IR) to collect a cluster of related documents as the resource for bootstr...
We propose a novel way of incorporating dependency parse and word co-occurrence information into a state-of-the-art web-scale n-gram model for spelling correction. The syntactic and distributional information provides extra evidence in addition to that provided by a web-scale n-gram corpus and especially helps with data sparsity problems. Experimen...
Event extraction is a particularly challenging type of information extraction (IE) that may require inferences from the whole article. However, most current event extraction systems rely on local information at the phrase or sentence level, and do not consider the article as a whole, thus limiting extraction performance. Moreover, most annotated co...
We propose a general cross-domain bootstrapping algorithm for domain adaptation in the task of named entity recognition. We first generalize the lexical features of the source domain model with word clusters generated from a joint corpus. We then select target domain instances based on multiple criteria during the bootstrapping process. Without usi...
Much of the world's knowledge is recorded in natural language text, but making effective use of it in this form poses a major challenge. Information extraction converts this knowledge to a structured form suitable for computer manipulation, opening up many possibilities for using it. In this review, the author describes the processing pipeline of i...
The term "event extraction" covers a wide range of information extraction tasks, and methods developed and evaluated for one task may prove quite unsuitable for another. Understanding these task differences is essential to making broad progress in event extraction. We look back at the MUC and ACE tasks in terms of one characteristic, the breadth of...
We describe a utility evaluation to deter- mine whether cross-document information extraction (IE) techniques measurably im- prove user performance in news summary writing. Two groups of subjects were asked to perform the same time-restricted sum- mary writing tasks, reading news under dif- ferent conditions: with no IE results at all, with traditi...
Event extraction is a particularly challenging type of information extraction (IE). Most current event extraction systems rely on local information at the phrase or sentence level. However, this local context may be insufficient to resolve ambiguities in identifying particular types of events; information from a wider scope can serve to resolve som...
Several researchers have proposed semi-supervised learning methods for adapting event extraction systems to new event types. This paper investigates two kinds of bootstrapping methods used for event extraction: the document-centric and similarity-centric approaches, and proposes a filtered ranking method that combines the advantages of the two. We...
We present a simple algorithm for clustering semantic patterns based on distributional similarity and use cluster memberships to guide semi-supervised pattern discovery. We apply this approach to the task of relation extraction. The evaluation results demonstrate that our novel bootstrapping procedure significantly outperforms a standard bootstrapp...
Entity extraction is the task of identifying names and nominal phrases ('mentions') in a text and linking coreferring mentions. We propose the use of a new source of data for improving entity extraction: the information gleaned from large bitexts and captured by a statistical, phrase-based machine translation system. We translate the individual men...
This paper describes NomBank, a project that will provide argument structure for instances of common nouns in the Penn Treebank II corpus. NomBank is part of a larger effort to add ad-ditional layers of annotation to the Penn Tree-bank II corpus. The University of Pennsylva-nia's PropBank, NomBank and other annota-tion projects taken together shoul...
To move beyond current keyword-based approaches to document retrieval, we need to provide the user with a range of technologies for obtaining information and answering questions. One of these is information extraction (IE), which has the potential, for a limited range of requests, to provide more focused and comprehensive responses. IE has been con...
Name translation is important well beyond the relative frequency of names in a text: a cor- rectly translated passage, but with the wrong name, may lose most of its value. The Night- ingale team has built a name translation com- ponent which operates in tandem with a con- ventional phrase-based statistical MT system, identifying names in the source...
We describe and analyze inference strategies for combining outputs from multiple question answering systems each of which was developed independently. Specifically, we address the DARPA-funded GALE information distillation Year 3 task of finding answers to the 5-Wh questions (who, what, when, where, and why) for each given sentence. The approach we...
In this paper, we propose an event-based approach for Chinese sentence compression without using any training corpus. We enhance the linguistically-motivated heuristics by exploiting event word significance and event information density. This is shown to improve the preservation of important information and the tolerance of POS and parsing errors,...
Cross-lingual spoken sentence retrieval (CLSSR) remains a challenge, especially for queries including OOV words such as person names. This paper proposes a simple method of fuzzy matching between query names and phones of candidate audio segments. This approach has the advantage of avoiding some word decoding errors in automatic speech recognition...
Cross-lingual tasks are especially difficult due to the compounding effect of errors in language processing and errors in machine translation (MT). In this paper, we present an error analysis of a new cross-lingual task: the 5W task, a sentence-level understanding task which seeks to return the English 5W's (Who, What, When, Where and Why) correspo...
For many NLP tasks, including named entity tagging, semi-supervised learning has been proposed as a reasonable alternative to methods that require annotating large amounts of training data. In this paper, we address the problem of analyzing new data given a semi-supervised NE tagger trained on data from an earlier time period. We will show that upd...
Cross-lingual tasks are especially difficult due to the compounding effect of errors in language processing and errors in machine translation (MT). In this paper, we present an error analysis of a new cross-lingual task: the 5W task, a sentence-level understanding task which seeks to return the English 5W's (Who, What, When, Where and Why) correspo...
Named entities (NEs) are the expressions in human languages that explicitly link notations in languages to the entities in the real world. They play important role in cross-language information retrieval (CLIR) because most users' requests have been found to have NEs, and majority of out-of-vocabulary terms are NEs. Therefore, missing their transla...
Recent availability of commercial online machine translation (MT) systems makes it possible for layman Web users to utilize the MT capability for cross-language information retrieval (CLIR). To study the effectiveness of using MT for query translation, we conducted a set of experiments using Google Translate, an online MT system provided by Google,...
Progress in both speech and language processing has spurred efforts to support applications that rely on spoken rather than written language input. A key challenge in moving from text-based documents to such spoken documents is that spoken language lacks explicit punctuation and formatting, which can be crucial for good performance. This article de...
This paper studies the effect of automatic sentence boundary detection and comma prediction on entity and relation extraction in speech. We show that punctuating the machine generated transcript according to maximum F-measure of period and comma annotation results in suboptimal information extraction. Precisely, period and comma decision thresholds...
This paper focuses on the influence of changing the text time f rame on the performance of a named entity tagger. We followed a twofold approach to investigate this subject: on the one hand, we analyzed a corpus that spans 8 years, and, on the other hand, we assessed the performance of a name tagger trained and tested on that corpus. We created 8 s...
We apply the hypothesis of "One Sense Per Discourse" (Yarowsky, 1995) to information extraction (IE), and extend the scope of "dis- course" from one single document to a cluster of topically-related documents. We employ a similar approach to propagate consistent event arguments across sentences and documents. Combining global evidence from related...
This work looks at the impact of automatically predicted commas on part-of-speech (POS) and name tagging of speech recognition transcripts of Mandarin broadcast news. There is a significant gain in both POS and name tagging accuracy due to using automatically predicted commas over sentence boundary prediction alone. One difference between Mandarin...
This paper addresses the task of provid- ing extended responses to questions re- garding specialized topics. This task is an amalgam of information retrieval, topical summarization, and Information Extrac- tion (IE). We present an approach which draws on methods from each of these ar- eas, and compare the effectiveness of this approach with a query...
This talk will look at some current issues in natural language processing from the vantage point of information extraction (IE), and so give some perspective on what is needed to make IE more successful. By IE we mean the identification of important types of relations and events in unstructured text. IE provides a nice reference point because it is...
We present two semi-supervised learning techniques to improve a state-of-the-art multi-lingual name tagger. For English and Chinese, the overall system obtains 1.7% -2.1% improvement in F-measure, representing a 13.5% -17.4% relative re-duction in the spurious, missing, and in-correct tags. We also conclude that simply relying upon large corpora is...
Name tagging is a critical early stage in many natural language processing pipe- lines. In this paper we analyze the types of errors produced by a tagger, distin- guishing name classification and various types of name identification errors. We present a joint inference model to im- prove Chinese name tagging by incorpo- rating feedback from subsequ...
Integrating information from different stages of an NLP processing pipeline can yield significant error reduction. We dem-onstrate how re-ranking can improve name tagging in a Chinese information extrac-tion system by incorporating information from relation extraction, event extraction, and coreference. We evaluate three state-of-the-art re-ranking...
We present a novel mechanism for im- proving reference resolution by using the output of a relation tagger to rescore coreference hypotheses. Experiments show that this new framework can im- prove performance on two quite different languages -- English and Chinese.
Information extraction systems incorpo- rate multiple stages of linguistic analysis. Although errors are typically compounded from stage to stage, it is possible to re- duce the errors in one stage by harnessing the results of the other stages. We dem- onstrate this by using the results of coreference analysis and relation extrac- tion to reduce th...
Entity relation detection is a form of in- formation extraction that finds predefined relations between pairs of entities in text. This paper describes a relation detection approach that combines clues from differ- ent levels of syntactic processing using kernel methods. Information from three different levels of processing is consid- ered: tokeniz...
In this report we present the overall architecture for the NYU English ACE 2005 system. We focus on two components which were evaluated this year: reference resolution, where we experimented with features for relating anaphor and antecedent, and event recognition, where we sought to take advantage of a rich combination of logical grammatical struct...
Most traditional information extraction approaches are generative models that assume events exist in text in certain patterns and these patterns can be regenerated in various ways. These assumptions limited the syntactic clues being considered for finding an event and confined these approaches to a particular syntactic level. This paper presents a...
In this paper, we discuss the performance of crosslingual information extraction systems employing an automatic pattern acquisition module. This module, which creates extraction patterns starting from a user's narrative task description, allows rapid customization to new extraction tasks. We compare two approaches: (1) acquiring patterns in the sou...
Discovering the significant relations embedded in documents would be very useful not only for information retrieval but also for question answering and summarization. Prior methods for relation discovery, however, needed large annotated corpora which cost a great deal of time and effort. We propose an unsupervised method for relation discovery from...
This paper does not necessarily reflect the position or the policy of the U.S. Government
When complete, NomBank will provide annotation of noun arguments in Penn Treebank II (PTB). In PropBank, University of Pennsylva- nia annotators provide similar information for verbs. Given nominalization/verb mappings, the combination of NomBank and PropBank allows for generalization of arguments across parts of speech. This paper describes our an...
This paper does not necessarily reflect the position or the policy of the U.S. Government
This paper presents new Information Extraction scenarios which are linguistically and structurally more challenging than the traditional MUC scenarios. Traditional views on event structure and template design are not adequate for the more complex scenarios.
We developed a cross-lingual, question-answering (CLQA) system for Hindi and English. It accepts questions in English, finds candidate answers in Hindi newspapers, and translates the answer candidates into English along with the context surrounding each answer. The system was developed as part of the surprise language exercise (SLE) within the TIDE...
We present an algorithm for unsupervised learning and semantic classification of names and terms. Given a small number of seed ex-amples and an unlabeled training corpus, the algorithm learns patterns that identify more examples, in a bootstrapping cycle. Multiple classes are learned simultaneously, including negative classes that serve to provide...
This paper introduces GLARF, a framework for predicate argument structure.
Several approaches have been described for the automatic unsupervised acquisition of patterns for information extraction.
This chapter will describe our experience developing specifications and tools for building a Syntactically Annotated Corpus (SAC) for Spanish newspaper texts. The initial corpus consists of 1,500 sentences extracted from El Pas Digital and Compra Maestra, with a total of 22,695 words. The paper will address several of the relevant topics for any SA...
lh'ansfer-based Machine q_anslation systems require a l)roce(ture ibr choosing the set of t;ransfer rules tbr generating a target language t,ranslation fi'om a given source language sentence. In an MT systetl wii;h many (:ompc,ing trmisfbr rules, choosing t;hc best set of transihr rules for translation may involve the evaluation of an plosivc mmber...
This report is based upon work supported by the Defeuse Advanced Research Projects Ageucy nnder Grant N00014-90-J-1851 from the Office of Naval Research cud by the National Science l'bumlation under Grant 11H-89-02304
en by semantic expectations, ignoring intervening text not matching these expectations [1]; this is robust but can lead to serious en'om. Another approach has been to identify "interesting" words and attempt only partial sentence parses around those words [2], As an alter- native, we have explored the use of full syntactic analysis of the input, co...
r than ignore such limitations, we should use them as a motivation for identifying manageable components of this domain-specific knowledge. Such considerations are especially important if we are aiming to construct portable systems -- systems which can be readily moved from one domain to another. What properties should such a component have? It sho...
A large corpus (about 100 MB of text) was selected and examples of 750 fi'e- quently occurring verbs were tagged with their complement class as defined by a large computational syntactic dictionary, COMLEX Syntax. This tagging task led to the refinement of ah'cady existing classes and to the addition of classes that had previously not been defiued....
Several approaches have been described for the automatic unsupervised acquisition of patterns for information extraction. Each approach is based on a particular model for the patterns to be acquired, such as a predicate-argument structure or a dependency chain. The effect of these alternative models has not been previously studied. In this paper, w...
This paper describes how NOMLEX, a dictionary of nominalizations, can be used in Information Extraction (IE). This paper details a procedure which maps syntactic and semantic information designed for writing an IE pattern for an active clause (IBM appointed Alice Smith as vice president) into a set of patterns for nominalizations (e.g., IBM's...
Research in example-based machine translation (EBMT) has been hampered by the lack of efficient tree alignment algorithms for bilingual corpora. This paper describes an alignment algorithm for EBMT whose running time is quadratic in the size of the input parse trees. The algorithm uses dynamic programming to score all possible matching nodes betwee...
We have introduced information extraction technique such as named entity tagging and pattern discovery to a summarization system based on sentence extraction technique, and evaluated the performance in the Document Understanding Conference 2001 (DUC-2001). We participated in the Single Document Summarization task in DUC-2001 and achieved one of the...
It is now generally accepted that a text corpus plays an important role in the production of hard- copy dictionaries. In this paper, we discuss the influence a corpus can have on the creation of lexical resources for computer use. In the creation of COMLEX Syntax and NOMLEX, two on- line lexicons produced by the authors at New York University, we u...
At the first conference on Language Resources and Evaluation, Granada 1998, Charles Fillmore, Nancy Ide, Daniel Jurafsky, and Catherine Macleod proposed creating an American National Corpus (ANC) that would compare with the British National Corpus (BNC) both in balance and in size (one hundred million words). This paper reports on the progress made...
We describe a system for creating and automatically updating a data base of information on infectious disease outbreaks. A web crawler is used to retrieve current news stories; potentially relevant stories are fed to an information extraction engine, whose output is used to update the data base. A web-based browser allows users to examine the data...
according to most parsing precision/recall measures. However, their level of detail is limited by the hand-annotated treebanks from which they derive their grammars. In contrast, some parsers based on hand-coded grammars rank lower on precision/recall measures while providing more detailed syntactic analyses, which greatly enhance NLP applications...
This paper reports on criteria for distinguishing complements from adjuncts in the development of COMLEX Syntax , a large on-line syntactic lexicon of English. Correct, or at least consistent, criteria are crucial for lexicography and natural language processing. Complement/adjunct criteria from linguistics and lexicography leave a gray area---opti...
While initial treebanks and treebank parsers primarily involved surface analysis, recent work focuses on predicate argument (PA) structure. PA structure provides means to regularize variants (e.g., actives/passives) of sentences so that individual patterns may have better coverage (in MT, QA, IE, etc.), offsetting the sparse data problem. We encode...
This paper discusses an efficient algorithm for aligning parse trees, and its application to the automatic acquisition of transfer rules for machine translation. Although the general problem of finding an optimal tree alignment is NP-complete, the problem becomes tractable if we consider only alignments that are restricted to preserve a dominance r...
This paper reports on a set of criteria for consistently distinguishing complements from adjuncts in the development of COMLEX Syntax, a large on-line syntactic lexicon of English. A correct, or at least consistent, set of criteria is crucial both for lexicography and for various natural language processing (NLP) applications, especially the accura...