Article

Semantic role labeling for open information extraction

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Open Information Extraction is a recent paradigm for machine reading from arbitrary text. In contrast to existing techniques, which have used only shallow syntactic features, we investigate the use of semantic features (semantic roles) for the task of Open IE. We compare TextRunner (Banko et al., 2007), a state of the art open extractor, with our novel extractor SRL-IE, which is based on UIUC's SRL system (Punyakanok et al., 2008). We find that SRL-IE is robust to noisy heterogeneous Web data and outperforms TextRunner on extraction quality. On the other hand, TextRunner performs over 2 orders of magnitude faster and achieves good precision in high locality and high redundancy extractions. These observations enable the construction of hybrid extractors that output higher quality results than TextRunner and similar quality as SRL-IE in much less time.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The meaning of the labels will be described in detail in Section 2.2. SRL has been used in many natural language processing (NLP) applications such as question answering [1], machine translation [2], document summarization [3] and information extraction [4]. Therefore, SRL is an important task in NLP. ...
... The tagset of the treebank has 38 syntactic labels (18 part-of-speech tags, 17 syntactic category tags, 3 empty categories) and 17 function tags. For details, please refer to [20] 4 . The meanings of some common tags are listed in Table 1. ...
... First, using the annotation rule for Arg0, the phrase having syntactical function SUB or 4 All the resources are available at the website of the VLSP project. ...
Preprint
In this paper, we study semantic role labelling (SRL), a subtask of semantic parsing of natural language sentences and its application for the Vietnamese language. We present our effort in building Vietnamese PropBank, the first Vietnamese SRL corpus and a software system for labelling semantic roles of Vietnamese texts. In particular, we present a novel constituent extraction algorithm in the argument candidate identification step which is more suitable and more accurate than the common node-mapping method. In the machine learning part, our system integrates distributed word features produced by two recent unsupervised learning models in two learned statistical classifiers and makes use of integer linear programming inference procedure to improve the accuracy. The system is evaluated in a series of experiments and achieves a good result, an F1F_1 score of 74.77%. Our system, including corpus and software, is available as an open source project for free research and we believe that it is a good baseline for the development of future Vietnamese SRL systems.
... There are many definitions of Legal Informatics that can be found in the literature some of which define the "Legal Informatics as the utilization of the Information & Communication Technologies (ICTs) within the legal environment context" [2,[10][11][12] while other pinpoint the result(s) of the utilization of the ICTs in the domain (e.g. decision making, problem solving) [13][14][15]. ...
... During the last three decades and according to these major areas, several online consultation services and knowledge systems have been appeared in order to make services more open and promote access to legal resources. However, in the era of the World Wide Web -and particularly Semantic Web -and the immense data (and information) availability, there is a new awareness of citizens demands for greater transparency, and a belief that Open Data, particularly reuse of data, has the potential for a great impact on the economy and society [10]. The same awareness applies to legal documents in order to transform them in Open Legal Data. ...
... Although many of them are not really ontologies in the sense that they describe the universe of discourse of the world or domain the law is working on, e.g. taxes, crime, traffic, immigration, etc. rather than the typical legal vocabulary [10]. ...
... There are many definitions of Legal Informatics that can be found in the literature some of which define the "Legal Informatics as the utilization of the Information & Communication Technologies (ICTs) within the legal environment context" [2,[10][11][12] while other pinpoint the result(s) of the utilization of the ICTs in the domain (e.g. decision making, problem solving) [13][14][15]. ...
... During the last three decades and according to these major areas, several online consultation services and knowledge systems have been appeared in order to make services more open and promote access to legal resources. However, in the era of the World Wide Web -and particularly Semantic Web -and the immense data (and information) availability, there is a new awareness of citizens demands for greater transparency, and a belief that Open Data, particularly reuse of data, has the potential for a great impact on the economy and society [10]. The same awareness applies to legal documents in order to transform them in Open Legal Data. ...
... Although many of them are not really ontologies in the sense that they describe the universe of discourse of the world or domain the law is working on, e.g. taxes, crime, traffic, immigration, etc. rather than the typical legal vocabulary [10]. ...
... The meaning of the labels will be described in detail in Section 2.2. SRL has been used in many natural language processing (NLP) applications such as question answering [1], machine translation [2], document summarization [3] and information extraction [4]. Therefore, SRL is an important task in NLP. ...
... The tagset of the treebank has 38 syntactic labels (18 part-of-speech tags, 17 syntactic category tags, 3 empty categories) and 17 function tags. For details, please refer to [20] 4 . The meanings of some common tags are listed in Table 1. ...
... First, using the annotation rule for Arg0, the phrase having syntactical function SUB or 4 All the resources are available at the website of the VLSP project. ...
Article
Full-text available
In this paper, we study semantic role labelling (SRL), a subtask of semantic parsing of natural language sentences and its application for the Vietnamese language. We present our effort in building Vietnamese PropBank, a first Vietnamese SRL corpus and a software system for labelling semantic roles of Vietnamese texts. In particular, we present a novel constituent extraction algorithm in the argument candidate identification step which is more suitable and more accurate than a common node-mapping method. In the learning machine part, our system integrates distributed word features produced by two recent unsupervised learning models in two learned statistical classifiers and makes use of integer linear programming inference procedure to improve the accuracy. The system is evaluated in a series of experiments and achieves a good result, an F1F_1 score of 74.77\%. Our system, including corpus and software, is available as an open source project for free research and we believe that it is a good baseline for the development of future Vietnamese SRL systems.
... This task lies somewhere in the middle between syntax and semantics: it is more semantic than tasks such as part of speech tagging or syntactic parsing, but less semantic than tasks such as information extraction or question answering. Previous work, e.g., (Shen & Lapata, 2007;Christensen et al., 2010), have shown that using the output of an SRL system improves performance for a variety of these higher-level tasks. ...
... Following is a list of applications of SRL to higher-level tasks: Christensen et al. (2010) build a system based on SRL for open-domain information extraction (Open IE) -the task of extracting factual relationships from a text corpus without using a prespecified list of relations. Their SRL-based system obtains higherquality extractions as compared to a state-of-the-art Open IE system. ...
... The tuple includes a relational phrase and multiple or a pair of argument phrases, which are semantically connected by the relational phrase. For the collection of extraction patterns, some studies use hand-crafted rules, 3,16,17 whereas others learn from automatically labeled training datasets. 1,2,4 Additionally, a number of studies improved the accuracy of Open IE by transforming complex sentences, including several clauses into a collection of simplified independent clauses. ...
Article
This paper presents a method to resolve tuples from plain text by adding an inception network, and dependency path embedding to existing neural network methods of Open Information Extraction (Open IE). Inception networks are used in analysis of computer vision, and dependency path embedding in text processing, but neither has been reported with Open IE. Performance was measured on benchmark datasets using two existing Open IE deep learning methods, one using bidirectional long short-term memory and BIO tagging (RnnOIE-verb), and another using a span-based model (SpanOIE). RnnOIE-verb was compared with RnnOIE-verb plus inception network and/or dependency path embedding. SpanOIE was compared with SpanOIE plus inception network. Performance slightly increased with the addition of inception network to RnnOIE-verb (before AUC 0.45, F1 0.59; after AUC 0.46, F1 0.60) and inception network to SpanOIE (before AUC 0.63, F1 0.748; after AUC 0.64, F1 0.764). The performance gain was minor but potentially relevant to an iterative process of improvement.
... OntoILPER is currently based on the shallow syntactic parsing of sentences, which does not take into account semantic aspects relating entities to verbs. Accordingly, we intend to integrate further BK (semantic resources) into OntoILPER preprocessing stage, such as synonyms, hypernyms/hyponyms, SRL [Christensen et al., 2010], and word sense disambiguation [Ciaramita and Altun, 2006], since these semantic resources have recently been proven to improve performance in many IE applications [Dou et al., 2015]. In this case, the distinctive feature of ILP would allow incremental BK to be put to test. ...
Preprint
Full-text available
Relation Extraction (RE), the task of detecting and characterizing semantic relations between entities in text, has gained much importance in the last two decades, mainly in the biomedical domain. Many papers have been published on Relation Extraction using supervised machine learning techniques. Most of these techniques rely on statistical methods, such as feature-based and tree-kernels-based methods. Such statistical learning techniques are usually based on a propositional hypothesis space for representing examples, i.e., they employ an attribute-value representation of features. This kind of representation has some drawbacks, particularly in the extraction of complex relations which demand more contextual information about the involving instances, i.e., it is not able to effectively capture structural information from parse trees without loss of information. In this work, we present OntoILPER, a logic-based relational learning approach to Relation Extraction that uses Inductive Logic Programming for generating extraction models in the form of symbolic extraction rules. OntoILPER takes profit of a rich relational representation of examples, which can alleviate the aforementioned drawbacks. The proposed relational approach seems to be more suitable for Relation Extraction than statistical ones for several reasons that we argue. Moreover, OntoILPER uses a domain ontology that guides the background knowledge generation process and is used for storing the extracted relation instances. The induced extraction rules were evaluated on three protein-protein interaction datasets from the biomedical domain. The performance of OntoILPER extraction models was compared with other state-of-the-art RE systems. The encouraging results seem to demonstrate the effectiveness of the proposed solution.
... Despite OntoILPER encouraging results, there is still room for improvement: (i) OntoILPER currently relies on shallow syntactic parsing, which does not take into account deeper semantic aspects of the sentences; (ii) the strategy of generating negative examples in OntoILPER can produce unbalanced distributions of positive and negatives training examples, which may hamper performance, as pointed out in [37]. To address the aforementioned shortcomings, we plan to: (i) integrate further BK into the preprocessing step, such as synonyms and hypernymys/hyponyms from WordNet, semantic role labeling [11], and word sense disambiguation [12], since these semantic resources have been proven to improve performance in many IE applications [14], and (ii) investigate the impact of undersampling techniques which would allow speed up the learning task by reducing the number of negative examples [37]. ...
Article
Full-text available
Named entity recognition (NER) and relation extraction (RE) are two important subtasks in information extraction (IE). Most of the current learning methods for NER and RE rely on supervised machine learning techniques with more accurate results for NER than RE. This paper presents OntoILPER a system for extracting entity and relation instances from unstructured texts using ontology and inductive logic programming, a symbolic machine learning technique. OntoILPER uses the domain ontology and takes advantage of a higher expressive relational hypothesis space for representing examples whose structure is relevant to IE. It induces extraction rules that subsume examples of entities and relation instances from a specific graph-based model of sentence representation. Furthermore, OntoILPER enables the exploitation of the domain ontology and further background knowledge in the form of relational features. To evaluate OntoILPER, several experiments over the TREC corpus for both NER and RE tasks were conducted and the yielded results demonstrate its effectiveness in both tasks. This paper also provides a comparative assessment among OntoILPER and other NER and RE systems, showing that OntoILPER is very competitive on NER and outperforms the selected systems on RE.
... Generally , the verb/noun and the semantically labeled arguments correspond to OIE propositions and, therefore , the two tasks are considered similar. Systems like SRL-IE (Christensen et al., 2010) explore if these techniques can be used for OIE. However, while OIE aims to identify the relation/predicate between a pair of arguments, frame-based techniques aim to identify arguments and their roles with respect to a predicate. ...
... Unlike syntactic level surface cases (i.e., dependency labels such as subject and object), semantic roles can be regarded as a deep case representation for predicates. Because of its ability to abstract the meaning of a sentence , SRL has been applied to many NLP applications , including information extraction (Christensen et al., 2010), question answering (Pizzato and Mollá, 2008) and machine translation (Liu and Gildea, 2010). Semantically annotated corpora, such as FrameNet (Fillmore et al., 2001) and PropBank (Kingsbury and Palmer, 2002), make this type of automatic semantic structure analysis feasible by using supervised machine learning methods. ...
... Semantic role labeling has become a key module for many language processing applications and its importance is growing in fields like question answering (Shen and Lapata, 2007), information extraction (Christensen et al., 2010), sentiment analysis (Johansson and Moschitti, 2011 ), and machine translation (Liu and Gildea, 2010; Wu et al., 2011). To build an unrestricted semantic role labeler, the first step is to develop a comprehensive proposition bank. ...
... In particular, it answers a question Who did What to Whom, When, Where, Why?. A simple Vietnamese sentence Nam giúp Huy hhc bài vào hôm qua (Nam helped Huy to do homework yesterday) is given inFigure To assign semantic roles for the sentence above, we must analyse and label the propositions concerning the predicate giúp (helped) of the sentence.Figure 1.2 shows a result of the SRL for this example, where meaning of the labels will be described in detail in Chapter 4.Figure 1.2: Semantic roles for the example sentence SRL has been used in many natural language processing (NLP) applications such as question answering [20], machine translation [11] , document summariza- tion [1] and information extraction [7]. Therefore, SRL is an important task in NLP. ...
Thesis
Full-text available
Semantic role labelling (SRL) is a task in natural language processing which detects and classifies the semantic arguments associated with the predicates of a sentence. It is an important step towards understanding the meaning of a natural language. There exists SRL systems for well-studied languages like English, Chinese or Japanese but there is not any such system for the Vietnamese language. In this thesis, we present the first SRL system for Vietnamese with encouraging accuracy. We first demonstrate that a simple application of SRL techniques developed for English could not give a good accuracy for Vietnamese. We then introduce a new algorithm for extracting candidate syntactic constituents, which is much more accurate than the common node-mapping algorithm usually used in the identification step. Finally, in the classification step, in addition to the common linguistic features, we propose novel and useful features for use in SRL. Our SRL system achieves an F 1 score of 73.53% on the Vietnamese PropBank corpus. This system, including software and corpus, is available as an open source project and we believe that it is a good baseline for the development of future Vietnamese SRL systems.
... Semantic Role Labeling (SRL) is a kind of shallow semantic parsing task and its goal is to recognize some related phrases and assign a joint structure (WHO did WHAT to WHOM, WHEN, WHERE, WHY, HOW) to each predicate of a sentence[5]. Because of its ability to encode semantic information, SRL has been applied in many tasks of NLP, such as question and answering[13], information extraction[3, 15] and machine translation[9, 21, 23, 28]. Since the release of FrameNet[1] and PropBank[7, 24], there has been a large amount of work on SRL[5, 10, 11, 18, 20, 22, 25, 26, 29] . ...
Article
Full-text available
The predicate and its semantic roles compose a unified entity that conveys the semantics of a given sentence. A standard pipeline of current approaches to semantic role labeling (SRL) is that for a given predicate in a sentence, we can extract features for each candidate argument and then perform the role classification through a classifier. However, this process totally ignores the integrality of the predicate and its semantic roles. To address this problem, we present a global generative model in which a novel concept called Predicate-Arguments-Coalition (PAC) is proposed to encode the relations among individual arguments. Owing to PAC, our model can effectively mine the inherent properties of predicates and obtain a globally consistent solution for SRL. We conduct experiments on the standard benchmarks: Chinese PropBank. Experimental results on a single syntactic tree show that our model outperforms the state-of-the-art methods.
... and show.v. accurate, ambiguous, apparent, apparently, audible, axiomatic, blatant, blatantly, blurred, blurry, certainly , clarify, clarity, clear, clearly, confused, confusing , conspicuous, crystal-clear, dark, definite, definitely, demonstrably, discernible, distinct, evident , evidently, explicit, explicitly, flagrant, fuzzy, glaring, imprecise, inaccurate, lucid, manifest, manifestly , markedly, naturally, notable, noticeable, obscure, observable, obvious, obviously, opaque, openly, overt, patently, perceptible, plain, precise, prominent, self-evident, show, show up, significantly, soberly, specific, straightforward, strong, sure, tangible , transparent, unambiguous, unambiguously, uncertain, unclear, undoubtedly, unequivocal, unequivocally , unspecific, vague, viewable, visibility, visible, visibly, visual, vividly, well, 1 woolly The semantic information in FrameNet (FN) is broadly useful for problems such as entailment (Ellsworth and Janin, 2007; Aharon et al., 2010) and knowledge base population (Mohit and Narayanan, 2003; Christensen et al., 2010; Gregory et al., 2011 ), and is of general enough interest to language understanding that substantial effort has focused on building parsers to map natural language onto FrameNet frames (Gildea and Jurafsky, 2002; Das and Smith, 2012). In practice, however, FrameNet's usefulness is limited by its size. ...
Article
Full-text available
We increase the lexical coverage of FrameNet through automatic paraphras-ing. We use crowdsourcing to manually filter out bad paraphrases in order to en-sure a high-precision resource. Our ex-panded FrameNet contains an additional 22K lexical units, a 3-fold increase over the current FrameNet, and achieves 40% better coverage when evaluated in a prac-tical setting on New York Times data.
... Semantic role labeling (SRL) is the task of identifying semantic arguments of predicates in text. It is an important step in text analysis and has applications in information extraction (Christensen et al., 2010), question answering (Shen and Lapata, 2007; Moreda et al., 2011) and machine translation (Wu and Fung, 2009; Xiong et al., 2012) . A large body of work exists on algorithms for SRL (Gildea and Jurafsky, 2002; Srikumar and Roth, 2011). ...
Article
Full-text available
Tenders are powerful means of investment of public funds and represent a strategic development resource. Despite the efforts made so far by governments at national and international levels to digitalise documents related to the Public Administration sector, most of the information is still available in an unstructured format only. With the aim of bridging this gap, we present OIE4PA, our latest study on extracting and classifying relations from tenders of the Public Administration. Our work focuses on the Italian language, where the availability of linguistic resources to perform Natural Language Processing tasks is considerably limited. Nevertheless, OIE4PA adopts a multilingual approach so it can be applied to several languages by providing appropriate training data. Rather than purely training a classifier on a portion of the extracted relations, the backbone idea of our learning strategy is to put a supervised method based on self-training to the proof and to assess whether or not it improves the performance of the classifier. For evaluation purposes, we built a dataset composed of 2,000 triples which have been manually annotated by two human experts. The in-vitro evaluation shows that OIE4PA achieves a MacroF11_1 equal to 0.89 and a 91%%\% accuracy. In addition, OIE4PA was used as the pillar of a prototype search engine, which has been evaluated through an in-vivo experiment with positive feedback from 32 final users, obtaining a SUS score equal to 83.98.
Article
Question generation aims to generate meaningful and fluent questions, which can address the lack of question-answer type annotated corpus by augmenting the available data. Using unannotated text with optional answers as input contents, question generation can be divided into two types based on whether answers are provided: answer-aware and answer-agnostic. While generating questions with providing answers is challenging, generating high-quality questions without providing answers is even more difficult, for both humans and machines. In order to address this issue, we proposed a novel end-to-end model called QGAE, which is able to transform answer-agnostic question generation into answer-aware question generation by directly extracting candidate answers. This approach effectively utilizes unlabeled data for generating high-quality question-answer pairs, and its end-to-end design makes it more convenient compared to a multi-stage method that requires at least two pre-trained models. Moreover, our model achieves better average scores and greater diversity. Our experiments show that QGAE achieves significant improvements in generating question-answer pairs, making it a promising approach for question generation.
Article
The rapid proliferation of text data has lead to an increase in the use of Information Extraction (IE) techniques to automatically extract key information in a fast and effective manner. Relation Extraction (RE), a sub-task of IE focuses on extracting semantic relations from free natural language text and is crucial for further applications including Question Answering, Information Retrieval, Knowledge Base construction, Text Summarization, etc. Literature shows that supervised learning approaches were widely used in RE. However, the performance of supervised methodologies depend on the availability of domain-specific annotated datasets which is not viable for many of the domains including legal, financial, insurance etc. In recent times, Open Information Extraction (OIE) techniques address this issue, by facilitating domain-independent extraction of relations from large text corpora with no demand for domain-specific tagged data and predefined relation classes. Even though OIE systems are fast and simple to implement, they are less effective in handling complex sentences, and often produce redundant extractions. This paper proposes an efficient RE system to extract domain-specific relations from natural language text, consisting of Knowledge-based and Semi-supervised learning systems, integrated with domain ontology. We evaluated the performance of proposed work on ‘judicial domain” as a use case and found that it overcomes the flaws and limitations of existing RE approaches, by achieving better results in terms of precision and recall. On further analysis, we found that the proposed system outperforms existing cutting-edge OIE systems on varying sentence length and complexity.
Article
Full-text available
As one of the most important research topics in the field of natural language processing, open information extraction has achieved gratifying research findings in recent years. Even if so much effort is put into the work of open information extraction, there are still many shortcomings and great room for improvement in the existing system. The traditional open information extraction task relies heavily on the artificially defined extraction paradigm, and it will produce error accumulation and propagation. The end-to-end model relies on a large number of training data, and it is hard to re-train with the increase of the model. To cope with the difficulty of updating parameters of large neural network models, in this paper, we propose a solution based on the meta-learning framework, we design a neural network-based converter module, which effectively combines the learned model parameters with the new model parameters. Then update the parameters of the original open information extraction model using the parameters calculated by the converter. This can not only avoid the problem of error propagation of traditional models but also effectively deal with the iterative updating of open information extraction models. We employ a large and public Open IE benchmark to demonstrate the performance of our approach. The experimental results show that our model can achieve better performance than existing baselines, and compared with the re-training model, our strategy can not only greatly shorten the update time of the model, but also not lose the performance of the model completely re-trained with all the training data.
Article
Open information extraction (Open IE) is a core task of natural language processing (NLP). Even many efforts have been made in this area, and there are still many problems that need to be tackled. Conventional Open IE approaches use a set of handcrafted patterns to extract relational tuples from the corpus. Secondly, many NLP tools are employed in their procedure; therefore, they face error propagation. To address these problems and inspired by the recent success of Generative Adversarial Networks (GANs), we employ an adversarial training architecture and name it Adversarial-OIE. In Adversarial-OIE, the training of the Open IE model is assisted by a discriminator, which is a (Convolutional Neural Network) CNN model. The goal of the discriminator is to differentiate the extraction result generated by the Open IE model from the training data. The goal of the Open IE model is to produce high-quality triples to cheat the discriminator. A policy gradient method is leveraged to co-train the Open IE model and the discriminator. In particular, due to insufficient training, the discriminator usually leads to the instability of GAN training. We use the distant supervision method to generate training data for the Adversarial-OIE model to solve this problem. To demonstrate our approach, an empirical study on two large benchmark dataset shows that our approach significantly outperforms many existing baselines.
Article
Full-text available
Open information extraction (Open IE), as one of the essential applications in the area of Natural Language Processing (NLP), has gained great attention in recent years. As a critical technology for building Knowledge Bases (KBs), it converts unstructured natural language sentences into structured representations, usually expressed in the form of triples. Most conventional open information extraction approaches leverage a series of manual pre-defined extraction patterns or learn patterns from labeled training examples, which requires a large number of human resources. Additionally, many Natural Language Processing tools are involved, which leads to error accumulation and propagation. With the rapid development of neural networks, neural-based models can minimize the error propagation problem, but it also faces the problem of data-hungry in supervised learning. Especially, they leverage existing Open IE tools to generate training data, and it causes data quality issues. In this paper, we employ a distant supervision learning approach to improve the Open IE task. We conduct extensive experiments by employing two popular sequence-to-sequence models (RNN and Transformer) and a large benchmark data set to demonstrate the performance of our approach.
Article
Research on Open Information Extraction (Open IE) has made great progress in recent years; it is the task that detects a group of structured, machine-readable statements usually represented in triple form or n-ary relation statements. Open IE is among the core areas of the territory of Natural Language Processing (NLP), and these extractions decompose grammatically complex sentences in a corpus into the relationships they represent, which can be leveraged for various downstream tasks. Even though a lot of work has been done in this direction, there are still many issues with the existing strategies. Most of the previous Open IE systems employ a group of artificially constructed patterns to detect and extract relational tuples from a sentence in a corpus, and these patterns are either automatically learned from annotated training examples or hand-crafted. Such an approach faces some issues, the first is that it requires a lot of manpower. Secondly, they used many NLP tools, therefore, error accumulation in the procedure can negatively impact the results. In this paper, we propose an Open IE approach based on the Transformer architecture. To verify our approach, we make a study using a large and public benchmark dataset, and the experimental results showed that our model achieves a better performance than many existing baselines.
Article
There is a large amount of heterogeneous data distributed in various sources in the upstream of PetroChina. These data can be valuable assets if we can fully use them. Meanwhile, the knowledge graph, as a new emerging technique, provides a way to integrate multi-source heterogeneous data. In this paper, we present one application of the knowledge graph in the upstream of PetroChina. Specifically, we first construct a knowledge graph from both structured and unstructured data with multiple NLP (natural language progressing) methods. Then, we introduce two typical knowledge graph powered applications and show the benefit that the knowledge graph brings to these applications: compared with the traditional machine learning approach, the well log interpretation method powered by knowledge graph shows more than 7.69% improvement of accuracy.
Article
In this paper, we study domain adaptation of semantic role classification. Most systems utilize the supervised method for semantic role classification. But, these methods often suffer severe performance drops on out-of-domain test data. The reason for the performance drops is that there are giant feature differences between source and target domain. This paper proposes a framework called Adversarial Domain Adaption Network (ADAN) to relieve domain adaption of semantic role classification. The idea behind our method is that the proposed framework can derive domain-invariant features via adversarial learning and narrow down the gap between source and target feature space. To evaluate our method, we conduct experiments on English portion in the CoNLL 2009 shared task. Experimental results show that our method can largely reduce the performance drop on out-of-domain test data.
Chapter
Semantic Role Labeling is the task of automatically detecting the semantic role played by words or phrases in a sentence. There is a small number of studies dedicated to Semantic Role Labeling in the Portuguese language, and the obtained performance is far from that of the English language. In this article, we propose an end-to-end semantic role labeler for the Portuguese language, which leans on a deep bidirectional long short-term memory neural network architecture. The predictions are used as inputs to an inference stage that employs a global recursive neural parsing algorithm, tailored for the task. We also provide a detailed analysis of the effects of word embedding dimensionality and network depth on the overall performance of the proposed model. The proposed approach outperforms the state-of-the-art approach on the PropBank-Br corpus, while reducing the relative error in approximately 8.74%.
Article
Full-text available
When knowledge is developed fast, as it is the case so often nowadays, one of the main difficulties in initiating new research in any field is to identify the domain’s specific state-of-the-art and trends. In this context, to evaluate the potential of a research niche by assisting the literature review process and to add a new and modern large-scale and automated dimension to it, the paper proposes a methodology that uses “Latent Semantic Analysis” (LSA) for identifying trends, focused within the knowledge space created at the intersection of three sustainability-related methodologies/concepts: “virtual Quality Management” (vQM), “Industry 4.0”, and “Product Life-Cycle” (PLC). The LSA was applied to a significant number of scientific papers published around these concepts to generate ontology charts that describe the knowledge structure of each by the frequency, position, and causal relation of associated notions. These notions are combined for defining the common high-density knowledge zone from where new technological solutions are expected to emerge throughout the PLC. The authors propose the concept of the knowledge space, which is characterized through specific descriptors with their own evaluation scales, obtained by processing the emerging information as identified by a combination of classic and innovative techniques. The results are validated through an investigation that surveys a relevant number of general managers, specialists, and consultants in the field of quality in the automotive sector from Romania. This practical demonstration follows each step of the theoretical approach and yields results that prove the capability of the method to contribute to the understanding and elucidation of the scientific area to which it is applied. Once validated, the method could be transferred to fields with similar characteristics.
Article
Relation Extraction (RE), the task of detecting and characterizing semantic relations between entities in text, has gained much importance in the last two decades, mainly in the biomedical domain. Many papers have been published on Relation Extraction using supervised machine learning techniques. Most of these techniques rely on statistical methods, such as feature-based and tree-kernels-based methods. Such statistical learning techniques are usually based on a propositional hypothesis space for representing examples, i.e., they employ an attribute–value representation of features. This kind of representation has some drawbacks, particularly in the extraction of complex relations which demand more contextual information about the involving instances, i.e., it is not able to effectively capture structural information from parse trees without loss of information. In this work, we present OntoILPER, a logic-based relational learning approach to Relation Extraction that uses Inductive Logic Programming for generating extraction models in the form of symbolic extraction rules. OntoILPER takes profit of a rich relational representation of examples, which can alleviate the aforementioned drawbacks. The proposed relational approach seems to be more suitable for Relation Extraction than statistical ones for several reasons that we argue. Moreover, OntoILPER uses a domain ontology that guides the background knowledge generation process and is used for storing the extracted relation instances. The induced extraction rules were evaluated on three protein–protein interaction datasets from the biomedical domain. The performance of OntoILPER extraction models was compared with other state-of-the-art RE systems. The encouraging results seem to demonstrate the effectiveness of the proposed solution.
Conference Paper
Full-text available
Recent knowledge extraction methods are moving towards ternary and higher-arity relations to capture more information about binary facts. An example is to include the time, the location, and the duration of a specific fact. These relations can be even more complex to extract in advanced domains such as news, where events typically come with different facets including reasons, consequences, purposes, involved parties, and related events. The main challenge consists in first finding the set of facets related to each fact, and second tagging those facets to the relevant category. In this paper, we tackle the above problems by proposing StuffIE, a fine-grained information extraction approach which is facet-centric. We exploit the Stanford dependency parsing enhanced by lexical databases such as WordNet to extract nested triple relations. Then, we exploit the syntactical dependencies to semantically tag facets using distant learning based on Oxford dictionary. We have tested the accuracy of the extracted facets and their semantic tags using DUC'04 dataset. The results show the high accuracy and coverage of our approach with respect to ClausIE, OLLIE, SEMAFOR SRL and Illinois SRL.
Article
Full-text available
A centre challenge in enhancing usage of natural language text is most sentences are barely clearly shows useful information for relation extraction, event extraction, etc. However, information cross sentences or even cross documents bounded together may produce better results. We bring action extraction based on open IE results to extract actions as meta results. It not only presents what an entity did and will do, but also provides an important foundation of producing conclusive results by using statistics and deduction. By using action extraction, we dramatically generate vast mention pairs we called weak relation which means the pair of mentions exist in the same action. This paper is focus on constructing a knowledge base not only with relations, but also actions so that we can do more work in future.
Article
Full-text available
This paper presents a method for improving semantic role labeling (SRL) using a large amount of automatically acquired knowledge. We acquire two varieties of knowledge, which we call surface case frames and deep case frames. Although the surface case frames are compiled from syntactic parses and can be used as rich syntactic knowledge, they have limited capability for resolving semantic ambiguity. To compensate for the deficiency of the surface case frames, we compile deep case frames from automatic semantic roles. We also consider quality management for both types of knowledge in order to get rid of the noise brought from the automatic analyses. The experimental results show that Chinese SRL can be improved using automatically acquired knowledge and the quality management shows a positive effect on this task.
Article
Full-text available
Named entity relation extraction is an important subject in the field of information extraction. Although many English extractors have achieved reasonable performance, an effective system for Chinese relation extraction remains undeveloped due to the lack of Chinese annotation corpora and the specificity of Chinese linguistics. Here, we summarize three kinds of unique but common phenomena in Chinese linguistics. In this article, we investigate unsupervised linguistics-based Chinese open relation extraction (ORE), which can automatically discover arbitrary relations without any manually labeled datasets, and research the establishment of a large-scale corpus. By mapping the entity relations into dependency-trees and considering the unique Chinese linguistic characteristics, we propose a novel unsupervised Chinese ORE model based on Dependency Semantic Normal Forms (DSNFs). This model imposes no restrictions on the relative positions among entities and relationships and achieves a high yield by extracting relations mediated by verbs or nouns and processing the parallel clauses. Empirical results from our model demonstrate the effectiveness of this method, which obtains stable performance on four heterogeneous datasets and achieves better precision and recall in comparison with several Chinese ORE systems. Furthermore, a large-scale knowledge base of entity and relation, called COER, is established and published by applying our method to web text, which conquers the trouble of lack of Chinese corpora.
Article
The problem of cross-document person profiling aimed at identifying and linking person entities across Web pages and extracting their relevant structured information. In this paper, we specifically focus on the core task of person profiling problem, namely the attribute extraction task. For attribute extraction, the existing approaches face several challenges that two important of them include (i) syntactic and structure variation, and (ii) cross-sentence and cross-document information extraction. To alleviate these deficiencies and improve performance of existing methods, we propose a semantic attribute extraction approach relying on probabilistic reasoning. Our approach produces structured, meaningful profiles in which the resulting textual facts are linked to their possible actual meaning in a distant ontology. We evaluate our approach on standard profile extraction datasets. Experimental results demonstrate that our approach achieves better results when compared with several baselines and state of the art counterparts. The results justify that our approach is a promising solution to the problem of person profiling.
Article
Full-text available
The problem of cross-document person profiling aimed at identifying and linking person entities across Web pages and extracting their relevant structured information. In this paper, we specifically focus on the core task of person profiling problem, namely the attribute extraction task. For attribute extraction, the existing approaches face several challenges that two important of them include (i) syntactic and structure variation, and (ii) cross-sentence and cross-document information extraction. To alleviate these deficiencies and improve performance of existing methods, we propose a semantic attribute extraction approach relying on probabilistic reasoning. Our approach produces structured, meaningful profiles in which the resulting textual facts are linked to their possible actual meaning in a distant ontology. We evaluate our approach on standard profile extraction datasets. Experimental results demonstrate that our approach achieves better results when compared with several baselines and state of the art counterparts. The results justify that our approach is a promising solution to the problem of person profiling.
Conference Paper
Full-text available
The vast amount of text published daily over the internet pose an opportunity to build unsupervised text mining models with a better or a comparable performance than existing models. In this paper, we investigate the problem of relation extraction and generation from text using an unsupervised model learned from news published online. We propose a clustering-based method to build a dataset of relations examples. News articles are clustered and once a cluster of sentences for each event in each piece of news is formed, relations between important entities in each event cluster are extracted and considered as examples of relations. Relations examples are used to build extraction templates in order to extract and generate readable relations summaries from new instances of news. The proposed unsupervised relation extraction and generation method is evaluated against multiple methods for relation extraction over different datasets where the proposed method has shown a comparable performance.
Article
Full-text available
In this work, we report large-scale semantic role annotation of arguments in the Turkish dependency treebank, and present the first comprehensive Turkish semantic role labeling (SRL) resource: Turkish Proposition Bank (PropBank). We present our annotation workflow that harnesses crowd intelligence, and discuss the procedures for ensuring annotation consistency and quality control. Our discussion focuses on syntactic variations in realization of predicate-argument structures, and the large lexicon problem caused by complex derivational morphology. We describe our approach that exploits framesets of root verbs to abstract away from syntax and increase self-consistency of the Turkish PropBank. The issues that arise in the annotation of verbs derived via valency changing morphemes, verbal nominals, and nominal verbs are explored, and evaluation results for inter-annotator agreement are provided. Furthermore, semantic layer described here is aligned with universal dependency (UD) compliant treebank and released to enable more researchers to work on the problem. Finally, we use PropBank to establish a baseline score of 79.10 F1 for Turkish SRL using the mate-tool (an open-source SRL tool based on supervised machine learning) enhanced with basic morphological features. Turkish PropBank and the extended SRL system are made publicly available.
Article
Discovering the interactions between the persons mentioned in a set of topic documents can help readers construct the background of the topic and facilitate document comprehension. To discover person interactions, we need a detection method that can identify text segments containing information about the interactions. Information extraction algorithms then analyze the segments to extract interaction tuples and construct a network of person interaction. In this paper, we define interaction detection as a classification problem. The proposed interaction detection method, called FISER, exploits 19 features covering syntactic, context-dependent, and semantic information in text to detect intra-clausal and inter-clausal interactive segments in topic documents. Empirical evaluations demonstrate that FISER outperformed many well-known relation extraction and protein-protein interaction detection methods on identifying interactive segments in topic documents. In addition, the precision, recall and F1-score of the best feature combination are 72.9%, 55.8%, and 63.2% respectively.
Article
Open Information Extraction (OIE) systems focus on identifying and extracting general relations from text. Most OIE systems utilize simple linguistic structure, such as part-of-speech or dependency features, to extract relations and arguments from a sentence. These approaches are simple and fast to implement, but suffer from two main drawbacks: i) they are less effective to handle complex sentences with multiple relations and shared arguments, and ii) they tend to extract overly-specific relations. This paper proposes an approach to Information Extraction called SemIE, which addresses both drawbacks. SemIE identifies significant relations from domain-specific text by utilizing a semantic structure that describes the domain of discourse. SemIE exploits the predicate-argument structure of a text, which is able to handle complex sentences. The semantics of the arguments are explicitly specified by mapping them to relevant concepts in the semantic structure. SemIE uses a semi-supervised learning approach to bootstrap training examples that cover all relations expressed in the semantic structure. SemIE inputs pairs of structured documents and uses a Greedy Mapping module to bootstrap a full set of training examples. The training examples are then used to learn the extraction and mapping rules. We evaluated the performance of SemIE by comparing it with OLLIE, a state-of-the-art OIE system. We tested SemIE and OLLIE on the task of extracting relations from text in the “movie” domain and found that on average, SemIE outperforms OLLIE. Furthermore, we also examined how the performance varies with sentence complexity and sentence length. The results prove the effectiveness of SemIE in handling complex sentences.
Conference Paper
According to the efficient market hypothesis, financial prices are unpredictable. However, meaningful advances have been achieved on anticipating market movements using machine learning techniques. In this work, we propose a novel method to represent the input for a stock price forecaster. The forecaster is able to predict stock prices from time series and additional information from web pages. Such information is extracted as structured events and represented in a compressed concept space. By using such representation with scalable forecasters, we reduced prediction error by about 10%, when compared to the traditional auto regressive models.
Article
This article makes an effort to improve Semantic Role Labeling (SRL) through learning generalized features. The SRL task is usually treated as a supervised problem. Therefore, a huge set of features are crucial to the performance of SRL systems. But these features often lack generalization powers when predicting an unseen argument. This article proposes a simple approach to relieve the issue. A strong intuition is that arguments occurring in similar syntactic positions are likely to bear the same semantic role, and, analogously, arguments that are lexically similar are likely to represent the same semantic role. Therefore, it will be informative to SRL if syntactic or lexical similar arguments can activate the same feature. Inspired by this, we embed the information of lexicalization and syntax into a feature vector for each argument and then use K-means to make clustering for all feature vectors of training set. For an unseen argument to be predicted, it will belong to the same cluster as its similar arguments of training set. Therefore, the clusters can be thought of as a kind of generalized feature. We evaluate our method on several benchmarks. The experimental results show that our approach can significantly improve the SRL performance.
Article
Knowledge graphs have gained increasing popularity in the past couple of years, thanks to their adoption in everyday search engines. Typically, they consist of fairly static and encyclopedic facts about persons and organizations–e.g. a celebrity’s birth date, occupation and family members–obtained from large repositories such as Freebase or Wikipedia. In this paper, we present a method and tools to automatically build knowledge graphs from news articles. As news articles describe changes in the world through the events they report, we present an approach to create Event-Centric Knowledge Graphs (ECKGs) using state-of-the-art natural language processing and semantic web techniques. Such ECKGs capture long-term developments and histories on hundreds of thousands of entities and are complementary to the static encyclopedic information in traditional knowledge graphs. We describe our event-centric representation schema, the challenges in extracting event information from news, our open source pipeline, and the knowledge graphs we have extracted from four different news corpora: general news (Wikinews), the FIFA world cup, the Global Automotive Industry, and Airbus A380 airplanes. Furthermore, we present an assessment on the accuracy of the pipeline in extracting the triples of the knowledge graphs. Moreover, through an event-centered browser and visualization tool we show how approaching information from news in an event-centric manner can increase the user’s understanding of the domain, facilitates the reconstruction of news story lines, and enable to perform exploratory investigation of news hidden facts.
Conference Paper
Full-text available
Semantic role labelling (SRL) is a task in natural language processing which detects and classifies the semantic arguments associated with the predicates of a sentence. It is an important step towards understanding the meaning of a natural language. There exists SRL systems for well-studied languages like English, Chinese or Japanese but there is not any such system for the Vietnamese language. In this paper, we present the first SRL system for Vietnamese with encouraging accuracy. We first demonstrate that a simple application of SRL techniques developed for English could not give a good accuracy for Vietnamese. We then introduce a new algorithm for extracting candidate syntactic constituents, which is much more accurate than the common node-mapping algorithm usually used in the identification step. Finally, in the classification step, in addition to the common linguistic features, we propose novel and useful features for use in SRL. Our SRL system achieves an F1 score of 73.53% on the Vietnamese PropBank corpus. This system, including software and corpus, is available as an open source project and we believe that it is a good baseline for the development of future Vietnamese SRL systems.
Article
The objective of this study is to evaluate the performance of five entity extraction methods for the task of identifying entities from scientific publications, including two vocabulary-based methods (a keyword-based and a Wikipedia-based) and three model-based methods (conditional random fields (CRF), CRF with keyword-based dictionary, and CRF with Wikipedia-based dictionary). These methods are applied to an annotated test set of publications in computer science. Precision, recall, accuracy, area under the ROC curve, and area under the precision-recall curve are employed as the evaluative indicators. Results show that the model-based methods outperform the vocabulary-based ones, among which CRF with keyword-based dictionary has the best performance. Between the two vocabulary-based methods, the keyword-based one has a higher recall and the Wikipedia-based one has a higher precision. The findings of this study help inform the understanding of informetric research at a more granular level.
Conference Paper
Document representation is a fundamental problem for text mining. Many efforts have been done to generate concise yet semantic representation, such as bag-of-words, phrase, sentence and topic-level descriptions. Nevertheless, most existing techniques counter difficulties in handling monolingual comparable corpus, which is a collection of monolingual documents conveying the same topic. In this paper, we propose the use of frame, a high-level semantic unit, and construct frame-based representations to semantically describe documents by bags of frames, using an information network approach. One major challenge in this representation is that semantically similar frames may be of different forms. For example, "radiation leaked" in one news article can appear as "the level of radiation increased" in another article. To tackle the problem, a text-based information network is constructed among frames and words, and a link-based similarity measure called SynRank is proposed to calculate similarity between frames. As a result, different variations of the semantically similar frames are merged into a single descriptive frame using clustering, and a document can then be represented as a bag of representative frames. It turns out that frame-based document representation not only is more interpretable, but also can facilitate other text analysis tasks such as event tracking effectively. We conduct both qualitative and quantitative experiments on three comparable news corpora, to study the effectiveness of frame-based document representation and the similarity measure SynRank, respectively, and demonstrate that the superior performance of frame-based document representation on different real-world applications.
Conference Paper
Full-text available
The pervasive ambiguity of language allows sentences that differ in just one lexical item to have rather different inference patterns. This would be no problem if the different lexical items fell into clearly definable and easy to represent classes. But this is not the case. To draw the correct inferences we need to look how the referents of the lexical items in the sentence (or broader context) interact in the described situation. Given that the knowledge our systems have of the represented situation will typically be incomplete, the classifications we come up with can only be probabilistic. We illustrate this problem with an investigation of various inference patterns associated with predications of the form 'Verb from X to Y', especially 'go from X to Y'. We characterize the various readings and make an initial proposal about how to create the lexical classes that will allow us to draw the correct inferences in the different cases.
Article
Full-text available
This paper examines the Stanford typed dependencies representation, which was designed to provide a straightforward de- scription of grammatical relations for any user who could benefit from automatic text understanding. For such purposes, we ar- gue that dependency schemes must follow a simple design and provide semantically contentful information, as well as offer an automatic procedure to extract the rela- tions. We consider the underlying design principles of the Stanford scheme from this perspective, and compare it to the GR and PARC representations. Finally, we address the question of the suitability of the Stan- ford scheme for parser evaluation.
Article
Full-text available
This note describes a logical system based on concepts and contexts, the system TIL for textual inference logic. The systemTIL is the beginnings of a logic that is a kind of "con- texted" description logic, designed to support local linguistic inferences. This logic pays attention to the intensionality of linguistic constructs and to the need for tractability of infer- ence in knowledge representation formalisms.
Conference Paper
Full-text available
We consider semi-supervised learning of information extraction methods, especially for extracting instances of noun categories (e.g., 'athlete,' 'team') and relations (e.g., 'playsForTeam(athlete,team)'). Semi- supervised approaches using a small number of labeled examples together with many un- labeled examples are often unreliable as they frequently produce an internally consistent, but nevertheless incorrect set of extractions. We propose that this problem can be over- come by simultaneously learning classifiers for many different categories and relations in the presence of an ontology defining constraints that couple the training of these classifiers. Experimental results show that simultaneously learning a coupled collection of classifiers for 30 categories and relations results in much more accurate extractions than training classifiers individually.
Article
Full-text available
We present results for a system designed to perform Open Knowledge Extraction, based on a tradition of compositional lan-guage processing, as applied to a large collection of text derived from the Web. Evaluation through manual assessment shows that well-formed propositions of reasonable quality, representing general world knowledge, given in a logical form potentially useable for inference, may be extracted in high volume from arbitrary input sentences. We compare these re-sults with those obtained in recent work on Open Information Extraction, indicat-ing with some examples the quite differ-ent kinds of output obtained by the two approaches. Finally, we observe that por-tions of the extracted knowledge are com-parable to results of recent work on class attribute extraction.
Conference Paper
Full-text available
A key question regarding the future of the semantic web is “how will we acquire structured information to populate the semantic web on a vast scale?” One approach is to enter this information manually. A second approach is to take advantage of pre-existing databases, and to develop common ontologies, publishing standards, and reward systems to make this data widely accessible. We consider here a third approach: developing software that automatically extracts structured information from unstructured text present on the web. We also describe preliminary results demonstrating that machine learning algorithms can learn to extract tens of thousands of facts to populate a diverse ontology, with imperfect but reasonably good accuracy.
Conference Paper
Full-text available
This paper concerns learning information by reading natu- ral language texts. The major aim is to develop representa- tions that are understandable by a reasoning engine and can be used to answer questions. We use abduction to map nat- ural language sentences into concise and specific underlying theories. Techniques for automatically generating usable data representations are discussed. New techniques are proposed to obtain semantically correct and precise logical representa- tions from natural language, in particular in cases where its syntactic complexity results in fragmented logical forms.
Conference Paper
Full-text available
Almost all automatic semantic role label- ing (SRL) systems rely on a preliminary parsing step that derives a syntactic struc- ture from the sentence being analyzed. This makes the choice of syntactic repre- sentation an essential design decision. In this paper, we study the influence of syn- tactic representation on the performance of SRL systems. Specifically, we com- pare constituent-based and dependency- based representations for SRL of English in the FrameNet paradigm. Contrary to previous claims, our results demonstrate that the systems based on de- pendencies perform roughly as well as those based on constituents: For the ar- gument classification task, dependency- based systems perform slightly higher on average, while the opposite holds for the argument identification task. This is re- markable because dependency parsers are still in their infancy while constituent pars- ing is more mature. Furthermore, the re- sults show that dependency-based seman- tic role classifiers rely less on lexicalized features, which makes them more robust to domain changes and makes them learn more efficiently with respect to the amount of training data.
Conference Paper
Full-text available
The ITP System for MUC3 is diagrammed in Figure 1. The three major modules handle different units of processing: the Message Handler processes a message unit; the ITP NLU Module processes a sentence and builds a Cognitive Model of the message; and the MUC3 Template Reasoning Module processes a segment of discourse.
Article
Full-text available
Kylin, a system, which is self-supervised learning to train relationally-targeted open information extractors from Wikipedia infoboxes, is studied. Open Information Extractions (IE) is different from traditional methods in input, target schema where IE opens automatically and discovers the relations of interest and computational complexity. Target relation within an infobox, Kylin queries to identify and verify relevant sentences that mention the attribute's value. Improvement in Kylin's integration abilities through the use of probabilistic CFG's is also expected, which may provide a cleaner way to integrate page and sentence classification with extraction and also enables joint inference. Shrinkage and retaining allows Kylin to improve extractor robustness, and these extractors can successfully mine tuples from a broader set of web pages. Kylin primarily uses the relational approach and the best option for jointly optimizing extraction generality, precision, and recall will be to further combine relational and structural approaches.
Article
Full-text available
The availability of large scale data sets of manually annotated predicate-argument structures has recently favored the use of machine learning approaches to the design of automated semantic role labeling (SRL) systems. The main research in this area relates to the design choices for feature representation and for effective decompositions of the task in different learning models. Regarding the former choice, structural properties of full syntactic parses are largely employed as they represent ways to encode different principles suggested by the linking theory between syntax and semantics. The latter choice relates to several learning schemes over global views of the parses. For example, re-ranking stages operating over alternative predicate-argument sequences of the same sentence have shown to be very effective. In this article, we propose several kernel functions to model parse tree properties in kernel-based machines, for example, perceptrons or support vector machines. In particular, we define different kinds of tree kernels as general approaches to feature engineering in SRL. Moreover, we extensively experiment with such kernels to investigate their contribution to individual stages of an SRL architecture both in isolation and in combination with other traditional manually coded features. The results for boundary recognition, classification, and re-ranking stages provide systematic evidence about the significant impact of tree kernels on the overall accuracy, especially when the amount of training data is small. As a conclusive result, tree kernels allow for a general and easily portable feature engineering method which is applicable to a large family of natural language processing tasks.
Article
Full-text available
We present a general framework for semantic role labeling. The framework combines a machine-learning technique with an integer linear programming-based inference procedure, which incorporates linguistic and structural constraints into a global decision process. Within this framework, we study the role of syntactic parsing information in semantic role labeling. We show that full syntactic parsing information is, by far, most relevant in identifying the argument, especially, in the very first stage—the pruning stage. Surprisingly, the quality of the pruning stage cannot be solely determined based on its recall and precision. Instead, it depends on the characteristics of the output candidates that determine the difficulty of the downstream problems. Motivated by this observation, we propose an effective and simple approach of combining different semantic role labeling systems through joint inference, which significantly improves its performance. Our system has been evaluated in the CoNLL-2005 shared task on semantic role labeling, and achieves the highest F1 score among 19 participants.
Article
Full-text available
We present a model for semantic role labeling that effectively captures the linguistic intuition that a semantic argument frame is a joint structure, with strong dependencies among the arguments. We show how to incorporate these strong dependencies in a statistical joint model with a rich set of features over multiple argument phrases. The proposed model substantially outperforms a similar state-of-the-art local model that does not include dependencies among different arguments. We evaluate the gains from incorporating this joint information on the Propbank corpus, when using correct syntactic parse trees as input, and when using automatically derived parse trees. The gains amount to 24.1% error reduction on all arguments and 36.8% on core arguments for gold-standard parse trees on Propbank. For automatic parse trees, the error reductions are 8.3% and 10.3% on all and core arguments, respectively. We also present results on the CoNLL 2005 shared task data set. Additionally, we explore considering multiple syntactic analyses to cope with parser noise and uncertainty.
Article
Full-text available
We demonstrate that an unlexicalized PCFG can parse much more accurately than previously shown, by making use of simple, linguistically motivated state splits, which break down false independence assumptions latent in a vanilla treebank grammar.
Article
Full-text available
The selectional preferences of verbal predicates are an important component of a computational lexicon. They have frequently been cited as being useful for wsd, alongside other sources of knowledge. We evaluate automatically acquired selectional preferences on the level playing field provided by senseval to examine to what extent they help in WSD.
Article
One of the main challenges in question-answering is the potential mismatch between the expressions in questions and the expressions in texts. While humans appear to use infer-ence rules such as "X writes Y" implies "X is the author of Y" in answering questions, such rules are generally unavailable to question-answering systems due to the inherent difficulty in constructing them. In this paper, we present an unsupervised algorithm for discovering inference rules from text. Our algorithm is based on an extended version of Harris' Distributional Hypothesis, which states that words that occurred in the same con-texts tend to be similar. Instead of using this hypothesis on words, we apply it to paths in the dependency trees of a parsed corpus. Essentially, if two paths tend to link the same set of words, we hypothesize that their meanings are similar. We use examples to show that our system discovers many inference rules easily missed by humans.
Article
This paper describesfoil, a system that learns Horn clauses from data expressed as relations.foil is based on ideas that have proved effective in attribute-value learning systems, but extends them to a first-order formalism. This new system has been applied successfully to several tasks taken from the machine learning literature.
Article
The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.
Article
The World Wide Web is a vast source of information accessible to computers, but understandable only to humans. The goal of the research described here is to automatically create a computer understandable knowledge base whose content mirrors that of the World Wide Web. Such a knowledge base would enable much more effective retrieval of Web information, and promote new uses of the Web to support knowledge-based inference and problem solving. Our approach is to develop a trainable information extraction system that takes two inputs. The first is an ontology that defines the classes (e.g., company, person, employee, product) and relations (e.g., employed_by, produced_by) of interest when creating the knowledge base. The second is a set of training data consisting of labeled regions of hypertext that represent instances of these classes and relations. Given these inputs, the system learns to extract information from other pages and hyperlinks on the Web. This article describes our general approach, several machine learning algorithms for this task, and promising initial results with a prototype system that has created a knowledge base describing university people, courses, and research projects.
Article
A theory of analogy must describe how the meaning of an analogy is derived from the meanings of its parts. In the structure‐mapping theory, the interpretation rules are characterized as implicit rules for mapping knowledge about a base domain into a target domain. Two important features of the theory are (a) the rules depend only on syntactic properties of the knowledge representation, and not on the specific content of the domains; and (b) the theoretical framework allows analogies to be distinguished cleanly from literal similarity statements, applications of abstractions, and other kinds of comparisons. Two mapping principles are described: (a) Relations between objects, rather than attributes of objects, are mapped from base to target; and (b) The particular relations mapped are determined by systematicity, as defined by the existence of higher‐order relations.
Conference Paper
We present the first unsupervised approach to the problem of learning a semantic parser, using Markov logic. Our USP system transforms dependency trees into quasi-logical forms, recursively induces lambda forms from these, and clusters them to abstract away syntactic variations of the same meaning. The MAP semantic parse of a sentence is obtained by recursively assigning its parts to lambda-form clusters and composing them. We evaluate our approach by using it to extract a knowledge base from biomedical abstracts and answer questions. USP substantially outperforms TextRunner, DIRT and an informed baseline on both precision and recall on this task.
Conference Paper
Many AI tasks, in particular natural language processing, require a large amount of world knowledge to create expec- tations, assess plausibility, and guide disambiguation. However, acquiring this world knowledge remains a for- midable challenge. Building on ideas by Schubert, we have developed a system called DART (Discovery and Aggrega- tion of Relations in Text) that extracts simple, semi-formal statements of world knowledge (e.g., "airplanes can fly", "people can drive cars") from text by abstracting from a parser's output, and we have used it to create a database of 23 million propositions of this kind. An evaluation of the DART database on two language processing tasks (parsing and textual entailment) shows that it improves perform- ance, and a human evaluation shows that over half the facts in it are considered true or partially true, rising to 70% for facts seen with high frequency. The significance of this work is two-fold: First it has created a new, publically available knowledge resource for language processing and other data interpretation tasks, and second it provides em- pirical evidence of the utility of this type of knowledge, going beyond Schubert et al's earlier evaluations which were based solely on human inspection of its contents.
Conference Paper
Modern models of relation extraction for tasks like ACE are based on supervised learning of relations from small hand-labeled corpora. We investigate an alternative paradigm that does not require labeled corpora, avoiding the domain dependence of ACE-style algorithms, and allowing the use of corpora of any size. Our experiments use Freebase, a large semantic database of several thousand relations, to provide distant supervision. For each pair of entities that appears in some Freebase relation, we find all sentences containing those entities in a large unlabeled corpus and extract textual features to train a relation classifier. Our algorithm combines the advantages of supervised IE (combining 400,000 noisy pattern features in a probabilistic classifier) and unsupervised IE (extracting large numbers of relations from large corpora of any domain). Our model is able to extract 10,000 instances of 102 relations at a precision of 67.6%. We also analyze feature performance, showing that syntactic parse features are particularly helpful for relations that are ambiguous or lexically distant in their expression.
Conference Paper
In this paper, we present Espresso, a weakly-supervised, general-purpose, and accurate algorithm for harvesting semantic relations. The main contribu- tions are: i) a method for exploiting ge- neric patterns by filtering incorrect instances using the Web; and ii) a prin- cipled measure of pattern and instance reliability enabling the filtering algo- rithm. We present an empirical com- parison of Espresso with various state of the art systems, on different size and genre corpora, on extracting various general and specific relations. Experi- mental results show that our exploita- tion of generic patterns substantially increases system recall with small effect on overall precision.
Conference Paper
Traditional Information Extraction (IE) takes a relation name and hand-tagged examples of that relation as input. Open IE is a relation- independent extraction paradigm that is tai- lored to massive and heterogeneous corpora such as the Web. An Open IE system extracts a diverse set of relational tuples from text with- out any relation-specific input. How is Open IE possible? We analyze a sample of English sentences to demonstrate that numerous rela- tionships are expressed using a compact set of relation-independent lexico-syntactic pat- terns, which can be learned by an Open IE sys- tem. What are the tradeoffs between Open IE and traditional IE? We consider this question in the context of two tasks. First, when the number of relations is massive, and the rela- tions themselves are not pre-specified, we ar- gue that Open IE is necessary. We then present a new model for Open IE called O-CRF and show that it achieves increased precision and nearly double the recall than the model em- ployed by TEXTRUNNER, the previous state- of-the-art Open IE system. Second, when the number of target relations is small, and their names are known in advance, we show that O-CRF is able to match the precision of a tra- ditional extraction system, though at substan- tially lower recall. Finally, we show how to combine the two types of systems into a hy- brid that achieves higher precision than a tra- ditional extractor, with comparable recall.
Conference Paper
We present a novel approach to relation extraction, based on the observation that the information required to assert a rela- tionship between two named entities in the same sentence is typically captured by the shortest path between the two en- tities in the dependency graph. Exper- iments on extracting top-level relations from the ACE (Automated Content Ex- traction) newspaper corpus show that the new shortest path dependency kernel out- performs a recent approach based on de- pendency tree kernels.
Conference Paper
We are trying to extend the boundary of Information Extraction (IE) systems. Ex-isting IE systems require a lot of time and human effort to tune for a new scenario. Preemptive Information Extraction is an attempt to automatically create all feasible IE systems in advance without human in-tervention. We propose a technique called Unrestricted Relation Discovery that dis-covers all possible relations from texts and presents them as tables. We present a pre-liminary system that obtains reasonably good results.
Conference Paper
Traditionally, Information Extraction (IE) has fo- cused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations. This paper introduces Open IE (OIE), a new ex- traction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. The paper also introduces T EXTRUNNER, a fully implemented, highly scalable OIE system where the tuples are assigned a probability and indexed to support efficient extraction and explo- ration via user queries. We report on experiments over a 9,000,000 Web page corpus that compare TEXTRUNNER with KNOWITALL, a state-of-the-art Web IE system. TEXTRUNNER achieves an error reduction of 33% on a comparable set of extractions. Furthermore, in the amount of time it takes KNOWITALL to per- form extraction for a handful of pre-specified re- lations, T EXTRUNNER extracts a far broader set of facts reflecting orders of magnitude more rela- tions, discovered on the fly. We report statistics on TEXTRUNNER's 11,000,000 highest probability tuples, and show that they contain over 1,000,000 concrete facts and over 6,500,000 more abstract as- sertions.
Article
We present an application of kernel methods to extracting relations from unstructured natural language sources. We introduce kernels defined over shallow parse representations of text, and design efficient algorithms for computing the kernels. We use the devised kernels in conjunction with Support Vector Machine and Voted Perceptron learning algorithms for the task of extracting person-affiliation and organization-location relations from text. We experimentally evaluate the proposed methods and compare them with feature-based learning algorithms, with promising results.
Article
In this rich reference work, Beth Levin classifies over 3,000 English verbs according to shared meaning and behavior. Levin starts with the hypothesis that a verb's meaning influences its syntactic behavior and develops it into a powerful tool for studying the English verb lexicon. She shows how identifying verbs with similar syntactic behavior provides an effective means of distinguishing semantically coherent verb classes, and isolates these classes by examining verb behavior with respect to a wide range of syntactic alternations that reflect verb meaning. The first part of the book sets out alternate ways in which verbs can express their arguments. The second presents classes of verbs that share a kernel of meaning and explores in detail the behavior of each class, drawing on the alternations in the first part. Levin's discussion of each class and alternation includes lists of relevant verbs, illustrative examples, comments on noteworthy properties, and bibliographic references. The result is an original, systematic picture of the organization of the verb inventory. Easy to use, English Verb Classes and Alternations sets the stage for further explorations of the interface between lexical semantics and syntax. It will prove indispensable for theoretical and computational linguists, psycholinguists, cognitive scientists, lexicographers, and teachers of English as a second language.
Article
In this paper, we propose an unsupervised method for discovering inference rules from text, such as "X is author of Y X wrote Y", "X solved Y X found a solution to Y", and "X caused Y Y is triggered by X". Inference rules are extremely important in many fields such as natural language processing, information retrieval, and artificial intelligence in general. Our algorithm is based on an extended version of Harris's Distributional Hypothesis, which states that words that occurred in the same contexts tend to be similar. Instead of using this hypothesis on words, we apply it to paths in the dependency trees of a parsed corpus.
Article
FrameNet is a three-year NSF-supported project in corpus-based computational lexicography, now in its second year (NSF IRI-9618838, "Tools for Lexicon Building"). The project's key features are (a) a commitment to corpus evidence for semantic and syntactic generalizations, and (b) the representation of the valences of its target words (mostly nouns, adjectives, and verbs) in which the semantic portion makes use of frame semantics. The resulting database will contain (a) descriptions of the semantic frames underlying the meanings of the words described, and (b) the valence representation (semantic and syntactic) of several thousand words and phrases, each accompanied by (c) a repre- sentative collection of annotated corpus attestations, which jointly exemplify the observed linkings between "frame elements" and their syntactic realizations (e.g. grammatical function, phrase type, and other syntactic traits). This report will present the project's goals and workflow, and information about the computational tools that have been adapted or created in-house for this work.