Content uploaded by Vasant Honavar
Author content
All content in this area was uploaded by Vasant Honavar on Dec 28, 2013
Content may be subject to copyright.
Ontology-guided extraction of structured information from unstructured text:
Identifying and capturing complex relationships
by
Sushain Pandit
A thesis submitted to the graduate faculty
in partial fulfillment of the requirements for the degree of
MASTER OF SCIENCE
Major: Computer Science
Program of Study Committee:
Vasant Honavar, Major Professor
Dimitris Margaritis
Samik Basu
Iowa State University
Ames, Iowa
2010
Copyright
c
Sushain Pandit, 2010. All rights reserved.
ii
DEDICATION
Dedicated to Supreme God, my parents and Sachin Tendulkar’s world record double-hundred.
iii
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
CHAPTER 1. OVERVIEW AND MOTIVATION . . . . . . . . . . . . . . . 1
1.1 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Extracting Domain-specific Semantic Information from Text . . . . . . . . . . . 2
1.3 Open Problems and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Issues in NLP-based Relationship and Entity Extraction . . . . . . . . . 2
1.3.2 Issues in Semantic Mapping and Validation . . . . . . . . . . . . . . . . 3
1.3.3 Issues in Generic Representation . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Motivation for Creating a Novel Information Extraction Framework . . . . . . 4
1.5 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
CHAPTER 2. PRELIMINARIES AND RELATED WORK . . . . . . . . . 8
2.1 Derivative Structures: Parse Trees and Dependency Graphs . . . . . . . . . . . 8
2.1.1 Parse Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Formal Specifications for Validation and Representation . . . . . . . . . . . . . 11
2.2.1 Domain Ontology, Instances and Knowledge Base . . . . . . . . . . . . . 11
2.2.2 Validation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
iv
2.3 Generic Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Resource Description Format(RDF) . . . . . . . . . . . . . . . . . . . . 13
2.4 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.1 Different Flavors of Information Extraction from Text . . . . . . . . . . 14
2.5.2 Complex Relationship Extraction from Text . . . . . . . . . . . . . . . . 16
CHAPTER 3. A NOVEL FRAMEWORK FOR EXTRACTING INFOR-
MATION FROM TEXT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Text Processing and Generation of Derivative Structures . . . . . . . . . . . . . 18
3.3 Composite Rule Framework for Entity and Relationship Extraction . . . . . . . 20
3.3.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.2 Identifying and Defining the Relationship Types . . . . . . . . . . . . . 21
3.3.3 Formulating Rules for Complex Relationship and Entity Extraction . . 29
3.3.4 Formulating Rules for Simple Relationships . . . . . . . . . . . . . . . . 36
3.4 Semantic Validation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.1 Performing Validation against the Domain Ontology Model . . . . . . . 39
3.4.2 Discussion on Basic Validation . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.3 Performing Enrichments . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Representation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.1 Transformation of Information Constructs into Graph Formalism . . . . 44
3.5.2 Primitive Transformation for Representing Simple Information . . . . . 45
3.5.3 Transformations for Representing Complex Information . . . . . . . . . 46
3.5.4 Existential Claims based on our Information Extraction Framework . . 48
3.6 Discussion on Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.6.1 Pronoun Resolution in Algorithm 1 . . . . . . . . . . . . . . . . . . . . . 50
CHAPTER 4. SEMANTIXS: SYSTEM ARCHITECTURE AND OVERVIEW 51
4.1 SEMANTIXS Architectural Overview . . . . . . . . . . . . . . . . . . . . . . . 51
v
4.1.1 SEMANTIXS Component Interaction . . . . . . . . . . . . . . . . . . . 54
4.2 Design and Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.1 Service Request/Response Framework . . . . . . . . . . . . . . . . . . . 57
4.2.2 Core Text Processing and Information Extraction Framework . . . . . . 58
4.2.3 User Interface, Visualization and Analysis Framework . . . . . . . . . . 59
CHAPTER 5. EMPIRICAL EVALUATION AND ANALYSIS USING SE-
MANTIXS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.1 Evaluation Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Experimental Setup: Text, Ontology and Instances . . . . . . . . . . . . . . . . 63
5.2.1 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.2 Ontology and Instance Data . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Results and Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.1 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.2 Discussion on Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4 Querying the Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4.1 Formulating Complex Questions . . . . . . . . . . . . . . . . . . . . . . 68
5.4.2 Formulating a Query-plan . . . . . . . . . . . . . . . . . . . . . . . . . . 69
CHAPTER 6. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
APPENDIX A. EXPERIMENTAL TEXTS AND EXTRACTED INFOR-
MATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
A.0.1 Sample Text Fragment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
A.0.2 Sample Generated RDF Sub-graph . . . . . . . . . . . . . . . . . . . . . 75
A.0.3 RDF Serialization for Graph 3.11 . . . . . . . . . . . . . . . . . . . . . . 76
A.0.4 RDF Serialization for Graph 3.12 . . . . . . . . . . . . . . . . . . . . . . 76
A.0.5 RDF Serialization for Graph 3.13 . . . . . . . . . . . . . . . . . . . . . . 77
vii
LIST OF TABLES
Table 5.1 Overview of Experimental Texts: Counts of Positive and Negative In-
stances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Table 5.2 Correctly Classified Information: Counts of Positive and Negative In-
stances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Table 5.3 C.M. for Simple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Table 5.4 C.M. for Type 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Table 5.5 C.M. for Type 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Table 5.6 C.M. for References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Table 5.7 C.M. for Type 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Table 5.8 Results: Precision, Recall and F-measure . . . . . . . . . . . . . . . . . 66
Table 5.9 Answers to Complex Questions . . . . . . . . . . . . . . . . . . . . . . 69
viii
LIST OF FIGURES
Figure 2.1 A Simple Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . 10
Figure 3.1 Dependency Graph with Application of a Simple Extraction Rule . . . 19
Figure 3.2 Example for a Relationship with Internal Clauses . . . . . . . . . . . . 23
Figure 3.3 Example Illustrating the Intuitive Motivation behind Case 2 . . . . . . 25
Figure 3.4 Example for a Relationship with a Qualifying Modifier . . . . . . . . . 26
Figure 3.5 Example with a Simple Placement of Conjunction . . . . . . . . . . . . 28
Figure 3.6 Example with a Complex Placement of Conjunction . . . . . . . . . . 28
Figure 3.7 Dependency Graph for the Sentence in Figure 3.2 . . . . . . . . . . . . 30
Figure 3.8 Extraction Rule for a Relationship with Internal Clauses . . . . . . . . 31
Figure 3.9 Dependency Graph for the Sentence in Figure 3.4 . . . . . . . . . . . . 33
Figure 3.10 Extraction Rule Application for a Relationship with Qualifying Modifiers 34
Figure 3.11 RDF representation of Simple Information . . . . . . . . . . . . . . . . 45
Figure 3.12 RDF representation of Reified Internal Clauses . . . . . . . . . . . . . 46
Figure 3.13 RDF representation of Qualified Relationships . . . . . . . . . . . . . . 48
Figure 4.1 SEMANTIXS Architectural Diagram . . . . . . . . . . . . . . . . . . . 52
Figure 4.2 UML Representation of Service Request Delegation Framework . . . . 57
Figure 4.3 UML Representation of Information Extraction Framework . . . . . . 58
Figure 4.4 Text Processing View . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Figure 4.5 Semantic Graph View . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Figure 5.1 RDF Sub-graph for the Entity “Dow Jones” . . . . . . . . . . . . . . . 68
ix
ACKNOWLEDGEMENTS
I would like to take this opportunity to express my gratitude to everyone who helped me
in conducting the research leading to this thesis. Firstly, I would like to thank my advisor,
Dr. Vasant Honavar for his guidance throughout the period of this research work. I would also
like to thank my committee members, Dr. Dimitris Margaritis and Dr. Samik Basu for their
thoughtful insights and pointers. Further, I would like to acknowledge useful discussions with
Neeraj Koul, Heyong Wang and Ganesh Ram Santhanam, members of the Artificial Intelligence
Research Laboratory. I would also like to warmly thank my parents for all those intangible
bits of reassurance given at key moments. Finally, I would like to express my gratefulness to
the Center for Integrated Animal Genomics and the Center for Computational Intelligence,
Learning, and Discovery for the research assistantship that provided partial support for me
during my graduate studies at Iowa State University.
x
ABSTRACT
Many applications call for methods to enable automatic extraction of structured infor-
mation from unstructured natural language text. Due to the inherent challenges of natural
language processing, most of the existing methods for information extraction from text tend
to be domain specific. This thesis explores a modular ontology-based approach to informa-
tion extraction that decouples domain-specific knowledge from the rules used for information
extraction. Specifically, the thesis describes:
1. A framework for ontology-driven extraction of a subset of nested complex relationships
(e.g., Joe reports that Jim is a reliable employee) from free text. The extracted relation-
ships are semantically represented in the form of RDF (resource description framework)
graphs, which can be stored in RDF knowledge bases and queried using query languages
for RDF.
2. An open source implementation of SEMANTIXS, a system for ontology-guided extraction
and semantic representation of structured information from unstructured text.
3. Results of experiments that offer some evidence of the utility of the proposed ontology-
based approach to extract complex relationships from text.
1
CHAPTER 1. OVERVIEW AND MOTIVATION
This chapter provides an introduction to the main topics and the motivation behind the
thesis. A brief overview of our contributions and the outline of the overall structure of the
thesis is also provided.
1.1 Information Extraction
Information-driven advanced AI problems as well as semantic computing issues dealing
with linked-data and semantic web, have long warranted the need of domain-specific, struc-
tured and consumable information. Keeping with the philosophical point of view of Stoll and
Schubert, “Data is not information, Information is not knowledge, Knowledge is not under-
standing, Understanding is not wisdom”, it is fairly easy to draw contrasts [1] between such
domain specific and consumable information and raw data that is often openly available and
constantly generated at the rate of over a tera-bytes per day. In essence, the constantly growing
data becomes rather useless if we are unable to extract meaningful, relevant and consumable
information out of it.
This overarching need for extracting structured information from raw data has motivated
various systematic processes for information extraction. For a while such processes were largely
manual, with domain experts semi-automatically extracting relevant information constructs
from data sources (web, text, images, publicly available structured data, etc), validating them
against their predefined set of domain-specific rules, and organizing them into a useful formal
representation.
Recently, however, there has been a lot of attention on building systems that try to au-
tomatically (semi or fully) extract information from free text, validate it against a domain
2
description and make a coherent representation out of it. Our work in this thesis focuses on
this particular paradigm and we describe it in more detail in following sections.
1.2 Extracting Domain-specific Semantic Information from Text
The general problem of interpretation of natural language texts is very difficult [12], however
there have been significant improvements that directly relate to the feasibility of such inter-
pretations: (i) Natural Language Processing (NLP) to enhance parsing and sentence-specific
semantic interpretations; and (ii) formal representations to capture domain knowledge, such
as ontology, taxonomies, etc.
Advances in NLP have enabled us to extract relationships and entities, which are essential
building blocks of any sort of information that is extracted from text. Further, the growth
and acceptance of open World Wide Web Consortium (W3C) [47] standards for encoding and
representing knowledge, such as Web Ontology Language (OWL) [43] and resource descrip-
tion framework (RDF) [40], have made it easier to express domain-specific information in a
consumable form.
Overall, structured information extraction encompasses three fundamental steps: (i) NLP-
based relationship and entity extraction, (ii) semantic mapping and validation of the extracted
constructs, and (iii) representation of the extracted constructs in a generic formalism. However,
each of these steps has numerous open issues and inherent challenges, and we discuss some of
them in the next section, followed by our motivations to address them.
1.3 Open Problems and Challenges
1.3.1 Issues in NLP-based Relationship and Entity Extraction
Generally, information extraction algorithms extract relations as: (i) simple verbs based on
speech tagging [7], (ii) complex associations based on dependency parses [2], or (iii) induced
relations through term co-occurrence in large text corpora. Similarly, entities are generally
extracted as: (i) simple nouns, (ii) modified, complex nouns. These algorithms make use of
3
certain extraction patterns, which may be statically encoded in the form of rules, or dynami-
cally induced by various machine learning approaches on large text corpora.
Some of the challenges encountered by conventional algorithms utilizing static rules, in-
clude: (i) over-dependence on parse trees, resulting in inability to extract indirect, implicit
and complex relationships, and (ii) the need for named-entities to guide relationship extrac-
tion. On the other hand, approaches based on rule-induction require enormous amount of
textual resources across a fairly wide-range of text corpora, to be able to induce generic rules.
1.3.2 Issues in Semantic Mapping and Validation
A domain-specific information extraction process needs to make sure that the extracted re-
lationships and entities (information constructs) are semantically mapped and validated against
a domain description. This domain description is often described in the form of an ontology
[46], capturing the underlying relations and concepts occurring within the domain. This step
raises a number of challenges, some of which are: (i) operating in an incomplete description
of the domain, viz., the case when the ontology description does not capture all the relations
and concepts, and (ii) enriching the ontology with new found knowledge without introducing
inconsistencies.
1.3.3 Issues in Generic Representation
Finally, we need to represent the extracted information using a specification that is widely
accepted and consumable. This step is critical in the overall process of information extrac-
tion since in its absence, the extracted information will exist in an independent world, which
is highly undesirable, considering the amount of recent attention that linked data [42] and
knowledge integration [10], [11] have received.
One of the most common mechanisms for representing information in machine consumable
form, is the W3C family of specifications: RDF, RDF schema, and OWL. Thus, it becomes
an interesting issue to be able to represent the extracted information constructs within the
existing W3C specifications.
4
With these issues in mind, we motivate the need for our work in the following section.
1.4 Motivation for Creating a Novel Information Extraction Framework
1. Our main overarching motivation stems from the need to identify, extract, validate and
represent complex relationships (e.g., Joe reports that Jim is a reliable employee) in order
to facilitate comprehensive information extraction from unstructured text.
Such relationships often occur in a non-standard form where one part of the sentence
refers to the other part through an inherent dependency relationship, or the underlying
information in the sentence is not suitable for a straightforward extraction task. The
issue with such complex relationships is that at least one of the three steps in the process
of information extraction becomes problematic. In our experiments, we saw one or more
of the following problem patterns: (i) the first step required special rules to be able to
identify and extract the relationships, (ii) the validation step required enrichment of the
existing ontology since it made sense to capture the information construct correspond-
ing to the extracted relationship, however, the existing definitions in the given domain
ontology were not sufficient to do that, and (iii) the representation step required certain
finesse to express the complex relationship within the given set of specifications.
In chapter 3, we describe some interesting examples that motivated the need for capturing
these complex relationships in spite of the inherent difficulty of the task, and clearly
identify the subset of complex nested relationships that we intend to handle.
Encapsulated within this main motivation, there are several ideas that we elaborate
below:
• A common approach for extracting new relationship patterns is to utilize rule-
induction algorithms [30] that induce rules in a semi-supervised manner to capture
newly found relationships. However, we believe that such automatic rule induction
is not the best way to deal with complex relationships because of their complexity,
uniqueness, the amount of text corpora and time required to figure out whether the
5
induced rule is generic enough to capture the corresponding complex relationship
pattern occurring across a significant enough collection of texts. Instead, we try to
experimentally arrive at generic rules to capture our subset of complex relationship
patterns.
• Similarly, one of the more common approaches to handle information constructs
that can not be validated against a given domain ontology, is to simply ignore
them. However, we argue that there are simple ontological extensions that can be
made without losing consistency, to be able to capture such cases.
• Finally, there are many highly-expressive logics, such as description logic [48] (OWL-
DL [43], OWL-Full [43]), F-logic [44], which may be utilized to express complex
extracted information. However, we argue that such formalisms become an overkill
while trying to express free text and thus, to the extent possible, we try to capture
complex information while retaining the simplicity of RDF specification.
2. Our second motivation follows from the first one in that we hope to formulate the set of
rules that would work best to extract certain complex relationship types. In order to do
this, it is necessary to have a suitable information extraction system that allows us to
run experiments and test the performance on large text documents.
3. Our third motivation is to enable complex queries in order to answer certain questions
about the information expressed within unstructured text. Since we intend to extract
complex relationships using our proposed framework, it becomes interesting to experi-
mentally analyze and determine the kind of questions that can be answered once we have
extracted a large set of linked information through these relationships and entities.
1.5 Goals
Against this set of background, challenges and motivations, we formulate various frame-
works and algorithms in this thesis, to address some of these concerns as briefly described
below:
6
1. Complex Relationships: We identify a subset of complex relations (e.g., Joe reports
that Jim is a reliable employee) that are especially interesting from the perspective of
structured information extraction (out of the many complicated sentential structures
that occur in free text) and have an above average likelihood of occurrence in a random
sample of free-text. We clearly define their scope w.r.t what we intend to cover within
our extraction framework.
2. Entity and Relationship Extraction: Having clearly defined the type and nature of
complex relationship patterns that we intend to extract, we precisely formulate the sets
of extraction rules that can identify and extract them. We also formulate an algorithm
that utilizes these rules to extract relationships and entities (information constructs)
from unstructured text.
3. Semantic Validation: Further, we propose methodology to validate and semantically
associate the extracted information constructs against a given domain description cap-
tured in an ontology. We also include ways to perform ontology enrichment with newly
found relationships, whenever possible.
4. Representation: We formulate a methodology to represent these extracted and vali-
dated information constructs using existing (non-extended) Resource Description Frame-
work (RDF) specification. We especially introduce mechanism to represent the complex
relationships in a way that enables the possibility of complex querying on the extracted
information.
5. Implementation and Empirical Evaluation: Based on our analysis and algorithms,
we provide implementation of an information extraction system, called SEMANTIXS
(System for Extraction of doMAin-specific iNformation from Text Including compleX
Structures), which can be utilized to extract and semantically represent structured infor-
mation from free text. We describe the architectural and design details of the system. We
also report results of some experiments that offer evidence of the utility of the proposed
ontology-based approach to extraction of complex relationships from text. The system
7
implementation is open-source (under GNU General Public License), and is available at
http://sourceforge.net/projects/semantixs/.
1.6 Thesis Outline
The remainder of this thesis is organized as follows:
• In Chapter 2, general terminology related to the steps in structured information extrac-
tion is introduced. Preliminary concepts for domain ontologies, knowledge bases, RDF
specification, etc are also provided. A brief survey of related work is also given.
• In Chapter 3, we present our information extraction from text framework along with
the complex relations that we intend to extract. We also present the corresponding
rules to extract those relations. Further, we talk about the frameworks that take care
of validation and enrichment, and the RDF representation that we utilize for expressing
extracted information. We conclude the chapter with a discussion on the algorithms
incorporating these frameworks.
• In Chapter 4, we describe the system architecture, design and other details of SEMAN-
TIXS, an information extraction system implemented based on the above analysis and
algorithms.
• In Chapter 5, we describe our experiments and results achieved using SEMANTIXS and
free text processing.
• We conclude the thesis work in Chapter 6. We summarize the work done in the thesis
along with the main contributions. We also provide a number of interesting future threads
of investigation that directly relate to this research.
8
CHAPTER 2. PRELIMINARIES AND RELATED WORK
This chapter provides preliminaries, definitions and problem formulations for this thesis.
We also give an overview of the related work in the area of information extraction from text.
2.1 Derivative Structures: Parse Trees and Dependency Graphs
Due to complete lack of semantics in unstructured text, for any task involving information
extraction from text, it becomes crucial to work with an intermediate representation of text.
For this purpose, any such task employs linguistic parsers to perform syntactic analysis of text
to determine its grammatical structure with respect to a formal grammar for the language
1
that it is operating in.
The parsing process results in the generation of parse trees and optionally, dependency
graphs that are data structures derived from the text itself, which capture the implicit structure
within the tokens in the input text. In the remainder of this section, we give a brief overview
of the terminology related to these data structures. We limit our discussion to the scope of
this thesis.
2.1.1 Parse Tree
A parse tree is an ordered and rooted tree that represents the syntactic structure of a
sentence. In this section, we describe the Penn Treebank Notation utilized by most parsers
for tagging the sentence before generating the parse tree. These tags are often utilized in
information extraction systems to formulate rules based on which they perform extraction.
1
In this work, we always assume that the text is from the English language; more specifically, all that is
captured by the englishPCFG.ser provided by with the Stanford Parser package
9
2.1.1.1 Penn Treebank Notation
These are simplified forms of the definitions found in the Penn Treebank manual
2
.
• S: Simple declarative clause
• NP: Noun Phrase. Phrasal category that includes all constituents that depend on a head
noun.
• VP: Verb Phrase. Phrasal category headed a verb.
2.1.2 Dependency Graph
A dependency graph is a data structure that captures the implicit dependencies (sentential
semantics) between tokens (words) in a sentence. In this section, we give a formal definition
of dependency graph that is useful for our analysis, and define various dependency relations
that we use in our work.
Definition: (Dependency Graph) Given an English language sentence T comprising of a word-
set W representing the words in the sentence, a dependency graph, G of the sentence T is
defined as a directed graph with node-set W and a labeled edge-set connecting the nodes in W
s.t. for any two connected nodes, the label on the edge represents the dependency relationship
between the words represented by the nodes.
Figure 2.1 shows a dependency graph for the sentence - “Heart attack causes reduced
average lifespan”.
For purposes of our work, we describe the following set of dependencies that are necessary
for understanding our rule framework in Chapter 3. These are simplified forms of the definitions
found in the Stanford typed dependencies manual
3
.
• amod: adjectival modifier - An adjectival modifier of an NP is any adjectival phrase
that serves to modify the meaning of the NP.
2
Refer - ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/notation.tex for a complete list
3
Refer - http://nlp.stanford.edu/software/dependencies manual.pdf for a complete list
10
Figure 2.1 A Simple Dependency Graph
• ccomp: clausal complement - A clausal complement of a VP or an ADJP is a clause
with internal subject which functions like an object of the verb or of the adjective;
a clausal complement of a clause is the clausal complement of the VP or of the ADJP
which is the predicate of that clause. Such clausal complements are usually finite (though
there are occasional remnant English subjunctives).
• conj: conjunct - A conjunct is the relation between two elements connected by a co-
ordinating conjunction, such as “and”, “or”, etc. We treat conjunctions asymmetrically:
The head of the relation is the first conjunct and other conjunctions depend on it via the
conj relation.
• dobj : direct object - The direct object of a VP is the noun phrase which is the
(accusative) object of the verb; the direct object of a clause is the direct object of the
VP which is the predicate of that clause.
11
• neg: negation modifier - The negation modifier is the relation between a negation
word and the word it modifies.
• nn: noun compound modifier - A noun compound modifier of an NP is any noun
that serves to modify the head noun.
• nsubj : nominal subject - A nominal subject is a noun phrase which is the syntactic
subject of a clause. The governor of this relation might not always be a verb: when the
verb is a copular verb, the root of the clause is the complement of the copular verb.
• parataxis: parataxis - The parataxis relation (from Greek for “place side by side”)
is a relation between the main verb of a clause and other sentential elements, such as a
sentential parenthetical, a clause after a “:” or a “;”.
• pobj : object of a preposition - The object of a preposition is the head of a noun
phrase following the preposition. (The preposition in turn may be modifying a noun,
verb, etc.)
• prep: prepositional modifier - A prepositional modifier of a verb, adjective, or noun
is any prepositional phrase that serves to modify the meaning of the verb, adjective, or
noun.
2.2 Formal Specifications for Validation and Representation
As briefly mentioned in the previous chapter, a structured information extraction task re-
quires a formal specification in order to be able to make sense of the extracted constructs
(entities and relationships). There are various representation mechanisms utilized for express-
ing a domain description, however, one of the most widely used is an ontology specification.
We give a brief overview of this next.
2.2.1 Domain Ontology, Instances and Knowledge Base
Definition 1: (Ontology) An ontology is a structure O = (R, C) such that:
12
• The sets R and C are disjoint and their elements are called relations and concepts re-
spectively.
• The elements of R induce a strict partial order ≺ on the elements in C. This ordering of
the form, c
i
≺ c
j
, c
i
, c
j
∈ C, is called the concept hierarchy.
Since the ontology merely defines the concepts and relationships in the domain, for a
process intending to extract domain-specific information, it is highly desirable that there be a
specification of assertions that may occur in the domain.
As an example, consider a sentence within the Sports domain - Sachin Tendulkar scored
200 runs. An ontology for Sports may have a concept Cricketer and a relation scoredRuns
with domain Cricketer and range Numeric. However, unless there is another specification that
asserts Sachin Tendulkar as a Cricketer, an information extraction system can not correctly
extract this as a domain-specific construct.
With this motivation, we define the notion of a domain ontology with instances (a.k.a
knowledge base), which is a combination of a domain description (in the form of an ontology
with concepts and relations) and certain concrete assertions (instances of those concepts) about
it.
Definition 2: (Domain Ontology with Instances) A domain ontology with instances is defined
with respect to an ontology O, as a structure DOI = (O, I) such that:
• I is a set, whose elements are called instances.
• There exists a function h : I −→ P(C), where P(C) is the power-set of the set of concepts
for the ontology O.
2.2.2 Validation Process
Given a sentence as input, the process of information extraction extracts certain words
from the sentence, which it identifies as possible information candidates. In this thesis, we
always refer to them as candidate information constructs. These candidates are then validated
against the input domain description specified as an ontology. Upon validation, they become
13
information constructs ready to be represented using a formalism. We describe the validation
process in detail when we present our frameworks in Chapter 3.
2.3 Generic Representation
Information extraction systems utilize various formalisms to represent the information con-
structs. In this section, we present a brief overview of one of the popular representation
mechanisms, specified by W3C.
2.3.1 Resource Description Format(RDF)
RDF utilizes the idea of representing statements in subject-predicate-object format. Each
such realization is called an RDF triple. As an example, lets reconsider the representation of
example from section 2.2.1. Once the information extraction process extracts and validates the
constructs from the sentence - Sachin Tendulkar scored 200 runs, this extracted information
can be simply represented in the triple format as - Sachin Tendulkar-scoredRuns-200.
RDF also provides advanced capability to make assertions about statements instead of
entities. This process is called RDF Reification [41]. We make use of this capability to
represent our complex relationships described in Chapter 3.
2.4 Problem Definition
With this background, we can now define the problem of structured information extraction
from free text in a somewhat concrete manner as follows.
Definition 3: Given a text fragment, Z, consisting of sentences {T
i
} with word-sets {W
i
},
a domain description captured in an ontology O(R, C), a set of instances Y , and a function
h : Y −→ P(C), where P(C) is the power-set of C, structured information extraction on Z is
defined as a process that must do the following:
1. Determine a set T
CT R
of candidate triples using an extraction algorithm.
14
2. Validate T
CT R
to find a set {K = {s
i
, p
i
, o
i
}} of triples with respect to O(R, C), Y and
h.
3. Represent triples in K using a suitable representation mechanism.
This process is a hybrid of in-formalism (free-text, NLP, extraction algorithms, etc) and
formalism (ontology-based validation, etc). In the next chapter, we discuss our formulation of
each of these steps and make this hybrid notion more clear. We now give a brief overview of
related work.
2.5 Related Work
There have been varied efforts around the problem of information extraction from text.
We move forward to a discussion of different approaches to information extraction task and
distinguish our work in context of the present state-of-art. Further, we also present an overview
of representative approaches in complex relationship extraction from text since that is our main
motivation to formulate a novel framework for information extraction from text.
2.5.1 Different Flavors of Information Extraction from Text
There are two main approaches to information extraction distinguished by whether they rely
on manual engineering to discover extraction rules by inspection of a corpus, or use statistical
methods to learn rules from annotated corpora. The key advantage of the former approach is
that it does not require large amounts of training corpora (which is often expensive to acquire)
and that the best performing systems have been known to be hand-crafted [35]. On the flip
side, the rule formation process is often laborious and in case of domain-specific rules (for
instance, rules that are tightly coupled with the ontology specification), domain adaptation
may require significant re-configuration. In case of latter, domain portability (in principle)
is relatively straightforward due to automatic rule induction, however training data may be
hard to acquire, and changes in domain specification may require re-annotation of the entire
training data.
15
Due to the availability of well-represented descriptions of concepts, relations and instances
in certain domains, like Biomedicine [39], [38], there has been significant effort on extracting
domain-specific information from text. Recently, there have also been large-scale efforts [37]
to extract entity mentions, facts and/or events from text. Although statistical approaches
to information extraction have shown decent results for entity mentions, such as identifying
genes names in Biomedical literature [refer 14], they are not as effective in case of relationship
identification due to lack of annotated test corpora. This problem is further aggravated in
case of complex relationships (e.g., Joe reports that Jim is a reliable employee) that occur
in implicit and non-standard form as we shall see in the next chapter. As a result, there
has been considerable focus on rule-based techniques that, even though being labor-intensive
and requiring a lot of manual formulation, prove to be more effective and transparent in
capturing the semantic criteria. Further, systems that utilize rule-based approaches, in a
domain-independent manner, are easier to extend using a modular design.
As indicated above, rule-based approaches can be broadly divided into two genres based
on whether the rule-formulation is tightly coupled with the domain description or not. Author
in [18] presented an approach to build a dictionary of extraction patterns, called concepts or
conceptnodes. However, the extraction patterns are triggered by domain-specific triggering
words. Similarly, [19] and [20] generate multi-slot extraction rules and learn extraction pat-
terns respectively, however both require a specification of at least one domain-specific word.
Although [21] allows more expression power than [20], it still relies on exact word matches
to some extent. Besides inducing rules coupled with the domain in some manner, all these
approaches operate at a different level of granularity than our approach in that we are inter-
ested in extracting a subset of nested complex relationships (e.g., Joe reports that Jim is a
reliable employee) that are completely independent of a domain and can be entirely identified
by looking at a dependency graph or parse tree of a sentence. In some cases, we use trigger
words, however, they are language constructs (like propositions, conjunctions, etc) and not
specific to a domain.
Authors in [13] proposed an ontology-based approach to extract relationships and com-
16
pound entities from Biomedical text using rules that operate on parse trees. They further
suggested an unsupervised approach [16] to joint extraction of compound entities and relation-
ships using information theoretic measures. Although the structured information extraction
framework that we have proposed in our work is similar to [13] in that we also utilize a rule-
based algorithm for entity and relationship identification and extraction, followed by validation
against an ontology, the purpose of our work is quite different due to (i) the focus on complex
relationship structures (e.g., Joe reports that Jim is a reliable employee), and (ii) formulation of
rules in an entirely domain-independent manner. Similarly, [15] proposed a rule-based system
to extract regulatory mechanisms from Biomedicine literature and [17] presents an ontology-
based information extraction tool that extracts data with the aid of context words defined in
the ontology. There are many similar efforts, such as [22], [23] that focus on domain-specific in-
formation extraction. Although these approaches capture many relationship forms and suggest
a general paradigm of relationship extraction using rules, all of them are broadly motivated
by extraction purposes specific to their domain of discourse. In contrast, our rule-framework
is meant to capture complex relationships of very general form that have correlations with
linguistic structures and not with any specific domain of discourse.
2.5.2 Complex Relationship Extraction from Text
In this section, we give an overview of representative approaches within the field of complex
relationship extraction from text. A comprehensive survey of relation extraction mechanisms is
given in [36]. In line with the approaches described above for the larger problem of information
extraction, some of the common supervised approaches to relationship extraction formulate the
problem as that of classification in a discriminative framework (naturally requiring positive and
negative training examples). Issues with these approaches are that they are difficult to extend
to new relationship-types (especially complex ones) due to lack of labeled data.
As an alternative, semi-supervised approaches to address the task of relationship extraction
from text, are often based on pattern induction (or slot filling), with the implicit assumption
that terms belonging to a common linguistic context would occur in relationships with certain
17
common semantics. For example, [24], [30] make use of this paradigm to learn taxonomic rela-
tions and induce patterns respectively. Other popular systems using semi-supervised approach
to relationship extraction include [33], [34]. Further, there are rule-based systems that fall
within the genre of semi-supervised approaches (as mentioned in the previous subsection) that
make use of extraction-rules on intermediate syntactic structures, like dependency graphs and
parse trees, to achieve similar goals.
For complex relationships, the general statistical (or machine learning) approach is to factor
them into binary relationships, train binary classifiers for capturing relatedness of entities
and then form complex relationships using related entities and the binary relationships they
participate in. For example, a representative approach described in [31] utilizes this approach
to identify binary relationships in news text. [32] utilizes similar notions and a scoring metric
on maximal cliques to discover complex relationship instances.
In most of these approaches, the working definition of a complex relationship has been taken
as any n-ary relation in which some of the arguments may be unspecified. However, as we
illustrate in the next chapter, not all candidates in our identified set of complex relationships
conform to this definition. We intend to capture such cases when, on surface, we may have a
case of binary relationship, however its arguments are not simple or compound entities, but
another relationship instance. Modeling this problem within the discriminative framework
raises intricate issues, such as non-uniform feature spaces in case of an entity being related
to a fact (combination of entities). Although, there has been work on designing kernels to
capture non-linear feature spaces, we still run into the same issue of manually labeling corpora
for appropriate test-set creation. Thus, to handle this flavor of complexity, we resorted to the
rule-based approaches based on intermediate syntactic structures.
It may be noted that apart from the differences pointed above, our work can be distin-
guished from the general task of relationship extraction from text in that we are formulating
a structured information extraction framework, and relationship extraction is one of its many
parts (entity and relationship extraction, semantic validation and representation).
18
CHAPTER 3. A NOVEL FRAMEWORK FOR EXTRACTING
INFORMATION FROM TEXT
In the previous chapter, we introduced the necessary background to enable us to begin
discussion of our proposed framework for information extraction from text. In this chapter, we
describe our work in detail. We start with an overview of our approach, and proceed towards
an elaboration of each of the individual steps in the overall process.
3.1 Approach Overview
As per definition 2.4, our overall methodology for information extraction from text is di-
vided into three main parts: (i) processing the text sentence-by-sentence, generating the parse
tree and dependency graph for each sentence and applying extraction rules to identify and ex-
tract candidate relationships and entities, (ii) validating these candidates against the domain
ontology model and optionally enriching the model by new concepts and relationships that
capture these candidates in a consistent manner, and (iii) representing the validated candi-
dates in RDF notation leveraging the existing W3 syntax specification. We now address each
of these parts in detail.
3.2 Text Processing and Generation of Derivative Structures
As indicated in the previous chapter, we intend to utilize both dependency graphs and
parse trees in our approach to information extraction. The dependency graphs would help in
extraction of complex, nested and implicit relationships, and parse trees can be leveraged for
simpler relationships or whenever a finer analysis of individual entities is necessary.
19
For these purposes, we wanted to utilize a third-party Parsing library, which could provide
both these structures out-of-the-box. After a brief evaluation of some of the popular parsing
libraries [28], [26], [29], we decided to go ahead with Stanford Parser, because of its flexibility,
ease of usage, speed and precision.
As a first step, we utilize the Stanford Parser to get the dependency graph and parse tree
representations of a candidate sentence. These structures are then consumed by the extraction
algorithm to extract candidate relationships and entities in accordance with the rule framework
described in the next section.
Figure 3.1 Dependency Graph with Application of a Simple Extraction
Rule
20
3.3 Composite Rule Framework for Entity and Relationship Extraction
An extraction rule is a simple statement encapsulating a set of premises and consequents.
By definition, a rule warrants the execution of actions defined in the consequents whenever the
conditions defined in the premises hold. For simplicity of explanation, we will use action and
consequent interchangeably. Similarly, we would use condition and premise interchangeably.
For instance, an informally specified extraction rule for the dependency graph in Figure 2.1
can be - If “labels nsubj, dobj occur along a path in the graph”, then “extract that path as an
information construct”. This rule, when utilized by an information extraction algorithm, would
result in the extraction of - “Causes-reduced-lifespan” as a candidate information construct as
illustrated in Figure 3.1. With this intuitive notion in mind, we formally define the terminology
that we will be using throughout the chapter.
3.3.1 Terminology
We utilize the following notations to describe the framework and algorithms that we have
formulated.
p
i
: i
th
condition or premise for a rule (defined below).
c
j
: j
th
action or consequent for a rule, corresponding to a set of premises {p
i
}.
G(V, E) : A dependency graph with vertex-set V and edge-set E.
G
S
(V
0
) : Subgraph of G induced by the vertex-set V
0
.
D : A set of labels denoting the typed dependency relations defined for the English language
(refer Section 2.1.2)
l : E −→ D : A label function that defines a specific label from the set D, for edges of the
graph G.
P
{e
i
}
: A labeled path in the graph G, comprising of the edge-set {e
i
}.
The information extraction rules, utilized by the algorithms that we will be eventually
describing, are of the following form:
21
Definition 4: (Extraction Rule) For a dependency graph G, we define an extraction rule as,
r
i
: {p
i
} −→ {c
i
}, meaning If {p
i
} holds, perform {c
i
}.
This generic definition of extraction rule can have various realizations based on the usage
scenario and complexity of the relationships encountered in the information extraction task
being performed, one of which is described below for illustration.
r
i
: {∃P
{e
i
}
| {l(e
i
)} = D
0
, D
0
⊂ D} −→ {Extract all the vertices associated with the edge-
set {e
i
}}
Here, r
i
encodes the rule that if there exists a specific sequence of dependency labels without
considering the order (defined by the set D) along some path in the dependency graph of
the given sentence, then the sequence of nodes (which represent words in the corresponding
sentence) forms an information construct and thus, the algorithm is recommended to extract
them.
3.3.2 Identifying and Defining the Relationship Types
In this section, we try to clearly identify and define the subset of complex relationships
that we intend to capture by our rules. Since relationships can not exist without the entities
that they are associating, we also give a description of the entities that we expect to identify
and extract.
Given an English language sentence, t, we define the complexity of the relationship(s)
expressed between the entities in t, by analyzing the type, nature and number of dependencies
that exist within the words of t.
22
3.3.2.1 Simple Relationships
Intuitively, a simple relationship is expected to have a single verb connecting two entities,
which may or may not have a modifier. It is not expected to have internal clauses, implicit
dependencies, multiple subjects or objects. Formally, we say that a sentence contains a simple
relationship if all of the following conditions hold for the dependency graph G generated from
the sentence T :
• It contains no more than one subject and object each. This further implies that it has
at most one dependency of type nsubj and one from the set {dobj, pobj}.
• It does not contain any clause-level dependencies, conjunctions, or a clausal subject.
Further, it can only contain noun-compound, or adjectival modifiers. In terms of Stan-
ford dependencies, this implies that it does not contain any dependencies from the set
{ccomp, xcomp, acomp, compl, conj, csubj, csubjpass}. Also, it can only have {amod,
quantmod, nn} as modifiers.
It is conceded that due to the extremely diverse nature of English language sentence
construction, there may be many sentences that satisfy our intuitive notion of simple
relationships and yet, they may not be captured by the conditions stated above. We are
indeed cognizant of this fact and thus, do not claim the above conditions to be exhaustive
in nature. We arrived at these conditions following an experimental approach in which
we manually analyzed the dependency graphs emerging from varied sentences and tried
to determine common patterns. However, for the purposes of this thesis, we limit the
scope of definition of simple relations to the conditions above.
3.3.2.2 Complex Relationships
Intuitively, the set of complex relationships should be expected to capture every relation
that is not simple. However, for purposes of this thesis, we limit the scope of complex rela-
tions to the following specific cases. For ease of understanding, we illustrate these cases with
descriptive examples.
23
Figure 3.2 Example for a Relationship with Internal Clauses
1. Case when the relationship has internal clauses
This type of relationship is distinguished by the fact that it has a main subject that
refers to an internal clause through a verb. Such an internal clause is often interpreted
as the object of the verb that it is dependent upon. These relationships make for very
interesting candidates as far as information extraction from text is concerned because
they implicitly encapsulate relationships between entities and facts, instead of two simple
entities. We illustrate the essence of this notion through the example shown in Figure
3.2. The sentence taken from an online article
1
, makes an assertion about Macintosh PCs
(Macs). A information extraction system, operating in the PC and Technology domain,
may capture this assertion as an information construct about the entity, Macintosh PC.
However, it is clear that the assertion is an opinion, expressed by another Technology
vendor (Microsoft). For a generic information extraction system, it is important to
capture this inter-clausal dependency and thus, we selected this relationship type in our
subset of complex relationships.
Before we go further into our analysis using the example above, it is important to note
1
http://www.macworld.com/article/139691/2009/03/
24
that it captures the general structure of this relationship type comprehensively. This is
since the verb relating the main clause with the inner clause (in this case, says) may as
well be any other verb, or modified verb. Similarly, the main clause (in this case the
main subject) may as well be any other noun, modified noun, or a pronoun (we perform
simple pronoun resolution in our algorithm). Further, we do not constraint the inner
clause in any way. Thus, this example, although specific in nature, captures any such
generic sentence with an inter-clausal relationship.
One of the most troubling issues in handling natural language texts is its varying nature
leading to different representations that all imply the same fact. For instance, here are a
few other representations of the sentence in Figure 3.2, in the order of increasing difficulty
with respect to information extraction task:
(a) “That Macs are too cool for its customers, says Microsoft ad.”
(b) “Microsoft ad says: Macs are too cool for its customers.”
(c) “Macs are too cool for its customers –Microsoft ad.”
(d) “Hey, listen to what Microsoft ad says.. Macs are too cool for its customers !...”
(e) “Macs are too cool for its customers.. I do not say this, Microsoft ad does.. see for
yourself.”
These instances made us quickly realize that it would be highly improbable to formulate
rules to identify complex relationships out of sentences that contain extraneous words,
which are not directly related to the main theme captured in the sentence (c, d and e
above). This is since, unlike with simple relationships where we can base our analysis
on the main (unique) verb, complex relationships can have multiple verbs, subjects and
objects with complex and implicit dependencies that can be hard to identify and extract
in cases when there are extraneous words. This is since these extraneous words introduce
dependencies that are not directly related to the underlying information captured in the
sentence.
25
Thus, we specifically restrict the scope for our rule-coverage to the forms captured by
cases a and b above, apart from the main sentential form illustrated in Figure 3.2. The
other motivation for the restriction of scope is that we were able to identify suitable
Stanford dependency relations for each of these cases, which formed the basis of our rule
formulation.
2. Case when modifiers implicitly qualify the meaning of the relationship
In this type of relationship, there is a main clause whose meaning is either being qualified
by a prepositional modifier, or the presence of the modifier gives an entirely different
context to the main clause. These relationships are again interesting candidates for
information extraction since they contain qualified information, which may or may not be
true on its own at the time of extraction, however their extraction can lead to interesting
new discoveries at a later point. This is since there may be a point in future when the
previously qualified piece of information becomes true without the qualification and it
may be possible to assert certain credibility to the source that had originally claimed
the qualified information (the source claiming the qualified information could have been
captured using the extraction rules for the relationship type 1).
Figure 3.3 Example Illustrating the Intuitive Motivation behind Case 2
26
As an example, lets consider the sentence - AnonymousSportingNewsChannel claims that
Sachin Tendulkar may score a double-hundred with high probability. It may be observed
that this sentence contains a main clause (AnonymousSportingNewsChannel claims that)
referring an internal clause. The internal clause can be captured by extraction rules for
relationship of case 1. Further, if we are able to identify and extract the internal clause
(Sachin Tendulkar may score a double-hundred) with the qualification (high probability),
we would be able to develop interesting insights about the source (AnonymousSport-
ingNewsChannel at a later point in time as discussed above. This idea is concisely
illustrated in the Figure 3.3.
Figure 3.4 Example for a Relationship with a Qualifying Modifier
With this motivating example in mind, we decided that it is important to capture the
qualified structure expressed in such relationships, and thus, we selected this relationship
type as the second case for our subset of complex relations. In rest of this section, we
will be referring to the sentence shown in Figure 3.4.
Before we go further into our analysis using the example above, we need to make sure that
the example captures the general structure of this relationship type. Unfortunately, this
case is not as straightforward as the previous one and we need to consider some variations.
First of all, we would need to generalize the connector for the modifying qualifier (in this
27
example, with). Most obvious choice is to consider the set of all common prepositions.
However, we need to only consider those, which can (i) have an associated adjective
that can act as the value of the qualification (in this case, high, and (ii) be relevant in
the sense of capturing a relational modification (iii) be supported by consistent Stanford
typed dependency labels. With these observations and for the sake of reduced overall
complexity, we restricted our set to the following prepositions - {with, of, in, for}.
Further, after performing some experiments to understand the effect of verb-clause (in
this example, may score) variation, we found that there are subtle variations in the de-
pendency graphs generated from very similar sentences. For instance, all of the following
sentences lead to slightly varying dependency graphs when compared with our original
sentence of Fig 3.4:
(a) “There is a high probability that Sachin Tendulkar may score a double-hundred.”
(b) “Sachin Tendulkar visits USA with his entire family.”
(c) “Sachin Tendulkar scores a hundred with absolute masterclass.”
In contrast, if we keep the verb-clause unchanged, we get similar dependency graphs,
even though we may vary the overall representation of the main sentence. For instance,
all of the following sentences lead to similar dependency graphs:
(a) “With high probability, Sachin Tendulkar may score a double-hundred.”
(b) “Sachin Tendulkar, in superb form, led the charge.”
(c) “Sachin Tendulkar, with high probability, may score a double hundred.”
Since we utilize dependency graphs to formulate our rules, the case with same verb-clause
presents no real challenge. In fact, it results in better sentential coverage for our rules.
However, to handle the case of varying dependency graphs, we had to slightly generalize
our rules so that they could accommodate subtle variations in dependency graphs. We
elaborate on this in the following subsection when we discuss the rule construction.
28
3. Case where multiple relations are formed by coordinating conjunctions
This type of relationship represents all the sentences with at least one conjunction and,
but, or, yet, for, nor, so. For reducing the complexity of the representation step, we only
handle the case with the and-conjunction.
Figure 3.5 Example with a Simple Placement of Conjunction
Figure 3.6 Example with a Complex Placement of Conjunction
The simplest case of this relationship occurs when the conjunction separates two simple
or complex clauses. In this case, the extraction task simply requires the separation of
the sentence about the conjunction, followed by usual processes required for individual
simple/complex clauses (refer Figure 3.5). However, in many other cases, the second
clause is dependent on the subject or predicate of the first clause. This is illustrated
in Figure 3.6. The latter case requires special handing with respect to the information
29
extraction task. Further, these relationships generally contain two or more of the other
relationship types described in this section and thus, we thought it to be relevant to
include them in our set of complex relationships.
Complex relationship types 1 and 2 are often encountered in complex literature (such as
Biomedicine) and an effective approach for such domains is to interpret the internal clause
as a single complex object in a modified form. Although this may work for certain highly
specialized domains, as we saw through the examples above, it can easily overlook some inter-
esting information constructs that our approach would be able to extract. We make this more
concrete in chapter 5, where we do some evaluations and report interesting results achieved
from utilizing our approach.
Having clearly defined the type and nature of relations that we intend to extract, we now
turn our attention to the rules that can identify and extract them. For ease of analysis and
discussion, we logically group the rule formulation by complex and simple relationship types.
3.3.3 Formulating Rules for Complex Relationship and Entity Extraction
In 3.3.2.2, we observed that most of the complex relationships are characterized by implicit
and explicit dependencies between parts of the sentence. Since such dependencies are nicely
captured by the dependency graph (refer chapter 2 for an overview of dependency graph), we
focus on it to formulate our rules. For this, we follow the methodology described below.
1. Refer to the dependencies described in section 2.1.2 and choose the ones that fit the
structure for each type of complex relationship described in previous subsection.
2. Form conditions based on whether a dependency label (or a sequence of labels) is found
along a set of edges in the dependency graph.
3. Form actions by deciding which nodes to extract as constructs.
4. Express premises and consequents to form the extraction rule.
We now apply these steps for each of the complex relationship types.
30
3.3.3.1 Rule Formulation for Case 1
Figure 3.7 Dependency Graph for the Sentence in Figure 3.2
1. For this case, we base our rule on two main clausal dependencies from the overall set of
Stanford dependency relations - Clausal complement (ccomp) and Parataxis (parataxis)
(For a definition of these dependencies, refer Chapter 2, Section 2.1.2). Clausal comple-
ment and Parataxis capture the structure of the main relationship form as illustrated in
Figure 3.2 (as well as form 1a) and form 1b of case 1 respectively.
2. For understanding the conditions based on which we would extract constructs, lets ob-
serve the dependency graph for the sentence in Figure 3.2, as illustrated in Figure 3.7.
The main observation to be made is that ccomp nicely captures the clause-level depen-
dency between the verbs of the sentence.
31
Figure 3.8 Extraction Rule Application for a Relationship with Internal
Clauses
3. Further, we want to capture the information that the main subject (Microsoft ad) has
a relation (says) with the composite object, which is an inner clause and thus, has its
own subject (Macs), relation (coolness), and object (customers). In case of the main
subject, we extract the modifiers as well as the main subject. This is easily achieved by
looking for the noun compound modifier dependency (nn) or quantitative phrase modifier
(quantmod) in our extraction rule.
4. Thus, we would like to extract the following constructs for this case:
pred
1
= {Node with two outgoing edges with labels “nsubj” and “ccomp”}
sub
1
= {Node
1
=Node that is connected to the pred
1
node by an edge with label “nsubj”,
Node connected to Node
1
by an edge with label “nn” or “quantmod”}
32
In a similar manner, we would extract the constructs when parataxis dependency appears
in the graph instead of ccomp. We do not illustrate this with an example since it generates
very similar dependency graph to the one shown in Figure 3.7, other than the fact that
ccomp is replaced by parataxis.
We leave out the constructs within the inner clause for now since it only comprises a
simple relationship, which can be easily extracted using the rules for simple relationships
described later in this section.
Expressing this formally using first-order logic notation, the general rule-set for this case
is described as follows.
Extraction Rule 1: (Extraction Rule for Relationships with Internal Clauses) Given a
dependency graph G(V, E) with a label function l, for an English-language sentence T ,
the information extraction rule-set to identify and extract the complex relationship with
internal clauses as described in case 1, is given as,
• r
RIC
1
: {∃u, v, w ∈ V, ∃e
1
(u, v), e
2
(v, w) ∈ E | l(e
1
) = “nsubj
00
∩ (l(e
2
) = “ccomp
00
∪
l(e
2
) = “parataxis
00
)} −→ {pred
1
= {v}, sub
1
= {u}}
• r
RIC
2
: {∃u, v, w, t ∈ V, ∃e
1
(u, v), e
2
(v, w), e
3
(u, t) ∈ E | l(e
1
) = “nsubj
00
∩ (l(e
2
) =
“ccomp
00
∪ l(e
2
) = “parataxis
00
) ∩ l(e
3
) ∈ {“nn
00
, “quantmod
00
}} −→ {sub
1
=
sub
1
∪ {t}}
Result of application of this rule to the sentence in Figure 3.2 is shown in Figure 3.8
3.3.3.2 Rule Formulation for Case 2
1. For this case, we base our rule on two modifier dependencies - Prepositional modifier
(prep) and Adjectival modifier (amod) (For a definition of these dependencies, refer
Chapter 2, Section 2.1.2). Prepositional and Adjectival modifiers capture the modifying
qualifier and value of qualification respectively.
33
Figure 3.9 Dependency Graph for the Sentence in Figure 3.4
2. For understanding the conditions based on which we would extract the information con-
structs, we again observe the dependency graph for the sentence in Figure 3.4, as il-
lustrated in Figure 3.9. The main observation to be made is that prep identifies a
qualification associated with the main clause and amod identifies the value (in this case,
the degree).
Recall that we need to take care of subtle differences in dependency graphs with respect
to verb variations. This variation is basically around the placement of edge labeled prep,
viz. it may be connected to either of the three head-nodes in the main-clause (subject,
predicate or the object). To account for this, we simply ignore the placement of prep.
We observe that as long as we have identified a prep–amod pattern in the graph, we can
get all the qualifying information that we need to be able to perform extraction of the
qualified relationship.
3. Now, for reasons that will be clear when we get to the validation and enhancement
steps, we intend to capture the following information - “There is a qualified relationship
with the subject Sachin Tendulkar, relationship may score, object double century, and
probability high”. Similar to the previous case, we extract the modifiers as well as the
34
main subject/object by looking for any modifier dependency (nn, quantmod, etc) in our
extraction rule.
4. In all, we would like to extract the following constructs for this case:
pred
1
= {Node with two outgoing edges with labels “nsubj” and “dobj”}
sub
1
= {Node
1
=Node that is connected to the pred
1
node by an edge with label “nsubj”,
Node connected to Node
1
by an edge with label “nn” or “quantmod”}
obj
1
= {N ode
1
=Node that is connected to the pred
1
node by an edge with label “dobj”,
Node connected to Node
1
by an edge with label “nn” or “quantmod”}
qual
1
= {Node with two edges with labels “prep” and “amod”}
val
1
= {Node that is connected to qual
1
by an edge with label “amod”}
Figure 3.10 Extraction Rule Application for a Relationship with Qualify-
ing Modifiers
Extraction Rule 2: (Extraction Rule for Relationships with Qualifying Modifiers) Given
a dependency graph G(V, E) with a label function l, for an English-language sentence T ,
the extraction rule-set to identify and extract the complex relationship with qualifying
modifiers as described in case 2, is given as,
35
• r
RIC
1
: {∃u, v, w, x, y ∈ V, ∃e
1
(u, v), e
2
(v, w), e
3
(w, x), e
4
(x, y) ∈ E | l(e
1
) = “nsubj
00
∩ l(e
2
) = “dobj
00
∩ l(e
3
) = “prep
00
∩ l(e
4
) = “amod
00
} −→ {pred
1
= {v}, sub
1
=
{u}, obj
1
= {w}, qual
1
= {x}, val
1
= {y}}
• r
RIC
2
: {∃u, v, w, x, y, t ∈ V, ∃e
1
(u, v), e
2
(v, w), e
3
(w, x), e
4
(x, y), e
5
(u, t) ∈ E | l(e
1
) =
“nsubj
00
∩ l(e
2
) = “dobj
00
∩ l(e
3
) = “prep
00
∩ l(e
4
) = “amod
00
∩ l(e
5
) ∈ {“nn
00
, “quantmod
00
}} −→ {sub
1
= sub
1
∪ {t}}
• r
RIC
3
: {∃u, v, w, x, y, t ∈ V, ∃e
1
(u, v), e
2
(v, w), e
3
(w, x), e
4
(x, y), e
5
(w, t) ∈ E | l(e
1
) =
“nsubj
00
∩ l(e
2
) = “dobj
00
∩ l(e
3
) = “prep
00
∩ l(e
4
) = “amod
00
∩ l(e
5
) ∈ {“nn
00
, “quantmod
00
}} −→ {obj
1
= obj
1
∪ {t}}
Result of application of this rule to the sentence in Figure 3.4 is shown in Figure 3.10
3.3.3.3 Rule Formulation for Case 3
In the case of relations with conjunctions, we do not formulate extraction rules based on
dependency graphs. This is since conjunctions are words that connect parts of a sentence and
thus, do not necessarily have an implicit or explicit dependency on any but their immediate
successor and predecessor in the sentence. In such a scenario, use of a dependency graph can
be avoided.
Instead, we take care of this as an auxiliary case in our extraction algorithm itself by
analyzing the structure of the parse tree representation. We basically look at the parse of the
part of the sentence to the right of the conjunction, and apply the the rules described below.
Note that these are not similar to the extraction rules that we have been using since in this
case, we are referring to the already extracted constructs from the left clause and utilizing
them directly for extraction information from the right part. In this process, if we figure that
the right part does not have any references from the left, we simply utilize our pre-existing
36
rule-framework for extracting information from it as if it was a distinct sentence, and thus, we
do not require any extraction rules for this case at all.
The rules we follow for analyzing conjunctions are as follows:
• If the parse tree for the right part contains a simple declarative clause (S ), treat it as a
distinct sentence and utilize the pre-existing rule-framework for extracting information.
• If the parse tree contains a verb phrase (VP) and noun phrase (NP), append the extracted
subject of the left part to the right part and treat it as a distinct sentence and utilize
the pre-existing rule-framework for extracting information.
• If the parse tree contains only a noun phrase (NP), append the extracted subject and
the predicate of the left part to the right part and treat it as a distinct sentence and
utilize the pre-existing rule-framework for extracting information.
3.3.4 Formulating Rules for Simple Relationships
In case of simple relationships, we extract the following constructs:
pred
1
= {Node with two outgoing edges with labels “nsubj” and “dobj”}
sub
1
= {N ode
1
=Node that is connected to the pred
1
node by an edge with label “nsubj”,
Node connected to Node
1
by an edge with label “nn” or “*mod”}
obj
1
= {N ode
1
=Node that is connected to the pred
1
node by an edge with label “dobj”, Node
connected to Node
1
by an edge with label “nn” or “*mod”}
Here, “*mod” is a shorthand notation, used only for this case, to denote any modifier type
dependency.
Expressing these rules formally using first-order logic notation, the general rule-set for this
case is described as follows.
Extraction Rule 3: (Extraction Rule for Simple Relationships) Given a dependency graph
G(V, E) with a label function l, for an English-language sentence T , the extraction rule-set to
identify and extract the simple relationship as described in section 3.3.2.1, is given as,
37
Algorithm 1 Extracting Candidate Information Constructs
1: procedure ExtractConstructs(p, G, R, l)
2: rawConstructs = CALL ExecuteRules(G, R)
3: if flag
clausal
is true then
4: Break p about pred
1
into p
l
with node-set V
1
and p
r
with V
2
5: G
S
(V
2
, E
2
) = Subgraph of G induced by V
2
6: innerConstructs = CALL ExtractConstructs(p
r
, G
S
, R, l)
7: for all ele in innerConstructs do
8: Form an outerConstruct using (sub
1
, pred
1
, element)
9: Add outerConstruct to the List of outerConstructs
10: end for
11: Cache first element from outerConstr ucts
12: return outerConstructs
13: else if f lag
conj
is true then
14: Break p about conj
and
into p
f
with node-set V
1
and p
s
with V
2
15: G
S1
(V
1
, E
2
) = Subgraph of G induced by V
1
16: G
S2
(V
2
, E
2
) = Subgraph of G induced by V
2
17: outerConstructs
1
= CALL ExtractConstructs(p
f
, G
S1
, R, l)
18: Cache first element from outerConstructs
1
19: if p
s
contains ‘S’ then
20: outerConstructs
2
= CALL ExtractConstructs(p
s
, G
S2
, R, l)
21: else if p
s
contains ‘VP’ and ‘NP’ then
22: Append sub from the Cache to p
s
23: outerConstruct
2
= CALL ExtractConstructs(p
s
, G
S2
, R, l)
24: else if p
s
contains ‘NP’ then
25: Append sub and pred from the Cache to p
s
26: outerConstruct
2
= CALL ExtractConstructs(p
s
, G
S2
, R, l)
27: end if
28: Add outerConstructs
1
, outerConstructs
2
and outerConstruct
2
to outerConstructs
29: return outerConstructs
30: else if f lag
enrich
is true then
31: outerConstructs = CALL PerformEnrichments(rawConstructs, ‘qualify’, l)
32: return outerConstructs
33: else
34: Create a construct Using (sub
1
, pred
1
, obj
1
) and Cache to resolve any pronouns
35: Add the construct to outerConstructs
36: return outerConstructs
37: end if
38
• r
RIC
1
: {∃u, v, w ∈ V, ∃e
1
(u, v), e
2
(v, w) ∈ E | l(e
1
) = “nsubj
00
∩ l(e
2
) = “dobj
00
} −→
{pred
1
= {v}, sub
1
= {u}, obj
1
= {w}}
• r
RIC
2
: {∃u, v, w, t ∈ V, ∃e
1
(u, v), e
2
(v, w), e
3
(u, t) ∈ E | l(e
1
) = “nsubj
00
∩ l(e
2
) =
“dobj
00
∩ l(e
3
) ∈ {“nn
00
, “ ∗ mod
00
}} −→ {sub
1
= sub
1
∪ {t}}
• r
RIC
3
: {∃u, v, w, t ∈ V, ∃e
1
(u, v), e
2
(v, w), e
3
(w, t) ∈ E | l(e
1
) = “nsubj
00
∩ l(e
2
) =
“dobj
00
∩ l(e
3
) ∈ {“nn
00
, “ ∗ mod
00
}} −→ {obj
1
= obj
1
∪ {t}}
An example of the application of only r
RIC
1
is shown in Figure 3.1.
For sake of simplicity in validation and representation framework, we do not extract and rep-
resent relationships with negation modifiers in this work. We simply perform a negation check
by looking for the presence of neg dependency, and if found, we skip the sentence from further
processing.
The rule framework described in this section is used by algorithm 1 when it calls Exe-
cuteRules. Next, we discuss our validation framework.
3.4 Semantic Validation Framework
For this section, we split our discussion into two subsections. In the first one, we de-
tail the steps taken to perform validation of the extracted information constructs using the
domain ontology model. In the next section, we describe the special case where we handle
enrichments for information constructs that were extracted from sentences comprising certain
special relationship cases (such as, qualified).
39
3.4.1 Performing Validation against the Domain Ontology Model
Our basic approach (for advanced approach, refer 3.4.3) for validating candidate informa-
tion constructs (subject, predicate, object) against a domain description captured in an ontology
model, and a set of instances is as follows:
1. Find an instance match for the subject and the object. For determining these matches, we
perform simple syntactic comparisons sequentially on the entire set of subject (similarly
object) candidates starting with the extracted head sub
1
or obj
1
.
2. If a match for subject and object is found, find a matching property for the predicate
in the domain ontology model. For this, we perform syntactic comparisons on domain
ontology relationships and the predicate.
3. If we are able to find these matches, we check if the class concepts to which the instances
for subject and predicate are asserted lie respectively in domain and range of the property
matched.
If all these conditions hold, we add the construct (instance for subject, property, instance
for object) to the set of validated constructs. The above conditions are captured formally in
the validation criteria specified below.
Validation Rule: For a text fragment Z consisting of sentences {T
i
} with word-sets {W
i
}, a
set T
CT R
of candidate constructs extracted by an extraction algorithm, a domain description
captured in an ontology O(R, C), a set of instances Y , a function h : Y −→ P(C), and a
mapping F from the set W to R ∪ Y that is able to type the words in the sentence to an
instance in Y or a relationship in R (whenever such a mapping is intuitive based on the domain
of discourse), the validation process must result in a set {K = {s
i
, p
i
, o
i
}} of 3-tuples s
i
, p
i
, o
i
(validated constructs) s.t. the following holds:
{∃y
1i
, y
2i
∈ Y, ∃r
i
∈ R| (y
1i
, r
i
, y
2i
) ∈ K ⇐⇒ (∃w
1i
, w
2i
, w
3i
∈ W
i
, c
1i
, c
2i
∈ C |{w
1i
, w
3i
, w
2i
} ∈
T
CT R
∩ (w
1i
, y
1i
) ∈ F ∩ (w
2i
, y
2i
) ∈ F ∩ (w
3i
, r
i
) ∈ F ∩ c
1i
∈ h(y
1i
) ∩ c
2i
∈ h(y
2i
) ∩ c
1i
∈
40
Domain(r
i
) ∩ c
2i
∈ Range(r
i
)}
3.4.2 Discussion on Basic Validation
In our experiments, we found that condition in step 3 causes rejection of a lot of triples,
even in cases where the extracted construct actually made sense as per the domain. The reason
for this is an incomplete description of the domain as captured by the ontology. While in most
cases, it is best to reject potential constructs based on this condition, there may be a need to
behave opportunistically and have some flexibility in these criteria, especially in cases when
we find a matching property, but do not find a match for the subject or the object. Similarly,
there may be cases when we find matching instances for the subject and the object, but not the
predicate. We consider these cases along with some other enrichments in the next subsection.
3.4.3 Performing Enrichments
In this section, we describe the process
2
that we undertake for handling the constructs
extracted from relationships with qualifying modifiers as well as other enrichments that help
improve the performance of our extraction algorithm.
1. In the case of qualifications, we create new definitions in the ontology model to account for
the qualifying relationships. In this case, extraction algorithm 1 invokes the enrichment
module before the validation is performed on the constructs.
2. We also handle a few selective cases where one of the matches in the basic validation
approach described in 3.4.1 does not work. For these cases, the enrichment module is
invoked after the validation process determines that an enrichment is required.
Thus, overall, the enrichment process needs to ensure that it (i) invokes the validation
module to validate the constructs passed to it depending on the case it is handling (ii) enriches
2
This part of the validation framework is yet to be included within our SEMANTIXS system (refer Chapter
4) at the time of writing this thesis
41
Algorithm 2 Performing Validation and Enrichments