Content uploaded by Jose Pablo Umaña
Author content
All content in this area was uploaded by Jose Pablo Umaña on Nov 13, 2022
Content may be subject to copyright.
Master thesis on Cognitive Systems and Interactive Media
Universitat Pompeu Fabra
The Language of Art and Architecture
José Pablo Umaña
Supervisor: Leo Wanner
Co-Supervisor: Alexander Shvets
July 2021
Master thesis on Cognitive Systems and Interactive Media
Universitat Pompeu Fabra
The Language of Art and Architecture
José Pablo Umaña
Supervisor: Leo Wanner
Co-Supervisor: Alexander Shvets
July 2021
Contents
1 Introduction 1
1.1 Fundamentals................................ 1
1.2 ProblemStatement............................. 3
1.3 StateoftheArt............................... 4
1.3.1 Pattern-based approach . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Statistical/Distributional approaches . . . . . . . . . . . . . . . . . . . 6
1.3.3 Neuralapproaches ............................. 7
1.4 ConcludingRemarks ............................ 10
2 Methods and Experiments 11
2.1 ResearchQuestion ............................. 11
2.2 Hypothesis ................................. 12
2.3 General Methodology Applied . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Relations to be tackled . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Experimental design and set-up . . . . . . . . . . . . . . . . . . . . . . 15
2.5.1 Matching the blanks methodology . . . . . . . . . . . . . . . . . . . . . 15
2.5.2 Model implementation and Architecture . . . . . . . . . . . . . . . . . 17
2.6 Procedures used to obtain data and results . . . . . . . . . . . . . . . . 18
2.6.1 Entitypairssets............................... 18
2.6.2 Generation of datasets for training, testing and fine-tuning . . . . . . . 19
2.6.3 Data Split and Shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Results 24
3.1 Overview .................................. 24
3.2 Training with Gigaword . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Fine tuning with Dezeen . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Discussion and Conclusions 31
4.1 Discussion of the Results . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Limitations ................................. 34
4.3 Futurework................................. 35
List of Figures 36
List of Tables 37
Bibliography 38
Abstract
Relation Extraction (RE) is considered one of the most promising areas of study in
Natural Language Processing because of its potential in the automation of labeling
and classification of entities in texts. Different approaches have been proposed in
academia over the past thirty years, mostly relying on lexico-syntactic patterns,
distributional information and statistics. Recently, works using Neural Networks
and Transformer Models have proven to outperform most traditional techniques,
making them the recommended state-of-the-art when it comes to RE.
The research presented here aims at the conceptualization and evaluation of a Re-
lation Extraction model in the domain of Arts and Architecture, a domain with
a rapidly growing amount of publications online. The proposed methodology at-
temptS to identify and characterize the relation between those entities that typically
describe the properties of objects commonly found in texts under this domain. It
does so by making use of a novel strategy in RE developed by Google Research and
whose implementation is put under training, fine tuning and assessment using self
generated datasets.
Results and analysis at the end of this document reveal the existence of an enormous
potential at developing a tool that can possibly lead to the creation of a unified
taxonomy for the language of Arts and Architecture.
Keywords: Lexical Semantics; Computational Semantics; Relation Extraction; Re-
lation Classification, Lexical Ontology, Taxonomy, Design, Architecture.
Chapter 1
Introduction
This section provides a brief description of the problem that this thesis will aim to
undertake and covers the state of the art in algorithms and techniques that have
recently attempted to solve similar issues like the one that concerns this project.
1.1 Fundamentals
Text is probably the most widespread content on the Web. Everyday, a considerable
amount of content is published in the form of digital text on platforms like social
media, blogs, research papers, news articles and online magazines. Vast volumes
of text remain today freely available on the web as unstructured data, meaning
the knowledge contained within has yet to be analyzed and properly extracted by
computational techniques capable of converting its human-readable format into a
more structured, machine-readable format.
With the advent of new voice recognition technologies, chat bots and virtual assis-
tants, it has become significantly important to develop more and better approaches
to process natural language, which can be considerably challenging because human
languages are complex. One major problem is ambiguity, present in every language,
and which humans are much better at comprehending than computers. It’s also
specially difficult for computers to capture contextual information, essential for lan-
1
2Chapter 1. Introduction
guage understanding, and for which humans perform better, again, than machines.
The work that will be developed in this thesis lies at the core of Lexical Semantics,
where the study of relations between words take place. Lexemes, the basic lexical
unit of language, can relate to each other in its meaning and these relations can be
culturally influenced, varying in every language.
Relation Extraction (RE) is a mechanism in Computational Semantics that aims to
automate the process of identification and classification of relations between entities
in texts. A relation implies the existence of a well-defined relationship between two
or more concepts [1]. In the sentence A big blue ring surrounds the chamber it
is possible to identify [big]SIZE,[blue]COLOR ,[ring]OBJECT and [chamber]PLACE
as concepts. The process of identifying and classifying these entities is known as
Concept Extraction.
A relation is defined in the form of t= (e1, e2, ..., n), where eientities are in
a well-defined relationship r[2]. In the previous sentence, several possible rela-
tions can be identified. Two attribute relations, Is-of-color(blue, ring) and
Is-of-size(big, ring) can be taken out. A relation of functionality can also
be established between ’surround’ and ’ring’ in the form of Purpose-of(ring,
surround). Some other examples of possible relations in texts include Located-At,
Feature-Of,Made-Of,Mother-Of.
There is a vast amount of relations and they vary from domain to domain. When a
relation tuple involves only two entities t = (e1, e2)the relation is binary, although
there can be non-binary relations, often more challenging to study. In general lex-
ical organization, relations can be classified as Hierarchical, Non-hierarchical and
Congruence [3]. Hierarchical relations describe associations in terms of proportion-
ality and usually describe attributes and taxonomies as in the case of the Is-A and
Part-Of relations where entities are linked by syntactic patterns like is a type of,
is a kind of or is a way of. Non-hierarchical relations, like antonomy, generally en-
capsulate opposition except in the case of synonymy which establishes a significant
similar semantic content between entities. Finally, Congruence relations group those
1.2. Problem Statement 3
involving inclusion, overlapping, disjunction and identity [3].
Relation Extraction is a component in systems for question answering, sentiment
analysis and ontology construction. The answer to the question What color is the
ring surrounding the chamber? can only be processed by having previously com-
puted the relation between blue and ring in the sentence. Biomedicine is one domain
that has taken advantage of modern NLP techniques with recent approaches in gene-
disease associations and protein interaction that heavily rely on content and relation
extraction of medical texts.
In the business intelligence domain, professionals often need to seek specific infor-
mation from news articles to help making their everyday decisions. Another possible
application is in the intelligence area, where analysts review large amounts of text to
search for information, such as on people involved in terrorism events, the weapons
used and the targets of the attacks.
1.2 Problem Statement
This thesis focuses on Relation Extraction in the domain of Art and Architecture.
With massive amounts of textual data online in these fields, there is a good op-
portunity to work on strategies to extract relevant information for both artists and
non-artists and in that way, set a precedent in knowledge modeling for this domain.
The main goal is to create a path to a unified taxonomy in Arts and Architecture
and lead the way in the construction of an ontology that encloses pertinent lexical
semantic information, an effort that registers no previous attempts in this particular
domain.
An scenario where such ontology would be of practical use is the case of students
or recent graduates in Furniture Design who are beginning to conceptualize their
first projects and who might need a creative workbench to access organized sets
of materials, pieces, finishes or whole objects that are highly interrelated and that
can cohesively convey their idea behind their work in a faithful manner. Words like
wood,steel and concrete could be labeled under a Materials category with an
4Chapter 1. Introduction
association to different objects in construction, such as wall. And so, inferences
like concrete is a material used in walls could be a fair suggestion available
to whoever may be looking for potential materials to design a wall.
A possible – and also expected – result of this is the ultimate comprehension of
the compositionality of the domain. Studying the relatedness of textual entities is
expected to lead to a clearer and better idea of how objects relate to each other.
Finally, a secondary goal of this project is to get closer to a better understanding
of how artists and architects are writing about their work online. This includes
the language they employ in their texts, the form, how figurative it is, or how
metaphorical it is.
1.3 State of the Art
Relation Extraction is generally considered a classification problem typically ad-
dressable via pattern matching, statistics and neural models. A summary of these
strategies and its most current developments is presented below.
1.3.1 Pattern-based approach
In this approach, the main idea is to use sentence analysis tools to identify syntactic
elements in a text, then automatically construct pattern rules from these elements.
"Hearst Patterns" [4] were a pioneering effort at extracting relations using this
approach and have been one of the most influential approaches in RE. Six main
patterns were identified, corresponding to:
•NPH such as NP,* and | or NP
•Such NPH as NP,* or | and NP
•NP ,NP* , or other NPH
•NP ,NP* , and other NPH
•NPH , including NP,* or | and NP
1.3. State of the Art 5
•NPH , especially NP,* or | and NP
While this approach has been successful at detecting some examples of relations
like hypernymy, it is generally a tedious process and it is limited by the number
of patterns typically in place. Also, because of the limited amount of patterns for
consideration, it is very difficult to cover other syntactical ways generally employed
to express the same relation in a single sentence, resulting in an unwanted reduction
of recall in prediction.
Since the publication of Hearst patterns, researchers have made important efforts
to automate the labeling of such semantic relations and minimize the resource-
consuming labor of manually processing syntactic patterns by using, for instance,
dependency paths to represent patterns [5].
Dependency parsing provides both lexical information in a sentence and its syntactic
structure, giving a more complete piece of input for processing. In [6], Hearst
patterns were reformulated as dependency paths to accomplish a better match of
complex or ambiguous sentences using a combination of automated parsing in a
domain-specific corpus and a manual analysis of parsed sentences to finally designate
corresponding dependency patterns.
Pattern-based approaches are often implemented these days as a semi-supervised
method, with an unlabeled corpus and a few "seed" instances of the relations of
interest [1]. A bootstrapping algorithm is then run under two assumptions: given a
good set of patterns, a good set of tuples (entity pairs following a certain relation
type) can be found; and given a good set of tuples, a good set of patterns can be
learned.
A pattern-based bootstrapping framework for the bio-medicine domain was recently
published in [7].
6Chapter 1. Introduction
1.3.2 Statistical/Distributional approaches
Unlike pattern-based methods, statistical methods are designed to involve less hu-
man intervention and reduce the costs of pattern matching. A good amount of works
have been published in recent years with a variety of techniques that differ in form
and strategy, some involving a combination of revisited older techniques with more
modern approaches like neural models.
Feature-based methods look for the design lexical, syntactic and semantic features
for entity pairs and their corresponding context, and then input these features into
relation classifiers. For instance, in [8], lexical, syntactic and semantic features are
described, such as word-based features, mention level, overlapping level and depen-
dency features. Relation classification among entity pairs using sub-trees mined
from syntactic structures as features of text was developed in [9].
Kernel-based methods have been widely studied as an alternative to improve per-
formance over feature-based ones. Kernel methods are non-parametric density es-
timation techniques that compute a kernel function between data instances, where
a kernel function can be thought of as a similarity measure. Given a set of labeled
instances, kernel methods determine the label of a novel instance by comparing it
to the labeled training instances using this kernel function [10]. A thorough exami-
nation of dependency paths as kernel, for RE, can be found in [11].
Unsupervised methods rely only on a NER tagger to identify entities and ap-
ply context similarity computation to establish potential associations based on co-
occurrence, which is just what Clustering-based techniques are based on. This strat-
egy works under the Distributional Hypothesis, that states that words that co-occur
in the same context tend to present similar meanings [12]. Context similarities
among entity pairs are identified and computed in clusters according the similarity
values obtained, with each clusters representing a potential relation between the
entities that it groups.
The mechanisms to perform the computations for clustering differ from study to
1.3. State of the Art 7
study. A recent paper covers the extraction of relations in Food-Drug interactions
texts by grouping relations sharing similar representation into clusters [13]. Another
interesting unsupervised effort for hypernymy detection using distributional vector
embeddings can be found in [14].
In distant-supervised methods a large semantic database is used to automatically
generate training corpora, combining the best of supervised and unsupervised ap-
proaches. One major shortcoming is the incompleteness and wrong labeling problem
that can result out of automated generation of datasets, leading to false negatives
in results [15]. Current approaches are focused on alleviating this problem by using
Embedding-based strategies [16] or mention-level extraction, as in the case of Noise
Reduction methods, which assume that “If two entities participate in a relation, at
least one sentence that mentions those two entities might express that relation.”
[17].
For training purposes, distant supervision techniques require large text corpora.
ClueWeb091and ClueWeb122are among the largest web datasets available today
with over 1 billion crawled web pages. Wikipedia pages have been used for automatic
data generation as seen in [18]. Other sources employed in previous works include
Freebase, The New York Times and CNN.
1.3.3 Neural approaches
Neural networks are considered to be the most promising state-of-the-art approach
in RE due to their significant and impressive results, as seen in studies performed
over the past five years.
Convolutional Neural Networks (CNN) are a type of feed-forward artificial neural
networks whose layers are formed by a convolution operation followed by a pooling
operation [19]. CNNs have been applied before in NLP tasks such as semantic pars-
ing [20], and in the case of RE, a combination of approaches can be found. In [21],
a distantly supervised approach for RE of causal information was developed using
1http://lemurproject.org/clueweb09/index.php
2http://lemurproject.org/clueweb12/index.php
8Chapter 1. Introduction
a CNN of two channels, one for representing human-prior knowledge like linguistic
cues and casual relationships, and another for collected data. Another recent study
[22] used pre-trained CNNs for the distributional representation of words.
Attention-Based Neural Models is an approach that mimics cognitive attention by
devoting computational power to specific important data. It has proven to be a good
complement to CNNs for relation extraction in recent publications [23]. Attention
mechanisms are often used in RE to alleviate the inner-sentence noise by performing
soft selections of words independently, as explained in [24].
Graph Neural Networks (GNN) are neural models that operate directly on graph
structures. Every node in the graph is associated with a label, and the goal is to
predict the label of the nodes that have no data associated. It is a suitable way
of data representation with complex relations and inter dependencies in between.
Variants of GNNs such as graph convolutional networks (GCN), graph attention
networks (GAT) and graph recurrent network (GRN) have demonstrated ground-
breaking performances in several learning tasks [25]. In [26], dependency trees are
used as input for a GNN that automatically learns how to selectively attend to the
relevant sub-structures useful for a RE task.
Transformer models consist of an encoder and a decoder (as it happens too in other
deep neural network approaches). Their most basic architecture was first presented
in [27]. The encoder typically contains a set of multi-head attention layers that pro-
vides the model the ability to orchestrate information from different representation
sub-spaces. Its outputs are fed either into other encoders or into decoders,depending
on the architecture. Transformers are designed to handle sequential input data,
like natural language, for tasks such as language translation and text summarizing.
However they do not require that the sequential data be processed in order.
Recently, Transformer models have been combined with GNNs to create special
attention-guided layers for RE as seen in [28]. An example of a typical transformer
model taken from [27] can be found in Figure 1. This architecture is based solely on
attention mechanisms, dispensing with recurrence and convolutions entirely and has
1.3. State of the Art 9
proven to be remarkably good in translations tasks. In [29], a Transformer based
relation extraction method was developed, replacing explicit linguistic features re-
quired by previous methods, with implicit features captured in pre-trained language
representations.
Figure 1: Basic Transformer Model Architecture proposed in [27].
Pre-trained Language Models (PTLM) are an effective strategy to learn the parame-
ters of neural networks, which are then fine-tuned on downstream tasks. Since most
NLP tasks are beyond word-level, it is natural to pre-train the neural encoders on
sentence-level or higher. This process provides a better model initialization, which
usually leads to a better generalization performance and speeds up convergence on
the target task [30]. It also helps to overcome the expenses of constructing large-
scale labeled datasets for training. Examples of PTLMs include ConVE [31] that
uses machine translation as a pre-training task, ELMO [32], GPT [33] and most
notoriously, Google’s BERT [34].
10 Chapter 1. Introduction
1.4 Concluding Remarks
Efforts in developing new approaches in RE have gone in an progressive path
from statistical to neural approaches, rather than the traditional lexical-syntactical
pattern-based approaches. Neural Models have proved to be a promising strategy
in achieving higher performance results, and coupled with newer frameworks like
Transformers and pre-trained Language Models, the possibilities are wide open for
better and more accurate tools in relation extraction. It is expected that this trend
will continue in the upcoming years.
Additionally, by choosing a novel approach that involves both Neural Models and
Transformers - as it will be explained in the next chapter -, the present study is
aligning in the right path in the pursuit of a capable RE model for the language of
Arts and Architecture.
Chapter 2
Methods and Experiments
This chapter covers the research hypothesis, methodology, design and development
criteria and the experimental set-up of this thesis.
2.1 Research Question
Relation Extraction has been applied to several domains in research, but this has
not been the case for the Arts or Architecture domain.
The main idea behind an implementation of relation extraction in this domain is
to develop a first approach to the problem of understanding how artists and
designers are writing about their work in online publications.
This thesis aims at studying the classification of words relations in texts under such
domain as a primary approach to attain the goal stated. The relations that are part
of this study are namely: Meronymy,Hypernymy,Color,Material and Shape. A
detailed explanation of each of these relation is provided in Section 2.4.
By making use of a model that is capable of classifying relations between entities
commonly found in this domain of interest, this thesis aims at setting a new frame-
work for characterizing Arts and Architecture texts and potentially describe how
authors are referring to the composition of specific features of objects.
11
12 Chapter 2. Methods and Experiments
2.2 Hypothesis
The following proposition was assumed as a premise for the present work:
It is possible to develop a neural model for content relation extraction in the language
of Arts and Architecture, and those relations can express how artists and architects
talk about the composition and specific features of objects.
By putting a novel model in RE under training, fine tuning and evaluation, this
thesis will potentially serve as a first step in the path to a unified taxonomy in Arts
and Architecture and ideally will lead to an eventual implementation of a tool that
could be harnessed by creators, designers and architects.
2.3 General Methodology Applied
The model on which this work is based is Matching The Blanks (MTB), developed
by Google Research [35]. This is a distributional implementation that approaches
RE via Transformer Networks and posits a new method for learning relation rep-
resentations directly from text. A careful review of this implementation and the
reasons why it was chosen for this project can be found in Section 2.5.
The work performed on the model can summarized into the following steps:
•Model Pre-training.
•Model Training.
•Model Fine Tuning.
•Model Testing.
Pre-training. This was be done using a CNN data dump instead of the Wiki
dump used in the original MTB implementation, also used in all phases of their
study. For this first stage of the methodology, the data dump was unlabeled and
did not contain any entity markers. The idea behind this is to have the Language
2.3. General Methodology Applied 13
Model familiarized with terms and sentences and make it capable of outputting a
representation of any tokenized sentence later during training and fine-tuning.
Training. The second step consisted of training the model using encoded relations
in the domain of Arts and Architecture. Because of the lack of available datasets
in the relations that concern this project, a distantly-supervised technique for auto-
generating training datasets was applied for this stage, based on Gigaword1. The
specifics of this technique and the characteristics of the resulting datasets are pro-
vided in Section 2.6.
Fine Tuning. Domain Adaptation through fine tuning of the model was carried
on as a third step to produce more relation classes. This was achieved by using a
smaller dataset taken out of Dezeen2, a well-known online magazine about design
and architecture. This dataset was provided by this project supervisors, as it is part
of an ongoing research project.
A linear classifier on top of the Transformer Model handles the relation classification
tasks. This classifier is fed with relation representations and then a comparison of the
inner products between the unlabelled output representation from the Transformer
and that of any of the labeled ones from the annotated corpus is made. That way,
the relation class with the highest inner product is be used as the final prediction.
Testing. An extra module was added to the MTB implementation selected to au-
tomatize inference validation. This code receives a training data set, previously
generated using the already mention distantly-supervised technique, performs infer-
ences on each sentence and stores the results. Then a classification report is run to
compare predicted relations and the actual relations. This report includes informa-
tion like precision, recall and f1-score among others. Such information was employed
to generate the plots found in Chapter 3.
The pre-training and training stages have the intention of making the Language
Model capable of outputting relation representations of any kind of relations, and
1https://catalog.ldc.upenn.edu/LDC2011T07
2http://www.dezeen.com
14 Chapter 2. Methods and Experiments
in that way, provide the linear classifier with a broader set of representations that
will likely include the subset of relations that concern this thesis project.
2.4 Relations to be tackled
The relation extraction procedure to be implemented in this thesis solely focuses on
binary relations (those established between two entities) in the English language,
and targets the following relations:
Meronymy relation. Also known as Parts-Of relation as it encapsulates the differ-
entiation of parts with respect to a whole between two entities. This differentiation
can be physical, functional or take any other form or degree. For instance, the word
wheel can be linked as a part of the word car, establishing a relation of property
between the two items.
Hypernymy relation. Denotes a relation between a super-type entity and a sub-
type entity. Furniture has a hypernymy, or Supertype-Of, relation with sofa and
cupboard. This superordinate relation implies the existence of a subordinate one,
called hyponymy, which is not be covered in this thesis.
Attribute relation. This a set of different relations related to specific and concrete
properties of an entity. It can be any qualitative feature that can be used to describe
the object in question. For the purpose of this thesis, only the following attributes
are considered:
•Color.
•Shape.
•Material.
Table 1 contains example sentences for each of these relations and their correspond-
ing RDF notation for a better understanding of how entities relate between them in
those cases.
2.5. Experimental design and set-up 15
Sentence Relation RDF Notation
Leaning listlessly against the
wall of the house.
Meronymy wall <is a part of>
house
Several artists, including
sculptors, will be at the
exhibition.
Hypernymy artist <is a super-type
of> sculptor
Details like black shadows
in the design add up to the
proposed theme.
Color shadow <is of color>
black
Revamped with new rounded
corners, the device was
launched last fall.
Shape corner <is shaped>
rounded
Guestrooms include marble
floors and king-sized beds.
Material floor <is made of>
marble
Table 1: Sentence examples for the five relations addressed in this project.
2.5 Experimental design and set-up
2.5.1 Matching the blanks methodology
Thought as an extension of both Harris’ distributional hypothesis to relations [12]
and recent projects on learning text representations, the MTB methodology pre-
sented in [35] does an exceptional job in the challenging task of generating relation
representation from solely entity-linked text.
Put simply, the model’s goal is to learn mappings from relation statements to rela-
tion representations. In a sequence of tokens x= [x0...xn]a pair of span markers
are defined, s1and s2, delimiting entity mentions in the sequence. Given two re-
lation statements r1and r2within the text where rn= (x, s1, s2), the formulated
hypothesis states that: if both r1and r2contain the same entity pair (s1and s2),
16 Chapter 2. Methods and Experiments
they should have the same s1-s2relation.
The ultimate goal is to learn a function hr=fθ(r)that maps relation statements to
a fixed-length vector that represents the relation expressed in xbetween the entities
marked by s1and s2. The authors of the MTB approach make a bold assertion by
claiming that this function can be learned from widely available distant supervision
in the form of entity linked text, a statement this thesis aims to put under evaluation.
MTB methodology includes several input/output strategies. The chosen implemen-
tation, detailed in the next section, is based in the Entity Markers strategy. The
basic flow consists of augmenting the sequence of tokens xwith the introduction of
four special tokens [E1start],[E1end ],[E2start],[E2end ], resulting in:
x= [x0...[E1start]xi...xj-1[E1end]
...[E2start]xk...xl-1[E2end]...xn]
Where s1=(i, j),s2=(k, l).
And 0< i < j −1,j < k,k≤l−1, and l≤n.
The augmented sequence xis used as input to the Transformer Model, updating
the entity indices s1 = (i+ 1, j + 1) and s2 = (k+ 3, l + 3) to account for the
four inserted tokens. In all the six input/output schemes in the methodology, the
output representation from the Transformer model is fed into a tightly connected
layer that either contains a linear activation, or performs layer normalization on the
representation.
According to the authors, the MTB methodology achieves state-of-the-art results on
three relation extraction tasks (KBP-37, SemEval 2010, TACRED) and surpasses
human efficiency on few-shot relation matching. It also claims to be particularly
effective in low-resource architectures and to significantly reduce the amount of
human effort required to create relation extractors.
2.5. Experimental design and set-up 17
2.5.2 Model implementation and Architecture
Authors of the original MTB implementation did not make their source code avail-
able, however, there are a few third-party implementations available on Github, from
which one has been chosen to be the basis of the experimental design of this thesis.3
Reasons for this decision have to do with the application completeness and the in-
put/output strategy implemented. An online article4by the application developer
explains in detail the procedures followed to come up with the code.
Much like the original implementation, the proposed set-up consists of a Transformer
Language Model (BERT) as main component and a linear classifier stacked on top
of its output hidden layers, as best seen in Figure 2. It makes use of the Entity
Markers input/output strategy, so in an input sequence x,[E1] and [E2] markers
are employed to mark the positions of their respective entities in order for to BERT
know exactly which ones will be tackled. The output hidden states of BERT at the
[E1] and [E2] token positions are joined as the final output representation of x, which
is then used along with that from other relation statements for loss calculation, such
that the output representations of two relation statements with the same entity pair
should have a high inner product.
Supporting the paper’s hypothesis of having a function hr=fθ(r)that can be
learned to output representations of any sequence of tokens xcan then be used for
any downstream task, the code stacks a linear classifier on top of the Transformer
Model. Formally, this classification layer W RK xH where His the size of the relation
representation and Kis the number of relation types. The classification loss is the
standard cross entropy of the softmax of hrWTwith respect to the true relation
type.
For the correct operation of the model, all text files for training and pre-training
should follow a specific format, as it is shown in the examples sentences contained
in 2.2.
3https://github.com/plkmo/BERT-Relation-Extraction
4https://towardsdatascience.com/bert-s-for-relation-extraction-in-nlp-2c7c3ab487c4
18 Chapter 2. Methods and Experiments
Figure 2: Proposed Architecture in chosen implementation: BERT as Transformer
Model and a stacked linear classifier for relation classification.
2.6 Procedures used to obtain data and results
2.6.1 Entity pairs sets
In the original MTB implementation [34], a large and open domain dataset was used
to train the BERT model. Such training can be computionally expensive because
of the dimensions of the set and for that reason this project made use of a set of
smaller hand-picked sets of entity pairs that encapsulate the relations of interest:
•For meronymy, a set5of 49.848 Part-Of pairs in CSV format.
•For hypernymy, a set6of 120.339 Hypernymy pairs in CSV format.
•For Color, a set of 35 colors in JSON format. See Listing 2.1.
•For Shape, a set of 45 shapes in JSON format.
•For Material, a set of 114 material names in JSON format.
5https://allenai.org/data/haspartkb
6https://www.kaggle.com/duketemon/hypernyms-wordnet
2.6. Procedures used to obtain data and results 19
{
" C ol or ": [ " a qu a ma r in e " , " b ei g e ", " b la c k ", " b lu e " , " b ro w n ", "
b ur g un d y ", " c el a do n " , " c ri m so n " , " c ya n ", " f uc h si a " , " g ra y ", "
g re en " , " g re y " , " k h ak i " , " l il a c ", " m a ge nt a " , " m a ro o n ", " m au v e
", " o c h re " , " orange " , " pin k " , " p u r p l e ", " red " , " russe t " , "
scarle t " , " tea l ", " t e r r a c o t t a " , " turquois e " , " ultramarine " , "
umbe r " , " verdigr i s " , " vermilion " , " violet " , " whit e " , " y e l l o w "]
}
Listing 2.1: Entity dataset of Colors.
2.6.2 Generation of datasets for training, testing and fine-
tuning
After pre-training, the model was trained using encoded relations in order to achieve
some level of learning of the relations described in 2.3. To do so, the sets of entity
pairs for each relation served as input for a python script that served as a dis-
tantly supervised technique for data generation by executing a series of queries to
a compiled version of English Gigaword7to extract valid sentences. This online
compilation8was provided by UPF’s TALN Group.
The basic flow of the algorithm goes as follows:
1. For every pair of entities e1and e2, a query to Gigaword in the form http://
clasificador-taln.upf.edu/index/english_giga/select?q=text:%22{e1}%
20{e2}%22~{i}&rows=100 is made. The output of it most likely includes texts
containing the entities queried.
2. The value of iis set to be incremental, starting from zero and ending at 6.
This number represents the distance span in tokens that is going to be covered
during query execution. It also means that for every pair, the query is executed
up to six times.
7https://catalog.ldc.upenn.edu/LDC2011T07
8http://clasificador-taln.upf.edu/knn/index.jsp
20 Chapter 2. Methods and Experiments
3. For every text value found, or valid sentence, the algorithm checks that no
more than 150 white spaces are contained within in order to avoid excessively
large sentences.
4. A validation for duplicates is also included to make sure sentences are unique.
5. Passing sentences are assigned an id and all entity pairs contained in them is
then replaced by special markers <e1>,</e1> and <e2>,</e2>.
6. Sentences are finally written out in the resulting text file, along with the
relation name in the format relationName(e1, e2).
889 Mo l e r p o i n t e d o ut the 2-inch - deep t rench left by t h e <e2 >
c ro c od i l e < / e 2 > 's < e1 > t ai l < / e1 > w he n s h e m ov e d b a ck in t o th e
w at er .
P ar t O f ( e1 , e 2 )
C om m en t : d i st a nc e u se d = 3
1 70 I 'v e s e e n p en g u in s fr o m fa r a wa y , a < e 2 > se a l < / e2 > < e1 > c l os e < / e1
> - u p an d l ot s o f s k ua s .
H yp e r n ym y ( e 1 , e 2 )
C om m en t : d i st a nc e u se d = 5
298 T h e accord als o b a n s the dumpi n g a t sea an d on la n d a nd the
d is ch a rg e i nt o t he ai r of ra d io active <e1 > m at er ia l < / e1 > o r < e2 >
wa s te </ e 2 > i n th e r e g io n .
H yp e r n ym y ( e 1 , e 2 )
C om m en t : d i st a nc e u se d = 1
652 Th ere is a t ro p ic al th em ed , < e1 > a qu am a ri ne </ e 1 > < e2 > ho te l </ e 2 >,
the Ke y W es t I nn , plopped u nc er em oniously in a cott on fi el d ,
m il es aw a y f ro m th e M i ss i ss i p pi Ri v er .
C ol o r ( e1 , e 2 )
C om m en t : d i st a nc e u se d 0
Listing 2.2: Sentences obtained after applying a distantly-supervised algorithm on
Gigaword.
2.6. Procedures used to obtain data and results 21
The algorithm described above covers only the generation of datasets from entity
pairs, however, in the case of Color,Shape and Material a different strategy had
to be implemented because attributes do not come in the form of pairs but rather
single tokens. For those, the query had to be limited to one entity only and then
a validation over the list of inflected forms in the text had to be performed in
order to search for words in the sentence’s span that are nouns. As there can be
several of these in the text for a single entity, the validation was constrained so that
only nouns that are linked to the entity in question via a dependency relation of
adverbial modifier or compound are accounted. This way, pairs consisting of one
attribute entity and a noun are used as a filter to extract those sentences in the
query’s output that are valid.
The following is an example of the mechanism described above.
Input token = violet.
Query string = http://clasificador-taln.upf.edu/index/english_giga/select?
q=text:%violet%22~i&rows=100
First adverbial modifier or compound in list of dependencies = violet|compound|show
Validation of selected dependency as inflected form = show|NOUN
Because all criteria is met, text is returned and written to file:
Hong Kong to Stage African Violet Show in Easter The 2000 African Violet
Show is sponsored by the Leisure and Cultural Services Department and jointly
organized by the African Violet Association of Hong Kong Ltd and the Society
of Horticulture, Hong Kong.
It is important to note that capitalized pairs of attributes and/or nouns were not
included to avoid proper names, such as organizations or brands (e.g "Red Bull").
And as in the case of meronyms and hypernyms, the query string for attributes was
also executed six times for each value of i, starting from zero.
22 Chapter 2. Methods and Experiments
2.6.3 Data Split and Shuffle
The techniques described were successful at generating datasets of sentences for each
of the five relations to be tackled. But in order to make the most of out of them,
they were split in a 80-10-10 fashion: 80% for training purposes, 10% for testing
and 10% for evaluation. Once this segmentation was done, the last step consisted
of merging all training sentences and shuffling them to end up with a final training
dataset. Same procedure was applied for testing and evaluation datasets, resulting
in three final sets of sentences, as explained in Figure 3.
Figure 3: Flow of the split, merge and shuffling applied to obtain final datasets for
training, testing and validation on Gigaword data.
2.6. Procedures used to obtain data and results 23
It is important to remember that this process was only necessary for Gigaword data
extraction and handling. In the case of Dezeen data, all datasets were assembled
and provided directly by this project’s supervisors.
A summary of all datasets employed in the methodology is provided below:
•Gigaword:
–Training set of 249.881 valid sentences9.
–Testing set of 30.971 valid sentences.
–Validation set of 31.336 valid sentences.
•Dezeen:
–Training set of 174.695 valid sentences.
–Testing set of 26.700 valid sentences.
–Validation set of 24.026 valid sentences.
9Valid sentences are those that met all criteria under the MTB methodology, as explained in
Section 2.5
Chapter 3
Results
Resulting data and plots obtained after following the methodology developed in
Chapter 2 are reported in this chapter.
3.1 Overview
The information presented here is grouped by the two main stages in the method-
ology: training and fine tuning. For each targeted relation, plots for precision and
f1-score are presented, as well as overall averages for classification reports. Data for
usage - and no usage - of pre-trained blanks in fine tuning is also provided. It is
important to mention that, for stages, the model was set to run a maximum of 20
epochs for all the scenarios considered in the methodology.
3.2 Training with Gigaword
Results after training the proposed model with Gigaword generated datasets are
shown in Figure 4. This corresponds to the first stage of training which means the
model was already pre-trained, but it did not make use of any of that pre-trained
data during the process. Such configuration was manually set up before training
and switched back once it was finished.
Under these settings, results for precision varied in all relations with a noticeable
24
3.2. Training with Gigaword 25
increase starting on epoch 15, with the exception of meronymy relation which un-
derperformed in most of epochs.
Figure 4: Precision results after training without pre-trained blanks. Gigaword
dataset.
Relation Precision Recall F1 Score
Meronymy 0.811 0.842 0.822
Hypernymy 0.865 0.530 0.647
Color 0.884 0.505 0.652
Material 0.851 0.557 0.672
Shape 0.851 0.937 0.880
Overall Average 0.8524 0.755 0.734
Table 2: Average Values for Training Classification Report without pre-trained
blanks. Gigaword dataset.
26 Chapter 3. Results
Figure 5: F1-Score results after training without pre-trained blanks. Gigaword
dataset.
Despite its low performance in Precision, meronymy relation reported a high Recall
average value which understandably resulted in a high average F1-score value too,
for this relation. Table 2 contains a summary of the classification report for this
data. Shape relation reported the highest F1-score with 0.880 while Color relation
had the best precision result with 0.884.
When pre-trained blanks were added, almost all relations reported a significant
increase in overall averages, starting as early as epoch 2 and finding a slight but
stable decrease from epoch 10. F1-scores improved for Material relation (+0.235),
Color (+0.193) and Hypernym (+0.177) as it can be seen in Figure 7.
Similarly to the non pre-trained blanks scenario, the meronymy relation underper-
formed compared to results for other relations. Moreover, its average precision
reported a decrease of -0.016. Attribute relations were observed to outperform both
meronymy and hypernymy again, and particularly Color relation which registered
the highest value (0.884 averaged), but all ultimately reaching very close margins in
higher epochs, right after epoch 15. This can be seen in Figure 7.
3.2. Training with Gigaword 27
Figure 6: Precision results after training using pre-trained blanks. Gigaword dataset
Figure 7: F1-Score results after training using pre-trained blanks. Gigaword dataset.
28 Chapter 3. Results
Overall averages with pre-trained blanks improved for Precision (+0.049), Recall
(+0.096) and F1-Score (+0.133). These values are visible in Table 3.
Relation Precision Recall F1 Score
Meronymy 0.795 0.875 0.821
Hypernymy 0.906 0.754 0.824
Color 0.968 0.751 0.845
Material 0.902 0.904 0.907
Shape 0.939 0.974 0.940
Overall Average 0.902 0.851 0.867
Table 3: Average Values for Training Classification Report using pre-trained blanks
on Gigaword extracted data.
3.3 Fine tuning with Dezeen
This stage of the methodology refers to the process of applying domain adaptation
to the model. Having it trained already with a more open and general dataset, the
goal here is to run a series of epochs with another dataset that is ideally heavier in
entities and relations concerning the domain of interest. As previously explained in
Chapter 2, a set of data from Dezeen magazine was used for this purpose. Because
this stage is not strictly a training stage, all epochs were run using pre-trained
blanks.
Precision results for up to 20 epochs can be seen in Figure 8. Similarly to what was
observed with Gigaword, attribute relations threw the highest precision values, with
Color relation reaching a peak of 0.89 before epoch 5. On the other hand, hypernyms
and meronyms registered an overall decrease of -0.115 and -0.009, respectively.
Values for F1-score are plotted in figure 9. Shape relation outperforms all others
with a peak of 0.93 right after epoch 10. Hypernymy, Meronymy and Color find a
close point of intersection at 0.77 on epoch 5.
3.3. Fine tuning with Dezeen 29
Figure 8: Precision results after fine tuning using pre-trained blanks. Dezeen
dataset.
Figure 9: F1-Score results after fine tuning using pre-trained blanks. Dezeen dataset.
30 Chapter 3. Results
Relation Precision Recall F1 Score
Meronymy 0.804 0.732 0.768
Hypernymy 0.791 0.754 0.771
Color 0.892 0.750 0.815
Material 0.873 0.903 0.888
Shape 0.876 0.973 0.922
Overall Average 0.847 0.822 0.832
Table 4: Average Values for Classification Report using pre-trained blanks for fine
tuning. Dezeen dataset.
Overall averages after fine tuning are shown on table 4. Comparatively with the
previous training stage, there is an observed decrease of -0.06 in Precision, a +0.033
increase in Recall and a +0.035 increase in F1-Score. It is worth noting that Shape
relation registered the highest F1-score values both under Gigaword and Dezeen, a
pattern that was somewhat expected given the relatively good performance seen in
attribute relations in almost all scenarios.
An analysis and discussion of the results presented here is provided in the next, in
Chapter 4.
Chapter 4
Discussion and Conclusions
This chapter covers the critical discussion of results obtained, their relevance with
respect to state of the art, limitations found during execution and ideas for extending
the scope of this project as future work.
4.1 Discussion of the Results
In order to do a complete evaluation of the results, it is important to begin by doing
an assessment of the data obtained in the process that preceded both training and
fine tuning stages and from which a big part of the methodology relied on. Many
papers found in RE base their training methodologies in standardized datasets that
suit the goals of their area of study, usually targeting general/open domains. But
because of the scope of this thesis, framed under the Arts and Architecture domain,
such approach was not possible.
Distantly-supervised techniques like the one employed to generate training datasets
for this project are far from perfect and can be very time consuming. The three
resulting sets of data obtained after applying the technique described in Section
2.6 had to undergo a curation process that included removing spare blank lines,
incomplete sentences and invalid sentences. The latter ones corresponding to those
that did not match the MTB criteria, which were mostly sentences with incorrect
31
32 Chapter 4. Discussion and Conclusions
or missing tags. Table 5 shows the loss values in sentences sets for training, testing
and validation.
Dataset Total Net Total (%)
Training 347.631 249.881 32.7%
Testing 43.728 30.971 34.1%
Validation 42.985 31.336 31.3%
Table 5: Total sentences vs Valid sentences on extracted Gigaword data.
It is clear that more than a quarter of the total data generated had to be taken
out due to problematic sentences, which is a very high percentage. A possible
explanation for this is the complexity involved in the search for acceptable sentences
in the algorithm, particularly in the case of attribute relations, whose input set was
not in the form of paired entities but rather single entities, forcing the algorithm to
look for dependencies in the sentence in order to match the validation criteria. Such
complexity may have increased the chances of introducing incomplete sentences and
mislabeling entities, making them ultimately not suitable for evaluation under the
MTB criteria. It is also worth mentioning that the original implementation, meant
to address open domain RE, was trained on a Wikipedia general data dump, fine
tuned with FewRel[36] and evaluated on standard datasets such as SemEval-2010
Task 8 [37].
Results for training the model with no pre-trained blanks show relatively low f1-
scores for all relations except Meronymy and Shape, a consequence of low recall
numbers in all other relations. This means the model, even after 20 epochs, was
able to catch barely half of all false negatives in Color, Material and Hypernymy
relations. In other words, without pre-trained blanks, the model missed somewhere
around 40% of the predictions of relations that are particularly crucial in the domain
concerning this thesis.
When pre-trained blanks were added, significant increases in all numbers were ob-
tained, specially for Material relation (+0.235). An interesting outcome to note here
4.1. Discussion of the Results 33
is the behaviour observed in Precision, for all relations, after epoch 10, where values
start to decrease slightly which could be a sign of over fitting in the model. This
also makes sense to what is seen for F1-Score where numbers tend to decrease as
epochs are increased.
Qualitatively, this makes the model a good predictor tool with an averaged F1-score
of 0.867, even without fine tuning, which is remarkable. It also proves that adding
pre-trained blanks improved the model in its overall performance for all relations.
Domain adaptation using Dezeen data, however, did not result in significant im-
provements in model performance. With an averaged F1-score of 0.832, it is evident
that predictions were not benefited from adding domain specific datasets, with a
difference of only 0.035 compared to training results with pre-trained blanks. This
may be a consequence of the possible over fitting observed during training, specially
evident in the numbers registered for Material and Shape relations, whose F1-scores
go above 0.85 in both training and fine tuning stages.
That being said, fine tuning results are still favorable and make the model suitable
for relation predictions in the domain of Arts and Architecture. That is because
attribute relations, which are very important in the characterization of objects com-
monly found in this domain, had the best results in the classification report. Simply
put, the model was successful at predicting the correct association between objects
and their attributes in 81% of the sentences with a Color relation, 88% of the ones
with a Material relation and 92% of those with a Shape relation.
In the original MTB implementation [34], F1-scores after evaluation with SemEval-
2010 Task 8, KBP37 and TACRED are provided, however, because these are stan-
dardized open domain datasets and the study focused on learning agnostic relation
representations rather than fine tuning on domain-specific relations, such scores can
not be used as a comparative metric for the results obtained in this thesis. Never-
theless, it is important to outline that the original implementation did outperform
previous works in supervised RE with a F1-score of 0.89 on SemEval-2010 and 0.88
on FewRel.
34 Chapter 4. Discussion and Conclusions
Results obtained prove that the MTB implementation proposed in this thesis was
successful as a tool for predicting relations in the language of Art and Architecture.
It harnesses the efficiency of the the original architecture and delivers promising
results even without any kind of fine tuning, a behavior that matches what was
observed in the original implementation.
By achieving up-and-coming predictive results, the methodology completed in this
project represents a notable and exceptional first approach in the path of unifying
a taxonomy in Arts and Architecture. To the best of our knowledge, no other study
has approached RE in this domain, targeting the relations studied in this thesis,
making these results even more noteworthy for future consideration.
4.2 Limitations
One of the initial constraints in the development of this thesis was finding model
implementations that were suitable for the relations and domain that were being
targeted. Not only because Arts and Architecture is not the typical area explored
in RE, but because a considerable amount of state-of-the-art models do not have
their implementation available to the public.
At the same time, the process of reviewing literature, revising architectures and
searching for appropriate datasets proved to be more time consuming than expected,
which led to important delays in the start of the project. Moreover, time required to
generate all the datasets necessary to begin training and fine tuning of the model was
higher than anticipated, mostly because of the complexity involved in the techniques
employed for such generation as explained in Section 2.6.
Finally, it is important to mention the difficulties in finding solid comparative base-
lines for results evaluation. While numbers obtained seem promising, no previous
study was found that could be applied for direct comparisons. Furthermore, no
existing approaches in Arts and Architecture were found that could be used as a
state-of-the-art baseline for this thesis, which in itself represents an important con-
straint.
4.3. Future work 35
4.3 Future work
The work presented here could be improved in several ways. As a contribution in
the discourse analysis of artists and architects, it could be useful to include one
additional relation in the methodology: Purpose. This was part of the initial set of
relations to be tackled in this thesis, but had to be taken out due to timing reasons.
Purpose, or functionality, is a relation that encapsulates how objects are meant to
be used and what they are meant to serve. Such information could be potentially
helpful in the exercise of deciphering what artists and designers are writing about
in their publications.
The strategy employed for data generation in this project could be used with other
sources and in that way, broaden the corpora available for training. Moreover, other
state-of-the-art models in RE could be trained using those datasets and in that way,
perform a comparative analysis with the results obtained in this thesis.
Adding more computational time for both training and fine tuning stages could also
be beneficial in the employment of an alternative model, specially if results do not
seem favorable initially. The 20 epochs mark used as a baseline in here could easily
be increased if the computational resources allow it.
List of Figures
1 Basic Transformer Model Architecture proposed in [27]. . . . . . . . . 9
2 Proposed Architecture in chosen implementation: BERT as Trans-
former Model and a stacked linear classifier for relation classification. 18
3 Flow of the split, merge and shuffling applied to obtain final datasets
for training, testing and validation on Gigaword data. . . . . . . . . . 22
4 Precision results after training without pre-trained blanks. Gigaword
dataset. .................................. 25
5 F1-Score results after training without pre-trained blanks. Gigaword
dataset. .................................. 26
6 Precision results after training using pre-trained blanks. Gigaword
dataset................................... 27
7 F1-Score results after training using pre-trained blanks. Gigaword
dataset. .................................. 27
8 Precision results after fine tuning using pre-trained blanks. Dezeen
dataset. .................................. 29
9 F1-Score results after fine tuning using pre-trained blanks. Dezeen
dataset. .................................. 29
36
List of Tables
1 Sentence examples for the five relations addressed in this project. . . 15
2 Average Values for Training Classification Report without pre-trained
blanks. Gigaword dataset. . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Average Values for Training Classification Report using pre-trained
blanks on Gigaword extracted data. . . . . . . . . . . . . . . . . . . . 28
4 Average Values for Classification Report using pre-trained blanks for
fine tuning. Dezeen dataset. . . . . . . . . . . . . . . . . . . . . . . . 30
5 Total sentences vs Valid sentences on extracted Gigaword data. . . . 32
37
Bibliography
[1] Pawar, S., Palshikar, G. K. & Bhattacharyya, P. Relation extraction: A survey.
arXiv preprint arXiv:1712.05191 (2017).
[2] Bach, N. & Badaskar, S. A review of relation extraction. In A Review of
Relation Extraction (2007).
[3] Cruse, D. A., Cruse, D. A., Cruse, D. A. & Cruse, D. A. Lexical semantics
(Cambridge university press, 1986).
[4] Auger, A. & Barrière, C. Pattern-based approaches to semantic relation ex-
traction: A state-of-the-art. Terminology 14, 1 (2008).
[5] Snow, R., Jurafsky, D. & Ng, A. Y. Learning syntactic patterns for automatic
hypernym discovery. Advances in Neural Information Processing Systems 17
(2004).
[6] Aldine, A. I. A., Harzallah, M., Berio, G., Béchet, N. & Faour, A. Dhps: Depen-
dency hearst’s patterns for hypernym relation extraction. In International Joint
Conference on Knowledge Discovery, Knowledge Engineering, and Knowledge
Management, 228–244 (Springer, 2018).
[7] Deepika, S. & Geetha, T. Pattern-based bootstrapping framework for biomed-
ical relation extraction. Engineering Applications of Artificial Intelligence 99,
104130 (2021).
38
BIBLIOGRAPHY 39
[8] Kambhatla, N. Combining lexical, syntactic, and semantic features with max-
imum entropy models for information extraction. In Proceedings of the ACL
Interactive Poster and Demonstration Sessions, 178–181 (2004).
[9] Nguyen, D. P., Matsuo, Y. & Ishizuka, M. Relation extraction from wikipedia
using subtree mining. In Proceedings of the National Conference on Artificial
Intelligence, vol. 22, 1414 (Menlo Park, CA; Cambridge, MA; London; AAAI
Press; MIT Press; 1999, 2007).
[10] Culotta, A. & Sorensen, J. Dependency tree kernels for relation extraction. In
Proceedings of the 42nd Annual Meeting of the Association for Computational
Linguistics (ACL-04), 423–429 (2004).
[11] Wang, M. A re-examination of dependency path kernels for relation extraction.
In Proceedings of the Third International Joint Conference on Natural Language
Processing: Volume-II (2008).
[12] Harris, Z. S. Distributional structure. <i>WORD</i> 10, 146–162 (1954).
[13] Randriatsitohaina, T. & Hamon, T. Extracting food-drug interactions from
scientific literature: Tackling unspecified relation. In Conference on Artificial
Intelligence in Medicine in Europe, 275–280 (Springer, 2019).
[14] Chang, H.-S., Wang, Z., Vilnis, L. & McCallum, A. Unsupervised hyper-
nym detection by distributional inclusion vector embedding. arXiv preprint
arXiv:1710.00880 (2017).
[15] Smirnova, A. & Cudré-Mauroux, P. Relation extraction using distant supervi-
sion: A survey. ACM Computing Surveys (CSUR) 51, 1–35 (2018).
[16] Riedel, S., Yao, L., McCallum, A. & Marlin, B. M. Relation extraction with
matrix factorization and universal schemas. In Proceedings of the 2013 Con-
ference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, 74–84 (2013).
40 BIBLIOGRAPHY
[17] Riedel, S., Yao, L. & McCallum, A. Modeling relations and their mentions
without labeled text. In Joint European Conference on Machine Learning and
Knowledge Discovery in Databases, 148–163 (Springer, 2010).
[18] Takamatsu, S., Sato, I. & Nakagawa, H. Reducing wrong labels in distant
supervision for relation extraction. In Proceedings of the 50th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers),
721–729 (2012).
[19] Kalchbrenner, N., Grefenstette, E. & Blunsom, P. A convolutional neural net-
work for modelling sentences. arXiv preprint arXiv:1404.2188 (2014).
[20] Yang, B., Yih, W.-t., He, X., Gao, J. & Deng, L. Embedding entities
and relations for learning and inference in knowledge bases. arXiv preprint
arXiv:1412.6575 (2014).
[21] Li, P. & Mao, K. Knowledge-oriented convolutional neural network for causal
relation extraction from natural language texts. Expert Systems with Applica-
tions 115, 512–523 (2019).
[22] Li, Q., Li, L., Wang, W., Li, Q. & Zhong, J. A comprehensive exploration of
semantic relation extraction via pre-trained cnns. Knowledge-Based Systems
194, 105488 (2020).
[23] Nayak, T., Majumder, N., Goyal, P. & Poria, S. Deep neural approaches
to relation triplets extraction: A comprehensive survey. arXiv preprint
arXiv:2103.16929 (2021).
[24] Yu, B. et al. Beyond word attention: Using segment attention in neural relation
extraction. In IJCAI, 5401–5407 (2019).
[25] Zhou, J. et al. Graph neural networks: A review of methods and applications.
AI Open 1, 57–81 (2020).
[26] Guo, Z., Zhang, Y. & Lu, W. Attention guided graph convolutional networks
for relation extraction. arXiv preprint arXiv:1906.07510 (2019).
BIBLIOGRAPHY 41
[27] Vaswani, A. et al. Attention is all you need. arXiv preprint arXiv:1706.03762
(2017).
[28] Zhang, Y., Guo, Z. & Lu, W. Attention guided graph convolutional networks
for relation extraction. arXiv preprint arXiv:1906.07510 (2019).
[29] Alt, C., Hübner, M. & Hennig, L. Improving relation extraction by pre-trained
language representations. arXiv preprint arXiv:1906.03088 (2019).
[30] Qiu, X. et al. Pre-trained models for natural language processing: A survey.
Science China Technological Sciences 1–26 (2020).
[31] McCann, B., Bradbury, J., Xiong, C. & Socher, R. Learned in translation:
Contextualized word vectors. arXiv preprint arXiv:1708.00107 (2017).
[32] Peters, M. E. et al. Deep contextualized word representations. arXiv preprint
arXiv:1802.05365 (2018).
[33] Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. Improving language
understanding by generative pre-training (2018).
[34] Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805 (2018).
[35] Soares, L. B., FitzGerald, N., Ling, J. & Kwiatkowski, T. Matching the blanks:
Distributional similarity for relation learning. arXiv preprint arXiv:1906.03158
(2019).
[36] Han, X. et al. Fewrel: A large-scale supervised few-shot relation classification
dataset with state-of-the-art evaluation (2018). 1810.10147.
[37] Hendrickx, I. et al. Semeval-2010 task 8: Multi-way classification of semantic
relations between pairs of nominals (2019). 1911.10422.