Conference PaperPDF Available

Specializing Distributional Vectors of All Words for Lexical Entailment

Authors:

Abstract and Figures

Semantic specialization methods fine-tune dis-tributional word vectors using lexical knowledge from external resources (e.g., WordNet) to accentuate a particular relation between words. However, such post-processing methods suffer from limited coverage as they affect only vectors of words seen in the external resources. We present the first post-processing method that specializes vectors of all vocabulary words-including those unseen in the resources-for the asymmetric relation of lexical entailment (LE) (i.e., hyponymy-hypernymy relation). Leveraging a partially LE-specialized distributional space, our POS-TLE (i.e., post-specialization for LE) model learns an explicit global specialization function , allowing for specialization of vectors of unseen words, as well as word vectors from other languages via cross-lingual transfer. We capture the function as a deep feed-forward neural network: its objective re-scales vector norms to reflect the concept hierarchy while simultaneously attracting hyponymy-hypernymy pairs to better reflect semantic similarity. An extended model variant augments the basic architecture with an adversarial dis-criminator. We demonstrate the usefulness and versatility of POSTLE models with different input distributional spaces in different scenarios (monolingual LE and zero-shot cross-lingual LE transfer) and tasks (binary and graded LE). We report consistent gains over state-of-the-art LE-specialization methods, and successfully LE-specialize word vectors for languages without any external lexical knowledge.
Content may be subject to copyright.
Specializing Distributional Vectors of All Words for Lexical Entailment
Aishwarya Kamath1
, Jonas Pfeiffer2, Edoardo M. Ponti3, Goran Glav4, Ivan Vuli´
c3
1Oracle Labs
2Ubiquitous Knowledge Processing Lab (UKP-TUDA), TU Darmstadt
3Language Technology Lab, TAL, University of Cambridge
4Data and Web Science Group, University of Mannheim
1aishwarya.kamath@oracle.com
2pfeiffer@ukp.informatik.tu-darmstadt.de
3{ep490,iv250}@cam.ac.uk
4goran@informatik.uni-mannheim.de
Abstract
Semantic specialization methods fine-tune dis-
tributional word vectors using lexical knowl-
edge from external resources (e.g., WordNet)
to accentuate a particular relation between
words. However, such post-processing meth-
ods suffer from limited coverage as they af-
fect only vectors of words seen in the ex-
ternal resources. We present the first post-
processing method that specializes vectors of
all vocabulary words – including those un-
seen in the resources – for the asymmetric rela-
tion of lexical entailment (LE) (i.e., hyponymy-
hypernymy relation). Leveraging a partially
LE-specialized distributional space, our POS-
TLE (i.e., post-specialization for LE) model
learns an explicit global specialization func-
tion, allowing for specialization of vectors
of unseen words, as well as word vectors
from other languages via cross-lingual trans-
fer. We capture the function as a deep feed-
forward neural network: its objective re-scales
vector norms to reflect the concept hierarchy
while simultaneously attracting hyponymy-
hypernymy pairs to better reflect semantic sim-
ilarity. An extended model variant augments
the basic architecture with an adversarial dis-
criminator. We demonstrate the usefulness and
versatility of POSTLE models with different in-
put distributional spaces in different scenarios
(monolingual LE and zero-shot cross-lingual
LE transfer) and tasks (binary and graded L E).
We report consistent gains over state-of-the-art
LE-specialization methods, and successfully
LE-specialize word vectors for languages with-
out any external lexical knowledge.
1 Introduction
Word-level lexical entailment (LE), also known as
the TYPE-OF or hyponymy-hypernymy relation, is
a fundamental asymmetric lexico-semantic relation
(Collins and Quillian,1972;Beckwith et al.,1991).
Both authors contributed equally to this work.
The set of these relations constitutes a hierarchi-
cal structure that forms the backbone of semantic
networks such as WordNet (Fellbaum,1998). Au-
tomatic reasoning about word-level LE benefits a
plethora of tasks such as natural language inference
(Dagan et al.,2013;Bowman et al.,2015;Williams
et al.,2018), text generation (Biran and McKeown,
2013), metaphor detection (Mohler et al.,2013),
and automatic taxonomy creation (Snow et al.,
2006;Navigli et al.,2011;Gupta et al.,2017).
However, standard techniques for inducing word
embeddings (Mikolov et al.,2013;Pennington
et al.,2014;Melamud et al.,2016;Bojanowski
et al.,2017;Peters et al.,2018,inter alia) are un-
able to effectively capture LE. Due to their crucial
dependence on contextual information and the dis-
tributional hypothesis (Harris,1954), they display
a clear tendency towards conflating different rela-
tionships such as synonymy, antonymy, meronymy
and LE and broader topical relatedness (Schwartz
et al.,2015;Mrkši´
c et al.,2017).
To mitigate this deficiency, a standard solution
is a post-processing step: distributional vectors are
gradually refined to satisfy linguistic constraints
extracted from external resources such as Word-
Net (Fellbaum,1998) or BabelNet (Navigli and
Ponzetto,2012). This process, termed retrofitting
or semantic specialization, is beneficial to language
understanding tasks (Faruqui,2016;Glavaš and
Vuli´
c,2018) and is extremely versatile as it can be
applied on top of any input distributional space.
Retrofitting methods, however, have a major
weakness: they only locally update vectors of
words seen in the external resources, while leaving
vectors of all other unseen words unchanged, as
illustrated in Figure 1. Recent work (Glavaš and
Vuli´
c,2018;Ponti et al.,2018) has demonstrated
how to specialize the full distributional space for
the symmetric relation of semantic (dis)similarity.
The so-called post-specialization model learns a
Source
Lang
Target
Lang
1. Initial LE specialization
Distributional word vectors
LE-specialized word vectors
2a. POSTLE: LE specialization
of all words in the source lang
2b. POSTLE: LE specialization
of all words in the target lang
Figure 1: High-level overview of a) the PO STLE full vo-
cabulary specialization process; and b) zero-shot cross-
lingual specialization for LE. This relies on an initial
shared cross-lingual word embedding space (see §2).
global and explicit specialization function that im-
itates the transformation from the distributional
space to the retrofitted space, and applies it to the
large subspace of unseen words’ vectors.
In this work, we present PO STLE, an all-words
post-specialization model for the asymmetric LE
relation. This model propagates the signal on the
hierarchical organization of concepts to the ones
unseen in external resources, resulting in a word
vector space which is fully specialized for the LE re-
lation. Previous LE specialization methods simply
integrated available LE knowledge into the input
distributional space (Vuli´
c and Mrkši´
c,2018), or
provided means to learn dense word embeddings
of the external resource only (Nickel and Kiela,
2017,2018;Ganea et al.,2018;Sala et al.,2018).
In contrast, we show that our POSTLE method can
combine distributional and external lexical knowl-
edge and generalize over unseen concepts.
The main contribution of POSTLE is a novel
global transformation function that re-scales vector
norms to reflect the concept hierarchy while simul-
taneously attracting hyponymy-hypernymy word
pairs to reflect their semantic similarity in the spe-
cialized space. We propose and evaluate two vari-
ants of this idea. The first variant learns the global
function through a deep non-linear feed-forward
network. The extended variant leverages the deep
feed-forward net as the generator component of an
adversarial model. The role of the accompanying
discriminator is then to distinguish between origi-
nal LE-specialized vectors (produced by any initial
post-processor) from vectors produced by trans-
forming distributional vectors with the generator.
We demonstrate that the proposed POSTLE meth-
ods yield considerable gains over state-of-the-art
LE-specialization models (Nickel and Kiela,2017;
Vuli´
c and Mrkši´
c,2018), with the adversarial vari-
ant having an edge over the other. The gains are
observed with different input distributional spaces
in several LE-related tasks such as hypernymy de-
tection and directionality, and graded lexical entail-
ment. What is more, the highest gains are reported
for resource-lean data scenarios where a high per-
centage of words in the datasets is unseen.
Finally, we show how to LE-specialize distribu-
tional spaces for target languages that lack external
lexical knowledge. POSTLE can be coupled with
any model for inducing cross-lingual embedding
spaces (Conneau et al.,2018;Artetxe et al.,2018;
Smith et al.,2017). If this model is unsupervised,
the procedure effectively yields a zero-shot LE spe-
cialization transfer, and holds promise to support
the construction of hierarchical semantic networks
for resource-lean languages in future work.
2 Post-Specialization for LE
Our post-specialization starts with the Lexical En-
tailment Attract-Repel (LEAR) model (Vuli´
c and
Mrkši´
c,2018), a state-of-the-art retrofitting model
for LE, summarized in §2.1. While we opt for LEAR
because of its strong performance and ease of use, it
is important to note that our POST LE models (§2.2
and §2.3) are not in any way bound to LEAR and
can be applied on top of any LE retrofitting model.
2.1 Initial LE Specialization: L EAR
LEAR fine-tunes the vectors of words observed in a
set of external linguistic constraints
C=SAL
,
consisting of synonymy pairs
S
such as (clever,
intelligent), antonymy pairs
A
such as (war, peace),
and lexical entailment (i.e., hyponymy-hypernymy)
pairs
L
such as (dog, animal). For the
L
pairs, the
order of words is important: we assume that the left
word always refers to the hyponym.
Extending the ATTRACT-REPEL model for sym-
metric similarity specialization (Mrkši´
c et al.,
2017), LEAR defines two types of objectives: 1)
the ATTRACT (Att) objective aims to bring closer
together in the vector space words that are se-
mantically similar (i.e., synonyms and hyponym-
hypernym pairs); 2) the REPEL (Rep) objective
pushes further apart vectors of dissimilar words
(i.e., antonyms). Let
B={(x(k)
l,x(k)
r)}K
k=1
be the
set of
K
word pairs for which the Att or Rep score
is to be computed – these are the positive examples.
The set of corresponding negative examples
T
is
created by coupling each positive ATTRACT exam-
ple
(xl,xr)
with a negative example pair
(tl,tr)
,
where
tl
is the vector closest (in terms of cosine
similarity, within the batch) to
xl
and
tr
vector clos-
est to
xr
. The Att objective for a batch of ATTRACT
constraints BAis then given as:
Att(BA, TA) =
K
X
k=1 τδatt + cos x(k)
l,t(k)
lcos x(k)
l,x(k)
r
+τδatt + cos x(k)
r,t(k)
rcos x(k)
l,x(k)
r.(1)
τ(x) = max(0, x)
is the hinge loss and
δatt
is the
similarity margin imposed between the negative
and positive vector pairs. In contrast, for each posi-
tive REPEL example, the negative example
(tl,tr)
couples the vector
tl
that is most distant from
xl
and
tr
, most distant from
xr
. The Rep objective for
a batch of REPE L word pairs BRis then:
Rep(BR, TR) =
K
X
k=1 τδrep + cos x(k)
l,x(k)
rcos x(k)
l,t(k)
l
+τδrep + cos x(k)
l,x(k)
rcos x(k)
r,t(k)
r.(2)
LEAR additionally defines a regularization term
in order to preserve the useful semantic informa-
tion from the original distributional space. With
V(B)
as the set of distinct words in a constraint
batch
B
, the regularization term is:
Reg(B) =
λreg PxV(B)kyxk2
, where
y
is the LEAR-
specialization of the distributional vector
x
, and
λreg is the regularization factor.
Crucially, LEAR forces specialized vectors to
reflect the asymmetry of the LE relation with an
asymmetric distance-based objective. The goal is
to preserve the cosine distances in the specialized
space while steering vectors of more general con-
cepts (those found higher in the concept hierarchy)
to take larger norms.
1
Vuli´
c and Mrkši´
c(2018) test
several asymmetric objectives, and we adopt the
one reported to be the most robust:
LE (BL) =
K
X
k=1
kx(k)
l|−kx(k)
rk
kx(k)
lk+kx(k)
rk.(3)
BL
denotes a batch of LE constraints. The full LEAR
objective is then defined as:
J=Att(BS, TS) + Rep(BA, TA)
+Att(BL, TL) + LE (BL) + Reg(BS,BA,BL)
(4)
1
E.g., while dog and animal should be close in the LE-
specialized space in terms of cosine distance, the vector norm
of animal should be larger than that of dog.
In summary, LEAR pulls words from synonymy
and LE pairs closer together (
Att(BS, TS)
and
Att(BL, TL)
), while simultaneously pushing vec-
tors of antonyms further apart (
Rep(BA, TA)
) and
enforcing asymmetric distances for hyponymy-
hypernymy pairs (LE (BL)).
2.2 Post-Specialization Model
The retrofitting model (LEAR) specializes vectors
only for a subset of the full vocabulary: the words
it has seen in the external lexical resource. Such
resources are still fairly incomplete, even for ma-
jor languages (e.g., WordNet for English), and fail
to cover a large portion of the distributional vo-
cabulary (referred to as unseen words). The trans-
formation of the seen subspace, however, provides
evidence on the desired effects of LE-specialization.
We seek a post-specialization procedure for LE
(termed POSTLE) that propagates this useful signal
to the subspace of unseen words and LE-specializes
the entire distributional space (see Figure 1).
Let
Xs
be the subset of the distributional space
containing vectors of words seen in lexical con-
straints and let
Ys
denote LE-specialized vectors
of those words produced by the initial LE specializa-
tion model. For seen words, we pair their original
distributional vectors
xsXs
with corresponding
LEAR-specialized vectors
ys
: post-specialization
then directly uses pairs (
xs
,
ys
) as training in-
stances for learning a global specialization function,
which is then applied to LE-specialize the remain-
der of the distributional space, i.e., the specializa-
tion function learned from
(Xs,Ys)
is applied to
the subspace of unseen words’ vectors Xu.
Let
G(xi;θG) : RdRd
(with
d
as the di-
mensionality of the vector space) be the special-
ization function we are trying to learn using pairs
of distributional and LEAR-specialized vectors as
training instances. We first instantiate the post-
specialization model
G(xi;θG)
as a deep fully-
connected feed-forward network (DFFN) with
H
hidden layers and
m
units per layer. The mapping
of the j-th hidden layer is given as:
x(j)=activx(j1)Wj+b(j).(5)
activ
refers to a non-linear activation function,
2
2
As discussed by Vuli´
c et al. (2018); Ponti et al. (2018),
non-linear transformations yield better results: linear transfor-
mations cannot fully capture the subtle fine-tuning done by
the retrofitting process, guided by millions of pairwise con-
straints. We also verify that linear transformations yield poorer
performance, but we do not report these results for brevity.
x(j1)
is the output of the previous layer (
x(0)
is
the input distributional vector), and
(W(j),b(j))
,
j∈ {1, . . . , H}are the model’s parameters θG.
The aim is to obtain predictions
G(xs;θG)
that
are as close as possible to the corresponding
LEAR-specializations
ys
. For symmetric similarity-
based post-specialization prior work relied on co-
sine distance to measure discrepancy between
the predicted and expected specialization (Vuli´
c
et al.,2018). Since we are specializing vectors for
the asymmetric LE relation, the predicted vector
G(xs;θG)
has to match
ys
not only in direction (as
captured by cosine distance) but also in size (i.e.,
the vector norm). Therefore, the POSTLE objective
augments cosine distance
dcos
with the absolute
difference of G(xs;θG)and ysnorms:3
LS=dcos (G(xs;θG),ys)
+δn
kG(xs;θG)k−kysk
.(6)
The hyperparameter
δn
determines the contribution
of the norm difference to the overall loss.
2.3 Adversarial LE Post-Specialization
We next extend the DFFN post-specialization model
with an adversarial architecture (ADV), following
Ponti et al. (2018) who demonstrated its useful-
ness for similarity-based specialization. The intu-
ition behind the adversarial extension is as follows:
the specialization function
G(xs;θG)
should not
only produce vectors that have high cosine simi-
larity and similar norms with corresponding LEAR-
specialized vectors
ys
, but should also ensure that
these vectors seem “natural”, that is, as if they were
indeed sampled from
Ys
. We can force the post-
specialized vectors
G(xs;θG)
to be legitimate sam-
ples from the
Ys
distribution by introducing an ad-
versary that learns to discriminate whether a given
vector has been generated by the specialization
function or directly sampled from
Ys
. Such adver-
saries prevent the generation of unrealistic outputs,
as demonstrated in computer vision (Pathak et al.,
2016;Ledig et al.,2017;Odena et al.,2017).
The DFFN function
G(x;θG)
from §2.2 can be
seen as the generator component. We couple the
generator with the discriminator
D(x;θD)
, also
instantiated as a DFFN. The discriminator performs
binary classification: presented with a word vector,
3
Simply minimizing Euclidean distance also aligns vectors
in terms of both direction and size. However, we consistently
obtained better results by the objective function from Eq.
(6)
.
it predicts whether it has been produced by
G
or
sampled from the LEAR-specialized subspace
Ys
.
On the other hand, the generator tries to produce
vectors which the discriminator would misclassify
as sampled from
Ys
. The discriminator’s loss is
defined via negative log-likelihood over two sets of
inputs; generator produced vectors
G(xs;θG)
and
LEAR specializations ys:
LD=
N
X
s=1
log P(spec = 0|G(xs;θG); θD)
M
X
s=1
log P(spec = 1|ys;θD)(7)
Besides minimizing the similarity-based loss
LS
,
the generator has the additional task of confusing
the discriminator: it thus perceives the discrimina-
tor’s correct predictions as its additional loss LG:
LG=
N
X
s=1
log P(spec = 1|G(xs;θG); θD)
M
X
s=1
log P(spec = 0|ys;θD)(8)
We learn
G
’s and
D
’s parameters with stochastic
gradient descent – to reduce the co-variance shift
and make training more robust, each batch contains
examples of the same class (either only predicted
vectors or only LEAR vectors). Moreover, for each
update step of
LG
we alternate between
sD
update
steps for LDand sSupdate steps for LS.
2.4 Cross-Lingual LE Specialization Transfer
The POSTLE models enable LE specialization of
vectors of words unseen in lexical constraints. Con-
ceptually, this also allows for a LE-specialization of
a distributional space of another language (possibly
without any external constraints), provided a shared
bilingual distributional word vector space. To this
end, we can resort to any of the methods for induc-
ing shared cross-lingual vector spaces (Ruder et al.,
2018). What is more, most recent methods success-
fully learn the shared space without any bilingual
signal (Conneau et al.,2018;Artetxe et al.,2018;
Chen and Cardie,2018;Hoshen and Wolf,2018).
Let
Xt
be the distributional space of some tar-
get language for which we have no external lexi-
cal constraints and let
P(x;θP) : Rdt7→ Rds
be
the (linear) function projecting vectors
xtXt
to the distributional space
Xds
of the source lan-
guage with available lexical constraints for which
we trained the post-specialization model. We then
simply obtain the LE-specialized space
Yt
of the
target language by composing the projection
P
with the post-specialization G(see Figure 1):
Yt=G(P(Xt;θP); θG)(9)
In §4.3 we report on language transfer experiments
with three different linear projection models
P
in
order to verify the robustness of the cross-lingual
LE-specialization transfer.4
3 Experimental Setup
Distributional Vectors.
To test the robustness of
the POSTLE approach, we experiment with two
pre-trained English word vector spaces: (1) vec-
tors trained by Levy and Goldberg (2014) on the
Polyglot Wikipedia (Al-Rfou et al.,2013) using
Skip-Gram with Negative Sampling (SGN S-BOW2)
(Mikolov et al.,2013) and (2) GLOVE embed-
dings trained on the Common Crawl (Penning-
ton et al.,2014). In the cross-lingual transfer ex-
periments (§4.3), we use English, Spanish, and
French FASTTEXT embeddings trained on respec-
tive Wikipedias (Bojanowski et al.,2017).
Linguistic Constraints.
We use the same set of
constraints as LEAR in prior work (Vuli´
c and
Mrkši´
c,2018): synonymy and antonymy con-
straints from (Zhang et al.,2014;Ono et al.,2015)
are extracted from WordNet and Roget’s Thesaurus
(Kipfer,2009). As in other work on LE specializa-
tion (Nguyen et al.,2017;Nickel and Kiela,2017),
asymmetric LE constraints are extracted from Word-
Net, and we collect both direct and indirect LE
pairs (i.e., (parrot, bird),(bird, animal), and (par-
rot, animal) are in the LE set) In total, we work
with 1,023,082 pairs of synonyms, 380,873 pairs
of antonyms, and 1,545,630 LE pairs.
Training Configurations.
For LEAR, we adopt the
hyperparameter setting reported in the original pa-
per:
δatt = 0.6
,
δrep = 0
,
λreg = 109
. For POS-
TLE, we fine-tune the hyperparameters via random
search on the validation set: 1) DFFN uses
H= 4
hidden layers, each with
1,536
units and Swish
as the activation function (Ramachandran et al.,
4
We experiment with unsupervised and weakly supervised
models for inducing cross-lingual embedding spaces. How-
ever, we stress that the POSTLE specialization transfer is
equally applicable on top of any method for inducing cross-
lingual word vectors, some of which may require more bilin-
gual supervision (Upadhyay et al.,2016;Ruder et al.,2018).
2018); 2) ADV relies on
H= 4
hidden layers, each
with
m= 2,048
units and Leaky ReLU (slope
0.2) (Maas et al.,2014) for the generator. The dis-
criminator uses
H= 2
layers with
1,024
units and
Leaky ReLU. For each update based on the gener-
ator loss (
LG
), we perform
sS= 3
updates based
on the similarity loss (
LS
) and
sD= 5
updates
based on the discriminator loss (
LD
). The value for
the norm difference contribution in
LS
is set to to
δn= 0.1
(see Eq.
(6)
) for both POST LE variants.
We train PO STLE models using SGD with the batch
size 32, the initial learning rate
0.1
, and a decay
rate of 0.98 applied every 1M examples.
Asymmetric LE Distance.
The distance that mea-
sures the strength of the LE relation in the special-
ized space reflects both the cosine distance between
the vectors as well as the asymmetric difference
between their norms (Vuli´
c and Mrkši´
c,2018):
ILE (x,y) = dcos(x,y) + kxk − kyk
kxk+kyk(10)
LE-specialized vectors of general concepts obtain
larger norms than vectors of specific concepts. True
LE pairs should display both a small cosine distance
and a negative norm difference. Therefore, in differ-
ent LE tasks we can rank the candidate pairs in the
ascending order of their asymmetric LE distance
ILE
. The LE distances are trivially transformed into
binary LE predictions, using a binarization thresh-
old
t
: if
ILE (x,y)< t
, we predict that LE holds
between words xand ywith vectors xand y.
4 Evaluation and Results
We extensively evaluate the proposed POSTLE mod-
els on two fundamental LE tasks: 1) predicting
graded LE and 2) LE detection (and directionality),
in monolingual and cross-lingual transfer settings.
4.1 Predicting Graded LE
The asymmetric distance
ILE
can be directly used
to make fine-grained graded assertions about the hi-
erarchical relationships between concepts. Follow-
ing previous work (Nickel and Kiela,2017;Vuli´
c
and Mrkši´
c,2018), we evaluate graded LE on the
standard HyperLex dataset (Vuli´
c et al.,2017).
5
HyperLex contains 2,616 word pairs (2,163 noun
pairs, the rest are verb pairs) rated by humans by
5
Graded LE is a phenomenon deeply rooted in cognitive
science and linguistics: it captures the notions of concept
prototypicality (Rosch,1973;Medin et al.,1984) and category
vagueness (Kamp and Partee,1995;Hampton,2007). We refer
the reader to the original paper for a more detailed discussion.
0 30 50 70 90 100
Percentage of seen HyperLex words
0.25
0.35
0.45
0.55
0.65
Spearman’s ρcorrelation
LEAR DFFN ADV
(a) SGNS-B OW2
0 30 50 70 90 100
Percentage of seen HyperLex words
0.25
0.35
0.45
0.55
0.65
Spearman’s ρcorrelation
LEAR DFFN ADV
(b) GLOVE
Figure 2: Spearman’s ρcorrelation scores for two input distributional spaces on the noun portion of HyperLex
(2,163 concept pairs) conditioned on the number of test words covered (i.e., seen) in the external lexical resource.
Similar patterns are observed on the full HyperLex dataset. Two other baseline models report the following scores
on the noun portion of HyperLex in the 100% setting: 0.512 (Nickel and Kiela,2017); 0.540 (Nguyen et al.,2017).
estimating on a [0, 6] scale the degree to which the
first concept is a type of the second concept.
Results and Discussion.
We evaluate the perfor-
mance of LE specialization models in a deliberately
controlled setup: we (randomly) select a percentage
of HyperLex words (0%, 30%, 50%, 70%, 90% and
100%) which are allowed to be seen in the external
constraints, and discard the constraints containing
other HyperLex words, making them effectively
unseen by the initial LEAR model. In the 0% set-
ting all constraints containing any of the HyperLex
words have been removed, whereas in the 100% set-
ting, all available constraints are used. The scores
are summarized in Figure 2.
The 0% setting is especially indicative of POS-
TLE performance: we notice large gains in perfor-
mance without seeing a single word from HyperLex
in the external resource. This result verifies that the
POSTLE models can generalize well to words un-
seen in the resources. Intuitively, the gap between
POSTLE and LEAR is reduced in the settings where
LEAR “sees” more words. In the 100% setting we
report the same scores for LEAR and P OSTLE: this
is an artefact of the HyperLex dataset construction
as all HyperLex word pairs were sampled from
WordNet (i.e., the coverage of test words is 100%).
Another finding is that in the resource-leaner 0%
and 30% settings POST LE outperforms two other
baselines (Nguyen et al.,2017;Nickel and Kiela,
2017), despite the fact that the two baselines have
“seen” all HyperLex words. The results further in-
dicate that POST LE yields gains on top of different
initial distributional spaces. As expected, the scores
are higher with the more sophisticated ADV variant.
4.2 LE Detection
Detection and Directionality Tasks.
We now
evaluate POSTLE models on three binary classi-
fication datasets commonly used for evaluating LE
models (Roller et al.,2014;Shwartz et al.,2017;
Nguyen et al.,2017), compiled into an integrated
benchmark by Kiela et al. (2015).6
The first task, LE directionality, is evaluated
on 1,337 true LE pairs (DBLESS) extracted from
BLESS (Baroni and Lenci,2011). The task tests the
models’ ability to predict which word in the LE pair
is the hypernym. This is simply achieved by taking
the word with a larger word vector norm as the
hypernym. The second task, LE detection, is evalu-
ated on the WBLESS dataset (Weeds et al.,2014),
comprising 1,668 word pairs standing in one of
several lexical relations (LE, meronymy-holonymy,
co-hyponymy, reverse LE, and no relation). The
models have to distinguish true LE pairs from pairs
that stand in other relations (including the reverse
LE). We score all pairs using the
ILE
distance. Fol-
lowing Nguyen et al. (2017), we find the threshold
tvia cross-validation.7Finally, we evaluate LE de-
tection and directionality simultaneously on BIB-
LESS, a relabeled variant of WBLESS. The task
is to detect true LE pairs (including the reverse LE
pairs), and also to determine the relation direction-
ality. We again use
ILE
to detect LE pairs, and then
compare the vector norms to select the hypernym.
For all three tasks, we consider two evaluation
settings: 1) in the FULL setting we use all available
lexical constraints (see §3) for the initial LEAR spe-
cialization; 2) in the DISJOINT setting, we remove
all constraints that contain any of the test words,
6http://www.cl.cam.ac.uk/dk427/generality.html
7
In each of the 1,000 iterations, 2% of the pairs are sampled
for threshold tuning, and the remaining 98% are used for
testing. The reported numbers are therefore averaged scores.
Setup: FULL Setup: DI SJOINT
DBLESS WBLESS BIBLESS DBLESS WBLESS BIBLESS
SG GL SG GL SG GL SG GL SG GL SG GL
LEAR (Vuli´
c et al.,2018) .957 .955 .905 .910 .872 .875 .528 .531 .555 .529 .381 .389
POSTL E DFFN .957 .955 .905 .910 .872 .875 .898 .825 .754 .746 .696 .677
POSTL E ADV .957 .955 .905 .910 .872 .875 .942 .888 .832 .766 .757 .690
Table 1: Accuracy of PO STLE models on *BLESS datasets, for two different sets of English distributional vectors:
Skip-Gram (SG) and GloVe (GL). LEAR reports highest scores on *BLESS datasets in the literature.
Target: SPAN ISH Target: FRENCH
Random .498 .515
Distributional .362 .387
Ar Co Sm Ar Co Sm
POSTL E DFFN .798 .740 .728 .688 .735 .742
POSTL E ADV .768 .790 .782 .746 .770 .786
Table 2: Average precision (AP) of PO STLE models
in cross-lingual transfer. Results are shown for both
POSTLE models (DFFN and ADV), two target languages
(Spanish and French) and three methods for inducing
bilingual vector spaces: Ar (Artetxe et al.,2018), Co
(Conneau et al.,2018), and Sm (Smith et al.,2017).
making all test words effectively unseen by LEAR.
Results and Discussion.
The accuracy scores on
*BLESS test sets are provided in Table 1.
8
Our
POSTLE models display exactly the same perfor-
mance as LEAR in the FULL setting: this is simply
because all words found in *BLESS datasets are
covered by the lexical constraints, and POSTLE
does not generalize the initial LEAR transforma-
tion to unseen test words. In the DI SJOINT setting,
however, LEAR is “blind” and has not seen a sin-
gle test word in the constraints: it leaves distribu-
tional vectors of *BLESS test words identical. In
this setting, LEAR performance is equivalent to the
original distributional space. In contrast, learning
to generalize the LE specialization function from
LEAR-specializations of other words, POSTLE mod-
els are able to successfully LE-specialize vectors
of test *BLESS words. As in the graded LE, the
adversarial POSTLE architecture outperforms the
simpler DFFN model.
4.3 Cross-Lingual Transfer
Finally, we evaluate cross-lingual transfer of LE
specialization. We train POST LE models using dis-
tributional (FASTTEXT) English (EN) vectors as
input. Afterwards, we apply those models to the
distributional vector spaces of two other languages,
French (FR) and Spanish (ES), after mapping them
8
We have evaluated the prediction performance also in
terms of
F1
and, in the ranking formulation, in terms of aver-
age precision (AP) and observed the same trends in results.
into the same space as English as described in §2.4.
We experiment with several methods to induce
cross-lingual word embeddings: 1) MU SE, an ad-
versarial unsupervised model fine-tuned with the
closed-form Procustes solution (Conneau et al.,
2018); 2) an unsupervised self-learning algorithm
that iteratively bootstraps new bilingual seeds, ini-
tialized according to structural similarities of the
monolingual spaces (Artetxe et al.,2018); 3) an or-
thogonal linear mapping with inverse softmax, su-
pervised by 5K bilingual seeds (Smith et al.,2017).
We test POST LE-specialized Spanish and French
word vectors on WN-Hy-ES and WN-Hy-FR, two
equally sized datasets (148K word pairs) created by
Glavaš and Ponzetto (2017) using the ES WordNet
(Gonzalez-Agirre et al.,2012) and the FR WordNet
(Sagot and Fišer,2008). We perform a ranking
evaluation: the aim is to rank LE pairs above pairs
standing in other relations (meronyms, synonyms,
antonyms, and reverse LE). We rank word pairs in
the ascending order based on ILE , see Eq. (10).
Results and Discussion.
The average precision
(AP) ranking scores achieved via cross-lingual
transfer of POST LE are shown in Table 2. We report
AP scores using three methods for cross-lingual
word embedding induction, and compare their per-
formance to two baselines: 1) random word pair
scoring, and 2) the original (FAS TTEXT) vectors.
The results uncover the inability of distributional
vectors to capture LE – they yield lower perfor-
mance than the random baseline, which strongly
emphasizes the need for the LE-specialization. The
transferred POSTLE yields an immense improve-
ment over the distributional baselines (up to +0.428,
i.e. +118%). Again, the adversarial architecture sur-
passes DFFN across the board, with the single excep-
tion of EN-ES transfer coupled with Artetxe et al.
(2018)’s cross-lingual model. Furthermore, trans-
fers with unsupervised (Ar, Co) and supervised
bilingual mapping (Sm) yield comparable perfor-
mance. This implies that a robust LE-specialization
of distributional vectors for languages with no
lexico-semantic resources is possible even without
any bilingual signal or translation effort.
5 Related Work
Vector Space Specialization.
In general, lexical
specialization models fall into two categories: 1)
joint optimization models and 2) post-processing or
retrofitting models. Joint models integrate external
constraints directly into the distributional objective
of embedding algorithms such as Skip-Gram and
CBOW (Mikolov et al.,2013), or Canonical Corre-
lation Analysis (Dhillon et al.,2015). They either
modify the prior or regularization of the objective
(Yu and Dredze,2014;Xu et al.,2014;Kiela et al.,
2015) or augment it with factors reflecting exter-
nal lexical knowledge (Liu et al.,2015;Ono et al.,
2015;Osborne et al.,2016;Nguyen et al.,2017).
Each joint model is tightly coupled to a specific dis-
tributional objective: any change to the underlying
distributional model requires a modification of the
whole joint model and expensive retraining.
In contrast, retrofitting models (Faruqui et al.,
2015;Rothe and Schütze,2015;Wieting et al.,
2015;Jauhar et al.,2015;Nguyen et al.,2016;
Mrkši´
c et al.,2016;Mrkši´
c et al.,2017;Vuli´
c
and Mrkši´
c,2018) use external constraints to post-
hoc fine-tune distributional spaces. Effectively, this
makes them applicable to any input distributional
space, but they modify only vectors of words seen
in the external resource. Nonetheless, retrofitting
models tend to outperform joint models in the con-
text of both similarity-based (Mrkši´
c et al.,2016)
and LE specialization (Vuli´
c and Mrkši´
c,2018).
The recent post-specialization paradigm has
been so far applied only to the symmetric semantic
similarity relation. Vuli´
c et al. (2018) generalize
over the retrofitting ATTRACT-REP EL (AR) model
(Mrkši´
c et al.,2017) by learning a global similarity-
focused specialization function implemented as a
DFFN.Ponti et al. (2018) further propose an adver-
sarial post-specialization architecture. In this work,
we show that post-specialization represents a vi-
able methodology for specializing all distributional
word vectors for the LE relation as well.
Modeling Lexical Entailment.
Extensive re-
search effort in lexical semantics has been dedi-
cated to automatic detection of the fundamental tax-
onomic LE relation. Early approaches (Weeds et al.,
2004;Clarke,2009;Kotlerman et al.,2010;Lenci
and Benotto,2012,inter alia) detected LE word
pairs by means of asymmetric direction-aware
mechanisms such as distributional inclusion hy-
pothesis (Geffet and Dagan,2005), and concept
informativeness and generality (Herbelot and Gane-
salingam,2013;Santus et al.,2014;Shwartz et al.,
2017), but were surpassed by more recent methods
that leverage word embeddings.
Embedding-based methods either 1) induce LE-
oriented vector spaces using text (Vilnis and Mc-
Callum,2015;Yu et al.,2015;Vendrov et al.,2016;
Henderson and Popa,2016;Nguyen et al.,2017;
Chang et al.,2018;Vuli´
c and Mrkši´
c,2018) and/or
external hierarchies (Nickel and Kiela,2017,2018;
Sala et al.,2018) or 2) use distributional vectors as
features for supervised LE detection models (Ba-
roni et al.,2012;Tuan et al.,2016;Shwartz et al.,
2016;Glavaš and Ponzetto,2017;Rei et al.,2018).
Our POSTLE method belongs to the first group.
Vuli´
c and Mrkši´
c(2018) proposed LEAR, a
retrofitting LE model which displays performance
gains on a spectrum of graded and ungraded LE
evaluations compared to joint specialization mod-
els (Nguyen et al.,2017). However, LE AR still spe-
cializes only the vectors of words seen in external
resources. The same limitation holds for a family
of recent models that embed concept hierarchies
(i.e., trees or directed acyclic graphs) in hyperbolic
spaces (Nickel and Kiela,2017;Chamberlain et al.,
2017;Nickel and Kiela,2018;Sala et al.,2018;
Ganea et al.,2018). Although hyperbolic spaces are
arguably more suitable for embedding hierarchies
than the Euclidean space, the “Euclidean-based”
LEAR has been proven to outperform the hyper-
bolic embedding of the WordNet hierarchy across
a range of LE tasks (Vuli´
c and Mrkši´
c,2018).
The proposed POST LE framework 1) mitigates
the limited coverage issue of retrofitting LE-
specialization models, and 2) removes the problem
of dependence on distributional objective in joint
models. Unlike retrofitting models, PO STLE LE-
specializes vectors of all vocabulary words, and un-
like joint models, it is computationally inexpensive
and applicable to any distributional vector space.
6 Conclusion
We have presented POSTLE, a novel neural post-
specialization framework that specializes distribu-
tional vectors of all words – including the ones
unseen in external lexical resources – to accentu-
ate the hierarchical asymmetric lexical entailment
(LE or hyponymy-hypernymy) relation. The ben-
efits of our two all-words POSTLE model variants
have been shown across a range of graded and bi-
nary LE detection tasks on standard benchmarks.
What is more, we have indicated the usefulness of
the POSTLE paradigm for zero-shot cross-lingual
LE specialization of word vectors in target lan-
guages, even without having any external lexical
knowledge in the target. In future work, we will
experiment with more sophisticated neural archi-
tectures, other resource-lean languages, and boot-
strapping approaches to LE specialization. Code
and POSTLE-specialized vectors are available at:
[https://github.com/ashkamath/POSTLE].
7 *Acknowledgments
EMP and IV are supported by the ERC Consolida-
tor Grant LEXICAL (648909). The authors would
like to thank the anonymous reviewers for their
helpful suggestions.
References
Rami Al-Rfou, Bryan Perozzi, and Steven Skiena.
2013. Polyglot: Distributed word representations for
multilingual NLP. In Proceedings of CoNLL, pages
183–192.
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018.
A robust self-learning method for fully unsupervised
cross-lingual mappings of word embeddings. In Pro-
ceedings of ACL, pages 789–798.
Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do,
and Chung-chieh Shan. 2012. Entailment above the
word level in distributional semantics. In Proceed-
ings of EACL, pages 23–32.
Marco Baroni and Alessandro Lenci. 2011. How we
BLESSed distributional semantic evaluation. In Pro-
ceedings of the GEMS 2011 Workshop, pages 1–10.
Richard Beckwith, Christiane Fellbaum, Derek Gross,
and George A. Miller. 1991. WordNet: A lexical
database organized on psycholinguistic principles.
Lexical acquisition: Exploiting on-line resources to
build a lexicon, pages 211–231.
Or Biran and Kathleen McKeown. 2013. Classifying
taxonomic relations between pairs of Wikipedia arti-
cles. In Proceedings of IJCNLP, pages 788–794.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and
Tomas Mikolov. 2017. Enriching word vectors with
subword information.Transactions of the ACL,
5:135–146.
Samuel R. Bowman, Gabor Angeli, Christopher Potts,
and Christopher D. Manning. 2015. A large anno-
tated corpus for learning natural language inference.
In Proceedings of EMNLP, pages 632–642.
Benjamin Paul Chamberlain, James Clough, and
Marc Peter Deisenroth. 2017. Neural embeddings of
graphs in hyperbolic space.CoRR, abs/1705.10359.
Haw-Shiuan Chang, Ziyun Wang, Luke Vilnis, and An-
drew McCallum. 2018. Distributional inclusion vec-
tor embedding for unsupervised hypernymy detec-
tion. In Proceedings of NAACL-HLT, pages 485–
495.
Xilun Chen and Claire Cardie. 2018. Unsupervised
multilingual word embeddings. In Proceedings of
EMNLP, pages 261–270.
Daoud Clarke. 2009. Context-theoretic semantics for
natural language: An overview. In Proceedings of
the Workshop on Geometrical Models of Natural
Language Semantics (GEMS), pages 112–119.
Allan M. Collins and Ross M. Quillian. 1972. Exper-
iments on semantic memory and language compre-
hension. Cognition in Learning and Memory.
Alexis Conneau, Guillaume Lample, Marc’Aurelio
Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018.
Word translation without parallel data. In Proceed-
ings of ICLR (Conference Track).
Ido Dagan, Dan Roth, Mark Sammons, and Fabio Mas-
simo Zanzotto. 2013. Recognizing textual entail-
ment: Models and applications. Synthesis Lectures
on Human Language Technologies, 6(4):1–220.
Paramveer S. Dhillon, Dean P. Foster, and Lyle H. Un-
gar. 2015. Eigenwords: Spectral word embeddings.
Journal of Machine Learning Research, 16:3035–
3078.
Manaal Faruqui. 2016. Diverse Context for Learning
Word Representations. Ph.D. thesis, Carnegie Mel-
lon University.
Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar,
Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015.
Retrofitting word vectors to semantic lexicons. In
Proceedings of NAACL-HLT, pages 1606–1615.
Christiane Fellbaum. 1998. WordNet. MIT Press.
Octavian-Eugen Ganea, Gary Bécigneul, and Thomas
Hofmann. 2018. Hyperbolic entailment cones for
learning hierarchical embeddings. In Proceedings
of ICML, pages 1632–1641.
Maayan Geffet and Ido Dagan. 2005. The distribu-
tional inclusion hypotheses and lexical entailment.
In Proceedings of ACL, pages 107–114.
Goran Glavaš and Ivan Vuli´
c. 2018. Explicit
retrofitting of distributional word vectors. In Pro-
ceedings of ACL, pages 34–45.
Goran Glavaš and Simone Paolo Ponzetto. 2017.
Dual tensor model for detecting asymmetric lexico-
semantic relations. In Proceedings of EMNLP,
pages 1758–1768.
Aitor Gonzalez-Agirre, Egoitz Laparra, and German
Rigau. 2012. Multilingual central repository version
3.0. In LREC, pages 2525–2529.
Amit Gupta, Rémi Lebret, Hamza Harkous, and Karl
Aberer. 2017. Taxonomy induction using hyper-
nym subsequences. In Proceedings of CIKM, pages
1329–1338.
James A. Hampton. 2007. Typicality, graded member-
ship, and vagueness. Cognitive Science, 31(3):355–
384.
Zellig S. Harris. 1954. Distributional structure. Word,
10(23):146–162.
James Henderson and Diana Popa. 2016. A vector
space for distributional semantics for entailment. In
Proceedings of ACL, pages 2052–2062.
Aurélie Herbelot and Mohan Ganesalingam. 2013.
Measuring semantic content in distributional vectors.
In Proceedings of ACL, pages 440–445.
Yedid Hoshen and Lior Wolf. 2018. Non-adversarial
unsupervised word translation. In Proceedings of
EMNLP, pages 469–478.
Sujay Kumar Jauhar, Chris Dyer, and Eduard H. Hovy.
2015. Ontologically grounded multi-sense represen-
tation learning for semantic vector space models. In
Proceedings of NAACL, pages 683–693.
Hans Kamp and Barbara Partee. 1995. Prototype the-
ory and compositionality.Cognition, 57(2):129–
191.
Douwe Kiela, Laura Rimell, Ivan Vuli´
c, and Stephen
Clark. 2015. Exploiting image generality for lexical
entailment detection. In Proceedings of ACL, pages
119–124.
Barbara Ann Kipfer. 2009. Roget’s 21st Century The-
saurus (3rd Edition). Philip Lief Group.
Lili Kotlerman, Ido Dagan, Idan Szpektor, and Maayan
Zhitomirsky-Geffet. 2010. Directional distribu-
tional similarity for lexical inference.Natural Lan-
guage Engineering, 16(4):359–389.
Christian Ledig, Lucas Theis, Ferenc Huszar, Jose
Caballero, Andrew Cunningham, Alejandro Acosta,
Andrew Aitken, Alykhan Tejani, Johannes Totz, Ze-
han Wang, et al. 2017. Photo-realistic single image
super-resolution using a generative adversarial net-
work. In Proceedings of CVPR, pages 4681–4690.
Alessandro Lenci and Giulia Benotto. 2012. Identify-
ing hypernyms in distributional semantic spaces. In
Proceedings of *SEM, pages 75–79.
Omer Levy and Yoav Goldberg. 2014. Dependency-
based word embeddings. In Proceedings of ACL,
pages 302–308.
Quan Liu, Hui Jiang, Si Wei, Zhen-Hua Ling, and
Yu Hu. 2015. Learning semantic word embeddings
based on ordinal knowledge constraints. In Proceed-
ings of ACL, pages 1501–1511.
Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng.
2014. Rectifier nonlinearities improve neural net-
work acoustic models. In Proceedings of ICML.
Douglas L. Medin, Mark W. Altom, and Timothy D.
Murphy. 1984. Given versus induced category repre-
sentations: Use of prototype and exemplar informa-
tion in classification.Journal of Experimental Psy-
chology, 10(3):333–352.
Oren Melamud, Jacob Goldberger, and Ido Dagan.
2016. Context2vec: Learning generic context em-
bedding with bidirectional LSTM. In Proceedings
of CoNLL, pages 51–61.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.
Corrado, and Jeffrey Dean. 2013. Distributed rep-
resentations of words and phrases and their compo-
sitionality. In Proceedings of NIPS, pages 3111–
3119.
Michael Mohler, David Bracewell, Marc Tomlinson,
and David Hinote. 2013. Semantic signatures for
example-based linguistic metaphor detection. In
Proceedings of the First Workshop on Metaphor in
NLP, pages 27–35.
Nikola Mrkši´
c, Diarmuid Ó Séaghdha, Blaise Thom-
son, Milica Gaši´
c, Lina Maria Rojas-Barahona, Pei-
Hao Su, David Vandyke, Tsung-Hsien Wen, and
Steve Young. 2016. Counter-fitting word vectors
to linguistic constraints. In Proceedings of NAACL-
HLT, pages 142–148.
Nikola Mrkši´
c, Ivan Vuli´
c, Diarmuid Ó Séaghdha, Ira
Leviant, Roi Reichart, Milica Gaši´
c, Anna Korho-
nen, and Steve Young. 2017. Semantic specialisa-
tion of distributional word vector spaces using mono-
lingual and cross-lingual constraints.Transactions
of the ACL, 5:309–324.
Roberto Navigli and Simone Paolo Ponzetto. 2012. Ba-
belNet: The automatic construction, evaluation and
application of a wide-coverage multilingual seman-
tic network.Artificial Intelligence, 193:217–250.
Roberto Navigli, Paola Velardi, and Stefano Faralli.
2011. A graph-based algorithm for inducing lexical
taxonomies from scratch. In Proceedings of IJCAI,
pages 1872–1877.
Kim Anh Nguyen, Maximilian Köper, Sabine
Schulte im Walde, and Ngoc Thang Vu. 2017.
Hierarchical embeddings for hypernymy detection
and directionality. In Proceedings of EMNLP,
pages 233–243.
Kim Anh Nguyen, Sabine Schulte im Walde, and
Ngoc Thang Vu. 2016. Integrating distributional
lexical contrast into word embeddings for antonym-
synonym distinction. In Proceedings of ACL, pages
454–459.
Maximilian Nickel and Douwe Kiela. 2017. Poincaré
embeddings for learning hierarchical representa-
tions. In Proceedings of NIPS, pages 6341–6350.
Maximilian Nickel and Douwe Kiela. 2018. Learning
continuous hierarchies in the Lorentz model of hy-
perbolic geometry. In Proceedings of ICML, pages
3776–3785.
Augustus Odena, Christopher Olah, and Jonathon
Shlens. 2017. Conditional image synthesis with aux-
iliary classifier gans. In Proceedings of ICML, pages
2642–2651.
Masataka Ono, Makoto Miwa, and Yutaka Sasaki.
2015. Word Embedding-based Antonym Detection
using Thesauri and Distributional Information. In
Proceedings of NAACL, pages 984–989.
Dominique Osborne, Shashi Narayan, and Shay Cohen.
2016. Encoding prior knowledge with eigenword
embeddings.Transactions of the ACL, 4:417–430.
Deepak Pathak, Philipp Krähenbühl, Jeff Donahue,
Trevor Darrell, and Alexei A. Efros. 2016. Context
encoders: Feature learning by inpainting. In Pro-
ceedings of CVPR, pages 2536–2544.
Jeffrey Pennington, Richard Socher, and Christopher
Manning. 2014. Glove: Global vectors for word rep-
resentation. In Proceedings of EMNLP, pages 1532–
1543.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word rep-
resentations. In Proceedings of NAACL-HLT, pages
2227–2237.
Edoardo Maria Ponti, Ivan Vuli ´
c, Goran Glavaš, Nikola
Mrkši´
c, and Anna Korhonen. 2018. Adversar-
ial propagation and zero-shot cross-lingual transfer
of word vector specialization. In Proceedings of
EMNLP, pages 282–293.
Prajit Ramachandran, Barret Zoph, and Quoc V Le.
2018. Searching for activation functions. In Pro-
ceedings of ICML.
Marek Rei, Daniela Gerz, and Ivan Vuli´
c. 2018. Scor-
ing lexical entailment with a supervised directional
similarity network. In Proceedings of ACL, pages
638–643.
Stephen Roller, Katrin Erk, and Gemma Boleda. 2014.
Inclusive yet selective: Supervised distributional hy-
pernymy detection. In Proceedings of COLING,
pages 1025–1036.
Eleanor H. Rosch. 1973. Natural categories.Cognitive
Psychology, 4(3):328–350.
Sascha Rothe and Hinrich Schütze. 2015. AutoEx-
tend: Extending word embeddings to embeddings
for synsets and lexemes. In Proceedings of ACL,
pages 1793–1803.
Sebastian Ruder, Ivan Vuli´
c, and Anders Søgaard. 2018.
A survey of cross-lingual embedding models.Jour-
nal of Artificial Intelligence Research.
Benoît Sagot and Darja Fišer. 2008. Building a free
french wordnet from multilingual resources. In
Proccedings of the OntoLex Workshop.
Frederic Sala, Christopher De Sa, Albert Gu, and
Christopher Ré. 2018. Representation tradeoffs for
hyperbolic embeddings. In Proceedings of ICML,
pages 4457–4466.
Enrico Santus, Alessandro Lenci, Qin Lu, and Sabine
Schulte im Walde. 2014. Chasing hypernyms in vec-
tor spaces with entropy. In Proceedings of EACL,
pages 38–42.
Roy Schwartz, Roi Reichart, and Ari Rappoport. 2015.
Symmetric pattern based word embeddings for im-
proved word similarity prediction. In Proceedings
of CoNLL, pages 258–267.
Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016.
Improving hypernymy detection with an integrated
path-based and distributional method. In Proceed-
ings of ACL, pages 2389–2398.
Vered Shwartz, Enrico Santus, and Dominik
Schlechtweg. 2017. Hypernyms under siege:
Linguistically-motivated artillery for hypernymy
detection. In Proceedings of EACL, pages 65–75.
Samuel L. Smith, David H.P. Turban, Steven Ham-
blin, and Nils Y. Hammerla. 2017. Offline bilin-
gual word vectors, orthogonal transformations and
the inverted softmax. In Proceedings of ICLR (Con-
ference Track).
Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2006.
Semantic taxonomy induction from heterogenous ev-
idence. In Proceedings of ACL, pages 801–808.
Luu Anh Tuan, Yi Tay, Siu Cheung Hui, and See Kiong
Ng. 2016. Learning term embeddings for taxonomic
relation identification using dynamic weighting neu-
ral network. In Proceedings of EMNLP, pages 403–
413.
Shyam Upadhyay, Manaal Faruqui, Chris Dyer, and
Dan Roth. 2016. Cross-lingual models of word em-
beddings: An empirical comparison. In Proceedings
of ACL, pages 1661–1670.
Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Ur-
tasun. 2016. Order-embeddings of images and lan-
guage. In Proceedings of ICLR (Conference Track).
Luke Vilnis and Andrew McCallum. 2015. Word repre-
sentations via Gaussian embedding. In Proceedings
of ICLR (Conference Track).
Ivan Vuli´
c, Daniela Gerz, Douwe Kiela, Felix Hill, and
Anna Korhonen. 2017. Hyperlex: A large-scale eval-
uation of graded lexical entailment.Computational
Linguistics, 43(4):781–835.
Ivan Vuli´
c, Goran Glavaš, Nikola Mrkši´
c, and Anna
Korhonen. 2018. Post-specialisation: Retrofitting
vectors of words unseen in lexical resources. In Pro-
ceedings of NAACL-HLT, pages 516–527.
Ivan Vuli ´
c and Nikola Mrkši´
c. 2018. Specialising word
vectors for lexical entailment. In Proceedings of
NAACL-HLT, pages 1134–1145.
Julie Weeds, Daoud Clarke, Jeremy Reffin, David Weir,
and Bill Keller. 2014. Learning to distinguish hy-
pernyms and co-hyponyms. In Proceedings of COL-
ING, pages 2249–2259.
Julie Weeds, David Weir, and Diana McCarthy. 2004.
Characterising measures of lexical distributional
similarity. In Proceedings of COLING, pages 1015–
1021.
John Wieting, Mohit Bansal, Kevin Gimpel, and Karen
Livescu. 2015. From paraphrase database to compo-
sitional paraphrase model and back.Transactions of
the ACL, 3:345–358.
Adina Williams, Nikita Nangia, and Samuel Bowman.
2018. A broad-coverage challenge corpus for sen-
tence understanding through inference. In Proceed-
ings of NAACL-HLT, pages 1112–1122.
Chang Xu, Yalong Bai, Jiang Bian, Bin Gao, Gang
Wang, Xiaoguang Liu, and Tie-Yan Liu. 2014. RC-
NET: A general framework for incorporating knowl-
edge into word representations. In Proceedings of
CIKM, pages 1219–1228.
Mo Yu and Mark Dredze. 2014. Improving lexical em-
beddings with semantic knowledge. In Proceedings
of ACL, pages 545–550.
Zheng Yu, Haixun Wang, Xuemin Lin, and Min Wang.
2015. Learning term embeddings for hypernymy
identification. In Proceedings of IJCAI, pages 1390–
1397.
Jingwei Zhang, Jeremy Salwen, Michael Glass, and Al-
fio Gliozzo. 2014. Word semantic representations
using bayesian probabilistic tensor factorization. In
Proceedings of EMNLP, pages 1522–1531.
... In the former (Glavaš and Vulić, 2018b, 2019) a global specialisation function (i.e., a deep non-linear feed-forward network) is learned using the lexical constraints as training examples to transform the entire embedding space. In the latter (Biesialska et al., 2020;Kamath et al., 2019;Ponti et al., 2018;, a general mapping function is learned based on the transformations undergone by the words in the initial specialisation (i.e., by predicting the specialised vectors from the original ones) so as to propagate the external lexical-semantic signal to the entire vocabulary. What is crucial, the method can be used to port structured knowledge from one language to another, even completely lacking lexical resources. ...
... Representation spaces induced through self-supervised objectives from large corpora, be it the word embedding spaces (Bojanowski et al., 2017;Mikolov et al., 2013b) or those spanned by LM-pretrained Transformers (Devlin et al., 2019;Liu et al., 2019d), encode only distributional knowledge, i.e., knowledge obtainable from large corpora. A large body of work focused on semantic specialisation (i.e., refinement) of such distributional spaces by means of injecting lexical-semantic knowledge from external resources such as WordNet (Fellbaum, 1998), BabelNet (Navigli and Ponzetto, 2010) or ConceptNet (Liu and Singh, 2004) expressed in the form of lexical constraints Glavaš and Vulić, 2018b;Kamath et al., 2019;Lauscher et al., 2020b;Mrkšić et al., 2017, inter alia). ...
Thesis
Advances in representation learning have enabled natural language processing models to derive non-negligible linguistic information directly from text corpora in an unsupervised fashion. However, this signal is underused in downstream tasks, where they tend to fall back on superficial cues and heuristics to solve the problem at hand. Further progress relies on identifying and filling the gaps in linguistic knowledge captured in their parameters. The objective of this thesis is to address these challenges focusing on the issues of resource scarcity, interpretability, and lexical knowledge injection, with an emphasis on the category of verbs. To this end, I propose a novel paradigm for efficient acquisition of lexical knowledge leveraging native speakers’ intuitions about verb meaning to support development and downstream performance of NLP models across languages. First, I investigate the potential of acquiring semantic verb classes from non-experts through manual clustering. This subsequently informs the development of a two-phase semantic dataset creation methodology, which combines semantic clustering with fine-grained semantic similarity judgments collected through spatial arrangements of lexical stimuli. The method is tested on English and then applied to a typologically diverse sample of languages to produce the first large-scale multilingual verb dataset of this kind. I demonstrate its utility as a diagnostic tool by carrying out a comprehensive evaluation of state-of-the-art NLP models, probing representation quality across languages and domains of verb meaning, and shedding light on their deficiencies. Subsequently, I directly address these shortcomings by injecting lexical knowledge into large pretrained language models. I demonstrate that external manually curated information about verbs’ lexical properties can support data-driven models in tasks where accurate verb processing is key. Moreover, I examine the potential of extending these benefits from resource-rich to resource-poor languages through translation-based transfer. The results emphasise the usefulness of human-generated lexical knowledge in supporting NLP models and suggest that time-efficient construction of lexicons similar to those developed in this work, especially in under-resourced languages, can play an important role in boosting their linguistic capacity.
... Fine-tuning generic word vectors using external knowledge such as WordNet (Miller 1995) has improved performance on a range of language understanding tasks (Glavaš and Vulić 2018). To extend this method to unseen words, Kamath et al. (2019) introduced POSTLE (post-specialization for LE), a model that learns an explicit global specialization function captured with feed forward neural networks. ...
Article
Full-text available
We present novel methods for detecting lexical entailment in a fully rule-based and explainable fashion, by automatic construction of semantic graphs, in any language for which a crowd-sourced dictionary with sufficient coverage and a dependency parser of sufficient accuracy are available. We experiment and evaluate on both the Semeval-2020 lexical entailment task (Glavaš et al. (2020). Proceedings of the Fourteenth Workshop on Semantic Evaluation, pp. 24-35) and the SherLIiC lexical inference dataset of typed predicates (Schmitt and Schütze (2019). Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 902-914). Combined with top-performing systems, our method achieves improvements over the previous state-of-the-art on both benchmarks. As a standalone system, it offers a fully interpretable model of lexical entailment that makes detailed error analysis possible, uncovering future directions for improving both the semantic parsing method and the inference process on semantic graphs. We release all components of our system as open source software.
... Representation spaces induced through selfsupervised objectives from large corpora, be it the word embedding spaces (Mikolov et al., 2013;Bojanowski et al., 2017) or those spanned by LMpretrained Transformers (Devlin et al., 2019;Liu et al., 2019b), encode only distributional knowledge, i.e., knowledge obtainable from large corpora. A large body of work focused on semantic specialisation (i.e., refinement) of such distributional spaces by means of injecting lexicosemantic knowledge from external resources such as WordNet (Fellbaum, 1998), BabelNet (Navigli and Ponzetto, 2010) or ConceptNet (Liu and Singh, 2004) expressed in the form of lexical constraints (Faruqui et al., 2015;Glavaš and Vulić, 2018c;Kamath et al., 2019;Lauscher et al., 2020b, inter alia). ...
Preprint
In parallel to their overwhelming success across NLP tasks, language ability of deep Transformer networks, pretrained via language modeling (LM) objectives has undergone extensive scrutiny. While probing revealed that these models encode a range of syntactic and semantic properties of a language, they are still prone to fall back on superficial cues and simple heuristics to solve downstream tasks, rather than leverage deeper linguistic knowledge. In this paper, we target one such area of their deficiency, verbal reasoning. We investigate whether injecting explicit information on verbs' semantic-syntactic behaviour improves the performance of LM-pretrained Transformers in event extraction tasks -- downstream tasks for which accurate verb processing is paramount. Concretely, we impart the verb knowledge from curated lexical resources into dedicated adapter modules (dubbed verb adapters), allowing it to complement, in downstream tasks, the language knowledge obtained during LM-pretraining. We first demonstrate that injecting verb knowledge leads to performance gains in English event extraction. We then explore the utility of verb adapters for event extraction in other languages: we investigate (1) zero-shot language transfer with multilingual Transformers as well as (2) transfer via (noisy automatic) translation of English verb-based lexical constraints. Our results show that the benefits of verb knowledge injection indeed extend to other languages, even when verb adapters are trained on noisily translated constraints.
... Another line of retrofitting methods , i.e., adjusting distributional vectors to satisfy external linguistic constraints, has been applied to hypernymy detection. However, they strictly require more additional resources e.g., synonym and antonym to achieve better performance (Kamath et al., 2019). To the best of our knowledge, we are the first to propose complementing the two lines of approaches to cover every word in a simple yet efficient way, with extensive analysis of the framework's potential and evaluation of performances. ...
... Another line of retrofitting methods , i.e., adjusting distributional vectors to satisfy external linguistic constraints, has been applied to hypernymy detection. However, they strictly require more additional resources e.g., synonym and antonym to achieve better performance (Kamath et al., 2019). To the best of our knowledge, we are the first to propose complementing the two lines of approaches to cover every word in a simple yet efficient way, with extensive analysis of the framework's potential and evaluation of performances. ...
Preprint
Full-text available
We address hypernymy detection, i.e., whether an is-a relationship exists between words (x, y), with the help of large textual corpora. Most conventional approaches to this task have been categorized to be either pattern-based or distributional. Recent studies suggest that pattern-based ones are superior, if large-scale Hearst pairs are extracted and fed, with the sparsity of unseen (x, y) pairs relieved. However, they become invalid in some specific sparsity cases, where x or y is not involved in any pattern. For the first time, this paper quantifies the non-negligible existence of those specific cases. We also demonstrate that distributional methods are ideal to make up for pattern-based ones in such cases. We devise a complementary framework, under which a pattern-based and a distributional model collaborate seamlessly in cases which they each prefer. On several benchmark datasets, our framework achieves competitive improvements and the case study shows its better interpretability.
... Retrofitting methods incorporate these task-specific graphs either by directly translating the embeddings (standard retrofitting) [7,26,32,10] or by learning a neural network to do the same (explicit retrofitting) [9,16]. Both standard and explicit retrofitting represent new relationships between entities observed in the task-specific graph; however, it is important to consider their impact on unobserved entities because most task-specific graphs are characteristically incomplete (Figure 1). ...
Preprint
Pretrained (language) embeddings are versatile, task-agnostic feature representations of entities, like words, that are central to many machine learning applications. These representations can be enriched through retrofitting, a class of methods that incorporate task-specific domain knowledge encoded as a graph over a subset of these entities. However, existing retrofitting algorithms face two limitations: they overfit the observed graph by failing to represent relationships with missing entities; and they underfit the observed graph by only learning embeddings in Euclidean manifolds, which cannot faithfully represent even simple tree-structured or cyclic graphs. We address these problems with two key contributions: (i) we propose a novel regularizer, a conformality regularizer, that preserves local geometry from the pretrained embeddings---enabling generalization to missing entities and (ii) a new Riemannian feedforward layer that learns to map pre-trained embeddings onto a non-Euclidean manifold that can better represent the entire graph. Through experiments on WordNet, we demonstrate that the conformality regularizer prevents even existing (Euclidean-only) methods from overfitting on link prediction for missing entities, and---together with the Riemannian feedforward layer---learns non-Euclidean embeddings that outperform them.
Article
We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity data sets. Due to its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and cross-lingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and cross-lingual representation models, including static and contextualized word embeddings (such as fastText, monolingual and multilingual BERT, XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised cross-lingual word embeddings. We also present a step-by-step data set creation protocol for creating consistent, Multi-Simlex -style resources for additional languages. We make these contributions - the public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses which can be be helpful in guiding future developments in multilingual lexical semantics and representation learning - available via a website which will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.
Article
Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent, modulo optimization strategies, hyper-parameters, and such. We also discuss the different ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons.