Content uploaded by Ivana Kajic

Author content

All content in this area was uploaded by Ivana Kajic on Aug 24, 2018

Content may be subject to copyright.

Evaluating the psychological plausibility of word2vec and GloVe distributional

semantic models

Ivana Kaji´

c, Chris Eliasmith

Centre for Theoretical Neuroscience, University of Waterloo

Waterloo, ON, Canada N2L 3G1

{i2kajic, celiasmith}@uwaterloo.ca

Abstract

The representation of semantic knowledge poses a central

modelling decision in many models of cognitive phenomena.

However, not all such representations reﬂect properties ob-

served in human semantic networks. Here, we evaluate the

psychological plausibility of two distributional semantic mod-

els widely used in natural language processing: word2vec and

GloVe. We use these models to construct directed and undi-

rected semantic networks and compare them to networks of hu-

man association norms using a set of graph-theoretic analyses.

Our results show that all such networks display small-world

characteristics, while only undirected networks show similar

degree distributions to those in the human semantic network.

Directed networks also exhibit a hierarchical organization that

is reminiscent of the human semantic network.

Keywords: semantic spaces, distributional semantic models,

free association norms, network analysis

Introduction

The representation of semantic knowledge is instrumental to

many models of linguistic processing in cognitive modelling

and machine learning. In particular, the decision of how to

represent such knowledge entails the selection of a vocabu-

lary and a computational representation of vocabulary items.

Many computational models studying human semantic

memory and related processes have been relying on The Uni-

versity of South Florida Free Association Norms (Nelson,

McEvoy, & Schreiber, 2004, USF Norms). Because it is a

psychologically plausible representation of a semantic net-

work, the USF Norms have been successfully used to repro-

duce human-level performance on tasks such as verbal se-

mantic search (Abbott, Austerweil, & Grifﬁths, 2015; Kaji´

c

et al., 2017), and recognition memory and recall (Steyvers,

Shiffrin, & Nelson, 2004).

Another common choice for the representation of semantic

knowledge is models derived from co-occurrence and word

frequency data. Such models learn vector representations of

words from large linguistic corpora and are often referred to

as spatial or distributed semantic models (DSMs).

In the domain of natural language processing,

word2vec (Mikolov, Sutskever, Chen, Corrado, & Dean,

2013) and GloVe (Pennington, Socher, & Manning, 2014)

have been two widely used DSMs. They are shown to

achieve high accuracy on a variety of lexical semantic tasks

such as word analogy and named entity recognition. Their

capacity to perform well on such tasks also makes them

attractive candidates for semantic representations in cognitive

models. Yet, it remains unclear which, if any, aspects of such

representations are psychologically plausible.

This study evaluates psychological plausibility of GloVe

and word2vec models by analyzing semantic networks con-

structed from those models and comparing them to semantic

networks constructed from the USF Norms.

In particular, we evaluate networks in terms of their small-

world characteristics, degree distributions and hierarchical or-

ganization. We characterize networks that capture properties

of human association networks, and identify differences that

might have important implications for modelling of human

semantic memory.

Semantic Spaces

Vector-based word representations are generated to capture

statistical regularities observed in natural language. Often, a

high-dimensional co-occurrence matrix is created by count-

ing word occurrences in a set of texts or contexts. Then,

by applying a dimensionality reduction method such a ma-

trix is factorized into components that can be used to re-

construct low-dimensional vectors representing individual

words. DSMs using this approach have been known as count-

based models, with Latent Semantic Analysis (Deerwester,

Dumais, Furnas, Landauer, & Harshman, 1990, LSA) be-

ing one such prominent example. GloVe vectors used in this

work are derived from a count-based model (Pennington et

al., 2014) that is based on methods similar to LSA.

Although networks created from LSA vectors have been

criticized as unable to reproduce connectivity patterns be-

tween words as observed in the USF Norms (Steyvers &

Tenenbaum, 2005), this has been challenged by more re-

cent work demonstrating that some DSMs indeed produce

degree distributions that resemble that of human association

norms (Utsumi, 2015).

In contrast to count-based models such as LSA, more

recently developed predictive models use iterative training

procedures in complex neural networks to learn word vec-

tors based on the contexts in which those words occur.

Word2vec (Mikolov et al., 2013) is a popular predictive

model that computes vectors from a large corpus of text by

maximizing the probability of a target, which can either be a

single word or a set of context words.

In this work, we use pre-trained GloVe and word2vec vec-

tors. GloVe vectors were trained using the Common Crawl

dataset containing approximately 840 billion word tokens.

Word2vec vectors were trained on the Google News dataset

containing about 100 billion words. In both cases, the result-

ing vectors used here contain 300 dimensions. To allow for

a comparable analyses, we restrict the size of vocabularies of

these two datasets to that of the USF Norms, which contains

5018 words.

Constructing Semantic Networks

We generate undirected and directed semantic networks

from word2vec and GloVe semantic models, and com-

pare them to corresponding networks constructed from

the USF Norms.1The cosine angle is used as a mea-

sure of similarity between two word vectors. We have

also tested inner angle as a similarity measure, but found

no major differences between the two and therefore re-

port results of analyses based on cosine similarity. The

data processing and analysis code is available online at

https://github.com/ctn-archive/kajic-cogsci2018.

Undirected Networks

We create three undirected networks: one from the human

free association norms (USF Norms), one from the word2vec

vectors (word2vec) and one from the GloVe vectors (glove).

In every network, a node represents a word and an edge be-

tween two nodes corresponds to an associative relationship

between two words. To construct the undirected version of

the USF network, we place an edge between two nodes rep-

resenting words w1and w2if an association pair (w1,w2) or

(w2,w1) occurs in the USF word association database.

To construct undirected networks from DSMs, we com-

pute the similarity between all word pairs in the correspond-

ing vocabulary. An edge is placed between two nodes in

a network if the similarity between words represented by

those nodes exceeds a certain threshold τ. To select a thresh-

old for each network, we ﬁrst test a sequence of uniformly

distributed thresholds (with two decimal places) in different

ranges. Then, we select the threshold that produces a network

with the average node degree <k>that is closest to that

of the USF network. Overall, increasing the threshold had

the effect of shifting the degree distribution from the ”right”

(the regime where many nodes have many connections) to the

”left” (very sparse connectivity with most nodes having only

few connections). For the word2vec network the threshold is

τ=0.38 and for the glove network it is τ=0.53.

The degree distribution of the word2vec network was

strongly correlated (r=0.74,p<0.001) with the degree dis-

tribution of the USF network, while a moderate correlation

(r=0.49,p<0.001) was observed with the glove network.

Directed Networks

A directed network is a more accurate representation of hu-

man word associations, as it captures the directionality of as-

sociative relationship between a cue word and a target word.

Only 26.4% of all association pairs in the USF Norms are

1We will refer to all networks constructed from word2vec and

GloVe models as synthetic semantic networks, to differentiate them

from the experimentally derived USF network.

reciprocal.2To construct the directed version of the USF net-

work, a directed edge is placed between two nodes w1and w2

only if w2was an associate of a cue word w1.

We adopt two different methods to construct directed net-

works from DSMs: the k-nn method (Steyvers & Tenenbaum,

2005) and the cs-method (Utsumi, 2015). In both cases, the

local neighborhood of a node iis determined by placing out-

going edges to kother nodes that represent words most sim-

ilar to the word represented by the node i. The k-nn method

determines the number kfor each node as the number of as-

sociates of that word in the USF Norms, resulting in a net-

work that has the same out-degree distribution as the directed

USF network. The cs-method ﬁnds the smallest number kfor

which a certain empirical threshold Ris exceeded that pro-

duces the same average degree connectivity <k>as in the

directed USF network. We refer to the networks constructed

by the k-nn method as word2vec-knn and glove-knn, and those

constructed with the cs-method as word2vec-cs and glove-cs.

We observe strong and signiﬁcant correlations between

degree distributions of the directed networks word2vec-knn

(r=0.69), glove-knn (r=0.83), word2vec-cs (r=0.75),

glove-cs (r=0.88,p<0.001 in all cases), and the directed

USF network.

Network Statistics

Previous research (Steyvers et al., 2004; Utsumi, 2015;

Morais, Olsson, & Schooler, 2013) has identiﬁed that hu-

man association networks can be characterized in terms of

their small-world properties: although the networks are very

sparse (i.e., a node in the network is on average connected to

only a small subset of all nodes in the network), they have a

small average shortest path length L,L=1

n(n−1)∑i,j∈Gd(i,j),

where nis the number of nodes in the network, iand jare two

different nodes and d(i,j)is the shortest distance between the

two nodes measured as the number of edges between them.

In addition, the average clustering coefﬁcients of such net-

works are higher than clustering coefﬁcients of random net-

works of the same size that have the same probability of

a connection between any two nodes. The average cluster-

ing coefﬁcient Cfor a network with nnodes is computed as

C=1

n∑i∈Gciwhere ciis a local clustering coefﬁcient of a

node i, given by: ci=2ti

ki(ki−1).tiis the number of triangles in

the neighborhood of the node i. A triangle is a connectivity

pattern where a node iis connected to two other nodes jand

k, and at the same time the nodes jand kare also connected.

The denominator in the equation for ciis the number of possi-

ble connections in a neighborhood of a node with the degree

ki:ki(ki−1)/2. In the context of semantic networks, such

small-world structure is important for the efﬁcient search and

retrieval of items from memory.

To test whether networks constructed from word2vec and

GloVe models exhibit small-world characteristics, we run dif-

ferent graph-theoretic analyses. The sparsity sof a network is

2A reciprocal association is one where the word w1is an asso-

ciate of a word w2and vice versa.

Table 1: Graph-theoretic statistics of networks derived from USF Norms, word2vec and GloVe vectors. The results of undi-

rected networks are presented in the ﬁrst three rows. Abbreviations: L=the average shortest path length, k=average node

degree, C=the average clustering coefﬁcient, Ck=connectivity, the number of nodes in the largest connected component

(expressed in %), D=the network diameter, m=the number of edges, n=the number of nodes, Lrnd =the average shortest

path of a randomly connected network of similar size, Crnd =the average clustering coefﬁcient of a random network, s=the

sparsity of the network (expressed in %).

L<k>C CkD m n Lrnd Crnd s

USF undirected 3.04 22.0 0.186 100.00 5 55,236 5,018 3.03 0.004 0.44

word2vec 4.24 21.3 0.325 99.84 12 52,317 4,902 3.04 0.004 0.44

glove 4.61 22.1 0.373 98.88 12 51,244 4,632 2.99 0.005 0.48

USF directed 4.26 12.7 0.187 96.51 10 63,619 5,018 3.62 0.005 0.25

word2vec-cs 4.81 12.5 0.237 99.28 11 62,328 4,977 3.64 0.005 0.25

word2vec-knn 4.77 12.7 0.232 99.32 12 63,165 4,977 3.64 0.005 0.26

glove-cs 5.06 12.3 0.266 97.21 12 61,470 4,988 3.65 0.005 0.25

glove-knn 5.03 12.7 0.259 97.91 13 63,262 4,988 3.62 0.005 0.25

computed by dividing the average node degree <k>with the

total number of edges in the network. Other measures such as

the clustering coefﬁcient C, the average shortest path length L

and the diameter Dhave been performed on the largest con-

nected component of each network.

The results of analyses are summarized in Table 1. Our

results for the two USF networks are consistent with the pre-

vious reports (Steyvers et al., 2004; Utsumi, 2015). Due to

the methods used to construct the networks, all synthetic net-

works have sparsity that is comparable to the sparsity of the

human association network. Also, their average shortest path

lengths are consistently higher, but still comparable to those

of the human association networks. However, in undirected

networks, the diameter of synthetic networks is more than

twice as long as that of the USF network, meaning that the

distance between the two farthest words is longer in the syn-

thetic networks than it is in the association network. Further-

more, all synthetic networks also exhibit a degree of cluster-

ing that is higher than that of the USF network. This effect is

more pronounced in the undirected versions of the network.

Degree Distributions

To obtain the distribution of degrees in a network, we count

the number of nodes with kdegrees, where kranges from

one to kmax. The kmax value denotes the highest node degree

and it is different for different networks. The distribution of

in-degrees of the directed association network is known to

follow a truncated power-law distribution P(k)∼eλk−α, or,

in some cases, a pure power-law P(k)∼k−α(Utsumi, 2015;

Morais et al., 2013). The power-law predicts that most nodes

in the network have a few connections, while a small number

of nodes, regarded as hubs, have a rich local neighborhood.

In our analyses of degree distributions, we ﬁrst test the

plausibility of a power-law behavior using the goodness-of-

ﬁt test. Then we test whether other heavy-tailed distributions

provide a better ﬁt using the loglikelihood-ratio (LR) test. To

ﬁt and evaluate different models, we use the Python powerlaw

package (Alstott, Bullmore, & Plenz, 2014).

First, we ﬁt the empirical degree distribution to a power-

law model using the maximum likelihood estimation for the

parameter α. The ﬁt is performed for values of k>kmin ,

where kmin was determined such that the Kolmogorov-

Smirnov (KS) distance between the empirical distribution and

the model distribution for values greater than kmin is mini-

mized. Given a model, the goodness-of-ﬁt test uses the KS

distance between the model and the empirical distribution,

as well as the model and thousands of distributions sampled

from the model, to evaluate its plausibility. It produces a

p-value that is a fraction of sampled distributions that have

a greater KS distance than the empirical distribution. Large

p-values denote that sampled distributions are more distant

than the empirical distribution, in which case the model is

regarded as a plausible ﬁt to the empirical data.

The LR-test is a comparative test that evaluates which of

the two distributions is more likely to generate samples from

the empirical data based on maximum likelihood functions

of each distribution. The resulting Rvalue is positive if the

ﬁrst distribution is more likely, and negative otherwise. The

alternative heavy-tailed distributions we tested are: truncated

power-law, (discretized) lognormal and exponential.

Results of our analyses are summarized in Table 2. As

consistent with previous research, we ﬁnd that the truncated

power-law, rather than a pure power-law, is a better descrip-

tion for the distribution of degrees for the direct USF net-

work (Morais et al., 2013; Utsumi, 2015). In addition, our

results indicate that the lognormal distribution is a plausible

model for the directed USF network, as it is not possible to

distinguish between the lognormal and truncated power law

distributions (R=−0.43,p=0.67). Pure power-law is ex-

cluded as a plausible model for all our networks as p-values

for the goodness-of-ﬁt test are all close to 0.

We also ﬁnd that the truncated power-law is a plausible

model for the undirected USF network and both undirected

versions of the synthetic networks. It is important to notice

Table 2: Goodness-of-ﬁt test for the power-law distribution and loglikelihood ratio tests evaluating plausibility of the power-law

versus other heavy-tailed distributions. The results of undirected networks are presented in the ﬁrst three rows. Abbreviations:

KS =Kolmogorov-Smirnov statistic, LR =loglikelihood ratio.

Power Law Power Law vs.

Truncated Power Law

Power Law vs.

Lognormal

Power Law vs.

Exponential

KS pLR pLR pLR p

USF undirected 0.014 0.01 -1.80 0.02 -0.91 0.37 7.46 0.00

glove 0.035 0.00 -7.29 0.00 -5.03 0.00 7.82 0.00

word2vec 0.064 0.00 -7.72 0.00 -5.31 0.00 -5.89 0.00

USF directed 0.055 0.00 -2.54 0.00 -2.36 0.02 0.27 0.79

glove-cs 0.016 0.00 -1.14 0.17 -0.67 0.50 3.52 0.00

glove-knn 0.020 0.00 -1.05 0.16 -0.82 0.41 1.08 0.28

word2vec-cs 0.032 0.00 -0.57 0.40 -0.51 0.61 0.29 0.77

word2vec-knn 0.028 0.00 -0.13 0.82 -0.12 0.90 0.69 0.49

that the exponential distribution, as well as the lognormal dis-

tribution cannot be ruled out for the undirected word2vec net-

work. However, due to the high p-values of the LR tests it is

not possible to reach similar conclusions for the degree dis-

tributions of the directed synthetic networks.

To better understand these numerical results, we plot the

empirical data and model ﬁts on a semi-log scale in Fig-

ure 1. Degree distributions are expressed as complementary

cumulative distribution functions, and ﬁts for the power-law,

truncated power-law, lognormal, and exponential models are

shown. The scarcity of nodes with high degrees (>80) in cer-

tain variants of synthetic graphs such as glove-knn, word2vec-

cs and word2vec-knn are likely to contribute to large p-values

in LR tests in Table 2. While the distribution of degrees of

USF networks is bounded from above by the power-law dis-

tribution and from below by the exponential distribution, this

is only somewhat the case for the glove networks and less

so for the word2vec networks, indicating differences between

degree distributions of human and synthetic word networks.

Hierarchical Topology

Human association networks have been shown to exhibit a hi-

erarchical organization resulting from the high modularity of

the network (Utsumi, 2015). Such modules, or clusters, are

highly interconnected groups of nodes that form only a few

connections to nodes that are not part of the group. The pres-

ence of such clusters indicates that there are features shared

among nodes in the network, such as semantic or lexical re-

latedness.

While the average clustering coefﬁcients are reported in

Table 1, to investigate the presence of hierarchical struc-

ture, we consider the relationship between a node degree and

local clustering coefﬁcients ci(Ravasz & Barab´

asi, 2003).

In networks that exhibit hierarchical organization, the lo-

cal clustering coefﬁcient is dependent on the node degree

and has been observed to follow a scaling law of the form

C(k)∼k−γ(Ravasz & Barab´

asi, 2003). While many hierar-

chical networks have been observed to have γ=1, hierarchi-

cal structure has also been observed in networks with γ<1.

To investigate whether the tendency for clustering is depen-

dent on the node degree, we implement methods proposed

by Utsumi (2015). We ﬁrst compute local clustering coef-

ﬁcients for all nodes in the largest connected component in

each network. Then, we compute the average clustering coef-

ﬁcient for each neighborhood size kand connect those values

to form a line. Finally, we use linear regression in the log-

arithmic space to determine the slope of the regression line

and the correlation coefﬁcient.

The results are shown in Figure 2. First, we conﬁrm that

the directed USF network exhibits hierarchical organization

with γ=0.75 (r=−0.97). We also found strong negative

correlation between the size of a neighborhood and the av-

erage clustering coefﬁcient for the undirected USF network

(γ=0.76,r=−0.97). For undirected versions of synthetic

networks, we ﬁnd a small positive slope γ=−0.05 and a pos-

itive correlation (r=0.27) for the glove network, and simi-

larly γ=−0.10 (r=0.49) for the word2vec network. There-

fore, there is no dependency between the local clustering co-

efﬁcients and the node degree in the undirected versions of

synthetic semantic networks.

In contrast, some of the directed semantic networks ex-

hibit higher levels of hierarchical organization. Directed net-

works constructed with the k-nn method have negative slopes

with strong correlations: γ=0.49 (r=−0.96) for glove-knn

and γ=0.39 (r=−0.91) for word2vec-knn. The hierarchi-

cal relationship is less apparent in networks constructed with

the cs-method (glove-cs:γ=0.32,r=−0.88, word2vec-cs:

γ=0.17,r=−0.54).

Discussion

The goal of the present study is to evaluate the psychological

plausibility of semantic networks constructed from the widely

used word2vec and GloVe distributional semantic models.

To this end, a number of graph-theoretic analyses were per-

Figure 1: Complementary cumulative distributions and model ﬁts for the USF Norms,glove and word2vec networks. Distribu-

tions and ﬁts for undirected networks are shown in the ﬁrst three plots in the ﬁrst row. The x-axis for each network is bounded

by kmin and kmax that are unique for each network (see text for details).

formed that compared undirected and directed versions of

networks with semantic networks constructed from human

association norms.

We found that all networks exhibit the small-world prop-

erty, characterized by short path lengths and high clustering

coefﬁcients. In other words, it is possible to efﬁciently search

in such networks as any two words in a network are only a

few words apart. These results are consistent with previous

studies that demonstrated small-world structure in different

DSMs (Steyvers & Tenenbaum, 2005; Utsumi, 2015).

Degree distribution analyses based on a goodness-of-ﬁt test

revealed that the power-law is not a plausible model in any

of the networks. This ﬁnding may not be as surprising con-

sidering that some semantic networks have distributions that

can be described well with alternative heavy-tailed distribu-

tions (Morais et al., 2013; Utsumi, 2015). We contribute to

the existing research by adding that the truncated power-law

is a plausible explanation of the degree distribution for the

undirected USF network. The synthetic undirected networks

were also explained best by the truncated power-law. How-

ever, the lognormal and the exponential distribution are also

a plausible ﬁt for the undirected word2vec network.

Truncated power-law behavior could not be inferred for the

directed word2vec and glove networks. Analyzing the tails of

distributions in Fig. 1 provides more insight as to why it is

difﬁcult to obtain a clear ﬁt in those cases. Directed networks

have only very few nodes with a high number of connections.

For example, there are only four nodes with k>80 for glove-

knn and less than ten nodes with k>55 for both word2vec

networks. What distinguishes different heavy-tailed distribu-

tions are nodes ”contained” in the tail of a distribution, and

in this case it is possible that the LR test did not have enough

data to reliably discriminate between different distributions.

We found that directed networks, more speciﬁcally those

constructed with the k-nn method, exhibit a moderate level

of hierarchical organization that is reminiscent of cluster-

ing observed in the human association network. The di-

rected glove-cs and glove-knn networks contain nodes with

high connectivity that exhibit ﬂuctuations in clustering coefﬁ-

cients, as observed with the network created from association

norms. Such nodes are hubs, some of which have higher clus-

tering coefﬁcients since they are embedded in clusters. Hubs

with lower clustering coefﬁcients act as intermediaries in the

network by connecting different modules of the network that

have less connections.

Overall, these results indicate that different semantic net-

works constructed from word2vec and GloVe models are ca-

pable of capturing some aspects of human association net-

works. However, for most synthetic datasets there are clear

differences with the empirical networks. The glove-knn net-

Figure 2: Scatter plots of local clustering coefﬁcients. Every blue dot represents a local clustering coefﬁcient of a node with the

degree k. The red line connects averages. The ﬁrst three plots in the ﬁrst row are obtained from undirected networks.

work exhibits properties that are most similar to those of the

USF Norms, but future work should address methods that

yield a greater number of nodes with rich neighborhoods.

Acknowledgments

The authors would like to thank Terry Stewart for useful com-

ments and discussions. This work has been supported by

AFOSR, grant number FA9550-17-1-002.

References

Abbott, J. T., Austerweil, J. L., & Grifﬁths, T. L. (2015).

Random walks on semantic networks can resemble optimal

foraging. Psyc. Rev.,122(3), 558.

Alstott, J., Bullmore, E., & Plenz, D. (2014). powerlaw: a

python package for analysis of heavy-tailed distributions.

PloS One,9(1), e85777.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K.,

& Harshman, R. (1990). Indexing by latent semantic anal-

ysis. Journal of the Am. soc.: for inf. sci.,41(6), 391.

Kaji´

c, I., Gosmann, J., Komer, B., Orr, R. W., Stewart, T. C.,

& Eliasmith, C. (2017). A biologically constrained model

of semantic memory search. In Proceedings of the 39th

Annual Conference of the Cognitive Science Society (pp.

631–636). Austin, TX: Cognitive Science Society.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean,

J. (2013). Distributed representations of words and phrases

and their compositionality. In Advances in Neural Infor-

mation Processing Systems 26 (pp. 3111–3119). Curran

Associates, Inc.

Morais, A. S., Olsson, H., & Schooler, L. J. (2013). Mapping

the structure of semantic memory. Cog. Sci.,37(1), 125–

145.

Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (2004).

The University of South Florida free association, rhyme,

and word fragment norms. Behavior Research Methods,

Instruments, & Computers,36(3), 402-407.

Pennington, J., Socher, R., & Manning, C. (2014). Glove:

Global vectors for word representation. In Proceedings of

the 2014 conference on empirical methods in natural lan-

guage processing (EMNLP) (pp. 1532–1543).

Ravasz, E., & Barab´

asi, A.-L. (2003). Hierarchical organiza-

tion in complex networks. Phys. Rev. E,67(2), 026112.

Steyvers, M., Shiffrin, R. M., & Nelson, D. L. (2004). Word

association spaces for predicting semantic similarity effects

in episodic memory. In Experimental Cognitive Psychol-

ogy and its Applications (pp. 237–249). American Psycho-

logical Association.

Steyvers, M., & Tenenbaum, J. B. (2005). The large-scale

structure of semantic networks: Statistical analyses and a

model of semantic growth. Cog. Sci.,29(1), 41–78.

Utsumi, A. (2015). A complex network approach to distribu-

tional semantic models. PloS One,10(8), e0136277.