PreprintPDF Available

Sense Tree: Discovery of New Word Senses with Graph-based Scoring

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Language is dynamic and constantly evolving: both the usage context and the meaning of words change over time. Identifying words that acquired new meanings and the point in time at which new word senses emerged is elementary for word sense disambiguation and entity linking in historical texts. For example, cloud once stood mostly for the weather phenomenon and only recently gained the new sense of cloud computing. We developed a simple graph-based language model to detect and visualize these changes. We evaluate our approach using COHA, a corpus books and magazines spanning two centuries. Further, we provide a list of words that were annotated by 16 linguists. https://hpi.de/naumann/s/language-evolution
Content may be subject to copyright.
Sense Tree: Discovery of New Word Senses
with Graph-based Scoring
Jan Ehm¨uller1, Lasse Kohlmeyer1, Holly McKee1, Daniel Paeschke1,
Tim Repke2, Ralf Krestel2, and Felix Naumann2
Hasso Plattner Institute, University of Potsdam, Germany
1first.last@student.hpi.uni-potsdam.de,2first.last@hpi.uni-potsdam.de
Abstract. Language is dynamic and constantly evolving: both the us-
age context and the meaning of words change over time. Identifying
words that acquired new meanings and the point in time at which new
word senses emerged is elementary for word sense disambiguation and
entity linking in historical texts. For example, cloud once stood mostly
for the weather phenomenon and only recently gained the new sense
of cloud computing. We propose a clustering-based approach that com-
putes sense trees, showing how meanings of words change over time. The
produced results are easy to interpret and explain using a drill down
mechanism. We evaluate our approach qualitatively on the Corpus of
Historic American English (COHA), which spans two hundred years.
1 The Evolution of Language
As language evolves, words develop new senses. Detecting these new senses over
time is useful for several uses, such as word sense disambiguation and entity
linking in historic texts. Current machine learning models that represent and
disambiguate word senses or link entities are trained on recent datasets. There-
fore, these models are not aware of how language changed over the last decades
or even centuries. When disambiguating words in historic texts, it is not possi-
ble for such a model to know which senses a word could have had at that time.
For instance, the word cloud was once used mostly in the newspaper section
of weather forecasts. Nowadays, it is increasingly used in the context of cloud
computing. Hence, when disambiguating cloud in texts from the 19th century
with a model trained on current data, the model would not be aware that the
meaning of cloud computing did not yet exist at that point in time.
New word senses enter colloquial language from many aspects of human life,
such as popular culture, new technologies, world events, and social movements.
In 2019 alone, over 1,100 new words and meanings were added to the Merriam-
Webster Dictionary. Because the usage of context words differs over time, we
can discover word senses by taking advantage of this temporal aspect. Discover-
ing new or different senses of a word is known as word sense induction (WSI).
Copyright ©2020 by the paper’s authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
An example for WSI can be found in the context of depression.Depression is
commonly known in the sense of a mental health disorder, but also in the sense
of economic crisis in the term great depression. The aim of WSI is to find out
which senses the word depression has. Word sense disambiguation (WSD), on the
other hand, is known as the automated disambiguation of a word sense within a
text [15]. While WSI aims to discover different senses for one word, WSD aims
to decide which sense a word has in a specific context, such as in a sentence.
We propose a WSI method that detects new senses by creating multiple
co-occurrence graphs over time, and extracts word senses based on so-called
ego-networks. For a given word, it uses graph clustering to extract word senses
from the word’s ego-network. These word senses are matched over time to create
a forest of so-called “sense trees”. This forest can be explored to find out if
and when a word gained new senses over time. With the help of linguists, we
annotated a list of 112 words for evaluation1and tested our approach using
the Corpus of Historical American English (COHA), with 400 million words the
largest corpus of its type [4].
2 Related Work
Research on word sense induction and word sense disambiguation addresses how
to improve information retrieval in search engines, information extraction for
specific domains, entity linking over time, machine translation, and lexicogra-
phy [15]. Our research focuses on WSI and aims to discover the emergence of
new word meanings over time. Approaches can be divided roughly into vector
space models that include word embeddings [5], graph clustering techniques [19],
which include word clustering [5], and co-occurrence graphs [14].
Word embeddings are vector representations of words in a semantic space
that allow for word meanings to be induced by context. Word embeddings are
trained by a neural network that learns to predict a word based on its context
words or vice versa. Hence, words with similar contexts have a similar vector
and are thus in close proximity to each other. The initially proposed word2vec
model [13] computes only one vector and thus, allows for only one meaning per
word. Natural language is ambiguous and most words are polysemous — they
have multiple senses. Sense embeddings, as proposed by Song et al. [17], allow
for multiple senses by representing each sense instance as its own vector.
However, neither traditional word embeddings, like word2vec, nor sense em-
beddings consider the temporal aspect and assume that words are static across
time. This creates a challenge in the face of dynamic and changing natural lan-
guage [1]. Word embeddings can be used for temporal tasks if multiple embed-
dings are trained separately for separate time periods. As those embeddings are
trained separately, they do not lie in the same semantic embedding space [10].
To ensure that they are in the same space and thus are comparable, they need
to be aligned with each other [1]. To resolve such alignment issues, dynamic or
1https://hpi.de/naumann/s/language-evolution
diachronic word embeddings were introduced [10]. Kim et al. solve this by train-
ing their model on the earliest time periods first. Using the obtained weights
as the initial state for the next training phase, they move through subsequent
periods, allowing the embeddings to gain complexity and pick up new senses [9].
Yao et al. present another idea, called dynamic word embeddings, where align-
ment is enforced by simultaneously training the word embeddings for different
time periods [21]. The disadvantage of their approach is that it addresses either
only the temporal semantic shift or multi-sense aspect of words. In contrast, our
approach takes both aspects into consideration.
Another method for extracting word senses is through the use of graph
clustering or community detection. This method builds a graph based on co-
occurrence features from a corpus. Such graphs represent the relation of words
to each other, and allows for the extraction of sense clusters through graph clus-
tering. Automatic word sense change detection based on curvature clustering
can help understand the different senses in historic archives [18]. Their manual
evaluation chose 23 terms with known sense changes (e.g., “gay”). Hope and
Keller introduce the soft clustering algorithm MaxMax [7]. They identify word
senses by transforming the co-occurrence graph around a given word into an
unweighted, directed graph.
Mitra et al. present an approach that builds graphs based on distributional
thesauri for separate time periods [14]. From those graphs they extract so-called
“ego-networks”. With the randomized graph-clustering algorithm Chinese Whis-
pers, sense clusters are induced for specific words from their ego-network [2]. A
similar and more recent approach by Ustalov et al., performs the clustering
step with the meta-algorithm watset. This algorithm uses hard clustering algo-
rithms, such as Louvain or Chinese Whispers, to perform a soft clustering [19].
Hard clustering algorithms assign nodes to exactly one cluster, whereas soft
clustering produces a probability distribution of cluster assignments.
Besides embeddings and graph models, topic modeling has been applied to
WSI as well. Lau et al. model word senses as topics of a word by application of
latent Dirichlet allocation (LDA) as well as non-parametric hierarchical Dirichlet
processes (HDP). They also applied their approach on the field of novelty sense
detection by comparing induced senses of words of a diachronic corpus containing
two time periods for a self developed dataset of 10 words [11].
Jatowt and Duh provide a framework for discovering and visualizing semantic
changes w.r.t. individual words, word pairs and word sentiment. They use n-gram
frequencies, positional information and Latent Semantic Analyses to construct
word vectors for each decade of Google Books and COHA. To derive changes of
word senses, they calculate the cosine similarity of vector representations of the
same word at different time points. The authors show results of a case study for
semantic changes of single words, inter-decade word similarity and contrasting
word pairs in 16 experiments with mostly different words [8]
Our approach is similar to that of Mitra et al. [14], but differs in that we build
a simpler co-occurrence graph and compare three different clustering algorithms.
Additionally, our approach interprets the results of comparing sense clusters
automatically to rank a word according to the likelihood of having gained a
new sense over time. Furthermore, we use more fine-grained time slices, to more
precisely narrow in on the point in time where a new sense emerges, whereas
Mitra et al. merely oppose two time slices at once. This point can be visualized
by our drill down, which shows the forest of sense trees built from ego-networks.
This ability makes our approach valuable as an exploration tool in the field of
historical and diachronic linguists, specifically for hypothesis testing.
3 A Forest of Word Senses
Our approach can be divided into the three parts visualized in Figure 1 using
a fictitious example: (i) construction of a weighted co-occurrence graph, (ii) ex-
traction of word senses from a word’s ego-network in the form of sense clusters,
and (iii) matching them over time to create a forest of sense trees. We separate
the corpus into ntime slices Ct, which are handled as subcorpora. Similar to
Mitra et al. [14], our analysis is limited to nouns. For each subcorpus, we con-
struct a filtered co-occurrence graph and extract sense clusters for each word.
Finally, these clusters are connected across time slices to form sense trees.
3.1 Building a Co-Occurrence Graph
To build a weighted co-occurrence graph for time slice Ct, we need to count how
often words appear in the same context in Ct. Two nouns co-occur if they are
part of the same sentence and the word distance between them is nwindow .
Therefore, the parameter nwindow controls the complexity of the resulting co-
occurrence graph. A smaller window size results in fewer co-occurrence edges
and hence in a sparser graph. Having computed the co-occurrences, we can
create the graph Gt= (Vt, Et) for each time slice Ct. The sets of nodes and
edges of Gtfor time slice Ctare defined as
Vt={uCt|uis noun tf(u)αtf}
Et={{u, v} | u, v Vtu6=vcooc (u, v)αcooc }(1)
where tf(u) is the term frequency of uin the given time slice and cooc(u, v) is
the number of times words uand vappear within the same window. We use
the threshold parameters αtf and αcooc to exclude rarely occurring words, as,
depending on the window size nwindow, the number of edges in this graph would
rapidly increase. Oftentimes, a corpus has an unbalanced distribution of data
across all time slices. As a result, some have much higher raw co-occurrence
values. This especially affects frequently occurring words with generic meanings,
such as man and time. Our initial filtering does not account for that. We use the
point-wise mutual information (PMI) [20] pmi(u, v) = cooc(u, v)/(tf(u)tf(v))
to reduce the importance of frequent words. After filtering more edges and pos-
sible unconnected nodes, the final co-occurence graph graph G0
t= (V0
t, E0
t) for
time slice Ct, as depicted as one column in Figure 1a, is defined by
(a) Co-occurrence graph
(b) Ego-network for the word mouse
(c) Forest of sense trees
Fig. 1: Example for the construction of a simple sense forest of the word mouse
V0
t=[E0
tE0
t={{u, v} ∈ Et|pmi (u, v)αpmit}(2)
where αpmi is a threshold parameter to remove the most loosely associated words
in the given time slice.
3.2 Word Sense Extraction
We use G0
tto extract information about word contexts in time slice Ct. We
hypothesize that the context of a word is an indicator for its senses as suggested
by Lind´en and hence use clustering to extract those senses [12]. The context
of word wcan be extracted in the form of an ego-network, which contains all
neighbors of wand edges among them, but not witself. Figure 1b shows such a
network for the word mouse. The different colors indicate the different clusters
of a time slice that were produced by a graph-clustering algorithm. Formally, we
define the ego-network b
Gw= (b
Vw,b
Ew) of word win time slice Ctas
b
Vw={uV0
t| {u, w} ∈ E0
t}b
Ew={{u, v} ∈ E0
t|ub
Vwvb
Vw}(3)
To extract different senses of w, we cluster the nodes of its ego-network b
Gw.
In Section 4 we compare different clustering strategies. Each of these clustering
algorithms produces a set of pdisjoint clusters Swt={c1, . . . , cp}from the ego-
network of word win time slice Ct. Following Mitra et al. [14], we assume that
each of the resulting clusters represents a “sense cluster”. Ideally, each meaning
of win Ctis represented by exactly one sense cluster. By relaxing this condition
and allowing more than one sense cluster for each sense, we are able to get better
and more fine-grained results by the clustering algorithms.
3.3 Matching Word Senses Over Time
In the last step, sense clusters are matched across time slices. We use the Jaccard
similarity to compare sets of words, which are given by the clusters of word w
from two neighboring time slices Ctand Ct1. Let ciSwtbe a sense cluster for
word win time slice Ct. After a pairwise comparison between all sense clusters
across two time slices, we use a greedy approach to iteratively match those with
the highest score. However, if there is no cluster c0
jSwt1that shares any
words with ciSwt, it remains unmatched. In this case, it would become the
root cluster of a new sense tree.
A disadvantage of this matching strategy is that a word sense might simply
not occur in some time slices and thus interrupt the lineage of that word sense.
This may happen with sense clusters whose words have low frequencies, such as
words that appear in specific scientific literature. Because they cannot always be
matched between neighboring time slices, we match them across a longer time
span. Sense clusters in Swtare matched not only to the sense clusters in Swt1,
but also to any other previous sense cluster in Sw1, . . . , Swt2that remained
unmatched. We call this matching strategy leaf-matching.
S0
wt1=Swt1
t2
[
i=1
c0Swi|c0not matched to any c00
t1
[
j=i+1
S0
wj(4)
We compare for each time slice and produce a forest of sense trees Fw= (Vw, Ew):
Vw=
n
[
i=1
Swi;Ew={(c, c0)|cSwic0Swji<j cmatched c0}
(5)
Fwcontains sense clusters without incoming edges. These clusters are root
clusters and represent the beginning of a sense tree. Given a root cluster r, the
respective sense tree Fwr= (Vwr, Ewr) is defined as follows:
Vwr={cVw| ∃p= ((r, c0
1),...,(c0
k, c))}
Ewr={(c, c0)Ew|c, c0Vwr}(6)
Such a sense tree represents a distinct sense of the word w. For example,
in Figure 1c, the matched clusters make up three sense trees, referring to three
different meanings of the word mouse.
4 Comparison of Algorithms for Sense Clustering
In this section we introduce the following algorithms for the clustering step in
Section 3.2: Chinese Whispers [2], Girvan-Newman [6], and Louvain [3]. We also
describe the insights gained by setting up experiments using our drill down to
inspect the resulting clusters of these algorithms.
Chinese Whispers is an agglomerative clustering algorithm introduced by Bie-
mann [2]. It does not have a fixed number of clusters, which is a property that
suits word senses whose number is not known apriori. Its only parameter is the
number of iterations. Depending on the size of the graph, a small number of
iterations might never produce larger sense clusters. We chose 1 000 iterations
to ensure that the algorithm converges as suggested by Biemann [2].
Inspecting the computed clusters, we discovered that Chinese Whispers pro-
duces one very large sense cluster that contains nearly every word in the ego-
network, along with very few other small sense clusters. Since the large sense
cluster contains more than one meaning, the results are not fitting our use case.
Girvan-Newman is a community detection algorithm named after its authors [6].
The algorithm is a hierarchical method that iteratively removes edges from the
graph to find communities. It always removes the edge with the highest between-
ness centrality or a custom metric, such as the co-occurrence frequency. Hence,
the initially computed sense cluster contains all nodes of the ego-network. With
each iteration, it produces more detailed sense clusters. This algorithm produces
also a variable number of clusters and hence fits well for the task of identifying
an unknown number of word senses. However, in our configuration, the compu-
tational costs of Girvan-Newman are around 25-30 times higher than those of
Louvain and Chinese Whispers. Due to this high computational cost, we choose
only three iterations. Because the fine-granularity of the hierarchy depends on
the number of iterations, Girvan-Newman effectively produces one large clus-
ter that contains most words of an ego-network in addition to many one-node
sense clusters. As with Chinese Whispers, these results do not fit the use case of
identifying multiple meanings of a word.
Louvain is a hierarchical community detection algorithm proposed by Blondel
et al. [3]. It uses the Louvain modularity, which measures the difference of edge
density inside communities to the edge density outside communities, to iden-
tify communities in a graph. Similarly to the previous algorithms, its number of
detected clusters is not fixed. From the computed hierarchy it greedily chooses
the graph partition that optimizes the algorithm’s modularity measure. While
investigating the computed sense clusters, we found that this partitioning pro-
duced a fairly balanced amount of sense clusters in terms of the cluster size. The
results are applicable for our use case, since in most cases, sense clusters can
be assigned to exactly one meaning of a word. However, in the later time slices
with an increasing number of documents and thus co-occurrences, the computed
sense clusters are not partitioned as well. In some cases it produces quite large
sense clusters that contain multiple meanings of a word.
5 Evaluation
The evaluation of WSI approaches is an open challenge, as there are no standard-
ized experimental settings or gold standards. Kutozov et al. address the need
for standardized, robust test sets of semantic shifts in their 2018 survey [10].
Other researchers created manually selected word lists, which can vary widely
and make it difficult to compare the accuracy of approaches [14,21]. In this sec-
tion, we introduce our evaluation data, which we compiled by aggregating the
different approaches used in related work and annotated with the help of expert
linguists. We also discuss the impact of the hyperparameters of our approach
and demonstrate the effectiveness qualitatively.
5.1 Compiled Word List for Evaluation
The word list compiled by Yarowsky [22] is used in several publications on word
sense disambiguation [16]. Yarowsky developed an algorithm that disambiguates
the following twelve words with two distinct meanings: axes,bass,crane,drug,
duty,motion,palm,plant,poach,sake,space, and tank. We extended this list by
adding new dictionary entries, novel technical terms and words based on related
work [15]. It also contains words that did not gain new meanings in the last 200
years. The words were annotated with respect to new sense gain using word en-
tries from the Oxford English and Merriam-Webster Dictionaries, which include
references to the first known use of words with a particular meaning. WordNet2,
a commonly used lexical resource for computational linguistics, groups words
with regard to word form as well as meaning. To merge the result obtained from
the two different approaches we use a logical OR-operation. Thus, a word is
considered to have gained a new sense if it is labeled positively from either of
our two approaches.
In total we compiled a set of 112 words with either a single sense, multiple,
but stable senses, or words that gained at least one new sense. Each candidate
word was annotated by 15 linguists (C1 and C2 level) as “Gained an additional
sense since 1800” or “No additional sense since 1800”. We measure an overall
agreement of 61% and a free-marginal Fleiss kappa of 0.42 (fixed-marginal: 0.22).
Based on a simple majority vote, 42 words gained an additional sense, whereas
70 did not. For some words, such as android,beef, or pot, we saw a high anno-
tator agreement over 75%. When using this as a threshold, 14 words gained an
additional sense, whereas 31 did not. The linguists where particularly undecided
on the words cat,honey,power, and state. The annotations are on our website.[1]
5.2 Corpus of Historical American English (COHA)
For demonstrating our proposed approach, we use the Corpus of Historical Amer-
ican English (COHA)3. It is one of the largest temporal corpora over one of the
2https://wordnet.princeton.edu/
3https://www.english-corpora.org/coha/
Table 1: Comparison of graph clustering algorithms
Algorithm Precision@5 Precision@10 Precision@20
Chinese Whispers 0.4 0.4 0.35
Girvan-Newman 0.2 0.5 0.55
Louvain 0.6 0.6 0.4
longest time ranges [4]. COHA spans texts from the years 1810 to 2009 and con-
tains more than 100,000 single texts in fiction, popular magazines, newspapers
and non-fiction books with a total of 400 million words.
We split the corpus by decade to generate the time slices, since vastly differ-
ent vocabularies would result in different word contexts. Ideally, the vocabulary
and word distribution is relatively stable across all time slices. Since the number
of tokens increases with each decade, we measure the vocabulary overlap. In our
measurements (not shown due space constraints), neighboring decades share 20-
30% of their vocabulary, while the decades that are further away from each other
share only 5-15% of their vocabulary for both measures. Very high values pro-
duced by the cosine similarity highlight that frequently occurring words appear
in most decades. We can conclude that using a frequency- and co-occurrence-
based approach to extract information about changes of word senses over time
is feasible. For a deeper statistical analysis of the corpus, we refer readers to the
work by Jatowt et al. [8]
5.3 Hyperparameter Evaluation
For evaluation, we derive a score of how likely it is, that a word gained a new
sense over time. Therefore we count the number of sense trees for a word and
their distribution over time. Sense trees with a length of 1 are ignored. This
count is used to rank words, such that word that gained a new sense are at the
top. Using our annotated data we calculate the precision@kwith k∈ {5,10,20}.
The following parameter settings were found by optimizing our approach
on the annotated word list. We set the pointwise mutual information (PMI)
threshold parameter αpmi = 0.01, we used the Jaccard similarity as similarity
measure, and we used leaf-matching to match sense clusters across time slices.
We compare the three graph clustering algorithms introduced in Section 3:
Chinese Whispers, Girvan-Newman, and Louvain. Table 1 presents the results
with these algorithms. The highest precision values are highlighted bold.
Louvain outperforms the other two algorithms at k= 5 and k= 10. Girvan-
Newman has a much lower precision at k= 5, but performs much better at the
values 10 and 20 for k. Chinese Whispers does not perform as well. We suggest
Louvain as clustering algorithm, because it fits best for our use case that a sense
cluster should only represent a single meaning.
1810 1850 1900 1950 2000
layer, cell, brain, gram, organism,
spirit, weight, body, secretion, tissue,
cytoplasm, accumulation, ribonucleic,
nucleic, gland, constituent
thiamin, diet, vitamin, mineral,
content, our, coecient,
ascorbic, protein diethylamide
Fig. 2: Sense tree for the word acid. Green boxes represent the initial cluster for
a sense, matched clusters are connected by straight lines.
5.4 Qualitative Evaluation
To qualitatively evaluate our approach, we take a closer look at the drill down
example for the word monitor. Our approach produced five sense trees, each of
which can be matched to a single meaning of the word: hall monitor starting
in the 1830s, the warship of that name starting in the 1860s (two sensetrees),
monitor in the technological sense of a screen starting in the 1920s, and the
surveillance sense starting in the 1960s. The time periods at which new senses
emerged are accurate. For example, the warship was created in the 1860s and
that is also the time slice in which we detect that sense. One weakness of the
approach is that although the earliest sense tree can be interpreted easily as
sense of hall monitor, a closer look reveals that the clusters are only matched
by the word master. This shows that the matching can be influenced easily by
just a few words. Another weakness is that in fact two sense trees represent the
meaning of warship. The later sense trees represent distinct senses of monitor:
the technological sense and the surveillance sense. However, the branches of these
sense trees are not separated sufficiently.
Our graph representation of a text corpus can be used to visually explore
linguistic features. Figure 2 shows the forest of sense trees for the word acid.
Each of the boxes represents a set of words ciSwtas defined before. Each
sense tree begins with a green box, the following clusters that were matched
across time slices are connected by straight lines. Interestingly, we can see the
discovery of DNA in the late 19th century and LSD in the 1960s.
5.5 Limitations
Although our approach is able to identify different senses of some words, there is
still room for improvement: graph clustering algorithms and matching strategies,
the evaluation of different window sizes used during the co-occurrence graph cre-
ation, and the conduction of a survey to obtain a word dataset that can be used
as gold standard for evaluating temporal WSI approaches. Useful features make
our approach available as an explorative tool, adjustments in the used language
models to make our approach applicable to historic language, a quantitative
comparison to word embeddings, verification of the stability of our approach, a
custom slicing of time periods, sensitivity for differently spelled variants of the
same word, and detecting not only the birth of new word senses but also the
death of word senses. However, our approach struggles to create well-partitioned
sense clusters for different senses, which also affects the matching strategies.
Additionally, these strategies do not prevent sense drifting and sense trees may
change their meaning significantly over multiple time slices.
6 Conclusion
We proposed an approach to identify words that gained new meanings over time.
Additionally, our approach is interpretable and produces intermediate results
that can be used to investigate and understand how the sense gain score for
a specific word was constructed. We presented a drill down into specific words
with two different visualizations that allow key components of our approach to
be easily understood. It enables seeing both the created sense clusters and sense
trees, and thus allows one to find the point in time at which new senses of a
word emerged.
We applied our approach to COHA, which spans 200 years and is the largest
corpus of its kind. We showed anecdotal evidence for the functioning of our
approach by manually annotating and evaluating 109 words. We also evaluated
our approach qualitatively by using our drill down to inspect the intermediate
results of the word monitor. We found that our approach was able to successfully
identify word senses in sense trees.
References
1. Bamler, R., Mandt, S.: Dynamic word embeddings. In: Proceedings of the Inter-
national Conference on Machine Learning (ICML). pp. 380–389. JMLR Inc. and
Microtome Publishing (2017)
2. Biemann, C.: Chinese whispers: an efficient graph clustering algorithm and its ap-
plication to natural language processing problems. In: Proceedings of the Workshop
on Graph-based Methods for NLP. pp. 73–80. ACL (2006)
3. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of
communities in large networks. Journal of Statistical Mechanics: Theory and Ex-
periment 2008(10) (2008)
4. Davies, M.: Expanding horizons in historical linguistics with the 400-million word
corpus of historical american english. Corpora 7(2), 121–157 (2012)
5. Feuerbach, T., Riedl, M., Biemann, C.: Distributional semantics for resolving bridg-
ing mentions. In: Proceedings of the International Conference on Recent Advances
in Natural Language Processing (RANLP). pp. 192–199. ACL (2015)
6. Girvan, M., Newman, M.E.: Community structure in social and biological networks.
Proceedings of the National Academy of Sciences 99(12), 7821–7826 (2002)
7. Hope, D., Keller, B.: MaxMax: A graph-based soft clustering algorithm applied
to word sense induction. In: Proceedings of the International Conference on Com-
putational Linguistics and Intelligent Text Processing (CICLing). pp. 368–381.
Springer-Verlag (2013)
8. Jatowt, A., Duh, K.: A framework for analyzing semantic change of words across
time. In: Proceedings of the IEEE/ACM Joint Conference on Digital Libraries
(JCDL). pp. 229–238 (2014)
9. Kim, Y., Chiu, Y.I., Hanaki, K., Hegde, D., Petrov, S.: Temporal analysis of lan-
guage through neural language models. In: Proceedings of the Workshop on Lan-
guage Technologies and Computational Social Science. pp. 61–65. ACL (2014)
10. Kutuzov, A., Øvrelid, L., Szymanski, T., Velldal, E.: Diachronic word embeddings
and semantic shifts: a survey. In: Proceedings of the International Conference on
Computational Linguistics (COLING). pp. 1384–1397. ACL (2018)
11. Lau, J.H., Cook, P., McCarthy, D., Newman, D., Baldwin, T.: Word sense induc-
tion for novel sense detection. In: Proceedings of the Conference of the European
Chapter of the Association for Computational Linguistics (EACL). p. 591–601.
ACL (2012)
12. Lind´en, K.: Evaluation of linguistic features for word sense disambiguation with
self-organized document maps. Computers and the Humanities 38, 417–435 (2004)
13. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word rep-
resentations in vector space. In: Proceedings of the International Conference on
Learning Representations (ICLR). pp. 1–12 (2013)
14. Mitra, S., Mitra, R., Maity, S.K., Riedl, M., Biemann, C., Pawan, G., Mukherjee,
A.: An automatic approach to identify word sense changes in text media across
timescales. Natural Language Engineering 21(5), 773–798 (2015)
15. Navigli, R.: Word sense disambiguation: A survey. ACM Computing Surveys 41(2),
1–69 (2009)
16. Rapp, R.: Word sense discovery based on sense descriptor dissimilarity. In: Pro-
ceedings of Machine Translation Summit (MTSummit). pp. 315–322. European
Association for Machine Translation (2003)
17. Song, L., Wang, Z., Mi, H., Gildea, D.: Sense embedding learning for word sense
induction. In: Proceedings of the Joint Conference on Lexical and Computational
Semantics (*SEM). The *SEM Organizing Committee (2016)
18. Tahmasebi, N., Risse, T.: On the uses of word sense change for research in the
digital humanities. In: Proceedings of the International Conference on Theory and
Practice of Digital Libraries (TPDL). pp. 246–257. Springer-Verlag (2017)
19. Ustalov, D., Panchenko, A., Biemann, C., Ponzetto, S.P.: Watset: Local-global
graph clustering with applications in sense and frame induction. Computational
Linguistics 45(3), 423–479 (2019)
20. Yang, H., Callan, J.: A metric-based framework for automatic taxonomy induc-
tion. In: Proceedings of the International Joint Conference on Natural Language
Processing (IJCNLP). pp. 271–279. ACL (2009)
21. Yao, Z., Sun, Y., Ding, W., Rao, N., Xiong, H.: Dynamic word embeddings for
evolving semantic discovery. In: Proceedings of the ACM International Conference
on Web Search and Data Mining (WSDM). pp. 673–681. ACM (2018)
22. Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised meth-
ods. In: Proceedings of the Annual Meeting of the Association for Computational
Linguistics (ACL). pp. 189–196. ACL (1995)
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We present a detailed theoretical and computational analysis of the Watset meta-algorithm for fuzzy graph clustering, which has been found to be widely applicable in a variety of domains. This algorithm creates an intermediate representation of the input graph, which reflects the “ambiguity” of its nodes. Then, it uses hard clustering to discover clusters in this “disambiguated” intermediate graph. After outlining the approach and analyzing its computational complexity, we demonstrate that Watset shows competitive results in three applications: unsupervised synset induction from a synonymy graph, unsupervised semantic frame induction from dependency triples, and unsupervised semantic class induction from a distributional thesaurus. Our algorithm is generic and can also be applied to other networks of linguistic data.
Article
Full-text available
In this paper, we propose an unsupervised and automated method to identify noun sense changes based on rigorous analysis of time-varying text data available in the form of millions of digitized books and millions of tweets posted per day. We construct distributional-thesauri-based networks from data at different time points and cluster each of them separately to obtain word-centric sense clusters corresponding to the different time points. Subsequently, we propose a split/join based approach to compare the sense clusters at two different time points to find if there is ‘birth’ of a new sense. The approach also helps us to find if an older sense was ‘split’ into more than one sense or a newer sense has been formed from the ‘join’ of older senses or a particular sense has undergone ‘death’. We use this completely unsupervised approach (a) within the Google books data to identify word sense differences within a media, and (b) across Google books and Twitter data to identify differences in word sense distribution across different media. We conduct a thorough evaluation of the proposed methodology both manually as well as through comparison with WordNet.
Conference Paper
Full-text available
We apply topic modelling to automatically induce word senses of a target word, and demonstrate that our word sense induction method can be used to automatically detect words with emergent novel senses, as well as token occurrences of those senses. We start by exploring the utility of standard topic models for word sense induction (WSI), with a pre-determined number of topics (=senses). We next demonstrate that a non-parametric formulation that learns an appropriate number of senses per word actually performs better at the WSI task. We go on to establish state-of-the-art results over two WSI datasets, and apply the proposed model to a novel sense detection task.
Conference Paper
Word evolution refers to the changing meanings and associations of words throughout time, as a byproduct of human language evolution. By studying word evolution, we can infer social trends and language constructs over different periods of human history. However, traditional techniques such as word representation learning do not adequately capture the evolving language structure and vocabulary. In this paper, we develop a dynamic statistical model to learn time-aware word vector representation. We propose a model that simultaneously learns time-aware embeddings and solves the resulting alignment problem. This model is trained on a crawled NYTimes dataset. Additionally, we develop multiple intuitive evaluation strategies of temporal word embeddings. Our qualitative and quantitative tests indicate that our method not only reliably captures this evolution over time, but also consistently outperforms state-of-the-art temporal embedding approaches on both semantic accuracy and alignment quality.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Conference Paper
With advances in technology and culture, our language changes. We invent new words, add or change meanings of existing words and change names of existing things. Unfortunately, our language does not carry a memory; words, expressions and meanings used in the past are forgotten over time. When searching and interpreting content from archives, language changes pose a great challenge. In this paper, we present results of automatic word sense change detection and show the utility for archive users as well as digital humanities’ research. Our method is able to capture changes that relate to the usage and culture of a word that cannot easily be found using dictionaries or other resources.
Conference Paper
Conventional word sense induction (WSI) methods usually represent each instance with discrete linguistic features or cooccurrence features, and train a model for each polysemous word individually. In this work, we propose to learn sense embeddings for the WSI task. In the training stage, our method induces several sense centroids (embedding) for each polysemous word. In the testing stage, our method represents each instance as a contextual vector, and induces its sense by finding the nearest sense centroid in the embedding space. The advantages of our method are (1) distributed sense vectors are taken as the knowledge representations which are trained discriminatively, and usually have better performance than traditional count-based distributional models, and (2) a general model for the whole vocabulary is jointly trained to induce sense centroids under the mutlitask learning framework. Evaluated on SemEval-2010 WSI dataset, our method outperforms all participants and most of the recent state-of-the-art methods. We further verify the two advantages by comparing with carefully designed baselines.
Article
The Corpus of Historical American English (COHA) contains 400 million words in more than 100,000 texts which date from the 1810s to the 2000s. The corpus contains texts from fiction, popular magazines, newspapers and non-fiction books, and is balanced by genre from decade to decade. It has been carefully lemmatised and tagged for part-of-speech, and uses the same architecture as the Corpus of Contemporary American English (COCA), BYU-BNC, the TIME Corpus and other corpora. COHA allows for a wide range of research on changes in lexis, morphology, syntax, semantics, and American culture and society (as viewed through language change), in ways that are probably not possible with any text archive (e.g., Google Books) or any other corpus of historical American English.
Conference Paper
This paper introduces a linear time graph-based soft clustering algorithm. The algorithm applies a simple idea: given a graph, vertex pairs are assigned to the same cluster if either vertex has maximal affinity to the other. Clusters of varying size, shape, and density are found automatically making the algorithm suited to tasks such Word Sense Induction (WSI), where the number of classes is unknown and where class distributions may be skewed. The algorithm is applied to two WSI tasks, obtaining results comparable with those of systems adopting existing, state-of-the-art methods.
Article
In machine translation, information on word ambiguities is usually provided by the lexicographers who construct the lexicon. In this paper we propose an automatic method for word sense induc-tion, i.e. for the discovery of a set of sense descriptors to a given ambiguous word. The approach is based on the statistics of the distributional similarity between the words in a corpus. Our algo-rithm works as follows: The 20 strongest first-order associations to the ambiguous word are con-sidered as sense descriptor candidates. All pairs of these candidates are ranked according to the following two criteria: First, the two words in a pair should be as dissimilar as possible. Second, although being dissimilar their co-occurrence vectors should add up to the co-occurrence vector of the ambiguous word scaled by two. Both conditions together have the effect that preference is given to pairs whose co-occurring words are complementary. For best results, our implementation uses singular value decomposition, entropy-based weights, and second-order similarity metrics.