Conference PaperPDF Available

Ant Colony Algorithm for the Unsupervised Word Sense Disambiguation of Texts: Comparison and Evaluation

Authors:

Abstract and Figures

Brute-force word sense disambiguation (WSD) algorithms based on semantic relatedness are really time consuming. We study how to perform WSD faster and better on the span of a text. Several stochastic algorithms can be used to perform Global WSD. We focus here on an Ant Colony Algorithm and compare it to two other methods (Genetic and Simulated Annealing Algorithms) in order to evaluate them on the Semeval 2007 Task 7. A comparison of the algorithms shows that the Ant Colony Algorithm is faster than the two others, and yields better results. Furthermore, the Ant Colony Algorithm coupled with a majority vote strategy reaches the level of the first sense baseline and among other systems evaluated on the same task rivals the lower performing supervised algorithms. TITLE AND ABSTRACT IN FRENCH Algorithme à colonie de fourmis pour la désambiguïsation lexicale non supervisée de textes : comparaison et évaluation Les algorithmes exhaustifs de désambiguïsation lexicale ont une complexité exponentielle et le contexte qu'il est calculatoirement possible d'utiliser s'en trouve réduit. Il ne s'agit donc pas d'une solution viable. Nous étudions comment réaliser de la désambiguïsation lexicale plus rapidement et plus efficacement à l'échelle du texte. Nous nous intéressons ainsi à l'adaptation d'un algorithme à colonies de fourmis et nous le confrontons à d'autres méthodes issues de l'état de l'art, un algorithme génétique et un recuit simulé en les évaluant sur la tâche 7 de Semeval 2007. Une comparaison des algorithmes montre que l'algorithme à colonies de fourmis est plus rapide que les deux autres et obtiens de meilleurs résultats. De plus, cet algorithme, couplé avec un vote majoritaire atteint le niveau de la référence premier sens et rivalise avec les moins bons algorithmes supervisés sur cette tâche.
Content may be subject to copyright.
Ant Colony Algorithm for the Unsupervised Word Sense
Disambiguation of Texts: Comparison and Evaluation
Didier SCHWAB, Jérôme GOULIAN,
Andon TCHECHMEDJIEV, Hervé BLANCHON
LIG-Laboratory of Informatics of Grenoble
GETALP-Study Group for Machine Translation and Automated Processing of Languages and Speech
Univ. Grenoble-Alpes, France
{Didier.Schwab,Jerome.Goulian,Andon.Tchechmedjiev,Herve.Blanchon}@imag.fr
ABS TRACT
Brute-force word sense disambiguation (WSD) algorithms based on semantic relatedness are
really time consuming. We study how to perform WSD faster and better on the span of a text.
Several stochastic algorithms can be used to perform Global WSD. We focus here on an Ant
Colony Algorithm and compare it to two other methods (Genetic and Simulated Annealing
Algorithms) in order to evaluate them on the Semeval 2007 Task 7. A comparison of the
algorithms shows that the Ant Colony Algorithm is faster than the two others, and yields better
results. Furthermore, the Ant Colony Algorithm coupled with a majority vote strategy reaches
the level of the first sense baseline and among other systems evaluated on the same task rivals
the lower performing supervised algorithms.
TITLE AND ABSTRAC T IN FRENCH
Algorithme à colonie de fourmis pour la désambiguïsation
lexicale non supervisée de textes : comparaison et évaluation
Les algorithmes exhaustifs de désambiguïsation lexicale ont une complexité exponentielle et le
contexte qu’il est calculatoirement possible d’utiliser s’en trouve réduit. Il ne s’agit donc pas
d’une solution viable. Nous étudions comment réaliser de la désambiguïsation lexicale plus
rapidement et plus efficacement à l’échelle du texte. Nous nous intéressons ainsi à l’adaptation
d’un algorithme à colonies de fourmis et nous le confrontons à d’autres méthodes issues de
l’état de l’art, un algorithme génétique et un recuit simulé en les évaluant sur la tâche 7 de
Semeval 2007. Une comparaison des algorithmes montre que l’algorithme à colonies de fourmis
est plus rapide que les deux autres et obtiens de meilleurs résultats. De plus, cet algorithme,
couplé avec un vote majoritaire atteint le niveau de la référence premier sens et rivalise avec
les moins bons algorithmes supervisés sur cette tâche.
KEYWORDS:Unsupervised Word Sense Disambiguation, Semantic Relatedness, Ant Colony
Algorithms, Stochastic optimization algorithms.
KEYWORDS IN FR ENCH:Désambiguïsation lexicale non-supervisée, proximité sémantique,
algorithmes à colonies de fourmis, algorithmes stochastiques d’optimisation
1 Introduction
Word Sense Disambiguation (WSD) is a core problem in Natural Language Processing (NLP), as
it may improve many of its applications, such as multilingual information extraction, automatic
summarization, or machine translation. More specifically, the aim of WSD is to find the
appropriate sense(s) of each word of a text among a pre-definied sense inventory. For example,
in "The mouse is eating cheese.", for the word
,
mouse
-
, the WSD algorithm should choose the
sense that corresponds to the animal rather than the computer device. There exist many
methods to perform WSD, among which, one can distinguish between supervised methods and
unsupervised methods. The former are based on machine learning techniques that use a set of
(manually) labelled training data, whereas the latter do not.
This article focusses on an unsupervised knowledge-based approach for WSD, derived from the
method proposed by (Lesk, 1986). This approach uses a similarity measure that corresponds to
the number of overlapping words between the definitions of two word senses. With this metric,
one can select, for a given word of a text, the sense that yields the highest relatedness to a
certain number of its neighbouring words (with a fixed window size). Works such as (Pedersen
et al., 2005) use a brute-force (BF) global algorithm that evaluates the relatedness between
each word sense and all the senses of the other words within the considered context window.
The execution time is exponential in the size of the input, thus reducing the maximum possible
width of the window. The problem can become intractable even on the span of short sentences:
a linguistically motivated context, such as a paragraph for instance, can not be handled. Thus,
such approaches can not be used for applications where real time is a necessary constraint
(image retrieval, machine translation, augmented reality).
In order to overcome this problem and to perform WSD faster, we are interested in other
methods. In this paper, we focus on three methods that globally propagate a local algorithm
based on semantic relatedness to the span of a whole text. We consider two unsupervised
algorithms from the state of the art, a Genetic Algorithm (GA) (Gelbukh et al., 2003) and a
Simulated Annealing (SA) algorithm (Cowie et al., 1992), as well as an adaptation of an Ant
Colony Algorithm (ACA) (Schwab et al., 2011). Our aim is to provide an empirical comparison
of the ACA with the two other unsupervised algorithms, using the Semeval-2007, Task-7, Coarse
grained corpus (Navigli et al., 2007) (both in terms of quality and execution time). Furthermore,
we also evaluate the results after applying a majority vote strategy.
After a brief review of the state-of-the-art of WSD, the algorithms are described. Subsequently,
their implementations are discussed, as well as the estimation of the best parameters and the
evaluation of the tested algorithms. Finally, an analysis of the results is presented as well as a
comparison to other systems on Semeval 2007 Task 7. Then, we conclude and propose some
perspectives for future work.
2 Brief State of the art of Word Sense Disambiguation
In simple terms, WSD consists in choosing the best sense among all possible senses for all words
in a text. There exist many methods to perform WSD. The reader can refer to (Ide and Véronis,
1998) for works before 1998 and (Agirre and Edmonds, 2006) or (Navigli, 2009) for a complete
state of the art.
Supervised WSD algorithms are based on the use of a large set of hand-labelled training data
to build a classifier which can determine what are the right sense(s) for a given word in a
given context. Most classical supervised learning algorithms have been applied to WSD, and
even though they tend to yield better results (on English) over unsupervised approaches, their
main disadvantage is that hand-labelled examples are rare and expensive resources (knowledge
acquisition bottleneck) that must be created for each sense inventory, each language and even
each specialized domain. We share the opinion of (Navigli and Lapata, 2010) that unsupervised
methods would be better in order to overcome these obstacles in the short term.
Among unsupervised WSD methods some use raw corpora to build word vectors or co-
occurrence graphs while others use external knowledge sources (dictionaries, thesauri, lexical
databases, . . . ). The latter are based on the use of semantic similarity metrics that assign a
score representing how related or close two word senses are. Many such measures exist and can
be classified in three main categories: taxonomic distance in a lexical graph; taxonomic distance
weighted by information content; feature-based similarity. More recent efforts go towards
hybrid measures that combine two or more of the above. Readers may consult (Pedersen et al.,
2005), (Cramer et al., 2010), (Tchechmedjiev, 2012) or (Navigli, 2009) for a more complete
overview.
Commonly, similarity measures use WordNet (Fellbaum, 1998), a lexical database for English
widely used in the context of WSD. WordNet is organised around the notion of ”synonym
sets” (synsets), which represent a word, its class (noun, verb, ...) and its connections to all
semantically related words (synonyms, antonyms, hyponyms,...), as well as a textual definition
for each corresponding synset. The current version, WordNet 3.0, contains over 155000 words
for 117000 synsets.
3 Local and global WSD Algorithms
3.1 Our local algorithm : A variant of the Lesk Algorithm
Our local algorithm is a variant of the Lesk Algorithm (Lesk, 1986). Proposed more than 25
years ago, it is simple, only requires a dictionary and no training. The score given to a sense
pair is the number of common words (space separated strings) in the definition of the senses,
without taking into account neither the word order in the definitions (bag-of-words approach),
nor any syntactic or morphological information. Variants of this algorithm are still today among
the best on English-language texts (Ponzetto and Navigli, 2010).
Our local algorithm exploits the links provided by WordNet: it considers not only the definition
of a sense but also the definitions of the linked senses (using all the semantic relations from
WordNet) following (Banerjee and Pedersen, 2002), henceforth referred as
E x t L esk 1
. Contrarily
to Banerjee, however, we do not consider the sum of squared sub-string overlaps, but merely a
bag-of-words overlap that allows us to generate a dictionary from WordNet, where each word
contained in any of the word sense definitions is indexed by a unique integer and where each
resulting definition is sorted. Thus we are able to lower the computational complexity from
O
(
mn
)to
O
(
m
)
,m>n
, where
m
and
n
are the respective length of two definitions. For example
for the definition: "Some kind of evergreen tree", if we say that Some is indexed by 123, kind by
14, evergreen by 34, and tree by 90, then the indexed representation is {14,34, 90, 123}.
3.2 Global algorithms
A global algorithm is a method that allows to propagate a local measure to a whole text in order
to assign a sense label to each word. The simplest approach is the exhaustive evaluation of
1All dictionaries and Java implementations of all algorithms of this article can be found on our WSD page
http://getalp.imag.fr/xwiki/bin/view/WSD/
sense combinations (BF), used for example in (Banerjee and Pedersen, 2002), that assigns a
score to each word sense combination in a given context (window or whole text) and selects the
one with the highest score. The main issue with this approach is that it leads to a combinatorial
explosion in the length of the context window or text: Q|T|
i=1(|s(wi)|), where s(wi)is the set of
possible senses of word
i
of a text
T
. For this reason it is very difficult to use the BF approach in
real-life scenarios as well as on analysis windows of more than a few words.
Several approximation methods can be used in order to overcome the combinatorial explosion
problem. On the one hand, complete approaches, try to reduce dimensionality using pruning
techniques and sense selection heuristics. Some examples include: (Hirst and St-Onge, 1998),
based on lexical chains that restrict the possible sense combinations by imposing constraints on
the succession of relations in a taxonomy (e.g. WordNet); or (Gelbukh et al., 2005) that review
general pruning techniques for Lesk-based algorithms; or yet (Brody and Lapata, 2008).
On the other hand, incomplete approaches generally use stochastic sampling techniques to reach
a local maximum by exploring as little as necessary of the search space. Our present work
focuses on such approaches. Furthermore, we can distinguish two possible variants:
local neighbourhood-based approaches (new configurations are created from existing
configurations) among which are some approaches from artificial intelligence such as
genetic algorithms or optimization methods such as simulated annealing ;
constructive approaches (new configurations are generated by iteratively adding new
elements of solutions to the configuration under construction), among which are for
example ant colony algorithms.
3.3 Context of our work
The aim of this paper is to compare our Ant Colony Algorithm (incomplete and constructive
approach) to other incomplete approaches. We choose to first confront our algorithm to two
classical neighbourhood-based approaches that have been used in the context of unsupervised
WSD: genetic algorithms (Gelbukh et al., 2003) and simulated annealing (Cowie et al., 1992).
The underlying assumption to our work is that the span of the analysis context should be the
whole text (similarly to (Cowie et al., 1992), (Gelbukh et al., 2003) and more recently (Navigli
and Lapata, 2010)), rather than a smaller context window (like many other methods do for
computational reasons). Indeed, in our opinion, using a context window smaller than that of
the whole text raises two main issues: no guarantee on the consistency between two selected
senses; contradictory sense assignments outside of the window range.
For example in the following sentence, considering a window of 6 words: "The two planes were
parallel to each other. The pilot had parked them meticulously.", plane may be disambiguated
wrongly due to pilot being outside the window of plane. Furthermore it can be detrimental
to the semantic unity in the disambiguation, given that as (Gale et al., 1992) or (Hirst and
St-Onge, 1998) pointed out, two words used several times in the same context tend to have
the same sense. Therefore, some algorithms that are similar to our Ant Colony Algorithm but
that use a context window have not been studied here (notably the adaptation (Mihalcea et al.,
2004) of PageRank (Brin and Page, 1998) to WSD).
Moreover, we are not interested in comparing these incomplete algorithm, which cannot
pragmatically be used in a real-life context, to the optimal disambiguation (Brute Force). Even
with a reduced windows of the context and weeks of execution time we were only able to
achieve a 77% coverage of the corpus with BF, as detailed in (Schwab et al., 2011).
4 Global stochastic algorithms for Word Sense Disambiguation
The aim of these algorithms is to assign to each ambiguous word
wi
in a text of mwords the
most appropriate of its senses
wi,j
given the context. The definition of a sense
j
of word
i
is
noted
d
(
wi,j
). The search-space corresponds to all the possible sense combinations for the text
being processed. Therefore, a configuration
C
of the problem can be represented as an array of
integers such that j=C[i]is the selected sense jof wi.
4.1 Problem configuration and global score
The algorithms require some fitness measure to evaluate how good a configuration is. With
this in mind, the score of the selected sense of a word can be expressed as the sum of the
local scores between that sense and the selected senses of all the other words of the text.
Hence, in order to obtain a fitness value (global score) for the whole configuration, it is
possible to simply sum the scores for all selected senses of the words of the text:
Score
(
C
) =
Pm
i=1Pm
j=iE x t L esk (wi,C[i],wj,C[j])
. The complexity of this algorithm is hence
O
(
m2
), where
m
is the number of words in the text.
4.2 Genetic algorithm for Word Sense Disambiguation
The Genetic Algorithm (GA) based on (Gelbukh et al., 2003) can be divided into five phases:
initialisation, selection, crossover, mutation and evaluation. During each iteration, the algorithm
goes through each phase but the initialisation.
The initialisation phase consists in the generation of a random initial population of
λ
individuals
(λconfigurations of the problem).
During the selection phase, the score of each individual of the current population is computed.
A crossover ratio (CR) is used to determine which individuals of the current population are to
be selected for crossover. The probability of an individual being selected is CR weighted by the
ratio of the score of the current individual over that of the best individual. Individuals who are
not selected for crossover are merely cloned (copied) into the new population. Additionally
the best individual is systematically kept. After each iteration, the size of each subsequent
population is a constant λ.
During the crossover phase individuals are sorted according to their global score. If the number
of individuals is odd, the individual with the lowest score is unselected and cloned into the
new population as it cannot serve for crossover. The crossover operator is then applied on
the individuals two by two in decreasing order of their score: the resulting configurations are
swapped around two random pivots (everything but what is between the pivots is swapped).
During the mutation phase, each individual has a probability of mutating (parameter MR, for
Mutation Rate). A mutation corresponds to MN random changes in the configuration. Thus,
after the mutation phase, we obtain a modified configuration Cc0.
The evaluation phase corresponds to the test of the termination criteria: convergence of the
score of the best individual. In other words if the score of the best individual remains the same
for a number of generations (STH), the algorithm terminates.
4.3 Simulated annealing for Word Sense Disambiguation
The simulated annealing approach as described in (Cowie et al., 1992) is based on the physical
phenomenon of metal cooling.
Simulated annealing works with the same configuration representation as the genetic algorithm,
however it uses a single randomly initialised configuration. The algorithm is organised in
cycles and in iterations, each cycle being composed of IN iterations. The other parameters
are the initial temperature
T0
and the cooling rate ClR
[0;1]. At each iteration a random
change is made to the current configuration
Cc
, which results in a new configuration
C0
c
. Given
that
E
=
Score
(
Cc
)
Score
(
C0
c
),
C0
c
, the probability
P
(
A
)of acceptance (the probability of
replacing Cc0) of configuration C0
cis :
P(A) = (1 if E<0
eE
Totherwise
The reason why configurations with lower scores have a chance to be accepted, is to prevent the
algorithm from converging on a local maximum. Lower score configurations allow to explore
other parts of the search-space which may contain the global maximum.
At the end of each cycle, if the current configuration is the same as the configuration at the
end of the previous cycle, the algorithm terminates. Otherwise, the temperature is lowered to
T·ClR
. In other words, the more cycles it takes for the algorithm to converge, the lower the
probability to accept lower score configurations: this guarantees an eventual convergence. The
configuration with the highest score is saved after each iteration and will be taken as a result
regardless of the convergence configuration.
5 Global Ant Colony Algorithm
5.1 Ant Colony Algorithm
Ant colony algorithms (ACA) come from biology and from observations of ant social behavior.
Indeed, these insects have the ability to collectively find the shortest path between their nest
and a source of food (energy). It has been demonstrated that cooperation inside an ant colony
is self-organised and emerges from interactions between individuals. These interactions are
often very simple and allow the colony to solve complex problems. This phenomenon is called
swarm intelligence (Bonabeau and Théraulaz, 2000) and is increasingly popular in computer
science where centralised control systems are often successfully replaced by other types of
control based on interactions between simple elements.
Artificial ants have first been used for solving the Traveling Salesman Problem (Dorigo and
Gambardella, 1997). In these algorithms, the environment is usually represented by a graph, in
which virtual ants exploit pheromone trails deposited by others, or pseudo-randomly explore
the graph.
These algorithms are a good alternative for the resolution of problems modeled with graphs.
They allow a fast and efficient exploration close to other search methods. Their main advantage
is their high adaptivity to changing environments. Readers can refer to (Dorigo and Stützle,
2004), (Monmarche et al., 2009) or (Guinand and Lafourcade, 2010) for a state of the art.
5.2 Ant colony Algorithm for Word Sense Disambiguation
5.2.1 Principle
The environment of the ant colony algorithm is a graph that can be linguistic, a morphological
lattice (Rouquet et al., 2010), morpho-syntactic (Schwab and Lafourcade, 2007), or simply
organised following the structure of the text (Guinand and Lafourcade, 2010).
Depending on the environment chosen, the results of the algorithm differ. We are currently
investigating this aspect, but as the focus of our article is to make a comparison between ACA
and the two other methods presented earlier, we will use a simple graph following the structure
of the text (see Fig. 1) that uses no external linguistic information (no morpho-syntactic links
within a sentence for example).
Text
Sense
Sentence Sentence Sentence
Word Word
Word Word Word Word
Sense Sense Sense Sense Sense
Sense Sense Sense
1
23 4
10
9
8
76
5
11
12 13 14 15 16 17 18 19
Figure 1: The environment for our experiment: text, sentences and words correspond to
common nodes (1-10) and word senses to nests (11-19).
In this graph, we distinguish two types of nodes: nests and plain nodes. Following (Schwab,
2005) or (Guinand and Lafourcade, 2010), each possible word sense is associated to a nest.
Nests produce ants that move in the graph in order to find energy and to bring it back to their
mother nest: the more energy is deposited by ants, the more ants can be produced by the nest
in turn. Ants carry an odour (array) that contains the words of the definition of the sense of its
mother nest. From the point of view of an ant, a node can be: (1) its mother nest, where it was
born; (2) an enemy nest that corresponds to another sense of the same word; (3) a potential
friend nest: any other nest; (4) a plain node: any node that is not a nest. Furthermore, to
each plain node is also associated an odour vector of a fixed length that is initially empty. For
example, in Fig. 1, for an ant born in nest 19: nest 18 is an enemy (as their are linked to the
same word node, 10), its potential friend nodes are from 11 to 17 and common nodes are from
1 to 10.
Ant movements depends on the scores given by the local algorithm (cf. Section 3.1), of
the presence of energy, of the passage of other ants (when passing on an edge ants leave a
pheromone trail that evaporates over time) and of the nodes’ odour vectors (ants deposit a
part of their odour on the nodes they go through). When an ant arrives on a nest of another
term (that corresponds to a sense thereof), it can either continue its exploration or depending
on the score between this nest and its mother nest, decide to build a bridge between them
and to follow it home. Bridges behave like normal edges except that if at any given time the
concentration of pheromone reaches 0, the bridge collapses.
Depending on the lexical information present and the structure of the graph, ants will favour
following bridges between more closely related senses. Thus, the more closely related the senses
of the nests are, the more a bridge between them will contribute to their mutual reinforcement
and to the sharing of resources between them (thus forming meta-nests); while the bridges
between more distant senses will tend to fade away. We are thus able to build (constructive
approach) interpretative paths
2
through emergent behaviour and to suppress the need to use a
complete graph that includes all the links between the senses from the start (as is usually the
case with classical graph-based optimisation approaches).
5.2.2 Implementation details
In this section we first present the notations used (Table 1) as well as the parameters of the Ant
Colony Algorithm and their typical value ranges (Table 2), followed by a detailed description of
the different steps of the algorithm.
Notation Description
FANest that corresponds to sense A
fAAnt born in nest FA
V(X)Odour vector associated with X(ant or node)
E(X)Energy on/carried by X(ant or node)
Eval f(N)Evaluation of a node Nby an ant f
Eval f(A)Evaluation of an edge A(quantity of pheromone) by an ant f
ϕ(t/c)(A)Quantity of pheromone on edge Aat given moment t or cycle c
Table 1: Main notations for the Ant Colony Algorithm
Notation Description Value
EaEnergy taken by an ant when it arrives on a node 1-30
Emax Maximum quantity of energy an ant can carry 1-60
δEvaporation rate of the pheromone between two cycles 0.0-1.0
E0Initial quantity of energy on each node 5-60
ωAnt life-span 1-30 (cycles)
LVOdour vector length 20-200
δVPercentage of the odour vector components (words) deposited 0-100%
by an ant when it arrives on a node
cac Number of cycles of the simulation 1-500
Table 2: Parameters of the Ant Colony Algorithm and their typical value-ranges
5.2.3 Simulation
The execution of the algorithm is a potentially infinite succession of cycles. After each cycle, the
state of the environment can be observed and used to generate a solution. A cycle is composed
of the following actions: (1) eliminate dead ants and bridges with no pheromone; (2) for each
nest, potentially produce an ant; (3) for each ant: determine its mode (energy search or return);
make it move; potentially create an interpretative bridge; (4) update the environment (energy
levels of nodes, pheromone and odour vectors).
Ant production, death and energy model
Initially, we assign a fixed quantity of energy
E0
to each node of the environment. At the beginning of each cycle, each nest node
N
has an
2Possible interpretation of the text.
opportunity to produce an ant
A
using 1 unit of energy, with a probability
P
(
NA
). In accordance
with (Schwab and Lafourcade, 2007) or (Guinand and Lafourcade, 2010), we define it as the
following sigmoid function (often used with artificial neural networks (Lafourcade, 2011)):
P(NA) = ar c tan(E(N))
π+0.5.
When created, an ant has a lifespan of
ω
cycles (see Table 2). When the life of an ant reaches
zero, the ant is deleted at the beginning of the next cycle and the energy it carried is deposited
on the node where it died. By thus doing, we ensure that the global energy equilibrium of the
system is preserved, which plays a fundamental role in the convergence (monopolization of
resources by certain nests) to a solution.
Ant movements
The ants’ movements are random, but influenced by the environment. When
an ant is on a node, it assigns a transition probability to the edges leading to all neighbouring
nodes. The probability to cross through an edge
Aj
in order to reach a node
Ni
is
P
(
Ni,Aj
) =
Ev al f(Ni,Aj)
Pk=n,l=m
k=1,l=1E val f(Nk,Al)
, where
Evalf
(
N,A
) =
Evalf
(
N
) +
Evalf
(
A
)is the evaluation function of a
node Nwhen coming from an edge A.
A newborn ant seeks food. It is attracted by the nodes which carry the most energy (
Evalf
(
N
) =
E(N)
Pm
0E(Ni)
), but avoids to go through edges with a lot of pheromone,
Evalf
(
A
) = 1
ϕt
(
A
)in
order to favour a greater exploration of the search space. The ant collects as much en-
ergy as possible until it decides to bring it back home (return mode) with the probability
P
(
ret ur n
) =
E(f)
Emax
3
. Then, it moves while following (statistically) the edges that contain the
most pheromone,
Evalf
(
A
) =
ϕt
(
A
)and leading to nodes with an odour close to their own,
Evalf(N) = E x t L esk(V(N),V(fA))
Pi=k
i=1E x t L esk(V(Ni),V(fA)) .
Creation, deletion and types of bridges
When an ant arrives on a node adjacent to a
potential friend nest (i.e. that corresponds to a sense of a word), it has to decide between
taking any of the possible paths or to go on that nest node. As such, we are dealing with
a particular case of the ant path selection algorithm presented above in Section 5.2.3, with
Evalf
(
A
) = 0 (The pheromone on the edge is ignored). The only difference is that if the ant
chooses to go on the potential friend nest, a bridge between that nest and the ant’s home nest
is built and the ant follows it to go home. Bridges behave like regular edges, except that if the
concentration of pheromone on them reaches 0, they collapse and are removed.
Pheromone Model
Ants have two types of behaviours: they are either looking to gather
energy or to return to their mother nest. When they move in the graph, they leave pheromone
trails on the edges they pass through. The pheromone density on an edge influences the
movements of the ants: they prefer to avoid edges with a lot of pheromone when they are
seeking energy and to follow them when they want to bring the energy back to their mother
nest.
When passing on an edge
A
, ants leave a trail by depositing a quantity of pheromone
θIR+
such that ϕt+1(A) = ϕt(A) + θ.
Furthermore, at each cycle, there is a slight (linear) evaporation of pheromones (penalizing
little frequented paths). Thus,
ϕt+1
(
A
) =
ϕt
(
A
)
×
(1
δ
), where
δ
is the pheromone evaporation
rate.
3Consequently, when the ant reaches its carrying capacity, the probability to switch to return mode is 1.
Odour
The odour of a nest is the numerical sense vector (as introduced in Section 3.1) and
corresponds to the definition of the sense associated to the nest. All ants born in the same nest
have the same odour vector. When an ant arrives on a common node
N
, it deposits some of
the components of its odour vector (following a uniform distribution), which will be added to
or will replace existing component of the node’s vector
V
(
N
). The odour of nest nodes on the
other hand is never modified.
This mechanism allows ants to find their way back to their mother nest. Indeed, the closer a
node is to a given nest, the more ants from that nest will have passed through and deposited
odour components. Thus, the odour of that node will reflect its nest neighbourhood and allow
ants to find their way by computing the score between their odour (that of their mother nest)
and the surrounding nodes and by choosing to go on the node yielding the highest score. This
process leaves some room for error (such as an ant arriving on a nest other than its own), which
is beneficial as it leads ants to build more bridges (see Section 5.2.3).
5.3 Global Evaluation
At the end of each cycle, we build the current problem configuration from the graph: for
each word, we take the sense corresponding to the nest with the highest quantity of energy.
Subsequently, we compute the global score of the configuration (see Section 4.1). Over the
execution of the algorithm we keep the configuration with the highest absolute score, which
will be used at the end to generate the solution.
5.4 Main parameters
Here we present a short characterisation of the influence of the parameters on the emergent
phenomena in the system:
The maximum amount of energy an ant can carry,
Ema x
, influences how much an ant
explores the environment. Ants cannot go back through an edge they just crossed and
have to make circuits to come back to their nest (if the ant does not die before that). The
size of the circuits depend on the moment the ants switch to return mode, hence on
Ema x
.
The evaporation rate of the pheromone between cycles (
δ
) is one of the memories of the
system. The higher the rate is, the least the trails from previous ants are given importance
and the faster interpretative paths have to be confirmed (passed on) by new ants in order
not to be forgotten by the system.
The initial amount of energy per node (
E0
) and the ant life-span (
ω
) influence the number
of ants that can be produced and therefore the probability of reinforcing less likely paths.
The odour vector length (
Lv
) and the proportion of odour components deposited by an
ant on a plain node (
δV
) are two dependent parameters that influence the global system
memory. The higher the length of the vector, the longer the memory of the passage of an
ant is kept. On the other hand, the proportion of odour components deposited has the
inverse effect.
Given the lack of an analytical way of determining the optimal parameters of both the Ant
Colony Algorithm and the other algorithms presented, they have to be estimated experimentally,
which is detailed in Section 6.
6 Empirical Evaluation
In this section we first describe the evaluation task we used to evaluate the three systems,
followed by the methodology we used for the estimation of the parameters, and then the exper-
imental protocol for the empirical quantitative comparison of the algorithms and subsequently,
the interpretation of the results. Finally, we briefly compare the number of evaluations of
the semantic similarity score function (
E x t L esk
) and discuss the positioning of our system
relatively to the other systems that are evaluated with Semeval 2007 Task 7 (participating
systems and more recent advances).
6.1 Evaluation Campaign Task
We evaluated the algorithms with the SemEval 2007 coarse-grained English all-words task 7
corpus (Navigli et al., 2007). Composed of 5 texts from various domains (journalism, book
review, travel, computer science, biography), the task consists in annotating 2269 words with
one of their possible senses from WordNet, with an average degree of polysemy of 6
.
19. The
evaluation of the output of the algorithm is done considering coarse grained senses distinction
i.e. close senses are counted as equivalent (e.g. snow/precipitation and snow/cover).
A Perl script that evaluates the quality of the solutions is provided with the task files and
allows us to compute the
P
recision,
R
ecall, and
F1
score
4
, which are the standard measures for
evaluating WSD algorithms (Navigli, 2009).
6.2 Parameter estimation
The algorithms we are interested in have a certain number of parameters that need tuning in
order to obtain the best possible score on the evaluation corpus. There are three approaches:
Make an educated guess about the value ranges based on a priori knowledge about the
dynamics of the algorithm;
Test manually (or semi-manually) several combinations of parameters that appear promis-
ing and determine the influence of making small adjustments to the values ;
Use a learning algorithm on a subset of the evaluation corpus, for example with SA or GA
to find the parameters that lead to the best score.
Both GA and SA, as presented in (Gelbukh et al., 2003) and (Cowie et al., 1992) respectively,
use the Lesk semantic similarity measure as a score metric. We reimplemented them with the
E x t L esk
measure and used the optimal parameters provided. However, the similarity values
are higher than with the standard Lesk algorithm and we had to adapt the parameters to reflect
that difference. We made one parameter vary at a time over 10 executions, in order to maximise
the F1measure.
For our ACA, given that an execution of the algorithm is very fast
5
, it was possible to use a
simplified simulated annealing approach for the automated estimation of the parameters. For
each parameter combination we ran the algorithm 50 times and considered the means coupled
with the p-values from a one-way ANOVA. We still needed to use our a priori knowledge to set
likely parameter intervals and discreet steps for each of them6.
The best parameters we found are:
4F1is the harmonic mean of P and R. When 100% of the corpus is annotated, P=R=F1.
5Depending on the parameters the execution takes between 15s to 40s for the first text of the corpus.
6Supervised approach to parameter tuning which does not affect the unsupervised nature of the algorithm itself.
For GA: λ=500, CR =0.9, M R =0.15, M N =80, CR =0.9;
For SA: T0=1000, Cl R =0.9, I N =1000;
For ACA: ω=25, Ea=16, Ema x =56, E0=30, δv=0.9, δ=0.9, LV=100
6.3 Experimental Protocol
The objective of the experiments is to compare the three algorithms according to different
criteria. First, the
F1
score obtained on the Task 7 of Semeval 2007 and then the execution time.
Furthermore, given that they use the same similarity measure and that it is the computational
bottleneck, we also measure the average number of similarity computations between word
senses.
Since the algorithms are stochastic, we need to have a representation of the distribution of
solutions as accurate as possible in order to make statistically significant comparisons and thus
we ran the algorithms 100 times each.
The first step in the evaluation of the significance of the results is the choice of an appropriate
statistical tool. In this case we are using a one-way ANOVA analysis (Miller, 1997), coupled
with a Tuckey’ HSD post-hoc pairwise analysis (Hsu, 1996). These two techniques rely on three
principal assumption: independence of the groups, normal distribution of the samples, within
group homegeneity of variances.
After the direct comparison of the results we apply a majority voting strategy for each word
among ten consecutive executions so as to obtains 100 overlapping vote results. The same
evaluation methodology is applied. In both cases, the baseline for our comparison is the
first-sense(FS) baseline. Let us now check for the ANOVA assumptions and analyse the results.
6.4 Quantitative results
In order to check the normality assumption for ANOVA, we computed the correlation between
the theoretical (normal distribution) and the empirical quantiles. For all metrics and algorithms
there was always a correlation above 0.99. Furthermore we used Levene’s variance homogeneity
test and found a minimum significance level of 106between all algorithms and metrics.
Algorithm F1(%) σF1T im e(s)σT ime Sim. Eval. σ(S.Ev.)
F.S. Baseline 77.59 N/A N/A N/A N/A N/A
G.A. 73.98(74.53)† 0.0052 4,537.6† 963.2 137,158,739† 13,784.43
S.A. 74.23(75.18)† 0.0028 1,436.6† 167.3 4,405,304† 50,805.27
A.C.A. 76.41(77.50)† 0.0048 65.46† 0.199 1,559,049† 17,482.45
Table 3: Comparison of the
F1
scores (after vote between brackets), execution times and
similarity measure evaluations of the algorithms († p<0.01) over texts 2 through 5.
Table 3 presents the results, for the three algorithms with the
F1
scores, execution time and
number of evaluations of the similarity measure along with their respective standard deviations.
For all three metrics and between all three algorithms, the difference in the means was
systematically significant with p<0.01. Since the first text was used to train the ACA parameters,
in this sections the
F1
scores presented are calculated for the 4 next texts in order to remove
any bias.
(a) Normal Execution
(b) Majority Vote
Figure 2: Boxplots of the F1compared to the first sense baseline (dashed line)
Figures 2a and 2b respectively present boxplots of the distributions of
F1
scores for all three
algorithms with and without vote compared to the first sense baseline.
In terms of
F1
score, SA and GA obtain similar results, even though SA is slightly better and
shows a lower variability in the score distribution. As for ACA, the scores are on average
+1.61% better than SA and +1.76% better than GA, with a variability similar to that of GA. All
three algorithms are below the FS baseline, even though the maximum of the ACA distribution
is close. After applying a majority vote on the answer files 10 by 10, for SA and GA there is
a slight improvement of the scores (p<0.01), respectively +0.17% and +0.46% and for ACA,
there is a larger improvement (p<0.01) of +1.17% (despite a certain number of lower-bound
outliers). ACA tends to converge on solutions close to the FS baseline. After the vote, the
distribution is practically centred around the latter as far as the score is concerned.
In terms of execution times there are huge differences between the algorithms, the slowest
being GA, which on average runs over 1.5h (
±
16 min). SA is much faster and takes on average
about 24m (
±
2.8m), but still remains much slower that ACA, which converges on average in
65s(
±
190ms). As one would expect, the number of evaluations of the similarity measure is
directly correlated with execution times of the algorithms (corp=0.9969).
6.5 Comparison to other Task 7 systems
For the comparison to other systems, we restricted ourselves to those that disambiguate the
whole text and not only a subset, in order to make a fair comparison. Furthermore, given that
the results are over the 5 texts, we will consider the ACA results for the 5 texts as well contrarily
to the previous section. Vis-a-vis the original participants, our algorithms is ahead of all other
unsupervised systems, and beats the weakest supervised system by getting very close to the first
sense baseline. However, most supervised systems are still ahead. If we add more recent results
from the experiments of (Navigli and Pozetto, 2012) using Babelnet, a multilingual database
which adds multilingual and monolingual links to WordNet, the Degree algorithm reaches a
score almost as good as ACA (only WordNet) -0.63%, followed by Pagerank (Mihalcea et al.,
2004) (adapted to Babelnet) -5.43%. When a voting strategy is used, ACA is ahead with 1.75%
compared to Degree. However, it is important to note that when looking a the scores per part
of speech, Degree exhibits notably higher results for nouns (85% versus 76.35%), while ACA
performs much better for adverbs adjectives and verbs (83.98%, 82.44%, 74.16%).
System A P R F1
UoR-SSI† 100 83.21 83.21 83.21
NUS-PT† 100 82.5 82.5 82.5
NUS-ML† 100 81.58 81.58 81.58
LCC-WSD† 100 81.45 81.45 81.45
GPLSI† 100 79.55 79.55 79.55
ACA Maj. Vote (Wn) 100 78.76 78.76 78.76
UPV-WSD† 100 78.63 78.63 78.63
ACA (Wn) 100 77.64 77.64 77.64
Degree (Babelnet) 100 77.01 77.01 77.01
Page Rank (Babelnet) 100 72.60 72.60 72.60
TKB-UO 100 70.21 70.21 70.21
RACAI-SYNWSD 100 65.71 65.7 65.7
FS Baseline 100 78.89 78.89 78.89
Table 4: Comparison with 100% coverage systems evaluated on Semeval2007-Task7
(† supervised system)
Conclusions and perspectives
In this paper we have presented three stochastic algorithms for knowledge-based unsupervised
Word Sense Disambiguation: a genetic algorithm, a simulated annealing algorithm, and an
ant colony algorithm; two from the state of the art, and our own ant colony algorithm. All
three algorithms belong to the family of incomplete approaches, which allows us to consider
the whole text as the disambiguation context and thus to go further towards ensuring the
global coherence of the disambiguation of the text. We have estimated the best parameter
values and then evaluated and compared the three algorithms empirically in terms of the
F1
score on Semeval 2007 Task 7, the execution time as well as the number of evaluation of
the similarity measure. We found that the ACA is notably better both in terms of score and
execution time. Furthermore, the number of evaluations of the similarity measure are directly
correlated with the computation time. Then, we applied a majority vote strategy, which led to
only small improvements for SA and GA and more substantial improvements for ACA. The vote
strategy allowed ACA to reach the level of the first sense baseline, to beat the state-of-the-art
unsupervised systems and the lowest performing supervised systems.
However, some open-ended questions remain. The three methods rely on parameters that
are not (a priori) linguistically grounded and that have an influence both on the results and
the computation time. The estimation of the parameters, whether manual or through an
automated learning algorithm prevent these algorithms from being entirely unsupervised.
However, the degree of supervision remains far below supervised approaches that use training
corpora approximately 1000 times larger. Our work is currently focussed on the study of these
parameters for ACA. Their values seem to depend mostly on the structure of the text and on its
consequences on the environment graph of ACA, as we have outlined in the paper. Moreover,
we are also interested in determining the degree to which the similarity measure and the
lexical resources it uses influences the parameters. We are currently adapting our
E x t L esk
to use Babelnet in order to investigate the matter further as well as to enable us to perform
Multilingual Word Sense Disambiguation.
References
Agirre, E. and Edmonds, P. (2006). Word Sense Disambiguation: Algorithms and Applications
(Text, Speech and LT). Springer-Verlag New York, Inc., Secaucus, NJ, USA.
Banerjee, S. and Pedersen, T. (2002). An adapted lesk algorithm for word sense disambiguation
using wordnet. In CICLing 2002, Mexico City.
Bonabeau, É. and Théraulaz, G. (2000). L’intelligence en essaim. Pour la science, (271):66–73.
Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual web search engine.
In Proceedings of the seventh international conference on World Wide Web 7, WWW7, pages
107–117, Amsterdam, The Netherlands, The Netherlands. Elsevier Science Publishers B. V.
Brody, S. and Lapata, M. (2008). Good neighbors make good senses: Exploiting distributional
similarity for unsupervised WSD. In Proceedings of the 22nd International Conference on
Computational Linguistics (Coling 2008), pages 65–72, Manchester, UK.
Cowie, J., Guthrie, J., and Guthrie, L. (1992). Lexical disambiguation using simulated
annealing. In COLING 1992, volume 1, pages 359–365, Nantes, France.
Cramer, I., Wandmacher, T., and Waltinger, U. (2010). WordNet: An electronic lexical database,
chapter Modeling, Learning and Processing of Text Technological Data Structures. Springer.
Dorigo and Stützle (2004). Ant Colony Optimization. MIT-Press.
Dorigo, M. and Gambardella, L. (1997). Ant colony system: A cooperative learning approach
to the traveling salesman problem. IEEE Transactions on Evolutionary Computation, 1:53–66.
Fellbaum, C. (1998). WordNet: An Electronic Lexical Database (Language, Speech, and Commu-
nication). The MIT Press.
Gale, W., Church, K., and Yarowsky, D. (1992). One sense per discourse. In Fifth DARPA Speech
and Natural Language Workshop, pages 233–237, Harriman, New-York, États-Unis.
Gelbukh, A., Sidorov, G., and Han, S. (2005). On some optimization heuristics for lesk-like
wsd algorithms. In 10th International Conference on Applications of Natural Languages to
Information Systems, NLDB-2005, pages 402–405, Alicante, Spain.
Gelbukh, A., Sidorov, G., and Han, S. Y. (2003). Evolutionary approach to natural language wsd
through global coherence optimization. WSEAS Transactions on Communications, 2(1):11–19.
Guinand, F. and Lafourcade, M. (2010). Artificial Ants. From Collective Intelligence to Real-life
Optimization and Beyond, chapter 20 - Artificial ants for NLP, pages 455–492. Lavoisier.
Hirst, G. and St-Onge, D. D. (1998). Lexical chains as representations of context for the
detection and correction of malapropisms. WordNet: An electronic Lexical Database. C. Fellbaum.
Ed. MIT Press. Cambridge. MA, pages 305–332. Ed. MIT Press.
Hsu, J. (1996). Multiple Comparisons: Theory and Methods. Chapman and Hall.
Ide, N. and Véronis, J. (1998). Wsd: the state of the art. Computational Linguistics, 28(1):1–41.
Lafourcade, M. (2011). Lexique et analyse sémantique de textes – structures, acquisitions,
calculs, et jeux de mots. HDR de l’Université Montpellier II.
Lesk, M. (1986). Automatic sense disambiguation using mrd: how to tell a pine cone from an
ice cream cone. In Proceedings of SIGDOC ’86, pages 24–26, New York, NY, USA. ACM.
Mihalcea, R., Tarau, P., and Figa, E. (2004). Pagerank on semantic networks, with application to
word sense disambiguation. In Proceedings of the 20th international conference on Computational
Linguistics, COLING ’04, Stroudsburg, PA, USA. Association for Computational Linguistics.
Miller, R. G. (1997). Beyond ANOVA: Basics of Applied Statistics (Texts in Statistical Science
Series). Chapman & Hall/CRC.
Monmarche, N., Guinand, F., and Siarry, P., editors (2009). Fourmis Artificielles et Traitement
de la Langue Naturelle. Lavoisier, Prague, Czech Republic.
Navigli, R. (2009). Wsd: a survey. ACM Computing Surveys, 41(2):1–69.
Navigli, R. and Lapata, M. (2010). An experimental study of graph connectivity for unsuper-
vised word sense disambiguation. IEEE Trans. Pattern Anal. Mach. Intell., 32:678–692.
Navigli, R., Litkowski, K. C., and Hargraves, O. (2007). Semeval-2007 task 07: Coarse-grained
english all-words task. In SemEval-2007, pages 30–35, Prague, Czech Republic.
Navigli, R. and Pozetto, S. P. (2012). Babelnet: The automatic construc-
tion, evaluation and application of a wide-coverage multilingual semantic network.
http://dx.doi.org/10.1016/j.artint.2012.07.004.
Pedersen, T., Banerjee, S., and Patwardhan, S. (2005). Maximizing Semantic Relatedness to
Perform WSD. Research report, University of Minnesota Supercomputing Institute.
Ponzetto, S. P. and Navigli, R. (2010). Knowledge-rich word sense disambiguation rivaling su-
pervised systems. In Proceedings of the 48th Annual Meeting of the Association for Computational
Linguistics, pages 1522–1531.
Rouquet, D., Falaise, A., Schwab, D., Boitet, C., Bellynck, V., Nguyen, H.-T., Mangeot, M., and
Guilbaud, J.-P. (2010). Rapport final de synthèse, passage à l’échelle et implémentation : Ex-
traction de contenu sémantique dans des masses de données textuelles multilingues. Technical
report, Agence Nationale de la Recherche.
Schwab, D. (2005). Approche hybride pour la modélisation, la détection et l’exploitation des
fonctions lexicales en vue de l’analyse sémantique de texte. PhD thesis, Université Montpellier 2.
Schwab, D., Goulian, J., and Guillaume, N. (2011). Désambiguïsation lexicale par propagation
de mesures sémantiques locales par algorithmes à colonies de fourmissation lexicale par
propagation de mesures sémantiques locales par algorithmes à colonies de fourmis. In
Traitement Automatique des Langues Naturelles (TALN), Montpellier, France.
Schwab, D. and Lafourcade, M. (2007). Lexical functions for ants based semantic analysis. In
ICAI’07- The 2007 International Conference on Artificial Intelligence, Las Vegas, Nevada, USA.
Tchechmedjiev, A. (2012). état de l’art: Mesures de similarité sémantique et algorithmes
globaux pour la désambiguïsation lexicale a base de connaissances. In RECITAL 2012, Grenoble.
ATALA.
... Among these, supervised methods have reached the best disambiguation results , but their main disadvantage is that they need large amounts of labeled examples for the supervised learning stage. Since large annotated corpora are difficult to obtain, many researchers have turned their focus on developping unsupervised learning or knowledge-based approaches (Agirre et al., 2014;Basile et al., 2014;Bhingardive et al., 2015;Chen et al., 2014;Dongsuk et al., 2018;Moro et al., 2014;Schwab et al., 2013aSchwab et al., , 2012Schwab et al., , 2013bTripodi and Pelillo, 2017;Vial et al., 2017). ...
... Rather more generally, a global WSD algorithm aims at choosing the appropriate sense for each ambiguous word in an entire text document. The obvious solution for global WSD is the exhaustive evaluation of all sense combinations (configurations) (Patwardhan et al., 2003), but the time complexity grows exponentially along with the number of words in the document, as also noted by Schwab et al. (2013aSchwab et al. ( , 2012. For example, in the sentence "You have a good sense of humor.", ...
... This example reveals that the brute-force (BF) solution quickly becomes impractical for windows of more than a few words. Therefore, several approximation methods (Schwab et al., 2013a(Schwab et al., , 2012 have been proposed for the global WSD task in order to overcome the exponential growth of the search space. ShotgunWSD is conceived to perform global WSD by combining multiple local sense configurations that are obtained using BF search, thus avoiding BF search on the whole text document. ...
Thesis
Full-text available
This thesis studies multiple natural language processing tasks, presenting approaches on applications such as information retrieval, polarity detection, dialect identification, automatic essay scoring, and methods that can help other systems to understand documents better. Part of the described approaches from this thesis are employing kernel methods for the dialect identification task and complex word identification task. It also describes two novel approaches that can enhance the performance of string kernel methods. This thesis also describes a novel word sense disambiguation algorithm, named ShotgunWSD, that proved to work very well in various setups. Besides this, it introduces the Moldavian versus Romanian Corpus and presents the first application of the Squeeze-and-Excitation block in a natural language processing task. A voting scheme based on a string kernel model and two deep learning modes is presented in the thesis. Having inspiration from computer vision, two novel approaches that can represent documents using a finite number of features are defined and presented in this thesis. Both proved to work very well, producing state-of-the-art results on several datasets.
... Among these, supervised methods have reached the best disambiguation results [10], [11], butatheiramain disadvantageais thatathey VOLUME 4, 2016 needalargeaamounts of labeled examples for the supervised learning stage. Since large annotated corpora are difficult to obtain [12], [13], many researchers have turned their focus on developing unsupervised learning or knowledge-based WSD methods [14]- [24]. ...
... Rather more generally, a globaliWSD algorithm aims at choosing the right senseifor eachaambiguous wordain an entire textadocument. The obvious solution for global WSD isathe exhaustiveaevaluation ofaall sense combinations (configurations)i [33], butathe timeacomplexity grows exponentially along withathe number ofawords inathe document, asaalso notedaby Schwabaet al. [14], [15]. For example, inithe sentence "You have a good sense of humor.", ...
... This example reveals that the brute-forcei(BF) solutionaquickly becomes impracticalifor windowsiof moreithan aifew words. Therefore, severaliapproximation approaches [14], [15] have beenaproposed forathe globalaWSD taskain orderato overcome the exponentialagrowth ofathe searchaspace. Shotgun-WSD is conceived toaperform globalaWSD byacombining multiplealocal senseaconfigurations thataare obtained us- 1 The open source implementation of ShotgunWSD is provided for download at https://github.com/butnaruandrei/ShotgunWSD. ...
Article
Full-text available
ShotgunWSD is a recent unsupervised and knowledge-based algorithm for global word sense disambiguation (WSD). The algorithm is inspired by the Shotgun sequencing technique, which is a broadly-used whole genome sequencing approach. ShotgunWSD performs WSD at the document level based on three phases. The first phase consists of applying a brute-force WSD algorithm on short context windows selected from the document, in order to generate a short list of likely sense configurations for each window. The second phase consists of assembling the local sense configurations into longer composite configurations by prefix and suffix matching. In the third phase, the resulting configurations are ranked by their length, and the sense of each word is chosen based on a majority voting scheme that considers only the top configurations in which the respective word appears. In this paper, we present an improved version (2.0) of ShotgunWSD which is based on a different approach for computing the relatedness score between two word senses, a step that stays at the core of building better local sense configurations. For each sense, we collect all the words from the corresponding WordNet synset, gloss and related synsets, into a sense bag. We embed the collected words from all the sense bags in the entire document into a vector space using a common word embedding framework. The word vectors are then clustered using k-means to form clusters of semantically related words. At this stage, we consider that clusters with fewer samples (with respect to a given threshold) represent outliers and we eliminate these clusters altogether. Words from the eliminated clusters are also removed from each and every sense bag. Finally, we compute the median of all the remaining word embeddings in a given sense bag to obtain a sense embedding for the corresponding word sense. We compare the improved ShotgunWSD algorithm (version 2.0) with its previous version (1.0) as well as several state-of-the-art unsupervised WSD algorithms on six benchmarks: SemEval 2007, Senseval-2, Senseval-3, SemEval 2013, SemEval 2015, and overall (unified). We demonstrate that ShotgunWSD 2.0 yields better performance than ShotgunWSD 1.0 and some other recent unsupervised or knowledge-based approaches. We also performed paired McNemar’s significance tests, showing that the improvements of ShotgunWSD 2.0 over ShotgunWSD 1.0 are in most cases statistically significant, with a confidence interval of 0.01.
... Schwab and their co-others [41,42] proposed the use of ACO to solve WSD using several variants of the Lesk algorithm where glosses are expanded using the relations in WordNet. Their result outperforms both GA and SA, but the convergence time is uncertain [43]. ...
... The authors used SemEval 2007 coarse-grained English all-words task corpus for evaluation and comparison, and their approach shows better results compared to brute force approach. In [42], another ACO approach for WSD is proposed and compared to GA and SA on SemEval 2007 Task 7. The comparison shows that this approach is faster than other approaches and obtains good results. ...
Article
Word Sense Disambiguation (WSD) is a key step for many natural language processing tasks such as information search, automatic translation, and sentiment analysis. WSD is the process that identifies appropriate senses of ambiguous words in the text. With the increasing number of words to be disambiguated in large amount of text data, WSD becomes very challenging and that is why an exhaustive search for the best set of senses may be unpractical. Recently, several metaheuristic approaches have been proposed for different complex optimization problems and have achieved good results. Therefore, in order to improve the WSD process, in this paper, the WSD problem is modeled as a combinatorial optimization problem, and the Discrete Student Psychology-Based Optimization (DSPBO) metaheuristic is proposed and used to selecting appropriate senses. A DSPBO-based WSD is proposed to disambiguate more ambiguous words together in function to their contexts in the target text, and a Lesk-based fitness function is used to guide the DSPBO metaheuristic to optimize the general semantic similarity of selected senses. The proposed approach is evaluated and compared to several recent WSD approaches on the well-known corpuses SensEval-2, SensEval-3, SemEval-2007, SemEval-13, and SemEval-15. The comparison is made in terms of F-Measure, precision, and recall. Experiments show a significant improvement both over existing knowledge lexicon-based approaches and metaheuristic-based approaches, with a higher F-measure of 84.21%, 83.33%, 87.5%, 77.58%, and 81.08% on SensEval-2, SensEval-3, SemEval-2007, SemEval-13, and SemEval-15, respectively.
... The LIG-GETALP system is a similarity based system. At the local level (between two senses), we use a Lesk similarity measure and at the global level we use our own ant colony algorithm [12][13][14]. ...
... In this graph we distinguish two types of nodes: nests and plain nodes. Following [13], each possible word sense is associated to a nest. Nests produce ants that move in the graph in order to find energy and bring it back to their mother nest: the more energy is brought back by ants, the more ants can be produced by the nest in turn. ...
... La deuxième catégorie repose sur l'apprentissage automatique supervisé et l'utilisation des corpus de textes annotés réunissant des exemples d'instances désambiguïsées de mots (Bakx et al., 2006;Navigli, 2009). La dernière catégorie concerne les systèmes non supervisés qui utilisent des connaissances provenant des réseaux sémantiques (Schwab et al., 2012;Navigli, 2009). ...
Thesis
Au cours de ces dernières années, l’information au sens large est devenue la pièce maîtresse pour révolutionner les projets de transformation numérique. Encore faut-il savoir l’exploiter d’une manière intelligente pour en tirer tous les bénéfices. L’informatisation des données textuelles concerne plusieurs secteurs d’activité, en particulier le domaine médical. Aujourd’hui, la médecine moderne est devenue presque inconcevable sans l’utilisation des données numériques, qui ont fortement affecté la compréhension scientifique des maladies. Par ailleurs, ces dernières années, les données médicales sont devenues de plus en plus complexes en raison de leur croissance exponentielle. Cette forte croissance engendre une quantité de données importante qui ne permet pas d’effectuer une lecture humaine complète dans un délai raisonnable. Ainsi, les professionnels de santé reconnaissent l’importance des outils informatiques pour identifier des modèles informatifs ou prédictifs à travers le traitement et l’analyse automatiques des données médicales. Notre thèse s’inscrit dans le cadre du projet ConSoRe, et vise à créer des cohortes de patients résistants aux traitements anticancéreux. L’identification de ces résistances nous permet de mettre en place des modèles de prédiction des éventuels risques qui pourraient apparaître pendant le traitement des patients, et nous facilite l’individualisation et le renforcement de la prévention en fonction du niveau de risque estimé. Cette démarche s’inscrit dans le cadre d’une médecine de précision, permettant de proposer de nouvelles solutions thérapeutiques adaptées à la fois aux caractéristiques de la maladie (cancer) et aux profils des patients identifiés. Pour répondre à ces problématiques, nous présentons dans ce manuscrit nos différentes contributions. Notre première contribution consiste en une approche séquentielle permettant de traiter les différents problèmes liés au pré-traitement et à la préparation des données textuelles. La complexité de ces tâches réside essentiellement dans la qualité et la nature de ces textes, et est liée étroitement aux particularités des comptes rendus médicaux traités. Outre les opérations de linguistiques standards telles que la tokenisation ou la segmentation en phrases, nous présentons un arsenal de techniques assez large pour la préparation et le nettoyage des données. Notre deuxième contribution consiste en une approche de classification automatique des phrases extraites des comptes rendus médicaux. Cette approche est constituée essentiellement de deux étapes. La première consiste à entraîner les vecteurs de mots pour représenter les textes de façon à extraire le plus de caractéristiques possibles. La seconde étape est une classification automatique de phrases selon leurs informations sémantiques. Nous étudions pour cela les différents algorithmes d’apprentissage automatique (classique et profond) qui fournissent les meilleures performances sur nos données, et nous présentons notre meilleur algorithme. Notre troisième et dernière contribution majeure est consacrée à notre approche de modélisation des résistances aux traitements d’oncologie. Pour cela, nous présentons deux modèles de structuration des données. Le premier modèle nous permet de structurer les informations identifiées au niveau de chaque document (ou compte rendu). Le second modèle est quant à lui introduit au niveau patient, et permet à partir des informations extraites dans plusieurs comptes rendus d’un même patient, reconstruire son parcours néoplasique. Cette structuration permet d’identifier les réponses aux traitements et les toxicités, qui constituent des composants élémentaires pour notre approche de modélisation des résistances aux traitements d’oncologie.
... To determine the correct sense in a particular context, Bakhouche et al. have used at the local level the similarity measure between word senses, and at the level global the combinatorial optimization algorithm [27,28]. The authors adopt ant colony algorithm and compared it to simulated annealing and genetic algorithms. ...
Article
In the field of natural language processing, the semantic disambiguation of words is beneficial to several applications, which helps us to identify the correct meaning of a word or a sequence of words according to the given context. It can be formulated as a combinatorial optimization problem where the goal is to find the set of meanings that contribute to improving the semantic relationship between target words. The Crow Search Algorithm (CSA) is a nature-inspired algorithm. It mimics the food foraging behavior of crow birds and their social interaction. CSA can deal with both continuous and discrete optimization problems. In this paper, the Word Sense Disambiguation (WSD) is modelled as a combinatorial optimization problem that is by nature a discrete problem. For this propose the discrete version of CSA has been adapted for solving the WSD problem and a DCSA-based WSD approach is proposed and called ADCSA-WSD. The proposed approach has been evaluated and compared with state-of-the-art approaches using three well-known benchmark datasets (SemCor 3.0, SensEval-02, SensEval-03). Experimental results show that ADCSA-WSD approach is performing better than other approaches. Keywords: word sense disambiguation, swarm-based optimization, discrete crow search algorithm, natural language processing, combinatorial optimization, swarm-based intelligence
... Such algorithms have already been applied for natural language processing tasks successfully including Word Sense Disambiguation [21]. The work presented here aims at expanding this paradigm to automatic speech recognition. ...
... -RNN-SST-POS-H2 : ajout de l'annotation morpho-syntaxique au niveau de la deuxième couche cachée. -GETALP : Système non-supervisé pour la WSD proposé par [Schwab et al. 2012] basé sur un algorithme à colonies de fourmis. -Les meilleurs performances sont celles des modèles résultant de la combinaison entre projection simple et RNN, ce qui montre la complémentarité de ces deux approches. ...
Thesis
Cette thèse porte sur la construction automatique d’outils et de ressources pour l’analyse linguistique de textes des langues peu dotées. Nous proposons une approche utilisant des réseaux de neurones récurrents (RNN - Recurrent Neural Networks) et n'ayant besoin que d'un corpus parallèle ou mutli-parallele entre une langue source bien dotée et une ou plusieurs langues cibles moins bien ou peu dotées. Ce corpus parallèle ou mutli-parallele est utilisé pour la construction d'une représentation multilingue des mots des langues source et cible. Nous avons utilisé cette représentation multilingue pour l’apprentissage de nos modèles neuronaux et nous avons exploré deux architectures neuronales : les RNN simples et les RNN bidirectionnels. Nous avons aussi proposé plusieurs variantes des RNN pour la prise en compte d'informations linguistiques de bas niveau (informations morpho-syntaxiques) durant le processus de construction d'annotateurs linguistiques de niveau supérieur (SuperSenses et dépendances syntaxiques). Nous avons démontré la généricité de notre approche sur plusieurs langues ainsi que sur plusieurs tâches d'annotation linguistique. Nous avons construit trois types d'annotateurs linguistiques multilingues: annotateurs morpho-syntaxiques, annotateurs en SuperSenses et annotateurs en dépendances syntaxiques, avec des performances très satisfaisantes. Notre approche a les avantages suivants : (a) elle n'utilise aucune information d'alignement des mots, (b) aucune connaissance concernant les langues cibles traitées n'est requise au préalable (notre seule supposition est que, les langues source et cible n'ont pas une grande divergence syntaxique), ce qui rend notre approche applicable pour le traitement d'un très grand éventail de langues peu dotées, (c) elle permet la construction d'annotateurs multilingues authentiques (un annotateur pour N langages).
... L'approche globale, elle, propage ces mesures locales aux niveaux supérieurs (syntagmes, phrases, paragraphes, voire le texte, selon l'algorithme choisi) afin de désambiguïser l'ensemble du texte. Parmi les algorithmes globaux on trouve, entres autres, des algorithmes génétiques [Gelbukh et al., 2003], de recuit simulé [Cowie et al., 1992], à colonies de fourmis [Schwab et al., 2011] [Schwab et al., 2012] ou encore depuis peu, des algorithmes à colonies d'abeilles [Abualhaija and Zimmermann, 2016] ou de coucous [Vial et al., 2016]. ...
Thesis
Nous abordons dans cette thèse une étude sur la tâche de la désambiguïsation lexicale qui est une tâche centrale pour le traitement automatique des langues, et qui peut améliorer plusieurs applications telles que la traduction automatique ou l'extraction d'informations. Les recherches en désambiguïsation lexicale concernent principalement l'anglais, car la majorité des autres langues manque d'une référence lexicale standard pour l'annotation des corpus, et manque aussi de corpus annotés en sens pour l'évaluation, et plus important pour la construction des systèmes de désambiguïsation lexicale. En anglais, la base de données lexicale wordnet est une norme de-facto de longue date utilisée dans la plupart des corpus annotés et dans la plupart des campagnes d'évaluation.Notre contribution porte sur plusieurs axes: dans un premier temps, nous présentons une méthode pour la création automatique de corpus annotés en sens pour n'importe quelle langue, en tirant parti de la grande quantité de corpus anglais annotés en sens wordnet, et en utilisant un système de traduction automatique. Cette méthode est appliquée sur la langue arabe et est évaluée sur le seul corpus arabe, qui à notre connaissance, soit annoté manuellement en sens wordnet: l'OntoNotes 5.0 arabe que nous avons enrichi semi-automatiquement. Son évaluation est réalisée grâce à la mise en œuvre de deux systèmes supervisés (SVM, LSTM) qui sont entraînés sur les corpus produits avec notre méthode.Grâce ce travail, nous proposons ainsi une base de référence solide pour l'évaluation des futurs systèmes de désambiguïsation lexicale de l’arabe, en plus des corpus arabes annotés en sens que nous fournissons en tant que ressource librement disponible.Dans un second temps, nous proposons une évaluation in vivo de notre système de désambiguïsation de l’arabe en mesurant sa contribution à la performance de la tâche de traduction automatique.
... • MFS Semeval 2013: The most frequent sense is the baseline provided by SemEval 2013 for Task 12, this system is a strong baseline, which is obtained by using an external resource (the WordNet most frequent sense). • GETALP: A fully unsupervised WSD system proposed by Schwab et al. (2012) based on Ant-Colony algorithm. ...
Article
Full-text available
This work focuses on the rapid development of linguistic annotation tools for low-resource languages (languages that have no labeled training data). We experiment with several cross-lingual annotation projection methods using recurrent neural networks (RNN) models. The distinctive feature of our approach is that our multilingual word representation requires only a parallel corpus between source and target languages. More precisely, our approach has the following characteristics: (a) it does not use word alignment information, (b) it does not assume any knowledge about target languages (one requirement is that the two languages (source and target) are not too syntactically divergent), which makes it applicable to a wide range of low-resource languages, (c) it provides authentic multilingual taggers (one tagger for N languages). We investigate both uni and bidirectional RNN models and propose a method to include external information (for instance, low-level information from part-of-speech tags) in the RNN to train higher level taggers (for instance, Super Sense taggers). We demonstrate the validity and genericity of our model by using parallel corpora (obtained by manual or automatic translation). Our experiments are conducted to induce cross-lingual part-of-speech and Super Sense taggers. We also use our approach in a weakly supervised context, and it shows an excellent potential for very low-resource settings (less than 1k training utterances).
Article
Full-text available
It is well-known that there are polysemous words like sentence whose "meaning" or "sense" depends on the context of use. We have recently reported on two new word-sense disambiguation systems, one trained on bilingual material (the Canadian Hansards) and the other trained on monolingual material (Roget's Thesaurus and Grolier's Encyclopedia). As this work was nearing completion, we observed a very strong discourse effect. That is, if a polysemous word such as sentence appears two or more times in a well-written discourse, it is extremely likely that they will all share the same sense. This paper describes an experiment which confirmed this hypothesis and found that the tendency to share sense in the same discourse is extremely strong (98%). This result can be used as an additional source of constraint for improving the performance of the word-sense disambiguation algorithm. In addition, it could also be used to help evaluate disambiguation algorithms that did not make use of the discourse constraint.
Thesis
Full-text available
The semantic analysis of texts requires beforehand the building of objects related to lexical semantics. Idea vectors and lexical networks seems to be adequate for such a purpose and are complementary. However, one should still be able to construct them in practice. Vectors can be computed with definition corpora extracted from dictionaries, with thesaurii or with plain texts. They can be derived as conceptual vectors, anonymous vectors or lexical vectors - each of those being a particular balance between precision, coverage and practicality. Concerning lexical networks, they can be efficiently constructed through serious games, which is precisely the goal of the JeuxDeMots project. The semantic analysis can be tackled from the thematic analysis, and can serve as computing means for idea vectors. We can modelise the analysis problem as actviations and propagations. The numerous criteria occuring in the semantic analysis and the difficulties related to the proper definition of a control function, lead us to explore metaheuristics inspired from nature. More precisely, we introduce an analysis moodel based on artificial ant colonies. From a given text, the analysis aims at building a graph holding objects of the text (words, phrases, sentences, etc.), highlighting objects considered as relevant (phrases and concepts) as well as typed and weighted relations between those objects.
Book
with a preface by George Miller WordNet, an electronic lexical database, is considered to be the most important resource available to researchers in computational linguistics, text analysis, and many related areas. Its design is inspired by current psycholinguistic and computational theories of human lexical memory. English nouns, verbs, adjectives, and adverbs are organized into synonym sets, each representing one underlying lexicalized concept. Different relations link the synonym sets. The purpose of this volume is twofold. First, it discusses the design of WordNet and the theoretical motivations behind it. Second, it provides a survey of representative applications, including word sense identification, information retrieval, selectional preferences of verbs, and lexical chains. Contributors: Reem Al-Halimi, Robert C. Berwick, J. F. M. Burg, Martin Chodorow, Christiane Fellbaum, Joachim Grabowski, Sanda Harabagiu, Marti A. Hearst, Graeme Hirst, Douglas A. Jones, Rick Kazman, Karen T. Kohl, Shari Landes, Claudia Leacock, George A. Miller, Katherine J. Miller, Dan Moldovan, Naoyuki Nomura, Uta Priss, Philip Resnik, David St-Onge, Randee Tengi, Reind P. van de Riet, Ellen Voorhees.
Article
Multiple Comparisons introduces simultaneous statistical inference and covers the theory and techniques for all-pairwise comparisons, multiple comparisons with the best, and multiple comparisons with a control. The author describes confidence intervals methods and stepwise exposes abuses and misconceptions, and guides readers to the correct method for each problem. Discussions also include the connections with bioequivalence, drug stability, and toxicity studies Real data sets analyzed by computer software packages illustrate the applications presented.
Article
We present an automatic approach to the construction of BabelNet, a very large, wide-coverage multilingual semantic network. Key to our approach is the integration of lexicographic and encyclopedic knowledge from WordNet and Wikipedia. In addition, Machine Translation is applied to enrich the resource with lexical information for all languages. We first conduct in vitro experiments on new and existing gold-standard datasets to show the high quality and coverage of BabelNet. We then show that our lexical resource can be used successfully to perform both monolingual and cross-lingual Word Sense Disambiguation: thanks to its wide lexical coverage and novel semantic relations, we are able to achieve state-of the-art results on three different SemEval evaluation tasks.