Content uploaded by Tim vor der Brück
Author content
All content in this area was uploaded by Tim vor der Brück on Apr 27, 2021
Content may be subject to copyright.
Spectral Text Similarity Measures
Tim vor der Br¨
uck and Marc Pouly
School of Information Technology
Lucerne University of Applied Sciences and Arts
Switzerland
{tim.vorderbrueck,marc.pouly}@hslu.ch
Abstract. Estimating semantic similarity between texts is of vital importance in
many areas of natural language processing like information retrieval, question
answering, text reuse or plagiarism detection.
Prevalent semantic similarity estimates based on word embeddings are noise sen
sitive. Thus, small individual term similarities can have in aggregate a consider
able inﬂuence on the total estimation value. In contrast, the methods proposed
here exploit the spectrum of the product of embedding matrices, which leads to
increased robustness when compared with conventional methods.
We apply these estimate on two tasks, which are the assignment of people to
the best matching marketing target group and ﬁnding the correct match between
sentences belonging to two independent translations of the same novel. The eval
uation revealed that our proposed methods could increase accuracy in both sce
narios.
1 Introduction
Estimating semantic document similarity is of vital importance in a lot of different ar
eas, like plagiarism detection, information retrieval, or text summarization. One draw
back of current stateoftheart similarity estimates based on word embeddings is that
small term similarities can sum up to a considerable amount and make these estimates
vulnerable to noise in the data. Therefore, we propose two estimates that are based on
the spectrum of the product Fof embedding matrices belonging to the two documents
to compare. In particular, we propose the spectral radius and the spectral norm of F,
where the ﬁrst denotes F’s largest absolute eigenvalue and the second its largest singu
lar value. Eigenvalue and singular value oriented methods for dimensionality reduction
aiming to reduce noise in the data have a long tradition in natural language processing.
For instance, principal component analysis is based on eigenvalues and can be used
to increase the quality of word embeddings [8]. In contrast, Latent Semantic Analysis
[11], a technique known from information retrieval to improve search results in term
document matrices, focuses on largest singular values.
Furthermore, we investigate several properties of our proposed measures that are
crucial for qualifying as proper similarity estimates, while considering both unsuper
vised and supervised learning.
Finally, we applied both estimates to two natural language processing scenarios.
In the ﬁrst scenario, we distribute participants of an online contest into several target
2 Tim vor der Br¨
uck and Marc Pouly
groups by exploiting short text snippets they were asked to provide. In the second sce
nario, we aim to ﬁnd the correct matching between sentences originating from two
independent translations of a novel from Edgar Allen Poe.
The evaluation revealed that our novel estimators performed superior to several
baseline methods for both scenarios.
The remainder of the paper is organized as follows. In the next section, we look
into several stateoftheart methods for estimating semantic similarity. Sect. 3 reviews
several concepts that are vital for the remainder of the paper and also for building the
foundation of our theoretical results. In Sect. 4 we describe in detail, how the spec
tral radius can be employed for estimating semantic similarity. Some drawbacks and
shortcomings of such an approach as well as an alternative method that very elegantly
solves all of these issues exploiting the spectral norm are discussed in Sect. 5. The two
application scenario for our proposed semantic similarity estimates are given in Sect. 6.
Sect. 7 describes the conducted evaluation, in which we compare our approach with
several baseline methods. The results of the evaluation are discussed in Sect. 8. So far,
we covered only unsupervised learning. In Sect. 9 we investigate, how our proposed
estimates can be employed in a supervised setting. Finally, this paper concludes with
Sect. 10, which summarizes the obtained results.
2 Related Work
Until recently, similarity estimates were predominantly based either on ontologies [4]
or on typical information retrieval techniques like Latent Semantic Analysis. In the last
couple of years, however, socalled word and sentence embeddings became stateof
theart.
The prevalent approach to document similarity estimation based on word embed
dings consists of measuring similarity between vector representations of the two docu
ments derived as follows:
1. The word embeddings (often weighted by the tfidf coefﬁcients of the associated
words [3]) are looked up in a hashtable for all the words in the two documents
to compare. These embeddings are determined beforehand on a very large cor
pus typically using either the skip gram or the continuous bag of words variant of
the Word2Vec model [15]. The skip gram method aims to predict the textual sur
roundings of a given word by means of an artiﬁcial neural network. The inﬂuential
weights of the onehotencoded input word to the nodes of the hidden layer consti
tute the embedding vector. For the socalled continuous bag of words method, it is
just the opposite, i.e., the center word is predicted by the words in its surrounding.
2. The centroid over all word embeddings belonging to the same document is calcu
lated to obtain its vector representation.
Alternatives to Word2Vec are GloVe [17], which is based on aggregated global
word cooccurrence statistics and the Explicit Semantic Analysis (or shortly ESA)
[6], in which each word is represented by the column vector in the tfidf matrix over
Wikipedia.
Spectral Text Similarity Measures 3
The idea of Word2Vec can be transferred to the level of sentences as well. In particu
lar, the socalled Skip Thought Vector (STV) model [10] derives a vector representation
of the current sentence by predicting the surrounding sentences.
If vector representations of the two documents to compare were successfully es
tablished, a similarity estimate can be obtained by applying the cosine measure to the
two vectors. [18] propose an alternative approach for ESA word embeddings that estab
lishes a bipartite graph consisting of the best matching vector components by solving
a linear optimization problem. The similarity estimate for the documents is then given
by the global optimum of the objective function. However, this method is only useful
for sparse vector representations. In case of dense vectors, [14] suggested to apply the
Frobenius kernel to the embedding matrices, which contain the embedding vectors for
all document components (usually either sentences or words, cf. also [9]). However, cru
cial limitations are that the Frobenius kernel is only applicable if the number of words
(sentences respectively) in the compared documents coincide and that a word from the
ﬁrst document is only compared with its counterpart from the second document. Thus,
an optimal matching has to be established already beforehand. In contrast, the approach
as presented here applies to arbitrary embedding matrices. Since it compares all words
of the two documents with each other, there is also no need for any matching method.
Before going more into detail, we want to review some concepts that are crucial for
the remainder of this paper.
3 Similarity Measure / Matrix norms
According to [2], a similarity measure on some set Xis an upper bounded, exhaustive
and total function s:X×X→I⊂Rwith I>1(therefore Iis upper bounded
and sup Iexists). Additionally, a similarity measure should fulﬁll the properties of re
ﬂexivity (the supremum is reached if an item is compared to itself) and symmetry. We
call such a measure normalized if the supremum equals 1 [1]. Note that an asymmetric
similarity measure can easily be converted into a symmetric by taking the geometric
or arithmetic mean of the asymmetric measure applied twice to the same arguments in
switched order.
A norm is a function f:V→Rover some vector space Vthat is absolutely homo
geneous, positive deﬁnite and fulﬁlls the triangle inequality. It is called matrix norm if
its domain is a set of matrices and if it is submultiplicative, i.e., kABk ≤ kAk · kBk.
An example of a matrix norm is the spectral norm, which denotes the largest singular
value of a matrix. Alternatively, one can deﬁne this norm as: kAk2:= pρ(A>A),
where the function ρreturns the largest absolute eigenvalue of the argument matrix.
4 Document Similarity Measure based on the Spectral Radius
For an arbitrary document twe deﬁne the embeddings matrix E(t)as follows: E(t)ij
is the ith component of the normalized embeddings vector belonging to the jth word
of the document t. Let t, u be two arbitrary documents, then the entry (i, j )of a product
F:= E(t)>E(u)speciﬁes the result of the cosine measure estimating the semantic
similarity between word iof document tand word jof document u.
4 Tim vor der Br¨
uck and Marc Pouly
The larger the matrix entries of Fare, the higher is usually the semantic similarity
of the associated texts. A straightforward way to measure the magnitude of the matrix
is just to summate all absolute matrix elements, which is called the L11norm. However,
this approach has the disadvantage that also small cosine measure values are included
in the sum, which can have in aggregate a considerable impact on the total similarity
estimate making such an approach vulnerable to noise in the data. Therefore we propose
instead to apply an operator, which is more robust than the L1,1norm and which is
called the spectral radius.
This radius denotes the largest absolute eigenvalue of the input matrix and consti
tutes a lower bound of all matrix norms. It also insinuates the convergence of the matrix
power series limn→∞ Fn. The series converges if and only if the spectral radius does
not exceed the value of one.
Since the vector components obtained by Word2Vec can be negative, the cosine
measure between two word vectors can also assume negative values (rather rarely in
practice though). Akin to zeros, negative cosine values indicate unrelated words as well.
Because the spectral radius usually treats negative and positive matrix entries alike (the
spectral radius of a matrix A and of its negation coincide), we replace all negative values
in the matrix by zero. Finally, since our measure should be restricted to values from zero
to one, we have to normalize it. Formally, we deﬁne our similarity measure as follows:
sn (t, u) := ρ(R(E(t)>E(u))
pρ(R(E(t)>E(t)) ·ρ(R(E(u)>E(u)))
where E(t)is the embeddings matrix belonging to document t, where all embedding
column vectors are normalized. R(M)is the matrix, where all nonzero entries are
replaced by zero, i.e. R(M)ij = max{0,Mij }.
In contrast to matrix norms that can be applied to arbitrary matrices, eigenvalues
only exist for square matrices. However, the matrix F∗:= R(E(t)>E(u)) that we use
as basis for our similarity measures is usually nonquadratic. In particular, this matrix
would be quadratic, if and only if the number of terms in the two documents tand
ucoincide. Thus, we have to ﬁll up the embedding matrix of the smaller one of the
two texts with additional embedding vectors. A quite straightforward choice, which we
followed here, is to just use the centroid vector for this. An alternative approach would
be to sample the missing vectors.
A further issue is that eigenvalues are not invariant concerning row and column
permutations. The columns of the embedding matrices just represent the words appear
ing in the texts. However, the word order can be arbitrarily for the text representing the
marketing target groups (see Sect. 6.1 for details). Since a similarity measure should not
depend on some random ordering, we need to bring the similarity matrix F∗in some
normalized format. A quite natural choice would be to enforce the ordering that maxi
mizes the absolute value of the largest eigenvalue (which is actually our target value).
Let us formalize this. We denote with F∗
P,Q the matrix obtained from F∗by applying
the permutation Pon the rows and the permutation Qon the columns. Thus, we can
deﬁne our similarity measure as follows:
simsr(t, u) = max
P,Q ρ(F∗
P,Q)(1)
Spectral Text Similarity Measures 5
However, solving this optimization problem is quite timeconsuming. Let us assume the
matrix F∗has m rows and columns. Then we would have to iterate over m!·m!different
possibilities. Hence, such an approach would be infeasible already for medium sized
texts. Therefore, we instead select the permutations that optimize the absolute value
of the arithmetic mean over all eigenvalues, which is a lower bound of the maximum
absolute eigenvalue.
Let λi(M)be the ith eigenvalue of a matrix M. With this, we can formalize our
optimization problem as follows:
˜
simsr(t, u) =ρ(F∗
˜
P , ˜
Q)
˜
P , ˜
Q= arg max
P,Q

m
X
i=1
λi(F∗
P,Q)(2)
The sum over all eigenvalues is just the trace of the matrix. Thus,
˜
P , ˜
Q= arg max
P,Q
tr(F∗
P,Q)(3)
which is just the sum over all diagonal elements. Since we constructed our matrix F∗
in such a way that it contains no negative entries, we can get rid of the absolute value
operator.
˜
P , ˜
Q= arg max
P,Q
{tr(F∗
P,Q)}(4)
Because the sum is commutative, the sequence of the individual summands is ir
relevant. Therefore, we can leave either the row or column ordering constant and only
permutate the other one.
˜
simsr(t, u) =ρ(F∗
˜
P ,id)
˜
P= arg max
P
{tr(F∗
P,id)}(5)
˜
Pcan be found by solving a binary linear programming problem in the following way.
Let Xbe the set of decision variables and let further Xij ∈Xbe one if and only if
row iis changed to row jin the reordered matrix and zero otherwise. Then the ob
jective function is given by maxXPm
i=1 Pm
j=1 Xji F∗
ji . A permutation denotes an 1:1
mapping, i.e.
m
X
i=1
Xij =1 ∀j= 1, . . . , m
m
X
j=1
Xij =1 ∀i= 1, . . . , m
Xij ∈{0,1} ∀i, j = 1, . . . , m
(6)
6 Tim vor der Br¨
uck and Marc Pouly
5 Spectral Norm
The similarity estimate as described above has several drawbacks.
–The boundedness condition is violated in some cases. Therefore, this similarity
does not qualify as a normalized similarity estimate according to the deﬁnition in
Sect. 3.
–The largest eigenvalues of a matrix depends on the row and column ordering. How
ever, this ordering is arbitrary for our proposed description of target groups by
keywords (cf. Sect. 6.1 for the details). To ensure a unique eigenvalue, we apply
linear optimization, which is an expensive approach in terms of runtime.
–Eigenvalues are only deﬁned for square matrices. Therefore, we need to ﬁll up the
smaller of the embedding matrices to meet this requirement.
An alternative to the spectral radius is the spectral norm, which is deﬁned by the
largest singular value of a matrix. Formally, the spectral norm based estimate is given
as:
sn2(t, u) := k(R(E(t)>E(u))k2
pkR(E(t)>E(t))k2· kR(E(u)>E(u))k2
where kAk2=pρ(A>A).
By using the spectral norm instead of the spectral radius, all of the issues mentioned
above are solved. The spectral norm is not only invariant to column or row permutations,
it can also be applied to arbitrary rectangular matrices. Furthermore, boundedness is
guaranteed as long as no negative cosine values occur as it is stated in the following
proposition.
Proposition 1. If the cosine similarity values between all embedding vectors of words
occurring in any of the documents are nonnegative, i.e., if R(E(t)>E(u)) = E(t)>E(u)
for all document pairs (t, u), then sn2is a normalized similarity measure.
Symmetry
Proof. At ﬁrst, we focus on the symmetry condition.
Let A:= E(t),B:= E(u), where tand uare arbitrary documents. Symmetry
directly follows, if we can show that
kZk2=kZ>k2
for arbitrary matrices Z, since with this property we have
sn2(t, u) = kA>Bk2
pkA>Ak2· kB>Bk2
=k(B>A)>k2
pkB>Bk2· kA>Ak2
=kB>Ak2
pkB>Bk2· kA>Ak2
=sn2(u, t)
(7)
Spectral Text Similarity Measures 7
Let Mand Nbe arbitrary matrices such that MN and NM are both deﬁned and
quadratic, then (see [5])
ρ(MN) = ρ(NM)(8)
where ρ(X)denotes the largest absolute eigenvalue of a squared matrix X.
Using identity 8 one can easily infer that:
kZk2=qρ(Z>Z) = qρ(ZZ>) = kZ>k2(9)
Boundedness
Proof. The following property needs to be shown:
kA>Bk2
pkA>Ak2· kB>Bk2
≤1(10)
In the proof, we exploit the fact that for every positivesemideﬁnite matrix X, the fol
lowing equation holds
ρ(X2) = ρ(X)2(11)
We observe that for the denominator
kA>Ak2· kB>Bk2
=qρ((A>A)>A>A)qρ((B>B)>B>B)
=qρ((A>A)>(A>A)>)qρ((B>B)>(B>B)>)
=qρ([(A>A)>]2)qρ([(B>B)>]2)
(11)
=qρ((A>A)>)2qρ((B>B)>)2
=ρ((A>A)>)ρ((B>B)>)
(9)
=kAk2
2· kBk2
2
(12)
Putting things together we ﬁnally obtain
kA>Bk2
pkA>Ak2kB>Bk2
submult.
≤kA>k2· kBk2
pkA>Ak2kB>Bk2
(9)
=kAk2· kBk2
pkA>Ak2kB>Bk2
(12)
=kAk2· kBk2
pkAk2
2· kBk2
2
= 1
(13)
The question remains, how the similarity measure value induced by matrix norms
performs in comparison with the usual centroid method. General statements about the
8 Tim vor der Br¨
uck and Marc Pouly
spectralnorm based similarity measure are difﬁcult, but we can draw some conclusions,
if we restrict to the case where A>Bis a square diagonal matrix. Hereby, one word of
the ﬁrst text is very similar to exactly one word of the second text and very dissimilar
to all remaining words. The similarity estimate is then given by the largest eigenvalue
(the spectral radius) of A>B, which equals the largest cosine measure value. Noise in
form of small matrix entries is completely ignored.
6 Application Scenarios
We applied our semantic similarity estimates to the following two scenarios:
6.1 Market Segmentation
Market segmentation is one of the key tasks of a marketer. Usually, it is accomplished
by clustering over behaviors as well as demographic, geographic and psychographic
variables [12]. In this paper, we will describe an alternative approach based on un
supervised natural language processing. In particular, our business partner operates a
commercial youth platform for the Swiss market, where registered members get access
to thirdparty offers such as discounts and special events like concerts or castings. Actu
ally, several hundred online contests per year are launched over this platform sponsored
by other ﬁrms, an increasing number of them require the members to write short free
text snippets, e.g. to elaborate on a perfect holiday at a destination of their choice in case
of a contest sponsored by a travel agency. Based on the results of a broad survey, the
platform provider’s marketers assume ﬁve different target groups (called milieus) being
present among the platform members: progressive postmodern youth (people primarily
interested in culture and arts), young performers (people striving for a high salary with
a strong afﬁnity to luxury goods), freestyle action sportsmen,hedonists (rather poorly
educated people who enjoy partying and disco music) and conservative youth (tradi
tional people with a strong concern for security). A sixth milieu called special groups
comprises all those who cannot be assigned to one of the upper ﬁve milieus. For each
milieu (with the exception of special groups) a keyword list was manually created by
describing its main characteristics. For triggering marketing campaigns, an algorithm
shall be developed that automatically assigns each contest answer to the most likely
target group: we propose the youth milieu as best match for a contest answer, for which
the estimated semantic similarity between the associated keyword list and user answer
is maximal. In case the highest similarity estimate falls below the 10 percent quantile
for the distribution of highest estimates, the special groups milieu is selected.
Since the keyword list typically consists of nouns (in the German language cap
italized) and the user contest answers might contain a lot of adjectives and verbs as
well, which do not match very well to nouns in the Word2Vec vector representation, we
actually conduct two comparisons for our Word2Vec based measures, one with the un
changed user contest answers and one by capitalizing every word beforehand. The ﬁnal
similarity estimate is then given as the maximum value of both individual estimates.
Spectral Text Similarity Measures 9
6.2 Translation Matching
The novel The purloined letter authored by Edgar Allen Poe was independently trans
lated by two translators into German1. We aim to match a sentence from the ﬁrst trans
lation to the associated sentence of the second by looking for the assignment with the
highest semantic relatedness disregarding the sentence order. To guarantee an 1:1 sen
tence mapping, periods were partly replaced by semicolons.
7 Evaluation
For evaluation we selected three online contests (language: German), where people
elaborated on their favorite travel destination (contest 1, see Appendix A for an exam
ple), speculated about potential experiences with a pair of fancy sneakers (contest 2)
and explained why they emotionally prefer a certain product out of four available can
didates. In order to provide a gold standard, three professional marketers from different
youth marketing companies annotated independently the best matching youth milieus
for every contest answer. We determined for each annotator individually his/her av
erage interannotator agreement with the others (Cohen’s kappa). The minimum and
maximum of these average agreement values are given in Table 2. Since for contest 2
and contest 3, some of the annotators annotated only the ﬁrst 50 entries (last 50 entries
respectively), we speciﬁed min/max average kappa values for both parts. We further
compared the youth milieus proposed by our unsupervised matching algorithm with the
majority votes over the human experts’ answers (see Table 3) and computed its average
interannotator agreement with the human annotators (see again Table 2). The obtained
accuracy values for the second scenario (matching translated sentences) are given in
Table 4.
0.5
0
0.5
1
1.5
2
0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
(a) W2VC / Spectral radius.
0
0.2
0.4
0.6
0.8
0 0.2 0.4 0.6
(b) W2VC / Spectral norm.
Fig. 1: Scatter Plots of Cosine between Centroids of Word2Vec Embeddings (W2VC)
vs similarity estimates induced by different spectral measures.
The Word2Vec word embeddings were trained on the German Wikipedia (dump
originating from 20 February 2017) merged with a Frankfurter Rundschau newspaper
1This corpus can be obtained under the URL https://www.researchgate.net/
publication/332072718_alignmentPurloinedLettertar
10 Tim vor der Br¨
uck and Marc Pouly
Table 1: Corpus sizes measured by number of words.
Corpus # Words
German Wikipedia 651 880 623
Frankfurter Rundschau 34 325 073
News journal 20 Minutes 8 629 955
Table 2: Minimum and maximum average interannotator agreements (Cohen’s kappa)
/ average interannotator agreement values for our automated matching method.
Method Contest
123
Min kappa 0.123 0.295/0.030 0.110/0.101
Max. kappa 0.178 0.345/0.149 0.114/0.209
Kap. (spectral norm) 0.128 0.049/0.065 0.060/0.064
# Entries 1544 100 100
Table 3: Obtained accuracy values for similarity measures induced by different matrix
norms and for ﬁve baseline methods. (W)W2VC=Cosine between (weighed by tfidf)
Word2Vec Embeddings Centroids.
Method Contest
1 2 3 all
Random 0.167 0.167 0.167 0.167
ESA 0.357 0.254 0.288 0.335
ESA2 0.355 0.284 0.227 0.330
W2VC 0.347 0.328 0.227 0.330
WW2VC 0.347 0.299 0.197 0.322
SkipThoughtVectors 0.162 0.284 0.273 0.189
Spectral Norm 0.370 0.299 0.288 0.350
Spectral Radius 0.353 0.313 0.182 0.326
Spectral Radius+W2VC 0.357 0.299 0.212 0.334
Corpus and 34 249 articles of the news journal 20 minutes2, where the latter is targeted
to the Swiss market and freely available at various Swiss train stations (see Table 1
for a comparison of corpus sizes). By employing articles from 20 minutes, we want to
ensure the reliability of word vectors for certain Switzerland speciﬁc expressions like
Velo or Glace, which are underrepresented in the German Wikipedia and the Frankfurter
Rundschau corpus.
ESA is usually trained on Wikipedia, since the authors of the original ESA pa
per suggest that the articles of the training corpus should represent disjoint concepts,
2http://www.20min.ch
Spectral Text Similarity Measures 11
which is only guaranteed for encyclopedias. However, Stein and Anerka [7] challenged
this hypothesis and demonstrated that promising results can be obtained by applying
ESA on other types of corpora like the popular Reuters newspaper corpus as well. Un
fortunately, the implementation we use (WikiprepESA3) expects its training data to
be a Wikipedia Dump. Furthermore, WikiprepESA only indexes words that are con
nected by hyperlinks, which are usually lacking in ordinary newspaper articles. So we
could train ESA on Wikipedia only but we have developed meanwhile a version of
ESA that can be applied to arbitrary corpora and which was trained on the full cor
pus (Wikipedia+Frankfurter Rundschau+20 minutes). In the following we refer to this
implementation as ESA2.
The STVs (Skip Thought Vectors) were trained on the same corpus as our estimates
and Word2Vec embedding centroids (W2VC). The actual document similarity estima
tion is accomplished by the usual centroid approach. An issue we are faced with for the
ﬁrst evaluation scenario of market segmentation (see Sect. 6.1) is that STVs are not bag
of word models but actually take the sequence of the words into account and therefore
the obtained similarity estimate between milieu keyword list and contest answer would
depend on the keyword ordering. However, this order could have arbitrarily been chosen
by the marketers and might be completely random. A possible solution is to compare
the contest answers with all possible permutation of keywords and determine the maxi
mum value over all those comparisons. However, such an approach would be infeasible
already for medium keyword list sizes. Therefore, we apply for this scenario a beam
search to extends the keyword list iteratively while keeping only the nbest performing
permutations.
Table 4: Accuracy value obtained for matching a sentence of the ﬁrst to the associated
sentence of the second translation (based on the ﬁrst 200 sentences of both translations).
Method Accuracy
ESA 0.672
STV 0.716
Spectral Radius 0.721
W2VC 0.726
Spectral Norm 0.731
8 Discussion
The evaluation showed that the interannotator agreement values vary strongly for con
test 2 part 2 (minimum average annotator agreement according to Cohen’s kappa of
0.03 while the maximum is 0.149, see Table 2). On this contest part, our spectral norm
based matching obtains a considerably higher average agreement than one of the anno
tators. Regarding baseline systems, the most relevant comparison is naturally the one
3https://github.com/faraday/wikiprepesa
12 Tim vor der Br¨
uck and Marc Pouly
with W2VC, since it employs the same type of data. The similarity estimate induced
by the spectral norm performs quite stable over both scenarios and clearly outperforms
the W2VC approach. In contrast however, the performance of the spectral radius based
estimate is rather mixed. While it performs well on the ﬁrst contest, the performance
on the third contest is quite poor and lags there behind the Word2Vec centroids. Only
the average of both measures (W2VC+Spectral Radius) performs reasonable well on
all three contests. One major issue of this measure is its unboundedness. The typical
normalization with the geometric mean of comparing the documents with itself results
in values exceeding the desired upper limit of one in 1.8% of the cases (determined
on the largest contest 1). So still some research is needed to come up with a better
normalization.
Finally, we conducted a scatter plot (see Fig. 1), plotting the values of the spectral
similarity estimates against W2VC. While the spectral norm is quite strongly correlated
to W2VC, the spectral radius behaves much more irregular and nonlinear. In addition,
its values exceed several times the desired upper limit of 1, which is a result of its
nonboundedness. Furthermore, both of the spectral similarity estimates tend to assume
larger values than W2VC, which is a result of its higher robustness against noise in the
data.
Note that a downside of both approaches in relation to the usual Word2Vec cen
troids method is the increased runtime, since it requires the pairwise comparison of all
words contained in the input documents. In our scenario with rather short text snippets
and keyword lists, this was not much of an issue. However, for large documents, such a
comprehensive comparison could become soon infeasible. This issue can be mitigated
for example by constructing the embedding matrices not on basis of individual words
but on entire sentences, for instance by employing the skipthoughtvector representa
tion.
9 Supervised Learning
So far, our two proposed similarity measures were only applied in an unsupervised
setting. However, supervised learning methods usually obtain superior accuracy. For
that, we could use our two similarity estimates as kernels for a suppport vector ma
chine [19] (SVM in short), potentially combined with an RBF kernel applied to an
ordinary feature representation consisting of tfidfweights of word forms or lemmas
(not yet evaluated however). One issue here is to investigate, whether our proposed
similarity estimates are positive semideﬁnite and qualify as regular kernels. In case of
non positivesemideﬁniteness, the SVM training process can stuck in a local minimum
resulting in failing to reach the global minimum for the hinge loss.
The estimate induced by the spectral radius and also the spectral norm in case of
negative cosine measure values between word embedding vectors can possibly violate
the boundedness constraint and therefore, it cannot constitute a positivesemideﬁnite
kernel. To see this, let us consider the kernel matrix K. According to Mercer‘s theorem
[13, 16], an SVM kernel is exactly then positivesemideﬁnite, if for any possible set of
inputs, the associated kernel matrices are positivesemideﬁnite. So we must show that
there is at least one kernel matrix that is not positivesemideﬁnite. Let us select one
Spectral Text Similarity Measures 13
kernel matrix Kwith at least one violation of boundedness. We can assume that Kis
symmetric, since symmetry is a prerequisite for positivesemideﬁniteness.
Since our normalization procedure gurantees reﬂexivity, a text compared with itself
always yields the estimated similarity of one. Therefore, the value of one can only be
exceeded for offdiagonal elements. Let us assume the entry Kij =Kji with i < j of
the kernel matrix equals 1+for some > 0. Consider a vector vwith vi= 1,vj=−1
and all other components equal to zero. Let w:= v>Kand q:= v>Kv =wv,
then wi= 1 −(1 + ) = −and wj= 1 + −1 = . With this, it follows that
q=−−=−2. And therefore Kcannot be positivesemideﬁnite.
Note that sn2can be a proper kernel in certain situations. Consider the fact that all of
the investigated texts are so dissimilar that the kernel matrices are diagonal dominant for
all possible sets of inputs. Since diagonal dominant matrices with nonnegative diagonal
elements are positivesemideﬁnite, the kernel is positivesemideﬁnite as well. It is still
an open question if this kernel can also be positivesemideﬁnite if not all of the kernel
matrices are diagonal dominant.
10 Conclusion
We proposed two novel similarity estimates based on the spectrum of the product of
embedding matrices. These estimates were evaluated on a two task, i.e., assigning users
to the best matching marketing target groups and matching sentences of a novel trans
lation with its counterpart from a different translation. Hereby, we obtained superior
results compared to the usual centroid of word2vec vectors (W2VC) method. Further
more, we investigated several properties of our estimates concerning boundness and
positivedeﬁniteness.
A Example Contest Answer
The following snippet is an example user answer for the travel contest (contest 1):
1. Jordanien: Ritt durch die W¨
uste und Petra im Morgengrauen bestaunen bevor die
Touristenbusse kommen
2. Cook Island: Schnorcheln mit Walhaien und die Seele baumeln lassen
3. USA: Eine abgespaceste Woche am Burning Man Festival erleben
English translation:
1. Jordan: Ride through the desert and marveling Petra during sunrise before the ar
rival of tourist buses
2. Cook Island: Snorkeling with whale sharks and relaxing
3. USA: Experience an awesome week at the Burning Man Festival
Acknowledgement
Hereby we thank the Jaywalker GmbH as well as the Jaywalker Digital AG for their
support regarding this publication and especially for annotating the contest data with
the bestﬁtting youth milieus.
14 Tim vor der Br¨
uck and Marc Pouly
References
1. Attig, A., Perner, P.: The problem of normalization and a normalized similarity measure by
online data. Transactions on CaseBased Reasoning 4(1) (2011)
2. Belanche, L.A., Orozco, J.: Things to know about a (dis)similarity measure. In: Proceedings
of the 15th International Conference on KnowledgeBased and Intelligent Information &
Engineering Systems. Karlsruhe, Germany (2011)
3. Brokos, G.I., Prodromos, Androutsopoulos, I.: Using centroids of word embeddings and
word mover‘s distance for biomedical document retrieval in question answering. In: Pro
ceedings of the 15th Workshop on Biomedical Natural Language Processing. pp. 114–118.
Berlin, Germany (2016)
4. Budanitsky, A., Hirst, G.: Evaluating wordnetbased measures of semantic relatedness. Com
putational Linguistics 32(1) (2006)
5. Chatelin, F.: Eigenvalues of Matrices  Revised Edition. Society for Industrial and Applied
Mathematics, Philadelphia, Pennsylvania (1993)
6. Gabrilovic, E., Markovitch, S.: Wikipediabased semantic interpretation for natural language
processing. Journal of Artiﬁcial Intelligence Research 34 (2009)
7. Gottron, T., Anderka, M., Stein, B.: Insights into explicit semantic analysis. In: Proceedings
of the 20th ACM international conference on Information and knowledge management. pp.
1961–1964. Glasgow, UK (2011)
8. Gupta, V.: Improving Word Embeddings Using Kernel Principal Component Analysis. Mas
ter‘s thesis, BonnAachen International Center for Information Technology (BIT) (2018)
9. Hong, K.J., Lee, G.H., Kom, H.J.: Enhanced document clustering using wikipediabased
document representation. In: Proceedings of the 2015 International Conference on Applied
System Innovation (ICASI) (2015)
10. Kiros, R., Zhu, Y., Salakhudinov, R., Zemel, R.S., Torralba, A., Urtasun, R., Fiedler, S.:
Skipthought vectors. In: Proceedings of the Conference on Neural Information Processing
Systems (NIPS). Montr´
eal, Canada (2015)
11. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Dis
course Processes 25, 259–284 (1998)
12. Lynn, M.: Segmenting and targeting your market: Strategies and limitations. Tech.
rep., Cornell University (2011), online: http://scholorship.sha.cornell.edu/
articles/243
13. Mercer, J.: Functions of positive and negative type and their connection with the theory of
integral equations. Philosophical Transactions of the Royal Society A 209, 441–458 (1909)
14. Mijangos, V., Sierra, G., Montes, A.: Sentence level matrix representation for document
spectral clustering. Pattern Recognition Letters 85 (2017)
15. Mikolov, T., Sutskever, I., Ilya, C., Corrado, G., Dean, J.: Distributed representations of
words and phrases and their compositionality. In: Proceedings of the Conference on Neu
ral Information Processing Systems (NIPS). pp. 3111–3119. Lake Tahoe, Nevada (2013)
16. Murphy, K.P.: Machine Learning  A Probabilistic Perspective. MIT Press (2012)
17. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation.
In: Proceedings of the Conference on Empirical Methods in Natural Language Processing
(EMNLP). Doha, Katar (2014)
18. Song, Y., Roth, D.: Unsupervised sparse vector densiﬁcation for short text similarity. In:
Proceedings of the Conference of the North American Chapter of the Association for Com
putational Linguistics (NAACL). Denver, Colorado (2015)
19. Vapnik, V.: Statistical Learning Theory. John Wiley and Sons, Inc., New York (1998)