Page 1

On the Weakenesses of Correlation Measures used for

Search Engines’ Results

(Unsupervised Comparison of Search Engine Rankings)

Paolo D’Alberto

Yahoo! Inc.

Sunnyvale, CA, USA

pdalbert@yahoo-inc.com

Ali Dasdan

Knowledge Discovery Consulting

San Jose, CA, USA

ali_dasdan@yahoo.com

ABSTRACT

The correlation of the result lists provided by search engines

is fundamental and it has deep and multidisciplinary ramifi-

cations. Here, we present automatic and unsupervised meth-

ods to assess whether or not search engines provide results

that are comparable or correlated. We have two main contri-

butions: First, we provide evidence that for more than 80%

of the input queries —independently of their frequency— the

two major search engines share only three or fewer URLs in

their search results, leading to an increasing divergence. In

this scenario (divergence), we show that even the most ro-

bust measures based on comparing lists is useless to apply;

that is, the small contribution by too few common items will

infer no confidence. Second, to overcome this problem, we

propose the fist content-based measures —i.e., direct com-

parison of the contents from search results; these measures

are based on the Jaccard ratio and distribution similarity

measures (CDF measures). We show that they are orthogo-

nal to each other (i.e., Jaccard and distribution) and extend

the discriminative power w.r.t. list based measures. Our ap-

proach stems from the real need of comparing search-engine

results, it is automatic from the query selection to the final

evaluation and it apply to any geographical markets, thus

designed to scale and to use as first filtering of query selec-

tion (necessary) for supervised methods.

1.INTRODUCTION

Today users have access to many search engines providing

services for their web search needs but the top three search

engines attract almost all user queries and the top search-

engines provide service to more than two-thirds of the search

traffic (as today 95%). What is the reason for this situation?

Attempting to answer this question and other similar ques-

tions, prompted us to the study of the metrics for compar-

ing search engines. Many such metrics are already available,

such as relevance, coverage, and presentation (e.g., see the

tutorial [8]). Independent of the metric, we would expect

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission and/or a fee.

Copyright 2010 ACM ...$10.00.

that, given the same query, if two different search engines

return results that are similar in both contents and order,

then the users’ satisfaction should be similar. In this work,

we argue that the previous hypothesis (i.e., similar results)

can be measured; the conclusion (i.e., user satisfaction) is

more subjective and we show that we must have a super-

vised approach.

We also show that a leading search-engine is not always

(and should not be always considered as) the ultimate ref-

erence of users’ satisfaction nor quality.

Thus, how can we judge the similarity of two sets of search

results? By representing URLs as sets or lists, we do take

advantage of these measures: For example, we can use the

Jaccard ratio for set similarity (without confidence level),

we can use Spearman’s footrule and Kendall’s tau for list

similarity (with confidence level and for lists that are per-

mutations and without weights). However, different search

engines provide results that are never permutations, at best,

are sparse lists, and the URLs should not be treated equally

because users pay attention only to the top results (pay lit-

tle attention to the bottom results, skip the successive re-

sult pages and just refine the query). These measures, in

combinations with adaptations for sparse lists, are still the

state-of-the-art measures and they are the first we used.

As we show in this work, for more than 80% of the queries

the overlap between two sets of search results is less than

30%. Unfortunately, This observation implies that the top

search engine does not subsume the results returned by the

next major search engine and URL-based measures are in-

sufficient for comparing different search engines with such a

little overlap. But why this small overlap affect the qual-

ity of URL-based measures?

these measures work well on the common URLs quantifying

their difference but the no-common URLs dilute the measure

making them less and less sensitive.

We show in this work that when the overlap is low be-

tween the results of two search engines, the relative quality

(users’ satisfaction) between search engines varies widely.

We looked at the correlation between the URL overlap (Jac-

card) and the quality of the search results measured by the

discounted cumulative gain (DCG) [18] (which is a super-

vised measure because is an editorial–human measure). We

have found that the results vary widely in quality especially

when the overlap is low: this implies that any search en-

1

Intuitively and in practice,

1If a leading search engine, given a query, provides a set of

URLs, we do not suggest to provide the same set to any

engines.

arXiv:1107.2691v1 [stat.CO] 13 Jul 2011

Page 2

gine can return better or worse results depending on the

query and it is difficult to estimate the outcome reliably.

But, once more, why this small overlap affect the quality

of URL-based measures? Most of the queries will provide

uncorrelated values: we must use instead precious human

resources to distinguish the queries that provide different

results (i.e., if we could measure the queries that provide

similar results, we may infer similar users’ satisfaction).

We show in this work that content-based similarity mea-

sures provide more discriminating conclusions than URL-

based similarity measures. A URL is nothing more than a

pointer where the information is. The contents must be in-

terpreted and quantified as we summarize in the following

paragraph:

We propose to use the contents from search results landing

pages for computing similarity. In particular, we represent

the contents by a set of terms as well as a distribution of

terms and adapt the Jaccard ratio and many distribution-

similarity measures from [6] (we present results for the ex-

tension of the φ measure [21] in particular to compute simi-

larity of free-format documents). Ultimately, contents based

measures outperform lists based measures when applied in

an unsupervised fashion.

As practitioners of pairwise correlation measures for search

engine comparison and similarity computation, we are aware

that rank correlation of search engines is used as common

example or flagship for the application of list-based correla-

tion measures. We want to make aware the community that

there are more sophisticated measures.

The rest of the paper is organized as follows. We intro-

duce the related work in § 2 and a theory of similarity in § 3.

In § 4, we present how the theory is applied in practice to

our choice of similarity measures and their parameters. We

present the experimental methodology in § 5 and the exper-

imental results and our observations in § 6. We conclude in

§ 7.

2. RELATED WORK

In the following, we will attempt to present a represen-

tative though limited set of related works in the fields of

list correlation, coverage and similarity measures (the three

components of our method). As such, we introduce previous

results in the context of our work in such a way to present

the main differences and then useful references for a deeper

investigation.

Correlation measures have a long history and by nature

are interdisciplinary. We can start with the contributions

by Gauss, Laplace, and Bravais; however, the first refer-

ence/introduction to the term correlation is by Galton [14]:

where it is crystallized that the variation of two organs are

due to common causes and proposed a reversion coefficient,

as also discussed by Pearson [23].

Spearman proposed the footrule in 1906 [28] with its dis-

tribution in a psychology journal, but he turned his atten-

tion to rank correlation (comparable rankings for addition

and pitch).

Concurrently, the Jaccard ratio was introduced in 1901 [16]

and used for the species-to-genus ratio [17] as introduced in

a historical note by [19]. The ratio was used as measure

of variety. No probability concept or confidence was intro-

duced . Here, we use the ratio in a similar spirit and without

a probability distribution.

Kendall in 1938 introduced a new measure of rank correla-

tion [20], based on the count of how many swaps of adjacent

elements are necessary to reduce one list to another as in

the bubble sort algorithm. From then, different versions of

correlation measures (with and without weights) have been

used and presented (e.g., see [29] for a short survey). For

example, Kendal’s with weights has been proposed by Siev-

ers [27].

Rank correlation aims at the measure of disarray/concordance

especially of short permutations. Its applications range in so

many different fields and applications: medicine, psychology,

wherever data is incomplete, to capture trends, and rank ag-

gregation (e.g., see the reviews in [11, 24]).

About the rank correlation and their comparison, the lit-

erature is quite large, of the recent publications we may

cite [4] and [30] where the authors introduce a new measure

starting from the the Kendall’s coefficient for the informa-

tion retrieval field.

Closer to our research is the comparison of search engines

rankings by Bar-Ilan et al. [1]: The idea is to set a small

set of queries and monitor search engines ranking in time.

The query set has a relative high intersection in the result

lists (common results at least between Google and Yahoo!).

In contrast, we show that our query corpus is large and has

wider variety.

We conclude this section by citing the work by Fagin et

al. [13, 12], where they present various distance measures

for unweighted partial lists. These papers are excellent ref-

erences for partial list similarity measures, their various gen-

eralizations, their equivalence, and some results on the com-

parison of search engines. In a different work [9], our proof

of the equivalence for the weighted generalizations has the

same spirit as the results in these papers.

The coverage and overlapping of search engines is a new

problem where one of the first attempts to measure such a

difference has been proposed in 1998 [2]. The same paper

needed a few tools for the similarity of documents such as

shingles that we still use today. About similarity measures

of documents, the literature is as large and old as for the

correlation measures and it is multifaceted: an arbitrary

classification is by signature comparison and by contents.

By signature, two documents are compared by summaries

or signatures only (e.g., see [5, 3]). We use the Jaccard ratio

of the signature because: first, it is common in the field the

authors work (e.g., see [7] for another use); and second it is

more a literal comparison than a semantic comparison. We

actually use a signature of up to 1000 items (shingles), thus

performing more a contents comparison than a probabilistic

comparison, reducing to zero false positives. By contents, we

could use any bag-of-words —e.g., word–count histograms—

measures, and thus use stochastic measures; for example,

one of the first measures is proposed by Kolmogorov in 1933,

but for a recent survey see [6].

For each of these metrics, and especially for the relevance

metrics, the rank of a search result plays an important role.

The reason is that users expect to find the answer among

the top search results, and the probability of a click (i.e.,

the user takes a look at the page) drops quite drastically as

the rank increases. In parallel with our work (i.e., they cited

this work), Kumar and Vassilvitskii [22], present measures

so that to take in account the relevance of a document in

conjunction with its rank. Of course, relevance is (currently)

a supervised feature.

Page 3

3. A THEORY OF SIMILARITY

In this section, we provide the mathematical overview of

comparing sets, lists, and distributions. Due to almost a

century-old history on the subject, our discussion is neces-

sarily focused on the measures that we use in this study. In

the case of list similarity, we have a contribution by pro-

viding a weighted generalization of Spearman’s footrule and

Kendall’s tau and prove their equivalence for permutations

and partial lists but we presented separately [9]. For list

with little overlap, we introduce novel metrics.

3.1 Set Similarity

Given two sets Uσ and Uπ, their intersection and union

are defined as

Uσ∪ Uπ = {x|x ∈ Uσ or x ∈ Uπ}

(1)

and

Uσ∩ Uπ = {x|x ∈ Uσ and x ∈ Uπ},

where elements are included without repetition.

There are many measures in the literature to compute the

similarity between these two sets. Among them, the Jaccard

ratio is commonly used. The Jaccard ratio is defined as

(2)

J(Uσ,Uπ) =|Uσ∩ Uπ|

|Uσ∪ Uπ|, (3)

which maps to [0,1] —i.e., 1 if the sets are identical and 0

if the sets have no common elements.

Example. Given Uσ = {a,b,d} and Uπ = {b,e,f}, we

have Uσ ∪ Uπ = {a,b,d,e,f} and Uσ ∩ Uπ = {b} and thus

J(Uσ,Uπ) =1

5= 0.2.

3.2List Similarity

As in the measures for comparing sets, there are many

measures in the literature to compute the similarity between

two lists. Among them, Spearman’s footrule and Kendall’s

tau are commonly used. In this paper, we generalize these

measures to include weights and also to work for partial lists

as well as permutations. By also proving the equivalence

of these two measures, we justify our choice of Spearman’s

footrule for our list comparison measure.

3.2.1

Given two lists σ and π, define σc= σ−σ∩π and πc= π−

σ ∩ π and keep the relative order of the remaining elements

in σcand πcthe same as they are in the original lists σ

and π, respectively. Note that σcand πcbring forth any

information only when σ and π are partial lists, because they

are the empty set otherwise (i.e., σ and π are permutations).

If σ and π are permutations of length n, the rank of an

element i is well defined and equal to σ(i) and π(i). If these

lists are partial lists, the rank of an element is determined

as follows: If an element i is in σ but missing from π, then

let π(i) = n + σc(i) − 1; that is, it is like we append the

missing items at the end of the list such as to minimize

their displacement. Similarly, if an element i is in π but

missing from σ, then let σ(i) = n+πc(i)−1. Now the rank

function σ() and π() infer two lists that are the permutation

of each other. Note that if the lists are of different lengths,

we can always restate the definition so that if an element

i is in π but missing from σ, then let σ(i) = |σ| + πc(i) −

1. Independently, the resulting lists are permutations, thus

with the same length.

Rank Assignment

Of course, this rank extension is arbitrary and relative to

the pair of lists. In fact, we extend the rank of an element

that does not exist in a list (unknown rank) using its rank

from another list (partial known rank). This provides an

optimistic ordering that should bias the permutation-based

correlation metrics towards positive correlation. This way to

infer not known rankings is similar/common for comparing

top-k lists [13]. Notice also that we increased the list size;

as a function of the increase, any type of list increases, we

may have made the most common correlation measures less

sensitive.

Example. Given σ = (a,b,d) and π = (b,e,f), we have

σ?= (a,b,d,e,f) and π?= (b,e,f,a,d); that is, the extended

lists. Now, without loss of generality, we can substitute the

letters to numbers —i.e., ranks. We take σ?as reference or

original permutation: σ?= (a,b,c,e,f) ∼ (1,2,3,4,5) and

thus we can rewrite π?= (b,e,f,a,d) as (2,4,5,1,3). All

measures introduced in this paper are symmetric, thus the

result is independent of whether we take σ?or π?as starting

point permutation.

3.2.2

The weighted Spearman’s footrule [28, 10] for partial lists

of length n is defined as

?

where w(i) returns a positive number as the weight of the

element i and the ranks are defined as in § 3.2.1.

The measure Sw can be normalized to the interval of

[−1,1] as

Weighted Spearman’s Footrule

Sw(σ,π) =

i∈σ∪π

w(i)|σ(i) − π(i)|. (4)

sw(σ,π) = 1 −

2Sw(σ,π)

?

i∈σ∪πw(i)|(i) − (n − i + 1)|

(5)

where the denominator reaches its maximum when both lists

are sorted but in opposite orders.

Both of these equations are valid if the input lists are

permutations.

Example. Given σ = (a,b,d) and π = (b,e,f), we have

σ?= (a,b,d,e,f) ∼ (1,2,3,4,5) and π?= (b,e,f,a,d) ∼

(2,4,5,1,3) (i.e., we transformed the lists into permutations

as we described in the previous example). Then

Sw =w(1)|1 − 4| + w(2)|2 − 1| + w(3)|3 − 5|

+ w(4)|4 − 2| + w(5)|5 − 3|

=10w

(6)

if we consider w(i) constant w and the normalized

sw = 1 − 2w ∗ 10

w ∗ 12= −0.66

As we can see the denominator grows as n2

3.2.3

In context, the unweighted Kendall’s Tau is the number

of swaps we would perform during the bubble sort in such a

way to reduce one permutation to the other. As we described

the ranks of the extended lists (Section 3.2.1), we can always

assume that the first list σ is the identity (increasing from 1

to n), and what we need to compute is the number of swaps

to sort the permutation π back to the identity permutation

(increasing). Here, a weight will be associated to each swap.

Weighted Kendall’s Tau

Page 4

The weighted Kendall’s tau [20, 27] for partial lists of

length n is defined as

?

where [x] is equal to 1 if the condition x is true and 0 oth-

erwise; also, we identify the permutation 1,2,...,n simply

as ι. In practice, if we would like to sort in increasing or-

der the permutation π using a bubble sort algorithm, then

Kw(σ = ι,π) is the cost of each swap.

The measure Kw can be normalized to the interval of

[−1,1] as

2Kw(σ,π)

?

where the value of the denominator is exactly the maximum

value that the numerator can reach: when both lists are

sorted but in opposite orders.

Note that both these equations are computed over all i

and j in σ ∪ π such that i < j. They are also valid if the

input lists are permutations.

It is important to note that the weighted version of Kendall’s

tau can be defined in different ways (e.g., see [25, 26], the

weights are multiplied as w(i) ∗ w(j)) rather than added.

The reason for our definition is to preserve the equivalence

between these two measures, as we prove in a different work

[9].

Example. Given σ = (a,b,d) and π = (b,e,f), we have

σ?= (a,b,d,e,f) ∼ (1,2,3,4,5) and π?= (b,e,f,a,d) ∼

(2,4,5,1,3). Then

Kw(σ = ι,π) =

1≤i<j≤n

w(i) + w(j)

2

[π(i) > π(j)](7)

kw = 1 −

{i,j∈σ∪π:i<j}

w(i)+w(j)

2

(8)

Kw = 5w

if we consider w(i) constant w and the normalized

kw = 1 −

2 ∗ 5

5 ∗ 4/2= 0

Notice that Kw ≤ Sw ≤ 2Kw because 5w ≤ 10w ≤ 10w.

We show this is true in general [9].

3.2.4

In this section, we show preliminary evidence that the

choice of w() will require a supervised approach and thus

beyond the scope of this paper. Hence, in this paper, we

will choose the weighting function w() = 1.

We address in this section two questions: First, will a

weighted measure be useful for sparse lists comparisons (for

search engines results)? Second, what is the choice of the

weighting function? Weighted measures are useful because

they provide a way to measure the importance of common

items in the results lists so that to complete the missing

information about the lists we compare. For example, if we

have a query, two search engines results (10 URLs each list),

and we find out that there are only four common results; we

can estimate a measure of disarray/concordance if we can

assign a heavier weight for higher URLs (on the list top).

Here we choose two weighting functions that we identify

as dcgw and iota. We have iota(i) = 1 for every i: that is, all

list URLs are equally important. Instead, we have dcgw(i) =

log10(1+i)

2i

, which is inspired by the discounted cumulative

gain (DCG) measure; that is, we can imagine that the tenth

URL in the result list is about 29less important than —

relatively speaking— the first one.

The weighting function w()

We created this test: we take two lists a = (1,2,..,10) and

b = (11,12,..,20), and we consider these lists composed of

the simbols ’1’ throught ’20’ and thus with nothing in com-

mon. We start creating lists with increasing common inter-

section: a and b?with one common item: b?= (1,12,..,20),

b?= (11,2,..,20), till b?= (11,12,..,10), then with two

consecutive items b?= (1,2,..,20), b?= (11,2,3,..,20), till

b?= (11,12,..,9,10) and eventually with 10 common con-

secutive items b?= a.

In Figure 1, we present the concordance measure results

using Spearman’s footrule and Kendal Tau.

show the list concordance measures using weights and in

green without weights. The width of the lines represents

the number of common items in the lists: the thinnest lines

represent lists with only one item in common, the thickest

10 (one point).Consider the thinnest lines: Spearman’s

footrule has the largest difference with weights and with-

out; when the two lists have only the first item in common,

the minimum rank is 1, the weighted measure is about 0.62

and the unweighted is 0.18. Both measures decreases when

we choose as common element the second item towards the

ninth element. The weighted measures are more sensitive

for sparse lists and with high correlation in the high ranks.

In Figure 2, we show the same analysis but instead of creat-

ing similar lists, we create anti-correlated similar lists. The

weighted measures are less sensitive in capturing anti cor-

relation. Even thought this is an example where the same

weighting function achieves contrasting and opposite results,

it shows a case where the function choice must rely on the

context for which the function is applied for. In this work,

we are actually interested in finding a measure that can cap-

ture both properties.

At this time, we believe that the solution must rely on a

supervised method where a third party (a crowd base sim-

ilarity measures) or a feed back system can be deployed to

tune the weighting function knowing the context. In a dif-

ferent work, we prove that the weighed Spearman’s Footrule

and Kendall Tau for partial lists (as described here) respect

the Diaconis-Graham inequality, thus they are equivalent as

discriminative power and we can choose either one (footrule

because of its computational simplicity) [9].

3.3Distribution Similarity

As in the measures for comparing sets and lists, there

are many measures in the literature to compute the simi-

larity between distributions —i.e., stochastic distances. For

example, a document can be represented as a word–count

histogram (which can be normalized naturally to a distri-

bution), and this idea can be easily extended to a set of

documents. Among distribution measures, we present and

use here the φ measure from [21], which is identified in [6]

as one of the best performing measures. The φ measure ex-

tends the well-known Kolmogorov-Smirnov measure and is

defined as

|Fσ(i) − Fπ(i)|

?

where Fσ and Fπ are the cumulative distribution functions

and Fσ(i) and Fπ(i) are the values for the element i from

these functions. This measure is symmetric and its value

ranges in [0,2], where the result is zero when two input dis-

tributions are identical. In practice, we can use stochastic

distances to compare the contents of search engines results.

In red, we

φ(Fσ,Fπ) = max

i

min(Fσ(i)+Fπ(i)

2

,1 −

Fσ(i)+Fπ(i)

2

)

(9)

Page 5

Figure 1: In red, we show the list concordance measures using weights and in green without weights. The

width of the lines represents the number of common items in the lists: the thinnest lines represent lists with

only one item in common, the thickest 10 (one point). Consider the thinnest lines: Spearman’s footrule has

the largest difference with weights and without; when the two lists have only the first item in common, the

minimum rank is 1, the weighted measure is about 0.62 and the unweighted is 0.18. Both measures decreases

when we choose as common element the second item towards the ninth element. The weighted measures are

more sensitive for sparse lists and with high correlation in the high ranks.

Figure 2:

especially for sparse lists (less than 4 common URLs) and for anti-correlation in the high ranks. Notice that

the unweighted Spearman’s footrule finds our lists anti correlated (thinnest lines and values close to -1),

instead Kendal tau suggests no correlation (values close to 0).

In contrast with Figure 1, the weighted measure is less sensitive in capturing anti correlation