Content uploaded by John Guiver
Author content
All content in this area was uploaded by John Guiver
Content may be subject to copyright.
SoftRank: Optimizing Non-Smooth Rank Metrics
Michael Taylor, John Guiver, Stephen Robertson and Tom Minka
Microsoft Research Cambridge
{mitaylor,joguiver,ser,minka}@microsoft.com
ABSTRACT
We address the problem of learning large complex rank-
ing functions. Most IR applications use evaluation metrics
that depend only upon the ranks of documents. However,
most ranking functions generate document scores, which are
sorted to produce a ranking. Hence IR metrics are innately
non-smooth with respect to the scores, due to the sort. Un-
fortunately, many machine learning algorithms require the
gradient of a training objective in order to perform the op-
timization of the model parameters, and because IR met-
rics are non-smooth, we need to find a smooth proxy ob-
jective that can be used for training. We present a new
family of training objectives that are derived from the rank
distributions of documents, induced by smoothed scores.
We call this approach SoftRank. We focus on a smoothed
approximation to Normalized Discounted Cumulative Gain
(NDCG), called SoftNDCG and we compare it with three
other training objectives in the recent literature. We present
two main results. First, SoftRank yields a very good way
of optimizing NDCG. Second, we show that it is possible to
achieve state of the art test set NDCG results by optimizing
a soft NDCG objective on the training set with a different
discount function.
Categories and Subject Descriptors
H.3.3 [Information Systems Applications]:
General Terms
Algorithms, Experimentation
Keywords
learning, ranking, metrics, optimization, gradient descent
1. INTRODUCTION
There is a clear trend among both IR researchers and
practitioners towards using ever more complex ranking func-
tions. Until quite recently it has been common to use models
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
WSDM’08, February 11–12, Palo Alto, California, USA.
Copyright 2008 ACM 978-1-59593-927-9/08/0002 ...$5.00.
with only a handful of free parameters. For example BM25
in its most widely adopted form has only 2. Such simple
models have advantages: they are easy to tune for a given
corpus, requiring few training queries and little computation
to find reasonable parameter settings. Often, they work well
“out-of-the-box” on new corpora with parameters reported
in the literature. In short, they are robust and generalize
well.
However, increasingly we see richer models appearing that
set out to harness an ever expanding set of more powerful
features. For example, there is much activity surrounding
term proximity where work is beginning to show benefits in
going beyond the bag-of-words models [13]. In the area of
document structure there has been progress too. Improve-
ments have been reported exploiting both a simple field-
based flat structure, where for example, term occurrences
are handled differently in titles than body text [15], and also
more complex hierarchical structures in the area of XML re-
trieval. Much work has also been published on combining
content with external cues such as link-graph features like
PageRank and HITS, and more recently, usage features [1].
As these features all appear to aid retrieval effectiveness,
it follows that competitive IR systems need to be able to ex-
ploit them in an efficient and reliable way. However, as the
number of features increases, so does the number of parame-
ters necessary in the ranking function. This paper addresses
the problem of learning the parameters for such complex
ranking functions.
The fundamental issue when formulating such a machine
learning problem is the choice of objective function to be op-
timized. In IR there are very many existing metrics (NDCG,
MAP, RPrec etc.) that all share the property that they
are rank-dependent, placing more emphasis on performance
at the top of a list of documents, thus reflecting the end-
user experience. While these metrics are ideal for evaluating
trained systems, their use as ob jective functions for training
is problematic.
1.1 IR Metrics are Not Smooth
Given a representative set of queries and relevance judg-
ments, we seek to learn a ranking function (or model) that
takes a set of document-query match feature values and gen-
erates a score. At test time, this score is used to sort doc-
uments to produce a ranked list, which is to be evaluated
using an IR metric.
Following [4, 3] in this paper we choose to model the map-
ping from features to score using a 2-layer neural net model.
Neural nets represent a tried-and-tested machine learning
technique that scales well with large amounts of training
data. The optimization process used is gradient based, and
so this learning approach depends upon the availability of a
gradient of the training ob jective.
Typical IR metrics only depend on the ranks (and not the
scores). If we make small changes to the model parameters,
the scores will typically change smoothly, but the ranks of
documents will not change until one document’s score passes
another, at which point the IR metric will make a discontin-
uous change (see Figure 1). In other words, the IR metrics
are non-smooth with respect to model parameters: they are
everywhere either flat (with zero gradient) or discontinuous.
Section 2 describes our chosen test metric NDCG, and
then gives a brief description of previous work in construct-
ing useful training utilities that have smooth gradients and
also attempt to approximate the desired non-smooth rank-
dependent IR objective. Section 3 presents a new training
objective called SoftNDCG that is a smoothed approxima-
tion to NDCG.
2. RANKING UTILITIES
This section first describes the test metric that we adopt in
this paper, called Normalized Discounted Cumulative Gain
(NDCG). The approach we adopt in this paper is to seek
an objective function that makes a good proxy for a rank-
based metric (e.g. NDCG), but that is also differentiable
with respect to the parameters of the ranking function. Such
a function will be more easily optimized in the neural net
framework. We next describe several training utilities that
have been previously proposed, concentrating on RankNet
[4] and LambdaRank [3] as they are used as baselines in this
paper.
For a given training query, we will assume we have N
documents, each with a known human-defined rating. We
denote an individual document indexed by jas docj.Letus
assume that we have a ranking function that takes in doc-
ument features xjand produces a score sj.Inthispaper
we consider the class of non-linear ranking functions (mod-
els) represented by the 2-layer neural net fwith weights
(parameters) w. We denote the score:
sj=f(w,xj).(1)
2.1 NDCG
We adopt the IR evaluation metric called NDCG [10] be-
cause it is a reasonable way of dealing with multiple rele-
vance levels in our datasets. It is often truncated at a rank
position R(indexed from 0) and is defined as:
GR=G−1
R,max
R−1
r=0
g(r)D(r)(2)
where the gain g(r) of the document at rank ris usually
an exponential function g(r)=2
l(r)of the labels l(r)(or
rat ing ) of the document at rank r. The labels typically take
values from 0 (bad) to 4 (perfect). A popular choice for the
rank discount is D(r)=1/log(2 + r)andGR,max is maxi-
mum value of
R−1
r=0 g(r)D(r) obtained when the documents
are optimally ordered by decreasing label value. Where no
subscript is defined, it should be assumed that R=N.
2.2 Pointwise
The simplest class of smooth training objective is what
we call a pointwise objective because it can be computed
for a single document. If we are given a target label for each
document, then the most straightforward pointwise objec-
tive is the mean squared error (MSE) between the target
and predicted label. We can write the objective for a docu-
ment/label pair (in this case it is actually a cost) as:
Umse(sj)=(sj−lj)2(3)
and the total cost is obtained by computing the mean over
all training documents. This is the first baseline objective
that we will use in our experiments. Another option would
be to treat the NDCG gains as the targets for the regres-
sion, instead of the labels. This made little difference in our
experiments and we do not consider it further.
There are of course more sophisticated approaches to point-
wise metrics. Rankprop [6] alternates between a MSE ap-
proach described above, and a phase that iteratively ad-
justs the targets themselves. An alternative approach is the
framework of ordinal regression, which is closer to classifi-
cation in spirit. An example of this approach is PRank [8];
for a Bayesian treatment see [7].
2.3 Pairwise
Because ranking metrics only require the recovery of rela-
tive relevance levels, the motivation of this approach is that
pairwise preferences in the labels may well be more eas-
ily modelled than any available absolute value of relevance.
Herbrich et. al. [9] were the first to use pairwise prefer-
ence labels to learn ranking functions. They took an ordinal
regression approach using an SVM model. This model is
well known in the IR literature as the RankingSVM [11],
where it was used in a scenario where only preference la-
bels were available in the training data, derived from click-
through logs.
RankNet [4] is a probabilistic model for pairwise prefer-
ences. The algorithm assumes that it is provided with a
training set consisting of pairs of documents doc1,doc
2to-
gether with a target probability ¯
P12 that doc1is to be ranked
higher than doc2. The authors define a ranking function f
as we have specified in (1).
The map from the outputs sjto probabilities is modeled
using a logistic function P12 ≡e−Y/(1 + e−Y)whereY≡
s2−s1,andP12 is the probability that doc1is ranked higher
than doc2. They then invoke the cross-entropy error function
to penalize pair ordering errors:
Urn(Y)=−¯
P12 log P12 −(1 −¯
P12)log(1−P12 ).(4)
This is a very general cost function that affords the use
of any available uncertainty we may have concerning the
pairwise ratings. Following [4], in our implementation of
RankNet, we take the pair ordering data as certain (ignor-
ing ties), and so for us, the ¯
P12 are always one. With this
simplification, the RankNet cost for a pair becomes:
Urn =log
1+es2−s1
(5)
where the score difference is positive if the documents are
in the wrong order. The RankNet cost for a pair in the
wrong order tends to a linear fuction of score difference.
When the scores are correctly ordered, Urn asymptotically
approaches zero as the score difference increases. Thus the
gradient of the Urn not only encourages the pair to be in
the right order, but encourages them to have at least some
separation in their scores. If fis differentiable with respect
to the parameters, then so too will Urn.
2.4 Rank Dependent
The smooth training utilities described so far do not take
into account the rank of a document in the set of documents
scored for a training query. Hence, Burges et. al. [3] argue
that there is a possibility that a model might be prone to
waste capacity in improving the order of documents at low
(poor) ranks at the expense of documents at the top of the
ranking. Their approach, called LambdaRank, argues for a
training objective that is closer to NDCG, in that it should
care more about the top of the ranking than the bottom.
Consequently, it needs to incorporate rank information into
its training objective.
However, rank information is only directly available via
a sort, and the sort operation is inherently undifferentiable.
Thus, any sort operation makes the objective non-smooth.
They get round this problem by defining a virtual gradient
on each document after the sort. In other words, Lam-
daRank defines a gradient of an implicit objective function
that is itself never actually defined. These gradients are only
defined once the current model has produced a ranked list
for the query. For example, consider a query with just two
relevant documents doc1,doc
2, and at some stage in train-
ing the model places doc1near the top of the ranked list
with score s1and doc2near the bottom with score s2.The
intuition concerning limited capacity can be encoded as:
∂U
∂s1
∂U
∂s2
,(6)
or in words, we would like the rate of change of the ob jective
with respect to the high-ranking relevant document’s score
to be very much greater than that for the low-ranking rel-
evant document. Note that Uis the implicit ob jective that
is not defined. Only the gradients of Uare defined, given
a sorted list of documents at a particular point in training.
In their paper, they report that they tried many different
gradient functions satisfying the capacity constraint of (6),
and the one that worked best on web retrieval experiments
was as follows. Given a pair of documents for a training
query, they define the gradient (lambda function) to be the
gradient of the RankNet cost (5) scaled by the difference
in NDCG found by swapping the two documents in ques-
tion. Using this pairwise “force”, the total gradient for a
document jcan be obtained by summing all such pairwise
interactions, giving:
λj≡∂Ulr
∂sj
=G−1
max
i
1
1+esi−sj
(gi−gj)
Di−Dj
(7)
where giis the gain of the label of dociand Diis the discount
at the rank of doci,D(ri). The authors report that a 2-layer
neural net model trained using this LambdaRank objective
outperforms the same model using the RankNet cost.
A more recent approach [5] sets out to define a probability
distribution over rankings: in other words, the event space
consists of document permutations. To contrast that work
with this paper, here we only set out to elicit distributions
for the ranks of individual documents: an event in our space
is that of an individual document having a particular rank.
The probability of such an event is obtainable from a full
ranking distribution by summing up the probabilities of the
rankings for which this event occurs; however, in our work,
we deal directly with the simpler event space.
3. SOFTRANK
The approach taken by LambdaRank was to abandon the
attempt to define an explicit smooth objective, and instead
only work with an implicit objective via the definition of
gradient functions with intuitively desirable properties.
The main idea that motivates the SoftRank approach is
the observation that if we consider the scores to be smoothed
by treating them as random variables, then it should be pos-
sible to propagate that noise through to a rank-dependent
IR metric. In particular, in this paper we use this idea to
createasmoothedapproximationtoNDCG(referredtoas
SoftNDCG), but the approach could equally be applied to
other rank-based metrics. This section shows under what
assumptions we can write down an analytic expression for
the SoftNDCG and its derivative with respect to the model
parameters w. The process is summarized as a factor graph
in Figure 4, the components of which are discussed in the
following sections.
In order to make an objective dependent upon the ranks
of documents, it is natural to assume that a sort is required
at some stage. However, this would render the objective
non-differentiable, so our approach is based on the idea that
we need to avoid sorting.
At a high level, the approach we adopt is as follows. We
consider Nlabeled documents for a single query. This means
we have Nscore distributions (Section 3.1). We show how
we can map from score distributions to a rank distribution
for each document (see Section 3.2) without performing an
explicit sort. Armed with a rank distribution for each docu-
ment, we investigate under what conditions we can compute
an expected NDCG (Section 3.3). The expected smoothed
NDCG is what we call SoftNDCG. Finally, because none of
these steps necessarily involves a sorting operation, we show
that the expression for SoftNDCG can be differentiated with
respect to the model parameters, and thus show how it can
be optimized using gradient ascent (Section 3.4).
We conclude the description of SoftRank by discussing
how the single degree of freedom represented by the global
scale of the scores is handled (Section 3.5), and finally dis-
cuss computational issues (Section 3.6).
3.1 Smoothing Scores
Rather than representing scores as deterministic values,
we will treat them as smoothed score distributions (see Fig-
ure 2). The simplest way to do this, and the approach we
adopt for the remainder of this paper, is to give every score
the same smoothing using equal variance Gaussian distri-
butions. Hence the deterministic score sjin (1) becomes
the mean of a Gaussian score distribution1,withashared
smoothing variance σs:
p(sj)=N(sj|¯sj,σ
2
s)=N(sj|f(w,xj),σ
2
s).(8)
An alternative motivation would be to consider the source of
noise as an inherent uncertainty in the model parameters w,
arising from inconsistency between the ranking model and
the training data (see the left side of Figure 4). This would
be the natural result of a Bayesian approach to the learning
task. Mapping parameter distributions through the non-
linearities of a 2-layer neural net that we use in this paper
is not straightforward, so we do not pursue this idea here.
1Using N(x|µ, σ2)=(2πσ2)−0.5exp[−(x−µ)2/2σ2]
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8
1
0
0.5
1
1.5
2
Score s
p(s)
012
0
0.2
0.4
0.6
0.8
1
Rank r1
p(r1)
0 1 2
0
0.2
0.4
0.6
0.8
1
Rank r2
p(r2)
0 1 2
0
0.2
0.4
0.6
0.8
1
Rank r3
p(r3)
s1s2s3
s1s2s3
Figure 1: Deterministic scores and ranks: Three
document scores as point (deterministic) values and their
corresponding rank distributions. The lowest scoring docu-
ment s1is certain to be ranked in the lowest position 2.
3.2 From Score to Rank Distributions
When we have deterministic scores, we have determinis-
tic rank distributions, as shown in Figure 1. This section
presents a way of quantifying what happens to the rank
distributions when we add noise to (smooth) the score dis-
tributions, shown in Figure 2.
The rank distributions shown in Figure 2 may be simu-
lated by the following exact generative process: a) sample a
vector of Nscores, one from each score distribution, b) sort
the score samples and c) accumulate histograms of the re-
sulting ranks for each document. However, we wish to avoid
the sort so we can perform gradient based optimization, and
so the remainder of this section presents an approximate al-
gorithm for generating the rank distributions that avoids an
explicit sort.
For a g iven docj, consider the probability that another
dociwill rank above docj.DenotingSjas a draw from
p(sj), we require the probability that Si>S
j,orequiva-
lently Pr(Si−Sj>0). Therefore the required probability
is simply the integral of the difference of two Gaussian ran-
dom variables, which is itself a Gaussian [14], and therefore
the probability that document ibeats document j,which
we will henceforth refer to as πij,is:
πij ≡Pr(Si−Sj>0) =
∞
0
N(s|¯si−¯sj,2σ2
s)ds. (9)
This quantity represents the fractional number of times we
would expect docito rank higher than docjon repeated
pairwise samplings from the two Gaussian score distribu-
tions. For example, in Figure 2, we would expect π32 >π
12.
In other words, if we were to draw two pairs {S1,S
2}and
{S3,S
2},S3is more likely to win its pairwise contest against
S2than S1is in its contest.
Now we use these pairwise probabilities to generate ranks.
We argue intuitively that if we were to add up the proba-
bilities of a document being beaten by each of the other
documents, then we would have a quantity that is related to
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8
1
0
0.5
1
1.5
2
Score s
p(s)
0 1 2
0
0.2
0.4
0.6
0.8
1
Rank r1
p(r1)
0 1 2
0
0.2
0.4
0.6
0.8
1
Rank r2
p(r2)
0 1 2
0
0.2
0.4
0.6
0.8
1
Rank r3
p(r3)
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
Figure 2: From score to rank distributions:
Smoothed scores for 3 documents and the resulting 3 rank
distributions.
the expected rank of the document being beaten. In other
words, if a document is never beaten, its rank will be 0, the
best rank. More generally, using the pairwise contest trick,
we can write down an expression for the expected rank rj
of document jas:
E[rj]=
N
i=1,i=j
πij (10)
which can be easily computed using (9). As an example,
Figure 2 shows what happens to the rank distributions when
we smooth the scores: the expected rank of document 3 is
between 0 (best) and 1, and documents 2 and 3 have an
expected rank between 2 and 3 (worst).
The actual distribution of the rank rjof a document j
under the pairwise contest approximation is obtained by
considering the rank rjas a Binomial-like random variable,
equal to the number of successes of N−1 Bernoulli trials,
where the probability of success is the probability that doc-
ument jis beaten by another document i,namelyπij .Ifi
beats jthen then rjgoes up by one.
However, because the probability of success is different for
each trial, it is a more complex discrete distribution than the
Binomial: we call it the Rank-Binomial distribution. Like
the Binomial, it has a combinatoric flavour: there are few
ways that a document can end up with top (and bottom)
rank, and many ways of ranking in the middle. Unlike the
Binomial, it does not have an analytic form. However, it
can be computed using a standard result from basic proba-
bility theory, that the probability density function (pdf ) of
a sum of independent random variables is the convolution
of the individual pdfs [14]. In this case we have a sum of
Nindependent Bernoulli (coin-flip) distributions, each with
a probability of success πij . This yields an exact recursive
computation for the distribution of ranks as follows.
If we define the initial rank distribution for document jas
p(1)
j(r), where we have just the document j, then the rank
can only have value zero (the best rank) with probability
one:
p(1)
j(r)=δ(r) (11)
where δ(x) = 1 only when x= 0 and zero otherwise. Now
we have N−1other documents that contribute to the rank
distribution that we will index with i=2..N.Eachtimewe
add a new document i, the event space of the rank distrib-
ution gets one larger, taking the rvariable to a maximum
of N−1 on the last iteration. The new distribution over
the ranks is updated by applying the convolution process
described above, giving the following recursive relation:
p(i)
j(r)=p(i−1)
j(r−1)πij +p(i−1)
j(r)(1 −πij ).(12)
This can be interpreted in the following more intuitive man-
ner. If we add document i, we can write the probability
of rank rjas a sum of two parts corresponding to the new
document ibeating document jor not. If ibeats jthen the
probability of being in rank rat this iteration is equal to the
probability of being in rank r−1 on the previous iteration,
and we have the situation covered by the first term on the
right of (12). Conversely, if the new document leaves the
rank of junchanged (it loses), the probability of being in
rank ris the same as it was in the last iteration, correspond-
ing to the second term on the right of (12).
We note that we need to define p(i)
j(r)=0ifrj<0. We
define the final rank distribution pj(r)≡p(N)
j(r). Figure 2
shows these distributions for the simple 3 score case.
To conclude this section, we note that the pairwise contest
trick yields Rank-Binomial rank distributions, which are an
approximation to the true rank distributions. Their com-
putation does not require an explicit sort. Simulations have
shown that this gives similar rank distributions to the true
generative process. We can improve these approximations
further by performing a sequence of column and row op-
erations on the [pj(r)] matrix: divide each column by the
column sums, then divide each row of the resulting matrix
by the row sums, and iterate to convergence. This process is
known as Sinkhorn scaling, its purpose being to convert the
original matrix to a doubly-stochastic matrix. The solution
can be shown to minimize the Kullback-Leibler distance of
the scaled matrix from the original matrix [2]. We will show
later in our results that we can successfully optimize NDCG
using these approximate rank distributions, which further
justifies the pairwise independence approximation and the
Sinkhorn post-processing.
3.3 SoftNDCG
This section shows how we can use rank distributions to
smooth traditional IR metrics. The general approach is to
take the expectation of the IR metric with respect to the
rank distribution. As an example, we will now examine the
specific case of NDCG.
The expression for deterministic NDCG was given in (2)
as G=G−1
max
N−1
r=0 g(r)D(r). We set out to compute the
expected NDCG given the rank distributions described in
the last section. Rewriting NDCG as a sum over document
indices rather than document ranks we get:
G=G−1
max
N
j=1
g(j)D(rj).(13)
With reference to Figure 3 we replace the deterministic dis-
count D(r) with the expected discount. Thus we define soft
NDCG Gas:
G=G−1
max
N
j=1
g(j)E[D(rj)].(14)
The expected discount is obtained by mapping the rank
distribution through the non-linear deterministic discount
function (again, see Figure 3) to give:
G=G−1
max
N
j=1
g(j)
N−1
r=0
D(r)pj(r) (15)
where the rank distribution pj(r) is given in (12).
0123456789
0
0.2
0.4
0.6
0.8
1
r
p(r)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.05
0.1
0.15
0.2
d
p(d)
D(r)
Figure 3: From rank to discount distribution. The
rank distribution is mapped through the non-linear discount
function D, to give a discrete distribution over discounts
p(d)whose expectation we substitute for the deterministic
discount to obtain SoftNDCG.
3.4 Gradient of SoftNDCG
Having derived an expression for a SoftNDCG, we now
differentiate it with respect to the weight vector. The deriv-
ative with respect to the weight vector with Kelements is:
∂G
∂w=
∂s1
∂w1... ∂sN
∂w1
... ... ...
∂s1
∂wK... ∂sN
∂wK
∂G
∂¯s1
...
∂G
∂¯sN
.(16)
The first matrix is defined by the neural net model and
is computed via backpropagation [12]. The second vector
is the gradient of our objective with respect to the score
means. As with LambdaRank above, our task is to define
this gradient vector for each document in a training query.
Taking a single element of this gradient vector correspond-
ing to a document with index m(1 ≤m≤N), we can
differentiate (15) to obtain:
∂G
∂¯sm
=G−1
max
N
j=1
g(j)
N−1
r=0
D(r)∂pj(r)
∂¯sm
.(17)
Intuitively, this says that changing score ¯smaffects Gvia
potentially all the rank distributions, as moving a score will
affect every document’s rank distribution. The resultant
change in each rank distribution will induce a change in the
expected gain for each document determined by the non-
linear discount function D(r).
w
f
(
x
j
,
w
)
s
j
Pairwise Contests
j
x
j
Rank-Binomial
r
j
j=1,…,
Discount
D(r)
d
j
G
g
j
NDCG
Figure 4: A factor graph of the distributions for
aquery. In our development of SoftRank to date, we
have only worked with Gaussian scores sj.Thesemapto
Bernoulli vectors πjwhich provide the success probabilities
for the computation of the Rank-Binomials over ranks rjfor
each document 1..N (see Section 3.2). Then the rank distri-
butions get mapped in a non-linear way through the discount
function D(r)to give a distribution over discounts dj(see
Section 3.3). Finally, combining the expected discount with
the gain of the label over all documents, we arrive at the
expected SoftNDCG.
Hence we need a parallel recursive computation to obtain
the required derivative of pj(r). Denoting ψ(i)
m,j (r)= ∂p(i)
j(r)
∂¯sm
it is easy to show from (12) that:
ψ(1)
m,j (0) = 0
ψ(i)
m,j (r)=ψ(i−1)
m,j (r−1)πij +ψ(i)
m,j (r)(1 −πij )
+
p(i−1)
j(r−1) −p(i−1)
j(r)
∂πij
∂¯sm
(18)
where again the recursive process runs i=1..N .Consid-
ering now the last term on the right of (18), differentiating
πij with respect to ¯smusing (9) yields three different cases,
given we know already that i=jand so m=i=jis not
possible). Using the fact that
∂
∂µ
∞
0
N(x|µ, σ2)dx =N(0|µ, σ2) (19)
it can easily be shown that from (9) that:
∂πij
∂¯sm
=
N(0|¯sm−¯sj,2σ2
s)m=i, m =j
−N (0|¯si−¯sm,2σ2
s)m=i, m =j
0m=i, m =j
(20)
and so substituting (20) in (18), we can now run the recur-
sion for the derivatives. We define the result of this compu-
tation as the N-vector over ranks:
∂pj(r)
∂¯sm
≡ψm,j =[ψ(N)
m,j (0), ..., ψ(N)
m,j (N−1)].(21)
Using this matrix notation we substitute the result in (17):
∂G
∂¯sm
=1
Gmax
[g1, ..., gN]
ψm,0
...
ψm,N−1
d0
...
dN−1
.(22)
We now define the gain vector g(by document), the discount
vector d(by rank) and the N×Nsquare matrix Ψmwhose
rows are the rank distribution derivatives implied above:
∂G
∂¯sm
=1
Gmax
gTΨmd.(23)
So to compute the N-vector gradient of Gwhich we define
as ∇G =[
∂G
∂¯s1, ..., ∂G
∂¯sN] we need to compute Ψmfor each
document.
3.5 Scale Optimizations
Any optimization of an objective that is based on ranks of
documents should consider carefully the degree of freedom
that corresponds to an arbitrary scaling of the scores. Mul-
tiplying all the scores by a common factor does not affect
the ranks, and if it is not handled properly, could lead to an
undesirable degeneracy in the optimization process.
With SoftRank, this global scale factor is equivalent to
the score variance σ2
s: if we multiply all the scores by some
common factor, it is the same thing as dividing σsby that
factor. In the experiments described in this paper, σsis
set to some initial value, and is not changed during opti-
mization. Hence the scale factor is controlled by ∇G · ¯
s,the
component of the SoftNDCG gradient parallel to the current
mean score vector ¯
s.
As a simple example, imagine a query with 2 documents
that were stubbornly in the wrong order, say the relevant
document was always below the irrelevant. In this situation,
SoftRank would steadily decrease the scale which would
have the effect of making the Gaussians overlap more, thus
increasing the probability that the relevant document could
beat the irrelevant above it. Conversely, if the documents
were correctly ordered, the scale would increase, effectively
reducing the score variance.
Following this rather intuitive argument, we conclude that
SoftRank does control the scale of the scores in a sensible,
non-degenerate way. In effect it overloads the global scale
degree of freedom to implement an annealing schedule. It
is the subject of ongoing study as to whether this default
schedule can be improved upon.
3.6 Computational Considerations
For a given query of Ndocuments, calculation of the πij is
O(N2), calculation of all the pj(r)isO(N3), and calculation
of the SoftNDCG is O(N2). Similar complexity arises for the
gradient calculations. So the calculations are dominated by
the recursions in (12) and (18).
A substantial computational saving can be made by ap-
proximating all but a few of the Rank-Binomial distribu-
tions. The motivation for this is that a true binomial distri-
bution, with Nsamples and probability of success π,canbe
approximated by a normal distribution with mean Nπ and
variance Nπ(1 −π)whenNπ is large enough. For the rank
binomial distribution, πis not constant, but simulations
confirm that it can be approximated similarly, for a given j,
by a normal distribution with mean equal to the expected
rank
N
i=1,i=jπij and variance equal to
N
i=1,i=jπij(1 −
πij ). As the approximation is an explicit function of the
πij , we can easily calculate the gradients of the approxi-
mated pj(r)withrespecttoπij , and therefore with respect
to the ¯sm. Using this approximation allows us to restrict
the expensive recursive calculations to a few documents at
the top and bottom of the ranking.
4. EXPERIMENTS
We performed experiments on three corpora: queries from
the TREC .GOV corpus, on enterprise search data and queries
from a commercial web search engine. For all corpora, we
used only those documents containing all the query terms
(the AND set) for both training and testing. This simpli-
fication is consistent with common practice for large scale
commercial search engines.
4.1 Training Set Subsampling
To make training tractable, it is necessary to subsample
the irrelevant documents to some degree. If we do not do
this, the data we need to process is dominated unnecessar-
ily by uninformative irrelevant documents. In this paper, we
adopted a very simple approach. We took all the judged doc-
uments, whatever the relevance label. We then augmented
this set with documents drawn at random from the AND
set of unlabeled documents, which we assume to be irrel-
evant, until we have drawn a number equal to the size of
the judged set, or we have drawn 30, whichever happens
first. This threshold is admittedly rather arbitrary, and it is
the subject of ongoing experiments as to how sensitive each
objectiveistothisnumber.
4.2 .GOV
For the TREC .GOV corpus, we took the following sets
of topics: name/home page 2003 (300 queries), topic dis-
tillation 2003 (50 queries) and all queries from 2004 (225
queries). As mentioned above, we chose to work with AND
set documents only, so we dropped all queries with no rel-
evant documents in the AND set, and this left us with 508
queries. This number would have been greater if we had per-
formed stemming. The queries had binary relevance judg-
ments.
The data was partitioned for 5-fold (3:1:1) cross valida-
tion, giving 5 runs, each with roughly 300 training queries,
100 validation and 100 test. It is possible that a different
ratio of train/validate/test queries might give better results,
and again this is an area of current investigation.
The index of our crawl of .GOV has 3 structural fields:
body, title and anchor. We used BM25F, which in this
case has 6 tuneable parameters that can be learned using
back-propagation following the process described in [16]. In
addition, we used PageRank as a further feature.
These two features were the input to a 2-layer net with
3 hidden nodes. This number was obtained by trying a few
values for the LambdaRank baseline. No further effort was
made to find the best number for each individual objective.
No linear model (single layer net) results are reported here,
as it is well-established in the literature now [3] that they
do not perform as well as non-linear models. The training
queries were subsampled as described in section 4.1. Com-
plete AND sets were used for validation and test.
4.3 Enterprise Search
This data came from the Intranet of a large corporation
with about 15 million documents. Queries were sampled
from the existing search service log. Judges were asked to
choose queries they understood and assign one of 4 rele-
vance levels. We used 761 queries and performed 5-fold cross
validation here with the same train/validate/test ratios as
above. The index had 6 fields, including author and url, in
addition to the body, title and anchor used for .GOV. We
used the 12-parameter BM25F, alongside about 8 query in-
dependent features, including file type and url length. We
used a 2-layer net with 4 hidden nodes. Training, validation
and test queries were sub-sampled as described in section
4.1.
4.4 Web Search Data
This data came from a large commercial web search en-
gine. We use a 4096 training, 2651 validation and 2560 test
queries, sampled from the live query log, with 5 relevance
levels. In this experiment we used about 380 features. Be-
cause we had more features and training queries, we used
a more complex 2-layer model with 10 hidden nodes. The
training queries were sub-sampled as described in section
4.1. While complete AND sets were not used for validation
and test, these splits still had about 20 times as many ran-
domly sampled documents from the AND set, which were
assumed to be irrelevant.
4.5 Optimization
For each run we initialize the weights to a random setting
near zero, so that all nodes in the neural net start off un-
saturated. Following standard neural net practice, all input
features are normalized so that they are zero mean and unit
standard deviation over the training set queries. We call a
cycle through all training queries an epoch. If the training
set NDCG@10 (G10 ) goes for 16 epochs without increasing,
we reinitialize the weights. Our runs were for 128 epochs.
Like [3] we adopted a stochastic gradient descent approach,
where the weights of the model were updated in a batch
mode after each query. This optimization technique is very
simple. There is one parameter that corresponds to the dis-
tance taken along the gradient vector at each weight update,
called the learning rate. We set this rate to an initial value,
and we reduce it by a factor of 0.8 each time the end-of-
epoch training set G10 does not improve.
Initial values for both the learning rate (all utilities) and
the SoftRank smoothing σsaffect the results significantly.
Therefore, all experiments need to set these values using
the validation set. Multiple training runs are performed
from different initial settings, and the run/epoch that per-
formed best on the validation set is the one used in the final
evaluation on the test set. We tried learning rates from 10−1
to 10−7and initial smoothing from 100to 10−4.
For consistency across these experiments, we have only
made use of gradient information. However, the SoftRank
approach supports future incorporation of rank-based met-
rics directly into the optimization process which opens up
the possibility of using more sophisticated optimizers which
do not rely on gradient alone, such as BFGS. This is the
subject of future work.
4.6 Initial Results
We investigated 4 utilities on the three corpora: MSE (3),
RankNet (RN) (5), LambdaRank (LR) (7) and SoftRank
(SR) (23). In this set of initial experiments we used the
same discount function for the training objectives (SR and
LR) as that used in the test NDCG defined in Section 2.1.
The mean NDCG over all test set queries, at cut-offs 3
and 10 (G3,G10 ) are shown in Table 1. We use the paired
t-test at the 5% significance level on the G10 results only,
as this was the objective used for model selection on the
validation set.
.GOV Enterprise Web
G3G10 G3G10 G3G10
MSE 56.0 59.5 59.0 62.0 60.7 65.8
RankNet 65.4 67.7 58.1 61.9 60.6 65.4
LambdaRank 65.9 68.1 59.5 62.5 61.4 66.4
SoftRank 66.9 68.9 59.3 62.6 60.4 65.6
Table 1: Test set NDCG@10 and NDCG@3 for the
four utilities and three corpora.
For .GOV, the MSE objective performed surprisingly badly,
and significantly worse than the others. Taking the RN re-
sult as the baseline, LR did not do better (p= 55%) and
SR was significantly better (p= 4%). SR was close to being
significantly better than LR (p=8%).
For the Enterprise corpus, the MSE objective performed
surprisingly well, being equivalent to RN. LR was not sig-
nificantly better than MSE, though nearly so at p=7%
and SR>MSE at p= 2%. Finally SR was not significantly
better than LR.
For the web corpus, characterized by a very complex model
with several thousand parameters, we found that Lamb-
daRank was the best on the test set, significantly beating
all other utilities. MSE did surprisingly well. MSE, RN and
SR were all not significantly different.
Training G10
MSE 67.9
RankNet 67.6
LambdaRank 69.2
SoftRank 70.6
Table 2: Training set NDCG@10 for the four objec-
tives on the web corpus.
Training Set G10
Tab le 2 compares t h e training set G10 values. As expected,
we observe that LR and SR are both much better than MSE
and RN. We also note that SR yields a consistently and sig-
nificantly better fit to the training data. This effect was
observed on all three corpora. We conclude that gradient
ascent on SoftNDCG represents a more effective NDCG op-
timization algorithm than the other objectives.
We find it encouraging that, using the same models and
gradient-based optimizer, SoftRank consistently finds better
training set G10 values. In this, SoftNDCG has fulfilled the
goal of creating a smoothed approximation of a rank-based
metric such as NDCG, and optimizing it directly. It seems
therefore that the Rank-Binomial approximation is a good
one.
4.7 Generalization Study on Web Corpus
However, further comment is needed on test set performance.
The training set G10 values lead us to ponder how SR can
give a much better fit to the training data at the same time
as worse test set performance given that the model struc-
tures are identical.
Simpler Linear Model
Our first thought was that we might just be overfitting:
maybe LambdaRank has better natural regularization prop-
erties than SoftNDCG, and the use of early stopping on the
validation set is not sufficient for SR. To test this hypothe-
sis we tried a drastically simpler linear model, reducing the
number of model parameters by a factor of 10 from about
4000 to 400. This simple model could not realistically be
overfitted with 4K training queries.
Model Train G10 Test G10
2-layer LR 69.2 66.4
2-layer SR 70.6 65.7
Linear LR 67.2 65.2
Linear SR 67.5 64.6
Table 3: Results on a much simpler model: the simple
linear model stil l shows a gain for LambdaRank.
The results of this experiment are shown in Table 3. As
we would expect, the training set NDCGs are worse for the
linear model than for the non-linear model as the model
is less flexible. More interestingly, we still observe that LR
statistically significantly outperforms SR in the linear model
on the test set. Now given we are very unlikely to be under-
regularized for the reasons stated earlier, this leads us to
conclude that we are not overfitting in the classical sense.
Alternative training discount functions
Another possible explanation is that by using the NDCG
discount function for training, we have allowed the model
to concentrate too much on high-ranking documents in the
training set. In other words, perhaps there is useful infor-
mation in the ordering of documents beyond the top 10 or so
relevant documents that SoftNDCG is effectively ignoring,
which the baseline models do not. Therefore, maybe the
use of a less severe training discount function would allow
SoftRank to generalize better to new queries, by exploiting
more of the training data.
To test this hypothesis we have investigated a variety of
training discount functions that do not decay as quickly as
the regular NDCG discount function. These are shown in
Figure 5, ranging from convex (super-linear with rank, and
denoted α=−1 in Figure 6), through linear α=0,to con-
cave (sub-linear, like the regular NDCG discount) α=1.
We used these new, deeper discount functions for both train-
ing (in SoftNDCG) and validation NDCG, but retaining the
original discount for the test set.
The results of this experiment are shown in Figure 6. For
α= 1 we saw a slightly (not significant) better G10 as we
are using a validation NDCG with a longer tail. The test
set performance improved to be as good as LambdaRank
for the range −0.4<α<0.4, reaching an optimum when
α=0.0.
5. CONCLUSION
We have introduced the idea of using rank distributions
to smooth traditional IR metrics. We have shown how to
compute an approximation to these rank distributions in a
way that involves no explicit sort, and is therefore differen-
tiable with respect to the model parameters, and so suitable
0 10 20 30 40 50 6
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Discount D(r)
Rank r
Figure 5: Shallower discount functions: We tried a
range of discount functions ranging from super-linear (top,
α=−1) through linear (α=0) to sub-linear, as used in the
test NDCG (bottom, α=1).
for memory efficient non-linear machine learning approaches
such as the multi-layer perceptron. We have shown that the
Rank-Binomial is a good and useful approximation to the
true rank distributions by demonstrating that it can reli-
ably find better training set NDCGs than state-of-the-art
algorithms designed for that purpose.
We have shown also that SoftRank does not, in some
cases, generalize to new queries as well as LambdaRank.
We have shown that this is not due to lack of regularization,
but rather a tendency to focus too much on the top ranks.
We fixed this problem by trying a range of training discount
functions that are less top-heavy, and found that it was pos-
sible to do as well as LambdaRank on the test set but, as
yet, not significantly better.
We believe that SoftRank represents a general and pow-
erful new approach for direct optimization of non-smooth
ranking metrics. Future work will focus on characterizing
better the conditions under which SoftNDCG fails to gener-
alize, and exploring other soft IR metrics.
Acknowledgments
Thanks to Chris Burges, Markus Svensen and Martin Szum-
mer for useful discussions and to Chris Burges for his neural
net code.
6. REFERENCES
[1] E. Agichtein, E. Brill, and S. Dumais. Improving web
search ranking by incporporating user behavior
information. In SIGIR, 2006.
[2] H. Balakrishnan, I. Hwang, and C. Tomlin.
Polynomial approximation algorithms for belief matrix
maintenance in identity management. In IEEE
Conference on Decision and Control, 2004.
[3] C. Burges, R. Ragno, and Q. V. L. Le. Learning to
rank with nonsmooth cost functions. In NIPS, 2006.
[4] C. Burges, T. Shaked, E. Renshaw, A. Lazier,
M. Deeds, N. Hamilton, and G. Hullender. Learning
to rank using gradient descent. In ICML, 2005.
−1 −0.5 0 0.5
1
65.5
66
66.5
67
Discount functio n parameter α
Tes t G10
Figure 6: Test set performance with different dis-
count functions Test set G10 on web data against the
discount function parameter αas defined in Figure 5. The
horizontal line is the LR baseline. SR does as well as LR
for −0.4<α<0.4.
[5] Z.Cao,T.Qin,T.-Y.Liu,M.-F.Tsai,andH.Li.
Learning to rank: From pairwise approach to listwise
approach. In ICML, 2007.
[6] R. Carnuana, S. Baluja, and T. Mitchell. Using the
future to “sort out” the present: Rankprop and
multitask learning for medical risk evaluation. In
NIPS 8, 1996.
[7] W. Chu and Z. Ghahramani. Gaussian processes for
ordinal regression. Journal of Machine Learning
Research, 6:1019–1041, 2005.
[8] K. Crammer and Y. Singer. Pranking with ranking. In
NIPS 14, 2002.
[9] R. Herbrich, T. Graepel, and K. Obermayer. Large
margin rank boundaries for ordinal regression. In
Advances in Large Margin Classifiers, pages 115–132.
MIT Press, 2000.
[10] J¨arvelin and J. Kek¨al¨ainen. IR evaluation methods for
retrieving highly relevant documents. In SIGIR, 2000.
[11] T. Joachims. Optimizing search engines using
clickthrough data. In Proceedi ngs of Know l edge
Discovery in Databases, 2002.
[12] Y. LeCun, L. Bottou, G. Orr, and K.-R. M¨uller.
Efficient backprop, 1998.
[13] D. Metzler, T. Strohman, and W. Croft. Indri at trec
2006: Lessons learned from three terabyte tracks. In
online Proceedi ngs of Text REt rie val Conference, 2005.
[14] A. Papoulis. Probability, Random Variables and
Stochastic Processes, Third Edition. McGraw-Hill
1991.
[15] S. Robertson, H. Zaragoza, and M. Taylor. A simple
BM 25 extension to multiple weighted fields. In
CIKM, pages 42–29, 2004.
[16] M. Taylor, H. Zaragoza, N. Craswell, S. Robertson,
and C. Burges. Optimisation methods for ranking
functions with multiple parameters. In CIKM, 2006.