Conference PaperPDF Available

SoftRank: optimizing non-smooth rank metrics

Authors:

Abstract and Figures

We address the problem of learning large complex rank- ing functions. Most IR applications use evaluation metrics that depend only upon the ranks of documents. However, most ranking functions generate document scores, which are sorted to produce a ranking. Hence IR metrics are innately non-smooth with respect to the scores, due to the sort. Un- fortunately, many machine learning algorithms require the gradient of a training objective in order to perform the op- timization of the model parameters, and because IR met- rics are non-smooth, we need to find a smooth proxy ob- jective that can be used for training. We present a new family of training objectives that are derived from the rank distributions of documents, induced by smoothed scores. We call this approach SoftRank. We focus on a smoothed approximation to Normalized Discounted Cumulative Gain (NDCG), called SoftNDCG and we compare it with three other training objectives in the recent literature. We present two main results. First, SoftRank yields a very good way of optimizing NDCG. Second, we show that it is possible to achieve state of the art test set NDCG results by optimizing a soft NDCG objective on the training set with a dierent discount function.
Content may be subject to copyright.
SoftRank: Optimizing Non-Smooth Rank Metrics
Michael Taylor, John Guiver, Stephen Robertson and Tom Minka
Microsoft Research Cambridge
{mitaylor,joguiver,ser,minka}@microsoft.com
ABSTRACT
We address the problem of learning large complex rank-
ing functions. Most IR applications use evaluation metrics
that depend only upon the ranks of documents. However,
most ranking functions generate document scores, which are
sorted to produce a ranking. Hence IR metrics are innately
non-smooth with respect to the scores, due to the sort. Un-
fortunately, many machine learning algorithms require the
gradient of a training objective in order to perform the op-
timization of the model parameters, and because IR met-
rics are non-smooth, we need to find a smooth proxy ob-
jective that can be used for training. We present a new
family of training objectives that are derived from the rank
distributions of documents, induced by smoothed scores.
We call this approach SoftRank. We focus on a smoothed
approximation to Normalized Discounted Cumulative Gain
(NDCG), called SoftNDCG and we compare it with three
other training objectives in the recent literature. We present
two main results. First, SoftRank yields a very good way
of optimizing NDCG. Second, we show that it is possible to
achieve state of the art test set NDCG results by optimizing
a soft NDCG objective on the training set with a different
discount function.
Categories and Subject Descriptors
H.3.3 [Information Systems Applications]:
General Terms
Algorithms, Experimentation
Keywords
learning, ranking, metrics, optimization, gradient descent
1. INTRODUCTION
There is a clear trend among both IR researchers and
practitioners towards using ever more complex ranking func-
tions. Until quite recently it has been common to use models
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
WSDM’08, February 11–12, Palo Alto, California, USA.
Copyright 2008 ACM 978-1-59593-927-9/08/0002 ...$5.00.
with only a handful of free parameters. For example BM25
in its most widely adopted form has only 2. Such simple
models have advantages: they are easy to tune for a given
corpus, requiring few training queries and little computation
to find reasonable parameter settings. Often, they work well
“out-of-the-box” on new corpora with parameters reported
in the literature. In short, they are robust and generalize
well.
However, increasingly we see richer models appearing that
set out to harness an ever expanding set of more powerful
features. For example, there is much activity surrounding
term proximity where work is beginning to show benefits in
going beyond the bag-of-words models [13]. In the area of
document structure there has been progress too. Improve-
ments have been reported exploiting both a simple field-
based flat structure, where for example, term occurrences
are handled differently in titles than body text [15], and also
more complex hierarchical structures in the area of XML re-
trieval. Much work has also been published on combining
content with external cues such as link-graph features like
PageRank and HITS, and more recently, usage features [1].
As these features all appear to aid retrieval effectiveness,
it follows that competitive IR systems need to be able to ex-
ploit them in an efficient and reliable way. However, as the
number of features increases, so does the number of parame-
ters necessary in the ranking function. This paper addresses
the problem of learning the parameters for such complex
ranking functions.
The fundamental issue when formulating such a machine
learning problem is the choice of objective function to be op-
timized. In IR there are very many existing metrics (NDCG,
MAP, RPrec etc.) that all share the property that they
are rank-dependent, placing more emphasis on performance
at the top of a list of documents, thus reflecting the end-
user experience. While these metrics are ideal for evaluating
trained systems, their use as ob jective functions for training
is problematic.
1.1 IR Metrics are Not Smooth
Given a representative set of queries and relevance judg-
ments, we seek to learn a ranking function (or model) that
takes a set of document-query match feature values and gen-
erates a score. At test time, this score is used to sort doc-
uments to produce a ranked list, which is to be evaluated
using an IR metric.
Following [4, 3] in this paper we choose to model the map-
ping from features to score using a 2-layer neural net model.
Neural nets represent a tried-and-tested machine learning
technique that scales well with large amounts of training
data. The optimization process used is gradient based, and
so this learning approach depends upon the availability of a
gradient of the training ob jective.
Typical IR metrics only depend on the ranks (and not the
scores). If we make small changes to the model parameters,
the scores will typically change smoothly, but the ranks of
documents will not change until one document’s score passes
another, at which point the IR metric will make a discontin-
uous change (see Figure 1). In other words, the IR metrics
are non-smooth with respect to model parameters: they are
everywhere either flat (with zero gradient) or discontinuous.
Section 2 describes our chosen test metric NDCG, and
then gives a brief description of previous work in construct-
ing useful training utilities that have smooth gradients and
also attempt to approximate the desired non-smooth rank-
dependent IR objective. Section 3 presents a new training
objective called SoftNDCG that is a smoothed approxima-
tion to NDCG.
2. RANKING UTILITIES
This section first describes the test metric that we adopt in
this paper, called Normalized Discounted Cumulative Gain
(NDCG). The approach we adopt in this paper is to seek
an objective function that makes a good proxy for a rank-
based metric (e.g. NDCG), but that is also differentiable
with respect to the parameters of the ranking function. Such
a function will be more easily optimized in the neural net
framework. We next describe several training utilities that
have been previously proposed, concentrating on RankNet
[4] and LambdaRank [3] as they are used as baselines in this
paper.
For a given training query, we will assume we have N
documents, each with a known human-defined rating. We
denote an individual document indexed by jas docj.Letus
assume that we have a ranking function that takes in doc-
ument features xjand produces a score sj.Inthispaper
we consider the class of non-linear ranking functions (mod-
els) represented by the 2-layer neural net fwith weights
(parameters) w. We denote the score:
sj=f(w,xj).(1)
2.1 NDCG
We adopt the IR evaluation metric called NDCG [10] be-
cause it is a reasonable way of dealing with multiple rele-
vance levels in our datasets. It is often truncated at a rank
position R(indexed from 0) and is defined as:
GR=G1
R,max
R1
r=0
g(r)D(r)(2)
where the gain g(r) of the document at rank ris usually
an exponential function g(r)=2
l(r)of the labels l(r)(or
rat ing ) of the document at rank r. The labels typically take
values from 0 (bad) to 4 (perfect). A popular choice for the
rank discount is D(r)=1/log(2 + r)andGR,max is maxi-
mum value of
R1
r=0 g(r)D(r) obtained when the documents
are optimally ordered by decreasing label value. Where no
subscript is defined, it should be assumed that R=N.
2.2 Pointwise
The simplest class of smooth training objective is what
we call a pointwise objective because it can be computed
for a single document. If we are given a target label for each
document, then the most straightforward pointwise objec-
tive is the mean squared error (MSE) between the target
and predicted label. We can write the objective for a docu-
ment/label pair (in this case it is actually a cost) as:
Umse(sj)=(sjlj)2(3)
and the total cost is obtained by computing the mean over
all training documents. This is the first baseline objective
that we will use in our experiments. Another option would
be to treat the NDCG gains as the targets for the regres-
sion, instead of the labels. This made little difference in our
experiments and we do not consider it further.
There are of course more sophisticated approaches to point-
wise metrics. Rankprop [6] alternates between a MSE ap-
proach described above, and a phase that iteratively ad-
justs the targets themselves. An alternative approach is the
framework of ordinal regression, which is closer to classifi-
cation in spirit. An example of this approach is PRank [8];
for a Bayesian treatment see [7].
2.3 Pairwise
Because ranking metrics only require the recovery of rela-
tive relevance levels, the motivation of this approach is that
pairwise preferences in the labels may well be more eas-
ily modelled than any available absolute value of relevance.
Herbrich et. al. [9] were the first to use pairwise prefer-
ence labels to learn ranking functions. They took an ordinal
regression approach using an SVM model. This model is
well known in the IR literature as the RankingSVM [11],
where it was used in a scenario where only preference la-
bels were available in the training data, derived from click-
through logs.
RankNet [4] is a probabilistic model for pairwise prefer-
ences. The algorithm assumes that it is provided with a
training set consisting of pairs of documents doc1,doc
2to-
gether with a target probability ¯
P12 that doc1is to be ranked
higher than doc2. The authors define a ranking function f
as we have specified in (1).
The map from the outputs sjto probabilities is modeled
using a logistic function P12 eY/(1 + eY)whereY
s2s1,andP12 is the probability that doc1is ranked higher
than doc2. They then invoke the cross-entropy error function
to penalize pair ordering errors:
Urn(Y)=¯
P12 log P12 (1 ¯
P12)log(1P12 ).(4)
This is a very general cost function that affords the use
of any available uncertainty we may have concerning the
pairwise ratings. Following [4], in our implementation of
RankNet, we take the pair ordering data as certain (ignor-
ing ties), and so for us, the ¯
P12 are always one. With this
simplification, the RankNet cost for a pair becomes:
Urn =log
1+es2s1
(5)
where the score difference is positive if the documents are
in the wrong order. The RankNet cost for a pair in the
wrong order tends to a linear fuction of score difference.
When the scores are correctly ordered, Urn asymptotically
approaches zero as the score difference increases. Thus the
gradient of the Urn not only encourages the pair to be in
the right order, but encourages them to have at least some
separation in their scores. If fis differentiable with respect
to the parameters, then so too will Urn.
2.4 Rank Dependent
The smooth training utilities described so far do not take
into account the rank of a document in the set of documents
scored for a training query. Hence, Burges et. al. [3] argue
that there is a possibility that a model might be prone to
waste capacity in improving the order of documents at low
(poor) ranks at the expense of documents at the top of the
ranking. Their approach, called LambdaRank, argues for a
training objective that is closer to NDCG, in that it should
care more about the top of the ranking than the bottom.
Consequently, it needs to incorporate rank information into
its training objective.
However, rank information is only directly available via
a sort, and the sort operation is inherently undifferentiable.
Thus, any sort operation makes the objective non-smooth.
They get round this problem by defining a virtual gradient
on each document after the sort. In other words, Lam-
daRank defines a gradient of an implicit objective function
that is itself never actually defined. These gradients are only
defined once the current model has produced a ranked list
for the query. For example, consider a query with just two
relevant documents doc1,doc
2, and at some stage in train-
ing the model places doc1near the top of the ranked list
with score s1and doc2near the bottom with score s2.The
intuition concerning limited capacity can be encoded as:
∂U
∂s1
∂U
∂s2
,(6)
or in words, we would like the rate of change of the ob jective
with respect to the high-ranking relevant document’s score
to be very much greater than that for the low-ranking rel-
evant document. Note that Uis the implicit ob jective that
is not defined. Only the gradients of Uare defined, given
a sorted list of documents at a particular point in training.
In their paper, they report that they tried many different
gradient functions satisfying the capacity constraint of (6),
and the one that worked best on web retrieval experiments
was as follows. Given a pair of documents for a training
query, they define the gradient (lambda function) to be the
gradient of the RankNet cost (5) scaled by the difference
in NDCG found by swapping the two documents in ques-
tion. Using this pairwise “force”, the total gradient for a
document jcan be obtained by summing all such pairwise
interactions, giving:
λj∂Ulr
∂sj
=G1
max
i
1
1+esisj
(gigj)
DiDj
(7)
where giis the gain of the label of dociand Diis the discount
at the rank of doci,D(ri). The authors report that a 2-layer
neural net model trained using this LambdaRank objective
outperforms the same model using the RankNet cost.
A more recent approach [5] sets out to define a probability
distribution over rankings: in other words, the event space
consists of document permutations. To contrast that work
with this paper, here we only set out to elicit distributions
for the ranks of individual documents: an event in our space
is that of an individual document having a particular rank.
The probability of such an event is obtainable from a full
ranking distribution by summing up the probabilities of the
rankings for which this event occurs; however, in our work,
we deal directly with the simpler event space.
3. SOFTRANK
The approach taken by LambdaRank was to abandon the
attempt to define an explicit smooth objective, and instead
only work with an implicit objective via the definition of
gradient functions with intuitively desirable properties.
The main idea that motivates the SoftRank approach is
the observation that if we consider the scores to be smoothed
by treating them as random variables, then it should be pos-
sible to propagate that noise through to a rank-dependent
IR metric. In particular, in this paper we use this idea to
createasmoothedapproximationtoNDCG(referredtoas
SoftNDCG), but the approach could equally be applied to
other rank-based metrics. This section shows under what
assumptions we can write down an analytic expression for
the SoftNDCG and its derivative with respect to the model
parameters w. The process is summarized as a factor graph
in Figure 4, the components of which are discussed in the
following sections.
In order to make an objective dependent upon the ranks
of documents, it is natural to assume that a sort is required
at some stage. However, this would render the objective
non-differentiable, so our approach is based on the idea that
we need to avoid sorting.
At a high level, the approach we adopt is as follows. We
consider Nlabeled documents for a single query. This means
we have Nscore distributions (Section 3.1). We show how
we can map from score distributions to a rank distribution
for each document (see Section 3.2) without performing an
explicit sort. Armed with a rank distribution for each docu-
ment, we investigate under what conditions we can compute
an expected NDCG (Section 3.3). The expected smoothed
NDCG is what we call SoftNDCG. Finally, because none of
these steps necessarily involves a sorting operation, we show
that the expression for SoftNDCG can be differentiated with
respect to the model parameters, and thus show how it can
be optimized using gradient ascent (Section 3.4).
We conclude the description of SoftRank by discussing
how the single degree of freedom represented by the global
scale of the scores is handled (Section 3.5), and finally dis-
cuss computational issues (Section 3.6).
3.1 Smoothing Scores
Rather than representing scores as deterministic values,
we will treat them as smoothed score distributions (see Fig-
ure 2). The simplest way to do this, and the approach we
adopt for the remainder of this paper, is to give every score
the same smoothing using equal variance Gaussian distri-
butions. Hence the deterministic score sjin (1) becomes
the mean of a Gaussian score distribution1,withashared
smoothing variance σs:
p(sj)=N(sj|¯sj
2
s)=N(sj|f(w,xj)
2
s).(8)
An alternative motivation would be to consider the source of
noise as an inherent uncertainty in the model parameters w,
arising from inconsistency between the ranking model and
the training data (see the left side of Figure 4). This would
be the natural result of a Bayesian approach to the learning
task. Mapping parameter distributions through the non-
linearities of a 2-layer neural net that we use in this paper
is not straightforward, so we do not pursue this idea here.
1Using N(x|µ, σ2)=(2πσ2)0.5exp[(xµ)2/2σ2]
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8
1
0
0.5
1
1.5
2
Score s
p(s)
012
0
0.2
0.4
0.6
0.8
1
Rank r1
p(r1)
0 1 2
0
0.2
0.4
0.6
0.8
1
Rank r2
p(r2)
0 1 2
0
0.2
0.4
0.6
0.8
1
Rank r3
p(r3)
s1s2s3
s1s2s3
Figure 1: Deterministic scores and ranks: Three
document scores as point (deterministic) values and their
corresponding rank distributions. The lowest scoring docu-
ment s1is certain to be ranked in the lowest position 2.
3.2 From Score to Rank Distributions
When we have deterministic scores, we have determinis-
tic rank distributions, as shown in Figure 1. This section
presents a way of quantifying what happens to the rank
distributions when we add noise to (smooth) the score dis-
tributions, shown in Figure 2.
The rank distributions shown in Figure 2 may be simu-
lated by the following exact generative process: a) sample a
vector of Nscores, one from each score distribution, b) sort
the score samples and c) accumulate histograms of the re-
sulting ranks for each document. However, we wish to avoid
the sort so we can perform gradient based optimization, and
so the remainder of this section presents an approximate al-
gorithm for generating the rank distributions that avoids an
explicit sort.
For a g iven docj, consider the probability that another
dociwill rank above docj.DenotingSjas a draw from
p(sj), we require the probability that Si>S
j,orequiva-
lently Pr(SiSj>0). Therefore the required probability
is simply the integral of the difference of two Gaussian ran-
dom variables, which is itself a Gaussian [14], and therefore
the probability that document ibeats document j,which
we will henceforth refer to as πij,is:
πij Pr(SiSj>0) =
0
N(s|¯si¯sj,2σ2
s)ds. (9)
This quantity represents the fractional number of times we
would expect docito rank higher than docjon repeated
pairwise samplings from the two Gaussian score distribu-
tions. For example, in Figure 2, we would expect π32
12.
In other words, if we were to draw two pairs {S1,S
2}and
{S3,S
2},S3is more likely to win its pairwise contest against
S2than S1is in its contest.
Now we use these pairwise probabilities to generate ranks.
We argue intuitively that if we were to add up the proba-
bilities of a document being beaten by each of the other
documents, then we would have a quantity that is related to
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8
1
0
0.5
1
1.5
2
Score s
p(s)
0 1 2
0
0.2
0.4
0.6
0.8
1
Rank r1
p(r1)
0 1 2
0
0.2
0.4
0.6
0.8
1
Rank r2
p(r2)
0 1 2
0
0.2
0.4
0.6
0.8
1
Rank r3
p(r3)
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
s1s2s3
Figure 2: From score to rank distributions:
Smoothed scores for 3 documents and the resulting 3 rank
distributions.
the expected rank of the document being beaten. In other
words, if a document is never beaten, its rank will be 0, the
best rank. More generally, using the pairwise contest trick,
we can write down an expression for the expected rank rj
of document jas:
E[rj]=
N
i=1,i=j
πij (10)
which can be easily computed using (9). As an example,
Figure 2 shows what happens to the rank distributions when
we smooth the scores: the expected rank of document 3 is
between 0 (best) and 1, and documents 2 and 3 have an
expected rank between 2 and 3 (worst).
The actual distribution of the rank rjof a document j
under the pairwise contest approximation is obtained by
considering the rank rjas a Binomial-like random variable,
equal to the number of successes of N1 Bernoulli trials,
where the probability of success is the probability that doc-
ument jis beaten by another document i,namelyπij .Ifi
beats jthen then rjgoes up by one.
However, because the probability of success is different for
each trial, it is a more complex discrete distribution than the
Binomial: we call it the Rank-Binomial distribution. Like
the Binomial, it has a combinatoric flavour: there are few
ways that a document can end up with top (and bottom)
rank, and many ways of ranking in the middle. Unlike the
Binomial, it does not have an analytic form. However, it
can be computed using a standard result from basic proba-
bility theory, that the probability density function (pdf ) of
a sum of independent random variables is the convolution
of the individual pdfs [14]. In this case we have a sum of
Nindependent Bernoulli (coin-flip) distributions, each with
a probability of success πij . This yields an exact recursive
computation for the distribution of ranks as follows.
If we define the initial rank distribution for document jas
p(1)
j(r), where we have just the document j, then the rank
can only have value zero (the best rank) with probability
one:
p(1)
j(r)=δ(r) (11)
where δ(x) = 1 only when x= 0 and zero otherwise. Now
we have N1other documents that contribute to the rank
distribution that we will index with i=2..N.Eachtimewe
add a new document i, the event space of the rank distrib-
ution gets one larger, taking the rvariable to a maximum
of N1 on the last iteration. The new distribution over
the ranks is updated by applying the convolution process
described above, giving the following recursive relation:
p(i)
j(r)=p(i1)
j(r1)πij +p(i1)
j(r)(1 πij ).(12)
This can be interpreted in the following more intuitive man-
ner. If we add document i, we can write the probability
of rank rjas a sum of two parts corresponding to the new
document ibeating document jor not. If ibeats jthen the
probability of being in rank rat this iteration is equal to the
probability of being in rank r1 on the previous iteration,
and we have the situation covered by the first term on the
right of (12). Conversely, if the new document leaves the
rank of junchanged (it loses), the probability of being in
rank ris the same as it was in the last iteration, correspond-
ing to the second term on the right of (12).
We note that we need to define p(i)
j(r)=0ifrj<0. We
define the final rank distribution pj(r)p(N)
j(r). Figure 2
shows these distributions for the simple 3 score case.
To conclude this section, we note that the pairwise contest
trick yields Rank-Binomial rank distributions, which are an
approximation to the true rank distributions. Their com-
putation does not require an explicit sort. Simulations have
shown that this gives similar rank distributions to the true
generative process. We can improve these approximations
further by performing a sequence of column and row op-
erations on the [pj(r)] matrix: divide each column by the
column sums, then divide each row of the resulting matrix
by the row sums, and iterate to convergence. This process is
known as Sinkhorn scaling, its purpose being to convert the
original matrix to a doubly-stochastic matrix. The solution
can be shown to minimize the Kullback-Leibler distance of
the scaled matrix from the original matrix [2]. We will show
later in our results that we can successfully optimize NDCG
using these approximate rank distributions, which further
justifies the pairwise independence approximation and the
Sinkhorn post-processing.
3.3 SoftNDCG
This section shows how we can use rank distributions to
smooth traditional IR metrics. The general approach is to
take the expectation of the IR metric with respect to the
rank distribution. As an example, we will now examine the
specific case of NDCG.
The expression for deterministic NDCG was given in (2)
as G=G1
max
N1
r=0 g(r)D(r). We set out to compute the
expected NDCG given the rank distributions described in
the last section. Rewriting NDCG as a sum over document
indices rather than document ranks we get:
G=G1
max
N
j=1
g(j)D(rj).(13)
With reference to Figure 3 we replace the deterministic dis-
count D(r) with the expected discount. Thus we define soft
NDCG Gas:
G=G1
max
N
j=1
g(j)E[D(rj)].(14)
The expected discount is obtained by mapping the rank
distribution through the non-linear deterministic discount
function (again, see Figure 3) to give:
G=G1
max
N
j=1
g(j)
N1
r=0
D(r)pj(r) (15)
where the rank distribution pj(r) is given in (12).
0123456789
0
0.2
0.4
0.6
0.8
1
r
p(r)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.05
0.1
0.15
0.2
d
p(d)
D(r)
Figure 3: From rank to discount distribution. The
rank distribution is mapped through the non-linear discount
function D, to give a discrete distribution over discounts
p(d)whose expectation we substitute for the deterministic
discount to obtain SoftNDCG.
3.4 Gradient of SoftNDCG
Having derived an expression for a SoftNDCG, we now
differentiate it with respect to the weight vector. The deriv-
ative with respect to the weight vector with Kelements is:
G
w=
∂s1
∂w1... ∂sN
∂w1
... ... ...
∂s1
∂wK... ∂sN
∂wK
G
¯s1
...
G
¯sN
.(16)
The first matrix is defined by the neural net model and
is computed via backpropagation [12]. The second vector
is the gradient of our objective with respect to the score
means. As with LambdaRank above, our task is to define
this gradient vector for each document in a training query.
Taking a single element of this gradient vector correspond-
ing to a document with index m(1 mN), we can
differentiate (15) to obtain:
G
¯sm
=G1
max
N
j=1
g(j)
N1
r=0
D(r)∂pj(r)
¯sm
.(17)
Intuitively, this says that changing score ¯smaffects Gvia
potentially all the rank distributions, as moving a score will
affect every document’s rank distribution. The resultant
change in each rank distribution will induce a change in the
expected gain for each document determined by the non-
linear discount function D(r).
w
f
(
x
j
,
w
)
s
j
Pairwise Contests
j
x
j
Rank-Binomial
r
j
j=1,…,
Discount
D(r)
d
j
G
g
j
NDCG
Figure 4: A factor graph of the distributions for
aquery. In our development of SoftRank to date, we
have only worked with Gaussian scores sj.Thesemapto
Bernoulli vectors πjwhich provide the success probabilities
for the computation of the Rank-Binomials over ranks rjfor
each document 1..N (see Section 3.2). Then the rank distri-
butions get mapped in a non-linear way through the discount
function D(r)to give a distribution over discounts dj(see
Section 3.3). Finally, combining the expected discount with
the gain of the label over all documents, we arrive at the
expected SoftNDCG.
Hence we need a parallel recursive computation to obtain
the required derivative of pj(r). Denoting ψ(i)
m,j (r)= ∂p(i)
j(r)
¯sm
it is easy to show from (12) that:
ψ(1)
m,j (0) = 0
ψ(i)
m,j (r)=ψ(i1)
m,j (r1)πij +ψ(i)
m,j (r)(1 πij )
+
p(i1)
j(r1) p(i1)
j(r)
∂πij
¯sm
(18)
where again the recursive process runs i=1..N .Consid-
ering now the last term on the right of (18), differentiating
πij with respect to ¯smusing (9) yields three different cases,
given we know already that i=jand so m=i=jis not
possible). Using the fact that
∂µ
0
N(x|µ, σ2)dx =N(0|µ, σ2) (19)
it can easily be shown that from (9) that:
∂πij
¯sm
=
N(0|¯sm¯sj,2σ2
s)m=i, m =j
−N (0|¯si¯sm,2σ2
s)m=i, m =j
0m=i, m =j
(20)
and so substituting (20) in (18), we can now run the recur-
sion for the derivatives. We define the result of this compu-
tation as the N-vector over ranks:
∂pj(r)
¯sm
ψm,j =[ψ(N)
m,j (0), ..., ψ(N)
m,j (N1)].(21)
Using this matrix notation we substitute the result in (17):
G
¯sm
=1
Gmax
[g1, ..., gN]
ψm,0
...
ψm,N1
d0
...
dN1
.(22)
We now define the gain vector g(by document), the discount
vector d(by rank) and the N×Nsquare matrix Ψmwhose
rows are the rank distribution derivatives implied above:
G
¯sm
=1
Gmax
gTΨmd.(23)
So to compute the N-vector gradient of Gwhich we define
as ∇G =[
G
¯s1, ..., G
¯sN] we need to compute Ψmfor each
document.
3.5 Scale Optimizations
Any optimization of an objective that is based on ranks of
documents should consider carefully the degree of freedom
that corresponds to an arbitrary scaling of the scores. Mul-
tiplying all the scores by a common factor does not affect
the ranks, and if it is not handled properly, could lead to an
undesirable degeneracy in the optimization process.
With SoftRank, this global scale factor is equivalent to
the score variance σ2
s: if we multiply all the scores by some
common factor, it is the same thing as dividing σsby that
factor. In the experiments described in this paper, σsis
set to some initial value, and is not changed during opti-
mization. Hence the scale factor is controlled by ∇G · ¯
s,the
component of the SoftNDCG gradient parallel to the current
mean score vector ¯
s.
As a simple example, imagine a query with 2 documents
that were stubbornly in the wrong order, say the relevant
document was always below the irrelevant. In this situation,
SoftRank would steadily decrease the scale which would
have the effect of making the Gaussians overlap more, thus
increasing the probability that the relevant document could
beat the irrelevant above it. Conversely, if the documents
were correctly ordered, the scale would increase, effectively
reducing the score variance.
Following this rather intuitive argument, we conclude that
SoftRank does control the scale of the scores in a sensible,
non-degenerate way. In effect it overloads the global scale
degree of freedom to implement an annealing schedule. It
is the subject of ongoing study as to whether this default
schedule can be improved upon.
3.6 Computational Considerations
For a given query of Ndocuments, calculation of the πij is
O(N2), calculation of all the pj(r)isO(N3), and calculation
of the SoftNDCG is O(N2). Similar complexity arises for the
gradient calculations. So the calculations are dominated by
the recursions in (12) and (18).
A substantial computational saving can be made by ap-
proximating all but a few of the Rank-Binomial distribu-
tions. The motivation for this is that a true binomial distri-
bution, with Nsamples and probability of success π,canbe
approximated by a normal distribution with mean and
variance (1 π)when is large enough. For the rank
binomial distribution, πis not constant, but simulations
confirm that it can be approximated similarly, for a given j,
by a normal distribution with mean equal to the expected
rank
N
i=1,i=jπij and variance equal to
N
i=1,i=jπij(1
πij ). As the approximation is an explicit function of the
πij , we can easily calculate the gradients of the approxi-
mated pj(r)withrespecttoπij , and therefore with respect
to the ¯sm. Using this approximation allows us to restrict
the expensive recursive calculations to a few documents at
the top and bottom of the ranking.
4. EXPERIMENTS
We performed experiments on three corpora: queries from
the TREC .GOV corpus, on enterprise search data and queries
from a commercial web search engine. For all corpora, we
used only those documents containing all the query terms
(the AND set) for both training and testing. This simpli-
fication is consistent with common practice for large scale
commercial search engines.
4.1 Training Set Subsampling
To make training tractable, it is necessary to subsample
the irrelevant documents to some degree. If we do not do
this, the data we need to process is dominated unnecessar-
ily by uninformative irrelevant documents. In this paper, we
adopted a very simple approach. We took all the judged doc-
uments, whatever the relevance label. We then augmented
this set with documents drawn at random from the AND
set of unlabeled documents, which we assume to be irrel-
evant, until we have drawn a number equal to the size of
the judged set, or we have drawn 30, whichever happens
first. This threshold is admittedly rather arbitrary, and it is
the subject of ongoing experiments as to how sensitive each
objectiveistothisnumber.
4.2 .GOV
For the TREC .GOV corpus, we took the following sets
of topics: name/home page 2003 (300 queries), topic dis-
tillation 2003 (50 queries) and all queries from 2004 (225
queries). As mentioned above, we chose to work with AND
set documents only, so we dropped all queries with no rel-
evant documents in the AND set, and this left us with 508
queries. This number would have been greater if we had per-
formed stemming. The queries had binary relevance judg-
ments.
The data was partitioned for 5-fold (3:1:1) cross valida-
tion, giving 5 runs, each with roughly 300 training queries,
100 validation and 100 test. It is possible that a different
ratio of train/validate/test queries might give better results,
and again this is an area of current investigation.
The index of our crawl of .GOV has 3 structural fields:
body, title and anchor. We used BM25F, which in this
case has 6 tuneable parameters that can be learned using
back-propagation following the process described in [16]. In
addition, we used PageRank as a further feature.
These two features were the input to a 2-layer net with
3 hidden nodes. This number was obtained by trying a few
values for the LambdaRank baseline. No further effort was
made to find the best number for each individual objective.
No linear model (single layer net) results are reported here,
as it is well-established in the literature now [3] that they
do not perform as well as non-linear models. The training
queries were subsampled as described in section 4.1. Com-
plete AND sets were used for validation and test.
4.3 Enterprise Search
This data came from the Intranet of a large corporation
with about 15 million documents. Queries were sampled
from the existing search service log. Judges were asked to
choose queries they understood and assign one of 4 rele-
vance levels. We used 761 queries and performed 5-fold cross
validation here with the same train/validate/test ratios as
above. The index had 6 fields, including author and url, in
addition to the body, title and anchor used for .GOV. We
used the 12-parameter BM25F, alongside about 8 query in-
dependent features, including file type and url length. We
used a 2-layer net with 4 hidden nodes. Training, validation
and test queries were sub-sampled as described in section
4.1.
4.4 Web Search Data
This data came from a large commercial web search en-
gine. We use a 4096 training, 2651 validation and 2560 test
queries, sampled from the live query log, with 5 relevance
levels. In this experiment we used about 380 features. Be-
cause we had more features and training queries, we used
a more complex 2-layer model with 10 hidden nodes. The
training queries were sub-sampled as described in section
4.1. While complete AND sets were not used for validation
and test, these splits still had about 20 times as many ran-
domly sampled documents from the AND set, which were
assumed to be irrelevant.
4.5 Optimization
For each run we initialize the weights to a random setting
near zero, so that all nodes in the neural net start off un-
saturated. Following standard neural net practice, all input
features are normalized so that they are zero mean and unit
standard deviation over the training set queries. We call a
cycle through all training queries an epoch. If the training
set NDCG@10 (G10 ) goes for 16 epochs without increasing,
we reinitialize the weights. Our runs were for 128 epochs.
Like [3] we adopted a stochastic gradient descent approach,
where the weights of the model were updated in a batch
mode after each query. This optimization technique is very
simple. There is one parameter that corresponds to the dis-
tance taken along the gradient vector at each weight update,
called the learning rate. We set this rate to an initial value,
and we reduce it by a factor of 0.8 each time the end-of-
epoch training set G10 does not improve.
Initial values for both the learning rate (all utilities) and
the SoftRank smoothing σsaffect the results significantly.
Therefore, all experiments need to set these values using
the validation set. Multiple training runs are performed
from different initial settings, and the run/epoch that per-
formed best on the validation set is the one used in the final
evaluation on the test set. We tried learning rates from 101
to 107and initial smoothing from 100to 104.
For consistency across these experiments, we have only
made use of gradient information. However, the SoftRank
approach supports future incorporation of rank-based met-
rics directly into the optimization process which opens up
the possibility of using more sophisticated optimizers which
do not rely on gradient alone, such as BFGS. This is the
subject of future work.
4.6 Initial Results
We investigated 4 utilities on the three corpora: MSE (3),
RankNet (RN) (5), LambdaRank (LR) (7) and SoftRank
(SR) (23). In this set of initial experiments we used the
same discount function for the training objectives (SR and
LR) as that used in the test NDCG defined in Section 2.1.
The mean NDCG over all test set queries, at cut-offs 3
and 10 (G3,G10 ) are shown in Table 1. We use the paired
t-test at the 5% significance level on the G10 results only,
as this was the objective used for model selection on the
validation set.
.GOV Enterprise Web
G3G10 G3G10 G3G10
MSE 56.0 59.5 59.0 62.0 60.7 65.8
RankNet 65.4 67.7 58.1 61.9 60.6 65.4
LambdaRank 65.9 68.1 59.5 62.5 61.4 66.4
SoftRank 66.9 68.9 59.3 62.6 60.4 65.6
Table 1: Test set NDCG@10 and NDCG@3 for the
four utilities and three corpora.
For .GOV, the MSE objective performed surprisingly badly,
and significantly worse than the others. Taking the RN re-
sult as the baseline, LR did not do better (p= 55%) and
SR was significantly better (p= 4%). SR was close to being
significantly better than LR (p=8%).
For the Enterprise corpus, the MSE objective performed
surprisingly well, being equivalent to RN. LR was not sig-
nificantly better than MSE, though nearly so at p=7%
and SR>MSE at p= 2%. Finally SR was not significantly
better than LR.
For the web corpus, characterized by a very complex model
with several thousand parameters, we found that Lamb-
daRank was the best on the test set, significantly beating
all other utilities. MSE did surprisingly well. MSE, RN and
SR were all not significantly different.
Training G10
MSE 67.9
RankNet 67.6
LambdaRank 69.2
SoftRank 70.6
Table 2: Training set NDCG@10 for the four objec-
tives on the web corpus.
Training Set G10
Tab le 2 compares t h e training set G10 values. As expected,
we observe that LR and SR are both much better than MSE
and RN. We also note that SR yields a consistently and sig-
nificantly better fit to the training data. This effect was
observed on all three corpora. We conclude that gradient
ascent on SoftNDCG represents a more effective NDCG op-
timization algorithm than the other objectives.
We find it encouraging that, using the same models and
gradient-based optimizer, SoftRank consistently finds better
training set G10 values. In this, SoftNDCG has fulfilled the
goal of creating a smoothed approximation of a rank-based
metric such as NDCG, and optimizing it directly. It seems
therefore that the Rank-Binomial approximation is a good
one.
4.7 Generalization Study on Web Corpus
However, further comment is needed on test set performance.
The training set G10 values lead us to ponder how SR can
give a much better fit to the training data at the same time
as worse test set performance given that the model struc-
tures are identical.
Simpler Linear Model
Our first thought was that we might just be overfitting:
maybe LambdaRank has better natural regularization prop-
erties than SoftNDCG, and the use of early stopping on the
validation set is not sufficient for SR. To test this hypothe-
sis we tried a drastically simpler linear model, reducing the
number of model parameters by a factor of 10 from about
4000 to 400. This simple model could not realistically be
overfitted with 4K training queries.
Model Train G10 Test G10
2-layer LR 69.2 66.4
2-layer SR 70.6 65.7
Linear LR 67.2 65.2
Linear SR 67.5 64.6
Table 3: Results on a much simpler model: the simple
linear model stil l shows a gain for LambdaRank.
The results of this experiment are shown in Table 3. As
we would expect, the training set NDCGs are worse for the
linear model than for the non-linear model as the model
is less flexible. More interestingly, we still observe that LR
statistically significantly outperforms SR in the linear model
on the test set. Now given we are very unlikely to be under-
regularized for the reasons stated earlier, this leads us to
conclude that we are not overfitting in the classical sense.
Alternative training discount functions
Another possible explanation is that by using the NDCG
discount function for training, we have allowed the model
to concentrate too much on high-ranking documents in the
training set. In other words, perhaps there is useful infor-
mation in the ordering of documents beyond the top 10 or so
relevant documents that SoftNDCG is effectively ignoring,
which the baseline models do not. Therefore, maybe the
use of a less severe training discount function would allow
SoftRank to generalize better to new queries, by exploiting
more of the training data.
To test this hypothesis we have investigated a variety of
training discount functions that do not decay as quickly as
the regular NDCG discount function. These are shown in
Figure 5, ranging from convex (super-linear with rank, and
denoted α=1 in Figure 6), through linear α=0,to con-
cave (sub-linear, like the regular NDCG discount) α=1.
We used these new, deeper discount functions for both train-
ing (in SoftNDCG) and validation NDCG, but retaining the
original discount for the test set.
The results of this experiment are shown in Figure 6. For
α= 1 we saw a slightly (not significant) better G10 as we
are using a validation NDCG with a longer tail. The test
set performance improved to be as good as LambdaRank
for the range 0.4<α<0.4, reaching an optimum when
α=0.0.
5. CONCLUSION
We have introduced the idea of using rank distributions
to smooth traditional IR metrics. We have shown how to
compute an approximation to these rank distributions in a
way that involves no explicit sort, and is therefore differen-
tiable with respect to the model parameters, and so suitable
0 10 20 30 40 50 6
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Discount D(r)
Rank r
Figure 5: Shallower discount functions: We tried a
range of discount functions ranging from super-linear (top,
α=1) through linear (α=0) to sub-linear, as used in the
test NDCG (bottom, α=1).
for memory efficient non-linear machine learning approaches
such as the multi-layer perceptron. We have shown that the
Rank-Binomial is a good and useful approximation to the
true rank distributions by demonstrating that it can reli-
ably find better training set NDCGs than state-of-the-art
algorithms designed for that purpose.
We have shown also that SoftRank does not, in some
cases, generalize to new queries as well as LambdaRank.
We have shown that this is not due to lack of regularization,
but rather a tendency to focus too much on the top ranks.
We fixed this problem by trying a range of training discount
functions that are less top-heavy, and found that it was pos-
sible to do as well as LambdaRank on the test set but, as
yet, not significantly better.
We believe that SoftRank represents a general and pow-
erful new approach for direct optimization of non-smooth
ranking metrics. Future work will focus on characterizing
better the conditions under which SoftNDCG fails to gener-
alize, and exploring other soft IR metrics.
Acknowledgments
Thanks to Chris Burges, Markus Svensen and Martin Szum-
mer for useful discussions and to Chris Burges for his neural
net code.
6. REFERENCES
[1] E. Agichtein, E. Brill, and S. Dumais. Improving web
search ranking by incporporating user behavior
information. In SIGIR, 2006.
[2] H. Balakrishnan, I. Hwang, and C. Tomlin.
Polynomial approximation algorithms for belief matrix
maintenance in identity management. In IEEE
Conference on Decision and Control, 2004.
[3] C. Burges, R. Ragno, and Q. V. L. Le. Learning to
rank with nonsmooth cost functions. In NIPS, 2006.
[4] C. Burges, T. Shaked, E. Renshaw, A. Lazier,
M. Deeds, N. Hamilton, and G. Hullender. Learning
to rank using gradient descent. In ICML, 2005.
−1 −0.5 0 0.5
1
65.5
66
66.5
67
Discount functio n parameter α
Tes t G10
Figure 6: Test set performance with different dis-
count functions Test set G10 on web data against the
discount function parameter αas defined in Figure 5. The
horizontal line is the LR baseline. SR does as well as LR
for 0.4<α<0.4.
[5] Z.Cao,T.Qin,T.-Y.Liu,M.-F.Tsai,andH.Li.
Learning to rank: From pairwise approach to listwise
approach. In ICML, 2007.
[6] R. Carnuana, S. Baluja, and T. Mitchell. Using the
future to “sort out” the present: Rankprop and
multitask learning for medical risk evaluation. In
NIPS 8, 1996.
[7] W. Chu and Z. Ghahramani. Gaussian processes for
ordinal regression. Journal of Machine Learning
Research, 6:1019–1041, 2005.
[8] K. Crammer and Y. Singer. Pranking with ranking. In
NIPS 14, 2002.
[9] R. Herbrich, T. Graepel, and K. Obermayer. Large
margin rank boundaries for ordinal regression. In
Advances in Large Margin Classifiers, pages 115–132.
MIT Press, 2000.
[10] arvelin and J. Kek¨al¨ainen. IR evaluation methods for
retrieving highly relevant documents. In SIGIR, 2000.
[11] T. Joachims. Optimizing search engines using
clickthrough data. In Proceedi ngs of Know l edge
Discovery in Databases, 2002.
[12] Y. LeCun, L. Bottou, G. Orr, and K.-R. uller.
Efficient backprop, 1998.
[13] D. Metzler, T. Strohman, and W. Croft. Indri at trec
2006: Lessons learned from three terabyte tracks. In
online Proceedi ngs of Text REt rie val Conference, 2005.
[14] A. Papoulis. Probability, Random Variables and
Stochastic Processes, Third Edition. McGraw-Hill
1991.
[15] S. Robertson, H. Zaragoza, and M. Taylor. A simple
BM 25 extension to multiple weighted fields. In
CIKM, pages 42–29, 2004.
[16] M. Taylor, H. Zaragoza, N. Craswell, S. Robertson,
and C. Burges. Optimisation methods for ranking
functions with multiple parameters. In CIKM, 2006.
... Differentiable proxies facilitate the end-to-end training using ranking, with many applications including learning-torank models [44] and neural network-based k-nearest neighbor classifiers [14,49]. Researchers have proposed a variety of differentiable ranking operators, for instance, the SoftRank approach that using random perturbation technique [45], and the method that utilizing the pairwise difference matrix [39]. Among them, the Sinkhorn ranking operator [14] has gained increasing attention due to its broad applicability in recent years. ...
... Ranking is a fundamental and important operation used extensively in various areas, such as machine learning, statistics, and information science [8,14,39,44,45,49]. ...
... It has been proved in [49] that the Sinkhorn ranking operator R h,ε (x, a; y, b) is differentiable with respect to x. Such a differentiable ranking operator is useful in information retrieval [32,39,45]. In this case, the optimal solution to (3) is ...
Preprint
In [Q. Liao et al., Commun. Math. Sci., 20(2022)], a linear-time Sinkhorn algorithm is developed based on dynamic programming, which significantly reduces the computational complexity involved in solving optimal transport problems. However, this algorithm is specifically designed for the Wasserstein-1 metric. We are curious whether the preceding dynamic programming framework can be extended to tackle optimal transport problems with different transport costs. Notably, two special kinds of optimal transport problems, the Sinkhorn ranking and the far-field reflector and refractor problems, are closely associated with the log-type transport costs. Interestingly, by employing series rearrangement and dynamic programming techniques, it is feasible to perform the matrix-vector multiplication within the Sinkhorn iteration in linear time for this type of cost. This paper provides a detailed exposition of its implementation and applications, with numerical simulations demonstrating the effectiveness and efficiency of our methods.
... Instead, the pairwise approach is focused on predicting the relative order between documents [17]- [19]. Finally, the listwise methods attempt to optimize a given performance measure directly on the full list of documents [20]- [22], or propose a loss function on the predicted and the ground truth lists [23], [24]. Different from ranking in IR, our main interest in this work is label ranking which generalizes the basic binary classification problem to multiclass, multilabel, and even hierarchical classification, see [25] for a survey. ...
... Taking into account the minus in front of the min in (20) and the definition of a, we finally recover the softmax loss L(y, f (x)) = log 1 + j =y exp(f j (x) − f y (x)) . ...
Preprint
Top-k error is currently a popular performance measure on large scale image classification benchmarks such as ImageNet and Places. Despite its wide acceptance, our understanding of this metric is limited as most of the previous research is focused on its special case, the top-1 error. In this work, we explore two directions that shed more light on the top-k error. First, we provide an in-depth analysis of established and recently proposed single-label multiclass methods along with a detailed account of efficient optimization algorithms for them. Our results indicate that the softmax loss and the smooth multiclass SVM are surprisingly competitive in top-k error uniformly across all k, which can be explained by our analysis of multiclass top-k calibration. Further improvements for a specific k are possible with a number of proposed top-k loss functions. Second, we use the top-k methods to explore the transition from multiclass to multilabel learning. In particular, we find that it is possible to obtain effective multilabel classifiers on Pascal VOC using a single label per image for training, while the gap between multiclass and multilabel methods on MS COCO is more significant. Finally, our contribution of efficient algorithms for training with the considered top-k and multilabel loss functions is of independent interest.
... To incorporate sorting operations into the backpropagation framework, differentiable approximations, known as soft sorting, have been explored. Examples include smoothed rank operators by adding Gaussian noise [47] and by using sigmoid surrogate functions [39], parameterizing permutations in terms of a differentiable relaxation [32], and relaxing the permutation matrices to be only row-wise stochastic [9]. Of note, Cuturi et al. [5] propose a differentiable proxy by viewing sorting as an optimal assignment problem and relaxing it to an optimal transport problem from the input values to an auxiliary probability measure supported on an increasing family of target values. ...
Preprint
Full-text available
While self-attention has been instrumental in the success of Transformers, it can lead to over-concentration on a few tokens during training, resulting in suboptimal information flow. Enforcing doubly-stochastic constraints in attention matrices has been shown to improve structure and balance in attention distributions. However, existing methods rely on iterative Sinkhorn normalization, which is computationally costly. In this paper, we introduce a novel, fully parallelizable doubly-stochastic attention mechanism based on sliced optimal transport, leveraging Expected Sliced Transport Plans (ESP). Unlike prior approaches, our method enforces double stochasticity without iterative Sinkhorn normalization, significantly enhancing efficiency. To ensure differentiability, we incorporate a temperature-based soft sorting technique, enabling seamless integration into deep learning models. Experiments across multiple benchmark datasets, including image classification, point cloud classification, sentiment analysis, and neural machine translation, demonstrate that our enhanced attention regularization consistently improves performance across diverse applications.
... Our goal is to avoid sorting and use gradient-based optimization techniques instead. Inspired by Softrank by Taylor et al. [44], we propose an approximate algorithm to generate rank distributions without explicit sorting. For a given node, node i , we aim to estimate the probability that node i ranks higher than another node, node j . ...
Article
Full-text available
Adversarial attacks in network security are a growing concern, prompting the need for innovative strategies to enhance both attack and defense mechanisms. This paper explores ways to improve adversarial attacks on the fairness and goodness algorithm (FGA) and review to reviewer (REV2), focusing on predicting trust within signed graphs. Unlike traditional time-based models, FGA and REV2 rely on iterative processes for trust propagation. By analyzing network structures, we identify strong ties and weak ties within FGA and discover preferential paths in REV2 that significantly impact information spread during algorithm iterations. Based on these insights, we propose a new approach called the vicinage attack, which enhances adversarial attacks by strategically targeting edges along these critical pathways. Our work highlights adversarial perturbation patterns that affect trust prediction on signed graphs and emphasizes their wide-reaching impact. These findings not only advance adversarial attack techniques but also deepen our understanding of trust propagation patterns. By clarifying the propagation bias in FGA and REV2, this research provides valuable insights for improving network security and developing better adversarial mitigation techniques in trust prediction.
... Listwise approaches provide the opportunity to directly optimize ranking performance criteria [17]. Representative algorithms include SoftRank [79], SVM map [80], and RankGP [81]. Another subset of listwise approaches choose to optimize listwise ranking losses. ...
Preprint
Objective assessment of image quality is fundamentally important in many image processing tasks. In this work, we focus on learning blind image quality assessment (BIQA) models which predict the quality of a digital image with no access to its original pristine-quality counterpart as reference. One of the biggest challenges in learning BIQA models is the conflict between the gigantic image space (which is in the dimension of the number of image pixels) and the extremely limited reliable ground truth data for training. Such data are typically collected via subjective testing, which is cumbersome, slow, and expensive. Here we first show that a vast amount of reliable training data in the form of quality-discriminable image pairs (DIP) can be obtained automatically at low cost by exploiting large-scale databases with diverse image content. We then learn an opinion-unaware BIQA (OU-BIQA, meaning that no subjective opinions are used for training) model using RankNet, a pairwise learning-to-rank (L2R) algorithm, from millions of DIPs, each associated with a perceptual uncertainty level, leading to a DIP inferred quality (dipIQ) index. Extensive experiments on four benchmark IQA databases demonstrate that dipIQ outperforms state-of-the-art OU-BIQA models. The robustness of dipIQ is also significantly improved as confirmed by the group MAximum Differentiation (gMAD) competition method. Furthermore, we extend the proposed framework by learning models with ListNet (a listwise L2R algorithm) on quality-discriminable image lists (DIL). The resulting DIL Inferred Quality (dilIQ) index achieves an additional performance gain.
... However, the non-continuous and non-differentiable nature of ranking metrics presents an obstacle for optimization algorithms. To circumvent this issue, surrogate objective functions [20,21,22,15,73,91] are utilized. These are continuous and differentiable functions derived from ranking metrics. ...
Preprint
Developing increasingly efficient and accurate algorithms for approximate nearest neighbor search is a paramount goal in modern information retrieval. A primary approach to addressing this question is clustering, which involves partitioning the dataset into distinct groups, with each group characterized by a representative data point. By this method, retrieving the top-k data points for a query requires identifying the most relevant clusters based on their representatives -- a routing step -- and then conducting a nearest neighbor search within these clusters only, drastically reducing the search space. The objective of this thesis is not only to provide a comprehensive explanation of clustering-based approximate nearest neighbor search but also to introduce and delve into every aspect of our novel state-of-the-art method, which originated from a natural observation: The routing function solves a ranking problem, making the function amenable to learning-to-rank. The development of this intuition and applying it to maximum inner product search has led us to demonstrate that learning cluster representatives using a simple linear function significantly boosts the accuracy of clustering-based approximate nearest neighbor search.
... The pairwise approaches (Burges et al., 2005(Burges et al., , 2006 measure the pairwise preferences between item pairs, being reportedly more effective than the pointwise method by capturing the relative importance of the items. Later, the training subjects were extended to a list of items, and the loss was defined over the entire item list (Cao et al., 2007;Xia et al., 2008;Taylor et al., 2008), allowing to obtain more fine-grained relative importance among the items. Recent studies (Nogueira et al., 2020;Zhuang et al., 2023b;Pradeep et al., 2023a,b) have applied pre-trained language models for passage reranking and observed significant performance gains. ...
Preprint
Full-text available
This survey examines the evolution of model architectures in information retrieval (IR), focusing on two key aspects: backbone models for feature extraction and end-to-end system architectures for relevance estimation. The review intentionally separates architectural considerations from training methodologies to provide a focused analysis of structural innovations in IR systems.We trace the development from traditional term-based methods to modern neural approaches, particularly highlighting the impact of transformer-based models and subsequent large language models (LLMs). We conclude by discussing emerging challenges and future directions, including architectural optimizations for performance and scalability, handling of multimodal, multilingual data, and adaptation to novel application domains beyond traditional search paradigms.
Article
Full-text available
In this paper, we introduce principled stochastic algorithms to efficiently optimize Normalized Discounted Cumulative Gain (NDCG) and its top-K variant for deep models. To this end, we first propose novel compositional and bilevel compositional objectives for optimizing NDCG and top-K NDCG, respectively. We then develop two stochastic algorithms to tackle these non-convex objectives, achieving an iteration complexity of O(ϵ4)\mathcal {O}(\epsilon ^{-4}) for reaching an ϵ\epsilon -stationary point. Our methods employ moving average estimators to track the crucial inner functions for gradient computation, effectively reducing approximation errors. Besides, we introduce practical strategies such as initial warm-up and stop-gradient techniques to enhance performance in deep learning. Despite the advancements, the iteration complexity of these two algorithms does not meet the optimal O(ϵ3)\mathcal {O}(\epsilon ^{-3}) for smooth non-convex optimization. To address this issue, we incorporate variance reduction techniques in our framework to more finely estimate the key functions, design new algorithmic mechanisms for solving multiple lower-level problems with parallel speed-up, and propose two types of algorithms. The first type directly tracks these functions with the variance reduced estimators, while the second treats these functions as solutions to minimization problems and employs variance reduced estimators to construct gradient estimators for solving these problems. We manage to establish the optimal O(ϵ3)\mathcal {O}(\epsilon ^{-3}) complexity for both types of algorithms. It is important to highlight that our algorithmic frameworks are versatile and can optimize a wide spectrum of metrics, including Precision@K/Recall@K, Average Precision (AP), mean Average Precision (mAP), and their top-K variants. We further present efficient stochastic algorithms for optimizing these metrics with convergence guarantees. We conduct comprehensive experiments on multiple ranking tasks to verify the effectiveness of our proposed algorithms, which consistently surpass existing strong baselines.
Conference Paper
Full-text available
The quality measures used in information retrieval are particularly difficult to optimize directly, since they depend on the model scores only through the sorted order of the documents returned for a given query. Thus, the derivatives of the cost with respect to the model parameters are either zero, or are undefined. In this paper, we propose a class of simple, flexible algorithms, called LambdaRank, which avoids these difficulties by working with implicit cost functions. We describe LambdaRank using neural network models, although the idea applies to any differentiable function class. We give necessary and sufficient conditions for the resulting implicit cost function to be convex, and we show that the general method has a simple mechanical interpretation. We demonstrate significantly improved accuracy, over a state-of-the-art ranking algorithm, on several datasets. We also show that LambdaRank provides a method for significantly speeding up the training phase of that ranking algorithm. Although this paper is directed towards ranking, the proposed method can be extended to any non-smooth and multivariate cost functions.
Conference Paper
Full-text available
Optimising the parameters of ranking functions with respect to standard IR rank-dependent cost functions has eluded satisfactory analytical treatment. We build on recent ad- vances in alternative dierentiable pairwise cost functions, and show that these techniques can be successfully applied to tuning the parameters of an existing family of IR scor- ing functions (BM25), in the sense that we cannot do better using sensible search heuristics that directly optimize the rank-based cost function NDCG. We also demonstrate how the size of training set aects the number of parameters we can hope to tune this way.
Conference Paper
Full-text available
This paper describes a simple way of adapting the BM25 ranking formula to deal with structured documents. In the past it has been common to compute scores for the individual fields (e.g. title and body) independently and then combine these scores (typically linearly) to arrive at a final score for the document. We highlight how this approach can lead to poor performance by breaking the carefully constructed non-linear saturation of term frequency in the BM25 function. We propose a much more intuitive alternative which weights term frequencies before the non-linear term frequency saturation function is applied. In this scheme, a structured document with a title weight of two is mapped to an unstructured document with the title content repeated twice. This more verbose unstructured document is then ranked in the usual way. We demonstrate the advantages of this method with experiments on Reuters Vol1 and the TREC dotGov collection.
Conference Paper
Full-text available
We investigate using gradient descent meth- ods for learning ranking functions; we pro- pose a simple probabilistic cost function, and we introduce RankNet, an implementation of these ideas using a neural network to model the underlying ranking function. We present test results on toy data and on data from a commercial internet search engine.
Article
This report describes the lessons learned using the In- dri search system during the 2004-2006 TREC Terabyte Tracks. We provide an overview of Indri, and, for the ad hoc and named page nding tasks, discuss our general ap- proach to the problem, what worked, what did not work, and what could possibly work in the future.