Page 1

Mach Learn (2010) 80: 189–211

DOI 10.1007/s10994-010-5176-9

Preference-based learning to rank

Nir Ailon ·Mehryar Mohri

Received: 15 March 2009 / Accepted: 1 November 2009 / Published online: 29 April 2010

© The Author(s) 2010

Abstract This paper presents an efficient preference-based ranking algorithm running in

two stages. In the first stage, the algorithm learns a preference function defined over pairs,

as in a standard binary classification problem. In the second stage, it makes use of that

preference function to produce an accurate ranking, thereby reducing the learning problem

of ranking to binary classification. This reduction is based on the familiar QuickSort and

guarantees an expected pairwise misranking loss of at most twice that of the binary classifier

derived in the first stage. Furthermore, in the important special case of bipartite ranking,

the factor of two in loss is reduced to one. This improved bound also applies to the regret

achieved by our ranking and that of the binary classifier obtained.

Our algorithm is randomized, but we prove a lower bound for any deterministic reduction

of ranking to binary classification showing that randomization is necessary to achieve our

guarantees. This, and a recent result by Balcan et al., who show a regret bound of two for

a deterministic algorithm in the bipartite case, suggest a trade-off between achieving low

regret and determinism in this context.

Our reduction also admits an improved running time guarantee with respect to that de-

terministic algorithm. In particular, the number of calls to the preference function in the

reduction is improved from ?(n2) to O(nlogn). In addition, when the top k ranked ele-

ments only are required (k ? n), as in many applications in information extraction or search

enginedesign,thetimecomplexityofouralgorithmcanbefurtherreducedtoO(klogk+n).

Our algorithm is thus practical for realistic applications where the number of points to rank

exceeds several thousand.

Keywords Learning to rank · Machine learning reductions · ROC

Editors: Sham Kakade and Ping Li.

N. Ailon (?) · M. Mohri

Computer Science Faculty, Technion – Israel Institute of Technology, Haifa 32000, Israel

e-mail: nailon@cs.technion.ac.il

M. Mohri

Courant Institute of Mathematical Sciences, 251 Mercer Street, New York, NY 10012, USA

Page 2

190Mach Learn (2010) 80: 189–211

1 Introduction

The learning problem of ranking arises in many modern applications, including the design

of search engines, information extraction, and movie recommendation systems. In these

applications, the ordering of the documents or movies returned is a critical aspect of the

system.

The problem has been formulated within two distinct settings. In the score-based set-

ting, the learning algorithm receives a labeled sample of pairwise preferences and returns

a scoring function f :U → R which induces a linear ordering of the points in the set U.

Test points are simply ranked according to the values of f for those points. Several rank-

ing algorithms, including RankBoost (Freund et al. 2003; Rudin et al. 2005), SVM-type

ranking (Joachims 2002), and other algorithms such as PRank (Crammer and Singer 2001;

Agarwal and Niyogi 2005), were designed for this setting. Generalization bounds have been

given in this setting for the pairwise misranking error (Freund et al. 2003; Agarwal et al.

2005), including margin-based bounds (Rudin et al. 2005). Stability-based generalization

bounds have also been given in this setting for wide classes of ranking algorithms both in

the case of bipartite ranking (Agarwal and Niyogi 2005) and the general case (Cortes et al.

2007a, 2007b).

A somewhat different two-stage scenario was considered in other publications starting

with (Cohen et al. 1999), and later (Balcan et al. 2007, 2008), which we will refer to as the

preference-based setting. In the first stage of that setting, a preference function h : U ×U ?→

[0,1] is learned, where values of h(u,v) closer to one indicate that u is ranked above v and

values closer to zero the opposite. The preference function h is typically assumed to be

the output of a classification algorithm trained on a sample of labeled pairs, and can be for

example a convex combination of simpler preference functions as in Cohen et al. (1999).

A crucial difference with the score-based setting is that, in general, the preference function

h may not induce a linear ordering. The relation it induces may be non-transitive, thus we

may have for example h(u,v) = h(v,w) = h(w,u) = 1 for three distinct points u, v, and w.

To rank a test subset V ⊆ U, in the second stage, the algorithm orders the points in V by

making use of the preference function h learned in the first stage. The subset ranking set-up

examined by Cossock and Zhang (2006), though distinct, also bears some resemblance with

this setting.

This paper deals with the preference-based ranking setting just described. The advantage

of this setting is that the learning algorithm is not required to return a linear ordering of all

points in U, which may be impossible to achieve faultlessly in accordance with a general

possibly non-transitive pairwise preference labeling. This is more likely to be achievable

exactly or with a better approximation when the algorithm is requested instead, to supply a

linear ordering, only for limited subsets V ⊆ U.

When the preference function is obtained as the output of a binary classification algo-

rithm, the preference-based setting can be viewed as a reduction of ranking to classification.

The second stage specifies how the ranking is obtained using the preference function.

Cohen et al. (1999) showed that in the second stage of the preference-based setting,

the general problem of finding a linear ordering with as few pairwise misrankings as pos-

sible with respect to the preference function h is NP-complete. The authors presented a

greedy algorithm based on the tournament degree, that is, for a given element u, the dif-

ference between the number of elements it is preferred to versus the number of those pre-

ferred to u. The bound proven by the authors, formulated in terms of the pairwise dis-

agreement loss l with respect to the preference function h, can be written as l(σgreedy,h) ≤

1/2 + l(σoptimal,h)/2, where l(σgreedy,h) is the loss achieved by the permutation σgreedyre-

turned by their algorithm and l(σoptimal,h) the one achieved by the optimal permutation

Page 3

Mach Learn (2010) 80: 189–211191

σoptimalwith respect to the preference function h. Here, the loss l is normalized to be in the

range [0,1]. This bound was given for the general case of ranking, but, in the particular

case of bipartite ranking, a random ordering can achieve a pairwise disagreement loss of

1/2 and thus the bound is not informative. Note also that the algorithm can be viewed as a

derandomization technique.

More recently, Balcan et al. (2008) studied the bipartite ranking problem. In this partic-

ular case, the loss of an output ranking is measured by counting pairs of ranked elements,

one of which is positive and the other negative, based on some ground truth—a more formal

definition of the bipartite loss function is given later. Each time a negative element appears

higher in the output compared to a positive element, a penalty is added to the loss. They

showed that sorting the elements of V according to the same tournament degree used by

Cohen et al. (1999) guarantees a loss of at most twice the loss of the binary classifier used.

They also showed that this guarantees a regret of at most 2r for a binary classifier with re-

gret r.1However, due to the quadratic nature of the definition of the tournament degree, their

algorithm requires ?(n2) calls to the preference function h, where n = |V| is the number of

objects to rank. The advantage of their algorithm is that it is deterministic.

We describe an efficient randomized algorithm for the second stage of the preference-

based setting and thus for reducing the learning problem of ranking to binary classification.

We guarantee, for the bipartite ranking case, an expected pairwise misranking regret of at

most r using a binary classifier with regret r, thereby bringing down the regret factor from

two to one. Compared to Balcan et al. (2008), we offer a tradeoff of low (expected) regret

versus determinism. Our reduction applies, with different constants, to a broader class of

ranking loss functions, admits a simple proof, and the expected running time complexity of

our algorithm in terms of number of calls to a classifier or preference function is improved

from ?(n2) to O(nlogn). Furthermore, when the top k ranked elements only are required

(k ? n), as in many applications in information extraction or search engines, the time com-

plexity of our algorithm can be further reduced to O(klogk + n). Our reduction and algo-

rithm are thus practical for realistic applications where the number of points to rank exceeds

several thousands. The price paid for this improvement is in resorting to randomization, but

we prove a lower bound for any deterministic reduction of ranking to binary classification

showing that randomization is necessary to achieve our guarantees: we give a simple proof

of a lower bound of 2r for any deterministic reduction of ranking to binary classification

with classification regret r. This result generalizes to all deterministic reductions a lower

bound given by Balcan et al. (2008) for a specific algorithm.

To understand the low regret versus determinism tradeoff, it is interesting to consider the

case of a binary classifier with an error rate of just 25%, which is quite reasonable in many

applications. Assume that the Bayes error is close to zero for the classification problem and,

similarly, that for the ranking problem the regret and loss approximately coincide. Then, the

bound of Balcan et al. (2008) guarantees for the ranking algorithm a worst-case pairwise

misranking error of at most 50%, which is the pairwise misranking error of random ranking.

In this work, we give an algorithm that provably achieves an expected pairwise misranking

error of at most 25%. Of course, since the algorithm is randomized, the actual error may

reach 50% with some probability.

Much of our results also extend beyond the bipartite case already discussed to the general

case of ranking. However, these extended results apply only to the loss and not to the regret

and at the cost of a worse factor of two, instead of one. A by-product of this extension also

allows us to compare our results to those given by Cohen et al. (1999).

1Balcan et al.’s (2008) definition of regret is based on a specific assumption that we shall discuss later.

Page 4

192 Mach Learn (2010) 80: 189–211

The algorithm used by Balcan et al. (2008) to produce a ranking based on the preference

function is known as sort-by-degree and has been recently used in the context of minimiz-

ing the feedback arcset in tournaments (Coppersmith et al. 2006). Here, we use a different

algorithm, QuickSort, which has also been recently used for minimizing the feedback arcset

in tournaments (Ailon et al. 2005; Ailon 2007). The techniques presented build upon earlier

work by Ailon et al. (2005) and Ailon (2007) on combinatorial optimization problems over

rankings and clustering.

Theresultsjustmentionedwerealready included inanearlier versionofthis paper(Ailon

and Mohri 2008). Here, we further generalize all of these results to the case where the

preference function h takes values in [0,1] and generalize the algorithm used in the second

stage, QuickSort, to make use of these real-valued scores instead of just their rounded binary

values. Most classification algorithms return a real-valued score that is subsequently used

to determine the labels. These scores can often be interpreted as confidence scores and thus

be more informative than their rounded or quantized counterparts. This is clear in the case

of generative models such as logistic regression, but the scores output by discriminative

learning algorithms are also often normalized and used as confidence scores. We show that

the same guarantees can be provided for the loss and regret bounds in both the bipartite

and the general ranking cases with the loss and regret for the preference functions defined

with respect to the real-values of h and thus taking into account the confidence scores of the

algorithm that produced h. This gives a strict generalization of our previous results (Ailon

and Mohri 2008) since the case where the scores are rounded is a special case corresponding

to a preference function˜h: U ?→ {0,1}. This generalization in fact helps simplify some of

our proofs. Note that our results also generalize over those of Balcan et al. (2008) which

hold for the case of a preference function taking values in {0,1}.

The remainder of the paper is structured as follows. In Sect. 2, we introduce the defin-

itions and notation used in future sections and introduce a general family of loss functions

for ranking. Section 3 describes a simple and efficient algorithm for reducing ranking to bi-

nary classification, proves several bounds guaranteeing the quality of the ranking produced

by the algorithm, and analyzes the running-time complexity of our algorithm. In Sect. 4, we

derive a lower bound for any deterministic reduction of ranking to binary classification. In

Sect. 5, we discuss the relationship of the algorithm and its proof with previous related work

in combinatorial optimization, and discuss some key assumptions related to the notion of

regret in this context.

2 Preliminaries

This section introduces several preliminary definitions necessary for the presentation of our

results. In what follows, U will denote a universe of elements, e.g., the collection of all

possible query-result pairs returned by a web search task, and V ⊆ U will denote a small

subset thereof, e.g., a preliminary list of relevant results for a given query. For simplicity

of notation we will assume that U is a set of integers, so that we are always able to choose

a minimal canonical element in a finite subset, as we shall do in (12) below. This arbitrary

ordering should not be confused with the ranking problem we are considering.

2.1 General definitions and notation

We first briefly discuss the learning setting and assumptions made here and compare them

with those of Balcan et al. (2008) and Cohen et al. (1999).

Page 5

Mach Learn (2010) 80: 189–211193

In what follows, V ⊆ U represents a finite subset extracted from some arbitrary uni-

verse U, which is the set we wish to rank at each round. The notation S(V) denotes the set

of rankings on V, that is the set of injections from V to [n] = {1,...,n}, where n = |V|.

If σ ∈ S(V) is such a ranking, then σ(u) is the rank of an element u ∈ V, where lower

ranks are interpreted as preferable ones. More precisely, we say that u is preferred over

v with respect to σ if σ(u) < σ(v). For convenience, and abusing notation, we also write

σ(u,v) = 1 if σ(u) < σ(v) and σ(u,v) = 0 otherwise. We let?V

to the variables a,b,.... To distinguish between functions taking ordered versus unordered

arguments, we use the notation Fu1u2...ukto denote k unordered arguments for a function F

defined on?V

2.2 Preference function

k

?denote the collection of

all subsets of size exactly k of V. We denote by Ea,b,...[·] the expectation taken with respect

k

?, and F(u1,u2,...,uk) for ordered arguments.

As with both (Cohen et al. 1999) and (Balcan et al. 2008), we assume that a preference

function h: U ×U → [0,1] is learned in the first learning stage. The convention is that the

higher is h(u,v), the more our belief that u should be preferred to v. The function h satisfies

pairwise consistency: h(u,v)+h(v,u) = 1, but needs not even be transitive on three tuples

(cycles may be induced). The second stage uses h to output a proper ranking σ, as we shall

further discuss below. The running time complexity of the second stage is measured with

respect to the number of calls to h.

2.3 Output of learning algorithm and loss function

The cost of the permutation σ output by the second stage of the algorithm is measured dif-

ferently in Balcan et al. (2008) and Cohen et al. (1999). In the former, it is measured against

the unknown ground truth and compared to the cost of the preference function h against the

ground truth. In Cohen et al. (1999), σ is measured against the preference function h, and

compared to the theoretically best ordering, that is the ordering that is the closest to h. Thus,

here, h plays the role of the ground truth.

Most of our work follows the approach of Balcan et al. (2008): at each round, there

is an underlying unknown ground truth which we wish the output of the learning algo-

rithm to agree with as much as possible. The ground truth is a ranking which we denote by

σ∗∈ S(V), equipped with a function ω assigning different importance weight to pairs of

positions. The combination (σ∗,ω) we allow in this work, as we shall see below, is very

expressive. It can encode in particular the standard average pairwise misranking or AUC

loss assumed by Balcan et al. (2008) in a bipartite setting, but also more sophisticated ones

capturing misrankings among the top k, and other losses that are close but distinct from

those considered by Clémençon and Vayatis (2007).

The following general loss function Lωmeasures the quality of a ranking σ with respect

to a desired one σ∗using a symmetric weight function ω described below:

Lω(σ,σ∗) =

?n

2

?−1?

u?=v

σ(u,v)σ∗(v,u)ω(σ∗(v),σ∗(u)).

The sum is over all pairs u,v in the domain V of the rankings σ,σ∗. It counts the number of

inverted pairs u,v ∈ V weighed by ω, which assigns importance coefficients to pairs, based

on their positions in the ground truth σ∗. The function ω must satisfy the following three

natural axioms, which will be necessary in our analysis: