Page 1
Mach Learn (2010) 80: 189–211
DOI 10.1007/s10994-010-5176-9
Preference-based learning to rank
Nir Ailon ·Mehryar Mohri
Received: 15 March 2009 / Accepted: 1 November 2009 / Published online: 29 April 2010
© The Author(s) 2010
Abstract This paper presents an efficient preference-based ranking algorithm running in
two stages. In the first stage, the algorithm learns a preference function defined over pairs,
as in a standard binary classification problem. In the second stage, it makes use of that
preference function to produce an accurate ranking, thereby reducing the learning problem
of ranking to binary classification. This reduction is based on the familiar QuickSort and
guarantees an expected pairwise misranking loss of at most twice that of the binary classifier
derived in the first stage. Furthermore, in the important special case of bipartite ranking,
the factor of two in loss is reduced to one. This improved bound also applies to the regret
achieved by our ranking and that of the binary classifier obtained.
Our algorithm is randomized, but we prove a lower bound for any deterministic reduction
of ranking to binary classification showing that randomization is necessary to achieve our
guarantees. This, and a recent result by Balcan et al., who show a regret bound of two for
a deterministic algorithm in the bipartite case, suggest a trade-off between achieving low
regret and determinism in this context.
Our reduction also admits an improved running time guarantee with respect to that de-
terministic algorithm. In particular, the number of calls to the preference function in the
reduction is improved from ?(n2) to O(nlogn). In addition, when the top k ranked ele-
ments only are required (k ? n), as in many applications in information extraction or search
enginedesign,thetimecomplexityofouralgorithmcanbefurtherreducedtoO(klogk+n).
Our algorithm is thus practical for realistic applications where the number of points to rank
exceeds several thousand.
Keywords Learning to rank · Machine learning reductions · ROC
Editors: Sham Kakade and Ping Li.
N. Ailon (?) · M. Mohri
Computer Science Faculty, Technion – Israel Institute of Technology, Haifa 32000, Israel
e-mail: nailon@cs.technion.ac.il
M. Mohri
Courant Institute of Mathematical Sciences, 251 Mercer Street, New York, NY 10012, USA
Page 2
190Mach Learn (2010) 80: 189–211
1 Introduction
The learning problem of ranking arises in many modern applications, including the design
of search engines, information extraction, and movie recommendation systems. In these
applications, the ordering of the documents or movies returned is a critical aspect of the
system.
The problem has been formulated within two distinct settings. In the score-based set-
ting, the learning algorithm receives a labeled sample of pairwise preferences and returns
a scoring function f :U → R which induces a linear ordering of the points in the set U.
Test points are simply ranked according to the values of f for those points. Several rank-
ing algorithms, including RankBoost (Freund et al. 2003; Rudin et al. 2005), SVM-type
ranking (Joachims 2002), and other algorithms such as PRank (Crammer and Singer 2001;
Agarwal and Niyogi 2005), were designed for this setting. Generalization bounds have been
given in this setting for the pairwise misranking error (Freund et al. 2003; Agarwal et al.
2005), including margin-based bounds (Rudin et al. 2005). Stability-based generalization
bounds have also been given in this setting for wide classes of ranking algorithms both in
the case of bipartite ranking (Agarwal and Niyogi 2005) and the general case (Cortes et al.
2007a, 2007b).
A somewhat different two-stage scenario was considered in other publications starting
with (Cohen et al. 1999), and later (Balcan et al. 2007, 2008), which we will refer to as the
preference-based setting. In the first stage of that setting, a preference function h : U ×U ?→
[0,1] is learned, where values of h(u,v) closer to one indicate that u is ranked above v and
values closer to zero the opposite. The preference function h is typically assumed to be
the output of a classification algorithm trained on a sample of labeled pairs, and can be for
example a convex combination of simpler preference functions as in Cohen et al. (1999).
A crucial difference with the score-based setting is that, in general, the preference function
h may not induce a linear ordering. The relation it induces may be non-transitive, thus we
may have for example h(u,v) = h(v,w) = h(w,u) = 1 for three distinct points u, v, and w.
To rank a test subset V ⊆ U, in the second stage, the algorithm orders the points in V by
making use of the preference function h learned in the first stage. The subset ranking set-up
examined by Cossock and Zhang (2006), though distinct, also bears some resemblance with
this setting.
This paper deals with the preference-based ranking setting just described. The advantage
of this setting is that the learning algorithm is not required to return a linear ordering of all
points in U, which may be impossible to achieve faultlessly in accordance with a general
possibly non-transitive pairwise preference labeling. This is more likely to be achievable
exactly or with a better approximation when the algorithm is requested instead, to supply a
linear ordering, only for limited subsets V ⊆ U.
When the preference function is obtained as the output of a binary classification algo-
rithm, the preference-based setting can be viewed as a reduction of ranking to classification.
The second stage specifies how the ranking is obtained using the preference function.
Cohen et al. (1999) showed that in the second stage of the preference-based setting,
the general problem of finding a linear ordering with as few pairwise misrankings as pos-
sible with respect to the preference function h is NP-complete. The authors presented a
greedy algorithm based on the tournament degree, that is, for a given element u, the dif-
ference between the number of elements it is preferred to versus the number of those pre-
ferred to u. The bound proven by the authors, formulated in terms of the pairwise dis-
agreement loss l with respect to the preference function h, can be written as l(σgreedy,h) ≤
1/2 + l(σoptimal,h)/2, where l(σgreedy,h) is the loss achieved by the permutation σgreedyre-
turned by their algorithm and l(σoptimal,h) the one achieved by the optimal permutation
Page 3
Mach Learn (2010) 80: 189–211191
σoptimalwith respect to the preference function h. Here, the loss l is normalized to be in the
range [0,1]. This bound was given for the general case of ranking, but, in the particular
case of bipartite ranking, a random ordering can achieve a pairwise disagreement loss of
1/2 and thus the bound is not informative. Note also that the algorithm can be viewed as a
derandomization technique.
More recently, Balcan et al. (2008) studied the bipartite ranking problem. In this partic-
ular case, the loss of an output ranking is measured by counting pairs of ranked elements,
one of which is positive and the other negative, based on some ground truth—a more formal
definition of the bipartite loss function is given later. Each time a negative element appears
higher in the output compared to a positive element, a penalty is added to the loss. They
showed that sorting the elements of V according to the same tournament degree used by
Cohen et al. (1999) guarantees a loss of at most twice the loss of the binary classifier used.
They also showed that this guarantees a regret of at most 2r for a binary classifier with re-
gret r.1However, due to the quadratic nature of the definition of the tournament degree, their
algorithm requires ?(n2) calls to the preference function h, where n = |V| is the number of
objects to rank. The advantage of their algorithm is that it is deterministic.
We describe an efficient randomized algorithm for the second stage of the preference-
based setting and thus for reducing the learning problem of ranking to binary classification.
We guarantee, for the bipartite ranking case, an expected pairwise misranking regret of at
most r using a binary classifier with regret r, thereby bringing down the regret factor from
two to one. Compared to Balcan et al. (2008), we offer a tradeoff of low (expected) regret
versus determinism. Our reduction applies, with different constants, to a broader class of
ranking loss functions, admits a simple proof, and the expected running time complexity of
our algorithm in terms of number of calls to a classifier or preference function is improved
from ?(n2) to O(nlogn). Furthermore, when the top k ranked elements only are required
(k ? n), as in many applications in information extraction or search engines, the time com-
plexity of our algorithm can be further reduced to O(klogk + n). Our reduction and algo-
rithm are thus practical for realistic applications where the number of points to rank exceeds
several thousands. The price paid for this improvement is in resorting to randomization, but
we prove a lower bound for any deterministic reduction of ranking to binary classification
showing that randomization is necessary to achieve our guarantees: we give a simple proof
of a lower bound of 2r for any deterministic reduction of ranking to binary classification
with classification regret r. This result generalizes to all deterministic reductions a lower
bound given by Balcan et al. (2008) for a specific algorithm.
To understand the low regret versus determinism tradeoff, it is interesting to consider the
case of a binary classifier with an error rate of just 25%, which is quite reasonable in many
applications. Assume that the Bayes error is close to zero for the classification problem and,
similarly, that for the ranking problem the regret and loss approximately coincide. Then, the
bound of Balcan et al. (2008) guarantees for the ranking algorithm a worst-case pairwise
misranking error of at most 50%, which is the pairwise misranking error of random ranking.
In this work, we give an algorithm that provably achieves an expected pairwise misranking
error of at most 25%. Of course, since the algorithm is randomized, the actual error may
reach 50% with some probability.
Much of our results also extend beyond the bipartite case already discussed to the general
case of ranking. However, these extended results apply only to the loss and not to the regret
and at the cost of a worse factor of two, instead of one. A by-product of this extension also
allows us to compare our results to those given by Cohen et al. (1999).
1Balcan et al.’s (2008) definition of regret is based on a specific assumption that we shall discuss later.
Page 4
192 Mach Learn (2010) 80: 189–211
The algorithm used by Balcan et al. (2008) to produce a ranking based on the preference
function is known as sort-by-degree and has been recently used in the context of minimiz-
ing the feedback arcset in tournaments (Coppersmith et al. 2006). Here, we use a different
algorithm, QuickSort, which has also been recently used for minimizing the feedback arcset
in tournaments (Ailon et al. 2005; Ailon 2007). The techniques presented build upon earlier
work by Ailon et al. (2005) and Ailon (2007) on combinatorial optimization problems over
rankings and clustering.
Theresultsjustmentionedwerealready included inanearlier versionofthis paper(Ailon
and Mohri 2008). Here, we further generalize all of these results to the case where the
preference function h takes values in [0,1] and generalize the algorithm used in the second
stage, QuickSort, to make use of these real-valued scores instead of just their rounded binary
values. Most classification algorithms return a real-valued score that is subsequently used
to determine the labels. These scores can often be interpreted as confidence scores and thus
be more informative than their rounded or quantized counterparts. This is clear in the case
of generative models such as logistic regression, but the scores output by discriminative
learning algorithms are also often normalized and used as confidence scores. We show that
the same guarantees can be provided for the loss and regret bounds in both the bipartite
and the general ranking cases with the loss and regret for the preference functions defined
with respect to the real-values of h and thus taking into account the confidence scores of the
algorithm that produced h. This gives a strict generalization of our previous results (Ailon
and Mohri 2008) since the case where the scores are rounded is a special case corresponding
to a preference function˜h: U ?→ {0,1}. This generalization in fact helps simplify some of
our proofs. Note that our results also generalize over those of Balcan et al. (2008) which
hold for the case of a preference function taking values in {0,1}.
The remainder of the paper is structured as follows. In Sect. 2, we introduce the defin-
itions and notation used in future sections and introduce a general family of loss functions
for ranking. Section 3 describes a simple and efficient algorithm for reducing ranking to bi-
nary classification, proves several bounds guaranteeing the quality of the ranking produced
by the algorithm, and analyzes the running-time complexity of our algorithm. In Sect. 4, we
derive a lower bound for any deterministic reduction of ranking to binary classification. In
Sect. 5, we discuss the relationship of the algorithm and its proof with previous related work
in combinatorial optimization, and discuss some key assumptions related to the notion of
regret in this context.
2 Preliminaries
This section introduces several preliminary definitions necessary for the presentation of our
results. In what follows, U will denote a universe of elements, e.g., the collection of all
possible query-result pairs returned by a web search task, and V ⊆ U will denote a small
subset thereof, e.g., a preliminary list of relevant results for a given query. For simplicity
of notation we will assume that U is a set of integers, so that we are always able to choose
a minimal canonical element in a finite subset, as we shall do in (12) below. This arbitrary
ordering should not be confused with the ranking problem we are considering.
2.1 General definitions and notation
We first briefly discuss the learning setting and assumptions made here and compare them
with those of Balcan et al. (2008) and Cohen et al. (1999).
Page 21
Mach Learn (2010) 80: 189–211209
?(H?,Hσ) ≤ ?(G,H?)+?(G,Hσ) ≤ (1+ε)?(G,H)+?(G,Hσ)
≤ (2+ε)?(G,Hσ).
Thus, this work also adds a significant contribution to Ailon et al. (2005), Ailon (2007) and
Kenyon-Mathieu and Schudy (2007).
5.4 Weak vs. strong regret functions
For the proof of the regret bound of Theorem 2 we used the fact that the minimizer˜h in the
definition (8)–(9) of R?
(12). This could also be done for strong regrets if the distribution D on V,τ∗satisfied the
following pairwise IIA condition.
classcould be determined independently for each pair u,v ∈ U, using
Definition 1 A distribution D on subsets V ⊆ U and generalized rankings τ∗on V satisfies
the pairwise independence on irrelevant alternatives (pairwise IIA) if for all u,v ∈ U and
for any two subsets V1,V2⊇ {u,v},
Eτ∗|V1[τ∗(u,v)] = Eτ∗|V2[τ∗(u,v)].
Note: we chose the terminology IIA in agreement with its use in Arrow’s seminal work
(Arrow 1950) for a similar notion.
When pairwise IIA holds, the average ground truth relation between u and v, conditioned
on u,v included in V, is independent of V.
Recall that a bipartite τ∗is derived from a pair σ∗,ω, where ω is defined using a term
1/m−m+, for compatibility with the definition of AUC. The numbers m+and m−depend on
the underlyingsizeofthepositiveandnegative sets partitioningof V andtherefore cannot be
inferred from (u,v) alone. Thus, in the standard bipartite case, the pairwise IIA assumption
is not natural. If, however, we replaced our definitions in the bipartite case and used the
following:
⎧
⎪⎩
instead of (3), then it would be reasonable to believe that pairwise IIA does hold in the
bipartite case. In fact, it would be reasonable to make the stronger assumption that for any
fixed u,v ∈ U and V1,V2⊇ {u,v}, the distribution of the random variables τ∗(u,v)|V1and
τ∗(u,v)|V2are equal. This corresponds to the intuition that when comparing a pair (u,v) in
a context of a set V containing them, human labelers are not as influenced by the irrelevant
information V\{u,v} as they would be by V\{u} when asked to evaluate single elements u.
The irrelevant information in V is often referred to as anchor in experimental psychology
and economics (Ariely et al. 2008).
Our regret bounds would still hold if we used (47), but we chose (3) to present our results
in terms of the familiar average pairwise misranking error or AUC loss.
Another possible assumption allowing usage of strong regrets is to let the preference
function learned in the first stage depend on V. This is, in fact, assumption made by Balcan
et al. (2008).
ω(i,j) =
⎪⎨
1
1
0
(i ≤ m+)∧(j > m+),
(j ≤ m+)∧(i > m+),
otherwise,
(47)
Page 22
210 Mach Learn (2010) 80: 189–211
6 Conclusion
We described a reduction of the learning problem of ranking to classification. The efficiency
of this reduction makes it practical for large-scale information extraction and search engine
applications. A finer analysis of QuickSort is likely to further improve our reduction bound
by providing a concentration inequality for the algorithm’s deviation from its expected be-
havior using the confidence scores output by the classifier. Our reduction leads to a com-
petitive ranking algorithm that can be viewed as an alternative to the algorithms previously
designed for the score-based setting.
Acknowledgements
Mohri’s work was partially funded by the New York State Office of Science Technology and Academic
Research (NYSTAR).
We thank Alina Beygelzimer and John Langford for helpful discussions. Mehryar
References
Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., & Roth, D. (2005). Generalization bounds for the area
under the roc curve. Journal of Machine Learning Research, 6, 393–425.
Agarwal, S., & Niyogi, P. (2005). Stability and generalization of bipartite ranking algorithms. In COLT
(pp. 32–47).
Ailon, N. (2007). Aggregation of partial rankings, p-ratings and top-m lists. In SODA.
Ailon,N.,Charikar,M.,&Newman,A.(2005). Aggregatinginconsistentinformation: rankingandclustering.
In Proceedings of the 37th annual ACM symposium on theory of computing (pp. 684–693). Baltimore,
MD, USA, May 22–24, 2005. New York: ACM.
Ailon, N., & Mohri, M. (2008). An efficient reduction of ranking to classification. In Proceedings of the 21st
annual conference on learning theory (COLT 2008). Helsinki: Omnipress.
Alon, N. (2006). Ranking tournaments. SIAM Journal Discrete Mathematics, 20, 137–142.
Ariely, D., Loewenstein, G., & Prelec, D. (2008). Coherent arbitrariness: stable demand curves without stable
preferences. The Quarterly Journal of Economics, 118, 73–105.
Arrow, K. J. (1950). A difficulty in the concept of social welfare. Journal of Political Economy, 58, 328–346.
Balcan, M.-F., Bansal, N., Beygelzimer, A., Coppersmith, D., Langford, J., & Sorkin, G. B. (2007). Robust
reductions from ranking to classification. In COLT (pp. 604–619). Berlin: Springer.
Balcan, M.-F., Bansal, N., Beygelzimer, A., Coppersmith, D., Langford, J., & Sorkin, G. B. (2008). Robust
reductions from ranking to classification. Machine Learning Journal, 72, 139–153.
Clémençon, S., & Vayatis, N. (2007). Ranking the best instances. Journal of Machine Learning Research, 8,
2671–2699.
Cohen, W. W., Schapire, R. E., & Singer, Y. (1999). Learning to order things. The Journal of Artificial
Intelligence Research, 10, 243–270.
Coppersmith, D., Fleischer, L., & Rudra, A. (2006). Ordering by weighted number of wins gives a good
ranking for weightedtournaments. InProceedingsofthe17th annual ACM-SIAMsymposiumondiscrete
algorithms (SODA).
Cortes, C., Mohri, M., & Rastogi, A. (2007a). An alternative ranking problem for search engines. In Proceed-
ings of the 6th workshop on experimental algorithms (WEA 2007) (pp. 1–21). Heidelberg: Springer-
Verlag.
Cortes, C., Mohri, M., & Rastogi, A. (2007b). Magnitude-preserving ranking algorithms. In Proceedings of
the twenty-fourth international conference on machine learning (ICML 2007). Oregon State University,
Corvallis, OR.
Cossock, D., & Zhang, T. (2006). Subset ranking using regression. In COLT (pp. 605–619).
Crammer, K., & Singer, Y. (2001). Pranking with ranking. In Advances in neural information processing
systems: Vol. 14. Neural information processing systems: natural and synthetic, NIPS 2001 (pp. 641–
647). Vancouver, British Columbia, Canada, December 3–8, 2001. Cambridge: MIT Press.
Freund, Y., Iyer, R. D., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for combining
preferences. Journal of Machine Learning Research, 4, 933–969.
Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating character-
istic (ROC) curve. Radiology, 143, 29–36.
Hedge, R., Jain, K., Williamson, D. P., & van Zuylen, A. (2007). Deterministic pivoting algorithms for con-
strained ranking and clustering problems. In Proceedings of the ACM-SIAM symposium on discrete
algorithms (SODA).
Page 23
Mach Learn (2010) 80: 189–211211
Hoare, C. (1961). Quicksort: Algorithm 64. Communications of the ACM, 4, 321–322.
Joachims, T. (2002). Optimizing search engines using clickthrough data. In KDD ’02: Proceedings of the
eighth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 133–142).
New York: ACM Press.
Kenyon-Mathieu, C., & Schudy, W. (2007). How to rank with few errors. In STOC ’07: Proceedings of the
thirty-ninth annual ACM symposium on theory of computing (pp. 95–103). New York: ACM Press.
Lehmann, E. L. (1975). Nonparametrics: statistical methods based on ranks. San Francisco: Holden-Day.
Montague, M. H., & Aslam, J. A. (2002). Condorcet fusion for improved retrieval. In Proceedings of the
2002 ACM CIKM international conference on information and knowledge management (pp. 538–548).
McLean, VA, USA, November 4–9, 2002. New York: ACM.
Rudin, C., Cortes, C., Mohri, M., & Schapire, R. E. (2005). Margin-based ranking meets boosting in the mid-
dle. In Learning theory, 18th annual conference on learning theory, COLT 2005 Proceedings (pp. 63–
78). Bertinoro, Italy, June 27–30, 2005. Berlin: Springer.
Williamson, D. P., & van Zuylen, A. (2007). Deterministic algorithms for rank aggregation and other ranking
and clustering problems. In Proceedings of the 5th workshop on approximation and online algorithms
(WAOA).