ArticlePDF Available

A Computationally Intensive Ranking System for Paired Comparison Data


Abstract and Figures

In this paper, we introduce a new ranking system where the data are preferences resulting from paired comparisons. When direct preferences are missing or unclear, then preferences are determined through indirect comparisons. Given that a ranking of n subjects implies (2n) paired preferences, the resultant computational problem is the determination of an optimal ranking where the agreement between the implied preferences via the ranking and the data preferences is maximized. Comparisons are carried out via simulation studies where the proposed rankings outperform Bradley–Terry in a particular predictive comparison.
Content may be subject to copyright.
Operations Research Perspectives 5 (2018) 105–112
Contents lists available at ScienceDirect
Operations Research Perspectives
journal homepage:
A computationally intensive ranking system for paired comparison
David Beaudoin
a , 1
, Tim Swartz
b , 1 ,
Département Opérations et Systèmes de Décision, Faculté des Sciences de l’Administration, Université Laval, Pavillon Palasis-Prince, Bureau 2439, Québec
(Québec) G1V0A6, Canada
Department of Statistics and Actuarial Science, Simon Fraser University, 8888 University Drive, Burnaby, BC V5A1S6, Canada
a r t i c l e i n f o
Article history:
Received 26 December 2017
Revised 7 March 2018
Accepted 24 March 2018
Available online 27 March 2018
Keywo rds:
Nonparametric methods
NCAA basketball
Simulated annealing
Statistical computing
a b s t r a c t
In this paper, we introduce a new ranking system where the data are preferences resulting from paired
comparisons. When direct preferences are missing or unclear, then preferences are determined through
indirect comparisons. Given that a ranking of n subjects implies (
) paired preferences, the resultant
computational problem is the determination of an optimal ranking where the agreement between the
implied preferences via the ranking and the data preferences is maximized. Comparisons are carried out
via simulation studies where the proposed rankings outperform Bradley–Terry in a particular predictive
© 2018 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BY-NC-ND license.
( )
1. Introduction
The problem of ranking can be simply stated and has an ex-
tensive literature in the statistical sciences. Given data on n sub-
jects, the objective is to determine a permutation (ranking) R =
, . . . , i
) where the interpretation is that subject i
is preferable
to subject i
whenever i
< i
. The term “preferable” depends on the
application and the methods used to determine the ranking de-
pend on aspects of the data structure.
In sport, ranking is an important problem. For example, in Na-
tional Collegiate Athletic Association (NCAA) basketball, there are
over 300 teams competing in Division I where a typical team plays
only a subset ( 25) of the other teams during a season. At the end
of the season, the NCAA Selection Committee is set with the task
of creating a tournament structure known as “March Madness” in-
volving 68 of these teams. In determining the invitees, team rank-
ings (in terms of team quality) form part of the decision making
Corresponding author.
E-mail address: (T. Swartz).
1 Both authors have been partially supported by the Natural Sciences and Engi-
neering Research Council of Canada. This research was enabled in part by support
provided by Calcul Québec ( ) and Compute Canada
( ). The authors thank three reviewers who carefully read
the paper and whose
comments have improved the manuscript.
Similarly, in NCAA football, various team rankings are regu-
larly reported during the regular season (e.g. Associated Press, FCS
Coaches’ Poll, Sagarin, etc.). Although such rankings are no longer
used for determining Bowl bids (i.e. identifying pairs of teams that
compete in prestigious holiday matches), the rankings receive con-
siderable media attention and are available to the selection com-
mittee. Part of the intrigue involving the determination of the
rankings is that there are not many crossover matches involving
teams from different conferences.
Ranking also occurs in non-sporting contexts. For example,
universities rank students, employers rank job candidates, there
are rankings corresponding to the quality of journals, and so on.
Clearly, the type of data used to inform the rankings varies greatly
on the application.
In this paper, we focus on the ranking problem associated with
NCAA basketball. More specifically, we consider the ranking of n
Division I teams ( n = 351 in 2015/2016). The data used to inform
our ranking are the result of paired comparisons. Sometimes a
comparison is explicit (e.g. based on the result of one team play-
ing another team). In other instances, the comparison between
two teams is determined by considering the results of matches in-
volving common opponents. Our approach searches for an optimal
ranking R = (i
, ... , i
) which has maximal agreement between the
) implied paired preferences via the ranking and the observed
data preferences. The approach is appealing in its simplicity and
its lack of assumptions. It may be regarded as nonparametric in the
sense that there is no underlying probability model. However, the
2214-7160/© 2018 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license. ( )
106 D. Beaudoin, T. Swartz / Operations Research Perspectives 5 (2018) 105–112
approach provides computational challenges. For example, a simple
search amongst rankings is not possible since there are n ! 10
potential rankings.
Ranking methods based on paired comparison data originate
from the work of Thurstone [26] and Bradley and Ter r y [4] . The ap-
proach suggested by Park and Newman [19] is most closely related
to our approach in the sense that it is also nonparametric and ex-
tends comparisons to indirect matchups between teams. Park and
Newman [19] rank teams according to team wins w minus team
losses l in both direct and discounted indirect matches. The statis-
tics w and l correspond to a matrix-based network centrality mea-
sure involving adjacency matrices.
From the seminal work by Thurstone [26] and Bradley and Terr y
[4] , the statistical literature on methods for paired comparison data
has flourished. For example, many extensions to the original mod-
els have been considered such as the provision for ties in paired
comparisons [9] , multiple comparisons [20] , Bayesian implementa-
tions [6,8,17] and dynamic ranking models [12,13] which have been
used in chess. The treatment of the margin of victory in paired
comparison settings has also led to various models and methods.
For example, Harville [14] , Harville [15] considers linear models
where truncations are imposed on large margins of victory. A cen-
tral idea is that teams should not have an incentive for running up
the score. Mease [18] considers a model based on normal likeli-
hoods and penalty terms that attempts to correspond to human
judgments. A general review of the literature related to paired
comparison methods is given by Cattelan [7] . Rotou et al. [22] re-
view methods that are primarily concerned with dynamic rankings
where data are frequently generated such as in the gaming indus-
In Section 2 , we describe our approach which is intuitive and
simple to describe. However, the method gives rise to challenging
computational hurdles for which we propose a stochastic search
algorithm. For example, we demonstrate how time savings can
be achieved in the calculation of our metric which measures the
agreement between the implied ranking preferences and the data
preferences. The algorithm implements a simulated annealing pro-
cedure which optimizes over the n ! candidate rankings. Section 3
assesses the proposed ranking procedure by forecasting matches
based on established ranks. We first investigate our procedure in
the context of real data from previous NCAA basketball seasons.
We compare our rankings with rankings obtained by other popu-
lar procedures. Our second example is based on simulated NCAA
basketball data where the underlying strengths of the teams are
specified. This allows us to compare forecasts against the truth.
The final forecasting example is based on data from the 2016/2017
English Premier League season. This is a substantially different
dataset in that we have a much smaller number of teams ( n = 20 ).
In Section 4 , we consider various nuances related to our approach.
In particular, we compare our procedure to the Bradley–Terry ap-
proach where we observe the proposed method places more im-
portance on individual matchups than Bradley-Terry. We conclude
with a short discussion in Section 5 .
2. Approach
Our approach is based on data arising from paired comparisons.
In basketball, this represents a data reduction since each team
scores a specific number of points in a game. However, sometimes
the actual number of points scored can be misleading. For exam-
ple, in “blowouts”, teams often “empty their benches” near the end
of a game, meaning that regular players are replaced by players
who do not typically play in competitive matches. In such cases,
margins of victory may not be representative of true quality. Inter-
estingly, a requirement of the computer rankings used in the for-
mer BCS (Bowl Championship Series) for NCAA football was that
the computer rankings should not take into account margin of vic-
tory [10] .
With respect to paired comparison data, it is straightforward to
determine the preference when one team plays another team in
a single game. We let h denote the number of points correspond-
ing to the home court advantage in NCAA basketball. If the home
team defeats the road team by more than h points, then the home
team is the preferred team in the paired comparison. Otherwise,
the road team is the preferred team.
Our ranking procedure requires the specification of the home
team advantage h . And since our approach does not contain a sta-
tistical model, the estimation of h must be done outside of the pro-
cedure. In NCAA basketball, Bessire [3] provided an average home
team advantage of h = 3 . 7 points. The value h = 3 . 7 roughly agrees
with Gandar et al. [11] who estimated a home court advantage
of 4.0 points in the National Basketball Association (NBA). Since
an NBA game is 48 min in duration and a college game is only
40 min in duration, the mapping from h = 4 . 0 in the NBA to col-
lege is 4 . 0(40 / 48) = 3 . 3 . In an independent calculation, we stud-
ied pairs of NCAA basketball teams during the 20 06/20 07 through
2015/2016 seasons. Suppose team A played at home and defeated
team B by h
points ( h
is negative for a loss). And similarly,
suppose team B then played at home and defeated team A by
points ( h
is negative for a loss). In this matchup, home ad-
vantage is estimated by (h
+ h
) / 2 , and h is obtained by av-
eraging these terms over all matchups. In the 26,206 matchups
where pairs of teams played more than once with both home and
away matches, we estimated the home court advantage as h = 3 . 4
points. Therefore, although there is a range of estimates for the
home team advantage h in NCAA basketball, it appears safe to as-
sume that h (3.0, 4.0). The reason why this is important is that
the determination of binary preferences in paired comparisons (the
statistic used in our ranking procedure) is insensitive to h (3.0,
4.0). Therefore, we arbitrarily set the home team advantage at
h = 3 . 5 points. When a game is played at a neutral site, then h = 0 .
Whereas there is frequent discussion of differential home team ad-
vantages for individual teams (as opposed to an overall home court
advantage h for all teams), we are inclined to believe that differen-
tial advantages are primarily the manifestation of multiple compar-
ison issues [24] and unbalanced schedules.
More generally, suppose that two teams have played each other
more than once. Let p
A i
and p
B i
be the points scored by Team A
and Team B respectively in the i th game. Then from Team A’s per-
spective, define the differential
A i
B i
h if Team A is the home team
A i
B i
+ h if Team B is the home team (1)
In this case, Team A is the preferred team in the particular paired
comparison if the average of its d
values is positive.
When two teams have played each other directly, then we use
(1) to determine the preference, and we refer to this as a level L
preference. With n = 351 NCAA basketball teams, there are (
) =
61 , 425 potential paired comparisons. Based on the 5948 matches
that took place in 2015/2016, 3918 level L
preferences were ob-
served. The level L
preferences represent only 6.38% of the poten-
tial (
) paired comparisons.
We now consider cases where Tea m A and Tea m B have not
directly played against each other. Our approach for determining
preferences in these situations borrows on ideas from the RPI (Rat-
ings Performance Index) where strength of schedule is considered;
see [2] for a definition of RPI. Specifically, suppose that Team A
and Team B have a common opponent Team C. Then (1) can be
used to obtain an average differential
from the point of view of
Team A versus Team C. Similarly, (1) can be used to obtain an aver-
age differential
BC from the point of view of Team B versus Tea m
D. Beaudoin, T. Swartz / Operations Research Perspectives 5 (2018) 105–112 107
C. If
, then Team A is the preferred team in the paired
comparison of Team A versus Team B. Now, suppose that Tea m A
and Team B have multiple common opponents C
. In this case, if
, then Team A is preferred to Te am B. When two
teams do not play one another directly but have common oppo-
nents, then we refer to the resulting preference as a level L
erence. In the 2015/2016 dataset, L
preferences represent 54.95%
of the potential (
) paired comparisons.
We extend the preference definition so that the data can be
used to further determine preferences. Suppose now that Team A
and Team B do not play each other directly and that they have no
common opponents. However, imagine that Tea m A has an oppo-
nent and that Team B has an opponent who have a common op-
ponent. For example, suppose Tea m A plays Team C, Team B plays
Team D, Tea m C plays Team E and that Tea m D plays Team E. With-
out going into the notational details and using a similar approach
as previously described, a differential
can be determined via
the AC and CE matchups. Similarly, a differential
BE can be deter-
mined via the BD and DE matchups. Then
can be compared
BE to determine the data preference between Team A and
Team B. We refer to preferences of this type as level L
In the 2015/2016 dataset, L
preferences represent 38.67% of the
potential (
) paired comparisons. We therefore see that 6.38% +
54.95% + 38.67 = 100% of the potential (
) paired comparisons are
either of levels L
, L
or L
. Referring to the popular 1993 movie
“Six Degrees of Separation” starring Will Smith, we observe three
(rather than six) degrees of separation in the 2015/2016 NCAA bas-
ketball season.
We now make a small adjustment in the definition of prefer-
ences. Occasionally, there are “ties” in the preferences. For exam-
ple, in the 2015/2016 NCAA basketball season, there were 18 cases
out of the 3,918 level L
preferences where a tie occurred. This
was the result of two teams playing each other twice, one game
on each team’s home court. In both games, the home team won by
the same margin leading to
d = 0 . There are various ways of break-
ing the tie to determine the preference. For example, you might set
the preference according to the most recent match. Our approach
which we use throughout the remainder of the paper is to set the
preference for both teams equal to 0.5.
Having defined level L
, L
and L
preferences using the NCAA
basketball data, we note that the data preferences are not nec-
essarily transitive. For example, it is possible that Team A is pre-
ferred to Team B, Team B is preferred to Tea m C, and yet Team C
is preferred to Team A. If transitivity were present, then the rank-
ing of teams would be trivial. In the absence of transitivity, what
is a good ranking? Recall that a ranking R = (i
, . . . , i
) has an im-
plicit set of preferences whereby Team i
is preferred to Team i
whenever i
< i
. We let L
denote the number of times that the
implied preferences based in the ranking R agree with the level L
data preferences. In this sense, L
is the number of “correct” pref-
erences in R compared to the level L
preferences determined by
the data. We then define
C(R ) = L
+ L
+ L
as the number of correct preferences. An optimal ranking R
is one
which maximizes C ( R ) in (2) over the space of the n ! permutations.
Although we considered assigning varying weights to the terms in
(2), we were unable to determine weights having a theoretical jus-
2.1. Computation
The first computational problem involves the calculation of the
correct number of preferences C ( R ) for a given ranking R . A naive
approach in calculating C ( R ) involves going through all of the L
, L
and L
preferences and counting the number that agree with the
implied preferences given by R . On an ordinary laptop computer,
such a calculation requires over one hour of computation for a sin-
gle ranking R in the NCAA basketball dataset. Since our optimiza-
tion problem involves searching over the space of permutations R ,
a more efficient way of calculating C ( R ) is required.
To calculate C ( R ) for a given ranking R , we pre-process the
data by creating three matrices corresponding to preferences at
levels L
, L
and L
. In the n ×n matrix D
(k )
= (
(k )
) , k = 1,2,3,
we have the average differential
(k )
from the point of view
of Team i versus Team j based on a level L
paired compari-
son. Once these three matrices are constructed, it is easy to cal-
culate C(R ) = C((i
, ... , i
)) in (2) via L
n 1
l= j+1
(k )
0) + 0 . 5 I(
(k )
= 0)) where I is the indicator function and the
second term takes ties into account. With the pre-processing, the
calculation of C ( R ) for a new R now takes roughly one second of
Recall that there are n ! 10
742 rankings R in the NCAA dataset,
and therefore calculation of C ( R ) for all rankings is impossible. To
maximize C ( R ) with respect to R over the space of the n ! rank-
ings, we implemented a version of the simulated annealing algo-
rithm [16] . Simulated annealing is a stochastic search algorithm
that explores the vast combinatorial space, spending more time
in regions corresponding to promising rankings. In this problem,
we begin with an initial ranking R
. In the i th step of the al-
gorithm, a candidate ranking R
new is generated in a neighbour-
hood of the ranking R
i 1
from step i 1 . If C(R
) > C(R
i 1
) , then
the ranking R
= R
new is accepted as the current state. In the case
where C(R
) C(R
i 1
) , then R
= R
new if a randomly generated
uniform(0,1) variate u < exp { (C(R
) C(R
i 1
)) /t
} where t
> 0 is
a parameter often referred to as the temperature. Otherwise the
current ranking R
= R
i 1
is set at the previous ranking. The al-
gorithm iterates according to a sequence of non-increasing tem-
peratures t
0. The states (rankings) R
, R
, . . . form a Markov
chain. The algorithm terminates after a fixed number of iterations
or when state changes occur infrequently. Under a ‘suitable’ neigh-
bourhood structure, asymptotic results suggest that the final state
will be nearly optimal.
Success of the simulated annealing algorithm depends greatly
on fine tuning of the algorithm. In particular, the user must specify
the cooling schedule (i.e. the temperatures t
) and also the neigh-
bourhood structure for generating successive states from a given
state. Aarts and Korst [1] discuss fine tuning of the algorithm.
Our implementation of simulated annealing begins with the
recognition that our problem shares similarities with the well-
studied travelling salesman problem. For example, like our prob-
lem, the state space in the travelling salesman problem consists of
permutations, permutations of cities that are visited by the sales-
man. Also, in the same way that an interchange in the order of
two adjacent cities in a permutation should not greatly affect the
total travelling distance for the salesman, an interchange in the or-
der of two adjacent teams in a permutation (ranking) should not
greatly affect the expected number of correct preferences C ( R ). Ac-
cordingly, our implementation of simulated annealing uses an ex-
ponential cooling schedule in early stages defined by a sequence of
temperature plateaux; this approach has been successively used in
the travelling salesman problem [1] .
After extensive experimentation, we have tuned our algo-
rithm and we propose an optimization schedule that is suited
to the NCAA basketball seasons under consideration. Specifically,
we consider m = 1 , . . . , 10 blocks (procedures) where the first
eight blocks correspond to simulated annealing. In simulated an-
nealing, the Markov chain consists of B
m iterations in the m th
block with temperature t
. The temperatures decrease exponen-
tially from one block to the next according to t
= 20(0 . 82)
m 1
108 D. Beaudoin, T. Swartz / Operations Research Perspectives 5 (2018) 105–112
Tabl e 1
Schedule for the optimization algorithm. For the m th block in the Permutation procedure we
provide the block size B
and the number of consecutive teams k
in the permutation. For the
non-greedy procedures, we also provide the temperature t
m Procedure B
m k
m t
m m Procedure B
m k
m t
1 Permutation 20 0 0 65 20.00 6 NGShuffle 250 0 0 3.00
2 Permutation 30 0 0 60 16 .4 0 7 NGShuffle 250 0 0 2.00
3 Permutation 40 0 0 55 13.4 5 8 NGShuffle 250 0 0 1. 0 0
4 Permutation 50 0 0 45 11. 0 3 9 GShuffle 750 0 0
5 Permutation 60 0 0 40 9.04 10 Housekeeping
where m = 1 , . . . , 5 . Therefore, it is more difficult to accept down-
ward moves (i.e. when C(R
) < C(R
i 1
)) in the final blocks. In
the first m = 5 blocks of simulated annealing, we refer to genera-
tion of candidate rankings as the “Permutation” procedure. Specif-
ically, within the m th block, consider the previous state R
i 1
, . . . , i
) and generate a discrete uniform variable l on (1 , . . . , n
+ 1) where the parameter k
m is user-specified. We then ran-
domly permute (i
, i
, . . . , i
l+ k
) yielding (j
, j
, . . . , j
l+ k
) .
The candidate state in the algorithm is then given by R
, . . . , i
, j
, . . . , j
l+ k
, i
l+ k
, . . . , i
) . In the application, k
m is
the number of consecutive teams in the previous ranking that
are permuted. Once permuted, a candidate ranking is obtained. In
keeping with the heuristic that state changes should be “smaller”
as simulated annealing proceeds, we propose a schedule where the
tuning parameter k
m decreases as m increases.
When the first five blocks of the algorithm have completed,
we carry out a procedure referred to as “Shuffle” in blocks m =
6 , 7 , 8 , 9 . The idea behind Shuffle is that whereas the Permuta-
tion procedure can lead to candidate rankings that differ consid-
erably from the current ranking, Shuffle produces new rankings
where only one “misplaced” team shuffles from its current posi-
tion. Specifically, given the previous ranking R
i 1
, Shuffle proceeds
by generating a discrete uniform random variable l on (1 , . . . , n ) .
Then another discrete uniform random variable j is generated on
( max (1 , l 50) , . . . , min (l + 50 , n )) . Shuffle updates from R
i 1
new if R
new is accepted and where R
new has the same ordering as
i 1
except that the team ranked l is moved to position j . There-
fore, teams as far apart as 100 places could potential switch places;
we did not want to switch teams further apart as such switches are
unlikely to yield an improved ranking. In blocks m = 6 , 7 , 8 , Shuf-
fle is more accurately described as Non-Greedy Shuffle (NGShuffle)
where temperatures t
, t
, t
are specified as part of the simulated
annealing procedure. In block m = 9 , the Shuffle procedure is mod-
ified as Greedy Shuffle (GShuffle). A greedy procedure is one where
only non-negative moves towards the maximum are allowed (i.e.
) C(R
i 1
) ). The motivation is that when the algorithm nears
termination, we only want to be moving in directions which pro-
vide improvements.
Finally, in block m = 10 of the algorithm, we carry out another
greedy procedure which we refer to as “Housekeeping”. House-
keeping investigates the effect of even smaller changes to the rank-
ing R following the GShuffle procedure (i.e. block m = 9) . Specif-
ically, we take R = (i
, i
, . . . , i
) and we sweep through the so-
lution by inspecting quintuples (i
, i
, i
, i
, i
) beginning
with j = 1 and ending with j = n 4 . For each quintuple, we cal-
culate C ( R ) for the 120 permutations of the quintuple to see if any
of the potential rankings lead to an improved solution. Whenever
an improved permutation is detected, the ranking is updated ac-
Table 1 summarizes the schedule for the optimization algo-
rithm. In the NCAA basketball example, one run of the optimiza-
tion procedure takes approximately 36 h of computation. This is
not onerous for a task that might be expected to be carried out
once per week.
0 50000 100000 150000
50000 51000 52000 53000
Iteration number
Fig. 1. A plot of the number of correct preferences C ( R
) versus the iteration num-
ber i in the optimization algorithm.
Fig. 1 provides a plot of the optimization algorithm correspond-
ing to the preferences obtained in the 2015/2016 NCAA basketball
season. We see that the algorithm moves quickly towards an op-
timal ranking and then slowly improves. The algorithm was ini-
tiated from a promising ranking R
(the 2015/2016 season end-
ing RPI rankings). However, the algorithm works equally well using
less promising initial states.
The simulated annealing algorithm provides guarantees of con-
vergence to a global maximum. However, in practical computing
times, it may be the case that our proposed algorithm gets stuck
in a local mode and only gets “close” to a maximum. It is a general
drawback of stochastic search algorithms that there is no definitive
rule for terminating algorithms. Generally, when changes stop tak-
ing place, this is a signal to stop. In our work, we have been mind-
ful of this, and have added an extra layer of insurance by carrying
out multiple runs of the algorithm. We choose to run the algorithm
M = 20 times which does not take any extra time because we are
able to submit our job to a cluster colony of processors. We then
choose the ranking R
which corresponds to the maximum value
of C ( R ) from the M runs.
The multiple runs also provide us with some confidence that
our resultant ranking R
yields C
= C(R
) which is close to the
global maximum. From the M = 20 runs, we have observed that
the resultant maxima C
, . . . , C
M are roughly symmetric. To gain
some insight, we therefore make the assumption that the maxima
are approximately normally distributed with mean ¯
C and standard
deviation given by the sample standard deviation s
. In extreme
value theory, the probability density function of C
, the M th order
D. Beaudoin, T. Swartz / Operations Research Perspectives 5 (2018) 105–112 109
53385 53390 53395 53400 53405 53410
0.00 0.05 0.10 0.15
density f(C*)
Fig. 2. A plot of the density function (3) of the largest order statistic C
based on
the optimal C ( R ) values C
, . . . , C
from the M = 20 runs of the optimization algo-
rithm using the 2015/2016 NCAA basketball dataset.
statistic of C
, . . . , C
M is therefore approximately
) = M
C M1
where φand are the density function and the distribution func-
tion of the standard normal distribution, respectively. In Fig. 2 , we
plot the density function of C
given by (3) based on the observed
maxima C
, . . . , C
M from the M = 20 runs. Based on the observed
value C
= 53388 . 5 , the plot suggests that we can be confident that
we are close to the global maximum. In particular, it looks unlikely
that C
= 53388 . 5 could be off from the global maximum by much
more than 6.0.
3. Forecasting
3.1. NCAA basketball data
We now compare our proposed ranking procedure with four
widely reported ranking systems used in NCAA basketball (Bihl,
Massey, Pomeroy and RPI).
We consider rankings that have been published over five
seasons (2011/2012 through 2015/2016) where we note that
the Pomeroy rankings were unavailable in the 2012/2013 and
2013/2014 seasons. For each season, we consider 7 time points
, . . . , t
7 where rankings are reported. The time periods roughly
correspond to mid December, early January, mid January, early
February, mid February, early March and mid March. For each rank-
ing system and for each time period (t
, t
i +1
) , our evaluation con-
siders matches played in the time period and the ranking based at
time t
. Except for the last time period coinciding with March Mad-
ness, there are approximately 500 matches played in each time
period per year. In a given match, the outcome is categorized as
correct if the home team wins by more than h = 3 . 4 points and
the home team has the higher ranking. The match outcome is also
categorized as correct if the road team wins or loses by less than
h = 3 . 4 points and the road team has the higher ranking. On a neu-
tral court, the match outcome is considered correct if the higher
ranked team wins.
Over all the predictions made during the five year period, we
calculated the percentage of correct predictions by each of the
ranking systems. In order of the highest percentages, we observed
Pomeroy (69.6%), Massey (69.4%), our proposed method (69.2%),
Bihl (68.7%) and RPI (67.8%). Although the percentages are reason-
ably close, the five methods exhibited fairly consistent orderings
on a yearly basis. It is interesting that the RPI approach exhibited
the lowest percentage, yet RPI is used by the NCAA Selection Com-
mittee in their March Madness deliberations.
Table 2 provides the top five ranked teams using the five rank-
ing methods at the end of the 2015/2016 NCAA basketball sea-
son. We observe a lot of agreement in the sets of rankings. How-
ever, our proposed approach is interesting in that it provides a no-
table difference from the other rankings. In particular, Kansas is
excluded from the top five whereas Xavier is included. This per-
spective is interesting as Kansas had a good season (33 wins ver-
sus 5 losses). However, their five losses came against strong teams
(Michigan St, West Virginia, Oklahoma State, Iowa State and Vil-
lanova), all top 20 AP (Associated Press) teams except for Okla-
homa State. This highlights the importance of the head-to-head
matchups which is discussed in Section 4.1 . We note that our rank-
ing had Kansas in the seventh position. On the other hand, Xavier
(not a traditional powerhouse school) had a strong 28-6 record and
may have been overlooked by some of the other ranking methods.
3.2. Simulated NCAA basketball data
Although the previous example using actual NCAA basketball
data was instructive, it did not allow us to make comparisons with
the “truth” since the correct rankings based on team strengths in
actual seasons are always unknown. In this example, we consider
simulated data sets where we can initially set team strengths so
that the true rankings are known to us.
Therefore, in the context of NCAA basketball, we consider n =
351 teams where the team rankings are set according to alphabet-
ical order. For example, Team 1 is the best team and its schedule
is determined by the 2015/2016 schedule for Abilene Christian (al-
phabetically first). Team 351 is the weakest team and its schedule
is determined by the 2015/2016 schedule for Youn g stown State (al-
phabetically last). For a match between Team i and Te am j on a
neutral court, the observed point differential in favour of Team i is
modeled according to the Normal (μi
, σ2
) distribution where
the normal distribution is a common assumption for NCAA basket-
ball [23] , and we set σ= 9 . 3 which is consistent with [25] . If the
normal variate is greater (less) than zero, then Team i is the win-
ning (losing) team. For home and road matches, winners and losers
are determined by using the same procedure as if the match was
played on a neutral court.
We consider two team strength scenarios. In the first case, we
set team strengths according to μi
= 35 . 1 (0 . 1) i such that Team
1 has strength 35.0 and Te am 351 has strength 0.0. This implies, for
example, that the strongest team is expected to defeat the weak-
est team by 35 points on a neutral court. In the second case, we
set team strengths according to μi
= 52 . 65 (0 . 15 ) i which implies
that the strongest team is expected to defeat the weakest team by
52.5 points on a neutral court.
Our comparison via simulation proceeds by generating M = 10
seasons of matches according to the above description where h =
3 . 5 is set as the home team advantage. In the j th season, we take
the resultant ranking R
= ( j
, . . . , j
) and compare it to the true
ranking R
= (1 , . . . , n ) . We do this using two comparison met-
rics, C
i =1
| j
i | and C
i =1
i )
. We repeat
the procedure over the M = 10 seasons to obtain the overall com-
parison metrics C
and C
Our simulation involves a comparison of our proposed ranking
method with the Bradley–Terry approach which is considered the
benchmark procedure for paired comparison data. Bradley–Terry
110 D. Beaudoin, T. Swartz / Operations Research Perspectives 5 (2018) 105–112
Tabl e 2
Final rankings of the top five teams at the end of the 2015/2016 season.
Method First Second Third Fourth Fifth
Pomeroy Villanova N Carolina Virginia Kansas Michigan St
Massey Villanova Kansas N Carolina Virginia Oklahoma
Proposed Villanova Michigan St N Carolina Xavier Virginia
Bihl Kansas Villanova N Carolina Oklahoma Virginia
RPI Kansas Villanova Virginia Oregon N Carolina
Tabl e 3
Comparison metrics with standard errors in parentheses for two rank-
ing systems (Proposed versus Bradley–Terry) studied under two simulation
Proposed Bradley–Terry
Ranking Procedure Procedure
Case 1: μi
= 35 . 1 (0 . 1) i C
= 12 . 7 (0.39) C
= 13 . 2 (0.85)
= 15 . 4 (0.44) C
= 16 . 9 (0.99)
Case 2:
= 52 . 65 (0 . 15 ) i C
= 09 . 1 (0.25) C
= 11 . 7 (0.44)
= 11 . 0 (0.34) C
= 15 . 0 (0.47)
estimation procedure fails when there is more than one winless
team. For this reason, we assign 0.5 wins in those rare cases where
there are winless teams. We note that Bayesian implementations of
Bradley–Terry as mentioned in the Introduction do not suffer from
this drawback. We are unable to make comparisons with some of
the systems that are frequently reported in NCAA basketball (e.g.
Sagarin, Pomeroy, Massey or RPI) since the systems are proprietary
and the code is unavailable.
Table 3 reports the results of the simulation procedure. The
metrics are interesting as they may be interpreted as the aver-
age deviation between a team’s ranking and its true ranking. We
observe that both ranking procedures are improved in the second
simulation case. This makes sense as there is more variability be-
tween teams in Case 2 than in Case 1, and it is therefore more
likely for a ranking method to differentiate between teams. Note
that any two teams will have a greater mean difference | μi
(actual difference in strength) in Case 2 than in Case 1. Further,
we observe that in both simulation cases and using both metrics
that the proposed ranking procedure gives better rankings than
Bradley-Terry. The reported standard errors suggest that the im-
provements are statistically significant in the second simulation
case. We believe that Case 2 is more realistic than Case 1 in de-
scribing a wider range in quality between NCAA teams.
It would be interesting to repeat the simulation where team
strengths were not linear but followed a Gaussian specification. For
example, one could generate μi
from a Normal(0, 36) distribution
and then sort the μi
such that μ1
is the largest and μ351
is the
smallest. This may be a more realistic description of a population
of team strengths.
3.3. English premier League data
Whereas NCAA basketball consisted of n = 351 teams in
2015/2016, the English Premier League (EPL) is a much smaller
league with n = 20 teams. Therefore, the EPL provides a different
type of challenge for our ranking procedure.
In the EPL, each team plays both a home and a road game
against every other team for a total of 38 matches in a season. We
begin by setting two dates during the 2016/2017 EPL season where
ranks based on our procedure are determined. These dates roughly
correspond to weeks 19 and 27 of the season. We chose not to
extend the dates to the latter part of the season as unusual play-
ing behaviours sometimes occur. For example, in the latter portion
of the 2016/2017 season, Manchester United was more focused on
their Europa Cup matches than in their EPL matches. It was be-
lieved that they had a greater chance for Champions League qual-
ification from the Europa Cup route. Based on these dates, each
team had played every other team at least once, and therefore
team comparisons were based entirely on level L
preferences. The
data matrix D
(1) was constructed for each of the two dates where
the home field advantage was set at h = 0 . 5 [21] . Recall that the
calculation of paired comparison preferences is insensitive to the
choice of h in the wide interval h (0, 1).
We note that the full strength of the optimization algorithm de-
scribed in Table 1 was not required since we have fewer teams.
We instead initiated the procedure beginning in block m = 6 . We
also modified the Shuffle procedure where we now generate in-
dependent uniform variates l and j on (1 , . . . , 20) . Under modified
Shuffle, the candidate ranking R
new has the same ordering as R
i 1
except that team l is inserted into position j . The modified Shuf-
fle procedure allows all possible pairs of teams to switch places.
In this case, the optimization procedure was carried out in roughly
45 s of computing for each of the two time periods.
An advantage of working with a smaller league is the increased
confidence that optimal rankings are obtained. Multiple runs of
the algorithm based on different initial rankings typically gave the
same value of C ( R ). However, we did discover that the rankings
were not unique. We found three optimal rankings at the first date
and two optimal rankings at the second date.
Table 4 provides both the EPL table (standings) and the opti-
mal rankings at the two dates. We observe some meaningful dif-
ferences between the tables and the ranks. On the Jan 1/17 date,
the largest discrepancies between the table and the optimal ranks
involve Middlesbrough, Arsenal and Watford. The optimal rank-
ings suggest that Middlesbrough is stronger (9 placings), Arsenal
is weaker (6 placings) and Watford is weaker (6 placings) than the
table indicates. Middlesbrough’s strength (according to our rank-
ing) was aided by “wins” (i.e. taking into account home team ad-
vantage) over Manchester City, Arsenal and West Brom. We also
observe that the three optimal rankings R
, R
and R
on Jan
1/17 are similar; the only differences involve the top three sides
Chelsea, Liverpool and Manchester United. On the Mar 6/17 date,
the largest discrepancies between the table and the optimal rank-
ings involve Manchester City (7 places lower according to R
), Le-
icester City (6 places higher according to R
) and Sunderland (6
places higher according to both R
and R
Having observed some of the large discrepancies between the
standings and the optimal rankings in Table 4 , it is difficult to as-
sess which lists are more sensible as measures of team strength.
Perhaps large discrepancies indicate to gamblers that there is
something interesting about such teams, that there may be a par-
tial explanation for their standings at a given point in time.
It is also interesting to compare the optimal rankings in Table 4 .
The Jan 1/17 optimal rankings are the same except for the order-
ing of the teams in the first three positions of the table. The Mar
6/17 optimal rankings have stability in the bottom half of the or-
derings but contain more overall variability. For example, we ob-
serve Manchester City in 10th place according to R
and in 5th
place according to R
. At that stage of the season, Manchester City
was doing well pointwise, sitting 3rd in the table. However, they
had suffered four of their five losses to “bigger” teams such as Tot-
D. Beaudoin, T. Swartz / Operations Research Perspectives 5 (2018) 105–112 111