Available via license: CC BYNCND
Content may be subject to copyright.
Operations Research Perspectives 5 (2018) 105–112
Contents lists available at ScienceDirect
Operations Research Perspectives
journal homepage: www.elsevier.com/locate/orp
A computationally intensive ranking system for paired comparison
data
David Beaudoin
a , 1
, Tim Swartz
b , 1 , ∗
a
Département Opérations et Systèmes de Décision, Faculté des Sciences de l’Administration, Université Laval, Pavillon PalasisPrince, Bureau 2439, Québec
(Québec) G1V0A6, Canada
b
Department of Statistics and Actuarial Science, Simon Fraser University, 8888 University Drive, Burnaby, BC V5A1S6, Canada
a r t i c l e i n f o
Article history:
Received 26 December 2017
Revised 7 March 2018
Accepted 24 March 2018
Available online 27 March 2018
Keywo rds:
Nonparametric methods
NCAA basketball
Ranking
Simulated annealing
Statistical computing
a b s t r a c t
In this paper, we introduce a new ranking system where the data are preferences resulting from paired
comparisons. When direct preferences are missing or unclear, then preferences are determined through
indirect comparisons. Given that a ranking of n subjects implies (
n
2
) paired preferences, the resultant
computational problem is the determination of an optimal ranking where the agreement between the
implied preferences via the ranking and the data preferences is maximized. Comparisons are carried out
via simulation studies where the proposed rankings outperform Bradley–Terry in a particular predictive
comparison.
© 2018 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BYNCND license.
( http://creativecommons.org/licenses/byncnd/4.0/ )
1. Introduction
The problem of ranking can be simply stated and has an ex
tensive literature in the statistical sciences. Given data on n sub
jects, the objective is to determine a permutation (ranking) R =
(i
1
, . . . , i
n
) where the interpretation is that subject i
j
is preferable
to subject i
k
whenever i
j
< i
k
. The term “preferable” depends on the
application and the methods used to determine the ranking de
pend on aspects of the data structure.
In sport, ranking is an important problem. For example, in Na
tional Collegiate Athletic Association (NCAA) basketball, there are
over 300 teams competing in Division I where a typical team plays
only a subset ( ∼25) of the other teams during a season. At the end
of the season, the NCAA Selection Committee is set with the task
of creating a tournament structure known as “March Madness” in
volving 68 of these teams. In determining the invitees, team rank
ings (in terms of team quality) form part of the decision making
process.
∗Corresponding author.
Email address: tswartz@sfu.ca (T. Swartz).
1 Both authors have been partially supported by the Natural Sciences and Engi
neering Research Council of Canada. This research was enabled in part by support
provided by Calcul Québec ( http://www.calculquebec.ca/en/ ) and Compute Canada
( www.computecanada.ca ). The authors thank three reviewers who carefully read
the paper and whose
comments have improved the manuscript.
Similarly, in NCAA football, various team rankings are regu
larly reported during the regular season (e.g. Associated Press, FCS
Coaches’ Poll, Sagarin, etc.). Although such rankings are no longer
used for determining Bowl bids (i.e. identifying pairs of teams that
compete in prestigious holiday matches), the rankings receive con
siderable media attention and are available to the selection com
mittee. Part of the intrigue involving the determination of the
rankings is that there are not many crossover matches involving
teams from different conferences.
Ranking also occurs in nonsporting contexts. For example,
universities rank students, employers rank job candidates, there
are rankings corresponding to the quality of journals, and so on.
Clearly, the type of data used to inform the rankings varies greatly
on the application.
In this paper, we focus on the ranking problem associated with
NCAA basketball. More speciﬁcally, we consider the ranking of n
Division I teams ( n = 351 in 2015/2016). The data used to inform
our ranking are the result of paired comparisons. Sometimes a
comparison is explicit (e.g. based on the result of one team play
ing another team). In other instances, the comparison between
two teams is determined by considering the results of matches in
volving common opponents. Our approach searches for an optimal
ranking R = (i
1
, ... , i
n
) which has maximal agreement between the
(
n
2
) implied paired preferences via the ranking and the observed
data preferences. The approach is appealing in its simplicity and
its lack of assumptions. It may be regarded as nonparametric in the
sense that there is no underlying probability model. However, the
https://doi.org/10.1016/j.orp.2018.03.002
22147160/© 2018 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BYNCND license. ( http://creativecommons.org/licenses/byncnd/4.0/ )
106 D. Beaudoin, T. Swartz / Operations Research Perspectives 5 (2018) 105–112
approach provides computational challenges. For example, a simple
search amongst rankings is not possible since there are n ! ≈10
742
potential rankings.
Ranking methods based on paired comparison data originate
from the work of Thurstone [26] and Bradley and Ter r y [4] . The ap
proach suggested by Park and Newman [19] is most closely related
to our approach in the sense that it is also nonparametric and ex
tends comparisons to indirect matchups between teams. Park and
Newman [19] rank teams according to team wins w minus team
losses l in both direct and discounted indirect matches. The statis
tics w and l correspond to a matrixbased network centrality mea
sure involving adjacency matrices.
From the seminal work by Thurstone [26] and Bradley and Terr y
[4] , the statistical literature on methods for paired comparison data
has ﬂourished. For example, many extensions to the original mod
els have been considered such as the provision for ties in paired
comparisons [9] , multiple comparisons [20] , Bayesian implementa
tions [6,8,17] and dynamic ranking models [12,13] which have been
used in chess. The treatment of the margin of victory in paired
comparison settings has also led to various models and methods.
For example, Harville [14] , Harville [15] considers linear models
where truncations are imposed on large margins of victory. A cen
tral idea is that teams should not have an incentive for running up
the score. Mease [18] considers a model based on normal likeli
hoods and penalty terms that attempts to correspond to human
judgments. A general review of the literature related to paired
comparison methods is given by Cattelan [7] . Rotou et al. [22] re
view methods that are primarily concerned with dynamic rankings
where data are frequently generated such as in the gaming indus
try.
In Section 2 , we describe our approach which is intuitive and
simple to describe. However, the method gives rise to challenging
computational hurdles for which we propose a stochastic search
algorithm. For example, we demonstrate how time savings can
be achieved in the calculation of our metric which measures the
agreement between the implied ranking preferences and the data
preferences. The algorithm implements a simulated annealing pro
cedure which optimizes over the n ! candidate rankings. Section 3
assesses the proposed ranking procedure by forecasting matches
based on established ranks. We ﬁrst investigate our procedure in
the context of real data from previous NCAA basketball seasons.
We compare our rankings with rankings obtained by other popu
lar procedures. Our second example is based on simulated NCAA
basketball data where the underlying strengths of the teams are
speciﬁed. This allows us to compare forecasts against the truth.
The ﬁnal forecasting example is based on data from the 2016/2017
English Premier League season. This is a substantially different
dataset in that we have a much smaller number of teams ( n = 20 ).
In Section 4 , we consider various nuances related to our approach.
In particular, we compare our procedure to the Bradley–Terry ap
proach where we observe the proposed method places more im
portance on individual matchups than BradleyTerry. We conclude
with a short discussion in Section 5 .
2. Approach
Our approach is based on data arising from paired comparisons.
In basketball, this represents a data reduction since each team
scores a speciﬁc number of points in a game. However, sometimes
the actual number of points scored can be misleading. For exam
ple, in “blowouts”, teams often “empty their benches” near the end
of a game, meaning that regular players are replaced by players
who do not typically play in competitive matches. In such cases,
margins of victory may not be representative of true quality. Inter
estingly, a requirement of the computer rankings used in the for
mer BCS (Bowl Championship Series) for NCAA football was that
the computer rankings should not take into account margin of vic
tory [10] .
With respect to paired comparison data, it is straightforward to
determine the preference when one team plays another team in
a single game. We let h denote the number of points correspond
ing to the home court advantage in NCAA basketball. If the home
team defeats the road team by more than h points, then the home
team is the preferred team in the paired comparison. Otherwise,
the road team is the preferred team.
Our ranking procedure requires the speciﬁcation of the home
team advantage h . And since our approach does not contain a sta
tistical model, the estimation of h must be done outside of the pro
cedure. In NCAA basketball, Bessire [3] provided an average home
team advantage of h = 3 . 7 points. The value h = 3 . 7 roughly agrees
with Gandar et al. [11] who estimated a home court advantage
of 4.0 points in the National Basketball Association (NBA). Since
an NBA game is 48 min in duration and a college game is only
40 min in duration, the mapping from h = 4 . 0 in the NBA to col
lege is 4 . 0(40 / 48) = 3 . 3 . In an independent calculation, we stud
ied pairs of NCAA basketball teams during the 20 06/20 07 through
2015/2016 seasons. Suppose team A played at home and defeated
team B by h
A
points ( h
A
is negative for a loss). And similarly,
suppose team B then played at home and defeated team A by
h
B
points ( h
B
is negative for a loss). In this matchup, home ad
vantage is estimated by (h
A
+ h
B
) / 2 , and h is obtained by av
eraging these terms over all matchups. In the 26,206 matchups
where pairs of teams played more than once with both home and
away matches, we estimated the home court advantage as h = 3 . 4
points. Therefore, although there is a range of estimates for the
home team advantage h in NCAA basketball, it appears safe to as
sume that h ∈ (3.0, 4.0). The reason why this is important is that
the determination of binary preferences in paired comparisons (the
statistic used in our ranking procedure) is insensitive to h ∈ (3.0,
4.0). Therefore, we arbitrarily set the home team advantage at
h = 3 . 5 points. When a game is played at a neutral site, then h = 0 .
Whereas there is frequent discussion of differential home team ad
vantages for individual teams (as opposed to an overall home court
advantage h for all teams), we are inclined to believe that differen
tial advantages are primarily the manifestation of multiple compar
ison issues [24] and unbalanced schedules.
More generally, suppose that two teams have played each other
more than once. Let p
A i
and p
B i
be the points scored by Team A
and Team B respectively in the i th game. Then from Team A’s per
spective, deﬁne the differential
d
i
=
p
A i
−p
B i
−h if Team A is the home team
p
A i
−p
B i
+ h if Team B is the home team (1)
In this case, Team A is the preferred team in the particular paired
comparison if the average of its d
i
values is positive.
When two teams have played each other directly, then we use
(1) to determine the preference, and we refer to this as a level L
1
preference. With n = 351 NCAA basketball teams, there are (
n
2
) =
61 , 425 potential paired comparisons. Based on the 5948 matches
that took place in 2015/2016, 3918 level L
1
preferences were ob
served. The level L
1
preferences represent only 6.38% of the poten
tial (
n
2
) paired comparisons.
We now consider cases where Tea m A and Tea m B have not
directly played against each other. Our approach for determining
preferences in these situations borrows on ideas from the RPI (Rat
ings Performance Index) where strength of schedule is considered;
see [2] for a deﬁnition of RPI. Speciﬁcally, suppose that Team A
and Team B have a common opponent Team C. Then (1) can be
used to obtain an average differential
¯
d
AC
from the point of view of
Team A versus Team C. Similarly, (1) can be used to obtain an aver
age differential
¯
d
BC from the point of view of Team B versus Tea m
D. Beaudoin, T. Swartz / Operations Research Perspectives 5 (2018) 105–112 107
C. If
¯
d
AC
>
¯
d
BC
, then Team A is the preferred team in the paired
comparison of Team A versus Team B. Now, suppose that Tea m A
and Team B have multiple common opponents C
i
. In this case, if
i
¯
d
AC
i
>
i
¯
d
BC
i
, then Team A is preferred to Te am B. When two
teams do not play one another directly but have common oppo
nents, then we refer to the resulting preference as a level L
2
pref
erence. In the 2015/2016 dataset, L
2
preferences represent 54.95%
of the potential (
n
2
) paired comparisons.
We extend the preference deﬁnition so that the data can be
used to further determine preferences. Suppose now that Team A
and Team B do not play each other directly and that they have no
common opponents. However, imagine that Tea m A has an oppo
nent and that Team B has an opponent who have a common op
ponent. For example, suppose Tea m A plays Team C, Team B plays
Team D, Tea m C plays Team E and that Tea m D plays Team E. With
out going into the notational details and using a similar approach
as previously described, a differential
¯
d
AE
can be determined via
the AC and CE matchups. Similarly, a differential
¯
d
BE can be deter
mined via the BD and DE matchups. Then
¯
d
AE
can be compared
with
¯
d
BE to determine the data preference between Team A and
Team B. We refer to preferences of this type as level L
3
preferences.
In the 2015/2016 dataset, L
3
preferences represent 38.67% of the
potential (
n
2
) paired comparisons. We therefore see that 6.38% +
54.95% + 38.67 = 100% of the potential (
n
2
) paired comparisons are
either of levels L
1
, L
2
or L
3
. Referring to the popular 1993 movie
“Six Degrees of Separation” starring Will Smith, we observe three
(rather than six) degrees of separation in the 2015/2016 NCAA bas
ketball season.
We now make a small adjustment in the deﬁnition of prefer
ences. Occasionally, there are “ties” in the preferences. For exam
ple, in the 2015/2016 NCAA basketball season, there were 18 cases
out of the 3,918 level L
1
preferences where a tie occurred. This
was the result of two teams playing each other twice, one game
on each team’s home court. In both games, the home team won by
the same margin leading to
¯
d = 0 . There are various ways of break
ing the tie to determine the preference. For example, you might set
the preference according to the most recent match. Our approach
which we use throughout the remainder of the paper is to set the
preference for both teams equal to 0.5.
Having deﬁned level L
1
, L
2
and L
3
preferences using the NCAA
basketball data, we note that the data preferences are not nec
essarily transitive. For example, it is possible that Team A is pre
ferred to Team B, Team B is preferred to Tea m C, and yet Team C
is preferred to Team A. If transitivity were present, then the rank
ing of teams would be trivial. In the absence of transitivity, what
is a good ranking? Recall that a ranking R = (i
1
, . . . , i
n
) has an im
plicit set of preferences whereby Team i
j
is preferred to Team i
k
whenever i
j
< i
k
. We let L
C
i
denote the number of times that the
implied preferences based in the ranking R agree with the level L
i
data preferences. In this sense, L
C
i
is the number of “correct” pref
erences in R compared to the level L
i
preferences determined by
the data. We then deﬁne
C(R ) = L
C
1
+ L
C
2
+ L
C
3
(2)
as the number of correct preferences. An optimal ranking R
∗is one
which maximizes C ( R ) in (2) over the space of the n ! permutations.
Although we considered assigning varying weights to the terms in
(2), we were unable to determine weights having a theoretical jus
tiﬁcation.
2.1. Computation
The ﬁrst computational problem involves the calculation of the
correct number of preferences C ( R ) for a given ranking R . A naive
approach in calculating C ( R ) involves going through all of the L
1
, L
2
and L
3
preferences and counting the number that agree with the
implied preferences given by R . On an ordinary laptop computer,
such a calculation requires over one hour of computation for a sin
gle ranking R in the NCAA basketball dataset. Since our optimiza
tion problem involves searching over the space of permutations R ,
a more eﬃcient way of calculating C ( R ) is required.
To calculate C ( R ) for a given ranking R , we preprocess the
data by creating three matrices corresponding to preferences at
levels L
1
, L
2
and L
3
. In the n ×n matrix D
(k )
= (
¯
d
(k )
ij
) , k = 1,2,3,
we have the average differential
¯
d
(k )
ij
from the point of view
of Team i versus Team j based on a level L
k
paired compari
son. Once these three matrices are constructed, it is easy to cal
culate C(R ) = C((i
1
, ... , i
n
)) in (2) via L
C
k
=
n −1
j=1
n
l= j+1
(I(
¯
d
(k )
i
j
i
l
>
0) + 0 . 5 ∗I(
¯
d
(k )
i
j
i
l
= 0)) where I is the indicator function and the
second term takes ties into account. With the preprocessing, the
calculation of C ( R ) for a new R now takes roughly one second of
computation.
Recall that there are n ! ≈10
742 rankings R in the NCAA dataset,
and therefore calculation of C ( R ) for all rankings is impossible. To
maximize C ( R ) with respect to R over the space of the n ! rank
ings, we implemented a version of the simulated annealing algo
rithm [16] . Simulated annealing is a stochastic search algorithm
that explores the vast combinatorial space, spending more time
in regions corresponding to promising rankings. In this problem,
we begin with an initial ranking R
0
. In the i th step of the al
gorithm, a candidate ranking R
new is generated in a neighbour
hood of the ranking R
i −1
from step i −1 . If C(R
new
) > C(R
i −1
) , then
the ranking R
i
= R
new is accepted as the current state. In the case
where C(R
new
) ≤C(R
i −1
) , then R
i
= R
new if a randomly generated
uniform(0,1) variate u < exp { (C(R
new
) −C(R
i −1
)) /t
i
} where t
i
> 0 is
a parameter often referred to as the temperature. Otherwise the
current ranking R
i
= R
i −1
is set at the previous ranking. The al
gorithm iterates according to a sequence of nonincreasing tem
peratures t
i
→ 0. The states (rankings) R
0
, R
1
, . . . form a Markov
chain. The algorithm terminates after a ﬁxed number of iterations
or when state changes occur infrequently. Under a ‘suitable’ neigh
bourhood structure, asymptotic results suggest that the ﬁnal state
will be nearly optimal.
Success of the simulated annealing algorithm depends greatly
on ﬁne tuning of the algorithm. In particular, the user must specify
the cooling schedule (i.e. the temperatures t
i
) and also the neigh
bourhood structure for generating successive states from a given
state. Aarts and Korst [1] discuss ﬁne tuning of the algorithm.
Our implementation of simulated annealing begins with the
recognition that our problem shares similarities with the well
studied travelling salesman problem. For example, like our prob
lem, the state space in the travelling salesman problem consists of
permutations, permutations of cities that are visited by the sales
man. Also, in the same way that an interchange in the order of
two adjacent cities in a permutation should not greatly affect the
total travelling distance for the salesman, an interchange in the or
der of two adjacent teams in a permutation (ranking) should not
greatly affect the expected number of correct preferences C ( R ). Ac
cordingly, our implementation of simulated annealing uses an ex
ponential cooling schedule in early stages deﬁned by a sequence of
temperature plateaux; this approach has been successively used in
the travelling salesman problem [1] .
After extensive experimentation, we have tuned our algo
rithm and we propose an optimization schedule that is suited
to the NCAA basketball seasons under consideration. Speciﬁcally,
we consider m = 1 , . . . , 10 blocks (procedures) where the ﬁrst
eight blocks correspond to simulated annealing. In simulated an
nealing, the Markov chain consists of B
m iterations in the m th
block with temperature t
m
. The temperatures decrease exponen
tially from one block to the next according to t
m
= 20(0 . 82)
m −1
108 D. Beaudoin, T. Swartz / Operations Research Perspectives 5 (2018) 105–112
Tabl e 1
Schedule for the optimization algorithm. For the m th block in the Permutation procedure we
provide the block size B
m
and the number of consecutive teams k
m
in the permutation. For the
nongreedy procedures, we also provide the temperature t
m
.
m Procedure B
m k
m t
m m Procedure B
m k
m t
m
1 Permutation 20 0 0 65 20.00 6 NGShuﬄe 250 0 0 3.00
2 Permutation 30 0 0 60 16 .4 0 7 NGShuﬄe 250 0 0 2.00
3 Permutation 40 0 0 55 13.4 5 8 NGShuﬄe 250 0 0 1. 0 0
4 Permutation 50 0 0 45 11. 0 3 9 GShuﬄe 750 0 0
5 Permutation 60 0 0 40 9.04 10 Housekeeping
where m = 1 , . . . , 5 . Therefore, it is more diﬃcult to accept down
ward moves (i.e. when C(R
new
) < C(R
i −1
)) in the ﬁnal blocks. In
the ﬁrst m = 5 blocks of simulated annealing, we refer to genera
tion of candidate rankings as the “Permutation” procedure. Specif
ically, within the m th block, consider the previous state R
i −1
=
(i
1
, . . . , i
n
) and generate a discrete uniform variable l on (1 , . . . , n −
k
m
+ 1) where the parameter k
m is userspeciﬁed. We then ran
domly permute (i
l
, i
l+1
, . . . , i
l+ k
m
−1
) yielding (j
l
, j
l+1
, . . . , j
l+ k
m
−1
) .
The candidate state in the algorithm is then given by R
new
=
(i
1
, . . . , i
l−1
, j
l
, . . . , j
l+ k
m
−1
, i
l+ k
m
, . . . , i
n
) . In the application, k
m is
the number of consecutive teams in the previous ranking that
are permuted. Once permuted, a candidate ranking is obtained. In
keeping with the heuristic that state changes should be “smaller”
as simulated annealing proceeds, we propose a schedule where the
tuning parameter k
m decreases as m increases.
When the ﬁrst ﬁve blocks of the algorithm have completed,
we carry out a procedure referred to as “Shuﬄe” in blocks m =
6 , 7 , 8 , 9 . The idea behind Shuﬄe is that whereas the Permuta
tion procedure can lead to candidate rankings that differ consid
erably from the current ranking, Shuﬄe produces new rankings
where only one “misplaced” team shuﬄes from its current posi
tion. Speciﬁcally, given the previous ranking R
i −1
, Shuﬄe proceeds
by generating a discrete uniform random variable l on (1 , . . . , n ) .
Then another discrete uniform random variable j is generated on
( max (1 , l −50) , . . . , min (l + 50 , n )) . Shuﬄe updates from R
i −1
to
R
new if R
new is accepted and where R
new has the same ordering as
R
i −1
except that the team ranked l is moved to position j . There
fore, teams as far apart as 100 places could potential switch places;
we did not want to switch teams further apart as such switches are
unlikely to yield an improved ranking. In blocks m = 6 , 7 , 8 , Shuf
ﬂe is more accurately described as NonGreedy Shuﬄe (NGShuﬄe)
where temperatures t
6
, t
7
, t
8
are speciﬁed as part of the simulated
annealing procedure. In block m = 9 , the Shuﬄe procedure is mod
iﬁed as Greedy Shuﬄe (GShuﬄe). A greedy procedure is one where
only nonnegative moves towards the maximum are allowed (i.e.
C(R
i
) ≥C(R
i −1
) ). The motivation is that when the algorithm nears
termination, we only want to be moving in directions which pro
vide improvements.
Finally, in block m = 10 of the algorithm, we carry out another
greedy procedure which we refer to as “Housekeeping”. House
keeping investigates the effect of even smaller changes to the rank
ing R following the GShuﬄe procedure (i.e. block m = 9) . Specif
ically, we take R = (i
1
, i
2
, . . . , i
n
) and we sweep through the so
lution by inspecting quintuples (i
j
, i
j+1
, i
j+2
, i
j+3
, i
j+4
) beginning
with j = 1 and ending with j = n −4 . For each quintuple, we cal
culate C ( R ) for the 120 permutations of the quintuple to see if any
of the potential rankings lead to an improved solution. Whenever
an improved permutation is detected, the ranking is updated ac
cordingly.
Table 1 summarizes the schedule for the optimization algo
rithm. In the NCAA basketball example, one run of the optimiza
tion procedure takes approximately 36 h of computation. This is
not onerous for a task that might be expected to be carried out
once per week.
0 50000 100000 150000
50000 51000 52000 53000
Iteration number
C(R)
Fig. 1. A plot of the number of correct preferences C ( R
i
) versus the iteration num
ber i in the optimization algorithm.
Fig. 1 provides a plot of the optimization algorithm correspond
ing to the preferences obtained in the 2015/2016 NCAA basketball
season. We see that the algorithm moves quickly towards an op
timal ranking and then slowly improves. The algorithm was ini
tiated from a promising ranking R
0
(the 2015/2016 season end
ing RPI rankings). However, the algorithm works equally well using
less promising initial states.
The simulated annealing algorithm provides guarantees of con
vergence to a global maximum. However, in practical computing
times, it may be the case that our proposed algorithm gets stuck
in a local mode and only gets “close” to a maximum. It is a general
drawback of stochastic search algorithms that there is no deﬁnitive
rule for terminating algorithms. Generally, when changes stop tak
ing place, this is a signal to stop. In our work, we have been mind
ful of this, and have added an extra layer of insurance by carrying
out multiple runs of the algorithm. We choose to run the algorithm
M = 20 times which does not take any extra time because we are
able to submit our job to a cluster colony of processors. We then
choose the ranking R
∗which corresponds to the maximum value
of C ( R ) from the M runs.
The multiple runs also provide us with some conﬁdence that
our resultant ranking R
∗yields C
∗= C(R
∗) which is close to the
global maximum. From the M = 20 runs, we have observed that
the resultant maxima C
1
, . . . , C
M are roughly symmetric. To gain
some insight, we therefore make the assumption that the maxima
are approximately normally distributed with mean ¯
C and standard
deviation given by the sample standard deviation s
C
. In extreme
value theory, the probability density function of C
∗, the M th order
D. Beaudoin, T. Swartz / Operations Research Perspectives 5 (2018) 105–112 109
53385 53390 53395 53400 53405 53410
0.00 0.05 0.10 0.15
C*
density f(C*)
Fig. 2. A plot of the density function (3) of the largest order statistic C
∗based on
the optimal C ( R ) values C
1
, . . . , C
M
from the M = 20 runs of the optimization algo
rithm using the 2015/2016 NCAA basketball dataset.
statistic of C
1
, . . . , C
M is therefore approximately
f(C
∗) = M
1
s
C
φC
∗−¯
C
s
C C
∗−¯
C
s
C M−1
(3)
where φand are the density function and the distribution func
tion of the standard normal distribution, respectively. In Fig. 2 , we
plot the density function of C
∗given by (3) based on the observed
maxima C
1
, . . . , C
M from the M = 20 runs. Based on the observed
value C
∗= 53388 . 5 , the plot suggests that we can be conﬁdent that
we are close to the global maximum. In particular, it looks unlikely
that C
∗= 53388 . 5 could be off from the global maximum by much
more than 6.0.
3. Forecasting
3.1. NCAA basketball data
We now compare our proposed ranking procedure with four
widely reported ranking systems used in NCAA basketball (Bihl,
Massey, Pomeroy and RPI).
We consider rankings that have been published over ﬁve
seasons (2011/2012 through 2015/2016) where we note that
the Pomeroy rankings were unavailable in the 2012/2013 and
2013/2014 seasons. For each season, we consider 7 time points
t
1
, . . . , t
7 where rankings are reported. The time periods roughly
correspond to mid December, early January, mid January, early
February, mid February, early March and mid March. For each rank
ing system and for each time period (t
i
, t
i +1
) , our evaluation con
siders matches played in the time period and the ranking based at
time t
i
. Except for the last time period coinciding with March Mad
ness, there are approximately 500 matches played in each time
period per year. In a given match, the outcome is categorized as
correct if the home team wins by more than h = 3 . 4 points and
the home team has the higher ranking. The match outcome is also
categorized as correct if the road team wins or loses by less than
h = 3 . 4 points and the road team has the higher ranking. On a neu
tral court, the match outcome is considered correct if the higher
ranked team wins.
Over all the predictions made during the ﬁve year period, we
calculated the percentage of correct predictions by each of the
ranking systems. In order of the highest percentages, we observed
Pomeroy (69.6%), Massey (69.4%), our proposed method (69.2%),
Bihl (68.7%) and RPI (67.8%). Although the percentages are reason
ably close, the ﬁve methods exhibited fairly consistent orderings
on a yearly basis. It is interesting that the RPI approach exhibited
the lowest percentage, yet RPI is used by the NCAA Selection Com
mittee in their March Madness deliberations.
Table 2 provides the top ﬁve ranked teams using the ﬁve rank
ing methods at the end of the 2015/2016 NCAA basketball sea
son. We observe a lot of agreement in the sets of rankings. How
ever, our proposed approach is interesting in that it provides a no
table difference from the other rankings. In particular, Kansas is
excluded from the top ﬁve whereas Xavier is included. This per
spective is interesting as Kansas had a good season (33 wins ver
sus 5 losses). However, their ﬁve losses came against strong teams
(Michigan St, West Virginia, Oklahoma State, Iowa State and Vil
lanova), all top 20 AP (Associated Press) teams except for Okla
homa State. This highlights the importance of the headtohead
matchups which is discussed in Section 4.1 . We note that our rank
ing had Kansas in the seventh position. On the other hand, Xavier
(not a traditional powerhouse school) had a strong 286 record and
may have been overlooked by some of the other ranking methods.
3.2. Simulated NCAA basketball data
Although the previous example using actual NCAA basketball
data was instructive, it did not allow us to make comparisons with
the “truth” since the correct rankings based on team strengths in
actual seasons are always unknown. In this example, we consider
simulated data sets where we can initially set team strengths so
that the true rankings are known to us.
Therefore, in the context of NCAA basketball, we consider n =
351 teams where the team rankings are set according to alphabet
ical order. For example, Team 1 is the best team and its schedule
is determined by the 2015/2016 schedule for Abilene Christian (al
phabetically ﬁrst). Team 351 is the weakest team and its schedule
is determined by the 2015/2016 schedule for Youn g stown State (al
phabetically last). For a match between Team i and Te am j on a
neutral court, the observed point differential in favour of Team i is
modeled according to the Normal (μi
−μj
, σ2
) distribution where
the normal distribution is a common assumption for NCAA basket
ball [23] , and we set σ= 9 . 3 which is consistent with [25] . If the
normal variate is greater (less) than zero, then Team i is the win
ning (losing) team. For home and road matches, winners and losers
are determined by using the same procedure as if the match was
played on a neutral court.
We consider two team strength scenarios. In the ﬁrst case, we
set team strengths according to μi
= 35 . 1 −(0 . 1) i such that Team
1 has strength 35.0 and Te am 351 has strength 0.0. This implies, for
example, that the strongest team is expected to defeat the weak
est team by 35 points on a neutral court. In the second case, we
set team strengths according to μi
= 52 . 65 −(0 . 15 ) i which implies
that the strongest team is expected to defeat the weakest team by
52.5 points on a neutral court.
Our comparison via simulation proceeds by generating M = 10
seasons of matches according to the above description where h =
3 . 5 is set as the home team advantage. In the j th season, we take
the resultant ranking R
j
= ( j
1
, . . . , j
n
) and compare it to the true
ranking R
true
= (1 , . . . , n ) . We do this using two comparison met
rics, C
(1)
j
=
1
n
n
i =1
 j
i
−i  and C
(2)
j
=
1
n
n
i =1
(j
i
−i )
2
. We repeat
the procedure over the M = 10 seasons to obtain the overall com
parison metrics C
(1)
=
1
M
M
j=1
C
(1)
j
and C
(2)
=
1
M
M
j=1
C
(2)
j
.
Our simulation involves a comparison of our proposed ranking
method with the Bradley–Terry approach which is considered the
benchmark procedure for paired comparison data. Bradley–Terry
110 D. Beaudoin, T. Swartz / Operations Research Perspectives 5 (2018) 105–112
Tabl e 2
Final rankings of the top ﬁve teams at the end of the 2015/2016 season.
Method First Second Third Fourth Fifth
Pomeroy Villanova N Carolina Virginia Kansas Michigan St
Massey Villanova Kansas N Carolina Virginia Oklahoma
Proposed Villanova Michigan St N Carolina Xavier Virginia
Bihl Kansas Villanova N Carolina Oklahoma Virginia
RPI Kansas Villanova Virginia Oregon N Carolina
Tabl e 3
Comparison metrics with standard errors in parentheses for two rank
ing systems (Proposed versus Bradley–Terry) studied under two simulation
cases.
Proposed Bradley–Terry
Ranking Procedure Procedure
Case 1: μi
= 35 . 1 −(0 . 1) i C
(1)
= 12 . 7 (0.39) C
(1)
= 13 . 2 (0.85)
C
(2)
= 15 . 4 (0.44) C
(2)
= 16 . 9 (0.99)
Case 2:
μi
= 52 . 65 −(0 . 15 ) i C
(1)
= 09 . 1 (0.25) C
(1)
= 11 . 7 (0.44)
C
(2)
= 11 . 0 (0.34) C
(2)
= 15 . 0 (0.47)
estimation procedure fails when there is more than one winless
team. For this reason, we assign 0.5 wins in those rare cases where
there are winless teams. We note that Bayesian implementations of
Bradley–Terry as mentioned in the Introduction do not suffer from
this drawback. We are unable to make comparisons with some of
the systems that are frequently reported in NCAA basketball (e.g.
Sagarin, Pomeroy, Massey or RPI) since the systems are proprietary
and the code is unavailable.
Table 3 reports the results of the simulation procedure. The
metrics are interesting as they may be interpreted as the aver
age deviation between a team’s ranking and its true ranking. We
observe that both ranking procedures are improved in the second
simulation case. This makes sense as there is more variability be
tween teams in Case 2 than in Case 1, and it is therefore more
likely for a ranking method to differentiate between teams. Note
that any two teams will have a greater mean difference  μi
−μj

(actual difference in strength) in Case 2 than in Case 1. Further,
we observe that in both simulation cases and using both metrics
that the proposed ranking procedure gives better rankings than
BradleyTerry. The reported standard errors suggest that the im
provements are statistically signiﬁcant in the second simulation
case. We believe that Case 2 is more realistic than Case 1 in de
scribing a wider range in quality between NCAA teams.
It would be interesting to repeat the simulation where team
strengths were not linear but followed a Gaussian speciﬁcation. For
example, one could generate μi
from a Normal(0, 36) distribution
and then sort the μi
such that μ1
is the largest and μ351
is the
smallest. This may be a more realistic description of a population
of team strengths.
3.3. English premier League data
Whereas NCAA basketball consisted of n = 351 teams in
2015/2016, the English Premier League (EPL) is a much smaller
league with n = 20 teams. Therefore, the EPL provides a different
type of challenge for our ranking procedure.
In the EPL, each team plays both a home and a road game
against every other team for a total of 38 matches in a season. We
begin by setting two dates during the 2016/2017 EPL season where
ranks based on our procedure are determined. These dates roughly
correspond to weeks 19 and 27 of the season. We chose not to
extend the dates to the latter part of the season as unusual play
ing behaviours sometimes occur. For example, in the latter portion
of the 2016/2017 season, Manchester United was more focused on
their Europa Cup matches than in their EPL matches. It was be
lieved that they had a greater chance for Champions League qual
iﬁcation from the Europa Cup route. Based on these dates, each
team had played every other team at least once, and therefore
team comparisons were based entirely on level L
1
preferences. The
data matrix D
(1) was constructed for each of the two dates where
the home ﬁeld advantage was set at h = 0 . 5 [21] . Recall that the
calculation of paired comparison preferences is insensitive to the
choice of h in the wide interval h ∈ (0, 1).
We note that the full strength of the optimization algorithm de
scribed in Table 1 was not required since we have fewer teams.
We instead initiated the procedure beginning in block m = 6 . We
also modiﬁed the Shuﬄe procedure where we now generate in
dependent uniform variates l and j on (1 , . . . , 20) . Under modiﬁed
Shuﬄe, the candidate ranking R
new has the same ordering as R
i −1
except that team l is inserted into position j . The modiﬁed Shuf
ﬂe procedure allows all possible pairs of teams to switch places.
In this case, the optimization procedure was carried out in roughly
45 s of computing for each of the two time periods.
An advantage of working with a smaller league is the increased
conﬁdence that optimal rankings are obtained. Multiple runs of
the algorithm based on different initial rankings typically gave the
same value of C ( R ). However, we did discover that the rankings
were not unique. We found three optimal rankings at the ﬁrst date
and two optimal rankings at the second date.
Table 4 provides both the EPL table (standings) and the opti
mal rankings at the two dates. We observe some meaningful dif
ferences between the tables and the ranks. On the Jan 1/17 date,
the largest discrepancies between the table and the optimal ranks
involve Middlesbrough, Arsenal and Watford. The optimal rank
ings suggest that Middlesbrough is stronger (9 placings), Arsenal
is weaker (6 placings) and Watford is weaker (6 placings) than the
table indicates. Middlesbrough’s strength (according to our rank
ing) was aided by “wins” (i.e. taking into account home team ad
vantage) over Manchester City, Arsenal and West Brom. We also
observe that the three optimal rankings R
∗
1
, R
∗
2
and R
∗
3
on Jan
1/17 are similar; the only differences involve the top three sides
Chelsea, Liverpool and Manchester United. On the Mar 6/17 date,
the largest discrepancies between the table and the optimal rank
ings involve Manchester City (7 places lower according to R
∗
1
), Le
icester City (6 places higher according to R
∗
1
) and Sunderland (6
places higher according to both R
∗
1
and R
∗
2
).
Having observed some of the large discrepancies between the
standings and the optimal rankings in Table 4 , it is diﬃcult to as
sess which lists are more sensible as measures of team strength.
Perhaps large discrepancies indicate to gamblers that there is
something interesting about such teams, that there may be a par
tial explanation for their standings at a given point in time.
It is also interesting to compare the optimal rankings in Table 4 .
The Jan 1/17 optimal rankings are the same except for the order
ing of the teams in the ﬁrst three positions of the table. The Mar
6/17 optimal rankings have stability in the bottom half of the or
derings but contain more overall variability. For example, we ob
serve Manchester City in 10th place according to R
∗
1
and in 5th
place according to R
∗
2
. At that stage of the season, Manchester City
was doing well pointwise, sitting 3rd in the table. However, they
had suffered four of their ﬁve losses to “bigger” teams such as Tot
D. Beaudoin, T. Swartz / Operations Research Perspectives 5 (2018) 105–112 111