Available via license: CC0
Content may be subject to copyright.
arXiv:1612.02567v1 [q-fin.ST] 8 Dec 2016
Order statistics of horse racing and the randomly broken stick
Peter A. Bebbington1,2and Julius Bonart3,4
December 9, 2016
1: Department of Physics and Astronomy, University College London, London WC1E 6BT; 2: Trium Capital
LLP, 60 Gresham St, London EC2V 7BB; 3: Financial Computing & Analytics, Department of Computer
Science, University College London, London WC1E 6BT; 4: CFM–Imperial Institute of Quantitative Fi-
nance, Department of Mathematics, Imperial College, London SW7 2AZ
Abstract
We find a remarkable agreement between the statistics of a randomly divided interval and
the observed statistical patterns and distributions found in horse racing betting markets. We
compare the distribution of implied winning odds, the average true winning probabilities, the
implied odds conditional on a win, and the average implied odds of the winning horse with
the corresponding quantities from the “randomly broken stick problem”. We observe that
the market is at least to some degree informationally efficient. From the mapping between
exponential random variables and the statistics of the random division we conclude that horses’
true winning abilities are exponentially distributed.
1 Introduction
From time to time nature has a taste for simplicity. It can then be promising to treat unknown variables as
purely random, using statistics that are compatible with the constraints, symmetries, or boundary conditions
of the given problem, but otherwise as simple as possible. Heavy nuclei are an example of such a system; they
are seemingly hopelessly complex, yet the spacings between their energy levels follow well-known statistics of
random matrix eigenvalues [1, 2]. More recently, one of such statistics, the Marchenko-Pastur distribution,
has been found in fluctuations of financial covariance matrices [3], despite the strong non-Gaussian depen-
dencies observed in real financial time series [4]. Latter example underscores the success of econophysics:
Socio-economic human systems are highly non-linear [5, 6, 7, 8] and chaotic [9], but methods borrowed from
statistical physics can still be successful in describing bulk statistics of these systems.
Traditionally, econophysics has somewhat neglected a certain type of financial markets: Betting markets.
This is perhaps surprising because economists, on the contrary, have studied betting markets extensively,
considering them as a controlled experiment for market efficiency [10, 11, 12], a key concept in financial
economics. Because the outcome of the bet – win or lose – is definitely known after a certain time, it is
straightforward to draw conclusions from the discrepancy between the implied market odds1and the true
winning probability. If the difference is large, the market is regarded as inefficient, because its participants
are not able to “price” the bet correctly. If the difference is small the market is regarded as efficient. How
are the implied market odds calculated? Consider for example a horse race with n= 3 horses. Assume that
after betting $X on the first horse you would get $3X if this horse wins. The “price” of a bet on the first
horse is thus 3. If the price of a bet on the second horse is 2 and the price of a bet on the third horse is 6,
1We use here “implied odds” in the sense of “implied probability”.
1
the second horse is the favourite because the market would pay the smallest ratio of the gain (including the
original stake) to the stake itself if this horse wins: Its implied winning odds are 1/2 so that the gambler’s
average payoff is zero.
The sum of all the implied odds must be roughly one. We know that this is not always exactly true, for
example because bookmakers charge the gamblers a small fee, but for the purpose of this study these tiny
deviations are not important. Each horse’s implied odds thus represent a segment of the unit interval. We
do not know much about horses, but we can guess the simplest statistics for these segments: Draw n−1
numbers from the uniform distribution and cut the unit interval at each of these numbers. We thus break a
stick of unit length randomly into npieces, each of which represents the winning odds of an individual horse,
participating in a race with nhorses. The favourite’s odds correspond to the largest segment, the second
favourite’s odds to the second largest segment, and so on.
This letter reports striking similarities between the empirical distribution of implied odds observed in
horse racing betting markets and the order statistics of the random division of the unit interval. Moreover,
we find that conditional expectations of the true winning probabilities2closely follow their corresponding
values from the “broken stick problem”, as well. We therefore conclude that the true winning probabilities
of horses behave like random divisions of the unit interval, and that the market follows these statistics in
the implied odds. Finally, from the well known mapping between exponential random variables and the
statistics of the random division [13] we make the somewhat vague statement that horses’ true “abilities”
are exponentially distributed: The probability that a horse with ability Xiwins against n−1 other horses
is then
Pi=Xi
Pn
j=1 Xj
,(1)
in that Pifollows the statistics of the broken stick problem if and only if the Xiare exponentially distributed.
Consider the interval [0,1] and divide it randomly into nsub-intervals. The length of the k-th largest
sub-interval, which we denote here by z(k), has the distribution [13]
P[z(k)> x|n] =
k−1
X
j=1 n
jn−j
X
ℓ=0
(−1)ℓ−1n−j
ℓ[1 −(j+ℓ)x]n−1
++
n
X
ℓ=1
(−1)ℓ−1n
ℓ[1 −ℓx]n−1
+,(2)
with a+= max[a, 0]. We want to compare P[z(k)> x] to the empirical distribution of implied winning odds
of horse racing betting markets.
We use data collected through Betfair on 12736 races occurring across the British Isles in the period from
31/12/2011 to 15/12/2012. The average number of horses per race is 8.95. We consider only races with at
least 5 horses which reduces the total number of races in our dataset to 11925. Gamblers exchange bets on
horses in a limit order book. Sell orders match buy limit orders specified by volumes and lay decimal odds.
Buy orders match sell limit orders specified by volumes and back decimal odds. Decimal odds quote the ratio
of the payout amount, including the original stake, to the stake itself. The highest back quote is larger than
the lowest lay quote3. The implied winning odds are defined as the reciprocal of the last matched quote
before the race starts.
Consider now the implied odds of the k-th favourite horse, which we denote by Q(k). Fig. 1 compares
the ECCDF P[Q(k)> x] with the theoretical prediction P[z(k)> x] for the favourite, 2nd favourite, 3rd
favourite, 4th favourite, and the horse with the least implied winning odds (the “longshot”), averaged over
the number of horses in each race. The agreement is striking and calls for further investigation.
To compare the true winning probabilities to the order statistics of the broken stick problem we need
to calculate average quantities. Table 1 shows the expected length of the k-th largest segment, the average
empirical implied odds of the k-th favourite, and the average observed true winning probabilities of the k-th
favourite, denoted by P(k), for all races in our dataset and for three subgroups containing roughly an equal
2Of course we cannot observe an empirical distribution of true winning probabilities but only aggregate statistics,
such as, for example, the average winning probability of the favourite horse.
3Note that here buy orders match at lower quotes than sell orders.
2
Figure 1: (Dashed) ECCDF of the four favourite and longshot horses’ implied odds (red: favourite
horse, blue: second favourite, green: third favourite, violet: forth favourite, orange: longshot) and
(solid) the cumulative distribution of the corresponding segments of the division of the unit interval,
displayed in (main) linear and (inset) double logarithmic scale. Note that there are no free fitting
parameters.
number of horses: races with 5 ≤n≤7 horses, races with 8 ≤n≤10 horses, and races with n≥11
horses. The theoretical expectation of the segment lengths are calculated by taking the first moment of
Eq. 2 (analytically given in Eq. 4 below) and averaging over the empirical distribution of n. Not only do
the empirical implied odds correspond to the expected segment lengths but the average observed winning
probabilities also follow the order statistics of the random division accurately for all horses, with the exception
of the longshot. Note that our theoretical estimations of the winning odds based on the expected segment
lengths are parameter free.
We observe significant discrepancies for the longshot, but the differences between its implied odds,
winning probability and segment length are small for races with 5 ≤n≤7 horses and larger for races with
more horses. This suggests that gamblers are not able to rank the horses precisely enough when the number
of horses is large. Remember that we define the rank of the horse according to the observed implied odds.
Therefore, the smallest segment may describe a horse that the market has not recognised as the weakest one.
In this case the market’s longshot is in reality a slightly stronger horse. This is consistent with the fact that
both implied odds and winning probabilities of the longshot are larger than suggested by Eq. 2.
We also calculate the implied odds of the k-th favourite given that this horse wins. This quantity
is naturally larger than the unconditional implied odds of the k-th favourite. To find the corresponding
theoretical prediction consider the indicator function I(k)= 1 which is one if a random point in the interval
[0,1] lies in the k-th largest segment and zero, else. Then:
P[z(k)=x|I(k)= 1] = P[I(k)= 1|z(k)=x]P[z(k)=x]
P[I(k)= 1] =xP[z(k)=x]
¯z(k)
,
and
E[z(k)|I(k)= 1] = z2(k)
¯z(k)
.(3)
Eq. 3 is the theoretical prediction of the k-th favourite’s odds given that this horse wins. By using well-known
3
favourite 2nd favourite 3rd favourite 4th favourite longshot
k= 1 k= 2 k= 3 k= 4 k=n
E[Q(k)|n≥5] 0.3208 0.2001 0.1420 0.1037 0.0210
E[P(k)|n≥5] 0.3358 0.1976 0.1345 0.0998 0.0253
E[z(k)|n≥5] 0.3237 0.2046 0.1451 0.1054 0.0157
favourite 2nd favourite 3rd favourite 4th favourite longshot
k= 1 k= 2 k= 3 k= 4 k=n
E[Q(k)|5≤n≤7] 0.3996 0.2399 0.1578 0.1024 0.0336
E[P(k)|5≤n≤7] 0.4165 0.2276 0.1503 0.0981 0.0339
E[z(k)|5≤n≤7] 0.4081 0.2407 0.1570 0.1012 0.0285
favourite 2nd favourite 3rd favourite 4th favourite longshot
k= 1 k= 2 k= 3 k= 4 k=n
E[Q(k)|8≤n≤10] 0.3184 0.1985 0.1438 0.1078 0.0182
E[P(k)|8≤n≤10] 0.3327 0.2081 0.1362 0.1031 0.0233
E[z(k)|8≤n≤10] 0.3166 0.2041 0.1478 0.1103 0.0128
favourite 2nd favourite 3rd favourite 4th favourite longshot
k= 1 k= 2 k= 3 k= 4 k=n
E[Q(k)|n≥11] 0.2470 0.1631 0.1247 0.1004 0.0119
E[P(k)|n≥11] 0.2614 0.1564 0.1172 0.0977 0.0193
E[z(k)|n≥11] 0.2500 0.1703 0.1305 0.1039 0.0065
Table 1: Average implied odds and winning probabilities, and expected segment lengths for (from
above) all races in our dataset, all races with n≤7, with 8 ≤n≤10, and with n≥11. The
theoretical expectation of the segment lengths are calculated by averaging the first moment of Eq.
2 over the empirical distribution of n.
4
favourite 2nd favourite 3rd favourite 4th favourite longshot
k= 1 k= 2 k= 3 k= 4 k=n
E[Q(k)|win] 0.3735 0.2148 0.1542 0.1139 0.0886
E[z(k)|I(k)= 1] 0.3622 0.2196 0.1549 0.1145 0.0383
Table 2: Average implied odds given that the horse wins and expected segment lengths given that
it contains a random point for all races in our dataset (with n≥5). The theoretical expectation
of the segment lengths are calculated by averaging Eq. 3 over the empirical distribution of n.
binomial identities we find from Eq. 2 after a somewhat lengthy calculation4that
¯z(k)=1
n
n
X
j=k
1
j=1
nHn,k ,(4)
with the partial harmonic number Hn,k ≡Pn
j=kj−1and
z2(k)=2
n(n+ 1)
n
X
j=k
Hn,j
j=2
n+ 1
n
X
j=k
¯z(j)
j.(5)
Table 2 compares the implied odds of the k-th favourite given that it wins with the average length of the
k-th largest segment given that it contains a random point, see Eq. 3. We observe again a good agreement
between the empirical odds and the theoretical prediction (except for the longshot, see discussion above).
Finally, we calculate the average initial odds of the winning horse which is 0.2148. Its theoretical
prediction follows from Eq. 3,
Average length of the segment
containing the random point =
n
X
k=1
E[z(k)|I(k)= 1]P[I(k)= 1] =
n
X
k=1
z2(k)=2
n+ 1 ,(6)
which – after averaging over n– yields 0.2107, again very close to the empirical value.
To summarise, we have found a remarkable agreement between the order statistics of the randomly broken
stick and the statistical properties of horse racing betting markets. We also observe that the empirical values
of the implied odds and true winning probabilities are close and therefore conclude that this betting market
is informationally efficient at least to some degree. Some discrepancies are found for the longshot, because
gamblers probably fail to rank the horses accurately when their number is big. Assuming that the implied
odds reflect to a large extent the true winning probabilities, we conclude that the “ability” of a horse can be
defined in such a way that its winning probability is the ratio of its “ability” to the sum of all its competitors’
abilities, provided “ability” is exponentially distributed.
Acknowledgements: Julius Bonart thanks Jean-Philippe Bouchaud, Jonathan Donier and Tomaso Aste
for interesting discussions. We would also like to give warm thanks to Peter A. Bebbington’s PhD supervisors
I. J. Ford and F. M. C. Witte, the funding body EPSRC and the Centre for Doctoral Training in Financial
Computing & Analytics.
References
[1] E. P. Wigner. Characteristic vectors of bordered matrices with infinite dimensions. Annals of Mathe-
matics, 62:548–564, 1955.
4The identity for ¯z(k)is reported in [14], p. 153, but the authors have no knowledge of a previous appearance of
Eq. 5.
5
[2] T. A. Brody, J. Flores, J. B. French, P. A. Mello, A. Pandey, and S. S. M. Wong. Random-matrix
physics: spectrum and strength fluctuations. Review of Modern Physics, 53:385–480, 1981.
[3] L. Laloux, P. Cizeau, J.-P. Bouchaud, and M. Potters. Noise dressing of financial correlation matrices.
Physical Review Letters, 83:1467–1470, 1999.
[4] J-P Bouchaud and M Potters. Theory of financial risk and derivative pricing. Cambridge, 2009.
[5] B. T´oth, Y. Lemp´eri`ere, C. Deremble, J. De Lataillade, J. Kockelkoren, and J. P. Bouchaud. Anomalous
price impact and the critical nature of liquidity in financial markets. Physical Review X, 1(2):021006,
2011.
[6] J. Donier, J. Bonart, I. Mastromatteo, and J.-P. Bouchaud. A fully consistent, minimal model for
non-linear market impact. Quantitative Finance, 15:1109–1121, 2015.
[7] J. Donier and J. Bonart. A million metaorder analysis of market impact on the bitcoin.
http://papers.ssrn.com/sol3/Papers.cfm?abstract_id=2536001, 2014.
[8] Tiziana di Matteo. Multi-scaling in finance. Quantitative Finance, 7:21–36, 2005.
[9] F. Patzelt and K. Pawelzik. An inherent instability of efficient markets. Scientific Reports, 3:2784, 2013.
[10] L. V. Williams. Information efficiency in betting markets: A survey. Bulletin of Economic Research,
51:1–39, 1999.
[11] S. Figlewski. Subjective information and market effciency in a betting market. The Journal of Political
Economy, 87:75–88, 1979.
[12] P. Divos, S. sel Bano Rollin, Z. Bihary, and T. Aste. Risk-neutral pricing and hedging of in-play football
bets. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2598767, 2014.
[13] L. Holst. On the length of the pieces of a stick broken at random. Journal of Applied Probability,
17:623–634, 1980.
[14] H. A. David and H. N. Nagara ja. Order Statistics. John Wiley & Sons, Inc., third edition edition, 2003.
6