PreprintPDF Available

Order statistics of horse racing and the randomly broken stick

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

We find a remarkable agreement between the statistics of a randomly divided interval and the observed statistical patterns and distributions found in horse racing betting markets. We compare the distribution of implied winning odds, the average true winning probabilities, the implied odds conditional on a win, and the average implied odds of the winning horse with the corresponding quantities from the "randomly broken stick problem". We observe that the market is at least to some degree informationally efficient. From the mapping between exponential random variables and the statistics of the random division we conclude that horses' true winning abilities are exponentially distributed.
arXiv:1612.02567v1 [q-fin.ST] 8 Dec 2016
Order statistics of horse racing and the randomly broken stick
Peter A. Bebbington1,2and Julius Bonart3,4
December 9, 2016
1: Department of Physics and Astronomy, University College London, London WC1E 6BT; 2: Trium Capital
LLP, 60 Gresham St, London EC2V 7BB; 3: Financial Computing & Analytics, Department of Computer
Science, University College London, London WC1E 6BT; 4: CFM–Imperial Institute of Quantitative Fi-
nance, Department of Mathematics, Imperial College, London SW7 2AZ
Abstract
We find a remarkable agreement between the statistics of a randomly divided interval and
the observed statistical patterns and distributions found in horse racing betting markets. We
compare the distribution of implied winning odds, the average true winning probabilities, the
implied odds conditional on a win, and the average implied odds of the winning horse with
the corresponding quantities from the “randomly broken stick problem”. We observe that
the market is at least to some degree informationally efficient. From the mapping between
exponential random variables and the statistics of the random division we conclude that horses’
true winning abilities are exponentially distributed.
1 Introduction
From time to time nature has a taste for simplicity. It can then be promising to treat unknown variables as
purely random, using statistics that are compatible with the constraints, symmetries, or boundary conditions
of the given problem, but otherwise as simple as possible. Heavy nuclei are an example of such a system; they
are seemingly hopelessly complex, yet the spacings between their energy levels follow well-known statistics of
random matrix eigenvalues [1, 2]. More recently, one of such statistics, the Marchenko-Pastur distribution,
has been found in fluctuations of financial covariance matrices [3], despite the strong non-Gaussian depen-
dencies observed in real financial time series [4]. Latter example underscores the success of econophysics:
Socio-economic human systems are highly non-linear [5, 6, 7, 8] and chaotic [9], but methods borrowed from
statistical physics can still be successful in describing bulk statistics of these systems.
Traditionally, econophysics has somewhat neglected a certain type of financial markets: Betting markets.
This is perhaps surprising because economists, on the contrary, have studied betting markets extensively,
considering them as a controlled experiment for market efficiency [10, 11, 12], a key concept in financial
economics. Because the outcome of the bet win or lose is definitely known after a certain time, it is
straightforward to draw conclusions from the discrepancy between the implied market odds1and the true
winning probability. If the difference is large, the market is regarded as inefficient, because its participants
are not able to “price” the bet correctly. If the difference is small the market is regarded as efficient. How
are the implied market odds calculated? Consider for example a horse race with n= 3 horses. Assume that
after betting $X on the first horse you would get $3X if this horse wins. The “price” of a bet on the first
horse is thus 3. If the price of a bet on the second horse is 2 and the price of a bet on the third horse is 6,
1We use here “implied odds” in the sense of “implied probability”.
1
the second horse is the favourite because the market would pay the smallest ratio of the gain (including the
original stake) to the stake itself if this horse wins: Its implied winning odds are 1/2 so that the gambler’s
average payoff is zero.
The sum of all the implied odds must be roughly one. We know that this is not always exactly true, for
example because bookmakers charge the gamblers a small fee, but for the purpose of this study these tiny
deviations are not important. Each horse’s implied odds thus represent a segment of the unit interval. We
do not know much about horses, but we can guess the simplest statistics for these segments: Draw n1
numbers from the uniform distribution and cut the unit interval at each of these numbers. We thus break a
stick of unit length randomly into npieces, each of which represents the winning odds of an individual horse,
participating in a race with nhorses. The favourite’s odds correspond to the largest segment, the second
favourite’s odds to the second largest segment, and so on.
This letter reports striking similarities between the empirical distribution of implied odds observed in
horse racing betting markets and the order statistics of the random division of the unit interval. Moreover,
we find that conditional expectations of the true winning probabilities2closely follow their corresponding
values from the “broken stick problem”, as well. We therefore conclude that the true winning probabilities
of horses behave like random divisions of the unit interval, and that the market follows these statistics in
the implied odds. Finally, from the well known mapping between exponential random variables and the
statistics of the random division [13] we make the somewhat vague statement that horses’ true “abilities”
are exponentially distributed: The probability that a horse with ability Xiwins against n1 other horses
is then
Pi=Xi
Pn
j=1 Xj
,(1)
in that Pifollows the statistics of the broken stick problem if and only if the Xiare exponentially distributed.
Consider the interval [0,1] and divide it randomly into nsub-intervals. The length of the k-th largest
sub-interval, which we denote here by z(k), has the distribution [13]
P[z(k)> x|n] =
k1
X
j=1 n
jnj
X
=0
(1)1nj
[1 (j+)x]n1
++
n
X
=1
(1)1n
[1 ℓx]n1
+,(2)
with a+= max[a, 0]. We want to compare P[z(k)> x] to the empirical distribution of implied winning odds
of horse racing betting markets.
We use data collected through Betfair on 12736 races occurring across the British Isles in the period from
31/12/2011 to 15/12/2012. The average number of horses per race is 8.95. We consider only races with at
least 5 horses which reduces the total number of races in our dataset to 11925. Gamblers exchange bets on
horses in a limit order book. Sell orders match buy limit orders specified by volumes and lay decimal odds.
Buy orders match sell limit orders specified by volumes and back decimal odds. Decimal odds quote the ratio
of the payout amount, including the original stake, to the stake itself. The highest back quote is larger than
the lowest lay quote3. The implied winning odds are defined as the reciprocal of the last matched quote
before the race starts.
Consider now the implied odds of the k-th favourite horse, which we denote by Q(k). Fig. 1 compares
the ECCDF P[Q(k)> x] with the theoretical prediction P[z(k)> x] for the favourite, 2nd favourite, 3rd
favourite, 4th favourite, and the horse with the least implied winning odds (the “longshot”), averaged over
the number of horses in each race. The agreement is striking and calls for further investigation.
To compare the true winning probabilities to the order statistics of the broken stick problem we need
to calculate average quantities. Table 1 shows the expected length of the k-th largest segment, the average
empirical implied odds of the k-th favourite, and the average observed true winning probabilities of the k-th
favourite, denoted by P(k), for all races in our dataset and for three subgroups containing roughly an equal
2Of course we cannot observe an empirical distribution of true winning probabilities but only aggregate statistics,
such as, for example, the average winning probability of the favourite horse.
3Note that here buy orders match at lower quotes than sell orders.
2
Figure 1: (Dashed) ECCDF of the four favourite and longshot horses’ implied odds (red: favourite
horse, blue: second favourite, green: third favourite, violet: forth favourite, orange: longshot) and
(solid) the cumulative distribution of the corresponding segments of the division of the unit interval,
displayed in (main) linear and (inset) double logarithmic scale. Note that there are no free fitting
parameters.
number of horses: races with 5 n7 horses, races with 8 n10 horses, and races with n11
horses. The theoretical expectation of the segment lengths are calculated by taking the first moment of
Eq. 2 (analytically given in Eq. 4 below) and averaging over the empirical distribution of n. Not only do
the empirical implied odds correspond to the expected segment lengths but the average observed winning
probabilities also follow the order statistics of the random division accurately for all horses, with the exception
of the longshot. Note that our theoretical estimations of the winning odds based on the expected segment
lengths are parameter free.
We observe significant discrepancies for the longshot, but the differences between its implied odds,
winning probability and segment length are small for races with 5 n7 horses and larger for races with
more horses. This suggests that gamblers are not able to rank the horses precisely enough when the number
of horses is large. Remember that we define the rank of the horse according to the observed implied odds.
Therefore, the smallest segment may describe a horse that the market has not recognised as the weakest one.
In this case the market’s longshot is in reality a slightly stronger horse. This is consistent with the fact that
both implied odds and winning probabilities of the longshot are larger than suggested by Eq. 2.
We also calculate the implied odds of the k-th favourite given that this horse wins. This quantity
is naturally larger than the unconditional implied odds of the k-th favourite. To find the corresponding
theoretical prediction consider the indicator function I(k)= 1 which is one if a random point in the interval
[0,1] lies in the k-th largest segment and zero, else. Then:
P[z(k)=x|I(k)= 1] = P[I(k)= 1|z(k)=x]P[z(k)=x]
P[I(k)= 1] =xP[z(k)=x]
¯z(k)
,
and
E[z(k)|I(k)= 1] = z2(k)
¯z(k)
.(3)
Eq. 3 is the theoretical prediction of the k-th favourite’s odds given that this horse wins. By using well-known
3
favourite 2nd favourite 3rd favourite 4th favourite longshot
k= 1 k= 2 k= 3 k= 4 k=n
E[Q(k)|n5] 0.3208 0.2001 0.1420 0.1037 0.0210
E[P(k)|n5] 0.3358 0.1976 0.1345 0.0998 0.0253
E[z(k)|n5] 0.3237 0.2046 0.1451 0.1054 0.0157
favourite 2nd favourite 3rd favourite 4th favourite longshot
k= 1 k= 2 k= 3 k= 4 k=n
E[Q(k)|5n7] 0.3996 0.2399 0.1578 0.1024 0.0336
E[P(k)|5n7] 0.4165 0.2276 0.1503 0.0981 0.0339
E[z(k)|5n7] 0.4081 0.2407 0.1570 0.1012 0.0285
favourite 2nd favourite 3rd favourite 4th favourite longshot
k= 1 k= 2 k= 3 k= 4 k=n
E[Q(k)|8n10] 0.3184 0.1985 0.1438 0.1078 0.0182
E[P(k)|8n10] 0.3327 0.2081 0.1362 0.1031 0.0233
E[z(k)|8n10] 0.3166 0.2041 0.1478 0.1103 0.0128
favourite 2nd favourite 3rd favourite 4th favourite longshot
k= 1 k= 2 k= 3 k= 4 k=n
E[Q(k)|n11] 0.2470 0.1631 0.1247 0.1004 0.0119
E[P(k)|n11] 0.2614 0.1564 0.1172 0.0977 0.0193
E[z(k)|n11] 0.2500 0.1703 0.1305 0.1039 0.0065
Table 1: Average implied odds and winning probabilities, and expected segment lengths for (from
above) all races in our dataset, all races with n7, with 8 n10, and with n11. The
theoretical expectation of the segment lengths are calculated by averaging the first moment of Eq.
2 over the empirical distribution of n.
4
favourite 2nd favourite 3rd favourite 4th favourite longshot
k= 1 k= 2 k= 3 k= 4 k=n
E[Q(k)|win] 0.3735 0.2148 0.1542 0.1139 0.0886
E[z(k)|I(k)= 1] 0.3622 0.2196 0.1549 0.1145 0.0383
Table 2: Average implied odds given that the horse wins and expected segment lengths given that
it contains a random point for all races in our dataset (with n5). The theoretical expectation
of the segment lengths are calculated by averaging Eq. 3 over the empirical distribution of n.
binomial identities we find from Eq. 2 after a somewhat lengthy calculation4that
¯z(k)=1
n
n
X
j=k
1
j=1
nHn,k ,(4)
with the partial harmonic number Hn,k Pn
j=kj1and
z2(k)=2
n(n+ 1)
n
X
j=k
Hn,j
j=2
n+ 1
n
X
j=k
¯z(j)
j.(5)
Table 2 compares the implied odds of the k-th favourite given that it wins with the average length of the
k-th largest segment given that it contains a random point, see Eq. 3. We observe again a good agreement
between the empirical odds and the theoretical prediction (except for the longshot, see discussion above).
Finally, we calculate the average initial odds of the winning horse which is 0.2148. Its theoretical
prediction follows from Eq. 3,
Average length of the segment
containing the random point =
n
X
k=1
E[z(k)|I(k)= 1]P[I(k)= 1] =
n
X
k=1
z2(k)=2
n+ 1 ,(6)
which after averaging over n yields 0.2107, again very close to the empirical value.
To summarise, we have found a remarkable agreement between the order statistics of the randomly broken
stick and the statistical properties of horse racing betting markets. We also observe that the empirical values
of the implied odds and true winning probabilities are close and therefore conclude that this betting market
is informationally efficient at least to some degree. Some discrepancies are found for the longshot, because
gamblers probably fail to rank the horses accurately when their number is big. Assuming that the implied
odds reflect to a large extent the true winning probabilities, we conclude that the “ability” of a horse can be
defined in such a way that its winning probability is the ratio of its “ability” to the sum of all its competitors’
abilities, provided “ability” is exponentially distributed.
Acknowledgements: Julius Bonart thanks Jean-Philippe Bouchaud, Jonathan Donier and Tomaso Aste
for interesting discussions. We would also like to give warm thanks to Peter A. Bebbington’s PhD supervisors
I. J. Ford and F. M. C. Witte, the funding body EPSRC and the Centre for Doctoral Training in Financial
Computing & Analytics.
References
[1] E. P. Wigner. Characteristic vectors of bordered matrices with infinite dimensions. Annals of Mathe-
matics, 62:548–564, 1955.
4The identity for ¯z(k)is reported in [14], p. 153, but the authors have no knowledge of a previous appearance of
Eq. 5.
5
[2] T. A. Brody, J. Flores, J. B. French, P. A. Mello, A. Pandey, and S. S. M. Wong. Random-matrix
physics: spectrum and strength fluctuations. Review of Modern Physics, 53:385–480, 1981.
[3] L. Laloux, P. Cizeau, J.-P. Bouchaud, and M. Potters. Noise dressing of financial correlation matrices.
Physical Review Letters, 83:1467–1470, 1999.
[4] J-P Bouchaud and M Potters. Theory of financial risk and derivative pricing. Cambridge, 2009.
[5] B. oth, Y. Lemp´eri`ere, C. Deremble, J. De Lataillade, J. Kockelkoren, and J. P. Bouchaud. Anomalous
price impact and the critical nature of liquidity in financial markets. Physical Review X, 1(2):021006,
2011.
[6] J. Donier, J. Bonart, I. Mastromatteo, and J.-P. Bouchaud. A fully consistent, minimal model for
non-linear market impact. Quantitative Finance, 15:1109–1121, 2015.
[7] J. Donier and J. Bonart. A million metaorder analysis of market impact on the bitcoin.
http://papers.ssrn.com/sol3/Papers.cfm?abstract_id=2536001, 2014.
[8] Tiziana di Matteo. Multi-scaling in finance. Quantitative Finance, 7:21–36, 2005.
[9] F. Patzelt and K. Pawelzik. An inherent instability of efficient markets. Scientific Reports, 3:2784, 2013.
[10] L. V. Williams. Information efficiency in betting markets: A survey. Bulletin of Economic Research,
51:1–39, 1999.
[11] S. Figlewski. Subjective information and market effciency in a betting market. The Journal of Political
Economy, 87:75–88, 1979.
[12] P. Divos, S. sel Bano Rollin, Z. Bihary, and T. Aste. Risk-neutral pricing and hedging of in-play football
bets. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2598767, 2014.
[13] L. Holst. On the length of the pieces of a stick broken at random. Journal of Applied Probability,
17:623–634, 1980.
[14] H. A. David and H. N. Nagara ja. Order Statistics. John Wiley & Sons, Inc., third edition edition, 2003.
6
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We present a thorough empirical analysis of market impact on the Bitcoin/USD exchange market using a complete dataset that allows us to reconstruct more than one million metaorders. We empirically confirm the "square-root law" for market impact, which holds on four decades in spite of the quasi-absence of statistical arbitrage and market marking strategies. We show that the square-root impact holds during the whole trajectory of a metaorder and not only for the final execution price. We also attempt to decompose the order flow into an "informed" and "uninformed" component, the latter leading to an almost complete long-term decay of impact. This study sheds light on the hypotheses and predictions of several market impact models recently proposed in the literature. Our empirical results strongly supports the statistical latent order book model as the most relevant candidate to explain price impact on the Bitcoin market - and therefore probably on other markets as well.
Article
Full-text available
We propose a minimal theory of non-linear price impact based on a linear (latent) order book approximation, inspired by diffusion-reaction models and general arguments. Our framework allows one to compute the average price trajectory in the presence of a meta-order, that consistently generalizes previously proposed propagator models. We account for the universally observed square-root impact law, and predict non-trivial trajectories when trading is interrupted or reversed. We prove that our framework is free of price manipulation, and that prices can be made diffusive (albeit with a generic short-term mean-reverting contribution). Our model suggests that prices can be decomposed into a transient "mechanical" impact component and a permanent "informational" component.
Article
Full-text available
Speculative markets are often described as "informationally efficient" such that predictable price changes are eliminated by traders exploiting them, leaving only residual unpredictable fluctuations. This classical view of markets operating close to an equilibrium is challenged by extreme price fluctuations which occur far more frequently than can be accounted for by external news. Here we show that speculative markets which absorb self-generated information can exhibit both: evolution towards efficient equilibrium states as well as their subsequent destabilisation. We introduce a minimal agent-based market model where the impacts of trading strategies naturally adapt according to their success. This implements a learning rule for the whole market minimising predictable price changes, and an extreme succeptibility at the point of perfect balance. The model quantitatively reproduces real heavy-tailed log return distributions and volatility clusters. Our results demonstrate that market instabilities can be a consequence of the very mechanisms that lead to market efficiency.
Article
Full-text available
The most suitable paradigms and tools for investigating the scaling structure of financial time series are reviewed and discussed in the light of some recent empirical results. Different types of scaling are distinguished and several definitions of scaling exponents, scaling and multi-scaling processes are given. Methods to estimate such exponents from empirical financial data are reviewed. A detailed description of the Generalized Hurst exponent approach is presented and substantiated with an empirical analysis across different markets and assets.
Article
Full-text available
We propose a dynamical theory of market liquidity that predicts that the average supply/demand profile is V-shaped and {\it vanishes} around the current price. This result is generic, and only relies on mild assumptions about the order flow and on the fact that prices are (to a first approximation) diffusive. This naturally accounts for two striking stylized facts: first, large metaorders have to be fragmented in order to be digested by the liquidity funnel, leading to long-memory in the sign of the order flow. Second, the anomalously small local liquidity induces a breakdown of linear response and a diverging impact of small orders, explaining the "square-root" impact law, for which we provide additional empirical support. Finally, we test our arguments quantitatively using a numerical model of order flow based on the same minimal ingredients.
Book
Full-text available
Risk control and derivative pricing have become of major concern to financial institutions, and there is a real need for adequate statistical tools to measure and anticipate the amplitude of the potential moves of the financial markets. Summarising theoretical developments in the field, this 2003 second edition has been substantially expanded. Additional chapters now cover stochastic processes, Monte-Carlo methods, Black-Scholes theory, the theory of the yield curve, and Minority Game. There are discussions on aspects of data analysis, financial products, non-linear correlations, and herding, feedback and agent based models. This book has become a classic reference for graduate students and researchers working in econophysics and mathematical finance, and for quantitative analysts working on risk management, derivative pricing and quantitative trading strategies.
Article
Consider the pieces of a randomly broken stick. How long is the j th longest piece? How many breaks are necessary for getting all pieces less than a given length? These and related questions are studied in particular when the number of pieces is large. Using simple properties of the exponential distribution new proofs are given of old results and new results are obtained.
Article
It now appears that the general nature of the deviations from uniformity in the spectrum of a complicated nucleus is essentially the same in all regions of the spectrum and over the entire Periodic Table. This behavior, moreover, is describable in terms of standard Hamiltonian ensembles which could be generated on the basis of simple information-theory concepts, and which give also a good account of fluctuation phenomena of other kinds and, apparently, in other many-body systems besides nuclei. The main departures from simple behavior are ascribable to the moderation of the level repulsion by effects due to symmetries and collectivities, for the description of which more complicated ensembles are called for. One purpose of this review is to give a self-contained account of the theory, using methods-sometimes approximate-which are consonant with the usual theory of stochastic processes. Another purpose is to give a proper foundation for the use of ensemble theory, to make clear the origin of the simplicities in the observable fluctuations, and to derive other general fluctuation results. In comparing theory and experiment, the authors give an analysis of much of the nuclear-energy-level data, as well as an extended discussion of observable effects in nuclear transitions and reactions and in the low-temperature thermodynamics of aggregates of small metallic particles.