Conference PaperPDF Available

The Haka Network: Evaluating Rugby Team Performance with Dynamic Graph Analysis


Abstract and Figures

Real world events are intrinsically dynamic and analytic techniques have to take into account this dynamism. This aspect is particularly important on complex network analysis when relations are channels for interaction events between actors. Sensing technologies open the possibility of doing so for sport networks, enabling the analysis of team performance in a standard environment and rules. Useful applications are directly related for improving playing quality, but can also shed light on all forms of team efforts that are relevant for work teams, large firms with coordination and collaboration issues and, as a consequence, economic development. In this paper, we consider dynamics over networks representing the interaction between rugby players during a match. We build a pass network and we introduce the concept of disruption network, building a multilayer structure. We perform both a global and a micro-level analysis on game sequences. When deploying our dynamic graph analysis framework on data from 18 rugby matches, we discover that structural features that make networks resilient to disruptions are a good predictor of a team's performance, both at the global and at the local level. Using our features, we are able to predict the outcome of the match with a precision comparable to state of the art bookmaking.
Content may be subject to copyright.
The Haka Network: Evaluating Rugby Team
Performance with Dynamic Graph Analysis
Paolo Cintia1, Michele Coscia2and Luca Pappalardo1
1KDDLab - ISTI CNR, Via G. Moruzzi 1 - 56124 Pisa, Italy Email: {name.surname}
2Naxys - University of Namur, Rempart de la Vierge 8, 5000 Namur Belgium Email:
Abstract—Real world events are intrinsically dynamic and
analytic techniques have to take into account this dynamism. This
aspect is particularly important on complex network analysis
when relations are channels for interaction events between
actors. Sensing technologies open the possibility of doing so
for sport networks, enabling the analysis of team performance
in a standard environment and rules. Useful applications are
directly related for improving playing quality, but can also shed
light on all forms of team efforts that are relevant for work
teams, large firms with coordination and collaboration issues
and, as a consequence, economic development. In this paper, we
consider dynamics over networks representing the interaction
between rugby players during a match. We build a pass network
and we introduce the concept of disruption network, building
a multilayer structure. We perform both a global and a micro-
level analysis on game sequences. When deploying our dynamic
graph analysis framework on data from 18 rugby matches, we
discover that structural features that make networks resilient to
disruptions are a good predictor of a team’s performance, both at
the global and at the local level. Using our features, we are able
to predict the outcome of the match with a precision comparable
to state of the art bookmaking.
Mining dynamics on graphs is a challenging, complex
and useful problem [4]. Many networks are representation
of evolving phenomena, thus understanding graph dynamics
brings us a step closer for an accurate description of reality.
Sensing technologies open the possibility of performing dy-
namic graph analysis in an ever expanding set of contexts.
One of them is competition events. Sports analysis is par-
ticularly interesting because it involves a setting where both
environment and rules are standardized, thus providing us an
objective measure of players’ contributions.
These possibilities power a number of potentially useful
applications. The direct one is related to the improvement
of a team’s performances. Once the most important factors
contributing or preventing victories are identified, the team
can work on strategies to regulate their collective effort to-
ward the best practices. However, there are non-sport related
externalities too. A sport team is nothing more than a group of
people with different skills that is trying to achieve a goal in
the most efficient way possible. This description can be applied
to any other form of team [15]: a start-up enterprise, a large
manufacturing firm, a group of scientists writing a scientific
IEEE/ACM ASONAM 2016, August 18-21, 2016, San Francisco, CA, USA
978-1-5090-2846-7/16/$31.00 c
2016 IEEE
1 1
1 4
1 2
1 1
1 3
1 5
1 0
1 5
1 4
1 0
1 2
1 3
Fig. 1: The pass and tackles networks imposed on the same
structure for one match in our dataset: Italy (blue nodes, home
team) versus New Zealand (black nodes, away team). The
number of the node refers to the role of the player. Green
edges are the pass network and red edges are the tackle
network. Node layout is determined by the classical rugby
player positioning on the pitch during a scrum.
paper. If we are able to shed light on how group dynamics
affect sport teams, we can try to universalize collaboration best
practices to enhance productivity in many different scenarios.
It comes as no surprise that analyzing sport events is a fast
growing literature in data science. Previous works involve the
analysis of different sports: from soccer [14] [9] to basketball
[27], from American football [20] to cycling [8]. In this paper
we focus on a sport that was not analyzed before: rugby.
The reason is that rugby has some distinctive features that
makes it an ideal sport to consider. Like American football,
it is a sport where there is a very clear and simple success
measure: the number of meters gained in territory. Again like
American football, it contains a disruption network: tackles.
Unlike American football, performance is also related to how
well a team can weave its own pass structure, involving all
the players in the field in the construction of uninterrupted
sequences that can lead to a score.
One of the main contributions of the paper is to use both
the pass and disruption networks at the same time, creating a
multilayer network analysis [13] [3] of dynamics on graphs.
Figure 1 depicts an example of such structure. To the best
of our knowledge, there has not been such attempt elsewhere
in the literature. We also do a multiscale analysis: we first
focus on the multilayer network as a whole – as a result of
all interactions during a match like in [9] – and we perform
also a micro-level analysis on each match sequence.
Our data comes from the 2012 Tri-Nations championship,
2012 New Zealand Europe tour, 2012 Irish tour to New
Zealand and 2011 Churchill Cup (only USA matches), for
a total of 18 matches. The data was collected by Opta1, using
field positioning and semi-automatically annotated events.
We find that there are some features of the pass networks
that make teams more successful in their quest for territorial
gain. In particular, the connectivity of the network seems to
play an important role. Being able avoid structurally crucial
players, that would make part of the team isolated if they were
removed from the network by a tackle, is associated with the
highest amounts of meters made. This is also confirmed if
we do not analyze only the global network as the result of
the entire match, but also mining the patterns of each single
sequence of the game. In the latter case, the signal is harder
to disentangle from noise, and most analyzed features did not
yield any significant result. This highlights how much rugby
is a game dependent on a grand match strategy, rather than on
just a sequence-by-sequence short term tactical one.
When applied to a real world prediction task, our framework
fares well in comparison with state of the art attempts. The
power in predicting the winner of a match is comparable with
the one of bookmakers, who have access to the full history
of teams and of the players actually performing on the pitch.
In particular, our framework is less susceptible to reputation
bias: the algorithm is not afraid to design New Zealand as a
loser in what bookmakers saw as a great upset result – its loss
to England in the December 1st, 2012 Twickenham clash.
In the last decade, sports analytics has increased its per-
vasiveness as large-scale performance data became available
[23]. Researchers from different disciplines started to analyze
massive datasets of players’ and teams’ performance collected
from monitoring devices. The enormous potential of sports
data is affecting both individual and team sports, providing
a valid tool to verify existing sports theories and develop
new ones. As an example in individual sports, Cintia et
al. [8], [6] develop a first large scale data-driven study on
cyclists’ performance by analyzing the workouts of 30,000
amateur cyclists. The analysis reveals that cyclists’ perfor-
mances follow precise patterns, thus discovering an efficient
training program learned from data. In tennis, Yucesoy and
asi develop a predictive model that relies on a tennis
player’s performance in tournaments to predict her popularity
[28]. In NBA basketball league the performance efficiency
rating introduced by Hollinger [12] is a stable and widely
used measure to assess players’ performance by combining
the manifold type of data gathered during every game (pass
completed, shots achieved, etc.). Vaz de Melo et al. [27]
introduced network analysis to the mix. In baseball, Rosales
and Spratt propose a new methodology to quantify the credit
for whether a pitch is called a ball or strike among the catcher,
the pitcher, the batter, and the umpire involved [21]. Smith et
al. propose a Bayesian classifier to predict baseball awards
in the US Major League Baseball, reaching an accuracy of
80% in the predictions [25]. In soccer, networks are a widely
used tool to determine the interactions between players on
the field, where soccer players are nodes of a network and
a pass between two players represents a link between the
respective nodes. For example, Cintia et al. [9], [7] exploit
passing networks to detect the winner of a game based on the
passing behavior of the teams. [14] shared the aim, without
using networks. They discovered that, while the strategy of
the majority of successful teams is based on maximizing
ball possession, another successful strategy is to maximize a
defense/attack efficiency score. Dynamic graph analysis has
also been applied for ranking purposes [24], [20], [18]. Other
examples of wider applicability of team-based success research
comes from analysis of citation [22] and social [15] networks.
Methodologically, our paper is indebted with the vast field
of dynamics on networks analysis [4]. This field is usually ap-
proached both from a statistical [26] and a mining perspective
[5]. We adopt the latter approach. In mining network dynam-
ics, the aim is to find regularities in the evolution of networks
[2]. There are many applications for these techniques, beyond
sports analytics: epidemiology [17], mobility [10] and genetics
[16]. In this work, we focus on a narrower part of this field,
since our networks are multilayer: nodes can be connected
with edges belonging to multiple types. A good survey on
multilayer networks, both modeling and analysis, can be found
in [13]. The specific multilayer model we adopt is the one of
multidimensional networks firstly presented in [3].
The data has been collected by Opta and made available in
2013 as part of the AIG Rugby Innovation Challenge2. Opta
performs a semi-automatic data collection that happens as the
game unfolds. Sensors feed a team composed by two or more
humans, who annotate the various actions of the match. Once
the event ended, the data is double-checked for consistency
and then serialized as an XML file.
A pass pis defined by an action that was coded as successful
pass in the data. This means we drop forward, intercepted or
otherwise erroneous passes, that directly result in the team
losing possession. A pass is composed by a pair of players:
Meters Made
Fig. 2: The relationship between meters made on a carry and
final score in our dataset. We exclude outliers for teams who
did not score in a match (two occurrences in our data).
the pass originator and the pass receiver. Both players have to
be part of the same team. We refer to Pt,g as the set of passes
made by team tin game g– i.e. what we call “Pass network”.
|Pt,g|is the number of total passes made, which is equivalent
to the sum of the edge weights of the pass network.
A disruption d, also called tackle, is defined by an action
that was coded as successful tackle in the data. This means we
keep all different tackle types recorded by the data provider
(chases, line, guard, etc), but we drop the ones whose result
was not a clean tackle, meaning that the defender conceded
a penalty or allowed the attackers to continue the offense. A
disruption is composed by a pair of players: the tackler and the
tackled. Players have to belong to different teams, making this
a bipartite network. Similarly to the pass notation, Dt,g and
|Dt,g|denote the set and number of tackles made, respectively.
For convenience, we define an aggregate of the disruption
action: du,t,g is the number of all disruptions targeted at player
uby team tin in game gover all disruptions made by tin g.
A sequence sis a list of passes and disruptions. Roughly,
we define a sequence as a phase of the game going from the
starting of a possession to its end. In rugby terminology, this
means defining a sequence as a phase of game going from
an interruption of the game to another. Interruptions are tries,
penalties, drop goals, scrums and lineouts. Note that clean
changes in possession (intercepts, ruck steals and others) are
considered interruptions too. The union of the pass and tackle
networks result in a structure of the type depicted in Figure 1.
In Section V, the quality measure used to distinguish
between successful and unsuccessful teams is the number of
meters made with a carry. Rugby is a very territory oriented
game, and it can be won basically by gaining more meters ball
in hand. There is a very high correlation between meters made
and score/victory, and Figure 2 depicts this relationship (in the
figure, the Pearson correlation is .58). As a consequence,
we use the information recorded in the metres attribute of
each actionrow element in the data as our success measure.
Symbol Meaning
gSingle game (match)
dDisruption (tackle)
du,t,g Relative number of disruptions targeted to uby tin g
sSequence, a set of passes and tackles
Pt,g Passes made in game gby team t
Dt,g Disruptions made in game gby team t
ms,t Meters made in sequence sby team t
Mg,t Meters made in game gby team t
TABLE I: The notation used in the paper.
Fig. 3: A toy model of a directed graph.
We refer to meters gained in a sequence sby team tas ms,t,
and we use it in the sequence analysis (Section V-B). In the
case of the structural analysis (Section V-A), we consider the
game gas a whole, and we define the overall success as all
the meters made, i.e. Mg,t =P
Table I sums up the notation used in the paper.
In this section we consider the multilayer network of passes
and disruptions as a whole, as it results from all interactions
between and within the two teams over the entire length of the
match. We define a set of features characterizing the network.
We aim at predicting the number of meters made on a carry
by the team over the match. The features are team-dependent,
thus for each match we have two observations: the feature for
the home team and for the away team. Each team has two set
of features: the pass features and the disruption features.
A. Pass Features
The pass features are features calculated over the team’s
pass network Pt,g. We consider the topology of Pt,g in
isolation, as a directed graph that is not interacting with
other external events. The features we define are purely
topological ones and they are: connectivity (γt,g(Pt,g )), assor-
tativity (ρt,g(Pt,g )), number of strongly connected components
(σt,g(Pt,g )), and clustering (t,g (Pt,g )).
Connectivity is defined as follows. Given two nodes, con-
nectivity is the number of nodes that must be removed to
break all paths from the two nodes in Pt,g [1]. For instance,
in Figure 3, to separate nodes 2 and 4 you need only to remove
one node (3). With γt,g(Pt,g )we refer to the average of local
node connectivity over all pairs of nodes of Pt,g.
Assortativity (ρt,g(Pt,g )) is the Pearson correlation coeffi-
cient of the degrees of all pairs of nodes connected by an
edge [19]. A positive assortativity means that in the network
nodes with high degree tends to connect with other nodes with
high degree and vice versa. Figure 3 represents an assortative
network, as the degree correlation coefficients is 0.15.
A strongly connected component in a network is a set of
nodes for which there is a path from any node of the compo-
nent to any other node in the component following the directed
edges of the graph. Given a directed network, there might
be zero, one or more strongly connected components. Our
σt,g(Pt,g )detects and counts the number of strongly connected
components in Pt,g . Figure 3 depicts a directed graph with
three strongly connected components: one composed by nodes
1, 2 and 3; and two other components composed by the single
nodes 4 and 5, since they are connected by a single directed
edge which does not allow to reach node 4 from 5.
The clustering of a node uis the fraction of possible
triangles through that node that exist:
where Tuis the number of triangles through node u, and
ku>1is its degree (for ku1the convention is to fix
cu= 0). The t,g(Pt,g )feature is the mean clustering, or:
t,g(Pt,g ) = 1
where nis the number of nodes in Pt,g (for rugby usually
n= 15, because there are 15 players in a rugby team, although
in some cases a player might never receive a pass, setting
n= 14). Note that clustering is defined for undirected graphs.
Thus, in this case, Pt,g is projected in a derived structure in
which we ignore edge direction. This is the only one of the
four measures for which we have to perform this projection.
The graph in Figure 3 has a high clustering coefficient (
0.87), because nodes 1, 2, 4 and 5 all have cu= 1 – they are
all part of the only triangle they could be part of – while node
3 is part of only two of its six possible triangles.
B. Disruption Features
The disruption features exploit the multilayer nature of the
conjunction between Pt,g and Dt,g . We consider how different
the features of Pt,g become when one of its nodes gets disabled
by a tackle. We compare these features with the version of Pt,g
where no node gets removed and we weight this difference by
the relative number of times that the disruption has been made.
For each disruption feature we use the same notation as the
pass feature, with an overline. We define Pu
t,g as the pass
network Pt,g when deprived of node u.
For connectivity, our definition is as follows:
γt1,g(Pt1,g ) = X
du,t2,g ×(γt1,g(Pu
t1,g)γt1,g (Pt1,g )).
In practice, the γt1,g (Pu
t1,g)γt1,g (Pt1,g )term calculates
what happens to the connectivity of Pt1,g when removing u.
A negative number means that the connectivity decreases, or
t1,g)< γt1,g (Pt1,g ): you need to remove fewer nodes
to disconnect pairs of nodes in the network. A positive value
means that the connectivity increases. The du,t2,g term weighs
this connectivity change with the relative number of times t2
was able to disable player uwith a successful disruption. The
sum term aggregates the measure over all players representing
team t1, the team receiving the tackles.
To have an intuition of this operation, consider the tackle
features as an expression of the resilience level of the pass
network: they estimate how much the network is resistant to
disruptions. The higher the value, the more resilient the pass
network is.
The other disruption features are defined following the same
ρt1,g(Pt1,g ) = X
du,t2,g ×(ρt1,g(Pu
t1,g)ρt1,g (Pt1,g )),
σt1,g(Pt1,g ) = X
du,t2,g ×(σt1,g(Pu
t1,g)σt1,g (Pt1,g )),
t1,g(Pt1,g ) = X
du,t2,g ×(∆t1,g(Pu
t1,g)t1,g (Pt1,g )),
for assortativity, number of strongly connected components,
and clustering, respectively.
In the case of disruptions, we have an additional feature.
For each player uin a t, g pair we can calculate a centrality
value, answering the question: how central was player ufor
team tin game g? Thus, βu,t,g is defined as u’s closeness
centrality [11] in Pt,g . The tackle centrality disruption is then
defined as:
βt1,g(Pt1,g ) = X
(du,t2,g ×βu,t1,g).
It represent the weighted average closeness centrality of the
players tackled in t1by t2.
In this section, we perform the network analysis to estimate
the meters gain by rugby team using their network features.
We start by looking at the global pass and disruption features,
calculated over the entire aggregated match (Section V-A). We
then focus on an analysis considering each match sequence as
a single observation (Section V-B).
A. Structural Analysis
In this section we calculate the network features presented in
the previous section over the pass and disruption networks. We
use these features as independent variables of a simple OLS
model. The dependent variable of the model is the number of
meters made carrying the ball, as described in Section III. We
log transform the dependent variable.
Dependent variable:
(1) (2) (3) (4)
h0.058 0.1600.081 0.077
(0.096) (0.084) (0.096) (0.084)
log |Pt,g|0.3660.218 0.379∗∗ 0.159
(0.201) (0.217) (0.158) (0.165)
t,g 0.451
γt,g 0.285∗∗∗
ρt,g 0.471
σt,g 0.073∗∗∗
Constant 5.118∗∗∗ 7.547∗∗∗ 5.451∗∗∗ 6.609∗∗∗
(0.724) (0.891) (0.764) (0.816)
Observations 36 36 36 36
R20.247 0.474 0.277 0.409
Adjusted R20.176 0.424 0.210 0.353
Residual Std. Error 0.273 0.229 0.268 0.242
F Statistic 3.500∗∗ 9.602∗∗∗ 4.096∗∗ 7.375∗∗∗
Note: p<0.1; ∗∗p<0.05; ∗∗∗p<0.01
TABLE II: The regression results predicting meters made
using pass network features.
Note that in the regression we control for the home factor
with a binary variable h. This is done because home advantage
is very strong in rugby3, and we want to make sure it does not
affect our results. A second control we impose is the number
of passes made. There is an expected correlation between
how many meters a team will advance and the time it has
possession of the ball. The number of passes made is a good
proxy for this information.
We first check the correlations between the pass features
and the number of meters made. Table II reports the results
of our models. We check one feature at a time, always
including our controlling factors. Two of the four features
do not exhibit a significant correlation with the number of
meters made: clustering (t,g ) and assortativity (ρt,g). This
means that, when crafting their own pass network, teams are
not required to encourage or discourage triadic closure – a
receiver of a pass passing to the passer that originated the
action – and assortativity – passing the ball to a player with
a team connectivity similar to their own.
More interesting are the significant associations with con-
nectivity (γt,g) and number of strongly connected components
(σt,g). These two measures are related and it is no surprise they
have opposite signs: the higher the connectivity of a network
the fewer components it has. We interpret these coefficients
as follows: a rugby team’s pass network should ensure a
strong connectivity, likely establishing that there are multiple
and reciprocal pathways for the ball to reach all players.
Redundancy and structural strength are important in a rugby
These two factors are able to explain away the simple con-
trol on quantity of possession, represented by the number of
passes made. Note that it is remarkable to obtain a significance
3As of 2016, a weak team like Italy won only 12 out of 85 matches in the
Six Nations, and 11 of them were in Italy.
Dependent variable:
(1) (2) (3) (4) (5)
h0.1620.1950.183∗∗ 0.225∗∗ 0.128
(0.091) (0.103) (0.085) (0.109) (0.092)
log |Dt,g| −0.203 0.2630.175 0.190 0.082
(0.129) (0.149) (0.121) (0.152) (0.131)
βt,g 1.601∗∗∗
t,g 1.379∗∗
γt,g 0.243∗∗∗
ρt,g 0.805∗∗
σt,g 0.082∗∗∗
Constant 7.206∗∗∗ 7.360∗∗∗ 7.401∗∗∗ 8.077∗∗∗ 7.747∗∗∗
(0.564) (0.642) (0.519) (0.646) (0.544)
Observations 36 36 36 36 36
R20.415 0.250 0.483 0.191 0.414
Adjusted R20.360 0.179 0.434 0.115 0.359
Residual Std. Error 0.241 0.273 0.227 0.283 0.241
F Statistic 7.563∗∗∗ 3.548∗∗ 9.956∗∗∗ 2.5177.537∗∗∗
Note: p<0.1; ∗∗p<0.05; ∗∗∗p<0.01
TABLE III: The regression results predicting meters made
using tackling network features.
of p<.01 with a very small sample size (N= 36). We do
not show a model containing all features at once, because
the collinearity between γt,g and σt,g makes them both not
We now turn our attention to the disruption features. In
this case, instead of controlling for the number of passes, we
control for the number of disruptions. The control whether a
team played home or away is still in place. Table III reports
the results of these models.
Differently from the pass features, all disruption features are
significantly correlated with the number of meters made. The
result of the average tackle centrality βt,g seems surprising:
the highest the centrality of the most targeted players the more
meters the team will make. However, this is only an apparent
surprise: the most central players are the scrum half and fly
half (number 9 and 10 in Figure 1) which are players that do
not usually cover a lot of ground. On the other hand, the wings
(number 11 and 14 in Figure 1) are very peripheral but also
expected to cover most of the ground. This result is indeed
expected and not very interesting.
More interesting is that the more a team can maintain a
high value of clustering, connectivity and assortativity after
being disrupted by the average opponent tackle, the more it
can advance on the pitch. As expected, the number of strong
components is still negative: if after disruption the team still
does not have many isolated components it is expected to be
able to advance. The advice for a team would be to target
their tackles to the players who are the most responsible to
keep their pass network connected, but not the most central.
The significance levels of γt,g and σt,g are higher also in this
Note that connectivity is the single most important feature
discovered. Both in Table II and Table III the associated R2is
the highest. Together with our controls, γt,g and γt,g are able
to explain 47-48% of the variance in number of meters made.
Is it true that connectivity is the most important feature to
Lo w Lo w Very Low
Lo wHi g h
Hi g h
2008+/-160 1236+/-296 965+/-151
Tackle Connectivity
Tackle Strong Comps
Ta ckle Connec tivity
Fig. 4: The decision tree for predicting the number of meters
made in a match using the pass/disruption features. The tree
node’s font size is proportional to the number of observations
with the given characteristics. Leaves report the expected
number of meters made for the branch. Edge labels the value
of the node’s feature.
predict a rugby team’s performance? We conclude this section
by showing how we can use these features in a data mining
framework to predict the number of meters a team will cover
ball in hand. We apply a simple decision tree technique4,
where the target variable is the number of meters made and
the predictors are all the features we discussed so far.
Figure 4 depicts the result. The most important split variable
the algorithm found was indeed a connectivity measure, the
one resulting from tackles. It is such an important feature
that it has been selected twice at two different tree levels –
note that we pruned the tree to avoid overfitting. A very low
tackle connectivity means that the team, as result from the
opponent’s disruptions, lost most of its original connectivity.
This is associated with the poorest performances in meters
gained: a very low tackle connectivity resulted in less than
1,000 expected meters made. This is as little as half the
expectation for a team with a high connectivity retention,
plus the ability of not having its own strongly connected
components broken apart. In this case, the team is expected
to advance 2,000 meters.
B. Sequence Analysis
A deeper evaluation of a team performance can be obtained
by the analysis of how team networks are built, action after ac-
tion. To do that, we split each game in sequence of possession
phases, i.e. time intervals where a team is controlling the ball.
The split into possession phases is made by selecting all the
events between two events that identify a start of possession. In
particular, we sort all the events according to their timestamps,
then we select all the sequences between possession events of
the two teams playing. The possession of a team lasts until
4Implementation obtained from
(a) Duration in minutes of se-
(b) Number of events per sequence.
(c) Number of passes per sequence. (d) Number of tackles per se-
Fig. 5: The distribution of sequence statistics across all the
observed matches.
the opposite team performs a possession event. The possession
events we consider to split a game into possession phases are
the following: 50m Restart, 22m Restart, Free Kick, Turnover
Won, Lineout, Lineout Steal, Scrum, Scrum Steal, Ruck Won,
Maul Won, Penalty, Pass intercepted.
Figure 5 depicts the distribution of various features across
game sequences. The overall trend suggests that these are all
broad distributions: the number of observations does not allow
us to test for a power-law hypothesis, but we can conclude
that there are heavy tails. In particular, Figure 5(a) shows
that usually game sequences are very brief, but some can
last for multiple minutes, when a team performs prolonged
phases of possession – for instance attempting to score a try.
This in turn implies that the number of events per action also
distributes broadly (Figure 5(b)) even when broken down in
passes (Figure 5(c)) and tackles (Figure 5(d)).
The goal of such an analysis is to understand the dynamics
of a game relying on the network features already described
in the previous sections. Here we are not exactly interested on
predicting the result of a game: the focus is on the analysis of
the single features in a dynamic context, in order to evaluate
the additional knowledge provided by the network features we
Once a game is subdivided into possession phases, we can
observe how the pass network – i.e. the passing interactions be-
tween players – grow across time. As a performance measure,
we use the quantity of meters gained by a team from which
we subtract the meters gained by its opposition. We analyze
the average value of each feature during time and we compare
it to the performance of the team, for each of the 18 games we
have in our dataset. Among the features we are interested in
(Connectivity, Assortativity, Strongly connected components,
Clustering), we observe a significant negative correlation be-
Fig. 6: The correlation between strongly connected compo-
nents and meters gained w.r.t. the opposition.
tween the average number of strongly connected components,
and the meters gained (minus the opposition gains). In Figure
6 such a correlation (ρ= 0.49)) is highlighted.
This confirms the global analysis: if a team breaks down its
effort in many isolated components it is unlikely to be able to
gain additional meters. The fact that this is the only relevant
feature – and that no tackle feature was found significant –
suggests two additional insights. First, that rugby is a game
fundamentally different than soccer: in the literature it has
been shown that single sequence features were more relevant
than here to evaluate team success [9]. Second, since these
features were relevant when calculated over the entire match
pass network (as Section V-A shows), it suggests that rugby
has a peculiar dynamics. Our evidence points that, in rugby,
each action might matter not in isolation, but as the part
of a grand match strategy, that can be only appreciated by
analyzing the whole pass network.
In this section, we test if our model based on pass and
disruption network features is able to accurately predict the
result of the game. We build a cross validation framework
where we train our model on 17 matches, leaving one out,
and then we predict the result of the match left out using
the model trained on the other 17 matches. We repeat this
procedure for all matches in the dataset. Since Section V-B
showed that sequence features are not significant, we build
our model using exclusively global pass network features.
We perform two prediction tasks. The two tasks differ in the
target variable of interest. In the first task, we aim at predicting
which one between the two teams will gain more meters during
the match. With our model, we are able to obtain the correct
answer for 15 out of 18 matches, i.e. with an accuracy 83%.
Since in rugby the number of meters gained is highly
correlated with both score and odds of winning, we can use
our model to predict also who is going to win the match. We
say that the team predicted to gain more meters is going to
win. In this case, we make the correct prediction for 14 out
of 18 matches, i.e. with an accuracy 77%. Note that the
reduced accuracy is due to the fact that one of the matches in
the dataset ended up in a draw. This is a very rare occurrence
in rugby, and it was not encoded in the model5.
How good is our prediction? A random predictor would flip
a coin and get the right answer 50% of the times. However,
rugby matches tend to be predictable, given enough infor-
mation about past performances of teams and players. These
performances are recorded by the World Rugby organization,
which publishes weekly updated national rankings of teams.
It is reasonable to assume that the higher ranked team of the
two playing is expected to win. If we use the World Rugby
rankings to predict the outcomes of the matches, we obtain a
very similar accuracy: 76%6.
We can do slightly better by looking at historical odds
data7. Bookmakers are more invested in getting right a specific
match prediction than World Rugby. In fact, their accuracy
was higher, both of World Rugby rankings and of our model:
86%. However, we could find data only for 14 matches,
the ones involving New Zealand, because there is no historic
record for the USA rugby matches. This makes the prediction
task easier: lower ranked teams are more unpredictable when
playing each other, and the bookmakers always picked New
Zealand for all the matches it played, being New Zealand such
a dominant rugby team.
To conclude this section, it is worthwhile noting two things.
First, our model was able to successfully predict the biggest
upset of the 18 played matches: the victory of England over
New Zealand. Neither World Rugby nor bookmakers predicted
that. Second, our model is a purely structural system, that
has no information about which team and which players are
performing. As such, its information pool is more restricted
than the one available to both World Rugby and bookmakers.
The fact that the model’s performance are on par with theirs
is rather encouraging. It is true that we then feed the model
perfect information recorded during the match, but we detail
in the conclusions how we plan to create a truly predictive
In this paper we build a multilayer network analysis frame-
work to describe the performance of rugby teams during a
match. We build two layers: a pass network and a tackle
network. We extract features from these layers and we use
them at two analytical levels. First, we extract features from
the network as a whole, representing the entire match. Then,
we divide the match in sequences of a single match action
and we extract sequence features. We use these features as
correlates of match performance, estimated by the number
of meters a team advanced on the field. We discover that,
5The match was Australia v New Zealand, played on October 20th, 2012.
Our model predicted a win for New Zealand.
6Note that this is calculated over 17 matches, not 18, because one match
involved the reserve English national team, the England Saxons, which is not
ranked by World Rugby.
using the global features, connectivity is very important for
a team to successfully advance. Second, we perform single
sequence mining and we find that, when considering actions
in isolation, most features have no significant relation with
a team’s performance. This shows how much rugby is a
different game than soccer, where this analysis yielded the
opposite result [9]. Our interpretation is that soccer is a game
of tactics, where each sequence yields results that are mostly
independent from the other sequences; while rugby is a game
of strategy, where sequences build on each other to obtain the
intended result. This is not to say that sequence analysis is not
useful: by looking at the dynamic graph one might be able
to understand key moments of games – i.e. moments where
the prospective winner change. Finally, we show how our
predictive framework is on par with state of the art bookmaker
estimates, and better suited to predict upset results.
There are a number of directions that we can explore for
future works. First we can work on a network comparison
of different sports, mainly soccer. This would build on top
of the differences between the two sports we highlighted
here. Second, we can investigate other reasons of the poor
predictive performance of sequences. Sequences of sequences
might give more insight to appreciate the dominance of a
team not during the whole game or just one sequence, but
during an intermediate period of time. Third, we are planning
to use our framework to discuss how it is able to shed light on
large performance shocks. In our data, we have several New
Zealand versus Ireland matches, with very different outcomes:
one ended 22-19 and another 60-0. The question would be:
what changed in Ireland’s performances across these two
matches, played only a week apart? Another interesting case
study would be the England versus New Zealand upset:
what made the English team the only one able to beat New
Zealand in the 14 matches in our data? Finally, we could test
our predictive model on an actual prediction. We could build
a predicted match network before the match starts and use it
to pick the winner before the match happens, not after as we
did here. To do so, we would need more data, as 18 matches
are not enough for reliably training our system.
Acknowledgments. We thank Opta for having made avail-
able the data on which this paper is based. We thank Vittorio
Romano for helping us with his knowledge of rugby rules and
game dynamics. Michele Coscia has been partly supported
by FNRS, grant 24927961. This work has been partially
funded by the following European projects: Cimplex (Grant
Agreement 641191) and SoBigData RI (Grant Agreement
[1] Lowell W Beineke, Ortrud R Oellermann, and Raymond E Pippert. The
average connectivity of a graph. Discrete mathematics, 252(1):31–45,
[2] Michele Berlingerio, Michele Coscia, Fosca Giannotti, Anna Monreale,
and Dino Pedreschi. Evolving networks: Eras and turning points.
Intelligent Data Analysis, 17(1):27–48, 2013.
[3] Michele Berlingerio, Michele Coscia, Fosca Giannotti, Anna Monreale,
and Dino Pedreschi. Multidimensional networks: foundations of struc-
tural analysis. World Wide Web, 16(5-6):567–593, 2013.
[4] Stefano Boccaletti, Vito Latora, Yamir Moreno, Martin Chavez, and D-
U Hwang. Complex networks: Structure and dynamics. Physics reports,
424(4):175–308, 2006.
[5] Bj¨
orn Bringmann, Michele Berlingerio, Francesco Bonchi, and Aristides
Gionis. Learning and predicting the evolution of social networks.
Intelligent Systems, IEEE, 25(4):26–35, 2010.
[6] P Cintia, L Pappalardo, and D Pedreschi. Mining efficient training
patterns of non-professional cyclists. In 22nd Italian Symposium on
Advanced Database Systems (SEBD 2014), Sorrento Coast, Italy, June
16–18, 2014, pages 1–8, 2014.
[7] Paolo Cintia, Fosca Giannotti, Luca Pappalardo, Dino Pedreschi, and
Marco Malvaldi. The harsh rule of the goals: Data-driven performance
indicators for football teams. In 2015 IEEE International Conference
on Data Science and Advanced Analytics, DSAA 2015, Campus des
Cordeliers, Paris, France, October 19-21, 2015, pages 1–10, 2015.
[8] Paolo Cintia, Luca Pappalardo, and Dino Pedreschi. Engine matters:
A first large scale data driven study on cyclists’ performance. In Data
Mining Workshops (ICDMW), 2013 IEEE 13th International Conference
on, pages 147–153. IEEE, 2013.
[9] Paolo Cintia, Salvatore Rinzivillo, and Luca Pappalardo. A network-
based approach to evaluate the performance of football teams. In
Machine Learning and Data Mining for Sports Analytics Workshop,
Porto, Portugal, 2015.
[10] Michele Coscia, Salvatore Rinzivillo, Fosca Giannotti, and Dino Pe-
dreschi. Optimal spatial resolution for the analysis of human mobility.
In Advances in Social Networks Analysis and Mining (ASONAM), 2012
IEEE/ACM International Conference on, pages 248–252. IEEE, 2012.
[11] Linton C Freeman. Centrality in social networks conceptual clarification.
Social networks, 1(3):215–239, 1978.
[12] John Hollinger. The player efficiency rating, 2009.
[13] Mikko Kivel¨
a, Alex Arenas, Marc Barthelemy, James P Gleeson, Yamir
Moreno, and Mason A Porter. Multilayer networks. Journal of Complex
Networks, 2(3):203–271, 2014.
[14] Joaquin Lago-Ballesteros and Carlos Lago-Pe˜
nas. Performance in team
sports: Identifying the keys to success in soccer. Journal of Human
Kinetics, 25:85–91, 2010.
[15] David Lazer and Allan Friedman. The parable of the hare and the
tortoise: Small worlds, diversity, and system performance. 2005.
[16] Nicholas M Luscombe, M Madan Babu, Haiyuan Yu, Michael Snyder,
Sarah A Teichmann, and Mark Gerstein. Genomic analysis of reg-
ulatory network dynamics reveals large topological changes. Nature,
431(7006):308–312, 2004.
[17] Robert M May and Alun L Lloyd. Infection dynamics on scale-free
networks. Physical Review E, 64(6):066112, 2001.
[18] Shun Motegi and Naoki Masuda. A network-based dynamical ranking
system for competitive sports. Scientific reports, 2, 2012.
[19] Mark EJ Newman. Mixing patterns in networks. Physical Review E,
67(2):026126, 2003.
[20] Juyong Park and Mark EJ Newman. A network-based ranking system
for us college football. Journal of Statistical Mechanics: Theory and
Experiment, 2005(10):P10014, 2005.
[21] Joe Rosales and Scott Spratt. Who is responsible for a called strike? In
The 9th annual MIT Sloan Sports Analytics Conference, 2015.
[22] Emre Sarig¨
ol, Ren´
e Pfitzner, Ingo Scholtes, Antonios Garas, and Frank
Schweitzer. Predicting scientific success based on coauthorship net-
works. EPJ Data Science, 3(1):1–16, 2014.
[23] R. Schumaker, O Solieman, and H. Chen. Sports Data Mining. Springer,
[24] Cl´
ement Sire and Sidney Redner. Understanding baseball team standings
and streaks. The European Physical Journal B, 67(3):473–481, 2009.
[25] Lloyd Smith, Bret Lipscomb, and Adam Simkins. Data mining in sports:
Predicting cy young award winners. Journal of Computing Sciences in
Colleges, 22(4):115–121, 2007.
[26] Tom AB Snijders. The statistical evaluation of social network dynamics.
Sociological methodology, 31(1):361–395, 2001.
[27] Pedro OS Vaz de Melo, Virgilio AF Almeida, Antonio AF Loureiro,
and Christos Faloutsos. Forecasting in the nba and other team sports:
Network effects in action. ACM Transactions on Knowledge Discovery
from Data (TKDD), 6(3):13, 2012.
[28] Burcu Yucesoy and Albert-L´
o Barab´
asi. Untangling performance
from success. arXiv preprint arXiv:1512.00894, 2015.
... A change in the defensive team's position (i.e. an affordance or opportunity for action) could be perceived by several attacking players simultaneously, and may invite collective action from these players to use quick and accurate passing to capitalise on a gap in the defensive line. Despite the logical necessity for effective cooperative passing interactions amongst team-mates in rugby union, only limited research has reported on this topic (Cintia et al., 2016). This is surprising given that there is a suitable framework (i.e. ...
... rugby union and rugby league). To our knowledge, the only article reporting on passing networks within competitive rugby union match play was that of Cintia et al. (2016). These researchers reported that connectivity was important to gain territory (a proxy used in place of match success) and that a model using metrics derived from network analysis was on-par with bookmakers when predicting territory/outcome (i.e. ...
... This study utilised a cross-sectional approach, whereby the data from Australian male Super Rugby teams were utilised from the 2015-2019 seasons (four teams and 321 matches). The passes completed by players were defined as hand-to-hand transitions of the ball, and were encoded by Opta (OptaPro, London, UK), a database which has been frequently utilised within rugby research (Cintia et al., 2016;Croft et al., 2015;George et al., 2015;Watson et al., 2017). Passes were only accounted if they were identified as completed, and the lineout throw/pass interaction was considered a separate skill and was not encoded by the data provider as a pass, so it was not included in this study. ...
This study investigated cooperative passing interactions in elite rugby match play. Associations between team network metrics and match outcomes were also investigated. A cross-sectional approach was adopted, using data from four Australian Super Rugby teams, across five seasons. 44,178 passing actions were included across 321 team-fixture observations. Network metrics were calculated for each positional group within each match, and two statistical models were developed; First: a mixed-effects multinomial regression to identify differences between positional groups; and second: a mixed-effects binomial logistic regression to determine the association between team-level network metrics and match outcomes. Differences were identified between positional groups e.g. Halves had the highest out-degree centrality and betweenness, while Centres had higher eigenvector centrality than all other positions. Within the Forwards pack, the Back Row had greater in-degree, out-degree, and betweenness than the Tight Five. Regarding match outcomes, the model explained only 6.9% of variance, although greater in-degree centralisation (OR = 1.847 [1.241–2.749], p = 0.002) and lower eigenvector centralisation (OR = 0.655 [0.440–0.975]; p = 0.037) were associated with successful outcomes. Cooperative passing networks in rugby union may provide useful information to describe how various positions interact, and some behaviours may contribute towards successful team performance.
... A total of 53 of the 112 studies investigated a performance outcome in relation to cooperative networks. This included associations to network variables of match outcome, 24,26 stage of competition, such as qualification or finals, 27,28 ladder positions, 15,29 win-loss prediction models, 9,30 score margins 15,26 and comparisons of the winning team of the overall competition with others. 2,31 Almost all papers, 107 out of 112, investigated a comparator to network measures. ...
Team invasion games are sports in which individual team members interact and exchange information to coordinate their behaviours and actions in pursuit of the common goal of winning matches. Researchers have used social network analysis to quantify the cooperative behaviours of sports teams (cooperative network analysis), yet this research exists across an array of disciplines and uses various methods. Therefore, accessibility for practitioners and researchers interested in using it to quantify team cooperation in team invasion games may be limited. This systematic mapping review aimed to identify, report and discuss research in this emerging research area. Articles were systematically searched in electronic databases and reference list scans resulting in 112 papers included. Football was the most studied sport ( n = 91), and passing was the most observed interaction between players within a sports team ( n = 107). This review further revealed a lack of consistency in reporting between the included studies with respect to nomenclature and network measures. A comprehensive map of the current literature on the use of cooperative network analysis in team invasion games is provided which can be used by practitioners and researchers tasked with or interested in analysing team performance.
... The model which gave the highest percentage of correct predictions, with 59.21%, was the only one to include a venue variable, implying in part that playing at home does have a significant impact on the match outcome. Since home advantage is a phenomenon which has been proven to occur in most sports, including rugby [23] and cricket [24], it is discussed in one form or another in many studies written on the topic of match outcome prediction [9,[25][26][27][28][29]. ...
Full-text available
Recently, football has seen the creation of various novel, ubiquitous metrics used throughout clubs' analytics departments. These can influence many of their day-to-day operations ranging from financial decisions on player transfers, to evaluation of team performance. At the forefront of this scientific movement is the metric expected goals, a measure which allows analysts to quantify how likely a given shot is to result in a goal however, xG models have not until this point considered using important features, e.g., player/team ability and psychological effects, and is not widely trusted by everyone in the wider football community. This study aims to solve both these issues through the implementation of machine learning techniques by, modelling expected goals values using previously untested features and comparing the predictive ability of traditional statistics against this newly developed metric. Error values from the expected goals models built in this work were shown to be competitive with optimal values from other papers, and some of the features added in this study were revealed to have a significant impact on expected goals model outputs. Secondly, not only was expected goals found to be a superior predictor of a football team's future success when compared to traditional statistics, but also our results outperformed those collected from an industry leader in the same area.
... The researchers constructed a bipartite network using spatial locations of all shots to explore the match styles of gold, silver, and bronze medal badminton players. Such approach has been widely adopted in studies of football [20], basketball [21], and rugby [22], where researchers explored team's connection or interaction patterns by dividing the pitch into different zones. Herrera-Diestra et al. [20] and Cintia et al. [23] constructed a pitch network of nodes consisting of particular field subdivisions to evaluate the performance of teams. ...
This research aimed to describe and quantify the stroke patterns profiling professional tennis players by constructing a complex network specifically related to ball bounce locations. A total sample of 16,863 points from 127 Australian Open matches played by 128 male tennis single players were gathered and scrutinized for creating the bipartite tennis stroke network (TSN) with court zones divided into 40 nodes. Afterwards, network metrics were used to assess the prominence of different court zones: in/out-degree centrality (I/ODC), eigenvector centrality (EC), betweenness centrality (BC) and shortest-path length. The results showed that zone-5 (zone-36) and zone-8 (zone-33) generally had higher levels of zone utilization considering their I/ODC, EC and BC. Zone-8 (zone-33) to zone-33 (zone-8) turned out to have the least short path length with a value of 0.0008 in all paths which had the most path utilization. Moreover, different rally lengths and tournament rounds can lead to a series of stroke patterns. As the tournament rounds progressed from the 1st round to the final, the consistency in ball bounces of the zones tended to decrease during short to medium or long rallies. Future investigations and sport performance analysts could adopt the TSN method to analyze tactics and specific striking styles of individual players.
... A rather new approach in predicting performance is based on machine learning and network science [17,18]. Such methods have been used in relation to sport [19][20][21] and particularly football [22,23]. Most of the past research in this area, however, either focuses on inter-team interactions and modelling player behaviour rather than league tournament's results prediction, or are limited in scope-particularly, they rarely take an historical approach in order to study the game as an evolving phenomenon [24][25][26]. ...
Full-text available
In recent years, excessive monetization of football and professionalism among the players have been argued to have affected the quality of the match in different ways. On the one hand, playing football has become a high-income profession and the players are highly motivated; on the other hand, stronger teams have higher incomes and therefore afford better players leading to an even stronger appearance in tournaments that can make the game more imbalanced and hence predictable. To quantify and document this observation, in this work, we take a minimalist network science approach to measure the predictability of football over 26 years in major European leagues. We show that over time, the games in major leagues have indeed become more predictable. We provide further support for this observation by showing that inequality between teams has increased and the home-field advantage has been vanishing ubiquitously. We do not include any direct analysis on the effects of monetization on football’s predictability or therefore, lack of excitement; however, we propose several hypotheses which could be tested in future analyses.
... We also explore the differences in the collective behavior of male and female teams computing the passing networks, i.e., graphs in which nodes are players and edges represent passes between teammates in a match [19,[22][23][24][25]. From the passing network of a team T in a match g Pitch zones from where free-kicks and shots in motion are more likely to be made by male players (a) and female players (b). ...
Full-text available
Women’s football is gaining supporters and practitioners worldwide, raising questions about what the differences are with men’s football. While the two sports are often compared based on the players’ physical attributes, we analyze the spatio-temporal events during matches in the last World Cups to compare male and female teams based on their technical performance. We train an artificial intelligence model to recognize if a team is male or female based on variables that describe a match’s playing intensity, accuracy, and performance quality. Our model accurately distinguishes between men’s and women’s football, revealing crucial technical differences, which we investigate through the extraction of explanations from the classifier’s decisions. The differences between men’s and women’s football are rooted in play accuracy, the recovery time of ball possession, and the players’ performance quality. Our methodology may help journalists and fans understand what makes women’s football a distinct sport and coaches design tactics tailored to female teams.
... Social network analysis (SNA) may be a useful tool to provide deeper investigations as it can be used to study the behaviors of players, and their interactions with their teammates as well as with opponents during a match. [36][37][38][39][40] In this context, network nodes (players) as well as their interactions (passes, contacts, movement coordinates) provide a framework for performance modelling by identifying the impact of key pass-chains, main actors, and major events (try, replacement, exclusion, etc.) on the specific design of each of these networks. Based on this framework, collective experience among identified players could be associated to specific performance indicators. ...
Coaches and analysts prepare for upcoming matches by identifying common patterns in the positioning and movement of the competing teams in specific situations. Existing approaches in this domain typically rely on manual video analysis and formation discussion using whiteboards; or expert systems that rely on state-of-the-art video and trajectory visualization techniques and advanced user interaction. We bridge the gap between these approaches by contributing a light-weight, simplified interaction and visualization system, which we conceptualized in an iterative design study with the coaching team of a European first league soccer team. Our approach is walk-up usable by all domain stakeholders, and at the same time, can leverage advanced data retrieval and analysis techniques: a virtual magnetic tactic-board. Users place and move digital magnets on a virtual tactic-board, and these interactions get translated to spatio-temporal queries, used to retrieve relevant situations from massive team movement data. Despite such seemingly imprecise query input, our approach is highly usable, supports quick user exploration, and retrieval of relevant results via query relaxation. Appropriate simplified result visualization supports in-depth analyses to explore team behavior, such as formation detection, movement analysis, and what-if analysis. We evaluated our approach with several experts from European first league soccer clubs. The results show that our approach makes the complex analytical processes needed for the identification of tactical behavior directly accessible to domain experts for the first time, demonstrating our support of coaches in preparation for future encounters.
Conference Paper
Full-text available
The recent emergence of the so called online social fitness open up new scenarios for fascinating challenges in the field of data science. Through these platforms, users can collect, monitor and share with friends their sport performance, with interesting details about heart-rate, watt consumption and calories burned. The availability of this data, collected among a large number of users, gives us the possibility to explore new data mining applications. In the current work, we present the results of a study conducted on a sample of 29,284 cyclists downloaded via APIs from the social fitness platform We defined two basic metrics: a measure of training effort, that is how much a cyclist struggled during the workout; and a measure of training performance indicating the results achieved during the training. Although the average effort is weakly correlated with the average performance, by deeply investigating workouts time evolution and cyclists' training characteristics interesting findings came out. We found that athletes that better improve their performance follow precise training patterns usually referred as overcompensation theory, with alternation of stress peaks and rest periods. Studies and experiments related to such theory, up to now, have always been conducted by sports doctors on a few dozen professionals athletes. To the best of our knowledge, our study is the first corroboration on large scale of this theory.
Full-text available
Fame, popularity and celebrity status, frequently used tokens of success, are often loosely related to, or even divorced from professional performance. This dichotomy is partly rooted in the difficulty to distinguish performance, an individual measure that captures the actions of a performer, from success, a collective measure that captures a community's reactions to these actions. Yet, finding the relationship between the two measures is essential for all areas that aim to objectively reward excellence, from science to business. Here we quantify the relationship between performance and success by focusing on tennis, an individual sport where the two quantities can be independently measured. We show that a predictive model, relying only on a tennis player's performance in tournaments, can accurately predict an athlete's popularity, both during a player's active years and after retirement. Hence the model establishes a direct link between performance and momentary popularity. The agreement between the performance-driven and observed popularity suggests that in most areas of human achievement exceptional visibility may be rooted in detectable performance measures.
Full-text available
The multi-million sports-betting market is based on the fact that the task of predicting the outcome of a sports event is very hard. Even with the aid of an uncountable number of descriptive statistics and background information, only a few can correctly guess the outcome of a game or a league. In this work, our approach is to move away from the traditional way of predicting sports events, and instead to model sports leagues as networks of players and teams where the only information available is the work relationships among them. We propose two network-based models to predict the behavior of teams in sports leagues. These models are parameter-free, that is, they do not have a single parameter, and moreover are sport-agnostic: they can be applied directly to any team sports league. First, we view a sports league as a network in evolution, and we infer the implicit feedback behind network changes and properties over the years. Then, we use this knowledge to construct the network-based prediction models, which can, with a significantly high probability, indicate how well a team will perform over a season. We compare our proposed models with other prediction models in two of the most popular sports leagues: the National Basketball Association (NBA) and the Major League Baseball (MLB). Our model shows consistently good results in comparison with the other models and, relying upon the network properties of the teams, we achieved a ≈ 14% rank prediction accuracy improvement over our best competitor.
Conference Paper
Full-text available
Sports analytics in general, and football (soccer in USA) analytics in particular, have evolved in recent years in an amazing way, thanks to automated or semi-automated sensing technologies that provide high-fidelity data streams extracted from every game. In this paper we propose a data-driven approach and show that there is a large potential to boost the understanding of football team performance. From observational data of football games we extract a set of pass-based performance indicators and summarize them in the H indicator. We observe a strong correlation among the proposed indicator and the success of a team, and therefore perform a simulation on the four major European championships (78 teams, almost 1500 games). The outcome of each game in the championship was replaced by a synthetic outcome (win, loss or draw) based on the performance indicators computed for each team. We found that the final rankings in the simulated championships are very close to the actual rankings in the real championships, and show that teams with high ranking error show extreme values of a defense/attack efficiency measure, the Pezzali score. Our results are surprising given the simplicity of the proposed indicators, suggesting that a complex systems' view on football data has the potential of revealing hidden patterns and behavior of superior quality.
Conference Paper
Full-text available
The striking proliferation of sensing technologies that provide high-fidelity data streams extracted from every game induced an amazing evolution of football statistics. Nowadays professional statistical analysis firms like ProZone and Opta provide data to football clubs, coaches and leagues, who are starting to analyze these data to monitor their players and improve team strategies. Standard approaches in evaluating and predicting team performance are based on history-related factors such as past victories or defeats, record in qualification games and margin of victory in past games. In contrast with traditional models, in this paper we propose a model based on the observation of players' behavior on the pitch. We model a the game of a team as a network and extract simple network measures, showing the value of our approach on predicting the outcomes of a long-running tournament such as Italian major league.
Conference Paper
Full-text available
The availability of massive network and mobility data from diverse domains has fostered the analysis of human behaviors and interactions. This data availability leads to challenges in the knowledge discovery community. Several different analyses have been performed on the traces of human trajectories, such as understanding the real borders of human mobility or mining social interactions derived from mobility and vice versa. However, the data quality of the digital traces of human mobility has a dramatic impact over the knowledge that it is possible to mine, and this issue has not been thoroughly tackled so far in literature. In this paper, we mine and analyze with complex network techniques a large dataset of human trajectories, a GPS dataset from more than 150k vehicles in Italy. We build a multi resolution grid and we map the trajectories with several complex networks, by connecting the different areas of our region of interest. Then we analyze the structural properties of these networks and the quality of the borders it is possible to infer from them. The result is a significant advancement in our understanding of the data transformation process that is needed to connect mobility with social network analysis and mining.
The intuitive background for measures of structural centrality in social networks is reviewed and existing measures are evaluated in terms of their consistency with intuitions and their interpretability.
Performance in Team Sports: Identifying the Keys to Success in Soccer The aim of this study was to identify specific performance indicators that discriminate the top clubs from the others based on significantly different pitch action performance in the Spanish Soccer League. All 380 games corresponding to the 2008-2009 season have been analyzed. The studied variables were divided into three groups related to goals scored (goals for, goals against, total shots, shots on goal, shooting accuracy, shots for a goal), offense (assists, crosses, offsides committed, fouls received, corners, ball possession) and defense (crosses against, offsides received, fouls committed, corners against, yellow cards, red cards). Data were analyzed performing a one-way ANOVA. Significant differences across sections of the league table were found for the following pitch actions: goals for, total shots, shots on goal, shots for a goal, assists and ball possession. The main findings of this study suggest that top teams had a higher average of goals for, total shots and shots on goal than middle and bottom teams (p<0.05). Bottom teams needed a higher number of shots for scoring a goal than the other groups of teams (p<0.05). Middle teams showed a lower value in assists and ball possession than top teams (p<0.05). In conclusion, this paper presents values that can be used as normative data to design and evaluate practices and competitions for peak performance soccer teams in a collective way.
Within the large body of research in complex network analysis, an important topic is the temporal evolution of networks. Existing approaches aim at analyzing the evolution on the global and the local scale, extracting properties of either the entire network or local patterns. In this paper, we focus on detecting clusters of temporal snapshots of a network, to be interpreted as eras of evolution. To this aim, we introduce a novel hierarchical clustering methodology, based on a dissimilarity measure derived from the Jaccard coefficient between two temporal snapshots of the network, able to detect the turning points at the beginning of the eras. We devise a framework to discover and browse the eras, either in top-down or a bottom-up fashion, supporting the exploration of the evolution at any level of temporal resolution. We show how our approach applies to real networks and null models, by detecting eras in an evolving co-authorship graph extracted from a bibliographic dataset, a collaboration graph extracted from a cinema database, and a network extracted from a database of terrorist attacks; we illustrate how the discovered temporal clustering highlights the crucial moments when the networks witnessed profound changes in their structure. Our approach is finally boosted by introducing a meaningful labeling of the obtained clusters, such as the characterizing topics of each discovered era, thus adding a semantic dimension to our analysis.
A Bayesian classifier was created to predict Cy Young Award winners in American baseball. The model was compared against two statistical models designed to perform the same task. Over the years from 1967 through 2006, the accuracy of the Bayesian classifier was similar to that of the other two models---when restricted to starting pitchers, all three were more than 80% correct. Accuracy was lower for all three models when relief pitchers were included in the data.