ArticlePDF Available

Abstract and Figures

Predicting the future state of a system has always been a natural motivation for science and practical applications. Such a topic, beyond its obvious technical and societal relevance, is also interesting from a conceptual point of view. This owes to the fact that forecasting lends itself to two equally radical, yet opposite methodologies. A reductionist one, based on the first principles, and the naive inductivist one, based only on data. This latter view has recently gained some attention in response to the availability of unprecedented amounts of data and increasingly sophisticated algorithmic analytic techniques. The purpose of this note is to assess critically the role of big data in reshaping the key aspects of forecasting and in particular the claim that bigger data leads to better predictions. Drawing on the representative example of weather forecasts we argue that this is not generally the case. We conclude by suggesting that a clever and context-dependent compromise between modelling and quantitative analysis stands out as the best forecasting strategy, as anticipated nearly a century ago by Richardson and von Neumann.
Content may be subject to copyright.
arXiv:1705.11186v1 [physics.soc-ph] 31 May 2017
FORECASTING IN LIGHT OF BIG DATA
HYKEL HOSNI AND ANGELO VULPIANI
Abstract. Predicting the future state of a system has always been a
natural motivation for science and practical applications. Such a topic,
beyond its obvious technical and societal relevance, is also interesting
from a conceptual point of view. This owes to the fact that forecasting
lends itself to two equally radical, yet opposite methodologies. A reduc-
tionist one, based on the first principles, and the na¨ıve-inductivist one,
based only on data. This latter view has recently gained some attention
in response to the availability of unprecedented amounts of data and
increasingly sophisticated algorithmic analytic techniques. The purpose
of this note is to assess critically the role of big data in reshaping the key
aspects of forecasting and in particular the claim that bigger data leads
to better predictions. Drawing on the representative example of weather
forecasts we argue that this is not generally the case. We conclude by
suggesting that a clever and context-dependent compromise between
modelling and quantitative analysis stands out as the best forecasting
strategy, as anticipated nearly a century ago by Richardson and von
Neumann.
Nothing is more practical than a good theory (L. Boltzmann)
1. Introduction and motivation
Uncertainty spans our lives and forecasting is how we cope with it, indi-
vidually, socially, institutionally, and scientifically. As a consequence, the
concept of forecast is an articulate one. Science, as a whole, moves forward
by making and testing forecasts. Political institutions make substantial use
of economic forecasting to devise their policies. Most of us rely on weather
forecasts to plan our daily activities. Thus, in forecasting, the boundaries
between the natural and the social sciences are often crossed, as well as the
boundaries between the scientific, technological and ethical domains.
This rather complex picture has been enriched significantly, over the past
few years, by the rapidly increasing availability of methods for collecting
and processing vast amounts of data. This revived a substantial interest in
purely inductive methods which are expected to serve the most disparate
Date: 30 May 2017.
1
2 HYKEL HOSNI AND ANGELO VULPIANI
needs, from commercial service to data-driven science. Data brokers sell
to third parties the digital footprints recorded by our internet activities or
credit card transactions. Those can be put to a number of different uses,
not all of them ethically neutral. For instance, aggressive forms of person-
alised marketing algorithms identify women who are likely to be pregnant
based on their internet activity, and similarly health related web searches
have been proved to influence individual credit scorings [26]. However,
data-intensive projects lie at the heart of extremely ambitious, cutting-
edge scientific enterprises, including the US “Brain Research through Inno-
vative Neurotechnologies” (http://www.braininitiative.nih.gov/) and
the African-Australian Square Kilometer Array, a radio telescope array con-
sisting of thousands receivers (http://skatelescope.org/).1
Those examples illustrate clearly that big data spans radically diverse do-
mains. This, together with its sodality with machine learning, has recently
been fuelling an all-encompassing enthusiasm, which is loosely rooted on
a twofold presupposition. First, the idea that big data will lead to much
better forecasts. Second, it will do so across the board, from scientific dis-
covery to medical, financial, commercial and political applications. It is
this enthusiasm which has recently led to making a case for the predictive-
analytics analogue of universal Turing machines, unblushingly referred to as
The Master Algorithm [12].
Based on this twofold presupposition, big data and predictive analytics are
expected to have a major impact in society, in technology, and all the way up
to the scientific method itself [21]. The extent to which those promises are
likely to be fulfilled is currently a matter for debate across a number of dis-
ciplines [1,15,22,17,3,23], while some early success stories rather quickly
turned into macroscopic failures [16]. This note adds to the methodological
debate by challenging both aspects of the presupposition for big data enthu-
siasm. First, more data may lead to worse predictions. Second, a suitably
specified context is crucial for forecasts to be scientifically meaningful. Both
points will be made with reference to a highly representative forecasting
problem: weather predictions.
The remainder of the paper is organised as follows. Section 2begins by
recalling that the very meaning of scientific prediction depends significantly
on an underlying theoretical context. Then we move on, in Section 3, to
challenging the na¨ıve inductivist view which goes hand in hand with big
data enthusiasm. In a rather elementary setting we illustrate the practical
1See, e.g. [2] for an appraisal of how, experiments of this kind, may lead to a paradigm
shift in the philosophy of science.
FORECASTING IN LIGHT OF BIG DATA 3
impossibility of inferring future behaviour from the past when the dimen-
sion of the problem is moderately large. Section 4develops this further by
emphasising that forecasts depend significantly on the modeller’s ability to
identify the proper level of description of the target system. To this end
we draw on the history of weather forecasting, where the early attempts at
arriving at a quantitative solution turned out to be unsuccessful precisely
because they took into account too much data. The representativeness of
the example suggests that this constitutes a serious challenge to the view
according to which big data could make do with the sole analysis of corre-
lations.
The main lesson can be put as follows: as anticipated nearly a century ago
by Richardson and von Neumann, a clever and context-dependent trade-off
between modelling and quantitative analysis stands out as the best strategy
for meaningful prediction. This flies in the face of the by now infamous
claim put forward in 2008 in Wired by its then editor C. Anderson “the
data deluge makes the scientific method obsolete”. In our experience aca-
demics have a tendency to roll their eyes when confronted with this, and
similar claims, and hasten to add that non-academic publications should
not be given so much credit. We believe otherwise. Indeed we think that
the importance of the cultural consequences of such claims is reason enough
for academics to take scientific and methodological issue against them, inde-
pendently of their publication venue. Whilst Anderson’s argument fails to
stand methodological scrutiny, as the present paper recalls, its key message
–big data enthusiasm– has clearly percolated society at large. This may
lead to very serious social and ethical shortcomings. For the combination
of statistical methods and machine learning techniques for predictive ana-
lytics is currently finding cavalier application in a number of very sensitive
intelligence and policing activities, as we now briefly recall.
This clearly illustrates that the scope of the epistemological problem tackled
by this note extends far beyond the scientific method and the academic silos.
1.1. From SKYNET to PredPol. Early in 2016 a debate took place on
alleged drone attacks in Pakistan. The controversial article by C. Grothoff
and J.M. Porup2opened as follows:
In 2014, the former director of both the CIA and NSA pro-
claimed that “we kill people based on metadata.” Now,
2http://arstechnica.co.uk/security/2016/02/the-nsas-skynet-program-may-be-killing-
thousands-of-innocent-people/
4 HYKEL HOSNI AND ANGELO VULPIANI
a new examination of previously published Snowden docu-
ments suggests that many of those people may have been
innocent.
Recall that SKYNET is the US National Security Agency’s programme
aimed at monitoring mobile phone networks in Pakistan. Leaked documents
[31] show that the primary goal of this programme is the identification of
potential affiliates to the Al Quaeda network. Further information suggests
that SKYNET builds on classification techniques, fed primarily on GSM
data drawn from the entire Pakistani population. This obviously puts the
classification method at high risk of overfitting, given, of course, that the
vast majority of the population is not linked to terrorist activities. Not
surprisingly then, the Snowden papers revealed a rather telling result of
the SKYNET sophisticated machine learning, which led to attach Ahmad
Zaidan, a bureau chief for Al-Jazeera in Islamabad, the highest probability
of being an Al Quaeda courier.
Two points are worth observing. First note, as some commentators have re-
ported [30], that the classification of Zaidan as strongly linked to Al Qaida
cannot be dismissed as utterly wrong. It of course all depends on what we
mean by “being linked”. As a journalist in the field he was certainly “linked”
to the organisation, and very much so if one counts the two interviews he
did with Osama Bin Laden. But of course, “being linked” with a terror
organisation may mean something entirely different, namely being actively
involved in the pursuit of their goal. This fundamental bit of contextual in-
formation is probably impossible to infer for a classification technique, even
the most accurate one. But SKYNET algorithms are far from it, which
brings us to the second noteworthy point. The leaked documents assess the
rate of false positives of the classification method used by SKYNET between
0.008% and 0.18%. Since the surveillance programme gathers data from a
population of 55 million people, this leads to up to 99 thousand Pakistani
who may have been wrongly labelled as “terrorists”. Whether or not this
actually led to deadly attacks through the “Find-Fix-Finish” strategy based
on Predator drones, this example illustrates the shortcomings of the uni-
versality of the combination of big data and machine learning. For if the
SKYNET programme was about detecting unsolicited emails, rather than
potential terror suspects, the false positive rate of 0.008% would be consid-
ered exceptionally good. It is far from it, if it may lead to causing highly
defamatory accusations, if not outright death to thousands of innocent peo-
ple. The observation to the effect that terrorists identification and spam
detection are completely different problems, with incomparable social, legal,
FORECASTING IN LIGHT OF BIG DATA 5
and ethical implications, though apparently trivial, may easily be overlooked
as a consequence of the big data enthusiasm.
On a less spectacular, but no less worrying scale, this can be seen to feed
the increasing excitement for predictive policing. Police departments in the
United States and in Europe have been recently purchasing commercially
available software to predict crimes. California based PredPol3is widely
used across the country and by some police departments in the United King-
dom. The New York Times reports4that Coplogic5has contracts with 5,000
police departments in the US. Keycrime6is a Milan based firm which has
been recently contracted by the Italian police. This list can be prolonged.
Predictive policing’s main selling point is of course expense reduction. If we
can predict where the next crime is going to be committed, we can optimise
patrolling. Being more precise requires less resources, less taxpayers money,
and it delivers surgical results. But context is once again neglected. When
introducing the methods and techniques underlying predictive policing the
authors of the 190 pages strong RAND report [27] on the subject note that
These analytical tools, and the IT that supports them, are
largely developed by and for the commercial world.
This, we believe, suffices to illustrate the relevance and urgency of a matter
which we now move on to discuss in greater generality. To this end we shall
begin by recalling a seemingly obvious, and yet surprisingly often overlooked,
feature of the forecasting problem, namely that not all forecasts are equal.
2. On forecasting
Laplace grasped rather clearly one important feature of how probability and
uncertainty relate to information when he pointed out that probability de-
pends partly on our knowledge and partly on our ignorance. What we do
know clearly affects our understanding of what we don’t know and, conse-
quently, our ability to estimate its probability.
It cannot be surprising then, that the meaning of scientific prediction or
forecasts changes with the growth of science. In [25], for instance, it is
suggested that one can get a clearer understanding of what physics is by
being specific about the accepted meaning of physical predictions.
3http://www.predpol.com/
4The Risk to Civil Liberties of Fighting Crime With Big Data, 6 November 2016
5http://www.coplogic.com/
6http://www.keycrime.com/
6 HYKEL HOSNI AND ANGELO VULPIANI
The origins of the very concept of scientific forecast can in fact be traced
back to the beginning of modern physics. The paradigmatic example being
classical mechanics – the deterministic world in which (for a limited class
of phenomena) one can submit definite Yes/No predictions to experimental
testing. A major conceptual revolution took place in the mid 1800s with
the introduction of probabilistic prediction, a notion which in the intervening
two centuries has taken three distinct interpretations. The first relates to
the introduction of statistical mechanics, and is indeed responsible for intro-
ducing a novel, stochastic, view of the laws of nature. The second started
at the beginning of the 1900s with the discovery of quantum mechanics.
The third, which is coming of age, relates to the investigation of complex
systems. It also observed in [25] that this development of the meaning of
scientific forecasting amounted to its progressive weakening. Whilst the con-
cept of stochastic prediction in statistical mechanics is clearly weaker than
the Yes/No prediction of the next solar eclipse, it can be regarded as being
stronger than predictions about complex systems which may involve prob-
ability intervals. The upside of increasingly weaker notions of forecasts is
the extension of the applicability of physics to a wider set of problem. The
downside is the lack of precision.
It is interesting to note that the first major shift in perspective – from the
binary forecasts of classical mechanics to the probabilistic ones of statistical
mechanics – can be motivated from an informational point of view. To
illustrate, we borrow from a classic presentation of Ergodic Theory [13],
in which a gas with kmolecules contained in a three-dimensional box is
considered. Since particles can move in any direction of the (Euclidean)
space, we are looking at a system with n= 3kdegrees of freedom. Assuming
complete information about the molecules’ masses and the forces they exert,
the instantaneous state of the system can be described fully –at least in
principle– by fixing nspatial coordinates and the ncorresponding velocities,
i.e. by picking a point in 2ndimensional Euclidean space. We are now
interested in looking at how the system evolves in time according to some
underlying physical law. In practice though, the information we do possess
is seldom enough to determine the answer.
[This led Gibbs to] abandon the deterministic study of one
state (i.e., one point in phase space) in favor of a statistical
study of an ensemble of states (i.e., a subset of phase space).
Instead of asking “what will the state of the system be at
time t?”, we should ask “what is the probability that at time
tthe state of the system will belong to a specified subset of
phase space?”.[13]
FORECASTING IN LIGHT OF BIG DATA 7
This (to our present lights) very natural observation led to enormous conse-
quences. So it is likewise natural to ask, today, whether the present ability
to acquire, store, and analyse unprecedented amounts of data may lead the
concept of forecasts to the next level.
In what follows we address this question in an elementary setting. In par-
ticular we ask whether using our knowledge of the past states of a system
– and without the use of models for the evolution equation – meaningful
predictions about the future are possible. Our answer is negative to the
extent that rather severe difficulties are immediately found, even in a very
abstract and simplified situation. As we shall point out the most difficult
challenge to this view is understanding of the “proper level” of abstraction
of the system. This is apparent in the paramount case of weather forecasting
discussed in Section 4. We will see there that the key to underestanding the
“proper level” of abstraction lies with identifying the “relevant variables”
and the effective equations which rule their time evolution. It is important
to stress that the procedure of building such a description does not follow
a fixed protocol, applicable in all contexts given that certain conditions are
met. It should rather be considered as a sort of art, based on the intuition
and the experience of the researcher.
3. An extreme inductivist approach to forecasting using Big
Data
According to a vaguely defined yet rather commonly held view [15]big data
may lead to dispense with theory, modelling or even hypothesising. All of
this would be encompassed, across domains, by smart enough machine learn-
ing algoritms operating on large enough data sets. This extreme inductivist
conception of forecasts is thought of as depending solely on data. Is this
providing us with a new meaning of predictions, and indeed one which will
outdate scientific modelling as we currently understand it?
Two hypotheses, which are seldom made explicitly, are needed to articulate
an affirmative answer:
(1) Similar premisses lead to similar conclusions (Analogy);
(2) Systems which exhibit a certain behaviour, will continue doing so
(Determinism).
Note that both assumptions are clearly at work in the very idea of predic-
tive policing recalled above. For predicting who is going to commit the next
crime and where this is going to happen, requires one to think of the dis-
position to commit crimes as a persistent feature of certain people, who in
8 HYKEL HOSNI AND ANGELO VULPIANI
turn, tend to conform to certain specific features. Those analogies and the
deterministic character of the ‘disposition to commit crimes’ are very prone
to mistake correlation with causation. Racial profiling is the most obvious,
but certainly not the sole ethical concern which is being currently raised
in connection with the first performance assessments of predictive policing
[32].
Let us go back to our key point by noting that Analogy and Determinism
have been long debated in connection to forecasting and scientific prediction.
If a system behaves in a certain way, it will do so again seems a rather
natural claim, but, as pointed out by Maxwell7it is not such an obvious
assumption after all.
It is a metaphysical doctrine that from the same antecedents
follow the same consequents. [...] But it is not of much use in
a world like this, in which the same antecedents never again
concur, and nothing ever happens twice. [...] The physical
axiom which has a somewhat similar aspect is “That from
like antecedents follow like consequents”.
In his Essai philosophique sur les probabilit´es Laplace argued that analogy
and induction, along with a “happy tact”, provide the principal means for
“approaching certainty” in situations in which the probabilities involved are
“impossible to submit to calculus”. Laplace then hastened to warn the
reader against the subtleties of reasoning by induction and the difficulties
of pinning down the right “similarity” between causes and effects which is
required for the sound application of analogical reasoning.
More recently de Finetti sought to redo the foundations of probability by
challenging the very idea of repeated events, which constitutes the starting
point of frequentists approaches a la von Mises, a view which is not central
to Kolmogorov’s axiomatisation, but for which the Russian voiced some
sympathy. In a vein rather similar to that of Maxwell’s, de Finetti argued
extensively [10,11] that thinking of events as “repeatable” is a modelling
assumption. If the modeller thinks that two events are in fact instances of
the same phenomenon, she/he should state that as a subjective and explicit
assumption.
This assumption is certainly not mentioned in the extreme inductivist big
data narrative, which advocates an approach to forecasting which uses just
7Quoted in Lewis Campbell and William Garnett, The Life of James Clerk Maxwell,
Macmillan, London (1882); reprinted by Johnson Reprint, New York (1969), p. 440.
FORECASTING IN LIGHT OF BIG DATA 9
knowledge of the past, without the aid of theory. Let us then turn our atten-
tion to this view, and frame the question in the simplest possible terms. We
are interested in forecasts such that future states of a systems are predicted
solely on the basis of known past states. If this turns out to be problematic
in a highly abstract situation, then it can hardly be expected to work in
contexts marred by high model-uncertainty, like the ones of interest for big
data applications.
Basically [34], one looks for a past state of the system “near” to the present
one: if it can be found at day k, then it makes sense to assume that tomorrow
the system will be “near” to day k+ 1. In more formal terms, given the
series {x1, ..., xM}where xjis the vector describing the state of the system
at time jt, we look in the past for an analogous state, that is a vector xk
with k < M “near enough” (i.e. such that |xkxM|< ǫ, being ǫthe desired
degree of accuracy). Once we find such a vector, we “predict” the future
at times M+n > M by simply assuming for xM+nthe state xk+n. It all
seems quite easy, but it is not at all obvious that an analog can be found.
The problem of finding an analog is strictly linked to the celebrated Poincar´e
recurrence theorem8: after a suitable time, a deterministic system with a
bounded phase space returns to a state near to its initial condition [28,14].
Thus an analog surely exists, but how long do we have to go back to find
it? The answer has been given by the Polish mathematician Mark Kac who
proved a Lemma [14] to the effect that the average return time in a region
Ais proportional to the inverse of the probability P(A) that the system is
in A.
To understand how hard it is to observe a recurrence, and hence to find an
analog, consider a system of dimension D.9The probability P(A) of being
in a region Athat extends in every direction by a fraction ǫis proportional
to ǫD, therefore the mean recurrence time is O(ǫD) . If Dis large (say,
larger than 10), even for not very high levels of precision (for instance, 5%,
8In its original version the Poincar´e recurrence theorem states that:
Given a Hamiltonian system with a bounded phase space Γ, and a set
AΓ, all the trajectories starting from xAwill return back to A
after some time repeatedly and infinitely many times, except for some
of them in a set of zero probability.
Actually, though this is seldom stressed in elementary courses, the theorem can be easily
extended to dissipative ergodic systems provided one only considers initial conditions on
the attractor, and “zero probability” is interpreted with respect to the invariant probability
on the attractor [6].
9To be precise, if the system is dissipative, Dis the fractal dimension DAof the
attractor [4].
10 HYKEL HOSNI AND ANGELO VULPIANI
10-5
10-4
10-3
10-2
10-1
100
101102103104105106107108
εmin(M)/εmax
M
N=20
N=21
M-1/DA
Figure 1. The relative precision of the best analog as
function of the size Mof the sequence. The data have been
numerically obtained[4] from a simplified climatic model in-
troduced by Lorenz, with two different choices of the param-
eters, see [18]; the vector xis in RNwith N= 20 and N= 21.
that is ǫ= 0.05), the return time is so large that in practice a recurrence is
never observed.
That is to say that the required analog, whose existence is guaranteed in the-
ory, sometimes cannot be expected to be found in practice, even if complete
and precise information about the system is available to us.
Fig. 1shows how even for moderately large values of the fractal dimension
of the attractor DA, a good analog can be obtained only in time series with
enormous length. If DAis small (in the example DA3.1) for an analog
with a precision 1% a sequence of length O(102) is enough; on the contrary
for DA6.6 we need a very large sequence, at least O(109).
In addition usually we do not know the vector xdescribing the state of the
system. Such rather serious difficulty is well known in statistical physics; it
has been stressed e.g. by Onsager and Machlup [24] in their seminal work
on fluctuations and irreversible processes, with the caveat: how do you know
you have taken enough variables, for it to be Markovian?; and by Ma [20]: the
hidden worry of thermodynamics is: we do not know how many coordinates
or forces are necessary to completely specify an equilibrium state.
Takens [33] gave an important contribution to such a topic: he showed
that from the study of a time series {u1, .., uM}, where ujis an observable
sampled at the discrete times jt, it is possible (if we know that the system
is deterministic and is described by a finite dimensional vector, and Mis
FORECASTING IN LIGHT OF BIG DATA 11
large enough) to determine the proper variable x. Unfortunately, at practical
level, the method has rather severe limitations:
a) It works only if we know a priori that the system is deterministic;
b) The protocol fails if the dimension of the attractor is large enough
(say more than 5 or 6).
Once again Kac’s lemma sheds light on the key difficulty encountered here:
the minimum size of the time size Mallowing for the use of Taken’s approach
increases as CMwith C=O(100) [34,4]. Therefore this method cannot
be used, apart for special cases (with a small dimension), to build up a
model from the data. All extreme inductivist approaches will have to come
to terms with this fundamental fact. One of the few success of the method
of the analogs is the tidal prediction from past history. This in spite of the
fact that tides are chaotic; the reason is the low number of effective degrees
of freedom involved [4].
4. Weather forecasting: the mother of all approaches to
prediction
Weather forecasts provide a very good illustration of some central aspects of
predictive models. Not last because of the extreme accuracy which this field
managed to achieve over the past decades. And yet this accuracy could be
attained only when it became clear that too much data would be detrimental
to the accuracy of the model. Indeed, as we now briefly review, in the early
days weather forecasts featured a naive form of inductivism not dissimilar
to the one fuelling the big data enthusiasm.
Let us stress that the main limit to predictions based on analogs is not the
sensitivity to initial conditions, typical of chaos. But, as realized by Lorenz
[4], the main issue is actually to find good analogs.
The first modern steps in weather forecasting are due to Richardson [29,19]
who, in his visionary work, introduced many of the ideas on which modern
meteorology is based. His approach was, to a certain extent, in line with
genuine reductionism, and may be summarised as follows: the atmosphere
evolves according to the hydrodynamic (and thermodynamics) equations for
the velocity, the density, and so on. Therefore, future weather can be pre-
dicted, in principle at least, by solving the proper partial differential equa-
tions, with initial conditions given by the present state of the atmosphere.
The key idea by Richardson to forecast the weather was correct, but in order
to put it in practice it was necessary to introduce one further ingredient that
he could not possibly have known [5]. After few decades von Neumann and
12 HYKEL HOSNI AND ANGELO VULPIANI
Charney noticed that the equations originally proposed by Richardson, even
though correct, are not suitable for weather forecasting [19,9]. The appar-
ently paradoxical reason is that they are too accurate: they also describe
high-frequency wave motions that are irrelevant for meteorology. So it is
necessary to construct effective equations that get rid of the fast variables.
The effective equations have great practical advantages, e.g. it is possible
to adopt large integration time steps making the numerical computations
satisfactorily efficient. Even more importantly, they are able to capture the
essence of the phenomena of interest, which could otherwise be hidden in
too detailed a description, as in the case of the complete set of original
equations. It is important to stress that the effective equations are not
mere approximations of the original equations, and they are obtained with
a subtle mixture of hypotheses, theory and observations [9,5].
5. Concluding remarks
The above argument shows that in weather forecasting the accuracy of pre-
diction need not be monotonic with the sheer amount of data. Indeed,
beyond a certain point the opposite is true. This, in our opinion, is a se-
rious methodological objection to the piecemeal big data entusiasm. Given
its representativeness among all forecasting methods, the conclusions drawn
with respect to predicting the weather are far reaching, and help unifying a
number of observations that have been recently put forward along the same
lines.
In many sciences and in engineering, an ever increasing gap between theory
and experiment can be observed. This gap tends to widen particularly in
the presence of complex features in natural systems science [25]. In socio-
economical systems the gap between data and our scientific ability to ac-
tually understanding them is typically enormous. Surely the availability of
huge amounts of data, sophisticated methods for its retrieval and unprece-
dented computational power available for its analysis will undoubtedly help
moving science and technology forward. But in spite of a persistent emphasis
on a fourth paradigm (beyond the traditional ones, i.e. experiment, theory
and computation) based only on data, there is as yet no evidence data alone
can bring about scientifically meaningful advance. To the contrary, as nicely
illustrated by Crutchfield [8], up to now it seems that the unique way to un-
derstand some non trivial scientific or technological problem, is following the
traditional approach based on a clever combination of data, theory (and/or
computations), intuition and wise use of previous knowledge. Similar con-
clusions have been reached in the computational biosciences. The authors
FORECASTING IN LIGHT OF BIG DATA 13
of [7] point out very clearly not only the methodological shortcomings (and
ineffectiveness) of relying on data alone, but also unfold the implications
of methodologically unwarranted big data enthusiasm for the allocation of
research funds to healthcare related projects: “A substantial portion of fund-
ing used to gather and process data should be diverted towards efforts to
discern the laws of biology”.
Big data undoubtedly constitute a great opportunity for scientific and tech-
nological advance, with a potential for considerable socio-economic impact.
To make the most of it, however, the ensuing developments at the interface
of statistics, machine learning and artificial intelligence, must be coupled
with adequate methodological foundations. Not least because of the serious
ethical, legal and more generally societal consequence of the possible misuses
of this technology. This note contributed to elucidating the terms of this
problem by focussing on the potential for big data to reshape our current
understanding of forecasting. To this end we pointed out, in a very ele-
mentary setting, some serious problems that the na¨ıve inductivist approach
to forecast must face: the idea according to which reliable predictions can
be obtained solely on the grounds of our knowledge of the past faces insur-
mountable problems – even in the most idealised and controlled modelling
setting.
Chaos is often considered the main limiting factor to predictability in de-
terministic systems. However this is an unavoidable difficulty as long as
the evolution laws of the system under consideration are known. On the
contrary, if the information on the system evolution is based only on ob-
servational data, the bottleneck lies in Poincar´e recurrences which, in turn,
depend on the number of effective degrees of freedom involved. Indeed,
even in the most optimistic conditions, if the state vector of the system
were known with arbitrary precision, the amount of data necessary to make
the meaningful predictions would grow exponentially with the effective num-
ber of degrees of freedom, independently of the presence of chaos. However,
when, as for tidal predictions, the number of degrees of freedom associated
with the scales of interest is relatively small, the future can be successfully
predicted from past history. In addition, in absence of a theory, a purely
inductive modelling methodology can only be based on times series and the
method on the analogs, with the already discussed difficulties [34].
We therefore conclude that the big data revolution is by all means a welcome
one for the new opportunities it opens. However the role of modelling cannot
be discounted: not only larger datasets, but also the lack of an appropriate
level of description [9,5] may make useful forecasting practically impossible.
14 HYKEL HOSNI AND ANGELO VULPIANI
References
[1] C. S. Calude and G. Longo. The Deluge of Spurious Correlations in Big Data. Foun-
dations of Science, 21, 1–18, 2016.
[2] D. Casacuberta and J. Vallverd´u. E-science and the data deluge. Philosophical Psy-
chology, 27(1), 126–140, 2014.
[3] S. Canali. Big Data, epistemology and causality: Knowledge in and knowledge out in
EXPOsOMICS. Big Data & Society, 3(2):1–11, 2016.
[4] F. Cecconi, M. Cencini, M. Falcioni, and A. Vulpiani The prediction of future from
the past: an old problem from a modern perspective American Journal of Physics 80(11),
1001-1008, 2012.
[5] S. Chibbaro, L. Rondoni, and A. Vulpiani Reductionism, Emergence and Levels of
Reality Springer-Verlag, Berlin, (2014)
[6] P. Collet, and J.-P. Eckmann Concepts and Results in Chaotic Dynamics: A Short
Course Springer-Verlag, Berlin, (2006)
[7] P. V. Coveney, E. R. Dougherty, and R.R. Highfield, Big data need big theory too Philo-
sophical Transactions of the Royal Society A: Mathematical, Physical and Engineering
Sciences, 280, 374, 1-11, 2016.
[8] J.P. Crutchfield The dreams of theory Wiley Interdisciplinary Reviews: Computational
Statistics 6, 75-79, 2014.
[9] A. Dahan Dalmedico History and epistemology of models: meteorology as a case study
Archive for the History of Exact Sciences 55, 395-422, 2001
[10] B. de Finetti. Theory of Probability, Vol 1. John Wiley and Sons, New York, 1974.
[11] B. de Finetti. Philosophical lectures on probability. Ed. A. Mura, Translated by H.
Hosni, Springer Verlag, Berlin, 2008.
[12] P. Domingos. The Master Algorithm: How the Quest for the Ultimate Learning Ma-
chine Will Remake Our World. Basic Books, New York, 2015.
[13] P.R. Halmos. Lectures on Ergodic Theory . Chelsea Publishing, London, 1956
[14] M. Kac On the notion of recurrence in discrete stochastic processes B ullettin of the
American Mathematical Society. 53, 1002–1010 1947.
[15] R. Kitchin. Big Data, new epistemologies and shifts. Big Data & Society, 1:1–12,
2014.
[16] D. Lazer, R. Kennedy, G. King, and A. Vespignani. The Parable of Google Flu: Traps
in Big Data Analysis. Science, 343(6167), 1203–1205, 2014.
[17] S. Leonelli. Data-Centric Biology: A Philosophical Study. Chicago University Press,
Chicago, 2016.
[18] E. N. Lorenz, Predictability- A problem partly solved in Proc. Seminar on Predictabil-
ity (ECMWF, Reading, UK, 1996), pp. 1–18.
[19] P. Lynch The Emergence of Numerical Weather Prediction: Richardson’s Dream
Cambridge University Press, Cambridge, (2006)
[20] S. K. Ma Statistical Mechanics World Scientific, Singapore, (1985).
[21] V. Mayer-Sch¨onberger and K. Cukier. Big Data: A Revolution That Wil l Transform
How We Live, Work, and Think. Houghton Mifflin, New York, (2013) 2013.
[22] E. Nowotny. The Cunning of Uncertainty. Polity, London, (2016).
[23] M. Nural, M. E. Cotterell, and J. Miller. Using Semantics in Predictive Big Data An-
alytics. Proceedings - 2015 IEEE International Congress on Big Data, BigData Congress
2015, pages 254–261, 2015.
FORECASTING IN LIGHT OF BIG DATA 15
[24] L. Onsager, and S. Machlup, Fluctuations and irreversible processes Physical Review
91, 1505-1512, 1953.
[25] G. Parisi. Complex Systems: a Physicist’s Viewpoint. Physica A, 263:557–564, 1999.
[26] F. Pasquale. The Black Box Society, volume 36. Harvard University Press, Harvard,
2015.
[27] W. L Perry, B. McInnes, C. C Price, S. C Smith, and J. S Hollywood. Predictive
Policing: The Role of Crime Forecasting in Law Enforcement Operations. RAND Cor-
poration, Santa Monica, 2013.
[28] H. Poincar´e. Sur le probl`eme des trois corps et les ´equations de la dynamique, Acta
Mathematica 13, 1–270, 1890.
[29] L. F. Richardson. Weather Prediction by Numerical Methods Cambridge University
Press, Cambridge (1922)
[30] M. Robbins. Has a rampaging AI algorithm really killed
thousands in Pakistan? The Guardian 18 February 2016
http://www.theguardian.com/science/the-lay- scientist/2016/feb/18/has-a-rampaging-ai-algorithm-really-ki
[31] SKYNET: Applying Advanced Cloud-based Behavior Analytics. The Intercept, 8 May
2005. https://theintercept.com/document/2015/05/08/skynet-applying-advanced-cloud- based-behavior-analytics
[32] J. Saunders, P. Hunt, and J.S. Hollywood. Predictions put into practice: a quasi-
experimental evaluation of Chicago’s predictive policing pilot. Journal of Experimental
Criminology, 12, 1–25, 2016.
[33] F. Takens Detecting strange attractors in turbulence In: D. Rand, L.-S. Young (Ed.s),
Dynamical Systems and Turbulence, Lecture Notes in Mathematics, 898, 366–381, 1981.
[34] A. S. Weigend, and N. A. Gershenfeld (Ed.s) Time Series Prediction: Forecasting the
Future and Understanding the Past Addison-Wesley, Reading (1994).
(H.H.) Dipartimento di Filosofia, Universit`
a degli Studi di Milano, and, (A.V.)
Dipartimento di Fisica, Universit`
a degli Studi di Roma Sapienza and Cen-
tro Linceo Inderdisciplinare “Beniamino Segre”, Accademia dei Lincei, Roma
(Italy).
E-mail address:hykel.hosni@unimi.it; Angelo.Vulpiani@roma1.infn.it
... For instance, bootstrapped errors (as drawn by Letchford) are unlikely to retain the spatial autocorrelation in their data, but spatially correlated data have quite high spurious correlations. While some scholars attempt to use a variety of machine learning methods to select models [1,28], others have been deeply skeptical [20,29] of this attempt. ...
... As noted earlier, the Bonferroni correction with α= 0.05 and ten million hypotheses tested is 0.05/10,000,000 = 0.000000005. Studies that indiscriminately mine for highly correlated terms are likely to encounter enormous difficulties [18][19][20]29], although efforts continue to identify approaches that will work [1,30]. While the specific tool used to collect the data Figure 4. Density of maximum spurious correlations: gamma (1, 1), spatially correlated, Gaussian random walk, and mean-reverting normal distributions. ...
... As noted earlier, the Bonferroni correction with α= 0.05 and ten million hypotheses tested is 0.05/10,000,000 = 0.000000005. Studies that indiscriminately mine for highly correlated terms are likely to encounter enormous difficulties [18][19][20]29], although efforts continue to identify approaches that will work [1,30]. While the specific tool used to collect the data used in this study (Google Correlate) has since been discontinued by Google, other tools continue to offer the capacity to examine many searches [31]. ...
Article
Full-text available
Big search data offers the opportunity to identify new and potentially real-time measures and predictors of important political, geographic, social, cultural, economic, and epidemiological phenomena, measures that might serve an important role as leading indicators in forecasts and nowcasts. However, it also presents vast new risks that scientists or the public will identify meaningless and totally spurious ‘relationships’ between variables. This study is the first to quantify that risk in the context of search data. We find that spurious correlations arise at exceptionally high frequencies among probability distributions examined for random variables based upon gamma (1, 1) and Gaussian random walk distributions. Quantifying these spurious correlations and their likely magnitude for various distributions has value for several reasons. First, analysts can make progress toward accurate inference. Second, they can avoid unwarranted credulity. Third, they can demand appropriate disclosure from the study authors.
... Also, in these applications a mix of BD and statistical learning is the current driving factor. however, it is also here that probably only when the conceptual adaptation phase starts can AI and BD rival the current practice of judgemental and model-driven forecasts if ever 33 . organisations will follow suit. ...
... Only our extremely low prior probability for H S can justify our rejecting it. [1, p. 195] For more recent arguments about the untenability of purely data-driven forecasting, see [160,161]. ...
Article
Full-text available
We illustrate how a variety of logical methods and techniques provide useful, though currently underappreciated, tools in the foundations and applications of reasoning under uncertainty. The field is vast spanning logic, artificial intelligence, statistics, and decision theory. Rather than (hopelessly) attempting a comprehensive survey, we focus on a handful of telling examples. While most of our attention will be devoted to frameworks in which uncertainty is quantified probabilistically, we will also touch upon generalisations of probability measures of uncertainty, which have attracted a significant interest in the past few decades.
... This issue has been investigated by drawing datainformed origin-destination demand matrices from GPS traces of mobile phones (Iqbal et al., 2014), micro-blogging (Lenormand et al., 2014) and location-based social networks (Noulas et al., 2012) (see Barbosa et al. (2018) for a recent review). Improving our understanding of the evolution of the transport system and the ability to optimize demand forecasts, however, necessitates not only access to big datasets but also the formulation of adequate methodological foundations (Hosni and Vulpiani, 2017). Specifically, the economic dimension of transportation ought to be taken into account when attempting demand forecasting (Lenormand et al., 2015b;Lotero et al., 2016a,b;Florez et al., 2018). ...
Article
We develop algorithms to analyze Information and Communication Technologies (ICT) data in order to estimate individuals’ mobility at different spatial scales. Specifically, we apply the algorithms to delineate airport catchment areas in the United Kingdom’s Greater London region and to estimate ground access trip times from a very large ICT dataset. The spatial demand is regressed over demographic, socio-economic, airport-specific and ground access modal characteristics in order to determine the drivers of airport demand. Drawing on these insights, we develop a catchment area game inspired by Hotelling that analyzes the potential impact of collaboration between airports and airlines by integrating evidence of consumer behavior with producers’ financial data. We apply the game to a case study of two London airports with overlapping catchment areas for local residents. Our assessment of airline-airport vertical collusion and airport-airport horizontal collusion indicates that the former is beneficial to both producers and passengers. In contrast, whilst horizontal and vertical collusion is the equilibrium outcome in the analytic symmetric case, it is found to be less likely in the asymmetric case and the real-world, data-driven analysis, due to catchment area and cost asymmetries. Thus, such new datasets may enable regulators to overcome the long-standing information asymmetry issue that has yet to be resolved. Combining new data sources with traditional consumer surveys may provide more informed insights into both consumers’ and producers’ actions, which determines the need (or lack thereof) for regulatory intervention in aviation markets.
Article
Full-text available
In this paper we present a discussion of the basic aspects of the well-known problem of prediction and inference in physics, with specific attention to the role of models, the use of data and the application of recent developments in artificial intelligence. By focussing in the time evolution of dynamic system, it is shown that main difficulties in predictions arise due to the presence of few factors as: the occurrence of chaotic dynamics, the existence of many variables with very different characteristic time-scales and the lack of an accurate understanding of the underlying physical phenomena. It is shown that a crucial role is assigned to the preliminary identification of the proper variables, their selection and the identification of an appropriate level of description (coarse-graining procedure). Moreover, it is discussed the relevance, even in modern practical issues, of old well-known fundamental results, like the Poincaré recurrence theorem, the Kac’s lemma and the Richard’s paradox.
Article
Full-text available
Economic complexity methods, and in particular relatedness measures, lack a systematic evaluation and comparison framework. We argue that out-of-sample forecast exercises should play this role, and we compare various machine learning models to set the prediction benchmark. We find that the key object to forecast is the activation of new products, and that tree-based algorithms clearly outperform both the quite strong auto-correlation benchmark and the other supervised algorithms. Interestingly, we find that the best results are obtained in a cross-validation setting, when data about the predicted country was excluded from the training set. Our approach has direct policy implications, providing a quantitative and scientifically tested measure of the feasibility of introducing a new product in a given country.
Chapter
Proliferation of smart devices coupled with advances in computing technologies has aided smart decision-making. Though the term data science is of recent origin, the field utilizes cornerstone computing and statistical concepts to solve data-specific challenges. The role of a data scientist is to extract maximum information from raw data. This role has spawned across disciplines but only acts as a supplemental measure to actual domain knowledge. The field of data science encompasses analytics, artificial intelligence, and statistical modeling. Information security has not always been integral to the data science ecosystem. An increasingly connected world with heightened reliance on data has encouraged data scientists to be well-versed with data security. This article takes an overview of the commercial application using data science, machine learning, and big data.
Article
Full-text available
The current interest in big data, machine learning and data analytics has generated the widespread impression that such methods are capable of solving most problems without the need for conventional scientific methods of inquiry. Interest in these methods is intensifying, accelerated by the ease with which digitized data can be acquired in virtually all fields of endeavour, from science, healthcare and cybersecurity to economics, social sciences and the humanities. In multiscale modelling, machine learning appears to provide a shortcut to reveal correlations of arbitrary complexity between processes at the atomic, molecular, meso- and macroscales. Here, we point out the weaknesses of pure big data approaches with particular focus on biology and medicine, which fail to provide conceptual accounts for the processes to which they are applied. No matter their ‘depth’ and the sophistication of data-driven methods, such as artificial neural nets, in the end they merely fit curves to existing data. Not only do these methods invariably require far larger quantities of data than anticipated by big data aficionados in order to produce statistically reliable results, but they can also fail in circumstances beyond the range of the data used to train them because they are not designed to model the structural characteristics of the underlying system. We argue that it is vital to use theory as a guide to experimental design for maximal efficiency of data collection and to produce reliable predictive models and conceptual knowledge. Rather than continuing to fund, pursue and promote ‘blind’ big data projects with massive budgets, we call for more funding to be allocated to the elucidation of the multiscale and stochastic processes controlling the behaviour of complex systems, including those of life, medicine and healthcare. This article is part of the themed issue ‘Multiscale modelling at the physics–chemistry–biology interface’.
Article
Full-text available
Recently, it has been argued that the use of Big Data transforms the sciences, making data-driven research possible and studying causality redundant. In this paper, I focus on the claim on causal knowledge by examining the Big Data project EXPOsOMICS, whose research is funded by the European Commission and considered capable of improving our understanding of the relation between exposure and disease. While EXPOsOMICS may seem the perfect exemplification of the data-driven view, I show how causal knowledge is necessary for the project, both as a source for handling complexity and as an output for meeting the project?s goals. Consequently, I argue that data-driven claims about causality are fundamentally flawed and causal knowledge should be considered a necessary aspect of Big Data science. In addition, I present the consequences of this result on other data-driven claims, concerning the role of theoretical considerations. I argue that the importance of causal knowledge and other kinds of theoretical engagement in EXPOsOMICS undermine theory-free accounts and suggest alternative ways of framing science based on Big Data.
Article
Full-text available
Objectives In 2013, the Chicago Police Department conducted a pilot of a predictive policing program designed to reduce gun violence. The program included development of a Strategic Subjects List (SSL) of people estimated to be at highest risk of gun violence who were then referred to local police commanders for a preventive intervention. The purpose of this study is to identify the impact of the pilot on individual- and city-level gun violence, and to test possible drivers of results. Methods The SSL consisted of 426 people estimated to be at highest risk of gun violence. We used ARIMA models to estimate impacts on city-level homicide trends, and propensity score matching to estimate the effects of being placed on the list on five measures related to gun violence. A mediation analysis and interviews with police leadership and COMPSTAT meeting observations help understand what is driving results. ResultsIndividuals on the SSL are not more or less likely to become a victim of a homicide or shooting than the comparison group, and this is further supported by city-level analysis. The treated group is more likely to be arrested for a shooting. Conclusions It is not clear how the predictions should be used in the field. One potential reason why being placed on the list resulted in an increased chance of being arrested for a shooting is that some officers may have used the list as leads to closing shooting cases. The results provide for a discussion about the future of individual-based predictive policing programs.
Book
The study of dynamical systems is a well established field. The authors have written this book in an attempt to provide a panorama of several aspects, that are of interest to mathematicians and physicists alike. The book collects the material of several courses at the graduate level given by the authors. Thus, the exposition avoids detailed proofs in exchange for numerous illustrations and examples, while still maintaining sufficient precision. Apart from common subjects in this field, a lot of attention is given to questions of physical measurement and stochastic properties of chaotic dynamical systems.
Article
Hidden algorithms drive decisions at major Silicon Valley and Wall Street firms. Thanks to automation, those firms can approve credit, rank websites, and make myriad other decisions instantaneously. But what are the costs of their methods? And what exactly are they doing with their digital profiles of us? Leaks, whistleblowers, and legal disputes have shed new light on corporate surveillance and the automated judgments it enables. Self-serving and reckless behavior is surprisingly common, and easy to hide in code protected by legal and real secrecy. Even after billions of dollars of fines have been levied, underfunded regulators may have only scratched the surface of troublingly monopolistic and exploitative practices. Drawing on the work of social scientists, attorneys, and technologists, The Black Box Society offers a bold new account of the political economy of big data. Data-driven corporations play an ever larger role in determining opportunity and risk. But they depend on automated judgments that may be wrong, biased, or destructive. Their black boxes endanger all of us. Faulty data, invalid assumptions, and defective models can’t be corrected when they are hidden. Frank Pasquale exposes how powerful interests abuse secrecy for profit and explains ways to rein them in. Demanding transparency is only the first step. An intelligible society would assure that key decisions of its most important firms are fair, nondiscriminatory, and open to criticism. Silicon Valley and Wall Street need to accept as much accountability as they impose on others. In this interview with Lawrence Joseph, Frank Pasquale describes the aims and methods of the book.