ArticlePDF Available

Abstract and Figures

This article shows that a highly statisticallysignificant correlation exists between storkpopulations and human birth rates across Europe.While storks may not deliver babies, unthinkinginterpretation of correlation and p-values cancertainly deliver unreliable conclusions.
Content may be subject to copyright.
concerned that, as electronic access to full text
becomes freely available at members' libraries,
people may no longer join to receive a dis-
counted journal subscription. In this case, other
member bene¢ts become all the more im-
portant. We o¡er several electronic services for
society members, many via Websites that we
have created and maintain for our society
partners. These include password-protected online
problem solving, data sets, book reviews online,
preprint access and searchable cumulative indexes.
By working proactively with societies, libraries,
third party providers, consortia and subscription
agents, we aim to o¡er cost-e¡ective solutions to
suit our journals' needs.
For more information on Blackwell Publishers,
and links to other statistics sites, please see our
Website at
Storks Deliver Babies (p0.008)
Robert Matthews
Aston University, Birmingham, England.
This article shows that a highly statistically
signi¢cant correlation exists between stork
populations and human birth rates across Europe.
While storks may not deliver babies, unthinking
interpretation of correlation and p-values can
certainly deliver unreliable conclusions.
Introductory statistics textbooks routinely warn
of the dangers of confusing correlation with
causation, pointing out that while a high corre-
lation coe¤cient is indicative of (linear) association,
it cannot be taken as a measure of causation. Such
warnings are typically accompanied by illustrative
examples, such as the correlation between the
reading skills of children and their shoe size, or the
apparent relationship between educational level
and unemployment (see e.g. Freedman et al. 1998).
However, such examples are often either trivially
explained via an obvious confounder (e.g. age, in
the case of reading age and shoe size) or are not
obviously cases of mere association (e.g.
educational level may indeed be at least partly
responsible for time spent unemployed). In what
follows, I give an example based on genuine data
of an association which is clearly ludicrous, but
which cannot be so easily dismissed as non-causal
via an obvious confounder.
My starting point is the familiar folk tale that
babies are delivered by storks. The origins of this
connection are believed to lie partly in the
association between storks and the concept of
women as bringers of life, and also in the bird's
feeding habits, which were once regarded as a
search for embryonic life in water (Cooper 1992).
The legend lives on to this day, with neonate-
bearing storks being a regular feature of greetings
cards celebrating births.
While it is (I trust) obvious that the legend is
complete nonsense, it is legitimate to ask precisely
how one might set about refuting it scienti¢cally. If
one were approaching the question in the same
way that many other links are investigated (e.g.
suspected links between diet and cancer risk), one
may well decide to carry out a correlational study,
to see if the number of storks in a country bears a
simple relationship to the number of human births
in that country. Although the presence of a
statistically signi¢cant degree of correlation cannot
be taken to imply causation, its absence would
certainly constitute evidence against a simple
relationship. This possibility can quickly be
investigated in the present case using standard
hypothesis testing, with the null hypothesis being
the absence of any correlation between the number
of storks and the number of live births in a
particular country. This I now proceed to do.
36 .Teaching Statistics. Volume 22, Number 2, Summer 2000
The white stork (Ciconia ciconia) is a surprisingly
common bird in many parts of Europe, and data
on the number of breeding pairs are available for
17 European countries (Harbard 1999, pers.
comm.); the latest ¢gures, covering the period from
1980 to 1990, are given in table 1, along with
demographic data taken from Britannica
Yearbook for 1990.
Plotting the number of stork pairs against the
number of births in each of the 17 countries, one
can discern signs of a possible correlation between
the two (see ¢gure 1).
The existence of this correlation is con¢rmed by
performing a linear regression of the annual
number of births in each country (the ¢nal column
in table 1) against the number of breeding pairs
of white storks (column 3). This leads to a
correlation coe¤cient of r0:62, whose statistical
signi¢cance can be gauged using the standard
t-test, where trpnÿ2=1ÿr2 and nis the
sample size. In our case, n17 so that t3:06,
which for nÿ2 15 degrees of freedom leads to
ap-value of 0.008.
What are we to make of this result, which points
to a highly statistically signi¢cant degree of
correlation between stork populations and birth
rates? The correlation coe¤cient is not particularly
high, but according to its p-value, there is only a
1 in 125 chance of obtaining at least as impressive
a value assuming the null hypothesis of no
correlation were true. Yet as with any p-value (and
contrary to what unwary users of them believe),
Country Area
Birth rate
Albania 28,750 100 3.2 83
Austria 83,860 300 7.6 87
Belgium 30,520 1 9.9 118
Bulgaria 111,000 5000 9.0 117
Denmark 43,100 9 5.1 59
France 544,000 140 56 774
Germany 357,000 3300 78 901
Greece 132,000 2500 10 106
Holland 41,900 4 15 188
Hungary 93,000 5000 11 124
Italy 301,280 5 57 551
Poland 312,680 30,000 38 610
Portugal 92,390 1500 10 120
Romania 237,500 5000 23 367
Spain 504,750 8000 39 439
Switzerland 41,290 150 6.7 82
Turkey 779,450 25,000 56 1576
Table 1. Geographic, human and stork data for 17
European countries
Fig 1. How the number of human births varies with stork populations in 17 European countries.
Teaching Statistics. Volume 22, Number 2, Summer 2000 .37
this does not imply that the probability that mere
£uke really is the correct explanation is just 1 in
125; still less does it imply a 124=125 99:2%
probability that storks really do deliver babies.
Such apparent nit-picking distinctions are fre-
quently overlooked by consumers of p-values. In
the case of the correlation between storks and
human births, however, they no longer seem so
pedantic: indeed, they provide the very welcome
`escape route' by which to avoid a patently
ludicrous inference. The most plausible explan-
ation of the observed correlation is, of course,
the existence of a confounding variable: some
factor common to both birth rates and the
number of breeding pairs of storks which ^ like
age in the reading skill/shoe-size correlation ^ can
lead to a statistical correlation between two
variables which are not directly linked themselves.
One candidate for a potential confounding variable
is land area: readers are invited to investigate this
possibility using the data in table 1.
Standard statistical texts routinely warn of the
fallacy of mistaking correlation for causation, but
the examples they provide are usually either trivial,
with obvious confounders, or lack clear non-
causality. The empirical relationship between the
number of stork breeding pairs and human birth
rates in 17 European countries provides a non-trivial
example of a correlation which is highly statistically
signi¢cant, not immediately explicable and yet
causally nonsensical. Indeed, its sheer absurdity has
pedagogic value beyond the correlation/causation
fallacy alone, as it compels greater attention to be
paid to the precise meaning of p-values, and
promotes greater recognition of the fact that
rejection of the null hypothesis does not imply the
correctness of the substantive hypothesis.
The author is very grateful to Chris Harbard of
the Royal Society for the Protection of Birds for
supplying the stork data, and to Professor Dennis
Lindley for valuable discussions.
Cooper, J.C. (ed.) (1992). Brewer's Myth and
Legend. London: Cassell.
Freedman, D., Pisani, R. and Purves, R.
(1998). Statistics (3rd edn). New York:
W.W. Norton.
The Big Ticket: How Not to Design a Game Show
David Burghes
University of Exeter, England.
This article analyses a television game show and
suggests improvements.
You may have seen this game show, which
was broadcast for 8^9 weeks on Saturdays
on television in the United Kingdom in the spring/
summer of 1998 as an extended National Lottery
show. I only saw parts of some of the shows, while
waiting for something else, or by mistake! I was
amazed that such an unappealing programme was
broadcast at all.
At the end of each show there was ridiculous
over-promotion of a game in which the ¢nal
contestants were certain to win substantial sums of
money. Just in case you missed the shows, I
summarize in ¢gure 1 the way this game worked.
The game seemed incredibly predictable as it was
impossible not to win a substantial amount, with a
low ceiling on the maximum that could be won,
compared with the guaranteed minimum winning
38 .Teaching Statistics. Volume 22, Number 2, Summer 2000
... There is a statistically significant correlation (p = 0.008) between the number of human births and the number of nesting pairs of storks in 17 European countries between 1980 and 1990 (figure 7(a) [113]). This means that there is only 1 chance in 125 of obtaining at least as good a correlation with a dataset where no correlation is assumed. ...
... This means that there is only 1 chance in 125 of obtaining at least as good a correlation with a dataset where no correlation is assumed. The very low p-value implies a strong correlation, but it does not imply that the hypothesis is true, even for p = 0. (1980)(1981)(1982)(1983)(1984)(1985)(1986)(1987)(1988)(1989)(1990) and the number of nesting storks [113]. (b) Correlations between different properties may arise when both properties correlate separately with the same 'confounder'. ...
Full-text available
The importance of the structure-function relationship in molecular biology was confirmed dramatically by the recent award of the 2024 Nobel Prize in Chemistry ‘for computational protein design’ and ‘for protein structure prediction’. The relationship is also important in chemistry and condensed matter physics, and we survey here structural concepts that have been developed over the past century, particularly in chemistry. As an example we take structural phase transitions in phase-change materials (PCM), which can be switched rapidly and reversibly between amorphous and crystalline states. Alloys of Ge, Sb, and Te are the materials of choice for PCM optical memory; they satisfy practical demands of stability and rapid crystallization, which results in metastable, rock salt structures, not the most stable (layered) crystalline forms.
... In statistics, it is important to distinguish between direct and marginal correlations between variables. A humorous example was offered by Matthews [1], where the number of stork breeding pairs in a collection of countries was shown to be positively and significantly correlated with the number of human births in those same countries. If this correlation were to be interpreted as a direct mechanistic interaction between the variables, storks might be concluded to deliver human babies. ...
... We stress that this is a very practical issue, beyond the "correlation vs. causality" discussion. Using a well-known example from the literature: Even if we accept that childbirths in a country can be predicted from stork population sizes 74 , it turns out that the model fails miserably for African countries. Consequently, the absence of any data for all countries outside of Europe should worry us massively on the domain of applicability of the model. ...
Full-text available
Small molecule machine learning aims to predict chemical, biochemical, or biological properties from molecular structures, with applications such as toxicity prediction, ligand binding, and pharmacokinetics. A recent trend is developing end-to-end models that avoid explicit domain knowledge. These models assume no coverage bias in training and evaluation data, meaning the data are representative of the true distribution. However, the domain of applicability is rarely considered in such models. Here, we investigate how well large-scale datasets cover the space of known biomolecular structures. For doing so, we propose a distance measure based on solving the Maximum Common Edge Subgraph (MCES) problem, which aligns well with chemical similarity. Although this method is computationally hard, we introduce an efficient approach combining Integer Linear Programming and heuristic bounds. Our findings reveal that many widely-used datasets lack uniform coverage of biomolecular structures, limiting the predictive power of models trained on them. We propose two additional methods to assess whether training datasets diverge from known molecular distributions, potentially guiding future dataset creation to improve model performance.
... For example, shoe size correlates with reading level in children but it not the true reason of better reading ability (the true reason might be age or education). Another example is the myth in ancient Germany where people believed that storks deliver babies (See here [57], [58] for a discussion of this). Notice that these were only few concerns regarding statistical testing. ...
A negative result is when the outcome of an experiment or a model is not what is expected or when a hypothesis does not hold. Despite being often overlooked in the scientific community, negative results are results and they carry value. While this topic has been extensively discussed in other fields such as social sciences and biosciences, less attention has been paid to it in the computer vision community. The unique characteristics of computer vision, particularly its experimental aspect, call for a special treatment of this matter. In this paper, I will address what makes negative results important, how they should be disseminated and incentivized, and what lessons can be learned from cognitive vision research in this regard. Further, I will discuss issues such as computer vision and human vision interaction, experimental design and statistical hypothesis testing, explanatory versus predictive modeling, performance evaluation, model comparison, as well as computer vision research culture.
... Currently, most studies on knowledge component relationships are focused on correlations rather than causations. Correlations lack true explainability, as exemplified by the saying "Storks Deliver Babies" [22], which illustrates correlation but not causation, thus failing to prove explainability. In the field of deep learning, correlation discovery has been extensively studied, yielding many excellent models. ...
Conference Paper
Full-text available
A reliable knowledge structure is a prerequisite for building effective adaptive learning systems and intelligent tutoring systems. Pursuing an explainable and trustworthy knowledge structure, we propose a method for constructing causal knowledge networks. This approach leverages Bayesian networks as a foundation and incorporates causal relationship analysis to derive a causal network. Additionally, we introduce a dependable knowledge-learning path recommendation technique built upon this framework, improving teaching and learning quality while maintaining transparency in the decision-making process.
... Before discussing how MC and the mechanisms of SCM might apply to epilepsy, it is worth taking some time to examine the statistical methods used to quantify persistent, functional changes in populations of neurons recorded in vivo: specifically, the statistical technique of partial correlation, which measures correlation between two variables after the effect of a third variable has been discounted. A humorous, teaching example regarding partial correlation asked whether it was statistically justifiable to propose that storks deliver babies by considering three variables: the number of storks, the human birth rate, and human population density (Matthews, 2000). The number of storks in an area was found indeed to be correlated with the number of births (so perhaps storks do deliver babies!), but that could reflect the possibility that people and storks normally live in close proximity to one another. ...
Full-text available
Epilepsy is a complex, multifaceted disease that affects patients in several ways in addition to seizures, including psychological, social, and quality of life issues, but epilepsy is also known to interact with sleep. Seizures often occur at the boundary between sleep and wake, patients with epilepsy often experience disrupted sleep, and the rate of inter-ictal epileptiform discharges increases during non-REM sleep. The Network Theory of Epilepsy did not address a role for sleep, but recent emphasis on the interaction between epilepsy and sleep suggests that post-seizure sleep may also be involved in the process by which seizures arise and become more severe with time (“epileptogenesis”) by co-opting processes related to the formation of long-term memories. While it is generally acknowledged that recurrent seizures arise from the aberrant function of neural circuits, it is possible that the progression of epilepsy is aided by normal, physiological function of neural circuits during sleep that are driven by pathological signals. Studies recording multiple, single neurons prior to spontaneous seizures have shown that neural assemblies activated prior to the start of seizures were reactivated during post-seizure sleep, similar to the reactivation of behavioral neural assemblies, which is thought to be involved in the formation of long-term memories, a process known as Memory Consolidation. The reactivation of seizure-related neural assemblies during sleep was thus described as being a component of Seizure-Related Consolidation (SRC). These results further suggest that SRC may viewed as a network-related aspect of epilepsy, even in those seizures that have anatomically restricted neuroanatomical origins. As suggested by the Network Theory of Epilepsy as a means of interfering with ictogenesis, therapies that interfered with SRC may provide some anti-epileptogenic therapeutic benefit, even if the interference targeted structures that were not involved originally in the seizure. Here, we show how the Network Theory of Epilepsy can be expanded to include neural plasticity mechanisms associated with learning by providing an overview of Memory Consolidation, the mechanisms thought to underlie MC, their relation to Seizure-Related Consolidation, and suggesting novel, anti-epileptogenic therapies targeting interference with network activation in epilepsy following seizures during post-seizure sleep.
... Currently, most studies on knowledge component relationships are focused on correlations rather than causations. Correlations lack true explainability, as exemplified by the saying "Storks Deliver Babies" [22], which illustrates correlation but not causation, thus failing to prove explainability. In the field of deep learning, correlation discovery has been extensively studied, yielding many excellent models. ...
Full-text available
A reliable knowledge structure is a prerequisite for building effective adaptive learning systems and intelligent tutoring systems. Pursuing an explainable and trustworthy knowledge structure, we propose a method for constructing causal knowledge networks. This approach leverages Bayesian networks as a foundation and incorporates causal relationship analysis to derive a causal network. Additionally, we introduce a dependable knowledge-learning path recommendationHuman-Centric eXplainable AI in Education technique built upon this framework, improving teaching and learning quality while maintaining transparency in the decision-making process.
There is debate on the best treatment for significant stenoses of the left main (LM) coronary artery. The available evidence is based on four randomized trials, which were either performed specifically to assess patients with LM disease (EXCEL, NOBLE, PRECOMBAT) or had a significant fraction of patients with this disease pattern (SYNTAX). A meta-analysis revealed no difference in periprocedural and 5-year mortality but demonstrated a significant reduction of spontaneous myocardial infarction (MI) with CABG. Furthermore, the recently published SWEDEHEART registry data have shown survival advantage and fewer MACCE with CABG for LM disease after adjustment. In general, patients with more severe coronary artery disease (CAD) appear to have a survival advantage with CABG both over PCI and medical therapy (independent of the presence or absence of LM stenosis), which is always associated with a reduction of spontaneous MI in the CABG arm. Since the nomenclature of LM disease does not automatically reflect the complexity of CAD, we review the nature of LM disease in this article. We mechanistically assess the treatment effects of PCI and CABG for patients with LM disease, which is rarely isolated, often distal, and mostly associated with varying degrees of single and multi-vessel disease. We conclude that in patients with isolated LM shaft lesions and associated diseases of low complexity, the risk of spontaneous MI is lower, and PCI may achieve similar long-term outcomes compared to CABG. Thus, heart teams are essential for selecting the best treatment option and should focus on assessing infarction risk in chronic CAD.
For the Ottoman military-administrative elite in imperial provinces, alliances were essential to garnering respect and support among Algerian notables, legitimizing their authority, and solidifying their positions of power in Constantine. This chapter shows how data summarization and statistical tests for independence may be used to explore questions such as: How did a man’s ethnicity and origin contribute to his ability to achieve and retain high office? How did Algerian local elites exert pressure on Ottoman administrators to install and remove governors? Why were ethnically mixed Ottoman-Algerian men so famously unsuccessful, given their inherent local and imperial ties? In examining these topics, this chapter uses hypothesis tests to uncover local perceptions of Ottoman officials’ ethnicity and the extent of local elites’ influence in provincial political careers.
Propensity scores (PS) have been studied for many years, mostly in the aspect of confounder matching in the control and treatment groups. This work is devoted to the problem of estimation of the causal impact of the treatment versus control data in observational studies, and it is based on the simulation of thousands of scenarios and the measurement of the causal outcome. The generated treatment effect was added in simulation to the outcome, then it was retrieved using the PS and regression estimations, and the results were compared with the original known in the simulation treatment values. It is shown that only rarely the propensity score can successfully solve the causality problem, and the regressions often outperform the PS estimations. The results support the old philosophical critique of the counterfactual theory of causation from a statistical point of view.
Brewer's Myth and Legend
  • J C Cooper
Cooper, J.C. (ed.) (1992). Brewer's Myth and Legend. London: Cassell.
  • D Freedman
  • R Pisani
  • R Purves
Freedman, D., Pisani, R. and Purves, R. (1998). Statistics (3rd edn). New York: W.W. Norton.