ArticlePDF Available

Misuses of Statistical Analysis in Climate Research

Authors:

Abstract

The history of misuses of statistics is as long as the history of statistics itself. The following is a personal assessment about such misuses in our field, climate research. Some people might find my subjective essay of the matter unfair and not balanced. This might be so, but an effective drug sometimes tastes bitter. The application of statistical analysis in climate research is methodologi-cally more complicated than in many other sciences, among others because of the following reasons: • In climate research only very rarely it is possible to perform real inde-pendent experiments (see Navarra's discussion in Chapter 1). There is more or less only one observational record which is analysed again and again so that the processes of building hypotheses and testing hypothe-ses are hardly separable. Only with dynamical models can independent Acknowledgments: I thank Bob Livezey for his most helpful critical comments, and Ashwini Kulkarni for responding so positively to my requests to discuss the problem of correlation and trend-tests.
Chapter 2
Misuses of Statistical
Analysis in Climate
Research
by Hans von Storch
2.1 Prologue
The history of misuses of statistics is as long as the history of statistics itself.
The following is a personal assessment about such misuses in our field, climate
research. Some people might find my subjective essay of the matter unfair
and not balanced. This might be so, but an effective drug sometimes tastes
bitter.
The application of statistical analysis in climate research is methodologi-
cally more complicated than in many other sciences, among others because
of the following reasons:
In climate research only very rarely it is possible to perform real inde-
pendent experiments (see Navarra’s discussion in Chapter 1). There is
more or less only one observational record which is analysed again and
again so that the processes of building hypotheses and testing hypothe-
ses are hardly separable. Only with dynamical models can independent
Acknowledgments: I thank Bob Livezey for his most helpful critical comments, and
Ashwini Kulkarni for responding so positively to my requests to discuss the problem of
correlation and trend-tests.
11
12 Chapter 2: Misuses
data be created - with the problem that these data are describing the
real climate system only to some unknown extent.
Almost all data in climate research are interrelated both in space and
time - this spatial and temporal correlation is most useful since it al-
lows the reconstruction of the space-time state of the atmosphere and
the ocean from a limited number of observations. However, for statisti-
cal inference, i.e., the process of inferring from a limited sample robust
statements about an hypothetical underlying “true” structure, this cor-
relation causes difficulties since most standard statistical techniques use
the basic premise that the data are derived in independent experiments.
Because of these two problems the fundamental question of how much
information about the examined process is really available can often hardly
be answered. Confusion about the amount of information is an excellent
hotbed for methodological insufficiencies and even outright errors. Many
such insufficiencies and errors arise from
The obsession with statistical recipes in particular hypothesis testing.
Some people, and sometimes even peer reviewers, react like Pawlow’s
dogs when they see a hypothesis derived from data and they demand a
statistical test of the hypothesis. (See Section 2.2.)
The use of statistical techniques as a cook-book like recipe without a
real understanding about the concepts and the limitation arising from
unavoidable basic assumptions. Often these basic assumptions are dis-
regarded with the effect that the conclusion of the statistical analysis
is void. A standard example is disregard of the serial correlation. (See
Sections 2.3 and 9.4.)
The misunderstanding of given names. Sometimes physically meaningful
names are attributed to mathematically defined objects. These objects,
for instance the Decorrelation Time, make perfect sense when used as
prescribed. However, often the statistical definition is forgotten and the
physical meaning of the name is taken as a definition of the object - which
is then interpreted in a different and sometimes inadequate manner. (See
Section 2.4.)
The use of sophisticated techniques. It happens again and again that
some people expect miracle-like results from advanced techniques. The
results of such advanced, for a “layman” supposedly non-understandable,
techniques are then believed without further doubts. (See Section 2.5.)
13
2.2 Mandatory Testing and the Mexican Hat
In the desert at the border of Utah and Arizona there is a famous combination
of vertically aligned stones named the “Mexican Hat” which looks like a
human with a Mexican hat. It is a random product of nature and not man-
made . . . really? Can we test the null hypothesis “The Mexican Hat is of
natural origin”? To do so we need a test statistic for a pile of stones and a
probability distribution for this test statistic under the null hypothesis. Let’s
take
t(p) = ½1 if pforms a Mexican Hat
0 otherwise (2.1)
for any pile of stones p. How do we get a probability distribution of t(p) for all
piles of stones pnot affected by man? - We walk through the desert, examine
a large number, say n= 106, of piles of stones, and count the frequency of
t(p) = 0 and of t(p) = 1. Now, the Mexican Hat is famous for good reasons
- there is only one pwith t(p) = 1, namely the Mexican Hat itself. The
other n1 = 1061 samples go with t(p) = 0. Therefore the probability
distribution for pnot affected by man is
prob (t(p) = k) = ½106for k= 1
1106for k= 0 (2.2)
After these preparations everything is ready for the final test. We reject the
null hypothesis with a risk of 106if t(Mexican hat) = 1. This condition is
fulfilled and we may conclude: The Mexican Hat is not of natural origin but
man-made.
Obviously, this argument is pretty absurd - but where is the logical error?
The fundamental error is that the null hypothesis is not independent of the
data which are used to conduct the test. We know a-priori that the Mexican
Hat is a rare event, therefore the impossibility of finding such a combination
of stones cannot be used as evidence against its natural origin. The same
trick can of course be used to “prove” that any rare event is “non-natural”,
be it a heat wave or a particularly violent storm - the probability of observing
a rare event is small.
One might argue that no serious scientist would fall into this trap. However,
they do. The hypothesis of a connection between solar activity and the
statistics of climate on Earth is old and has been debated heatedly over
many decades. The debate had faded away in the last few decades - and
has been refueled by a remarkable observation by K. Labitzke. She studied
the relationship between the solar activity and the stratospheric temperature
at the North Pole. There was no obvious relationship - but she saw that
during years in which the Quasibiennial Oscillation (QBO) was in its West
Phase, there was an excellent positive correlation between solar activity and
North Pole temperature whereas during years with the QBO in its East Phase
14 Chapter 2: Misuses
Figure 2.1: Labitzke’ and van Loon’s relationship between solar flux and
the temperature at 30 hPa at the North Pole for all winters during which the
QBO is in its West Phase and in its East Phase. The correlations are 0.1,
0.8 and -0.5. (From Labitzke and van Loon, 1988).
300
°C
SOLARFLUX 10.7cm
250
200
150
100
70
Independent
data -54
-58
-62
-66
-70
-74
-78
WEST
300
SOLARFLUX 10.7cm
250
200
150
100
70
300
SOLARFLUX 10.7cm
250
200
150
100
70
EAST
°C
-54
-58
-62
-66
-70
-74
-78
°C
-54
-58
-62
-66
-70
-74
-78
1956
1960 1970 1980 1990 TIME [year]
094HVSaa.ds4
Section 2.2: Neglecting Serial Correlation 15
there was a good negative correlation (Labitzke, 1987; Labitzke and van Loon,
1988).
Labitzke’s finding was and is spectacular - and obviously right for the data
from the time interval at her disposal (see Figure 2.1). Of course it could be
that the result was a coincidence as unlikely as the formation of a Mexican
Hat. Or it could represent a real on-going signal. Unfortunately, the data
which were used by Labitzke to formulate her hypothesis can no longer be
used for the assessment of whether we deal with a signal or a coincidence.
Therefore an answer to this question requires information unrelated to the
data as for instance dynamical arguments or GCM experiments. However,
physical hypotheses on the nature of the solar-weather link were not available
and are possibly developing right now - so that nothing was left but to wait for
more data and better understanding. (The data which have become available
since Labitzke’s discovery in 1987 support the hypothesis.)
In spite of this fundamental problem an intense debate about the “statisti-
cal significance” broke out. The reviewers of the first comprehensive paper on
that matter by Labitzke and van Loon (1988) demanded a test. Reluctantly
the authors did what they were asked for and found of course an extremely
little risk for the rejection of the null hypothesis “The solar-weather link is
zero”. After the publication various other papers were published dealing with
technical aspects of the test - while the basic problem that the data to conduct
the test had been used to formulate the null hypothesis remained.
When hypotheses are to be derived from limited data, I suggest two alterna-
tive routes to go. If the time scale of the considered process is short compared
to the available data, then split the full data set into two parts. Derive the
hypothesis (for instance a statistical model) from the first half of the data and
examine the hypothesis with the remaining part of the data.1If the time scale
of the considered process is long compared to the time series such that a split
into two parts is impossible, then I recommend using all data to build a model
optimally fitting the data. Check the fitted model whether it is consistent with
all known physical features and state explicitly that it is impossible to make
statements about the reliability of the model because of limited evidence.
2.3 Neglecting Serial Correlation
Most standard statistical techniques are derived with explicit need for sta-
tistically independent data. However, almost all climatic data are somehow
correlated in time. The resulting problems for testing null hypotheses is dis-
cussed in some detail in Section 9.4. In case of the t-test the problem is
nowadays often acknowledged - and as a cure people try to determine the
“equivalent sample size” (see Section 2.4). When done properly, the t-test
1An example of this approach is offered by Wallace and Gutzler (1981).
16 Chapter 2: Misuses
Figure 2.2: Rejection rates of the Mann-Kendall test of the null hypothesis
“no trend” when applied to 1000 time series of length ngenerated by an
AR(1)-process (2.3) with prescribed α. The adopted nominal risk of the test
is 5%.
Top: results for unprocessed serially correlated data.
Bottom: results after pre-whitening the data with (2.4). (From Kulkarni and
von Storch, 1995)
Section 2.3: Neglecting Serial Correlation 17
becomes conservative - and when the “equivalent sample size” is “optimized”
the test becomes liberal2. We discuss this case in detail in Section 2.4.
There are, however, again and again cases in which people simply ignore
this condition, in particular when dealing with more exotic tests such as the
Mann-Kendall test, which is used to reject the null hypothesis of “no trends”.
To demonstrate that the result of such a test really depends strongly on the
autocorrelation, Kulkarni and von Storch (1995) made a series of Monte Carlo
experiments with AR(1)-processes with different values of the parameter α.
Xt=αXt1+Nt(2.3)
with Gaussian “white noise”Nt, which is neither auto-correlated nor corre-
lated with Xtkfor k1. αis the lag-1 autocorrelation of Xt. 1000 iid3
time series of different lengths, varying form n= 100 to n= 1000 were gener-
ated and a Mann-Kendall test was performed. Since the time series have no
trends, we expect a (false) rejection rate of 5% if we adopt a risk of 5%, i.e., 50
out of the 1000 tests should return the result “reject null hypothesis”. The
actual rejection rate is much higher (see Figure 2.2). For autocorrelations
α0.10 the actual rejection rate is about the nominal rate of 5%, but for
α= 0.3 the rate is already 0.15, and for α= 0.6 the rate >0.30. If we test a
data field with a lag-1 autocorrelation of 0.3, we must expect that on average
at 15% of all points a “statistically significant trend” is found even though
there is no trend but only “red noise”. This finding is mostly independent of
the time series length.
When we have physical reasons to assume that the considered time series is
a sum of a trend and stochastic fluctuations generated by an AR(1) process,
and this assumption is sometimes reasonable, then there is a simple cure,
the success of which is demonstrated in the lower panel of Figure 2.2. Before
conducting the Mann-Kenndall test, the time series is “pre-whitened” by first
estimating the lag-autocorrelation ˆαat lag-1, and by replacing the original
time series Xtby the series
Yt=XtˆαXt1(2.4)
The “pre-whitened” time series is considerably less plagued by serial correla-
tion, and the same Monte Carlo test as above returns actual rejections rates
close to the nominal one, at least for moderate autocorrelations and not too
short time series. The filter operation (2.4) affects also any trend; however,
other Monte Carlo experiments have revealed that the power of the test is
reduced only weakly as long as αis not too large.
A word of caution is, however, required: If the process is not AR(1) but
of higher order or of a different model type, then the pre-whitening (2.4)
2A test is named “liberal” if it rejects the null hypothesis more often than specified by
the significance level. A “conservative” rejects less often than specified by the significance
level.
3“iid” stands for “independent identically distributed”.
18 Chapter 2: Misuses
is insufficient and the Mann-Kenndall test rejects still more null hypotheses
than specified by the significance level.
Another possible cure is to “prune” the data, i.e., to form a subset of
observations which are temporally well separated so that any two consecutive
samples in the reduced data set are no longer autocorrelated (see Section
9.4.3).
When you use a technique which assumes independent data and you believe
that serial correlation might be prevalent in your data, I suggest the following
“Monte Carlo” diagnostic: Generate synthetic time series with a prescribed
serial correlation, for instance by means of an AR(1)-process (2.3). Create
time series without correlation (α= 0) and with correlation (0< α < 1)
and try out if the analysis, which is made with the real data, returns different
results for the cases with and without serial correlation. In the case that they
are different, you cannot use the chosen technique.
2.4 Misleading Names: The Case of the
Decorrelation Time
The concept of “the” Decorrelation Time is based on the following reasoning:4
The variance of the mean ¯
Xn=1
nPn
k=1 Xkof nidentically distributed and
independent random variables Xk=Xis
Var¡¯
Xn¢=1
nVar(X) (2.5)
If the Xkare autocorrelated then (2.5) is no longer valid but we may define
a number, named the equivalent sample size n0such that
Var¡¯
Xn¢=1
n0Var(X) (2.6)
The decorrelation time is then defined as
τD= lim
n→∞
n
n0·t="1 + 2
X
∆=1
ρ(∆)#t(2.7)
with the autocorrelation function ρof Xt.
The decorrelation times for an AR(1) process (2.3) is
τD=1 + α
1αt(2.8)
There are several conceptual problems with “the” Decorrelation Time:
4This section is entirely based on the paper by Zwiers and von Storch (1995). See also
Section 9.4.3.
Section 2.4: Misleading Names 19
The definition (2.7) of a decorrelation time makes sense when dealing with
the problem of the mean of n-consecutive serially correlated observations.
However, its arbitrariness in defining a characteristic time scale becomes
obvious when we reformulate our problem by replacing the mean in (2.6)
by, for instance, the variance. Then, the characteristic time scale is
(Trenberth, 1984):
τ="1 + 2
X
k=1
ρ2(k)#t
Thus characteristic time scales τdepends markedly on the statistical
problem under consideration. These numbers are, in general, not physi-
cally defined numbers.
For an AR(1)-process we have to distinguish between the physically
meaningful processes with positive memory (α > 0) and the physically
meaningless processes with negative memory (α < 0). If α > 0 then
formula (2.8) gives a time τD>trepresentative of the decay of the
auto-correlation function. Thus, in this case, τDmay be seen as a phys-
ically useful time scale, namely a “persistence time scale” (but see the
dependency on the time step discussed below). If α < 0 then (2.8) re-
turns times τD<t, even though probability statements for any two
states with an even time lag are identical to probabilities of an AR(p)
process with an AR-coefficient |α|.
Thus the number τDmakes sense as a characteristic time scale when
dealing with red noise processes. But for many higher order AR(p)-
processes the number τDdoes not reflect a useful physical information.
The Decorrelation Time depends on the time increment ∆t: To demon-
strate this dependency we consider again the AR(1)-process (2.3) with
a time increment of ∆t= 1 and α0. Then we may construct other
AR(1) processes with time increments kby noting that
Xt=αkXtk+N0
t(2.9)
with some noise term N0
twhich is a function of Nt...Ntk+1. The
decorrelation times τDof the two processes (2.3,2.9) are because of α < 1:
τD,1=1 + α
1α·11 and τD,k =1 + αk
1αk·kk(2.10)
so that
lim
k→∞
τD,k
k= 1 (2.11)
20 Chapter 2: Misuses
Figure 2.3: The dependency of the decorrelation time τD,k (2.10) on the time
increment k(horizontal axis) and on the coefficient α(0.95, 0.90, 0.80, 0.70
and 0.50; see labels). (From von Storch and Zwiers, 1999).
That means that the decorrelation time is as least as long as the time
increment; in case of “white noise”, with α= 0, the decorrelation time
is always equal to the time increment. In Figure 2.3 the dimensional
decorrelation times are plotted for different α-values and different time
increments k. The longer the time increment, the larger the decorrelation
time. For sufficiently large time increments we have τD,k =k. For small
α-values, such as α= 0.5, we have virtually τD,k =kalready after k= 5.
If α= 0.8 then τD,1= 9, τD,11 = 13.1 and τD ,21 = 21.4. If the time
increment is 1 day, then the decorrelation time of an α= 0.8-process is
9 days or 21 days - if we sample the process once a day or once every 21
days.
We conclude that the absolute value of the decorrelation time is of ques-
tionable informational value. However, the relative values obtained from
several time series sampled with the same time increment are useful to
infer whether the system has in some components a longer memory than
in others. If the decorrelation time is well above the time increment,
as in case of the α= 0.95-curve in Figure 2.3, then the number has
some informational value whereas decorrelation times close to the time
increment, as in case of the α= 0.5-curve, are mostly useless.
Section 2.4: Misleading Names 21
We have seen that the name “Decorrelation Time” is not based on physical
reasoning but on strictly mathematical grounds. Nevertheless the number is
often incorrectly interpreted as the minimum time so that two consecutive
observations Xtand Xt+τDare independent. If used as a vague estimate
with the reservations mentioned above, such a use is in order. However, the
number is often introduced as crucial parameter in test routines. Probably
the most frequent victim of this misuse is the conventional t-test.
We illustrate this case by a simple example from Zwiers and von Storch
(1995): We want to answer the question whether the long-term mean winter
temperatures in Hamburg and Victoria are equal. To answer this question,
we have at our disposal daily observations for one winter from both locations.
We treat the winter temperatures at both locations as random variables, say
THand TV. The “long term mean” winter temperatures at the two loca-
tions, denoted as µHand µVrespectively, are parameters of the probability
distributions of these random variables. In the statistical nomenclature the
question we pose is: do the samples of temperature observations contain
sufficient evidence to reject the null hypothesis H0:µHµV= 0.
The standard approach to this problem is to use to the Student’s t-test.
The test is conducted by imposing a statistical model upon the processes
which resulted in the temperature samples and then, within the confines of
this model, measuring the degree to which the data agree with H0. An essen-
tial part of the model which is implicit in the t-test is the assumption that the
data which enter the test represent a set of statistically independent obser-
vations. In our case, and in many other applications in climate research, this
assumption is not satisfied. The Student’s t-test usually becomes “liberal”
in these circumstances. That is, it tends to reject that null hypothesis on
weaker evidence than is implied by the significance level5which is specified
for the test. One manifestation of this problem is that the Student’s t-test
will reject the null hypothesis more frequently than expected when the null
hypothesis is true.
A relatively clean and simple solution to this problem is to form subsam-
ples of approximately independent observations from the observations. In
the case of daily temperature data, one might use physical insight to argue
that observations which are, say, 5 days apart, are effectively independent
of each other. If the number of samples, the sample means and standard
deviations from these reduced data sets are denoted by n,˜
T
H,˜
T
V, ˜σ
Hand
˜σ
Vrespectively, then the test statistic
t=˜
T
H˜
T
V
pσ
H
2+ ˜σ
V
2)/n(2.12)
has a Student’s t-distribution with ndegrees of freedom provided that the
null hypothesis is true6and a test can be conducted at the chosen signif-
5The significance level indicates the probability with which the null hypothesis will be
rejected when it is true.
6Strictly speaking, this is true only if the standard deviations of THand TVare equal.
22 Chapter 2: Misuses
icance level by comparing the value of (2.12) with the percentiles of the
t(n)-distribution.
The advantage of (2.12) is that this test operates as specified by the user
provided that the interval between successive observations is long enough.
The disadvantage is that a reduced amount of data is utilized in the analy-
sis. Therefore, the following concept was developed in the 1970s to overcome
this disadvantage: The numerator in (2.12) is a random variable because it
differs from one pair of temperature samples to the next. When the observa-
tions which comprise the samples are serially uncorrelated the denominator
in (2.12) is an estimate of the standard deviation of the numerator and the
ratio can be thought of as an expression of the difference of means in units
of estimated standard deviations. For serially correlated data, with sample
means ˜
Tand sample standard deviations ˜σderived from all available ob-
servations, the standard deviation of ˜
TH˜
TVis pσ2
H+ ˜σ2
V)/n0with the
equivalent sample size n0as defined in (2.6). For sufficiently large samples
sizes the ratio
t=˜
TH˜
TV
pσ2
H+ ˜σ2
V)/n0(2.13)
has a standard Gaussian distribution with zero mean and standard deviation
one. Thus one can conduct a test by comparing (2.13) to the percentiles of
the standard Gaussian distribution.
So far everything is fine.
Since t(n0) is approximately equal to the Gaussian distribution for n030,
one may compare the test statistic (2.13) also with the percentiles of the
t(n0)-distribution. The incorrect step is the heuristic assumption that this
prescription - “compare with the percentiles of the t(n0), or t(n01) distri-
bution” - would be right for small (n0<30) equivalent samples sizes. The
rationale of doing so is the tacitly assumed fact that the statistic (2.13) would
be t(n0) or t(n01)-distributed under the null hypothesis. However, this as-
sumption is simply wrong. The distribution (2.13) is not t(k)-distributed for
any k, be it the equivalent sample size n0or any other number. This result
has been published by several authors (Katz (1982), Thi´ebaux and Zwiers
(1984) and Zwiers and von Storch (1995)) but has stubbornly been ignored
by most of the atmospheric sciences community.
A justification for the small sample size test would be that its behaviour
under the null hypothesis is well approximated by the t-test with the equiv-
alent sample size representing the degrees of freedom. But this is not so, as
is demonstrated by the following example with an AR(1)-process (2.3) with
α=.60. The exact equivalent sample size n0=1
4nis known for the process
since its parameters are completely known. One hundred independent sam-
ples of variable length nwere randomly generated. Each sample was used to
test the null hypothesis Ho:E(Xt) = 0 with the t-statistic (2.13) at the 5%
significance level. If the test operates correctly the null hypothesis should be
(incorrectly) rejected 5% of the time. The actual rejection rate (Figure 2.4)
Section 2.4: Misleading Names 23
Figure 2.4: The rate of erroneous rejections of the null hypothesis of equal
means for the case of auto correlated data in a Monte Carlo experiment. The
“equivalent sample size” n0(in the diagram labeled ne) is either the correct
number, derived from the true parameters of the considered AR(1)-process
or estimated from the best technique identified by Zwiers and von Storch
(1995). (From von Storch and Zwiers, 1995).
Sample Size n
24 Chapter 2: Misuses
is notably smaller than the expected rate of 5% for 4n0=n30. Thus, the
t-test operating with the true equivalent sample size is conservative and thus
wrong.
More problems show up when the equivalent sample is unknown. In this
case it may be possible to specify n0on the basis of physical reasoning. As-
suming that conservative practices are used, this should result in underesti-
mated values of n0and consequently even more conservative tests. In most
applications, however, an attempt is made to estimate n0from the same data
that are used to compute the sample mean and variance. Monte Carlo ex-
periments show that the actual rejection rate of the t-test tends to be greater
than the nominal rate when n0is estimated. Also this case has been simulated
in a series of Monte Carlo experiments with the same AR(1)-process. The
resulting rate of erroneous rejections is shown in Figure 2.4 - for small ratio
sample sizes the actual significance level can be several times greater than
the nominal significance level. Thus, the t-test operating with the estimated
equivalent sample size is liberal and thus wrong.
Zwiers and von Storch (1995) offer a “table look-up” test as a useful alter-
native to the inadequate “t-test with equivalent sample size” for situations
with serial correlations similar to red noise processes.
2.5 Use of Advanced Techniques
The following case is an educational example which demonstrates how easily
an otherwise careful analysis can be damaged by an inconsistency hidden in
a seemingly unimportant detail of an advanced technique. When people have
experience with the advanced technique for a while then such errors are often
found mainly by instinct (“This result cannot be true - I must have made
an error.”) - but when it is new then the researcher is somewhat defenseless
against such errors.
The background of the present case was the search for evidence of bifur-
cations and other fingerprints of truly nonlinear behaviour of the dynamical
system “atmosphere”. Even though the nonlinearity of the dynamics of the
planetary-scale atmospheric circulation was accepted as an obvious fact by
the meteorological community, atmospheric scientists only began to discuss
the possibility of two or more stable states in the late 1970’s. If such multiple
stable states exist, it should be possible to find bi- or multi-modal distribu-
tions in the observed data (if these states are well separated).
Hansen and Sutera (1986) identified a bi-modal distribution in a variable
characterizing the energy of the planetary-scale waves in the Northern Hemi-
sphere winter. Daily amplitudes for the zonal wavenumbers k= 2 to 4 for
500 hPa height were averaged for midlatitudes. A “wave-amplitude indica-
tor” Zwas finally obtained by subtracting the annual cycle and by filtering
out all variability on time scales shorter than 5 days. The probability den-
sity function fZwas estimated by applying a technique called the maximum
Section 2.5: Epilogue 25
penalty technique to 16 winters of daily data. The resulting fZhad two max-
ima separated by a minor minimum. This bimodality was taken as proof
of the existence of two stable states of the atmospheric general circulation:
A “zonal regime”, with Z<0, exhibiting small amplitudes of the planetary
waves and a “wavy regime”, with Z>0, with amplified planetary-scale zonal
disturbances.
Hansen and Sutera performed a “Monte Carlo” experiment to evaluate the
likelihood of fitting a bimodal distribution to the data with the maximum
penalty technique even if the generating distribution is unimodal. The au-
thors concluded that this likelihood is small. On the basis of this statistical
check, the found bimodality was taken for granted by many scientists for
almost a decade.
When I read the paper, I had never heard about the “maximum penalty
method” but had no doubts that everything would have been done properly
in the analysis. The importance of the question prompted other scientists to
perform the same analysis to further refine and verify the results. Nitsche et
al. (1994) reanalysed step-by-step the same data set which had been used in
the original analysis and came to the conclusion that the purportedly small
probability for a misfit was large. The error in the original analysis was not
at all obvious. Only by carefully scrutinizing the pitfalls of the maximum
penalty technique did Nitsche and coworkers find the inconsistency between
the Monte Carlo experiments and the analysis of the observational data.
Nitsche et al. reproduced the original estimation, but showed that some-
thing like 150 years of daily data would be required to exclude with sufficient
certainty the possibility that the underlying distribution would be unimodal.
What this boils down to is, that the null hypothesis according to which the
distribution would be unimodal, is not rejected by the available data - and
the published test was wrong . However, since the failure to reject the null
hypothesis does not imply the acceptance of the null hypothesis (but merely
the lack of enough evidence to reject it), the present situation is that the
(alternative) hypothesis “The sample distribution does not originate from a
unimodal distribution” is not falsified but still open for discussion.
I have learned the following rule to be useful when dealing with advanced
methods: Such methods are often needed to find a signal in a vast noisy phase
space, i.e., the needle in the haystack - but after having the needle in our hand,
we should be able to identify the needle as a needle by simply looking at it.7
Whenever you are unable to do so there is a good chance that something is
rotten in the analysis.
7See again Wallace’s and Gutzler’s study who identified their teleconnection patterns
first by examining correlation maps - and then by simple weighted means of few grid point
values - see Section 12.1.
26 Chapter 2: Misuses
2.6 Epilogue
I have chosen the examples of this Chapter to advise users of statistical con-
cepts to be aware of the sometimes hidden assumptions of these concepts.
Statistical Analysis is not a Wunderwaffe8to extract a wealth of information
from a limited sample of observations. More results require more assump-
tions, i.e., information given by theories and other insights unrelated to the
data under consideration.
But, even if it is not a Wunderwaffe Statistical Analysis is an indispensable
tool in the evaluation of limited empirical evidence. The results of Statistical
Analysis are not miracle-like enlightenment but sound and understandable
assessments of the consistency of concepts and data.
8Magic bullet.
... The MK test assumes that the input data are serially independent. The presence of a positive serial correlation in the data structure makes the test more liberal, meaning it rejects the null hypothesis more often than its significance level suggests [58,60,96,97]. ...
... In order to eliminate the influence of serial correlation on the MK test, von Storch [58] proposed the removal of the serial correlation component from the time series based on the assumption that the series comes from a lag-1 autoregressive process, AR(1). The MK test is then applied to the modified series to assess the significance of a possible trend. ...
Article
Full-text available
Changes in streamflow extremes can affect the economy and are likely to impact the most vulnerable in society. Estimating these changes is crucial to develop rational adaptation strategies and to protect society. Streamflow data from 1106 gauges were used to provide a comprehensive analysis of change in eight different extreme indices. The modified trend-free prewhitening and the false discovery rate were used to account for serial correlation and multiplicity in regional analysis, issues shown here to distort the results if not properly addressed. The estimated proportion of gauges with significant trends in low and high flows was about 23% and 15%, respectively. Half of these significant gauges had more than 60 years of data and were associated with changes greater than 5% per decade. A clear spatial pattern was identified, where most increasing trends in both low and high flows were observed in Southern Brazil, and decreasing trends in the remaining regions, except for the Amazon, where a pattern is not clear, and the proportion of significant gauges is low. Results based only on gauges unaffected by reservoirs suggest that reservoirs alone do not explain the increasing trends of low flows in the southern regions nor the decreasing trends in high flows in the remaining hydrographic regions.
... The continental Australia has a wide range of climate zones as defined by Köppen Climate Classification (Stern et al., 2000) -including tropical region in the north, temperate regions in the south, grassland and desert in the 165 vast interior (Fig. 1). Water year is defined in accordance with Australian Water Information Dictionary (Bureau of Meteorology, 2021). ...
... The Mann-Kendall test requires input data to be serially uncorrelated. Any serial correlation in the data structure can leads to overestimation of the significance of trends (Hamed and Ramachandra Rao, 1998;von Storch, 1995;Yue et al., 2002). To overcome the effect of serial correlation of higher order (namely Short-Term Persistence (STP)), two techniques are used here. ...
Preprint
Full-text available
The Hydrologic Reference Stations is a network of 467 high quality streamflow gauging stations across Australia, developed and maintained by Bureau of Meteorology, as part of ongoing responsibility under the Water Act 2007. Main objectives of the service are to observe and detect climate-driven changes in observed streamflow and to provide a quality controlled dataset for research. We investigate linear and step changes in streamflow across Australia in data from all 467 streamflow gauging stations. Data from 30 to 69 years duration ending in February 2019 was examined. We analysed data in terms of water year totals and for the four seasons. The commencement of water year varies across the country – mainly from February–March in the south to September–October in the north. We summarised our findings for each of the 12 Drainage Divisions defined by Australian Geospatial Fabric (Geofabric), and continental Australia as a whole. We used statistical tests to detect and analyse linear and step changes in seasonal and annual streamflow. Linear trends were detected by Mann-Kendall – Variance Correction Approach (MK3), Block Bootstrap Approach (MK3bs) and Long Term Persistence (Mk4) tests. The Nonparametric Pettitt test was used for step change detection and identification. Regional significance of these changes at the drainage division scale was analysed and synthesised using the Walker test. The Murray Darling Basin, with Australia’s largest river system, showed statistically significant decreasing trends for the region in annual total and all four seasons. Drainage Divisions in New South Wales, Victoria and Tasmania showed significant annual and seasonal decreasing trends. Similar results were found in south-west Western Australia, South Australia and north-east Queensland. There was no significant spatial pattern observed in Central and mid-west Australia, one possibility being the sparse density of streamflow stations and or length of data. Only the Timor Sea drainage division in northern Australia showed increasing trends and step changes in annual and seasonal streamflow and were regionally significant. Most of the step changes occurred during 1970–99. In the south-eastern part of Australia, majority of the step changes occurred in the 1990s, before the onset of the millennium drought. Long term linear trends in observed streamflow and its regional significance are consistent with observed changes in climate experienced across Australia. Findings from this study will assist water managers for long term infrastructure planning and management of water resources under climate variability and change across Australia.
... Because an average over the whole European domain would smooth out the variability of specific regions and associated wind regimes Dafka et al., 2016;Minola et al., 2016) we show inter-annual to multidecadal changes in wind variability over regions of homogeneous wind variability. For this purpose, rotated empirical orthogonal functions (REOFs; Richman, 1986;von Storch and Zwiers, 1999) are used to identify a distribution of regions that covers the spatial extent of the data set and within which wind variability is uncorrelated from region to region. ...
... The resulting 366-day estimate is filtered with a 15-day running mean, thus providing a smooth estimate of the annual cycle with daily resolution (see Jiménez et al., 2008;; missing days in series of short span are interpolated. Subsequently, anomalies were calculated with respect to the mean annual cycle at each site and finally a principal component analysis (Preisendorfer, 1988;von Storch and Zwiers, 1999;Hannachi et al., 2007). was performed based on the diagonalization of the inter-site correlation matrix. ...
Article
Full-text available
This work improves the characterization and knowledge of the surface wind climatology over Europe with the development of an observational database with unprecedented quality control, the European Surface Wind Observational database (EuSWiO). EuSWiO includes over 3829 stations with sub‐daily resolution for wind speed and direction, with a number of sites spanning the period of 1880 ‐‐ 2017, a few hundred time series starting in the 1930s and relatively good spatial coverage since the 1970s. The creation of EuSWiO entails the merging of eight different datasets and its submission to a common quality control. About 5% of the total observations were flagged, correcting a great part of the extreme and unrealistic values which have a discernible impact on the statistics of the database. The daily wind variability was characterized by means of a classification technique, identifying eleven independent subregions with distinct temporal wind variability over the 2000 ‐‐ 2015 period. Significant decreases in the wind speed during this period are found in five regions, while two show increases. Most regions allow for extending the analysis to earlier decades. Caution in interpreting long term trends is needed as wind speed data has not been homogenized. Nevertheless, decreases in the wind speed since the 1980s can be noticed in most of the regions. This work contributes to a deeper understanding of the temporal and spatial surface wind variability in Europe. It will allow from meteorological to climate and climate change studies, including potential applications to the analyses of extreme events, wind power assessments or the evaluation of reanalysis or model‐data comparison exercises at continental scales.
... Serial autocorrelation within a time series, independent of sample size, can result in inaccurate rejection of the null hypothesis when using M-K, leading to the incorrect identification of significant trends (Bayazit & Önöz, 2007). Prewhitening can address serial autocorrelation by removing portions of a trend if autocorrelation is present (von Storch, 1999;Yue & Wang, 2002). We used the R "zyp" package (Bronaugh & Werner, 2013) to compute prewhitened linear trends for Level I and Level III Ecoregions and the conterminous US. ...
Article
Full-text available
Satellite imagery is commonly used to map surface water extents over time, but many approaches yield discontinuous records resulting from cloud obstruction or image archive gaps. We applied the Dynamic Surface Water Extent (DSWE) model to downscaled (250-m) daily Moderate Resolution Imaging Spectroradiometer (MODIS) data in Google Earth Engine to generate monthly surface water maps for the conterminous United States (US) from 2003 through 2019. The aggregation of daily observations to monthly maps of maximum water extent produced records with diminished cloud and cloud shadow effects across most of the country. We used the continuous monthly record to analyze spatiotemporal surface water trends stratified within Environmental Protection Agency Ecoregions. Although not all ecoregion trends were significant (p<0.05), results indicate that much of the western and eastern US underwent a decline in surface water over the 17-year period, while many ecoregions in the Great Plains had positive trends. Trends were also generated from monthly streamgage discharge records and compared to surface water trends from the same ecoregion. These approaches agreed on the directionality of trend detected for 54 of 85 ecoregions, particularly across the Great Plains and portions of the western US, whereas trends were not congruent in select western deserts, the Great Lakes region, and the southeastern US. By describing the geographic distribution of surface water over time and comparing these records to instrumented discharge data across the conterminous US, our findings demonstrate the efficacy of using satellite imagery to monitor surface water dynamics and supplement traditional instrumented monitoring.
... Portanto, ao se avaliar a significância estatística, é preciso considerar o problema da dependência temporal (Yue et al., 2002). Com relação a esse problema (autocorrelação), mesmo já sendo conhecido na aplicação de testes de tendências (Von Storch, 1995), apenas em 20% dos estudos realizados no Brasil trataram dessa questão (Muller et al., 1998;Rosin et al., 2015;Chagas e Chaffe, 2018). Embora essa questão tenha sido mencionada em alguns estudos, a maioria não utilizava nenhum procedimento para tratamento do mesmo (Marengo e Tomassela, 1997;Marengo et al., 1998). ...
Chapter
Full-text available
Este capítulo apresenta os principais resultados de um estudo de detecção de tendências de índices de extremos de vazão em 855 estações fluviométricas espalhadas pelo território brasileiro. Esta atividade foi realizada no âmbito do projeto de pesquisa intitulado IMPACTOS DAS MUDANÇAS CLIMÁTICAS EM EXTREMOS HIDROLÓGICOS (SECAS E CHEIAS), financiado pelas CAPES e Agência Nacional de Águas e Saneamento Básico por meio do Edital Mudanças do Clima e Recursos Hídricos n° 19/2015, que contou com a participação da Universidade Federal do Ceará, Universidade de Brasília e Universidade Federal de Campina Grande. Este capítulo faz parte do livro abaixo, Souza Filho, Francisco de Assis, Reis Jr, Dirceu Silveira, & Galvão, Carlos de Oliveira. (2022). SECAS E CHEIAS: Modelagem e Adaptação aos extremos hidrológicos no contexto da variabilidade e mudança do clima. Zenodo. https://doi.org/10.5281/zenodo.6569311. Organizadores Francisco de Assis de Souza Filho (UFC) Carlos de Oliveira Galvão (UFCG) Dirceu Silveira Reis Junior (UnB) Revisores Daniel Antônio Camelo Cid (UFC) Maycon Breno Macena da Silva (UFCG) Apresentação As mudanças climáticas têm nos recursos hídricos uma de suas dimensões mais relevantes. Os impactos das mudanças climáticas nos extremos hidrológicos (secas e cheias) podem impor aumento significativo da vulnerabilidade das populações humanas e do desenvolvimento social. Avaliar os riscos de aumento da frequência destes eventos e as severidades dos mesmos é passo inicial e necessário para a proposição de estratégias de adaptação que possibilitem maior resiliência da sociedade à variabilidade e mudança climática. Objetivando construir análise deste processo e propostas de mitigação, a Universidade Federal do Ceará (UFC), a Universidade de Brasília (UnB) e a Universidade Federal de Campina Grande (UFCG) decidiram constituir uma rede de colaboração com outras instituições internacionais e submeter proposta para o Edital Mudanças do Clima e Recursos Hídricos n° 19/2015 CAPES-ANA. A proposta intitulada “Impactos das Mudanças Climáticas em Extremos Hidrológicos (Secas e Cheias)” recebeu financiamento deste edital e os resultados do trabalho de pesquisa financiados por este projeto constituem os capítulos do presente livro. Os grupos de pesquisa da UFCG, UFC e UnB possuem colaboração anterior a este projeto, notadamente na Rede Brasileira de Pesquisas sobre Mudanças Climáticas Globais – REDE CLIMA, e as atividades desenvolvidas neste projeto podem ser consideradas no contexto desta rede de colaboração. As pesquisas foram desenvolvidas nos Programas de Pós-Graduação em Engenharia Civil (Recursos Hídricos) da Universidade Federal do Ceará, em Tecnologia Ambiental e Recursos Hídricos da Universidade de Brasília, em Engenharia Civil e Ambiental, em Gestão e Regulaçã o de Recursos Hídricos, e em Sistemas Agroindustriais, da Universidade Federal de Campina Grande. Alunos de graduação também foram envolvidos no projeto. Diversos pesquisadores deste projeto tiveram bolsas financiadas pelo CNPq, pela CAPES e pela Fundação Cearense de Apoio ao Desenvolvimento Científico e Tecnológico (FUNCAP), a quem agradecemos. O presente livro é dividido em três partes: (i) modelos climáticos e detecção de mudanças; (ii) impactos das mudanças climáticas e (iii) estratégias de adaptação à mudança climática.
... Therefore, the trend results from the MK method sometimes do not reflect the reality when there are many extreme values in serial data. The MK is affected by serial correlations within the time series, which may lead to a disproportionate rejection of the null hypothesis of no trend whereas it is true [86]. Meanwhile, the ITA method does not include any assumptions (e.g., serial relationship, nonnormality, test number, etc.) and is not affected by serial correlation. ...
Article
Full-text available
In this study, we analyze sea surface height referenced against the WGS84 ellipsoid at the Hon Dau tidal gauge station (Hai Phong, Vietnam), in front of the Red River Delta, between 1961 and 2020. The annual sea level varied from 165.23 cm to 206.06 cm in this period (+20.28 cm over 60 years). The average water level was 190.87 cm for 60 years, with higher annual values in recent years, especially from 2016 to the present (above 201.5 cm). The Mann–Kendall (MK) test with Sen’s slope estimator and Şen’s innovative trend analysis (ITA) were applied and compared to estimate the sea level rise. These methods showed complete agreement among tests with significant rising trends of about 3.38 mm/year with the MK test and 3.08 mm/year with the ITA method for 1961–2020. During the last 20 years (2001–2020), the mean sea level increased about 7.16 mm/year (MK test and Sen’s slope), 7.38 mm/year (ITA method), and around twice higher than the rate of the region and globally. The MK test and ITA method provided similar results for periods: 1961–2020, 1961–1980, and 2001–2020, with relatively stable monotonic related trend conditions. For the period 1981–2000, with a more nonmonotonic trend, the MK test and ITA method provided different trends and allowed to illustrate the specificity of each method.
... The presence of autocorrelation leads to a phenomenon known as variance inflation causing over-rejection of the null hypothesis (H o ) in the time series by reducing the effective sample size and inflating the variance of the test statistic (Serinaldi and Kilsby, 2016). To address the issue of autocorrelated time series two mainly proposed methods include: 1) applying a modified MK (ModMK) trend test while properly considering the variance inflation (Hamed and Ramachandra Rao, 1998) and 2) prewhitening the time series by removing the autocorrelation and then applying the MK trend test (Katz, 1988;von Storch, 1999). Azam et al. (2018) applied various techniques aiming at reducing the effects of autocorrelation while detecting rainfall trends in South Korea at differing time scales. ...
Article
Full-text available
Study region Agricultural reservoirs are the principal source of irrigation supplies for sustaining rice production and play a critical role in the water resource management of South Korea. Study focus We comprehensively evaluated the performance of 400 major agricultural reservoirs, spread throughout the country, as a function of climate change during 1973–2017 considering the constituents of reservoir water balance, such as watershed runoff, irrigation water demand, and evaporation loss. New hydrological insights Based on the trend analysis during the 45-year study period, the reservoir inflows remained higher than the irrigation water demands because of the gradually increasing annual rainfall. Most reservoirs had enough storage capacities; their resilience gradually declined particularly during the last 15 years. The northern and northeastern region reservoirs had acceptable resilience whereas the reservoirs in the western and southwestern inland and coastal regions showed insufficient resilience. Reservoirs received sufficient inflows but the residence time of the excess inflow has to be increased to suffice the peak irrigation demands in the coming cropping season. Study outcomes could help prioritize the reservoirs requiring immediate rehabilitation located in the central and southwest regions with high failure risks. Results suggested that a reservoir with a watershed-irrigated area ratio > 5 would have a better chance of coping with future climate change threats.
Article
Full-text available
It is known that densely populated coastal areas may be adversely affected as a result of the climate change effects. In this respect, for coastal protection, utilization, and management it is critical to understand the changes in wind speed (WS) and significant wave height (SWH) in coastal areas. Innovative approaches, which are one of the trend analysis methods used as an effective way to examine these changes, have started to be used very frequently in many fields in recent years, although not in coastal and marine engineering. The Innovative Polygon Trend Analysis (IPTA) method provides to observe the one-year behavior of the time series by representing the changes between consecutive months as well as determining the trends in each individual month. It is not also affected by constraints such as data length, distribution type or serial correlation. Therefore, the main objective of this study is to investigate whether using innovative trend methods compared to the traditional methods makes a difference in trends of the climatological variables. For this goal, trends of mean and maximum WS and SWH series for each month at 33 coastal locations in Black Sea coasts were evaluated. Wind and wave parameters WS and SWH were obtained from 42-year long-term wave simulations using Simulating Waves Nearshore (SWAN) model forced by the Climate Forecast System Reanalysis (CFSR). Monthly mean and maximum WS and SWH were calculated at all locations and then trend analyses using both traditional and innovative methods were performed. Low occurrence of trends were detected for mean SWH, maximum SWH, mean WS, and maximum WS according to the Mann-Kendall test in the studied months. The IPTA method detected more trends, such as the decreasing trend of the mean SWH at most locations in May, July and November December. The lowest (highest) values were seen in summer (winter), according to a one-year cycle on the IPTA template for all variables. According to both methods, most of the months showed a decreasing trend for the mean WS at some locations in the inner continental shelf of the southwestern and southeastern Black Sea. The IPTA method can capture most of the trends detected by the Mann-Kendall method, and more missed by the latter method.
Article
Full-text available
It is established that changes in sea level influence melt production at midocean ridges, but whether changes in melt production influence the pattern of bathymetry flanking midocean ridges has been debated on both theoretical and empirical grounds. To explore the dynamics that may give rise to a sea-level influence on bathymetry, we simulate abyssal hills using a faulting model with periodic variations in melt supply. For 100-ky melt-supply cycles, model results show that faults initiate during periods of amagmatic spreading at half-rates >2.3 cm/y and for 41-ky melt-supply cycles at half-rates >3.8 cm/y. Analysis of bathymetry across 17 midocean ridge regions shows characteristic wavelengths that closely align with the predictions from the faulting model. At intermediate-spreading ridges (half-rates >2.3 cm/y and ≤ 3.8 cm/y) abyssal hill spacing increases with spreading rate at 0.99 km/(cm/y) or 99 ky ( n = 12; 95% CI, 87 to 110 ky), and at fast-spreading ridges (half-rates >3.8 cm/y) spacing increases at 38 ky ( n = 5; 95% CI, 29 to 47 ky). Including previously published analyses of abyssal-hill spacing gives a more precise alignment with the primary periods of Pleistocene sea-level variability. Furthermore, analysis of bathymetry from fast-spreading ridges shows a highly statistically significant spectral peak ( P < 0.01) at the 1/(41-ky) period of Earth’s variations in axial tilt. Faulting models and observations both support a linkage between glacially induced sea-level change and the fabric of the sea floor over the late Pleistocene.
Article
In this study, time series data is divided into two equal parts, as in the basic principle of the Innovative Trend Analysis (ITA) method. However, unlike the ITA, the data are not ordered when comparing the halves to avoid any change in the time series. Wilcoxon Signed Rank Test is used to consider the differences between the two equal halves and to determine if there is any trend in time series. A graphical presentation is provided like ITA approach and is expressed as the Mann-Kendall (MK) test for trend detection at a given confidence interval (significance level). In the application phase, minimum, maximum and average discharge time series measured at Yakabaşı, Söğütlühan, Arıcılar, and Topluca stations along the northern line from west to east of Turkey are used. The trend conditions of the data are determined by the proposed method and compared with ITA and MK tests results. No trend is observed in 8 out of 12 cases for stations at the 95% confidence interval. While decreasing trends are detected in the maximum discharge of Yakabaşı station, the minimum discharge of Arıcılar station, and the average discharge of Yakabaşı station, an increasing trend is detected in the average discharge records of Topluca station. Here, an easy-to-use model for decision-makers is proposed that presents numerical as well as graphical analyses of trend identification studies.
ResearchGate has not been able to resolve any references for this publication.