Content uploaded by Frank Klawonn
Author content
All content in this area was uploaded by Frank Klawonn on Jan 08, 2016
Content may be subject to copyright.
Incremental Statistical Measures
Katharina Tschumitschew, Frank Klawonn
1 Introduction
Statistics and statistical methods are used in almost every aspect of modern life, like
medicine, social surveys, economy and marketing, only to name few of application
areas. A vast number of sophisticated statistical software tools can be used to search
and test for structures and patterns in data. Important information about the data
generating process is provided by the simple summary statistics. Characteristics of
the data distribution can be described by summary statistics like the followingone.
•Measuresof location: The meanand quantiles provide information about location
of the distribution. Mean and median are representatives for the centre of the
distribution.
•Measures of spread: Common measures for the variation in the data are standard
deviation, variance and interquartile range.
•Shape: The third and fourth moments provide information about the skewness
and the kurtosis of a probability distribution.
•Dependence: For instance, the Pearson correlation coefﬁcient is a measure for the
linear dependency between two variables. Other common measures for statistical
dependency between two variables rank correlation coefﬁcients like Spearman’s
rho or Kendall’s tau.
Katharina Tschumitschew
Department of Computer Science, Ostfalia University of Applied Sciences, Salzdahlumer Str.
46/48, D38302 Wolfenbuettel, Germany, email: katharina.tschumitschew@ostfalia.de
Frank Klawonn
Department of Computer Science, Ostfalia University of Applied Sciences, Salzdahlumer Str.
46/48, D38302 Wolfenbuettel, Germany, email: f.klawonn@ostfalia.de
and
Bioinformatics and Statistics, Helmholtz Centre for Infection Research, Inhoffenstr. 7, D38124
Braunschweig, Germany, email: frank.klawonn@helmholtzhzi.de
1
2 Katharina Tschumitschew, Frank Klawonn
Apart from providing information about location and spread of the data distribution,
quantiles also play an important role in robust data analysis, since they are less
sensitive to outliers.
Summary statistics can be used in a purely exploratory context to describe prop
erties of the data, but also as estimators for model parameters of an assumed under
lying data distribution.
More complex and powerful methods for statistical data analysis are for instance
hypothesis tests. Statistical hypothesis testing allows us to discover the current state
of affairs and therefore help us to make decisions based on the gained knowledge.
Hypothesis test can be applied to a great variety of problems. We may need to test
just a simple parameter or the whole distribution of the data.
However, classical statistics operates with a ﬁnite, ﬁxed data set. On the other
hand, nowadays it is very important to continuously collect and analyse data sets
increasing with time, since the (new) data may contain useful information. Sensor
data as well as the seasonal behaviour of markets, weather or animals are in the focus
of diverse research studies. The amount of recordeddata increases each day. Apart
from the huge amount of data to be dealt with, another problem is that the data arrive
continuously in time. Such kind of data is called datastream. A data stream can be
characterised as an unlimited sequence of values arriving step by step over time.
One of the main problems for the analysis of data streams is limited computing and
memory capabilities. It is impossible to hold the whole data set in the main memory
of a computer or computing device like an ECU (electronic control unit) that might
also be responsible for other tasks than just analysing the data. Moreover, the results
of the analysis should be presented in acceptable time, sometimes even under very
strict time constraints, so that the user or system can react in real time. Therefore, the
analysis of data streams requires efﬁcient online computations. Algorithms based
on incremental or recursive computation schemes satisfy the above requirements.
Such methods do not store all historical data and do not need to browse through
old data to update an estimator or an analysis, in the ideal case, each data value is
touched only once.
Consequently the application of statistical methods to data streams requires mod
iﬁcations to the standard calculation schemes in order to be able carry out the com
putations online. Since data come in step by step, incremental calculations are
needed to avoid to start the computation process fromscratch each time new data ar
rive and to save memory, so that not the whole data set must be kept in the memory.
Statistical measures like the sample mean, variance and momentsin general and the
Pearson correlation coefﬁcient render themselves easily incremental computation
schemes, whereas, for instance, for standard quantiles computations the whole data
is needed. In such cases, new incremental methods must be developed that avoid
sorting the whole data set, since sorting requires in principal to check the whole
data set. Several approaches for the online estimation of quantiles are presented for
instance in [9, 19, 1, 25].
Another important aspect in data stream analysis is that the data generating pro
cess does not remain static, i.e. the underlying probabilistic model cannot be as
sumed to be stationary. The changes in the data structure may occur over time.
Incremental Statistical Measures 3
Dealing with nonstationary data requires change detection and online adaptation.
Different kinds of nonstationarity have been classiﬁed in [2]:
•Changes in the data distribution: the change occurs in the data distribution. For
instance mean or variance of the data distributionmay change over time.
•Changes in concept: here concept drift refers to changes of a target variable. A
target variable is a variable, whose values we try to predict based on the model
estimated from the data, for instance for linear regression it is the change of the
parameters of the linear relationship between the data.
– Concept drift: concept drift describes gradualchanges of the concept.In statis
tics, this usually called structural drift.
– Concept shift: concept shift refers to an abrupt change which is also referred
to as structural break.
Hence change detection and online adaptation of statistical estimators are re
quired for nonstationary data streams. Various strategies to handle nonstationarity
are proposed, see for instance [11] for a detailed survey of change detection meth
ods. Statistical hypothesis tests may also be used for change detection. Since we
are working with data streams, it is required that the calculations for the hypothe
sis tests can be carried out in an incremental way. For instance, the
χ
2test and the
ttest1render themselves easily to incremental computations. Based on change de
tection strategies, one can derive informationon the sampling strategy, for instance
the optimal size of a time window for parameter estimations of nonstationary data
streams [26, 3].
This chapter is organised as follows. Incremental computations of the mean, vari
ance, third and fourth moments and the Pearson correlation coefﬁcient are explained
in Section 2. Furthermore two algorithms for the online estimation of quantiles are
described in Section 3. In Section 4 we provide online adaptations of statistical
hypothesis test and discuss different change detection strategies.
2 Incremental calculation of moments and the Pearson
correlation coefﬁcient.
Statistical measures like sample central moments provide valuable information
about the data distribution. So the sample mean or empirical mean (ﬁrst sample
central moment) is the measure of the centre of location of the data distribution,
the measure of variability is sample variance (second sample central moment). The
third and fourth central moments are used to compute skewness and kurtosis of the
data sample. Skewness provides us the information about the asymmetry of the data
distribution and kurtosis give us an idea about the degreeof peakedness of the dis
tribution.
1For precise deﬁnitions see Section 4.
4 Katharina Tschumitschew, Frank Klawonn
Another important statistic is the correlation coefﬁcient. The correlation coefﬁ
cient is a measure for linear dependency between two variables.
In this section we introduce incremental calculations for these statistical mea
sures.
In the following, we consider a realvalued sample x1,...,xt,... (xi∈Rfor all
i∈{1,...,t,...}).
Deﬁnition 1. Let x1,...,xtbe a random sample from the distribution of the random
variable X.
The sample or empirical mean of the sample of size t, denoted by ¯xt,isgivenby
the formula
¯xt=1
t
t
∑
i=1xi.(1)
Equation (1) can not be applied directly in the context of data streams, since it
would require to consider all sample values at each time step. Fortunately, Equation
(1) can be easily transformed into an incremental scheme.
¯xt=1
t
t
∑
i=1xi
=1
txt+
t−1
∑
i=1xi
=1
t(xt+(t−1)¯xt−1)
=¯xt−1+1
t(xt−¯xt−1).(2)
The incremental update Equation (2) requires only three values to calculate the sam
ple mean at time point t:
•The mean at time point t−1.
•The sample value at time point t.
•The number of sample values so far.
The empirical or sample variance can be calculated in an incremental fashion in
a similar way.
Deﬁnition 2. Let x1,...,xtbe a random sample from the distribution of the random
variable X. The empirical or sample variance of a sample of sizetis given by
s2
t=1
t−1
t
∑
i=1(xi−¯xt)2(3)
Furthermore, st=s2
tis called the sample standard deviation.
Incremental Statistical Measures 5
In order to simplify the calculation we use following notation:
˜m2,t=
t
∑
i=1(xi−¯xt)2(4)
In the following, the formula for incremental calculation is derived from Equa
tion (4) using Equation (2).
˜m2,t−˜m2,t−1=
t
∑
i=1x2
i−t¯x2
t−
t−1
∑
i=1x2
i+(t−1)¯x2
t−1
=x2
t−t¯x2
t+(t−1)¯x2
t−1
=x2
t−¯x2
t−1+t¯x2
t−1−¯x2
t
=x2
t−¯x2
t−1+t(¯xt−1−¯xt)(¯xt−1+¯xt)
=x2
t−¯x2
t−1+t¯xt−1−¯xt−1−1
t(xt−¯xt−1)(¯xt−1+¯xt)
=x2
t−¯x2
t−1+(¯xt−1−xt)(¯xt−1+¯xt)
=(xt−¯xt−1)(xt+¯xt−1−¯xt−1−¯xt)
=(xt−¯xt−1)(xt−¯xt).
Consequently, we obtain the following recurrence formula for the second central
moment: ˜m2,t=˜m2,t−1+(xt−¯xt−1)(xt−¯xt)(5)
The unbiased estimator for the variance of the sample according to the Equation(5)
is given by
s2
t=1
t−1M2,t=(t−2)s2
t−1+(xt−¯xt−1)(xt−¯xt)
t−1.(6)
Deﬁnition 3. Let x1,...,xtbe a random sample from the distribution of the random
variable X. Then the kth central moment of a sample of size tis deﬁned by
mk,t=1
t
t
∑
i=1(xi−¯xt)k.(7)
In order to simplify the computations and to facilitate the readability of the text we
use the following expression for the derivation.
˜mk,t=
t
∑
i=1(xi−¯xt)k,(8)
therefore ˜mk,t=t·mk,t.
6 Katharina Tschumitschew, Frank Klawonn
For the third and fourthorder moments, which are needed to calculate skewness
and kurtosis of the data distribution, incremental formulae canbe derived in a similar
way, in the formof pairwise update equations for ˜m3,tand ˜m4,t.
˜m3,t=
t−1
∑
i=1(xi−¯xt)3+(xt−¯xt)3
=
t−1
∑
i=1xi−¯xt−1−1
t(xt−¯xt−1)3
+xt−¯xt−1+1
t(xt−¯xt−1)3
=
t−1
∑
i=1((xi−¯xt−1)−b)3+(tb−b)3
=
t−1
∑
i=1(xi−¯xt−1)3−3b(xi−¯xt−1)2+3b2(xi−¯xt−1)−b3+(t−1)3b3
=˜m3,t−1−3b˜m2,t−1−((t−1)b3+(t−1)3b3
=˜m3,t−1−3b˜m2,t−1+t(t−1)(t−2)b3(9)
where b=xt−¯xt−1
t.
From Equation (9) we obtain a onepass formula for the thirdorder centred sta
tistical moment of a sample of size t:
˜m3,t=˜m3,t−1−3(xt−¯xt−1)
t˜m2,t−1+(t−1)(t−2)
t2(xt−¯xt−1)3.(10)
The derivation for the fourthorder moment is very similar to Equation (9) and thus
is not detailed here.
˜m4,t=˜m4,t−1−4(xt−¯xt−1)
t˜m3,t−1+6xt−¯xt−1
t2˜m2,t−1
+(t−1)t2−3t+3
t3(xt−¯xt−1)4.(11)
The results presented above offer the essential formulae for efﬁcient, onepass cal
culations of statistical moments up to the fourth order. Those are important when
the data stream mean, variance, skewness and kurtosis should be calculated. Al
though these measures cover the needsof the vast majority of applications for data
analysis, sometimes higherorder statistics should be used. For the computation of
higherorder statistical moments see for instance [6].
Now we derive a formula for the incremental calculation of the sample correla
tion coefﬁcient
Deﬁnition 4. Let x1,...,xtbe a random sample from the distribution of the random
variable Xand y1,...,ytbe a random sample from the distribution of the random
variable Y. Then the sample Pearson correlation coefﬁcientof the sample of size t,
denoted by rxy,t, is given by the formula
Incremental Statistical Measures 7
rxy,t=∑t
i=1(xi−¯xt)(yi−¯yt)
(t−1)sx,tsy,t(12)
where ¯xtand ¯ytare the sample means of Xand Yand sx,tand sy,tare the sample
standard deviations of Xand Y, respectively.
The incremental formula for the sample standard deviation can be easely derived
from the incremental formula for sample variance (6). Hence only the numerator of
Equation (12) needs to be considered further. Furthermore, the numerator of Equa
tion (12) represents the sample covariance sxy,t.
Deﬁnition 5. Let x1,...,xtbe a random sample from the distribution of the random
variable Xand y1,...,ytbe a random sample from the distribution of the random
variableY. Then the sample covariance sxy,tof the sample of size tis given by t
sxy,t=∑t
i=1(xi−¯xt)(yi−¯yt)
t−1(13)
where ¯xtand ¯ytare the sample means of Xand Yand sx,tand sy,tare the sample
standard deviations of Xand Y, respectively.
The formula for the incremental calculation of the covariance is given by
(t−1)sxy,t=
t−1
∑
i=1(xi−¯xt)(yi−¯yt)+(xt−¯xt)(yt−¯yt)
=
t−1
∑
i=1((xi−¯xt−1)−bx)((yi−¯yt−1)−by)+(t−1)2bxby
=(t−2)sxy,t−1+t(t−1)bxby(14)
where bx=(xt−¯xt−1)
tand by=(yt−¯yt−1)
t. Hence the incremental formula for the sam
ple covariance is
sxy,t=(t−2)
(t−1)sxy,t−1+1
t(xt−¯xt−1)(yt−¯yt−1)(15)
Therefore, to update the Pearson correlation coefﬁcient, we have to compute the
sample standard deviation and covariance ﬁrst and subsequently use Equation (12).
Above in this section we presented incremental calculations for the empirical
mean, empirical variance, third and fourthsample central moments and sample cor
relation coefﬁcient. These statistical measures can also be considered as estimators
of the corresponding parameters of the data distribution. Therefore, we are inter
ested in the question how many values xido we need to get a “good” estimation of
the parameters. Of course, as we deal with a data stream, in general we will have a
large amount of data. However, some application are based on time window tech
niques. For instance, for change detection methods presented in the section (4). Here
we need to compare at least two samples of data, on that account, the data have to be
split into smaller parts. To answer the question about the optimal amount of data for
8 Katharina Tschumitschew, Frank Klawonn
statistical estimators, we have to analyse the variances of the parameter estimators.
The variance of an estimator shows how efﬁcient this estimator is.
Here we restrict our considerations to a random sample from a normal dis
tribution with expected value 0. Let X1,...,Xtbe independent and identically
distributed (i.i.d.) random variables followinga normal distribution, Xi∼N0,
σ
2
and x1,...,xtare observed values of these randomvariables.
The variance of the estimator of the expected value2¯
Xt=1
t∑t
i=1Xiis given by
Var(¯
Xt)=
σ
2
t.(16)
The variance of the unbaised estimator of the variance S2=1
t−1∑t
i=1(Xi−¯
Xt)2
is given by
VarS2
t=2
(t−1)
σ
4.(17)
The variance of the distribution of the third moment is shown in Equation (18)
(see [6] for more detailed information)
Var(M3,t)=6(t−1)(t−2)
t3
σ
6.(18)
Figure 1 shows Equations (16), (17) and (18) as functions in tfor
σ
2=1 (stan
dard normal population). It is obvious that for small amounts of data, the variance
of the estimators is quite large, consequently more values are needed to obtain a
reliable estimation of distribution parameters. Furthermore the optimal sample size
depends on the statistic to be computed. For instance, for the sample mean and a
sample of size 50, the variance is already small enough, whereas for the third mo
ment estimator to have the same variance, many more observations are needed.
20
40
60
80
1
0
0
t
0
.0
5
0
.1
0
0
.1
5
0
.2
0
0
.2
5
0
.3
0
Var
V
arM3
V
arS2
V
arX
Fig. 1 Variances from bottom to top of parameter estimators for the expected value, the variance
and the third moment of a standard normal distribution
2We use capital letters here to distinguish between random variables and real numbers that are
denoted by small letters.
Incremental Statistical Measures 9
We apply the same considerations to the sample correlation coefﬁcient. Let X
and Ybe two random variables following normal distributions and let X1,...,Xt
and Y1,...,Ytbe i.i.d. samples of Xand Y, respectively: Xi∼N0,
σ
2
xand Yi∼
N0,
σ
2
y. Assume the correlation between Xand Yis equal to
ρ
XY. Then the
asymptotic variance of the sample correlation coefﬁcient is given by (see [7])
Var(RXY,t)≈1−
ρ
2
XY2
t.(19)
Attention should be paid to the asymptotic natureof Equation (19). This formula
can be used only for sufﬁciently large t(see [7]). Equation (19) is illustrated in
Figure 2 as a function in tfor
ρ
XY =0.9. Since for different values of
ρ
XY, the plots
are very similar, they are not shown here.
20
40
60
80
1
0
0
t
0
.00
1
0
.00
2
0
.00
3
0
.00
4
V
arRXY
Fig. 2 Asymptotic variance of the sample correlation coefﬁcient
In this section we have provided equations for incremental calculation of the
sample mean, sample variance, third and fourth moments and the Pearson correla
tion coefﬁcient. These statistics allow us to summarize a set of observations ana
lytically. Since we assume that the observations reﬂect the population as a whole,
these statistics give us an idea about the underlying data distribution. Other impor
tant summary statistics are sample quantiles. Incremental approaches for quantiles
estimation are described in the next section
10 Katharina Tschumitschew, Frank Klawonn
3 Incremental quantile estimation
Quantiles play an important role in statistics, especially in robust statistics, since
they are not or less sensitive to outliers. For q∈(0,1), the qquantile has the prop
erty that q·100% of the data are smaller and (1−q)·100% of the data are larger
than this value. The median, i.e. the 50%quantile, is a robust measure of location
and the interquartile range3is a robust measure of spread. Incremental or recur
sive techniques for quantile estimation are not as obvious as for statistical moments,
since for the sample quantile computation the entire sorted data are needed. Never
theless, there are techniques for incremental quantile estimation. In this section, we
describe two different approaches. First approach is restricted to continuous sym
metric unimodal distributions. Therefore this method is not very useful for all real
world data. The second approach is not restricted to any kind of distribution andis
not limited to continuous random variables. We also provide experimental results
for both algorithms for different kinds of distributions.
3.1 Incremental quantile estimation for continuous random
variables
Deﬁnition 6. For a random variable Xwith cumulative distribution function FX, the
qquantile (q∈(0,1)) is deﬁned as inf{x∈RFX(x)≥q}.Ifxqis the qquantile of
a continuous random variable, this implies P(X≤xq)=qand P(X≥xq)=1−q.
For continuous random variables, an incremental scheme for quantile estimation
is proposed in [10]. This approach is based on the following theorem.
Theorem 1. Let {
ξ
t}t=0,1,... be a sequence of identically distributed independent
(i.i.d.) random variables with cumulative distribution function F
ξ
. Assume that the
densityfunction f
ξ
(x)exists and is continuousinthe
α
quantilex
α
for an arbitrarily
chosen
α
(0<
α
<1). Further let the inequality
f
ξ
(x
α
)>0 (20)
be fulﬁlled. Let {ct}t=0,1,... be a (control) sequence of real numbers satisfying the
conditions ∞
∑
t=0ct=∞,
∞
∑
t=0c2
t<∞.(21)
Then the stochastic process Xtdeﬁned by
X0=
ξ
0,(22)
Xt+1=Xt+ctYt+1(Xt,
ξ
t+1),(23)
3The interquartile range is the midrange containing 50% of the data and it is computed as the
difference between the 75% and the 25%quantiles: IQR =x0.75 −x0.25.
Incremental Statistical Measures 11
with
Yt+1=
α
−1if
ξ
t+1<Xt,
α
if
ξ
t+1≥Xt,(24)
almost surely converges to the quantile x
α
.
The proof of the theorem is based on stochastic approximation and can be found
in [18]. A standard choice of the sequence {ct}t=0,1,... is ct=1/t. However, con
vergence might be extremely slow for certain distributions. Therefore, techniques
to choose a suitable sequence {ct}t=0,1,..., for instance based on an estimation of
the probability density function of the sampled random variable, are proposed in
[17, 10].
Although this technique of incremental quantile estimation has only minimum
memory requirement, it has certain disadvantages.
•It is only suitable for continuous random variables.
•Unless the sequence {ct}t=0,1,... is well chosen, convergence can be extremely
slow.
•When the sampled random variable changes over time, especially when the ct
are already close to zero, the incremental estimation of the quantile will remain
almost constant and the change will be unnoticed.
In the following we present an algorithm to overcomethese problems.
3.2 Incremental quantile estimation
Here we provide a more general approachwhich is not limited to continuous random
variables. First we describe an algorithm for incremental median estimation, which
can be generalised to arbitrary quantiles. Since this algorithm is not very suitably
for noncentral quantiles, we modifythis approach in such a way that it yields good
results for all quantiles.
3.2.1 Incremental median estimation
Before we discuss the general problem of incremental quantile estimation, we ﬁrst
focus on the special of case of the median, since we will need the results for the
median to develop suitable methods for arbitrary quantiles.
For the incremental computation of the median we store a ﬁxed number,a buffer
of msorted data values a1,...,amin the ideal case the m
2closest values left and the m
2
closest values right of the median, so that the interval [a1,am]contains the median.
We also need two counters Land Rto store the number of values outside the interval
[a1,am], counting the values left and right of the interval separately. Initially, Land
Rare set to zero.
12 Katharina Tschumitschew, Frank Klawonn
The algorithm works as follows.The ﬁrst mdata points x1,...,xmare used to ﬁll
the buffer. They are entered into the buffer in increasing order, i.e. ai=x[i]where
x[1]≤... ≤x[m]are the sorted values x1,...,xm. After the buffer is ﬁlled, the algo
rithm handles the incoming values xtin the following way:
1. If xt<a1, i.e. the new value lies left of the interval supposed to contain the
median, then Lnew :=Lold +1.
2. If xt>am, i.e. the new value lies right of the interval supposed to contain the
median, then Rnew :=Rold +1.
3. If ai≤xt≤ai+1(1 ≤i<m), xtis entered into the buffer at position aior ai+1.
Of course, the other values have to be shifted accordingly and the old left bound
a1or the old right bound amwill be dropped. Since in the ideal case, the median
is the value in the middle of the buffer, the algorithm tries to achieve this by
balancing the number of values left and right of the interval [a1,am]. Therefore,
the following rule is applied:
a. If L<R, then remove a1, increase L, i.e. Lnew :=Lold +1, shift the values
a2,...,aione position to the left and enter xtin ai.
b. Otherwise remove am, increase R, i.e. Rnew :=Rold +1, shift the values
ai+1,...,am−1one position to the right and enter xtin ai+1.
In each step, the median ˆq0.5can be easily calculated from the given values in the
buffer and the counters Land Rby
ˆq0.5=⎧
⎨
⎩
aL+m+R
2−Lif tis odd,
aL+m+R−1
2−L+aL+m+R+1
2−L
2if tis even. (25)
It should be noted that it can happen that at least one of the indices L+m+R
2−L,
L+m+R−1
2−Land L+m+R+1
2−Lare not within the bounds 1,...,mof the buffer
indices and the computation of the median fails. The interval length am−a1can
only decrease and at least for continuous distributions Xwith probability density
function fX(q0.5)>0, where q0.5is the true median of X, it will tend to zero with
increasing sample size. In an ideal situation, the buffer of mstored values contains
exactly the values in the middle of the sample. Here we assume that at this point in
time the sample consists of m+tvalues.
Table 1 A small example data set
t123456789
data 3.8 5.2 6.1 4.2 7.5 6.3 5.4 5.9 3.9
Table 2 illustrates how this algorithm works with an extremely small buffer of
size m=4 based on the data set given in Table 1.
In the following we generalise and modify the incremental median algorithm
proposed in the previous section and analyse the algorithmin more detail.
Incremental Statistical Measures 13
Table 2 The development of the buffer and the two counters for the small example data set in
Table 1
t L a1a2a3a4R
4 0 3.8 4.2 5.2 6.1 0
5 0 3.8 4.2 5.2 6.1 1
6 0 3.8 4.2 5.2 6.1 2
7 1 4.2 5.2 5.4 6.1 2
8 2 5.2 5.4 5.9 6.1 2
9 3 5.2 5.4 5.9 6.1 2
3.2.2 An ad hoc algorithm
This algorithm for incremental median estimation can be generalised to arbitrary
quantiles in a straight forward manner. For the incremental qquantile estimation
(0 <q<1) only case 3 requires a modiﬁcation. Instead of trying to get the same
values for the counters LandR, we now try to balance the counters in such a way that
qR ≈(1−q)Lholds. This means, step 3a is applied if L<(1−q)tholds, otherwise
step 3b is carried out. tis the number of data sampled after the buffer of length m
has been ﬁlled.
Therefore, in the ideal case, when we achieve this balance, a proportionof qof
the data points lies left and a proportion of (1−q)lies right of the interval deﬁned
by the buffer of length m.
Now we are interested in the properties of the incremental quantile estimator pre
sented above. Since we are simply selecting the kth order statistic of the sample, at
least for continuous random variables and largerpresampling sizes, we can provide
an asymptotic distribution of the order statistic and therefore for the estimator.
Assume, the sample comes from a continuous random variable Xand we are
interested in an estimation of the qquantile xq. Assume furthermore that the prob
ability density function fXis continuous and positive at xq. Let
ξ
t
k(k=tq+1)
denote the kth order statistic from an i.i.d. sample. Then
ξ
t
khas an asymptotic nor
mal distribution [7]
Nxq;q(1−q)
tf2(xq)(26)
From Equation (26) we can obtain valuable information about the quantile esti
mator.
In order to have a more efﬁcient and reliable estimator, we want the variance
of (26) to be as small as possible. Under the assumption that we know the data
distribution, we can compute the variance of
ξ
t
k.
Let Xbe a random variable following a standard normal distribution and assume
we have a sample x1,...,xtof X, i.e. these values are realizations of the i.i.d. random
variables Xi∼N(0,1). We are interested in the median of X. According to Equation
(26), the sample median
ξ
t
0.5t+1is follows asymptotically a normal distribution:
ξ
t
0.5t+1∼N0;
π
2t.(27)
14 Katharina Tschumitschew, Frank Klawonn
Figure 3 shows the variance of the order statistic
ξ
t
0.5t+1as a function in twhen
the chosen quantile is q=0.5, i.e. the median, and the original distribution from
which the sample comes is a standard normal distribution N(0;1). The second curve
in the Figure corresponds to the variance of the sample mean.
2
0
0
4
0
0
6
0
0
8
0
0
1
00
0
t
0
.00
5
0
.01
0
0
.01
5
Var
V
arX
V
arΞk
t
Fig. 3 Variance from bottom to top of ¯
Xand
ξ
t
kunder the assumption of a standard normal distri
bution of X
The variance of the sample mean ¯
Xis only slightly better than that of the order
statistic
ξ
t
0.5t+1, nevertheless we should keep in mind the asymptotic character of
the distribution (26).
Furthermore, from Equation (26) we obtain the other nice property of the in
cremental quantile estimator: It is an asymptotically unbiased estimator of sample
quantiles. It is evena consistent estimator.
Unfortunately,as it was shown in [25], the probabilityfor the algorithm to fail is
much smaller for the estimation of the median than for arbitrary quantiles. There
fore, despite the nice properties of this estimator this simple generalisation of the
incremental median estimation algorithm to arbitrary quantiles is not very useful in
practice. In order to amend this problem, we provide a modiﬁed algorithm based on
presampling.
3.2.3 Incremental quantile estimation with presampling iQPres
Here we introduce the algorithm iQPres (incremental quantile estimation with pre
sampling) [25]. As already mentioned above, the failure probability for the incre
mental quantile estimation algorithm in Subsection 3.2.2 is lower for the median
than for extreme quantiles. Therefore, to minimise the failure probability we intro
duce an incremental quantile estimation algorithm with presampling.
Assume we want to estimate the qquantile. We presample nvalues and we sim
ply take the lth smallest value x(l)from the presample for some ﬁxed l∈{1,...,n}.
At the moment, ldoes not even have to be related to the qquantile. The probability
Incremental Statistical Measures 15
that x(l)is smaller than the qquantile of interest is
pl=
l
∑
i=0n
i·qi·(1−q)n−i.(28)
So when we apply presampling in this way, we obtain the new (presampled) dis
tribution (order statistic)
ξ
n
l. From equation (28) we can immediately see that the
(1−pl)quantile of
ξ
n
lis the same as the qquantile of X. Therefore, instead of
estimating the qquantile of X, we estimate the (1−pl)quantile of
ξ
n
l. Of course,
this is only helpful, when lis chosen in such a way that the failure probabilities
for the (1−pl)quantile are signiﬁcantly lower than the failure probabilities for the
qquantile. In order to achieve this, lshould be chosen in such a way that (1−pl)is
as close to 0.5 as possible.
We want to estimate the qquantile (0 <q<1). Fix the parameters m,l,n.(For
an optimal choice see [25]).
1. Presampling: nsucceeding values are stored in increasing order in a buffer bn
of length n. Then we select the lth element in the buffer. The buffer is emptied
afterwards for the next presample of nvalues.
2. Estimation of the (1−pl)quantile based on the lth element in the buffer for pre
sampling: This is carried out according to the algorithm described in Subsection
3.2.2.
The quantile is then estimated in the usual way, i.e.
k=(m+L+R)∗(1−pl)−l+0.5,
r=(m+L+R)∗(1−pl)−l+0.5−k,
ˆq=(1−r)·ak−R+r·ak−R+1(quantile estimator).
Of course, this does only work when the algorithm has not failed, i.e. the corre
sponding index kis within the bufferof mvalues.
3.3 Experimental results
In this section we present an experimental evaluation of the presented algorithms
iQPres and the algorithm described in Section 3.1. The evaluation is based on artiﬁ
cial data sets.
First, we consider estimations of the lower and upper quartile as well as the
median for different distributions:
•Exponential distribution with parameter
λ
=4 (Exp(4))
•Standard normal distribution (N(0;1))
•Uniform distribution on the unit interval (U(0,1))
16 Katharina Tschumitschew, Frank Klawonn
•An asymmetric bimodal distribution given by a Gaussian mixture model (GM) of
two normal distributions. The cumulative distributionfunction ofthis distribution
is given by F(x)=0.3·FN(3;1) +0.7·FN(1;1)
where FN(
μ
;
σ
2)denotes the cumulative distribution function of the normal distri
bution with expected value
μ
and variance
σ
2. Its probability density function is
shown in Figure 4.
0
0.05
0.1
0.15
0.2
0.25
0.3
6 4 2 0 2 4
Fig. 4 An example for an asymmetric, bimodal probability density function
The quantile estimations were carried out for samples of size of 10000 that were
generated from these distributions. We have repeated each estimation 1000 times.
Tables 35 show the average over all estimations for our algorithm (iQPres with a
memory size of M=150) and for the technique based on Theorem 1 where we used
the control sequence ct=1
t. The mean squared error over the 1000 repeated runs is
also shown in the tables.
Table 3 Estimation of the lower quartile q=0.25
Distr. True quantile iQPres Equation 23 MSE (iQPres) MSE (Equation 23)
Exp(4) 1.150728 1.152182 1.718059 2.130621E5 2.675568
N(0;1) 0.674490 0.672235 0.678989 5.611009E6 0.008013
U(0,1) 0.250000 0.250885 0.250845 1.541123E6 4.191695E5
GM 2.043442 2.042703 0.185340 1.087618E5 5.331730
For the uniform distribution, incremental quantile estimation based on equation
(23) and iQPres leads to very similar and good results. For the normal distribution,
both algorithms yield quite good results, but iQPres seems to be slightly more ef
ﬁcient with a smaller mean square error. For the bimodal distribution based on the
Incremental Statistical Measures 17
Table 4 Estimation of the median q=0.5
Distr. True quantile iQPres Equation 23 MSE (iQPres) MSE (Equation 23)
Exp(4) 2.772589 2.7462635 5.775925 7.485865E4 10.906919
N(0;1) 0.000000 6.8324E4 0.047590 1.786715E5 0.009726
U(0,1) 0.500000 0.495781 0.499955 1.779917E5 2.529276E6
GM 0.434425 0.434396 0.117499 2.365156E6 0.451943
Table 5 Estimation of the upper quartile q=0.75
Distr. True quantile iQPres Equation 23 MSE (iQPres) MSE (Equation 23)
Exp(4) 5.545177 5.554385 5.062660 1.054132E4 0.919735
N(0;1) 0.674490 0.674840 0.656452 3.600748E7 0.003732
U(0,1) 0.750000 0.750883 0.749919 8.443136E7 2.068730E5
GM 1.366114 1.366838 0.027163 1.193377E6 2.207112
Gaussian mixture model and a skewed distribution such as the exponential distri
bution, the estimations for the algorithm based on equation (23) are more or less
useless, at least when no speciﬁc effort is invested to ﬁnd an optimal control se
quence {ct}t=0,1,.... iQPres does not have any problems with these distributions. As
already mentioned before, it is also not required for iQPres that the sampling distri
bution is continuous whereas it is a necessary assumption for the technique based
on equation (23).
4 Hypothesis tests and change detection
In this section we demonstrate how hypothesis testing can be adapted to an incre
mental computation scheme for the cases of the
χ
2test and the ttest. Moreover
we discus the problem of nonstationary data and explain various change detection
strategies with the main focus on the use of statistical tests.
4.1 Incremental hypothesis tests
Statistical test are methods to check the validity of hypotheses about distributions or
properties of distributions of random variables. Since statistical tests rely on sam
ples, they cannot deﬁnitely verify or falsify a hypothesis. They can only provide
probabilistic information supporting or rejecting the hypothesis under considera
tion.
Statistical tests usually consider a null hypothesis H0and an alternative hypoth
esis H1. The hypotheses may concern parameters of a given class of distributions,
for instance unknown expected value and variance of a normal distribution. Such
tests are called parameter tests. In such cases, the a priori assumption is that the data
18 Katharina Tschumitschew, Frank Klawonn
deﬁnitely originate from a normal distribution. Only the parameters are unknown.
In contrast to parameter tests, nonparametric tests concern more general hypothe
sis, for example whether it is reasonable at all to assume that the data come from a
normal distribution.
The error probability that the test will erroneously reject the null hypothesis,
given the null hypothesis is true, is used as an indicator of the reliability of the test.
Sometimes a socalled pvalue is used. The pvalue is smallest error probabilitythat
can be admitted, so that the test will still reject the null hypothesis for a given sam
ple. Therefore, a low pvalue is a good indicator for rejecting the null hypothesis.
Usually, the acceptable error probability
α
(
α
error) should be speciﬁed in advance,
before the test is carried out. The smaller
α
is chosen, the more reliable is the test
when the outcome is to reject the null hypothesis. However, when
α
is chosen too
small, then the test will not tend to reject the null hypothesis, although the sample
might not speak in favour of it.
Some of the hypothesis tests can be applied to data streams, since they can be cal
culated in an incremental fashion. We discuss in this section the incremental adap
tation of two statistical tests, the
χ
2test and the ttest. Note, that the application
of hypothesis tests to data streams, using incremental computation or windowtech
niques, requires the repeated execution of the test. This can cause the problem of
multiple testing. The multiple testing problem is described later in this section.
4.1.1
χ
2test
The
χ
2test has various applications. The principal idea of the
χ
2test is the com
parison of two distributions. One can check whether two samples come from the
same distribution, a single sample follows a given distribution or also whether two
samples are independent.
Example 1. A die is thrown 120 times and the observed frequencies are as follows:
1 is obtained 30 times, 225, 318, 410, 522 and 615. We are interested in the
question whether the die is fair or not.
The null hypothesis H0for the
χ
2test claims that the data follow a certain (cu
mulative) probability distribution F(x). The distribution of the null hypothesis is
than compared to the distribution of the data. The null hypothesis can for instance
be a given distribution,e.g. a uniform or a normal distribution, and the
χ
2test can
give an indication, whether the data strongly deviate from this expected distribu
tion. For an independence test for two variables,the joint distribution of the sample
is compared to the product of the marginal distributions. If these distributions differ
signiﬁcantly, this is an indication that the variables might not be independent.
The main idea of the
χ
2test is to determine how well the observed frequencies
ﬁt the theoretical/expected frequencies speciﬁed by the null hypothesis. Therefore,
the
χ
2test is appropriate for data from categorical or nominally scaled random vari
ables. In order to apply the test to continuous numeric data, the data domain should
be partioned into rcategories ﬁrst.
Incremental Statistical Measures 19
First we discus the
χ
2goodness of ﬁt test. Here we assume to know fromwhich
distribution the data come. Then the H0and H1hypotheses can be stated as follows:
H0: The sample comes from the distribution FX
H1: The sample does not come from the distribution FX
Therefore the problem from example 1 can be solved with the help of the
χ
2
goodness of ﬁt test. Consequently, the H0and H1hypotheses are chosen as follows:
H0:P(X=1)=p1=1
6,...,P(X=6)=p6=1
6
H1:P(X=i)=1
6for at least one value i∈{1,...,6}
Let X1,...,Xnbe i.i.d. continuous random variables and x1,...,xnthe observa
tions from these random variables. Then the test statistic is computedas follows
χ
2=
r
∑
i=1
(Oi−Ei)2
Ei(29)
where Oiare the observed frequencies and Eiare the expectedfrequencies.
Since we are dealing with continuous random variables, to compute the observed
and expected frequencies we should carry out a discretisation of the data domain.
Let FX(x)be the assumed cumulative distribution function. The xaxis have to be
split into rpairwise disjoint sets or bin Si. Then the expected frequency in bin Siis
given by Ei=n(FX(ai+1)−FX(ai)) (30)
where [ai,ai+1)is interval corresponding to bin Si.
Furthermore, for the observed frequencieswe obtain
Oi=∑
xki∈Si
1.(31)
Oiis therefore the amount of observations in the ith interval.
The statistic (29) has an approximate
χ
2distribution with (r−1)degreesof free
dom under the following assumptions: First, the observations are independentfrom
each other. Secondly, the categories – the bins Si– are mutually exclusive and ex
haustive. This means that no categories may have an expected frequency of zero,
i.e. ∀i∈1,...,r:Ei>0. Furthermore, no more than 20% of the categories should
have an expectedfrequencyless than ﬁve. If this is not the case, categories should be
merged or redeﬁned. Note that this might also lead to a different number of degrees
of freedom.
Therefore, the hypothesis H0that the sample comes from the particular distribu
tion FXis rejected if r
∑
i=1
(Oi−Ei)2
Ei>
χ
2
1−
α
(32)
where
χ
2
1−
α
is the (1−
α
)quantile of the
χ
2distribution with (r−1)degrees of
freedom.
20 Katharina Tschumitschew, Frank Klawonn
Table 6 summarizes the observed and expected frequencies and computations for
example 1. All Eiare greater than zero, even greater than 4. Therefore, there is no
Table 6 example 1
number ion the die EiOi(Oi−Ei)2
Ei
1 20 30 5
2 20 25 1.25
3 20 18 0.2
4 20 10 5
5 20 22 0.2
6 20 15 1.25
need to combine categories. The test statistic is computed as follows:
r
∑
i=1
(Oi−Ei)2
Ei=5+1.25+0.2+5+0.2+1.25=12.9 (33)
The obtained result
χ
2=12.9 should be evaluated with (1−
α
)quantile of the
χ
2
distibution. For that purpose s. table of the
χ
2distribution ([7]) . The corresponding
degrees of freedom are computed as explained above (r−1)=(6−1)=5. For
α
=0.05 the tabled critical value for 5 degrees of freedomis
χ
2
0.95 =11.07, which
is smaller than computedtest statistic. Thereforethe null hypothesisis rejectedat the
0.05signiﬁcance level.For signiﬁcance level 0.02 the critical value is
χ
2
0.98 =13.388
and therefore the null hypothesis cannot be rejected at this level. This result can be
summarized as follows:
χ
2=12.9 with 5 degrees of freedom can be rejected for all
signiﬁcance levels bigger than 0.024. This indicates that the die is unfair.
In order to adapt the
χ
2goodness of ﬁt test to incremental calculation, the ob
served frequencies should be computed in an incremental fashion.
O(t)
i=O(t−1)
i+1ifxt∈Si,
O(t−1)
iotherwise. (34)
The expected frequencyshould also be recalculated correspondingto the increasing
amount of observations.
E(t)
i=E(t−1)
i
(t−1)t.(35)
Another very common test is the
χ
2independence test. This test evaluates the
general hypothesis that two variables are statistically independentfrom each other.
Let Xand Ybe two random variables and (x1,y1),...,(xn,yn)are the observed
values of these variables.For continuous random variables the data domains should
be partitioned into rand qcategories, respectively. Therefore the observed values of
Xcan be assigned to one of the categories SX
1,...,SX
rand the observed values ofY
to one of the categories SY
1,...,SY
q. Then Oij is the frequency of occurrence of the
observation (xki,ykj), where xki∈SX
iand ykj∈SY
j. Furthermore
Incremental Statistical Measures 21
Oi•=
q
∑
j=1Oij (36)
and
O•j=
r
∑
i=1Oij (37)
denote the marginal observed frequencies.
Table 7 illustrates the observed absolute frequencies. The total number of obser
vations in the table is n. The notation Oij represents the number of observations in
the cell with index ij (ith row and jth column), Oi•the number of observations
in the ith row and O•jthe number of observations in the jth column. This table is
called contingency table.
Table 7 Contingency table
X\Y SY
1... SY
j... SY
qmarginal of X
SX
1O11 ... O1j... O1qO1•
.
.
..
.
..
.
..
.
..
.
..
.
..
.
.
SX
iOi1... Oij ... Oiq Oi•
.
.
..
.
..
.
..
.
..
.
..
.
..
.
.
SX
rOr1... Orj ... Orq Or•
marginal of Y O•1... O•j... O•qn
It is assumed that the random variables Xand Yare statistically independent.
Let pij be the probability of being in the ith category of the domain of Xand
the jth category of the domain of Y.pi•and p•jare the corresponding marginal
probabilities. Then, corresponding to the assumption of independencefor each pair
pij =pi•·p•j(38)
holds. Equation (38) deﬁnes statistical independence. Therefore the null and the
alternative hypothesis are as follows:
H0:pij =pi•·p•j
H1:pij =pi•·p•j
Thus, if Xand Yare independent, then the expected absolute frequencies are
given by
Eij =Oi•·O•j
n.(39)
The test statistic, again checking the observed frequencies against the expected
frequencies under the null hpyothesis, is as follows.
χ
2=
r
∑
i=1
q
∑
j=1
(Oij−Eij)2
Eij (40)
22 Katharina Tschumitschew, Frank Klawonn
The test statistic has an approximate
χ
2distribution with (r−1)(s−1)degrees of
freedom. Consequently, the hypothesis H0that Xand Yare independent can be
rejected if r
∑
i=1
q
∑
j=1
(Oij−Eij)2
Eij ≥
χ
2
1−
α
(41)
where
χ
2
1−
α
is the (1−
α
)quantile of the
χ
2distribution with (r−1)(s−1)de
grees of freedom.
For the incremental computation of Oi•,O•jand Oij corresponding formulae
must be developed. For the time point tand the new observed values (xt,yt)the
incremental formulae are given by
O(t)
i•=O(t−1)
i•+1ifxt∈SX
i,
O(t−1)
i•otherwise. (42)
O(t)
•j=O(t−1)
•j+1ifyt∈SY
j,
O(t−1)
•jotherwise. (43)
O(t)
ij =O(t−1)
ij +1ifxt∈SX
i∧yt∈SY
j,
O(t−1)
ij otherwise. (44)
The
χ
2goodness of ﬁt test can be extended to a
χ
2homogeneity test ([22]).
Whereas the
χ
2goodness of ﬁt test can be used only for a single sample, the
χ
2
homogeneity test is used to compare whether two or more samples come from the
same population.
Let X1,...,Xm(m≥2) be discrete random variables, or continuous random vari
ables discretised into rcategories S1,...,Sr. The data for each of the msamples
from random variables X1,...,Xm(overall nvalues) are entered in a contingency
table. This table is similar to the one for the
χ
2independence test.
Table 8 Contingency table
values \variables X1... Xj... Xm∑
S1O11 ... O1j... O1mO1•
.
.
..
.
..
.
..
.
..
.
..
.
..
.
.
SiOi1... Oij ... Oim Oi•
.
.
..
.
..
.
..
.
..
.
..
.
..
.
.
SrOr1... Orj ... Orm Or•
∑O•1... O•j... O•mn
The samples are represented by the columns and the categories by the rows of
Table 8. We assume that each of the samples is randomly drawn from the same
distribution. The
χ
2homogeneity test checks whether msamples are homogeneous
Incremental Statistical Measures 23
with respect to the observed frequencies. If the hypothesis H0is true, the expected
frequency in the ith category will be the same for all of the mrandom variables.
Therefore, the null and the alternative hypothesiscan be stated as follows:
H0:pij =pi•·p•j
H1:pij =pi•·p•j.
From H0follows that the rows are independentof the column.
Therefore, the computation of an expected frequency can be summarized by
Eij =Oi•·O•j
n.(45)
Although the
χ
2independence test and
χ
2homogeneity test evaluate different
hypothesis, they are computed identically. Therefore,the incremental adaptation of
the
χ
2independence test can also be applied to the
χ
2homogeneity test.
Commonly in case of two samples the KolmogorovSmirnovtest is used, since it
is an exact test and in contrast to the
χ
2test can be applied directly without previ
ous discretisation of continuous distributions. However, the KolmogorovSmirnov
test does not have any obvious incremental calculation scheme. The Kolmogorov
Smirnov test is described in Section 4.2.2.
4.1.2 The ttest
The next hypothesis test, for which we want to provide incremental computation is
the ttest. Different kinds of the ttest are used. We restrict our considerations to the
one sample ttest and the ttest for two independent samples with equal variance.
The one sample ttest evaluates whether a sample with particular mean could be
drawn from the population with known expected value
μ
0. Let X1,...Xnbe i.i.d. and
Xi∼N
μ
;
σ
2with unknown variance
σ
2. The null and the alternative hypothesis
for two sided test are:
H0:
μ
=
μ
0, the sample comes from the normal distribution with expected value
μ
0.
H1:
μ
=
μ
0, the sample comes from a normal distributionwith an expected value
differing from
μ
0.
The test statistic is given by
T=√n¯
X−
μ
S(46)
where ¯
Xis the sample mean and Sthe sample standard deviation. The statistic (46)
is tdistributed with (n−1)degrees of freedom. H0is rejected if
t<−t1−
α
/2or t>t1−
α
/2(47)
24 Katharina Tschumitschew, Frank Klawonn
where t1−
α
/2is the (1−
α
/2)quantile of the tdistribution with (n−1)degrees of
freedom and tis the computed valueof the test statistic (46), i.e. t=√n¯x−
μ
0
s.
Onesided tests are given by the following null and alternative hypotheses:
H0:
μ
≤
μ
0and H1:
μ
>
μ
0.H0is rejected if t>t1−
α
.
H0:
μ
≥
μ
0and H1:
μ
<
μ
0.H0is rejected if t<−t1−
α
.
This test can be very easily adapted to incremental computation.For this purpose
the sample mean and the sample variance have to be updated as in Equations (2) and
(6), respectively, as described in Section 2. Note that the degrees of freedom of the
tdistribution shouldbe updated in each step as well.
tn+1=√n+1¯xn+1−
μ
0
sn+1(48)
Unlike previous notations we use here n+1 for the time point, since the letter
tis already used for the computed test statistic. Furthermore, as mentioned above
the (1−
α
/2)quantile of the tdistribution with ndegrees of freedom should be
used to evaluate the null hypothesis.However forn≥30, the quantiles of the stan
dard normal distribution could be used as approximation of the quantiles of the
tdistribution.
The ttest for two independent samples is used to evaluate whether two inde
pendent sample come from two normal distributions with the same expectedvalue.
The two sample means ¯xand ¯yare used to estimate the expected values
μ
Xand
μ
Yof the underlying distributions.If the result of the test is signiﬁcant, we assume
that the samples come from two normal distributions with different expected val
ues. Furthermore, we assume that the variances of the underlying distributions are
unknown.
The ttest is based on the following assumptions:
•The samples are drawn randomly.
•The underlying distribution is a normal distribution.
•The variances of the underlying distributions are equal, i.e.
σ
2
X=
σ
2
Y.
Let X1,...Xn1i.i.d. and Xi∼N
μ
X;
σ
2
Xand Y1,...Yn2i.i.d. and Yi∼N
μ
Y;
σ
2
Y
with unknow expected values and unknown variances and
σ
2
X=
σ
2
Y.
The null and the alternative hypothesis can be deﬁnedas follows:
H0:
μ
X=
μ
Y, the samples come from the same normal distribution.
H1:
μ
X=
μ
Y, the samples come from normal distributions with different expected
values.
In this case, a twosided test is carried out, however similar to the one sample ttest
also a onesided test can be deﬁned.
The test statistic is computed as follows.
T=¯
X−¯
Y
(n1−1)S2
X+(n2−1)S2
Y
n1+n2−2n1n2
n1+n2(49)
Incremental Statistical Measures 25
where S2
Xand S2
Yare the unbiased estimators for the variances of Xand Y, respec
tively.
Equation (49) is a general equation for the ttest for two independent samples
and can be used in both cases of equal and unequal sample sizes.
The statistic (49) has a tdistribution with (n1+n2−2)degrees of freedom.
Let
t=¯x−¯y
(n1−1)s2
X+(n2−1)s2
Y
n1+n2−2n1n2
n1+n2(50)
be the computed value of the statistic (49). Then the hypotesis H0that the samples
come from the same normal distribution is rejected if
t<−t1−
α
/2or t>t1−
α
/2(51)
where t1−
α
/2is the (1−
α
/2)quantile of the tdistribution with (n1+n2−2)de
grees of freedom.
Similar to the one sample ttest, the ttest for two independent samples can be
easily computed in an incremental fashion, since the sample means and the variance
can be calculated in an incremental way. Here the degrees of freedom should also
be updated with the new observed values.
4.1.3 Multiple Testing
Multiple testing refers to the application of number of tests simultaneously. Instead
of a single null hypothesis, a tests for a set of null hypotheses H0,H1,...,Hnare
considered. These null hypotheses do not have to exclude each other.
An example for multiple testing is a test whether mrandom variables X1,...Xm
are pairwise independent. This means, the null hypotheses are H1,2,...,H1,m,...,
Hm−1,mwhere Hi,jstates that Xiand Xjare independent.
Multiple testing leads to the undesired effect of cumulating the
α
error. The
α

error
α
is the probability to reject the null hypothesis erroneously, given it is true.
Choosing
α
=0.05 means that in 5% of the cases the null hypothesis would be
rejected, although it is true. When ktests are applied to the same sample, then the
error probability for each test is
α
. Under the assumption that the null hypotheses
are all true and the tests are independent, the probability that at least one test will
reject its null hypothesis erroneously is
P(≥1)=1−P(=0)(52)
=1−(1−
α
)·(1−
α
)...·(1−
α
)(53)
=1−(1−
α
)k.(54)
is the number of tests rejection the null hypothesis.
A variety of approaches have been proposed to handle the problemof cumulating
the
α
error. In the following, two common methods will be introduced shortly.
26 Katharina Tschumitschew, Frank Klawonn
The simplest and most conservative method is Bonferroni correction [21]. When
knull hypotheses are tested simultaneously and
α
is the desired overall
α
error for
all tests together, then the corrected
α
error for each single test should be chosen as
˜
α
=
α
k. The justiﬁcation for this correction is the inequality
P
iAi≤∑
iP(Ai).(55)
For Bonferroni correction, Aiis the event that the null hypothesis Hiis rejected,
although it is true. In this way, the probability that one or more of the tests rejects
its corresponding null hypothesis is at most
α
. In order to guarantee the signiﬁcance
level
α
, each single test must be carried out with the corrected level ˜
α
.
Bonferroni correction is a very rough and conservative approximationforthe true
α
error. Oneof its disadvantages is that the corrected signiﬁcance level ˜
α
becomes
very low, so that it becomes almost impossible to reject any of the null hypotheses.
The simple single step Bonferroni correction has been improved by Holm [12].
The BonferroniHolm method is a multistep procedure in which the necessary cor
rections are carried out stepwise. This method usually yields larger corrected
α

values than the simple Bonferroni correction.
When khypotheses are tested simultaneously and the overall
α
error for all tests
is
α
, for each of the tests the corresponding pvalue is computedbased on the sample
xand the pvalues are sorted in ascending order.
p[1](x)≤p[2](x)≤...≤p[k](x)(56)
The null hypotheses Hiare ordered in the same way.
H[1],H[2],...,H[k](57)
In the ﬁrst step H[1]is tested by comparing p[1]with
α
k.Ifp[1]>
α
kholds, then H[1]
and the other null hypotheses H[2],...,H[k]are not rejected. The method terminates
in this case. However, if p[1]≤
α
kholds, H[1]is rejected and the next null hypothesis
H[2]is tested by comparing the pvalue p[2]and the corrected
α
value
α
k−1.Ifp[2]>
α
k−1holds, H[2]and the remaining null hypotheses H[3],...,H[k]are not rejected. If
p[2]≤
α
k−1holds, H[2]is rejected and the procedure continues with H[3]in the same
way.
The BonferroniHolm method tests the hypotheses in the order of their pvalues,
starting with H[1]. The corrected
α
ivalues
α
k,
α
k−1,...
α
are increasing. Therefore,
the BonferroniHolm method rejects at least those hypotheses that are also rejected
by simple Bonferroni correction, but in general morehypotheses can be rejected.
Incremental Statistical Measures 27
4.2 Change detection strategies
Detecting changes in data streams has become a very important area of research in
many application ﬁelds, such as stock market, web activities or sensors measure
ments, just to name a few. The main problem for changedetection in data streams is
limited memory capacity. It is unrealistic to store the full history of the data stream.
Therefore, efﬁcient change detection strategies tailored to the data stream should be
used. The main requirements for such approaches are: low computational costs, fast
change detection and high accuracy.Moreoverit is important to distinguish between
true changes and false alarms. Abrupt changes as well as slow drift in the data gen
erating process can occur. Therefore, a “good” algorithm should be able to detect
both kinds of changes.
Various strategies are proposed to handle this problem, see for instance [11] for
a detailed survey of change detection methods. Most of these approaches are based
on time window techniques [2, 15]. Furthermore, several approaches are presented
for evolving data streams as they are discussed in [14, 13, 8].
In this section, we introduce two types of change detection strategies: incremental
computation and window technique based change detection. Furthermore we put
the main focus on statistical tests. We assume to deal with numeric data streams. As
already mentioned in the introduction, two types of change are identiﬁed: concept
change and change of data distribution. We don’t differentiate in this work between
both of them, since the distribution of the target variable will be changed in both
cases.
4.2.1 iQPres for change detection
The incremental quantile estimator iQPres from Section 3.2.3 can be used for
change detection [25]. In case, the sampling distribution changes, having a drift
of the quantile to be estimated as a consequence, such changes will be noticed,
since the simple version of iQPres without shifted parallel estimations will fail in
the sense that it is not able to balance the counters Land Rany more.
In order to illustrate how iQPres can be applied to change detection, we consider
daily measurements for gas productionin a waste water treatment plant over a period
of more than eight years. The measurements are shown in Figure 5.
iQPress has been applied to this data set to estimate the median with a memory
size of M=30. The optimal choice for the sizes of the buffers for presampling
and median estimation is then n=3 and m=27, respectively. At the three time
points 508, 2604 and 2964, the buffer cannot be balanced anymore, indicating that
the median has changed. These three time points are indicated by vertical lines in
Figure 5. The arrows indicate whether the median is increased or decreased. An
increase corresponds to an unbalanced buffer with the right counter Rbecoming
too large, whereas a decrease leads to an unbalanced buffer with the left counter L
becoming too large. The median increases at the ﬁrst point at 508 from 998 before
28 Katharina Tschumitschew, Frank Klawonn
#
#
#
! " # $ % & ' ! " # $% & ' ! " # $ % & ' !
#
#
#
! " # $ % & ' ! " # $% & ' ! " # $ % & ' !
#
#
#
! " # $ % & ' ! " # $% & ' ! " # $ % & ' !
Fig. 5 An example of change detection for time series data from a waste water treatment plant
and 1361 after this point. At time point 2604 the median increases to 1406 and drops
again to 1193 at time point 2964.
Note that algorithms based on Theorem 1 mentioned in Section 3.1 are not suit
able for change detection.
By using iQPres for change detection in the data distribution, we assume that
the median of the distribution changes with the time, however if this is not the case
and only another parameter like the variance of the underlying distribution changes,
other strategies for change detection should be used.
4.2.2 Statistical tests for change detection
The theory of hypothesis testing is the main background for change detection. Sev
eral algorithms for change detection are based on hypothesis tests.
Hypothesis tests could be applied to change detection in two different ways:
•Change detection trough incremental computation of the tests: by this approach
the test is computed in an incremental fashion, for instance as it is explained in
Section 4.1. Consequently the change can be detected if the test starts to yield
different results as before.
•Window techniques:by this approach the data stream divided into time windows.
A sliding window could be used as well as nonoverlaping windows. In order to
detect potential changes, we need either to compare data from an earlier window
with data from newer one or to test only the new data (for instance, whether the
data follows a known or assumed distribution). When the window size is not too
large, it is not necessary to be able to compute the tests in an incremental fash
ion. Therefore, we are notrestricted to tests that render themselves to incremental
computations, but many other tests could be used. Hybrid approaches combining
both techniques are also possible. Of course, window techniques with incremen
tal computations within the window will lead to less memory consumptionsand
faster computations.
Incremental Statistical Measures 29
We will not give a detailed description for change detection based on incremental
computation here, since the principles of these methods are explained in Section
4.1. However, the problem of multiple testing as discussed in Section 4.1 should
be taken into account when a test is applied again and againover time. Even if the
underlying distribution does not change over time, any test will erroneously reject
the null hypothesis of no change in the long run if we only carry out the test often
enough. Different approaches to solve this problem are presented in Section 4.1.3.
Another problem of this approach is the “burdenof old data”. If a large amount of
data has been analysed already and the change is not very drastic, it may happen
that the change will be detected with large delay or not detected at all when a very
large window is used. On that account it may be useful to reinitialise the test from
time to time.
To detect changes with by window technique, we need to compare two samples
of data and have to decide whetherthe hypothesis H0that they come from the same
distribution is true.
First we will present a general metaalgorithm for change detection based on
a window technique, without any speciﬁc ﬁxed test. This algorithm is presented
in Figure 6. The constant step speciﬁes, after how many new values the change
detection should checked again.
1 Initialise window W,i=0
2for each new xtdo
3if i<stepthen
4W←W∪{xt}(i.e., add xtto the W)
5W←W\w0(i.e., remove oldest element in W)
6i=i+1
7if i=step then
8i=0
9 split Winto W0and W1
10 test W0and W1for change
11 if change detected then
12 report change at time t
13 end if
14 end if
15 end if
2end for
Fig. 6 General scheme of a change detection algorithm based on time windows and statistical tests
This approach follows an simple idea: when the data from two subwindows of
Ware judged as “distinct enough”, the change is detected. Here “distinct enough”
is speciﬁed by the selected statistical test for distribution change. In general, we
assume the splitting of Winto two subwindows of equal size. Nevertheless, any
30 Katharina Tschumitschew, Frank Klawonn
“valid” splitting can be used. Valid is meant in terms of the amount of data that is
needed for the test to be reliable.
However, by a badly selected cut point the change can be detected with large
delay as Figure 7 shows. The rightmost part indicates a change in the data stream.
9
99
9
Fig. 7 Subwindows problem
As the change occurs almost at the end of the subwindow W1, it is most likely that
the change remains at ﬁrst undetected. Of course, since the window will be moved
forward with new data points arriving, at some point the change will be detected,
but it may be from essential interest, to detect the change as early as possible.
To solve this problem, we modify the algorithmin Figure 6 in the following way:
instead of splitting window Wonly once, the splitting is carried out several times.
Figure 8 shows the modiﬁed part of the algorithm in Figure 6 starting at step 9.
9for each valid split W=W0∪W1do
10 test W0and W1for change
11 if change detected then
12 report change at time t
13 end if
14 end for
Fig. 8 Modiﬁcation of the algorithm for change detection to avoid the subwindows problem
How many times the window should be split, should be decided based on the
required performance and precision of the algorithm. We can run the test for each
sufﬁciently large subwindow ofW, although the performance of the algorithm will
decrease, or we can carry out ﬁxed number of splits. Note that also for the win
dows technique based approach, attention should be paid to the problem of multiple
testing (see Section 4.1.3). Furthermore, we do not specify here the effect of the
detected change. The question whether the window should be reinitialised depends
on the application. A change in the variance of the data stream might have a strong
effect on the task to be fulﬁlled with the online analysis of the data stream or it
might have no effect as long the mean value remains more or less stable.
For the hypothesis test in step 10 of the algorithm, any appropriate test for the
distribution change can be chosen. Since we do not necessarily have to apply an
incremental scheme for the hypothesis test, the KolmogorovSmirnovtest can also
be considered for change detection. The KolmogorovSmirnov test is designed to
Incremental Statistical Measures 31
compare two distribution, whether they are equal or not. Therefore two kinds of
questions could be answered with the help of the KolmogorovSmirnov test:
•Does the sample arise from a particular known distribution?
•Do two samples coming from different time windows have the same distribution?
We are particularly interested in the second question. For this purpose, the two sam
ple KolmogorovSmirnovgoodnessofﬁt test should be used.
Let X1,...,Xnand Y1,...,Ymbe two independent random samples from distri
butions with cumulative distribution functions FXand FY, respectively. We want to
test the hypothesis H0:FX=FYagainst the hypothesis H1:FX=FY. The Kol
mogorovSmirnov statistic is given by
Dn,m=sup
tSX,n(x)−SY,m(x)(58)
where SX,n(x)and SY,m(x)are corresponding empirical cumulative distribution
function4of the ﬁrst and second sample. H0is rejected at level
α
if
nm
m+nDn,m>K
α
(60)
where K
α
is the
α
quantile of the Kolmogorovdistribution.
To adapt the KolmogorovSmirnovtest as a change detection algorithm, ﬁrst the
signiﬁcance level
α
should be chosen (we can also use for instance the Bonferroni
correction to avoid the multiple testing problem). The value of K
α
needs either nu
merical computation or should be stored in a table5. Furthermore, values from the
subwindows W0and W1represent two samples x1,...,xnand y1,...,ym. Then the
empirical empirical cumulative distribution functions SX,n(x)and SY,m(x)and the
KolmogorovSmirnovstatistic should be computed. Note that for the computation
of SX,n(x)and SY,m(x)in case of unique splitting the samples have to be sorted only
initially, afterward the new values have to be inserted and the old values must be
deleted from the sorted lists. In case of multiple splitting we have to decide either to
sort each time from scratch or to save sorted lists for each kind of splitting.
An implementation of the KolmogorovSmirnov test is for instance available in
the R statistics library (see [4] for more information).
Algorithm 8 based on the KolmogorovSmirnovtest as the hypothesis test in step
10 has been implemented in Java using Rlibraries and has been tested with artiﬁcial
data. For the data generation process the following modelwas used:
4Let xr1,xr2,...xrnbe a sample in ascending order from the random variables X1,...,Xn. Then the
empirical distribution function of the sample is given by
SX,n(x)=⎧
⎨
⎩
0ifx≤xr1,
k
nif xrk<x≤xrk+1,
1ifx>xrk.
(59)
5This applies also to the ttest and the
χ
2test.
32 Katharina Tschumitschew, Frank Klawonn
Yt=
t
∑
i=1Xi.(61)
We assume the random variablesXito be normally distributed with expected value
μ
=0 and variance
σ
2, i.e. Xi∼N0,
σ
2. Here Ytis a one dimensional random
walk [24]. To make the situation more realistic, we consider the followingmodel:
Zt∼N(yt,1).(62)
The process (62) can be understood as a constant model with drift and noise, the
noise follows a normal distributionwhose expected value equals the actualvalue of
the random walk and whose variance is 1.
The data were generated with the following parameters:
σ
1=0.02,
σ
2=0.1.
Therefore the data have a slow drift andare furthermore corrupted with noise.
Fig. 9 An example of change detection for the data generated by the process (62).
Algorithm 8 has been applied to this data set. The size of the window Wwas
chosen to be 500. The window is always split into two subwindows of equal size,
i.e. 250. The data are identiﬁed by the algorithm as nonstationary.Only very short
sequences are considered to be stationbary by theKolmogorovSmirnovtest. These
sequences are marked by the darkerareas in Figure 9. In the interval [11445,14414]
stationary parts are mixed with occasionally occurring small nonstationary parts.
For easier interpretation we joined these parts to one larger area. Of course, since
we are dealing with the window, the real stationary areas are not exactly the same
as shown in the ﬁgure. The quality of change detection depends on the window.
For slow gradual changes in the form of concept drift a larger window is a better
choice, whereas for abrupt changes in terms of a concept shift a smaller window is
of advantage.
Incremental Statistical Measures 33
5 Conclusions
We have introduced incremental computation schemes for statistical measures or
indices like the mean, the median, the variance, the interquartile rangeor the Pearson
correlation coefﬁcient. Such indices provide information about the characteristics
of the probability distribution that generates the data stream. Although incremental
computations are designed to handle large amounts of data, it is not extremely useful
to calculate the above mentioned statistical measures for extremely large data sets,
since they quickly converge to the parameter of the probabilitydistribution they are
designed to estimate as can be seen in Figure 1, 2 and 3. Of course, convergence
will only occur when the underlying data stream is stationary.
It is therefore very important to use such statistical measure or hypothesis tests
for change detection. Change detection is a crucial aspect for nonstationary data
streams or “evolving systems”. It has been demonstrated in [26] that na¨ıve adaption
without taking any effort to distinguish between noise and true changes of the un
derlying sample distribution can lead to very undesired results. Statistical measures
and tests can help to discover true changes in the distribution and to distinguish them
from random noise.
Applications of such change detection methods canbe found in areas like quality
control and manufacturing [16, 20], intrusion detection [27] or medical diagnosis
[5].
The main focus of this chapter are univariate methods. There also extensions to
multidimensional data [23] which are out of the scope of this contribution.
References
1. Aho, A.V., Ullman, J.D., J.E., H.: Data Structures and Algorithms. Addison Wesley, Boston
(1987)
2. Basseville, M., Nikiforov, I.:Detection of Abrupt Changes: Theory and Application (Prentice
Hall information and system sciences series). Prentice Hall, Upper Saddle River, New Jersey
(1993)
3. Beringer, J., H¨ullermeier, E.: Effcient instancebased learning on data streams. Intelligent
Data Analysis 11, 627–650 (2007)
4. Crawley, M.: Statistics: An Introduction using R. Wiley, New Yourk (2005)
5. Dutta, S., Chattopadhyay, M.: A change detection algorithm for medical cell images. In: Proc.
Intern. Conf. on Scientiﬁc Paradigm Shift in Information Technology and Management, pp.
524–527. IEEE, Kolkata (2011)
6. Fischer, R.: Moments and product moments of sampling distributions. In: Proceedings of the
London Mathematical Society, Series 2, 30, pp. 199–238 (1929)
7. Fisz, M.: Probability Theory and Mathematical Statistics. Wiley, NewYork (1963)
8. Ganti, V., Gehrke, J., Ramakrishnan, R.: Mining data streams under block evolution. SIGKDD
Explorations 3, 1–10 (2002)
9. Gelper, S., Schettlinger, K., Croux, C., Gather, U.: Robust online scale estimation in time
series: A modelfree approach. Journal of Statistical Planning & Inference 139(2), 335–349
(2008)
10. Grieszbach, G., Schack, B.: Adaptive quantile estimation and its application in analysis of
biological signals. Biometrical journal 35, 166–179 (1993)
34 Katharina Tschumitschew, Frank Klawonn
11. Gustafsson, F.: Adaptive Filteringand Change Detection. Wiley, New York (2000)
12. Holm, S.: A simple sequentially rejective multiple test procedure. Scandinavian Journal of
Statistics 6, 65–70 (1979)
13. Hulten, G., Spencer, L., Domingos, P.: Mining time changing data streams. In: Proceedings of
the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
(2001)
14. Ikonomovska, E., Gama, J., Sebasti˜ao, R., Gjorgjevik, D.: Regression trees from data streams
with drift detection. In: 11th int conf on discovery science, LNAI, vol 5808, pp. 121–135.
Springer, Berlin (2009)
15. Kifer, D., BenDavid, S., Gehrke, J.: Detecting change in data streams. In: In Proc. 30th VLDB
Conf., pp. 199–238. Toronto, Canada (2004)
16. Lai, T.: Sequential changepoint detection in quality control and dynamic systems. Journal of
the Royal Statistical Society, Series B 57, 613–658 (1995)
17. M ¨oller, E., Grieszbach, G., Schack, B., Witte, H.: Statistical properties and control algorithms
of recursive quantile estimators. Biometrical Journal 42, 729–746 (2000)
18. Nevelson, M., Chasminsky, R.: Stochastic approximation and recurrent estimation. Verlag
Nauka, Moskau (1972)
19. Qiu, G.: An improved recursive median ﬁltering scheme for image processing. IEEE Trans
actions on Image Processing 5, 646–648 (1996)
20. Ruusunen, M., Paavola, M., Pirttimaa, M., Leiviska, K.: Comparison of three change detec
tion algorithms for an electronics manufacturing process. In: Proc. 2005 IEEE International
Symposium on Computational Intelligence in Robotics and Automation, pp. 679–683 (2005)
21. Shaffer, J.P.: Multiplehypothesis testing. Ann. Rev. Psych 46, 561–584 (1995)
22. Sheskin, D.: Handbook of Parametric and Nonparametric Statistical Procedures. CRCPress,
Boca Raton, Florida (1997)
23. Song, X., Wu, M., Jermaine, C., Ranka, S.: Statistical change detection for multidimensional
data. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge
discovery and data mining, pp. 667–676. ACM, New York (2007)
24. Spitzer, F.: Principles of Random Walk (2nd edition). Springer, Berlin (2001)
25. Tschumitschew, K., Klawonn, F.: Incremental quantile estimation. Evolving Systems 1, 253–
264 (2010)
26. Tschumitschew, K., Klawonn, F.: The need for benchmarks with data from stochastic pro
cesses and metamodels in evolving systems. In: N.K. P. Angelov D. Filev (ed.) International
Symposium on Evolving Intelligent Systems. SSAISB, Leicester, pp. 30–33 (2010)
27. Wang, K., Stolfo, S.: Anomalous payloadbased network intrusion detection. In: E. Jonsson,
A. Valdes, M. Almgren (eds.) Recent Advances in Intrusion Detection, pp. 203–222. Springer,
Berlin (2004)