ChapterPDF Available

Abstract and Figures

Statistical measures provide essential and valuable information about data and are needed for any kind of data analysis. Statistical measures can be used in a purely exploratory context to describe properties of the data, but also as estimators for model parameters or in the context of hypothesis testing. For example, the mean value is a measure for location, but also an estimator for the expected value of a probability distribution from which the data are sampled. Statistical moments of higher order than the mean provide information about the variance, the skewness, and the kurtosis of a probability distribution. The Pearson correlation coefficient is a measure for linear dependency between two variables. In robust statistics, quantiles play an important role, since they are less sensitive to outliers. The median is an alternative measure of location, the interquartile range an alternative measure of dispersion. The application of statistical measures to data streams requires online calculation. Since data come in step by step, incremental calculations are needed to avoid to start the computation process each time new data arrive and to save memory so that not the whole data set needs to be kept in the memory. Statistical measures like the mean, the variance, moments in general, and the Pearson correlation coefficient render themselves easily to incremental computations, whereas recursive or incremental algorithms for quantiles are not as simple or obvious. Nonstationarity is another important aspect of data streams that needs to be taken into account. This means that the parameters of the underlying sampling distribution might change over time. Change detection and online adaptation of statistical estimators is required for nonstationary data streams. Hypothesis tests like the χ2- or the t-test can be a basis for change detection, since they can also be calculated in an incremental fashion. Based on change detection strategies, one can derive information on the sampling strategy, for instance the optimal size of a time window for parameter estimations of nonstationary data streams. © 2012 Springer Science+Business Media New York. All rights reserved.
Content may be subject to copyright.
Incremental Statistical Measures
Katharina Tschumitschew, Frank Klawonn
1 Introduction
Statistics and statistical methods are used in almost every aspect of modern life, like
medicine, social surveys, economy and marketing, only to name few of application
areas. A vast number of sophisticated statistical software tools can be used to search
and test for structures and patterns in data. Important information about the data
generating process is provided by the simple summary statistics. Characteristics of
the data distribution can be described by summary statistics like the followingone.
Measuresof location: The meanand quantiles provide information about location
of the distribution. Mean and median are representatives for the centre of the
Measures of spread: Common measures for the variation in the data are standard
deviation, variance and interquartile range.
Shape: The third and fourth moments provide information about the skewness
and the kurtosis of a probability distribution.
Dependence: For instance, the Pearson correlation coefficient is a measure for the
linear dependency between two variables. Other common measures for statistical
dependency between two variables rank correlation coefficients like Spearman’s
rho or Kendall’s tau.
Katharina Tschumitschew
Department of Computer Science, Ostfalia University of Applied Sciences, Salzdahlumer Str.
46/48, D-38302 Wolfenbuettel, Germany, e-mail:
Frank Klawonn
Department of Computer Science, Ostfalia University of Applied Sciences, Salzdahlumer Str.
46/48, D-38302 Wolfenbuettel, Germany, e-mail:
Bioinformatics and Statistics, Helmholtz Centre for Infection Research, Inhoffenstr. 7, D-38124
Braunschweig, Germany, e-mail:
2 Katharina Tschumitschew, Frank Klawonn
Apart from providing information about location and spread of the data distribution,
quantiles also play an important role in robust data analysis, since they are less
sensitive to outliers.
Summary statistics can be used in a purely exploratory context to describe prop-
erties of the data, but also as estimators for model parameters of an assumed under-
lying data distribution.
More complex and powerful methods for statistical data analysis are for instance
hypothesis tests. Statistical hypothesis testing allows us to discover the current state
of affairs and therefore help us to make decisions based on the gained knowledge.
Hypothesis test can be applied to a great variety of problems. We may need to test
just a simple parameter or the whole distribution of the data.
However, classical statistics operates with a finite, fixed data set. On the other
hand, nowadays it is very important to continuously collect and analyse data sets
increasing with time, since the (new) data may contain useful information. Sensor
data as well as the seasonal behaviour of markets, weather or animals are in the focus
of diverse research studies. The amount of recordeddata increases each day. Apart
from the huge amount of data to be dealt with, another problem is that the data arrive
continuously in time. Such kind of data is called datastream. A data stream can be
characterised as an unlimited sequence of values arriving step by step over time.
One of the main problems for the analysis of data streams is limited computing and
memory capabilities. It is impossible to hold the whole data set in the main memory
of a computer or computing device like an ECU (electronic control unit) that might
also be responsible for other tasks than just analysing the data. Moreover, the results
of the analysis should be presented in acceptable time, sometimes even under very
strict time constraints, so that the user or system can react in real time. Therefore, the
analysis of data streams requires efficient on-line computations. Algorithms based
on incremental or recursive computation schemes satisfy the above requirements.
Such methods do not store all historical data and do not need to browse through
old data to update an estimator or an analysis, in the ideal case, each data value is
touched only once.
Consequently the application of statistical methods to data streams requires mod-
ifications to the standard calculation schemes in order to be able carry out the com-
putations on-line. Since data come in step by step, incremental calculations are
needed to avoid to start the computation process fromscratch each time new data ar-
rive and to save memory, so that not the whole data set must be kept in the memory.
Statistical measures like the sample mean, variance and momentsin general and the
Pearson correlation coefficient render themselves easily incremental computation
schemes, whereas, for instance, for standard quantiles computations the whole data
is needed. In such cases, new incremental methods must be developed that avoid
sorting the whole data set, since sorting requires in principal to check the whole
data set. Several approaches for the on-line estimation of quantiles are presented for
instance in [9, 19, 1, 25].
Another important aspect in data stream analysis is that the data generating pro-
cess does not remain static, i.e. the underlying probabilistic model cannot be as-
sumed to be stationary. The changes in the data structure may occur over time.
Incremental Statistical Measures 3
Dealing with non-stationary data requires change detection and on-line adaptation.
Different kinds of non-stationarity have been classified in [2]:
Changes in the data distribution: the change occurs in the data distribution. For
instance mean or variance of the data distributionmay change over time.
Changes in concept: here concept drift refers to changes of a target variable. A
target variable is a variable, whose values we try to predict based on the model
estimated from the data, for instance for linear regression it is the change of the
parameters of the linear relationship between the data.
Concept drift: concept drift describes gradualchanges of the concept.In statis-
tics, this usually called structural drift.
Concept shift: concept shift refers to an abrupt change which is also referred
to as structural break.
Hence change detection and on-line adaptation of statistical estimators are re-
quired for non-stationary data streams. Various strategies to handle non-stationarity
are proposed, see for instance [11] for a detailed survey of change detection meth-
ods. Statistical hypothesis tests may also be used for change detection. Since we
are working with data streams, it is required that the calculations for the hypothe-
sis tests can be carried out in an incremental way. For instance, the
2-test and the
t-test1render themselves easily to incremental computations. Based on change de-
tection strategies, one can derive informationon the sampling strategy, for instance
the optimal size of a time window for parameter estimations of non-stationary data
streams [26, 3].
This chapter is organised as follows. Incremental computations of the mean, vari-
ance, third and fourth moments and the Pearson correlation coefficient are explained
in Section 2. Furthermore two algorithms for the on-line estimation of quantiles are
described in Section 3. In Section 4 we provide on-line adaptations of statistical
hypothesis test and discuss different change detection strategies.
2 Incremental calculation of moments and the Pearson
correlation coefficient.
Statistical measures like sample central moments provide valuable information
about the data distribution. So the sample mean or empirical mean (first sample
central moment) is the measure of the centre of location of the data distribution,
the measure of variability is sample variance (second sample central moment). The
third and fourth central moments are used to compute skewness and kurtosis of the
data sample. Skewness provides us the information about the asymmetry of the data
distribution and kurtosis give us an idea about the degreeof peakedness of the dis-
1For precise definitions see Section 4.
4 Katharina Tschumitschew, Frank Klawonn
Another important statistic is the correlation coefficient. The correlation coeffi-
cient is a measure for linear dependency between two variables.
In this section we introduce incremental calculations for these statistical mea-
In the following, we consider a real-valued sample x1,...,xt,... (xiRfor all
Definition 1. Let x1,...,xtbe a random sample from the distribution of the random
variable X.
The sample or empirical mean of the sample of size t, denoted by ¯xt,isgivenby
the formula
Equation (1) can not be applied directly in the context of data streams, since it
would require to consider all sample values at each time step. Fortunately, Equation
(1) can be easily transformed into an incremental scheme.
The incremental update Equation (2) requires only three values to calculate the sam-
ple mean at time point t:
The mean at time point t1.
The sample value at time point t.
The number of sample values so far.
The empirical or sample variance can be calculated in an incremental fashion in
a similar way.
Definition 2. Let x1,...,xtbe a random sample from the distribution of the random
variable X. The empirical or sample variance of a sample of sizetis given by
Furthermore, st=s2
tis called the sample standard deviation.
Incremental Statistical Measures 5
In order to simplify the calculation we use following notation:
In the following, the formula for incremental calculation is derived from Equa-
tion (4) using Equation (2).
Consequently, we obtain the following recurrence formula for the second central
moment: ˜m2,t=˜m2,t1+(xt¯xt1)(xt¯xt)(5)
The unbiased estimator for the variance of the sample according to the Equation(5)
is given by
Definition 3. Let x1,...,xtbe a random sample from the distribution of the random
variable X. Then the k-th central moment of a sample of size tis defined by
In order to simplify the computations and to facilitate the readability of the text we
use the following expression for the derivation.
therefore ˜mk,t=t·mk,t.
6 Katharina Tschumitschew, Frank Klawonn
For the third- and fourth-order moments, which are needed to calculate skewness
and kurtosis of the data distribution, incremental formulae canbe derived in a similar
way, in the formof pairwise update equations for ˜m3,tand ˜m4,t.
where b=xt¯xt1
From Equation (9) we obtain a one-pass formula for the third-order centred sta-
tistical moment of a sample of size t:
The derivation for the fourth-order moment is very similar to Equation (9) and thus
is not detailed here.
The results presented above offer the essential formulae for efficient, one-pass cal-
culations of statistical moments up to the fourth order. Those are important when
the data stream mean, variance, skewness and kurtosis should be calculated. Al-
though these measures cover the needsof the vast majority of applications for data
analysis, sometimes higher-order statistics should be used. For the computation of
higher-order statistical moments see for instance [6].
Now we derive a formula for the incremental calculation of the sample correla-
tion coefficient
Definition 4. Let x1,...,xtbe a random sample from the distribution of the random
variable Xand y1,...,ytbe a random sample from the distribution of the random
variable Y. Then the sample Pearson correlation coefficientof the sample of size t,
denoted by rxy,t, is given by the formula
Incremental Statistical Measures 7
where ¯xtand ¯ytare the sample means of Xand Yand sx,tand sy,tare the sample
standard deviations of Xand Y, respectively.
The incremental formula for the sample standard deviation can be easely derived
from the incremental formula for sample variance (6). Hence only the numerator of
Equation (12) needs to be considered further. Furthermore, the numerator of Equa-
tion (12) represents the sample covariance sxy,t.
Definition 5. Let x1,...,xtbe a random sample from the distribution of the random
variable Xand y1,...,ytbe a random sample from the distribution of the random
variableY. Then the sample covariance sxy,tof the sample of size tis given by t
where ¯xtand ¯ytare the sample means of Xand Yand sx,tand sy,tare the sample
standard deviations of Xand Y, respectively.
The formula for the incremental calculation of the covariance is given by
where bx=(xt¯xt1)
tand by=(yt¯yt1)
t. Hence the incremental formula for the sam-
ple covariance is
Therefore, to update the Pearson correlation coefficient, we have to compute the
sample standard deviation and covariance first and subsequently use Equation (12).
Above in this section we presented incremental calculations for the empirical
mean, empirical variance, third and fourthsample central moments and sample cor-
relation coefficient. These statistical measures can also be considered as estimators
of the corresponding parameters of the data distribution. Therefore, we are inter-
ested in the question how many values xido we need to get a “good” estimation of
the parameters. Of course, as we deal with a data stream, in general we will have a
large amount of data. However, some application are based on time window tech-
niques. For instance, for change detection methods presented in the section (4). Here
we need to compare at least two samples of data, on that account, the data have to be
split into smaller parts. To answer the question about the optimal amount of data for
8 Katharina Tschumitschew, Frank Klawonn
statistical estimators, we have to analyse the variances of the parameter estimators.
The variance of an estimator shows how efficient this estimator is.
Here we restrict our considerations to a random sample from a normal dis-
tribution with expected value 0. Let X1,...,Xtbe independent and identically-
distributed (i.i.d.) random variables followinga normal distribution, XiN0,
and x1,...,xtare observed values of these randomvariables.
The variance of the estimator of the expected value2¯
i=1Xiis given by
The variance of the unbaised estimator of the variance S2=1
is given by
The variance of the distribution of the third moment is shown in Equation (18)
(see [6] for more detailed information)
Figure 1 shows Equations (16), (17) and (18) as functions in tfor
2=1 (stan-
dard normal population). It is obvious that for small amounts of data, the variance
of the estimators is quite large, consequently more values are needed to obtain a
reliable estimation of distribution parameters. Furthermore the optimal sample size
depends on the statistic to be computed. For instance, for the sample mean and a
sample of size 50, the variance is already small enough, whereas for the third mo-
ment estimator to have the same variance, many more observations are needed.
Fig. 1 Variances from bottom to top of parameter estimators for the expected value, the variance
and the third moment of a standard normal distribution
2We use capital letters here to distinguish between random variables and real numbers that are
denoted by small letters.
Incremental Statistical Measures 9
We apply the same considerations to the sample correlation coefficient. Let X
and Ybe two random variables following normal distributions and let X1,...,Xt
and Y1,...,Ytbe i.i.d. samples of Xand Y, respectively: XiN0,
xand Yi
y. Assume the correlation between Xand Yis equal to
XY. Then the
asymptotic variance of the sample correlation coefficient is given by (see [7])
Attention should be paid to the asymptotic natureof Equation (19). This formula
can be used only for sufficiently large t(see [7]). Equation (19) is illustrated in
Figure 2 as a function in tfor
XY =0.9. Since for different values of
XY, the plots
are very similar, they are not shown here.
Fig. 2 Asymptotic variance of the sample correlation coefficient
In this section we have provided equations for incremental calculation of the
sample mean, sample variance, third and fourth moments and the Pearson correla-
tion coefficient. These statistics allow us to summarize a set of observations ana-
lytically. Since we assume that the observations reflect the population as a whole,
these statistics give us an idea about the underlying data distribution. Other impor-
tant summary statistics are sample quantiles. Incremental approaches for quantiles
estimation are described in the next section
10 Katharina Tschumitschew, Frank Klawonn
3 Incremental quantile estimation
Quantiles play an important role in statistics, especially in robust statistics, since
they are not or less sensitive to outliers. For q(0,1), the q-quantile has the prop-
erty that q·100% of the data are smaller and (1q)·100% of the data are larger
than this value. The median, i.e. the 50%-quantile, is a robust measure of location
and the interquartile range3is a robust measure of spread. Incremental or recur-
sive techniques for quantile estimation are not as obvious as for statistical moments,
since for the sample quantile computation the entire sorted data are needed. Never-
theless, there are techniques for incremental quantile estimation. In this section, we
describe two different approaches. First approach is restricted to continuous sym-
metric unimodal distributions. Therefore this method is not very useful for all real
world data. The second approach is not restricted to any kind of distribution andis
not limited to continuous random variables. We also provide experimental results
for both algorithms for different kinds of distributions.
3.1 Incremental quantile estimation for continuous random
Definition 6. For a random variable Xwith cumulative distribution function FX, the
q-quantile (q(0,1)) is defined as inf{xR|FX(x)q}.Ifxqis the q-quantile of
a continuous random variable, this implies P(Xxq)=qand P(Xxq)=1q.
For continuous random variables, an incremental scheme for quantile estimation
is proposed in [10]. This approach is based on the following theorem.
Theorem 1. Let {
t}t=0,1,... be a sequence of identically distributed independent
(i.i.d.) random variables with cumulative distribution function F
. Assume that the
densityfunction f
(x)exists and is continuousinthe
for an arbitrarily
<1). Further let the inequality
)>0 (20)
be fulfilled. Let {ct}t=0,1,... be a (control) sequence of real numbers satisfying the
Then the stochastic process Xtdefined by
3The interquartile range is the mid-range containing 50% of the data and it is computed as the
difference between the 75%- and the 25%-quantiles: IQR =x0.75 x0.25.
Incremental Statistical Measures 11
almost surely converges to the quantile x
The proof of the theorem is based on stochastic approximation and can be found
in [18]. A standard choice of the sequence {ct}t=0,1,... is ct=1/t. However, con-
vergence might be extremely slow for certain distributions. Therefore, techniques
to choose a suitable sequence {ct}t=0,1,..., for instance based on an estimation of
the probability density function of the sampled random variable, are proposed in
[17, 10].
Although this technique of incremental quantile estimation has only minimum
memory requirement, it has certain disadvantages.
It is only suitable for continuous random variables.
Unless the sequence {ct}t=0,1,... is well chosen, convergence can be extremely
When the sampled random variable changes over time, especially when the ct
are already close to zero, the incremental estimation of the quantile will remain
almost constant and the change will be unnoticed.
In the following we present an algorithm to overcomethese problems.
3.2 Incremental quantile estimation
Here we provide a more general approachwhich is not limited to continuous random
variables. First we describe an algorithm for incremental median estimation, which
can be generalised to arbitrary quantiles. Since this algorithm is not very suitably
for non-central quantiles, we modifythis approach in such a way that it yields good
results for all quantiles.
3.2.1 Incremental median estimation
Before we discuss the general problem of incremental quantile estimation, we first
focus on the special of case of the median, since we will need the results for the
median to develop suitable methods for arbitrary quantiles.
For the incremental computation of the median we store a fixed number,a buffer
of msorted data values a1,...,amin the ideal case the m
2closest values left and the m
closest values right of the median, so that the interval [a1,am]contains the median.
We also need two counters Land Rto store the number of values outside the interval
[a1,am], counting the values left and right of the interval separately. Initially, Land
Rare set to zero.
12 Katharina Tschumitschew, Frank Klawonn
The algorithm works as follows.The first mdata points x1,...,xmare used to fill
the buffer. They are entered into the buffer in increasing order, i.e. ai=x[i]where
x[1]... x[m]are the sorted values x1,...,xm. After the buffer is filled, the algo-
rithm handles the incoming values xtin the following way:
1. If xt<a1, i.e. the new value lies left of the interval supposed to contain the
median, then Lnew :=Lold +1.
2. If xt>am, i.e. the new value lies right of the interval supposed to contain the
median, then Rnew :=Rold +1.
3. If aixtai+1(1 i<m), xtis entered into the buffer at position aior ai+1.
Of course, the other values have to be shifted accordingly and the old left bound
a1or the old right bound amwill be dropped. Since in the ideal case, the median
is the value in the middle of the buffer, the algorithm tries to achieve this by
balancing the number of values left and right of the interval [a1,am]. Therefore,
the following rule is applied:
a. If L<R, then remove a1, increase L, i.e. Lnew :=Lold +1, shift the values
a2,...,aione position to the left and enter xtin ai.
b. Otherwise remove am, increase R, i.e. Rnew :=Rold +1, shift the values
ai+1,...,am1one position to the right and enter xtin ai+1.
In each step, the median ˆq0.5can be easily calculated from the given values in the
buffer and the counters Land Rby
2Lif tis odd,
2if tis even. (25)
It should be noted that it can happen that at least one of the indices L+m+R
2Land L+m+R+1
2Lare not within the bounds 1,...,mof the buffer
indices and the computation of the median fails. The interval length ama1can
only decrease and at least for continuous distributions Xwith probability density
function fX(q0.5)>0, where q0.5is the true median of X, it will tend to zero with
increasing sample size. In an ideal situation, the buffer of mstored values contains
exactly the values in the middle of the sample. Here we assume that at this point in
time the sample consists of m+tvalues.
Table 1 A small example data set
data 3.8 5.2 6.1 4.2 7.5 6.3 5.4 5.9 3.9
Table 2 illustrates how this algorithm works with an extremely small buffer of
size m=4 based on the data set given in Table 1.
In the following we generalise and modify the incremental median algorithm
proposed in the previous section and analyse the algorithmin more detail.
Incremental Statistical Measures 13
Table 2 The development of the buffer and the two counters for the small example data set in
Table 1
t L a1a2a3a4R
4 0 3.8 4.2 5.2 6.1 0
5 0 3.8 4.2 5.2 6.1 1
6 0 3.8 4.2 5.2 6.1 2
7 1 4.2 5.2 5.4 6.1 2
8 2 5.2 5.4 5.9 6.1 2
9 3 5.2 5.4 5.9 6.1 2
3.2.2 An ad hoc algorithm
This algorithm for incremental median estimation can be generalised to arbitrary
quantiles in a straight forward manner. For the incremental q-quantile estimation
(0 <q<1) only case 3 requires a modification. Instead of trying to get the same
values for the counters LandR, we now try to balance the counters in such a way that
qR (1q)Lholds. This means, step 3a is applied if L<(1q)tholds, otherwise
step 3b is carried out. tis the number of data sampled after the buffer of length m
has been filled.
Therefore, in the ideal case, when we achieve this balance, a proportionof qof
the data points lies left and a proportion of (1q)lies right of the interval defined
by the buffer of length m.
Now we are interested in the properties of the incremental quantile estimator pre-
sented above. Since we are simply selecting the k-th order statistic of the sample, at
least for continuous random variables and largerpresampling sizes, we can provide
an asymptotic distribution of the order statistic and therefore for the estimator.
Assume, the sample comes from a continuous random variable Xand we are
interested in an estimation of the q-quantile xq. Assume furthermore that the prob-
ability density function fXis continuous and positive at xq. Let
denote the k-th order statistic from an i.i.d. sample. Then
khas an asymptotic nor-
mal distribution [7]
From Equation (26) we can obtain valuable information about the quantile esti-
In order to have a more efficient and reliable estimator, we want the variance
of (26) to be as small as possible. Under the assumption that we know the data
distribution, we can compute the variance of
Let Xbe a random variable following a standard normal distribution and assume
we have a sample x1,...,xtof X, i.e. these values are realizations of the i.i.d. random
variables XiN(0,1). We are interested in the median of X. According to Equation
(26), the sample median
0.5t+1is follows asymptotically a normal distribution:
14 Katharina Tschumitschew, Frank Klawonn
Figure 3 shows the variance of the order statistic
0.5t+1as a function in twhen
the chosen quantile is q=0.5, i.e. the median, and the original distribution from
which the sample comes is a standard normal distribution N(0;1). The second curve
in the Figure corresponds to the variance of the sample mean.
Fig. 3 Variance from bottom to top of ¯
kunder the assumption of a standard normal distri-
bution of X
The variance of the sample mean ¯
Xis only slightly better than that of the order
0.5t+1, nevertheless we should keep in mind the asymptotic character of
the distribution (26).
Furthermore, from Equation (26) we obtain the other nice property of the in-
cremental quantile estimator: It is an asymptotically unbiased estimator of sample
quantiles. It is evena consistent estimator.
Unfortunately,as it was shown in [25], the probabilityfor the algorithm to fail is
much smaller for the estimation of the median than for arbitrary quantiles. There-
fore, despite the nice properties of this estimator this simple generalisation of the
incremental median estimation algorithm to arbitrary quantiles is not very useful in
practice. In order to amend this problem, we provide a modified algorithm based on
3.2.3 Incremental quantile estimation with presampling iQPres
Here we introduce the algorithm iQPres (incremental quantile estimation with pre-
sampling) [25]. As already mentioned above, the failure probability for the incre-
mental quantile estimation algorithm in Subsection 3.2.2 is lower for the median
than for extreme quantiles. Therefore, to minimise the failure probability we intro-
duce an incremental quantile estimation algorithm with presampling.
Assume we want to estimate the q-quantile. We presample nvalues and we sim-
ply take the l-th smallest value x(l)from the presample for some fixed l∈{1,...,n}.
At the moment, ldoes not even have to be related to the q-quantile. The probability
Incremental Statistical Measures 15
that x(l)is smaller than the q-quantile of interest is
So when we apply presampling in this way, we obtain the new (presampled) dis-
tribution (order statistic)
l. From equation (28) we can immediately see that the
(1pl)-quantile of
lis the same as the q-quantile of X. Therefore, instead of
estimating the q-quantile of X, we estimate the (1pl)-quantile of
l. Of course,
this is only helpful, when lis chosen in such a way that the failure probabilities
for the (1pl)-quantile are significantly lower than the failure probabilities for the
q-quantile. In order to achieve this, lshould be chosen in such a way that (1pl)is
as close to 0.5 as possible.
We want to estimate the q-quantile (0 <q<1). Fix the parameters m,l,n.(For
an optimal choice see [25]).
1. Presampling: nsucceeding values are stored in increasing order in a buffer bn
of length n. Then we select the l-th element in the buffer. The buffer is emptied
afterwards for the next presample of nvalues.
2. Estimation of the (1pl)-quantile based on the l-th element in the buffer for pre-
sampling: This is carried out according to the algorithm described in Subsection
The quantile is then estimated in the usual way, i.e.
ˆq=(1r)·akR+r·akR+1(quantile estimator).
Of course, this does only work when the algorithm has not failed, i.e. the corre-
sponding index kis within the bufferof mvalues.
3.3 Experimental results
In this section we present an experimental evaluation of the presented algorithms
iQPres and the algorithm described in Section 3.1. The evaluation is based on artifi-
cial data sets.
First, we consider estimations of the lower and upper quartile as well as the
median for different distributions:
Exponential distribution with parameter
=4 (Exp(4))
Standard normal distribution (N(0;1))
Uniform distribution on the unit interval (U(0,1))
16 Katharina Tschumitschew, Frank Klawonn
An asymmetric bimodal distribution given by a Gaussian mixture model (GM) of
two normal distributions. The cumulative distributionfunction ofthis distribution
is given by F(x)=0.3·FN(-3;1) +0.7·FN(1;1)
where FN(
2)denotes the cumulative distribution function of the normal distri-
bution with expected value
and variance
2. Its probability density function is
shown in Figure 4.
-6 -4 -2 0 2 4
Fig. 4 An example for an asymmetric, bimodal probability density function
The quantile estimations were carried out for samples of size of 10000 that were
generated from these distributions. We have repeated each estimation 1000 times.
Tables 3-5 show the average over all estimations for our algorithm (iQPres with a
memory size of M=150) and for the technique based on Theorem 1 where we used
the control sequence ct=1
t. The mean squared error over the 1000 repeated runs is
also shown in the tables.
Table 3 Estimation of the lower quartile q=0.25
Distr. True quantile iQPres Equation 23 MSE (iQPres) MSE (Equation 23)
Exp(4) 1.150728 1.152182 1.718059 2.130621E-5 2.675568
N(0;1) -0.674490 -0.672235 -0.678989 5.611009E-6 0.008013
U(0,1) 0.250000 0.250885 0.250845 1.541123E-6 4.191695E-5
GM -2.043442 -2.042703 0.185340 1.087618E-5 5.331730
For the uniform distribution, incremental quantile estimation based on equation
(23) and iQPres leads to very similar and good results. For the normal distribution,
both algorithms yield quite good results, but iQPres seems to be slightly more ef-
ficient with a smaller mean square error. For the bimodal distribution based on the
Incremental Statistical Measures 17
Table 4 Estimation of the median q=0.5
Distr. True quantile iQPres Equation 23 MSE (iQPres) MSE (Equation 23)
Exp(4) 2.772589 2.7462635 5.775925 7.485865E-4 10.906919
N(0;1) 0.000000 6.8324E-4 -0.047590 1.786715E-5 0.009726
U(0,1) 0.500000 0.495781 0.499955 1.779917E-5 2.529276E-6
GM 0.434425 0.434396 0.117499 2.365156E-6 0.451943
Table 5 Estimation of the upper quartile q=0.75
Distr. True quantile iQPres Equation 23 MSE (iQPres) MSE (Equation 23)
Exp(4) 5.545177 5.554385 5.062660 1.054132E-4 0.919735
N(0;1) 0.674490 0.674840 0.656452 3.600748E-7 0.003732
U(0,1) 0.750000 0.750883 0.749919 8.443136E-7 2.068730E-5
GM 1.366114 1.366838 0.027163 1.193377E-6 2.207112
Gaussian mixture model and a skewed distribution such as the exponential distri-
bution, the estimations for the algorithm based on equation (23) are more or less
useless, at least when no specific effort is invested to find an optimal control se-
quence {ct}t=0,1,.... iQPres does not have any problems with these distributions. As
already mentioned before, it is also not required for iQPres that the sampling distri-
bution is continuous whereas it is a necessary assumption for the technique based
on equation (23).
4 Hypothesis tests and change detection
In this section we demonstrate how hypothesis testing can be adapted to an incre-
mental computation scheme for the cases of the
2-test and the t-test. Moreover
we discus the problem of non-stationary data and explain various change detection
strategies with the main focus on the use of statistical tests.
4.1 Incremental hypothesis tests
Statistical test are methods to check the validity of hypotheses about distributions or
properties of distributions of random variables. Since statistical tests rely on sam-
ples, they cannot definitely verify or falsify a hypothesis. They can only provide
probabilistic information supporting or rejecting the hypothesis under considera-
Statistical tests usually consider a null hypothesis H0and an alternative hypoth-
esis H1. The hypotheses may concern parameters of a given class of distributions,
for instance unknown expected value and variance of a normal distribution. Such
tests are called parameter tests. In such cases, the a priori assumption is that the data
18 Katharina Tschumitschew, Frank Klawonn
definitely originate from a normal distribution. Only the parameters are unknown.
In contrast to parameter tests, nonparametric tests concern more general hypothe-
sis, for example whether it is reasonable at all to assume that the data come from a
normal distribution.
The error probability that the test will erroneously reject the null hypothesis,
given the null hypothesis is true, is used as an indicator of the reliability of the test.
Sometimes a so-called p-value is used. The p-value is smallest error probabilitythat
can be admitted, so that the test will still reject the null hypothesis for a given sam-
ple. Therefore, a low p-value is a good indicator for rejecting the null hypothesis.
Usually, the acceptable error probability
-error) should be specified in advance,
before the test is carried out. The smaller
is chosen, the more reliable is the test
when the outcome is to reject the null hypothesis. However, when
is chosen too
small, then the test will not tend to reject the null hypothesis, although the sample
might not speak in favour of it.
Some of the hypothesis tests can be applied to data streams, since they can be cal-
culated in an incremental fashion. We discuss in this section the incremental adap-
tation of two statistical tests, the
2-test and the t-test. Note, that the application
of hypothesis tests to data streams, using incremental computation or windowtech-
niques, requires the repeated execution of the test. This can cause the problem of
multiple testing. The multiple testing problem is described later in this section.
2-test has various applications. The principal idea of the
2-test is the com-
parison of two distributions. One can check whether two samples come from the
same distribution, a single sample follows a given distribution or also whether two
samples are independent.
Example 1. A die is thrown 120 times and the observed frequencies are as follows:
1 is obtained 30 times, 2-25, 3-18, 4-10, 5-22 and 6-15. We are interested in the
question whether the die is fair or not.
The null hypothesis H0for the
2-test claims that the data follow a certain (cu-
mulative) probability distribution F(x). The distribution of the null hypothesis is
than compared to the distribution of the data. The null hypothesis can for instance
be a given distribution,e.g. a uniform or a normal distribution, and the
2-test can
give an indication, whether the data strongly deviate from this expected distribu-
tion. For an independence test for two variables,the joint distribution of the sample
is compared to the product of the marginal distributions. If these distributions differ
significantly, this is an indication that the variables might not be independent.
The main idea of the
2-test is to determine how well the observed frequencies
fit the theoretical/expected frequencies specified by the null hypothesis. Therefore,
2-test is appropriate for data from categorical or nominally scaled random vari-
ables. In order to apply the test to continuous numeric data, the data domain should
be partioned into rcategories first.
Incremental Statistical Measures 19
First we discus the
2goodness of fit test. Here we assume to know fromwhich
distribution the data come. Then the H0and H1hypotheses can be stated as follows:
H0: The sample comes from the distribution FX
H1: The sample does not come from the distribution FX
Therefore the problem from example 1 can be solved with the help of the
goodness of fit test. Consequently, the H0and H1hypotheses are chosen as follows:
6for at least one value i∈{1,...,6}
Let X1,...,Xnbe i.i.d. continuous random variables and x1,...,xnthe observa-
tions from these random variables. Then the test statistic is computedas follows
where Oiare the observed frequencies and Eiare the expectedfrequencies.
Since we are dealing with continuous random variables, to compute the observed
and expected frequencies we should carry out a discretisation of the data domain.
Let FX(x)be the assumed cumulative distribution function. The x-axis have to be
split into rpairwise disjoint sets or bin Si. Then the expected frequency in bin Siis
given by Ei=n(FX(ai+1)FX(ai)) (30)
where [ai,ai+1)is interval corresponding to bin Si.
Furthermore, for the observed frequencieswe obtain
Oiis therefore the amount of observations in the i-th interval.
The statistic (29) has an approximate
2-distribution with (r1)degreesof free-
dom under the following assumptions: First, the observations are independentfrom
each other. Secondly, the categories – the bins Si– are mutually exclusive and ex-
haustive. This means that no categories may have an expected frequency of zero,
i.e. i1,...,r:Ei>0. Furthermore, no more than 20% of the categories should
have an expectedfrequencyless than five. If this is not the case, categories should be
merged or redefined. Note that this might also lead to a different number of degrees
of freedom.
Therefore, the hypothesis H0that the sample comes from the particular distribu-
tion FXis rejected if r
is the (1
)-quantile of the
2-distribution with (r1)degrees of
20 Katharina Tschumitschew, Frank Klawonn
Table 6 summarizes the observed and expected frequencies and computations for
example 1. All Eiare greater than zero, even greater than 4. Therefore, there is no
Table 6 example 1
number ion the die EiOi(OiEi)2
1 20 30 5
2 20 25 1.25
3 20 18 0.2
4 20 10 5
5 20 22 0.2
6 20 15 1.25
need to combine categories. The test statistic is computed as follows:
Ei=5+1.25+0.2+5+0.2+1.25=12.9 (33)
The obtained result
2=12.9 should be evaluated with (1
)-quantile of the
distibution. For that purpose s. table of the
2-distribution ([7]) . The corresponding
degrees of freedom are computed as explained above (r1)=(61)=5. For
=0.05 the tabled critical value for 5 degrees of freedomis
0.95 =11.07, which
is smaller than computedtest statistic. Thereforethe null hypothesisis rejectedat the
0.05significance level.For significance level 0.02 the critical value is
0.98 =13.388
and therefore the null hypothesis cannot be rejected at this level. This result can be
summarized as follows:
2=12.9 with 5 degrees of freedom can be rejected for all
significance levels bigger than 0.024. This indicates that the die is unfair.
In order to adapt the
2goodness of fit test to incremental calculation, the ob-
served frequencies should be computed in an incremental fashion.
iotherwise. (34)
The expected frequencyshould also be recalculated correspondingto the increasing
amount of observations.
Another very common test is the
2independence test. This test evaluates the
general hypothesis that two variables are statistically independentfrom each other.
Let Xand Ybe two random variables and (x1,y1),...,(xn,yn)are the observed
values of these variables.For continuous random variables the data domains should
be partitioned into rand qcategories, respectively. Therefore the observed values of
Xcan be assigned to one of the categories SX
rand the observed values ofY
to one of the categories SY
q. Then Oij is the frequency of occurrence of the
observation (xki,ykj), where xkiSX
iand ykjSY
j. Furthermore
Incremental Statistical Measures 21
j=1Oij (36)
i=1Oij (37)
denote the marginal observed frequencies.
Table 7 illustrates the observed absolute frequencies. The total number of obser-
vations in the table is n. The notation Oij represents the number of observations in
the cell with index ij (i-th row and j-th column), Oithe number of observations
in the i-th row and Ojthe number of observations in the j-th column. This table is
called contingency table.
Table 7 Contingency table
1... SY
j... SY
qmarginal of X
1O11 ... O1j... O1qO1
iOi1... Oij ... Oiq Oi
rOr1... Orj ... Orq Or
marginal of Y O1... Oj... Oqn
It is assumed that the random variables Xand Yare statistically independent.
Let pij be the probability of being in the i-th category of the domain of Xand
the j-th category of the domain of Y.piand pjare the corresponding marginal
probabilities. Then, corresponding to the assumption of independencefor each pair
pij =pi·pj(38)
holds. Equation (38) defines statistical independence. Therefore the null and the
alternative hypothesis are as follows:
H0:pij =pi·pj
H1:pij =pi·pj
Thus, if Xand Yare independent, then the expected absolute frequencies are
given by
Eij =Oi·Oj
The test statistic, again checking the observed frequencies against the expected
frequencies under the null hpyothesis, is as follows.
Eij (40)
22 Katharina Tschumitschew, Frank Klawonn
The test statistic has an approximate
2-distribution with (r1)(s1)degrees of
freedom. Consequently, the hypothesis H0that Xand Yare independent can be
rejected if r
is the (1
)-quantile of the
2-distribution with (r1)(s1)de-
grees of freedom.
For the incremental computation of Oi,Ojand Oij corresponding formulae
must be developed. For the time point tand the new observed values (xt,yt)the
incremental formulae are given by
iotherwise. (42)
jotherwise. (43)
ij =O(t1)
ij +1ifxtSX
ij otherwise. (44)
2goodness of fit test can be extended to a
2homogeneity test ([22]).
Whereas the
2goodness of fit test can be used only for a single sample, the
homogeneity test is used to compare whether two or more samples come from the
same population.
Let X1,...,Xm(m2) be discrete random variables, or continuous random vari-
ables discretised into rcategories S1,...,Sr. The data for each of the msamples
from random variables X1,...,Xm(overall nvalues) are entered in a contingency
table. This table is similar to the one for the
2independence test.
Table 8 Contingency table
values \variables X1... Xj... Xm
S1O11 ... O1j... O1mO1
SiOi1... Oij ... Oim Oi
SrOr1... Orj ... Orm Or
O1... Oj... Omn
The samples are represented by the columns and the categories by the rows of
Table 8. We assume that each of the samples is randomly drawn from the same
distribution. The
2homogeneity test checks whether msamples are homogeneous
Incremental Statistical Measures 23
with respect to the observed frequencies. If the hypothesis H0is true, the expected
frequency in the i-th category will be the same for all of the mrandom variables.
Therefore, the null and the alternative hypothesiscan be stated as follows:
H0:pij =pi·pj
H1:pij =pi·pj.
From H0follows that the rows are independentof the column.
Therefore, the computation of an expected frequency can be summarized by
Eij =Oi·Oj
Although the
2independence test and
2homogeneity test evaluate different
hypothesis, they are computed identically. Therefore,the incremental adaptation of
2independence test can also be applied to the
2homogeneity test.
Commonly in case of two samples the Kolmogorov-Smirnovtest is used, since it
is an exact test and in contrast to the
2-test can be applied directly without previ-
ous discretisation of continuous distributions. However, the Kolmogorov-Smirnov
test does not have any obvious incremental calculation scheme. The Kolmogorov-
Smirnov test is described in Section 4.2.2.
4.1.2 The t-test
The next hypothesis test, for which we want to provide incremental computation is
the t-test. Different kinds of the t-test are used. We restrict our considerations to the
one sample t-test and the t-test for two independent samples with equal variance.
The one sample t-test evaluates whether a sample with particular mean could be
drawn from the population with known expected value
0. Let X1,...Xnbe i.i.d. and
2with unknown variance
2. The null and the alternative hypothesis
for two sided test are:
0, the sample comes from the normal distribution with expected value
0, the sample comes from a normal distributionwith an expected value
differing from
The test statistic is given by
where ¯
Xis the sample mean and Sthe sample standard deviation. The statistic (46)
is t-distributed with (n1)degrees of freedom. H0is rejected if
/2or t>t1
24 Katharina Tschumitschew, Frank Klawonn
where t1
/2is the (1
/2)-quantile of the t-distribution with (n1)degrees of
freedom and tis the computed valueof the test statistic (46), i.e. t=n¯x
One-sided tests are given by the following null and alternative hypotheses:
0and H1:
0.H0is rejected if t>t1
0and H1:
0.H0is rejected if t<t1
This test can be very easily adapted to incremental computation.For this purpose
the sample mean and the sample variance have to be updated as in Equations (2) and
(6), respectively, as described in Section 2. Note that the degrees of freedom of the
t-distribution shouldbe updated in each step as well.
Unlike previous notations we use here n+1 for the time point, since the letter
tis already used for the computed test statistic. Furthermore, as mentioned above
the (1
/2)-quantile of the t-distribution with ndegrees of freedom should be
used to evaluate the null hypothesis.However forn30, the quantiles of the stan-
dard normal distribution could be used as approximation of the quantiles of the
The t-test for two independent samples is used to evaluate whether two inde-
pendent sample come from two normal distributions with the same expectedvalue.
The two sample means ¯xand ¯yare used to estimate the expected values
Yof the underlying distributions.If the result of the test is significant, we assume
that the samples come from two normal distributions with different expected val-
ues. Furthermore, we assume that the variances of the underlying distributions are
The t-test is based on the following assumptions:
The samples are drawn randomly.
The underlying distribution is a normal distribution.
The variances of the underlying distributions are equal, i.e.
Let X1,...Xn1i.i.d. and XiN
Xand Y1,...Yn2i.i.d. and YiN
with unknow expected values and unknown variances and
The null and the alternative hypothesis can be definedas follows:
Y, the samples come from the same normal distribution.
Y, the samples come from normal distributions with different expected
In this case, a two-sided test is carried out, however similar to the one sample t-test
also a one-sided test can be defined.
The test statistic is computed as follows.
Incremental Statistical Measures 25
where S2
Xand S2
Yare the unbiased estimators for the variances of Xand Y, respec-
Equation (49) is a general equation for the t-test for two independent samples
and can be used in both cases of equal and unequal sample sizes.
The statistic (49) has a t-distribution with (n1+n22)degrees of freedom.
be the computed value of the statistic (49). Then the hypotesis H0that the samples
come from the same normal distribution is rejected if
/2or t>t1
where t1
/2is the (1
/2)-quantile of the t-distribution with (n1+n22)de-
grees of freedom.
Similar to the one sample t-test, the t-test for two independent samples can be
easily computed in an incremental fashion, since the sample means and the variance
can be calculated in an incremental way. Here the degrees of freedom should also
be updated with the new observed values.
4.1.3 Multiple Testing
Multiple testing refers to the application of number of tests simultaneously. Instead
of a single null hypothesis, a tests for a set of null hypotheses H0,H1,...,Hnare
considered. These null hypotheses do not have to exclude each other.
An example for multiple testing is a test whether mrandom variables X1,...Xm
are pairwise independent. This means, the null hypotheses are H1,2,...,H1,m,...,
Hm1,mwhere Hi,jstates that Xiand Xjare independent.
Multiple testing leads to the undesired effect of cumulating the
-error. The
is the probability to reject the null hypothesis erroneously, given it is true.
=0.05 means that in 5% of the cases the null hypothesis would be
rejected, although it is true. When ktests are applied to the same sample, then the
error probability for each test is
. Under the assumption that the null hypotheses
are all true and the tests are independent, the probability that at least one test will
reject its null hypothesis erroneously is
is the number of tests rejection the null hypothesis.
A variety of approaches have been proposed to handle the problemof cumulating
-error. In the following, two common methods will be introduced shortly.
26 Katharina Tschumitschew, Frank Klawonn
The simplest and most conservative method is Bonferroni correction [21]. When
knull hypotheses are tested simultaneously and
is the desired overall
-error for
all tests together, then the corrected
-error for each single test should be chosen as
k. The justification for this correction is the inequality
For Bonferroni correction, Aiis the event that the null hypothesis Hiis rejected,
although it is true. In this way, the probability that one or more of the tests rejects
its corresponding null hypothesis is at most
. In order to guarantee the significance
, each single test must be carried out with the corrected level ˜
Bonferroni correction is a very rough and conservative approximationforthe true
-error. Oneof its disadvantages is that the corrected significance level ˜
very low, so that it becomes almost impossible to reject any of the null hypotheses.
The simple single step Bonferroni correction has been improved by Holm [12].
The Bonferroni-Holm method is a multi-step procedure in which the necessary cor-
rections are carried out stepwise. This method usually yields larger corrected
values than the simple Bonferroni correction.
When khypotheses are tested simultaneously and the overall
-error for all tests
, for each of the tests the corresponding p-value is computedbased on the sample
xand the p-values are sorted in ascending order.
The null hypotheses Hiare ordered in the same way.
In the first step H[1]is tested by comparing p[1]with
kholds, then H[1]
and the other null hypotheses H[2],...,H[k]are not rejected. The method terminates
in this case. However, if p[1]
kholds, H[1]is rejected and the next null hypothesis
H[2]is tested by comparing the p-value p[2]and the corrected
k1holds, H[2]and the remaining null hypotheses H[3],...,H[k]are not rejected. If
k1holds, H[2]is rejected and the procedure continues with H[3]in the same
The Bonferroni-Holm method tests the hypotheses in the order of their p-values,
starting with H[1]. The corrected
are increasing. Therefore,
the Bonferroni-Holm method rejects at least those hypotheses that are also rejected
by simple Bonferroni correction, but in general morehypotheses can be rejected.
Incremental Statistical Measures 27
4.2 Change detection strategies
Detecting changes in data streams has become a very important area of research in
many application fields, such as stock market, web activities or sensors measure-
ments, just to name a few. The main problem for changedetection in data streams is
limited memory capacity. It is unrealistic to store the full history of the data stream.
Therefore, efficient change detection strategies tailored to the data stream should be
used. The main requirements for such approaches are: low computational costs, fast
change detection and high accuracy.Moreoverit is important to distinguish between
true changes and false alarms. Abrupt changes as well as slow drift in the data gen-
erating process can occur. Therefore, a “good” algorithm should be able to detect
both kinds of changes.
Various strategies are proposed to handle this problem, see for instance [11] for
a detailed survey of change detection methods. Most of these approaches are based
on time window techniques [2, 15]. Furthermore, several approaches are presented
for evolving data streams as they are discussed in [14, 13, 8].
In this section, we introduce two types of change detection strategies: incremental
computation and window technique based change detection. Furthermore we put
the main focus on statistical tests. We assume to deal with numeric data streams. As
already mentioned in the introduction, two types of change are identified: concept
change and change of data distribution. We don’t differentiate in this work between
both of them, since the distribution of the target variable will be changed in both
4.2.1 iQPres for change detection
The incremental quantile estimator iQPres from Section 3.2.3 can be used for
change detection [25]. In case, the sampling distribution changes, having a drift
of the quantile to be estimated as a consequence, such changes will be noticed,
since the simple version of iQPres without shifted parallel estimations will fail in
the sense that it is not able to balance the counters Land Rany more.
In order to illustrate how iQPres can be applied to change detection, we consider
daily measurements for gas productionin a waste water treatment plant over a period
of more than eight years. The measurements are shown in Figure 5.
iQPress has been applied to this data set to estimate the median with a memory
size of M=30. The optimal choice for the sizes of the buffers for presampling
and median estimation is then n=3 and m=27, respectively. At the three time
points 508, 2604 and 2964, the buffer cannot be balanced anymore, indicating that
the median has changed. These three time points are indicated by vertical lines in
Figure 5. The arrows indicate whether the median is increased or decreased. An
increase corresponds to an unbalanced buffer with the right counter Rbecoming
too large, whereas a decrease leads to an unbalanced buffer with the left counter L
becoming too large. The median increases at the first point at 508 from 998 before
28 Katharina Tschumitschew, Frank Klawonn
   
 #  
  ! " # $ % & '     ! " # $% & '    ! " # $ % & ' !
   
 #  
  ! " # $ % & '     ! " # $% & '    ! " # $ % & ' !
   
 #  
  ! " # $ % & '     ! " # $% & '    ! " # $ % & ' !
Fig. 5 An example of change detection for time series data from a waste water treatment plant
and 1361 after this point. At time point 2604 the median increases to 1406 and drops
again to 1193 at time point 2964.
Note that algorithms based on Theorem 1 mentioned in Section 3.1 are not suit-
able for change detection.
By using iQPres for change detection in the data distribution, we assume that
the median of the distribution changes with the time, however if this is not the case
and only another parameter like the variance of the underlying distribution changes,
other strategies for change detection should be used.
4.2.2 Statistical tests for change detection
The theory of hypothesis testing is the main background for change detection. Sev-
eral algorithms for change detection are based on hypothesis tests.
Hypothesis tests could be applied to change detection in two different ways:
Change detection trough incremental computation of the tests: by this approach
the test is computed in an incremental fashion, for instance as it is explained in
Section 4.1. Consequently the change can be detected if the test starts to yield
different results as before.
Window techniques:by this approach the data stream divided into time windows.
A sliding window could be used as well as non-overlaping windows. In order to
detect potential changes, we need either to compare data from an earlier window
with data from newer one or to test only the new data (for instance, whether the
data follows a known or assumed distribution). When the window size is not too
large, it is not necessary to be able to compute the tests in an incremental fash-
ion. Therefore, we are notrestricted to tests that render themselves to incremental
computations, but many other tests could be used. Hybrid approaches combining
both techniques are also possible. Of course, window techniques with incremen-
tal computations within the window will lead to less memory consumptionsand
faster computations.
Incremental Statistical Measures 29
We will not give a detailed description for change detection based on incremental
computation here, since the principles of these methods are explained in Section
4.1. However, the problem of multiple testing as discussed in Section 4.1 should
be taken into account when a test is applied again and againover time. Even if the
underlying distribution does not change over time, any test will erroneously reject
the null hypothesis of no change in the long run if we only carry out the test often
enough. Different approaches to solve this problem are presented in Section 4.1.3.
Another problem of this approach is the “burdenof old data”. If a large amount of
data has been analysed already and the change is not very drastic, it may happen
that the change will be detected with large delay or not detected at all when a very
large window is used. On that account it may be useful to reinitialise the test from
time to time.
To detect changes with by window technique, we need to compare two samples
of data and have to decide whetherthe hypothesis H0that they come from the same
distribution is true.
First we will present a general meta-algorithm for change detection based on
a window technique, without any specific fixed test. This algorithm is presented
in Figure 6. The constant step specifies, after how many new values the change
detection should checked again.
1 Initialise window W,i=0
2for each new xtdo
3if i<stepthen
4WW∪{xt}(i.e., add xtto the W)
5WW\w0(i.e., remove oldest element in W)
7if i=step then
9 split Winto W0and W1
10 test W0and W1for change
11 if change detected then
12 report change at time t
13 end if
14 end if
15 end if
2end for
Fig. 6 General scheme of a change detection algorithm based on time windows and statistical tests
This approach follows an simple idea: when the data from two subwindows of
Ware judged as “distinct enough”, the change is detected. Here “distinct enough”
is specified by the selected statistical test for distribution change. In general, we
assume the splitting of Winto two subwindows of equal size. Nevertheless, any
30 Katharina Tschumitschew, Frank Klawonn
“valid” splitting can be used. Valid is meant in terms of the amount of data that is
needed for the test to be reliable.
However, by a badly selected cut point the change can be detected with large
delay as Figure 7 shows. The rightmost part indicates a change in the data stream.
Fig. 7 Subwindows problem
As the change occurs almost at the end of the subwindow W1, it is most likely that
the change remains at first undetected. Of course, since the window will be moved
forward with new data points arriving, at some point the change will be detected,
but it may be from essential interest, to detect the change as early as possible.
To solve this problem, we modify the algorithmin Figure 6 in the following way:
instead of splitting window Wonly once, the splitting is carried out several times.
Figure 8 shows the modified part of the algorithm in Figure 6 starting at step 9.
9for each valid split W=W0W1do
10 test W0and W1for change
11 if change detected then
12 report change at time t
13 end if
14 end for
Fig. 8 Modification of the algorithm for change detection to avoid the subwindows problem
How many times the window should be split, should be decided based on the
required performance and precision of the algorithm. We can run the test for each
sufficiently large subwindow ofW, although the performance of the algorithm will
decrease, or we can carry out fixed number of splits. Note that also for the win-
dows technique based approach, attention should be paid to the problem of multiple
testing (see Section 4.1.3). Furthermore, we do not specify here the effect of the
detected change. The question whether the window should be reinitialised depends
on the application. A change in the variance of the data stream might have a strong
effect on the task to be fulfilled with the on-line analysis of the data stream or it
might have no effect as long the mean value remains more or less stable.
For the hypothesis test in step 10 of the algorithm, any appropriate test for the
distribution change can be chosen. Since we do not necessarily have to apply an
incremental scheme for the hypothesis test, the Kolmogorov-Smirnovtest can also
be considered for change detection. The Kolmogorov-Smirnov test is designed to
Incremental Statistical Measures 31
compare two distribution, whether they are equal or not. Therefore two kinds of
questions could be answered with the help of the Kolmogorov-Smirnov test:
Does the sample arise from a particular known distribution?
Do two samples coming from different time windows have the same distribution?
We are particularly interested in the second question. For this purpose, the two sam-
ple Kolmogorov-Smirnovgoodness-of-fit test should be used.
Let X1,...,Xnand Y1,...,Ymbe two independent random samples from distri-
butions with cumulative distribution functions FXand FY, respectively. We want to
test the hypothesis H0:FX=FYagainst the hypothesis H1:FX=FY. The Kol-
mogorovSmirnov statistic is given by
where SX,n(x)and SY,m(x)are corresponding empirical cumulative distribution
function4of the first and second sample. H0is rejected at level
where K
is the
-quantile of the Kolmogorovdistribution.
To adapt the Kolmogorov-Smirnovtest as a change detection algorithm, first the
significance level
should be chosen (we can also use for instance the Bonferroni
correction to avoid the multiple testing problem). The value of K
needs either nu-
merical computation or should be stored in a table5. Furthermore, values from the
subwindows W0and W1represent two samples x1,...,xnand y1,...,ym. Then the
empirical empirical cumulative distribution functions SX,n(x)and SY,m(x)and the
Kolmogorov-Smirnovstatistic should be computed. Note that for the computation
of SX,n(x)and SY,m(x)in case of unique splitting the samples have to be sorted only
initially, afterward the new values have to be inserted and the old values must be
deleted from the sorted lists. In case of multiple splitting we have to decide either to
sort each time from scratch or to save sorted lists for each kind of splitting.
An implementation of the Kolmogorov-Smirnov test is for instance available in
the R statistics library (see [4] for more information).
Algorithm 8 based on the Kolmogorov-Smirnovtest as the hypothesis test in step
10 has been implemented in Java using R-libraries and has been tested with artificial
data. For the data generation process the following modelwas used:
4Let xr1,xr2,...xrnbe a sample in ascending order from the random variables X1,...,Xn. Then the
empirical distribution function of the sample is given by
nif xrk<xxrk+1,
5This applies also to the t-test and the
32 Katharina Tschumitschew, Frank Klawonn
We assume the random variablesXito be normally distributed with expected value
=0 and variance
2, i.e. XiN0,
2. Here Ytis a one dimensional random
walk [24]. To make the situation more realistic, we consider the followingmodel:
The process (62) can be understood as a constant model with drift and noise, the
noise follows a normal distributionwhose expected value equals the actualvalue of
the random walk and whose variance is 1.
The data were generated with the following parameters:
Therefore the data have a slow drift andare furthermore corrupted with noise.
Fig. 9 An example of change detection for the data generated by the process (62).
Algorithm 8 has been applied to this data set. The size of the window Wwas
chosen to be 500. The window is always split into two subwindows of equal size,
i.e. 250. The data are identified by the algorithm as non-stationary.Only very short
sequences are considered to be stationbary by theKolmogorov-Smirnovtest. These
sequences are marked by the darkerareas in Figure 9. In the interval [11445,14414]
stationary parts are mixed with occasionally occurring small non-stationary parts.
For easier interpretation we joined these parts to one larger area. Of course, since
we are dealing with the window, the real stationary areas are not exactly the same
as shown in the figure. The quality of change detection depends on the window.
For slow gradual changes in the form of concept drift a larger window is a better
choice, whereas for abrupt changes in terms of a concept shift a smaller window is
of advantage.
Incremental Statistical Measures 33
5 Conclusions
We have introduced incremental computation schemes for statistical measures or
indices like the mean, the median, the variance, the interquartile rangeor the Pearson
correlation coefficient. Such indices provide information about the characteristics
of the probability distribution that generates the data stream. Although incremental
computations are designed to handle large amounts of data, it is not extremely useful
to calculate the above mentioned statistical measures for extremely large data sets,
since they quickly converge to the parameter of the probabilitydistribution they are
designed to estimate as can be seen in Figure 1, 2 and 3. Of course, convergence
will only occur when the underlying data stream is stationary.
It is therefore very important to use such statistical measure or hypothesis tests
for change detection. Change detection is a crucial aspect for non-stationary data
streams or “evolving systems”. It has been demonstrated in [26] that na¨ıve adaption
without taking any effort to distinguish between noise and true changes of the un-
derlying sample distribution can lead to very undesired results. Statistical measures
and tests can help to discover true changes in the distribution and to distinguish them
from random noise.
Applications of such change detection methods canbe found in areas like quality
control and manufacturing [16, 20], intrusion detection [27] or medical diagnosis
The main focus of this chapter are univariate methods. There also extensions to
multidimensional data [23] which are out of the scope of this contribution.
1. Aho, A.V., Ullman, J.D., J.E., H.: Data Structures and Algorithms. Addison Wesley, Boston
2. Basseville, M., Nikiforov, I.:Detection of Abrupt Changes: Theory and Application (Prentice
Hall information and system sciences series). Prentice Hall, Upper Saddle River, New Jersey
3. Beringer, J., H¨ullermeier, E.: Effcient instance-based learning on data streams. Intelligent
Data Analysis 11, 627–650 (2007)
4. Crawley, M.: Statistics: An Introduction using R. Wiley, New Yourk (2005)
5. Dutta, S., Chattopadhyay, M.: A change detection algorithm for medical cell images. In: Proc.
Intern. Conf. on Scientific Paradigm Shift in Information Technology and Management, pp.
524–527. IEEE, Kolkata (2011)
6. Fischer, R.: Moments and product moments of sampling distributions. In: Proceedings of the
London Mathematical Society, Series 2, 30, pp. 199–238 (1929)
7. Fisz, M.: Probability Theory and Mathematical Statistics. Wiley, NewYork (1963)
8. Ganti, V., Gehrke, J., Ramakrishnan, R.: Mining data streams under block evolution. SIGKDD
Explorations 3, 1–10 (2002)
9. Gelper, S., Schettlinger, K., Croux, C., Gather, U.: Robust online scale estimation in time
series: A model-free approach. Journal of Statistical Planning & Inference 139(2), 335–349
10. Grieszbach, G., Schack, B.: Adaptive quantile estimation and its application in analysis of
biological signals. Biometrical journal 35, 166–179 (1993)
34 Katharina Tschumitschew, Frank Klawonn
11. Gustafsson, F.: Adaptive Filteringand Change Detection. Wiley, New York (2000)
12. Holm, S.: A simple sequentially rejective multiple test procedure. Scandinavian Journal of
Statistics 6, 65–70 (1979)
13. Hulten, G., Spencer, L., Domingos, P.: Mining time changing data streams. In: Proceedings of
the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
14. Ikonomovska, E., Gama, J., Sebasti˜ao, R., Gjorgjevik, D.: Regression trees from data streams
with drift detection. In: 11th int conf on discovery science, LNAI, vol 5808, pp. 121–135.
Springer, Berlin (2009)
15. Kifer, D., Ben-David, S., Gehrke, J.: Detecting change in data streams. In: In Proc. 30th VLDB
Conf., pp. 199–238. Toronto, Canada (2004)
16. Lai, T.: Sequential changepoint detection in quality control and dynamic systems. Journal of
the Royal Statistical Society, Series B 57, 613–658 (1995)
17. M ¨oller, E., Grieszbach, G., Schack, B., Witte, H.: Statistical properties and control algorithms
of recursive quantile estimators. Biometrical Journal 42, 729–746 (2000)
18. Nevelson, M., Chasminsky, R.: Stochastic approximation and recurrent estimation. Verlag
Nauka, Moskau (1972)
19. Qiu, G.: An improved recursive median filtering scheme for image processing. IEEE Trans-
actions on Image Processing 5, 646–648 (1996)
20. Ruusunen, M., Paavola, M., Pirttimaa, M., Leiviska, K.: Comparison of three change detec-
tion algorithms for an electronics manufacturing process. In: Proc. 2005 IEEE International
Symposium on Computational Intelligence in Robotics and Automation, pp. 679–683 (2005)
21. Shaffer, J.P.: Multiplehypothesis testing. Ann. Rev. Psych 46, 561–584 (1995)
22. Sheskin, D.: Handbook of Parametric and Nonparametric Statistical Procedures. CRC-Press,
Boca Raton, Florida (1997)
23. Song, X., Wu, M., Jermaine, C., Ranka, S.: Statistical change detection for multi-dimensional
data. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge
discovery and data mining, pp. 667–676. ACM, New York (2007)
24. Spitzer, F.: Principles of Random Walk (2nd edition). Springer, Berlin (2001)
25. Tschumitschew, K., Klawonn, F.: Incremental quantile estimation. Evolving Systems 1, 253–
264 (2010)
26. Tschumitschew, K., Klawonn, F.: The need for benchmarks with data from stochastic pro-
cesses and meta-models in evolving systems. In: N.K. P. Angelov D. Filev (ed.) International
Symposium on Evolving Intelligent Systems. SSAISB, Leicester, pp. 30–33 (2010)
27. Wang, K., Stolfo, S.: Anomalous payload-based network intrusion detection. In: E. Jonsson,
A. Valdes, M. Almgren (eds.) Recent Advances in Intrusion Detection, pp. 203–222. Springer,
Berlin (2004)
... Equal-sized window-based methods [37] use a nonoverlapping window for detection. They split recent drift indicators into two equal-sized windows (like stationery and test windows). ...
... This is because indicators in the subwindow now come from any time in the past stationary states instead of the latest state only [38]. Therefore, the probability of sampling a bad subwindow with outliers is smaller than the method in [37]. In the next section, the number of drift indicators to select from a stationary window in an undersampling process is discussed. ...
... where N is the size of the window used to extract the neighborhood to calculate the median. Usually, the median calculation is fast enough when N is not too big, however it can be also sped up by the incremental (real single-pass) quantile estimation approach proposed in [72] (as medians correspond to 50% quantiles), which is able to approximate the true (batch) median with good accuracy. Gaussian filters also have applications in several fields. ...
We propose an improved fault detection (FD) scheme based on residual signals extracted on-line from system models identified from high-dimensional measurement data recorded in multi-sensor networks. The system models are designed for an all-coverage approach and comprise linear and non-linear approximation functions representing the interrelations and dependencies among the measurement variables. The residuals obtained by comparing observed versus predicted values (i.e., the predictions achieved by the system models) are normalized subject to the uncertainty of the models and are supervised by an incrementally adaptive statistical tolerance band. Upon violation of this tolerance band, a fault alarm is triggered. The improved FD methods comes with two the main novelty aspects: 1.) the development of an enhanced optimization scheme for fuzzy systems training which builds upon the SparseFIS (Sparse Fuzzy Inference Systems) approach and enhances it by embedding genetic operators for escaping local minima ⟶ a hybrid memetic (sparse) fuzzy modeling approach, termed as GenSparseFIS. 2.) the design and application of adaptive filters on the residual signals, over time, in a sliding-window based incremental/decremental manner to smoothen the signals and to reduce the false positive rates. This gives us the freedom to tighten the tolerance band and thus to increase fault detection rates by holding the same level of false positives. In the results section, we verify that this increase is statistically significant in the case of adaptive filters when applying the proposed concepts onto four real-world scenarios (three different ones from rolling mills, one from engine test benches). The hybridization of sparse fuzzy inference systems with genetic algorithms led to the generation of more high quality models that can in turn be used in the FD process as residual generators. The new hybrid sparse memetic modeling approach also achieved fuzzy systems leading to higher fault detection rates for some scenarios.
Multi-label classification has attracted much attention in the machine learning community to address the problem of assigning single samples to more than one class at the same time. We propose an evolving multi-label fuzzy classifier (EFC-ML-FWU) which is able to self-adapt and self-evolve its structure and consequent parameters in the form of multiple hyper-planes with new incoming multi-label samples in an incremental, single-pass manner and which especially addresses the intrinsic curse of dimensionality as well as human label uncertainty problems, often apparent in multi-label classification problems, to ensure the advanced robustness of the learned structure and parameters. The former is achieved by integrating feature weights into the learning process, specifically designed for online multi-label classification problems in an incremental manner, measuring the impact of features with respect to their discriminatory power. The features are integrated (i) into the rule evolution criterion, leading to a shrinkage of distances along unimportant dimensions, which reduces the likelihood of unnecessary rule evolution and thus decreases over-fitting due to the curse of dimensionality, (ii) into the first consequent learning part by a variable-regularized RFWLS approach realized through an incremental coordinate descent algorithm, and (iii) into the second consequent learning part employing correlation-based preservation learning by using weight-based thresholds (extending the classical Lipschitz constant-based threshold) within soft shrinkage operations to optimize a feature-based weighted L1-norm on the consequent parameters. Uncertainty in class labels is handled by an integration of sample weights, where lower weights indicate a higher uncertainty in the labels carried by a sample. This leads to (i) a weighted updating of the incremental feature weights, (ii) a weighted update of the rule antecedent space through a weighted incremental clustering process, and (iii) a specific weighted update of the consequent parameters exploring a single-label and a multi-label view of uncertainty. Our approach was evaluated on several data sets from the MULAN repository and showed significantly improved classification accuracy and average precision trend lines compared to alternative (evolving) one-versus-rest or classifier chaining concepts, and especially improved the native EFC-ML method without feature weights and uncertainty handling with performance gains up to 17% in the AUC of the accuracy trends. Furthermore, interesting insights into an improved robustness of the multi-label classifier (i) in the case of wrong labels due to low user experience levels and (ii) in the case of low label certainties but potentially correct labels were obtained.
Non-stationarity is an important aspect of data stream mining. Change detection and on-line adaptation of statistical estimators is required for non-stationary data streams. Statistical hypothesis tests may also be used for change detection. The advantage of using statistical tests compared to heuristic adaptation strategies is that we can distinguish between fluctuations due to the randomness inherent in the underlying distribution while it remains stationary and real changes of the distribution from which we sample. However, the problem of multiple testing should be taken into account when a test is carried out more than once. Even if the underlying distribution does not change over time, any test will erroneously reject the null hypothesis of no change in the long run if we only carry out the test often enough. In this work, we propose methods which account for the multiple testing issue and consequently improve reliability of change detection. A new method based on the information about the distribution of p-values is presented and discussed in this article as well as classical methods such as Bonferroni correction and the Bonferroni-Holm method.
The problem with real data is often that noise and forms of randomness are involved. This causes a specific problem for evolving systems. They must be able to distinguish between changes according to noise and changes of the underlying data generating process or its parameters. In the worst case, an evolving system might just try to track the noise in the data and is unable to learn the actual relationship inherent in the data. But how can we make sure that this will not be the case for a given evolving system? We propose some simple benchmark tests in our paper that can give an idea of how much an evolving system might be misled by noise.
The goal in this section is to explain the fundamentals of Kalman filter theory by penetrating a few illustrative examples. The Kalman filter requires a state space model for describing the signal dynamics. To describe the role of the model in filtering, the concrete example of target tracking is used throughout the chapter. After presentation of state space modelling and the Kalman filter, numerical aspects as square root implementation, non-linear models and the extended Kalman filter and computational aspects are presented. The idea of using the Kalman filter for sensor fusion is described, and whiteness test on the innovations is presented.
Everything VariesSignificanceGood and Bad HypothesesNull Hypothesesp ValuesInterpretationStatistical ModellingMaximum LikelihoodExperimental DesignThe Principle of Parsimony (Occam's Razor)Observation, Theory and ExperimentControlsReplication: It's the n's that Justify the MeansHow Many Replicates?PowerRandomizationStrong InferenceWeak InferenceHow Long to Go On?PseudoreplicationInitial ConditionsOrthogonal Designs and Non-orthogonal Observational Data
After a brief survey of a large variety of sequential detection procedures that are widely scattered in statistical references on quality control and engineering references on fault detection and signal processing, we study some open problems concerning these procedures and introduce a unified theory of sequential changepoint detection. This theory leads to a class of sequential detection rules which are not too demanding in computational and memory requirements for on-line implementation and yet are nearly optimal under several performance criteria.
The present study is an extension of the investigations made by Grieszbach and Schack (1993) where the recursive estimators of the quantile were introduced. Attention is focused on statistical properties and on the controlling of these estimators in order to reduce their variance and to improve their capability of adaptation. Using methods of stochastic approximation, several control algorithms have been developed, where both the consistent and the adaptive estimation are considered.Due to the recursive computation formula the estimators are suitable for the analysis of large data sets and for sets whose elements are obtained sequentially. In this study, application examples from the analysis of EEG-records are presented, where quantiles are used as threshold values.