Content uploaded by Albert Bifet
Author content
All content in this area was uploaded by Albert Bifet on Mar 24, 2018
Content may be subject to copyright.
Deferral Classification of Evolving Temporal Dependent
Data Streams
ABSTRACT
Data streams generated in real-time can be strongly temporal de-
pendent. In this case, standard techniques where we suppose that
class labels are not correlated, may not be good enough. Dealing
with this problem, in this paper we present a new algorithm to clas-
sify temporal correlated data based in deferral learning, suitable to
learn over streams that may be time-varying. We show how simple
classifiers as Naive Bayes, can boost their performance using this
new meta-learning methodology. We give an empirical validation
of our new algorithm over several real and artificial datasets.
Keywords
data streams, classification, temporal dependence
1. INTRODUCTION
Nowadays, Big Data applications can be found in several many
diverse fields that require deep computational insights, as financial
markets data, energy data, and many others [6]. These sources of
data generated in real-time may have a strong dependence from one
instance to the next: instances labeled with a specific class label are
slightly more likely to be followed by instances with the same class
label.
In this type of data, we can observe that there is short term memory
in the stream, i.e. in the case of binary classification if the prob-
ability of a positive or negative is dependent on the instance that
has just passed. This seems to be the case in financial markets, or
energy related markets as the electricity market.
To acknowledge this type of temporal dependence is very impor-
tant, since the performance of the classifier depends on it. A clas-
sifier dealing with temporal dependent data should always be com-
pared with the no-change classifier, a classifier that predicts as a
class label, the last class label seen on the stream. This is due to the
fact, that this very simple classifier, in datasets with strong tempo-
ral dependence beats more complex classifiers as decision trees, or
Naive Bayes [4].
In this paper we present a new meta classifier that can boost the per-
formance of any classifier to be able to deal with this temporal data
dependency, running an empirical evaluation to show its benefits.
The paper is structured as follows: in Section 2 we discuss related
work, and in Section 3 we present our new meta classifier. We per-
form and discuss an extensive experimental evaluation in Section 4.
We end the paper giving some conclusions in Section 5.
2. RELATED WORK
The temporal dependency of data has been studied in time series [5],
and in regression [15].
In classification, a solution based in using the non-change classifier
was proposed in the winning solution of the EUNITE2003 compe-
tition [16], which was about predicting glass quality from a glass
production line. The problem needed multi-step prediction into the
future, and the winners used linear interpolation starting from the
last value (as the prediction for the first value at t+ 1) to the global
mean (as the prediction for the value at t+ 20, the ones furthest
into the future). This was described as the Naive Rules.
Zliobaite [17] detected the temporal dependence component in the
Electricity Dataset. The Electricity Dataset due to [9] is a popu-
lar benchmark for testing adaptive classifiers. It has been used in
over 40 concept drift experiments [7, 12, 3, 14]. The Electricity
Dataset was collected from the Australian New South Wales Elec-
tricity Market. The dataset contains 45,312 instances which record
electricity prices at 30 minute intervals. The class label identifies
the change of the price (UP or DOWN) related to a moving aver-
age of the last 24 hours. The data is subject to concept drift due to
changing consumption habits, unexpected events and seasonality.
The Kappa Plus Statistic measure was proposed in [4] to mea-
sure the performance of a classifier with temporal dependence. As
the standard Kappa Statistic normalizes the accuracy of a classi-
fier with the performance of a majority class based classifier, this
new measure, normalizes the accuracy of a classifier against the
no-change classifier.
3. DEFERRAL CLASSIFIER
In this section we propose a new algorithm to classify evolving data
streams, assuring that the performance of our new classifier is going
to be at least as good as the no-change classifier.
Our new deferral classifier is based in the following pseudo-code:
•Each time a new instance arrives, make a predic-
tion
–If the prediction is sufficiently certain (based
on a probability threshold t) then accept it;
–If the prediction is not sufficiently certain,
default to predicting the outcome of the last
instance
We extend it with two new features:
•tuning the threshold parameter tautomatically as
the stream is processed, keeping records of the
accuracy over a sliding window for different val-
ues of t(such as 0.1,0.2,...,1.0) and then pick
the threshold that produces lowest error,
•instead of using the last instance, use a consensus
over the last ninstances.
We implement an improvement to the first feature in the following
way: we compute an exponential average on the error for each pos-
sible threshold and always pick the threshold with lowest average
error.
We implemented the second feature as a classifier that has a pa-
rameter α, that maintains a vote prediction for each class, and that
when a new instance arrives it multiplies all the votes for (1 −α),
and adds αto the vote of the class of the instance. It is identical to
a classifier that predicts the last class seen when α= 1, but when
α < 1, it gives more weight to previous instances beyond the most
recent.
4. EXPERIMENTAL EVALUATION
In this section, we perform two evaluations to compare the new
classification schema with previous state-of-the art strategies:
•comparison with standard real and artificial datasets
•comparison with artificial streams generated adding strong
temporal dependency
Massive Online Analysis (MOA) [2] is a software environment for
implementing algorithms and running experiments for online learn-
ing from data streams. All algorithms evaluated in this paper were
implemented in the Java programming language by extending the
MOA software.
We use the experimental framework for concept drift presented
in [3]. Considering data streams as data generated from pure distri-
butions, we can model a concept drift event as a weighted combina-
tion of two pure distributions that characterizes the target concepts
before and after the drift. This framework defines the probability
that a new instance of the stream belongs to the new concept after
the drift based on the sigmoid function.
DEFI NI TI ON 1. Given two data streams a,b, we define c=
a⊕W
t0bas the data stream built by joining the two data streams a
and b, where t0is the point of change, Wis the length of change,
Pr[c(t) = b(t)] = 1/(1 + e−4(t−t0)/W )and Pr[c(t) = a(t)] =
1−Pr[c(t) = b(t)].
In order to create a data stream with multiple concept changes, we
can build new data streams joining different concept drifts, i. e.
(((a⊕W0
t0b)⊕W1
t1c)⊕W2
t2d). . ..
0 1 2 3 4
·104
0
20
40
60
80
100
Time, instances
Accuracy, %
Naive Bayes Deferral Classifier α= 1
Deferral Classifier α= 0.5Temporal Augmented Classifier
0 1 2 3 4
·104
0
20
40
60
80
100
Time, instances
Kappa Statistic, %
Naive Bayes Deferral Classifier α= 1
Deferral Classifier α= 0.5Temporal Augmented Classifier
Figure 1: Accuracy and Kappa Statistic on the Electricity Mar-
ket Dataset
4.1 Datasets for concept drift
Synthetic data has several advantages – it is easier to reproduce
and there is little cost in terms of storage and transmission. For this
paper we use the following data generators most commonly found
in the literature.
Rotating Hyperplane This data was used as a testbed for CVFDT
versus VFDT in [11]. Examples for which Pd
i=1 wixi≥w0
are labeled positive, and examples for which Pd
i=1 wixi<
w0are labeled negative. Hyperplanes are useful for simu-
lating time-changing concepts, because we can change the
orientation and position of the hyperplane in a smooth man-
ner by changing the relative size of the weights.
Random RBF Generator This generator was devised to offer an
alternate complex concept type that is not straightforward to
approximate with a decision tree model. The RBF (Radial
Basis Function) generator works as follows: A fixed num-
ber of random centroids are generated. Each center has a
random position, a single standard deviation, class label and
0 2 4
·105
0
20
40
60
80
100
Time, instances
Accuracy, %
Naive Bayes Deferral Classifier α= 1
Deferral Classifier α= 0.5Temporal Augmented Classifier
0 2 4
·105
0
20
40
60
80
100
Time, instances
Kappa Statistic, %
Naive Bayes Deferral Classifier α= 1
Deferral Classifier α= 0.5Temporal Augmented Classifier
0 2 4
·105
−400
−200
0
Time, instances
Kappa Plus Statistic, %
Naive Bayes Deferral Classifier α= 1
Deferral Classifier α= 0.5Temporal Augmented Classifier
Figure 2: Accuracy, κand κ+on the Forest Covertype dataset
weight. New examples are generated by selecting a center
at random, taking weights into consideration so that centers
with higher weight are more likely to be chosen. A random
direction is chosen to offset the attribute values from the cen-
tral point. Drift is introduced by moving the centroids with
constant speed.
4.2 Real-World Data
The UCI machine learning repository [1] contains some real-world
benchmark data for evaluating machine learning techniques. We
consider the following datasets: Forest Covertype, Poker-Hand,
and Electricity.
Forest Covertype Contains the forest cover type for 30 x 30 meter
cells obtained from US Forest Service (USFS) Region 2 Re-
source Information System (RIS) data. It contains 581,012
instances and 54 attributes, and it has been used in several
papers on data stream classification [8, 13].
Poker-Hand Consists of 1,000,000 instances and 11 attributes.
Each record of the Poker-Hand dataset is an example of a
hand consisting of five playing cards drawn from a standard
deck of 52. Each card is described using two attributes (suit
and rank), for a total of 10 predictive attributes. There is one
class attribute that describes the “Poker Hand”.
Electricity is another widely used dataset described by M. Har-
ries [10] and analysed by Gama [7]. This data was collected
from the Australian New South Wales Electricity Market. In
this market, prices are not fixed and are affected by demand
and supply of the market. They are set every five minutes.
The ELEC dataset contains 45,312 instances. The class la-
bel identifies the change of the price relative to a moving
average of the last 24 hours.
We use normalized versions of these datasets, so that the numerical
values are between 0and 1. With the Poker-Hand dataset, the cards
are not ordered, i.e. a hand can be represented by any permutation,
which makes it very hard for propositional learners, especially for
linear ones. We use a modified version, where cards are sorted
by rank and suit, and have removed duplicates. This dataset loses
about 171,799 examples, and comes down to 829,201 examples.
These datasets are small compared to synthetic datasets we con-
sider. Another important fact is that we do not know when drift
occurs or indeed if there is any drift. We may simulate concept
drift, joining the three datasets, merging attributes, and supposing
that each dataset corresponds to a different concept
CovPokElec = (CoverType ⊕5,000
581,012 Poker)⊕5,000
1,000,000 ELEC
As all examples need to have the same number of attributes, we
simply concatenate all the attributes, and set the number of classes
to the maximum number of classes of all the datasets.
4.3 Results
We ran an experimental evaluation to test our new deferral clas-
sifier. We compare the original Naive Bayes, with the following
classifiers:
•deferral classifier with α= 1,
Naive Bayes Deferral Deferral Temporal Augmented
Classifier α= 1 Classifier α= 0.5Classifier
COVTY PE 60.52 95.07 94.54 85.67
ELECTRICITY 73.36 85.21 84.85 78.48
POKE R 59.55 64.21 63.19 74.73
COVPO KELE C 24.24 49.54 47.54 34.45
HYP(10,0.001) 70.91 70.91 70.91 70.91
HYP(10,0.0001) 91.25 91.25 91.25 91.25
RBF(0,0) 51.21 51.19 51.20 51.21
RBF(50,0.001) 29.14 28.90 28.91 29.12
RBF(10,0.001) 51.96 51.96 51.96 51.95
RBF(50,0.0001) 30.99 30.56 30.62 30.99
RBF(10,0.0001) 52.10 52.10 52.10 52.10
Average 54.11 60.99 60.64 59.17
Table 1: Comparison of Naive Bayes, Deferral Classifier with α= 1,0.5and Temporal Augmented Classifier. Accuracy is measured
as the final percentage of examples correctly classified over the 1 million test/train interleaved evaluation. The best individual
accuracies are indicated in boldface.
•deferral classifier with α= 0.5,
•temporal augmented classifier [4], where the class label of
the previous instance is used as an additional attribute.
We use the datasets introduced previously for evaluation. The ex-
periments were performed on 2.66 GHz Core 2 Duo E6750 ma-
chines with 4 GB of memory.
The evaluation methodology used was Interleaved Test-Then-Train:
every example was used for testing the model before using it to
train. This interleaved test followed by train procedure was carried
out on one million examples from the hyperplane and RandomRBF
datasets. The parameters of these streams are the following:
•RBF(x,v): RandomRBF data stream of 5 classes with xcen-
troids moving at speed v.
•HYP(x,v): Hyperplane data stream of 5 classes with xat-
tributes changing at speed v.
We report the following measures based on the accuracy of the clas-
sifiers compared with very simple classifiers:
•final accuracy p0
•κ= (p0−pc)/(1 −pc)where pcis the probability that the
classifiers predicts correctly by chance
•κ+= (p0−p0
e)/(1−p0
e)where p0
eis the accuracy of the clas-
sifier that predicts using the label of the last instance seen.
We plot the learning curves for the ELECTRICITY dataset in Fig-
ure 1 and for the FO RE ST COVERTYPE dataset in Figure 2. We
observe that the two deferral classifier performs similarly, but with
higher accuracy, κand κ+than the single Naive Bayes and the
temporal augmented Naive Bayes.
Tables 1, 2 and 3 report the performance of the classification mod-
els induced on the synthetic data and the real datasets: FOREST
COVE RTYP E, PO KE R HAND, ELECTRICITY and COVPO KEL EC .
The performance is measured as the final percentage of examples
correctly classified over the test/train interleaved evaluation.
We see that the deferral classifiers are superior to the Temporal
Augmented Classifier and the Naive Bayes classifier alone. On
Table 3, we see that classifiers has positive values over artificial
datasets, as they don’t have strong temporal dependence, but they
have in some cases negative values to indicate that the performance
is not good. We see that the classifier with better accuracy, Kappa
Statistic and Kappa Plus Statistic is the deferral classifier using
α= 1.
4.4 Artificial Simulator
In order of creating artificial data streams with controlled temporal
dependence, we create an artificial meta-generator that can be used
on top of the current data stream generators.
It works in the following way: given a threshold probability p, each
time we want to generate a data instance from the stream generator:
•r= random number from a uniform distribution
between 0 and 1
•if ris less or equal to p
–class = class of previous instance
•else
–class = a randomly selected class that is not
the same as the previous instance
Figure 3 shows a simple experiment. We use the Random Tree
generator, and a probability of p= 0,9. We observe that our
deferral classifier is able to classify the instances from the stream
with added strong temporal dependence with 20% more of accu-
racy than the single Naive Bayes Classifier.
5. CONCLUSIONS
In this paper we presented a new deferral classifier, to address the
problem of temporal dependence on evolving data streams. We
showed the benefits of the new method running an empirical eval-
uation over several datasets, using our method as a meta-classifier
over the Naive Bayes classifier.
0 0.2 0.4 0.6 0.8 1
·106
0
20
40
60
80
100
Time, instances
Accuracy, %
Naive Bayes Deferral Classifier α= 1
Temporal Augmented Classifier
0 0.2 0.4 0.6 0.8 1
·106
0
20
40
60
80
Time, instances
Kappa Statistic, %
Naive Bayes Deferral Classifier α= 1
Temporal Augmented Classifier
Figure 3: Accuracy on the artificial generated stream with con-
trolled temporal dependence of p=.9using the standard ran-
dom tree generator
As future work, we would like to continue studying this problem in
more depth, and try to apply these techniques to the more challeng-
ing setting of evolving data stream multi-label classification, where
the number of labels is not fixed, and the probability distribution
that is generating the data may be evolving.
6. REFERENCES
[1] A. Asuncion and D. Newman. UCI machine learning
repository, 2007.
[2] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. MOA:
Massive online analysis. J. of Mach. Learn. Res.,
11:1601–1604, 2010.
[3] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and
R. Gavaldà. New ensemble methods for evolving data
streams. In Proc. of the 15th ACM SIGKDD int. conf. on
Knowledge discovery and data mining, KDD, pages
139–148, 2009.
[4] A. Bifet, J. Read, I. Zliobaite, B. Pfahringer, and G. Holmes.
Pitfalls in benchmarking data stream classification and how
to avoid them. In Proc. of the European Conference on
Machine Learning and Principles and Practice of
Knowledge Discovery in Databases, ECMLPKDD, pages
465–479, 2013.
[5] G. E. P. Box and G. M. Jenkins. Time Series Analysis:
Forecasting and Control. Prentice Hall PTR, Upper Saddle
River, NJ, USA, 3rd edition, 1994.
[6] K. Cukier. Data, data everywhere. The Economist Report,
2010.
[7] J. Gama, P. Medas, G. Castillo, and P. Rodrigues. Learning
with drift detection. In Proc. of the 7th Brazilian Symp. on
Artificial Intelligence, SBIA, pages 286–295, 2004.
[8] J. Gama, R. Rocha, and P. Medas. Accurate decision trees for
mining high-speed data streams. In KDD, pages 523–528,
2003.
[9] M. Harries. SPLICE-2 comparative evaluation: Electricity
pricing. Tech. report, University of New South Wales, 1999.
[10] M. Harries. Splice-2 comparative evaluation: Electricity
pricing. Technical report, The University of South Wales,
1999.
[11] G. Hulten, L. Spencer, and P. Domingos. Mining
time-changing data streams. In Proc. of the 7th ACM
SIGKDD int. conf. on Knowl. disc. and data mining, KDD,
pages 97–106, 2001.
[12] J. Kolter and M. Maloof. Dynamic weighted majority: An
ensemble method for drifting concepts. J. of Mach. Learn.
Res., 8:2755–2790, 2007.
[13] N. C. Oza and S. J. Russell. Experimental comparisons of
online and batch versions of bagging and boosting. In KDD,
pages 359–364, 2001.
[14] G. Ross, N. Adams, D. Tasoulis, and D. Hand. Exponentially
weighted moving average charts for detecting concept drift.
Pattern Recogn. Lett, 33:191–198, 2012.
[15] G. Seber and C. Wild. Nonlinear Regression. Wiley Series in
Probability and Statistics. Wiley, 2003.
[16] M. Wojnarski. Prediction of product quality in glass
manufacturing process using LTF-A neural network.On
Bagging and Nonlinear Estimation. Technical report,
EUNITE Competition, 2003.
[17] I. Zliobaite. How good is the electricity benchmark for
evaluating concept drift adaptation. CoRR, abs/1301.3524,
2013.
Naive Bayes Deferral Deferral Temporal Augmented
Classifier α= 1 Classifier α= 0.5Classifier
COVTY PE 40.91 92.09 91.23 77.99
ELECTRICITY 42.54 69.76 69.01 54.42
POKE R 24.79 34.78 32.59 54.60
COVPO KELE C 12.40 30.04 28.37 23.38
HYP(10,0.001) 41.82 41.82 41.82 41.81
HYP(10,0.0001) 82.51 82.51 82.51 82.50
RBF(0,0) 35.47 35.45 35.46 35.47
RBF(50,0.001) 2.29 2.25 2.24 2.26
RBF(10,0.001) 36.18 36.17 36.17 36.16
RBF(50,0.0001) 5.49 5.37 5.43 5.48
RBF(10,0.0001) 36.39 36.39 36.39 36.39
Average 32.80 42.42 41.93 40.95
Table 2: Comparison of Naive Bayes, Deferral Classifier with α= 1,0.5and Temporal Augmented Classifier. Kappa statistic
is measured as the final percentage of examples correctly classified over the 1 million test/train interleaved evaluation. The best
individual accuracies are indicated in boldface.
Naive Bayes Deferral Deferral Temporal Augmented
Classifier α= 1 Classifier α= 0.5Classifier
COVTY PE -699.37 0.14 -10.65 -190.18
ELECTRICITY -81.59 -0.84 -3.29 -46.70
POKE R -58.84 -40.54 -44.55 0.75
COVPO KELE C -338.46 -192.05 -203.62 -279.35
HYP(10,0.001) 41.83 41.83 41.83 41.82
HYP(10,0.0001) 82.50 82.50 82.50 82.50
RBF(0,0) 36.87 36.85 36.86 36.87
RBF(50,0.001) 8.31 8.01 8.02 8.29
RBF(10,0.001) 37.85 37.85 37.85 37.83
RBF(50,0.0001) 10.72 10.16 10.24 10.71
RBF(10,0.0001) 38.03 38.02 38.03 38.02
Average -83.83 1.99 -0.62 -23.58
Table 3: Kappa Plus statistic is measured as the final percentage of examples correctly classified over the 1 million test/train inter-
leaved evaluation. The best individual accuracies are indicated in boldface.