Conference PaperPDF Available

Deferral classification of evolving temporal dependent data streams

Authors:

Abstract and Figures

Data streams generated in real-time can be strongly temporally dependent. In this case, standard techniques where we suppose that class labels are not correlated may produce sub-optimal performance because the assumption is incorrect. To deal with this problem, we present in this paper a new algorithm to classify temporally correlated data based on deferral learning. This approach is suitable for learning over time-varying streams. We show how simple classifiers such as Naive Bayes can boost their performance using this new meta-learning methodology. We give an empirical validation of our new algorithm over several real and artificial datasets.
Content may be subject to copyright.
Deferral Classification of Evolving Temporal Dependent
Data Streams
ABSTRACT
Data streams generated in real-time can be strongly temporal de-
pendent. In this case, standard techniques where we suppose that
class labels are not correlated, may not be good enough. Dealing
with this problem, in this paper we present a new algorithm to clas-
sify temporal correlated data based in deferral learning, suitable to
learn over streams that may be time-varying. We show how simple
classifiers as Naive Bayes, can boost their performance using this
new meta-learning methodology. We give an empirical validation
of our new algorithm over several real and artificial datasets.
Keywords
data streams, classification, temporal dependence
1. INTRODUCTION
Nowadays, Big Data applications can be found in several many
diverse fields that require deep computational insights, as financial
markets data, energy data, and many others [6]. These sources of
data generated in real-time may have a strong dependence from one
instance to the next: instances labeled with a specific class label are
slightly more likely to be followed by instances with the same class
label.
In this type of data, we can observe that there is short term memory
in the stream, i.e. in the case of binary classification if the prob-
ability of a positive or negative is dependent on the instance that
has just passed. This seems to be the case in financial markets, or
energy related markets as the electricity market.
To acknowledge this type of temporal dependence is very impor-
tant, since the performance of the classifier depends on it. A clas-
sifier dealing with temporal dependent data should always be com-
pared with the no-change classifier, a classifier that predicts as a
class label, the last class label seen on the stream. This is due to the
fact, that this very simple classifier, in datasets with strong tempo-
ral dependence beats more complex classifiers as decision trees, or
Naive Bayes [4].
In this paper we present a new meta classifier that can boost the per-
formance of any classifier to be able to deal with this temporal data
dependency, running an empirical evaluation to show its benefits.
The paper is structured as follows: in Section 2 we discuss related
work, and in Section 3 we present our new meta classifier. We per-
form and discuss an extensive experimental evaluation in Section 4.
We end the paper giving some conclusions in Section 5.
2. RELATED WORK
The temporal dependency of data has been studied in time series [5],
and in regression [15].
In classification, a solution based in using the non-change classifier
was proposed in the winning solution of the EUNITE2003 compe-
tition [16], which was about predicting glass quality from a glass
production line. The problem needed multi-step prediction into the
future, and the winners used linear interpolation starting from the
last value (as the prediction for the first value at t+ 1) to the global
mean (as the prediction for the value at t+ 20, the ones furthest
into the future). This was described as the Naive Rules.
Zliobaite [17] detected the temporal dependence component in the
Electricity Dataset. The Electricity Dataset due to [9] is a popu-
lar benchmark for testing adaptive classifiers. It has been used in
over 40 concept drift experiments [7, 12, 3, 14]. The Electricity
Dataset was collected from the Australian New South Wales Elec-
tricity Market. The dataset contains 45,312 instances which record
electricity prices at 30 minute intervals. The class label identifies
the change of the price (UP or DOWN) related to a moving aver-
age of the last 24 hours. The data is subject to concept drift due to
changing consumption habits, unexpected events and seasonality.
The Kappa Plus Statistic measure was proposed in [4] to mea-
sure the performance of a classifier with temporal dependence. As
the standard Kappa Statistic normalizes the accuracy of a classi-
fier with the performance of a majority class based classifier, this
new measure, normalizes the accuracy of a classifier against the
no-change classifier.
3. DEFERRAL CLASSIFIER
In this section we propose a new algorithm to classify evolving data
streams, assuring that the performance of our new classifier is going
to be at least as good as the no-change classifier.
Our new deferral classifier is based in the following pseudo-code:
Each time a new instance arrives, make a predic-
tion
If the prediction is sufficiently certain (based
on a probability threshold t) then accept it;
If the prediction is not sufficiently certain,
default to predicting the outcome of the last
instance
We extend it with two new features:
tuning the threshold parameter tautomatically as
the stream is processed, keeping records of the
accuracy over a sliding window for different val-
ues of t(such as 0.1,0.2,...,1.0) and then pick
the threshold that produces lowest error,
instead of using the last instance, use a consensus
over the last ninstances.
We implement an improvement to the first feature in the following
way: we compute an exponential average on the error for each pos-
sible threshold and always pick the threshold with lowest average
error.
We implemented the second feature as a classifier that has a pa-
rameter α, that maintains a vote prediction for each class, and that
when a new instance arrives it multiplies all the votes for (1 α),
and adds αto the vote of the class of the instance. It is identical to
a classifier that predicts the last class seen when α= 1, but when
α < 1, it gives more weight to previous instances beyond the most
recent.
4. EXPERIMENTAL EVALUATION
In this section, we perform two evaluations to compare the new
classification schema with previous state-of-the art strategies:
comparison with standard real and artificial datasets
comparison with artificial streams generated adding strong
temporal dependency
Massive Online Analysis (MOA) [2] is a software environment for
implementing algorithms and running experiments for online learn-
ing from data streams. All algorithms evaluated in this paper were
implemented in the Java programming language by extending the
MOA software.
We use the experimental framework for concept drift presented
in [3]. Considering data streams as data generated from pure distri-
butions, we can model a concept drift event as a weighted combina-
tion of two pure distributions that characterizes the target concepts
before and after the drift. This framework defines the probability
that a new instance of the stream belongs to the new concept after
the drift based on the sigmoid function.
DEFI NI TI ON 1. Given two data streams a,b, we define c=
aW
t0bas the data stream built by joining the two data streams a
and b, where t0is the point of change, Wis the length of change,
Pr[c(t) = b(t)] = 1/(1 + e4(tt0)/W )and Pr[c(t) = a(t)] =
1Pr[c(t) = b(t)].
In order to create a data stream with multiple concept changes, we
can build new data streams joining different concept drifts, i. e.
(((aW0
t0b)W1
t1c)W2
t2d). . ..
0 1 2 3 4
·104
0
20
40
60
80
100
Time, instances
Accuracy, %
Naive Bayes Deferral Classifier α= 1
Deferral Classifier α= 0.5Temporal Augmented Classifier
0 1 2 3 4
·104
0
20
40
60
80
100
Time, instances
Kappa Statistic, %
Naive Bayes Deferral Classifier α= 1
Deferral Classifier α= 0.5Temporal Augmented Classifier
Figure 1: Accuracy and Kappa Statistic on the Electricity Mar-
ket Dataset
4.1 Datasets for concept drift
Synthetic data has several advantages – it is easier to reproduce
and there is little cost in terms of storage and transmission. For this
paper we use the following data generators most commonly found
in the literature.
Rotating Hyperplane This data was used as a testbed for CVFDT
versus VFDT in [11]. Examples for which Pd
i=1 wixiw0
are labeled positive, and examples for which Pd
i=1 wixi<
w0are labeled negative. Hyperplanes are useful for simu-
lating time-changing concepts, because we can change the
orientation and position of the hyperplane in a smooth man-
ner by changing the relative size of the weights.
Random RBF Generator This generator was devised to offer an
alternate complex concept type that is not straightforward to
approximate with a decision tree model. The RBF (Radial
Basis Function) generator works as follows: A fixed num-
ber of random centroids are generated. Each center has a
random position, a single standard deviation, class label and
0 2 4
·105
0
20
40
60
80
100
Time, instances
Accuracy, %
Naive Bayes Deferral Classifier α= 1
Deferral Classifier α= 0.5Temporal Augmented Classifier
0 2 4
·105
0
20
40
60
80
100
Time, instances
Kappa Statistic, %
Naive Bayes Deferral Classifier α= 1
Deferral Classifier α= 0.5Temporal Augmented Classifier
0 2 4
·105
400
200
0
Time, instances
Kappa Plus Statistic, %
Naive Bayes Deferral Classifier α= 1
Deferral Classifier α= 0.5Temporal Augmented Classifier
Figure 2: Accuracy, κand κ+on the Forest Covertype dataset
weight. New examples are generated by selecting a center
at random, taking weights into consideration so that centers
with higher weight are more likely to be chosen. A random
direction is chosen to offset the attribute values from the cen-
tral point. Drift is introduced by moving the centroids with
constant speed.
4.2 Real-World Data
The UCI machine learning repository [1] contains some real-world
benchmark data for evaluating machine learning techniques. We
consider the following datasets: Forest Covertype, Poker-Hand,
and Electricity.
Forest Covertype Contains the forest cover type for 30 x 30 meter
cells obtained from US Forest Service (USFS) Region 2 Re-
source Information System (RIS) data. It contains 581,012
instances and 54 attributes, and it has been used in several
papers on data stream classification [8, 13].
Poker-Hand Consists of 1,000,000 instances and 11 attributes.
Each record of the Poker-Hand dataset is an example of a
hand consisting of five playing cards drawn from a standard
deck of 52. Each card is described using two attributes (suit
and rank), for a total of 10 predictive attributes. There is one
class attribute that describes the “Poker Hand”.
Electricity is another widely used dataset described by M. Har-
ries [10] and analysed by Gama [7]. This data was collected
from the Australian New South Wales Electricity Market. In
this market, prices are not fixed and are affected by demand
and supply of the market. They are set every five minutes.
The ELEC dataset contains 45,312 instances. The class la-
bel identifies the change of the price relative to a moving
average of the last 24 hours.
We use normalized versions of these datasets, so that the numerical
values are between 0and 1. With the Poker-Hand dataset, the cards
are not ordered, i.e. a hand can be represented by any permutation,
which makes it very hard for propositional learners, especially for
linear ones. We use a modified version, where cards are sorted
by rank and suit, and have removed duplicates. This dataset loses
about 171,799 examples, and comes down to 829,201 examples.
These datasets are small compared to synthetic datasets we con-
sider. Another important fact is that we do not know when drift
occurs or indeed if there is any drift. We may simulate concept
drift, joining the three datasets, merging attributes, and supposing
that each dataset corresponds to a different concept
CovPokElec = (CoverType 5,000
581,012 Poker)5,000
1,000,000 ELEC
As all examples need to have the same number of attributes, we
simply concatenate all the attributes, and set the number of classes
to the maximum number of classes of all the datasets.
4.3 Results
We ran an experimental evaluation to test our new deferral clas-
sifier. We compare the original Naive Bayes, with the following
classifiers:
deferral classifier with α= 1,
Naive Bayes Deferral Deferral Temporal Augmented
Classifier α= 1 Classifier α= 0.5Classifier
COVTY PE 60.52 95.07 94.54 85.67
ELECTRICITY 73.36 85.21 84.85 78.48
POKE R 59.55 64.21 63.19 74.73
COVPO KELE C 24.24 49.54 47.54 34.45
HYP(10,0.001) 70.91 70.91 70.91 70.91
HYP(10,0.0001) 91.25 91.25 91.25 91.25
RBF(0,0) 51.21 51.19 51.20 51.21
RBF(50,0.001) 29.14 28.90 28.91 29.12
RBF(10,0.001) 51.96 51.96 51.96 51.95
RBF(50,0.0001) 30.99 30.56 30.62 30.99
RBF(10,0.0001) 52.10 52.10 52.10 52.10
Average 54.11 60.99 60.64 59.17
Table 1: Comparison of Naive Bayes, Deferral Classifier with α= 1,0.5and Temporal Augmented Classifier. Accuracy is measured
as the final percentage of examples correctly classified over the 1 million test/train interleaved evaluation. The best individual
accuracies are indicated in boldface.
deferral classifier with α= 0.5,
temporal augmented classifier [4], where the class label of
the previous instance is used as an additional attribute.
We use the datasets introduced previously for evaluation. The ex-
periments were performed on 2.66 GHz Core 2 Duo E6750 ma-
chines with 4 GB of memory.
The evaluation methodology used was Interleaved Test-Then-Train:
every example was used for testing the model before using it to
train. This interleaved test followed by train procedure was carried
out on one million examples from the hyperplane and RandomRBF
datasets. The parameters of these streams are the following:
RBF(x,v): RandomRBF data stream of 5 classes with xcen-
troids moving at speed v.
HYP(x,v): Hyperplane data stream of 5 classes with xat-
tributes changing at speed v.
We report the following measures based on the accuracy of the clas-
sifiers compared with very simple classifiers:
final accuracy p0
κ= (p0pc)/(1 pc)where pcis the probability that the
classifiers predicts correctly by chance
κ+= (p0p0
e)/(1p0
e)where p0
eis the accuracy of the clas-
sifier that predicts using the label of the last instance seen.
We plot the learning curves for the ELECTRICITY dataset in Fig-
ure 1 and for the FO RE ST COVERTYPE dataset in Figure 2. We
observe that the two deferral classifier performs similarly, but with
higher accuracy, κand κ+than the single Naive Bayes and the
temporal augmented Naive Bayes.
Tables 1, 2 and 3 report the performance of the classification mod-
els induced on the synthetic data and the real datasets: FOREST
COVE RTYP E, PO KE R HAND, ELECTRICITY and COVPO KEL EC .
The performance is measured as the final percentage of examples
correctly classified over the test/train interleaved evaluation.
We see that the deferral classifiers are superior to the Temporal
Augmented Classifier and the Naive Bayes classifier alone. On
Table 3, we see that classifiers has positive values over artificial
datasets, as they don’t have strong temporal dependence, but they
have in some cases negative values to indicate that the performance
is not good. We see that the classifier with better accuracy, Kappa
Statistic and Kappa Plus Statistic is the deferral classifier using
α= 1.
4.4 Artificial Simulator
In order of creating artificial data streams with controlled temporal
dependence, we create an artificial meta-generator that can be used
on top of the current data stream generators.
It works in the following way: given a threshold probability p, each
time we want to generate a data instance from the stream generator:
r= random number from a uniform distribution
between 0 and 1
if ris less or equal to p
class = class of previous instance
else
class = a randomly selected class that is not
the same as the previous instance
Figure 3 shows a simple experiment. We use the Random Tree
generator, and a probability of p= 0,9. We observe that our
deferral classifier is able to classify the instances from the stream
with added strong temporal dependence with 20% more of accu-
racy than the single Naive Bayes Classifier.
5. CONCLUSIONS
In this paper we presented a new deferral classifier, to address the
problem of temporal dependence on evolving data streams. We
showed the benefits of the new method running an empirical eval-
uation over several datasets, using our method as a meta-classifier
over the Naive Bayes classifier.
0 0.2 0.4 0.6 0.8 1
·106
0
20
40
60
80
100
Time, instances
Accuracy, %
Naive Bayes Deferral Classifier α= 1
Temporal Augmented Classifier
0 0.2 0.4 0.6 0.8 1
·106
0
20
40
60
80
Time, instances
Kappa Statistic, %
Naive Bayes Deferral Classifier α= 1
Temporal Augmented Classifier
Figure 3: Accuracy on the artificial generated stream with con-
trolled temporal dependence of p=.9using the standard ran-
dom tree generator
As future work, we would like to continue studying this problem in
more depth, and try to apply these techniques to the more challeng-
ing setting of evolving data stream multi-label classification, where
the number of labels is not fixed, and the probability distribution
that is generating the data may be evolving.
6. REFERENCES
[1] A. Asuncion and D. Newman. UCI machine learning
repository, 2007.
[2] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. MOA:
Massive online analysis. J. of Mach. Learn. Res.,
11:1601–1604, 2010.
[3] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and
R. Gavaldà. New ensemble methods for evolving data
streams. In Proc. of the 15th ACM SIGKDD int. conf. on
Knowledge discovery and data mining, KDD, pages
139–148, 2009.
[4] A. Bifet, J. Read, I. Zliobaite, B. Pfahringer, and G. Holmes.
Pitfalls in benchmarking data stream classification and how
to avoid them. In Proc. of the European Conference on
Machine Learning and Principles and Practice of
Knowledge Discovery in Databases, ECMLPKDD, pages
465–479, 2013.
[5] G. E. P. Box and G. M. Jenkins. Time Series Analysis:
Forecasting and Control. Prentice Hall PTR, Upper Saddle
River, NJ, USA, 3rd edition, 1994.
[6] K. Cukier. Data, data everywhere. The Economist Report,
2010.
[7] J. Gama, P. Medas, G. Castillo, and P. Rodrigues. Learning
with drift detection. In Proc. of the 7th Brazilian Symp. on
Artificial Intelligence, SBIA, pages 286–295, 2004.
[8] J. Gama, R. Rocha, and P. Medas. Accurate decision trees for
mining high-speed data streams. In KDD, pages 523–528,
2003.
[9] M. Harries. SPLICE-2 comparative evaluation: Electricity
pricing. Tech. report, University of New South Wales, 1999.
[10] M. Harries. Splice-2 comparative evaluation: Electricity
pricing. Technical report, The University of South Wales,
1999.
[11] G. Hulten, L. Spencer, and P. Domingos. Mining
time-changing data streams. In Proc. of the 7th ACM
SIGKDD int. conf. on Knowl. disc. and data mining, KDD,
pages 97–106, 2001.
[12] J. Kolter and M. Maloof. Dynamic weighted majority: An
ensemble method for drifting concepts. J. of Mach. Learn.
Res., 8:2755–2790, 2007.
[13] N. C. Oza and S. J. Russell. Experimental comparisons of
online and batch versions of bagging and boosting. In KDD,
pages 359–364, 2001.
[14] G. Ross, N. Adams, D. Tasoulis, and D. Hand. Exponentially
weighted moving average charts for detecting concept drift.
Pattern Recogn. Lett, 33:191–198, 2012.
[15] G. Seber and C. Wild. Nonlinear Regression. Wiley Series in
Probability and Statistics. Wiley, 2003.
[16] M. Wojnarski. Prediction of product quality in glass
manufacturing process using LTF-A neural network.On
Bagging and Nonlinear Estimation. Technical report,
EUNITE Competition, 2003.
[17] I. Zliobaite. How good is the electricity benchmark for
evaluating concept drift adaptation. CoRR, abs/1301.3524,
2013.
Naive Bayes Deferral Deferral Temporal Augmented
Classifier α= 1 Classifier α= 0.5Classifier
COVTY PE 40.91 92.09 91.23 77.99
ELECTRICITY 42.54 69.76 69.01 54.42
POKE R 24.79 34.78 32.59 54.60
COVPO KELE C 12.40 30.04 28.37 23.38
HYP(10,0.001) 41.82 41.82 41.82 41.81
HYP(10,0.0001) 82.51 82.51 82.51 82.50
RBF(0,0) 35.47 35.45 35.46 35.47
RBF(50,0.001) 2.29 2.25 2.24 2.26
RBF(10,0.001) 36.18 36.17 36.17 36.16
RBF(50,0.0001) 5.49 5.37 5.43 5.48
RBF(10,0.0001) 36.39 36.39 36.39 36.39
Average 32.80 42.42 41.93 40.95
Table 2: Comparison of Naive Bayes, Deferral Classifier with α= 1,0.5and Temporal Augmented Classifier. Kappa statistic
is measured as the final percentage of examples correctly classified over the 1 million test/train interleaved evaluation. The best
individual accuracies are indicated in boldface.
Naive Bayes Deferral Deferral Temporal Augmented
Classifier α= 1 Classifier α= 0.5Classifier
COVTY PE -699.37 0.14 -10.65 -190.18
ELECTRICITY -81.59 -0.84 -3.29 -46.70
POKE R -58.84 -40.54 -44.55 0.75
COVPO KELE C -338.46 -192.05 -203.62 -279.35
HYP(10,0.001) 41.83 41.83 41.83 41.82
HYP(10,0.0001) 82.50 82.50 82.50 82.50
RBF(0,0) 36.87 36.85 36.86 36.87
RBF(50,0.001) 8.31 8.01 8.02 8.29
RBF(10,0.001) 37.85 37.85 37.85 37.83
RBF(50,0.0001) 10.72 10.16 10.24 10.71
RBF(10,0.0001) 38.03 38.02 38.03 38.02
Average -83.83 1.99 -0.62 -23.58
Table 3: Kappa Plus statistic is measured as the final percentage of examples correctly classified over the 1 million test/train inter-
leaved evaluation. The best individual accuracies are indicated in boldface.
Article
The classification of dynamical data streams is among the most complex problems encountered in classification. This is, firstly, because the distribution of the data streams is non-stationary, and it changes without any prior “warning”. Secondly, the manner in which it changes is also unknown. Thirdly, and more interestingly, the model operates with the assumption that the correct classes of previously-classified patterns become available at a juncture after their appearance. This paper pioneers the use of unreported novel schemes that can classify such dynamical data streams by invoking the recently-introduced “Anti-Bayesian” (AB) techniques. Contrary to the Bayesian paradigm, that compare the testing sample with the distribution’s central points, AB techniques are based on the information in the distant-from-the-mean samples. Most Bayesian approaches can be naturally extended to dynamical systems by dynamically tracking the mean of each class using, for example, the exponential moving average based estimator, or a sliding window estimator. The AB schemes introduced by Oommen et al.., on the other hand, work with a radically different approach and with the non-central quantiles of the distributions. Surprisingly and counter-intuitively, the reported AB methods work equally or close-to-equally well to an optimal supervised Bayesian scheme on a host of accepted Pattern Recognition problems. This thus begs its natural extension to the unexplored arena of classification for dynamical data streams. Naturally, for such an AB classification approach, we need to track the non-stationarity of the quantiles of the classes. To achieve this, in this paper, we develop an AB approach for the online classification of data streams by applying the efficient and robust quantile estimators developed by Yazidi and Hammer [12,37]. Apart from the methodology itself, in this paper, we compare the Bayesian and AB approaches using both real-life and synthetic data. The results demonstrate the intriguing and counter-intuitive results that the AB approach, sometimes, actually outperforms the Bayesian approach for this application both with respect to the peak performance obtained, and the robustness of the choice of the respective tuning parameters. Furthermore, the AB approach is much more robust against outliers, which is an inherent property of quantile estimators [12,37], which is a property that the Bayesian approach cannot match, since it rather tracks the mean.
Conference Paper
Full-text available
Data stream classification plays an important role in modern data analysis, where data arrives in a stream and needs to be mined in real time. In the data stream setting the underlying distribution from which this data comes may be changing and evolving, and so classifiers that can update themselves during operation are becoming the state-of-the-art. In this paper we show that data streams may have an important temporal component, which currently is not considered in the evaluation and benchmarking of data stream classifiers. We demonstrate how a naive classifier considering the temporal component only outperforms a lot of current state-of-the-art classifiers on real data streams that have temporal dependence, i.e. data is autocorrelated. We propose to evaluate data stream classifiers taking into account temporal dependence, and introduce a new evaluation measure, which provides a more accurate gauge of data stream classifier performance. In response to the temporal dependence issue we propose a generic wrapper for data stream classifiers, which incorporates the temporal component into the attribute space.
Article
Full-text available
In this correspondence, we will point out a problem with testing adaptive classifiers on autocorrelated data. In such a case random change alarms may boost the accuracy figures. Hence, we cannot be sure if the adaptation is working well.
Article
Classifying streaming data requires the development of methods which are computationally efficient and able to cope with changes in the underlying distribution of the stream, a phenomenon known in the literature as concept drift. We propose a new method for detecting concept drift which uses an Exponentially Weighted Moving Average (EWMA) chart to monitor the misclassification rate of an streaming classifier. Our approach is modular and can hence be run in parallel with any underlying classifier to provide an additional layer of concept drift detection. Moreover our method is computationally efficient with overhead O(1) and works in a fully online manner with no need to store data points in memory. Unlike many existing approaches to concept drift detection, our method allows the rate of false positive detections to be controlled and kept constant over time.
Article
This report presents solutions of EUNITE Competition 2003 problem "Predic- tion of product quality in glass manufacturing process". The first solution is based on Local Transfer Function Approximator (LTF-A) neural network, while the next three solutions utilize simple rules to predict glass quality. Despite advanced data preprocessing, LTF-A did not performed very well on the competition data. The reason for this was probably the diculty of the problem and the lack of strong relations in the data ifself.