Content uploaded by Albert Bifet

Author content

All content in this area was uploaded by Albert Bifet on Mar 24, 2018

Content may be subject to copyright.

Use of Ensembles of Fourier Spectra in Capturing

Recurrent Concepts in Data Streams

Sripirakas Sakthithasan and Russel Pears

School of Computer and Mathematical Sciences

Auckland University of Technology, New Zealand

Email: {ssakthit, rpears }@aut.ac.nz

Albert Bifet and Bernhard Pfahringer

Department of Computer Science

University of Waikato, New Zealand

Email: {albifet , bfahringer }@cs.waikato.ac.nz

Abstract—In this research, we apply ensembles of Fourier

encoded spectra to capture and mine recurring concepts in a data

stream environment. Previous research showed that compact ver-

sions of Decision Trees can be obtained by applying the Discrete

Fourier Transform to accurately capture recurrent concepts in a

data stream. However, in highly volatile environments where new

concepts emerge often, the approach of encoding each concept

in a separate spectrum is no longer viable due to memory

overload and thus in this research we present an ensemble

approach that addresses this problem. Our empirical results on

real world data and synthetic data exhibiting varying degrees

of recurrence reveal that the ensemble approach outperforms

the single spectrum approach in terms of classiﬁcation accuracy,

memory and execution time.

I. INTRODUCTION

In many real world applications, patterns or concepts recur

over time. Machine learning applications that model, capture

and recognize concept re-occurrence gain signiﬁcant efﬁciency

and accuracy advantages over systems that simply re-learn

concepts each time they re-occur. When such applications

include safety and time critical requirements, the need for

concept re-use to support decision making becomes even more

compelling.

Auto-pilot systems sense environmental changes and take

appropriate action (classiﬁers, in the supervised machine learn-

ing context) to avoid disasters and to ﬂy smoothly. As en-

vironmental conditions change, appropriate actions must be

taken in the shortest possible time in the interest of safety.

Thus for example, a situation that involves the occurrence

of a sudden low pressure area coupled with high winds (a

concept that would be captured by a classiﬁer) would require

appropriate action to keep the aircraft on a steady trajectory. A

machine learning system that is coupled to a ﬂight simulator

can learn such concepts in the form of classiﬁers and store

them in a repository for timely re-use when the aircraft is on

live ﬂying missions. In live ﬂying mode the autopilot system

can quickly re-use the stored classiﬁers when such situations

re-occur. Additionally, in live ﬂying mode, new potentially

hazardous situations not experienced in simulator mode can

also be learned and stored as classiﬁers in the repository for

future use.

In a real world setting, there is an abundance of applications

that exhibit such recurring behavior such as stock and sales

applications where timely decision making results in improved

productivity. Our research setting is a data stream environment

where we seek to capture concepts as they occur, store them

in highly compressed form in a repository and to re-use

such concepts for classiﬁcation when the need arises in the

future. A number of challenges need to be overcome. Firstly,

a compression scheme that captures concepts using minimal

storage is required as in a high volatile high dimensional

environment. Memory overhead will be a prime concern as

the number of concepts will grow continuously in time given

the unbounded nature of data streams. Secondly, in real-world

environments, concepts rarely, if ever, occur in exactly their

original form and so a mechanism is needed to recognize par-

tial re-occurrence of concepts. Thirdly, the concept encoding

scheme needs to be efﬁcient in order to support high speed

data stream environments.

In order to meet the above challenges, we extend the work

proposed in [16] in a number of ways. In [16] concepts were

initially captured using decision trees and the Discrete Fourier

Transform (DFT) was applied to encode them into spectra

yielding compressed versions of the original decision trees.

Firstly, instead of encoding each concept using its own

Fourier spectrum, we use an ensemble approach to aggregate

individual spectra into a single uniﬁed spectrum. This has two

advantages, the ﬁrst of which is reduction of memory overhead.

Memory is further reduced as Fourier coefﬁcients that are

common between different spectra can be combined into a

single coefﬁcient, thus eliminating redundancy. The second

advantage arises from the use of an ensemble: new concepts

that manifest as a combination of previously occurring con-

cepts already present in the ensemble have a higher likelihood

of being recognized, resulting in better accuracy and stability

over large segments of the data stream.

Secondly, we devise an efﬁcient scheme for spectral energy

thresholding that directly controls the degree of compression

that can be obtained in encoding concepts in the repository.

Thirdly, we optimize the DFT encoding process by re-

moving the need for computing a potentially expensive inner

product operation on vectors.

II. RE LATE D RESEARCH

While a vast literature on concept drift detection exists

[13], only a small body of work exists so far on exploitation

of recurrent concepts. The methods that exist fall into two

broad categories. Firstly, methods that store past concepts as

models and then use a meta-learning mechanism to ﬁnd the

best match when a concept drift is triggered [5], [7]. Secondly,

methods that store past concepts as an ensemble of classiﬁers.

The method proposed in this research belongs to the second

category where ensembles remember past concepts.

An algorithm called REDDLA is presented in [14]. This

algorithm is designed to handle recurring concepts with un-

labeled data instances. One of the key issues is that explicit

domain is required on the concept recurrence interval. The

other issue is high memory overhead.

Lazarescu in [11] proposed an evidence forgetting mecha-

nism based on a multiple window approach and a prediction

module to adapt classiﬁers based on estimating future rate

of change. Whenever the difference between observed and

estimated rates of change are above a threshold, a classiﬁer

that best represents the current concept is stored in a reposi-

tory. Experimentation on the STAGGER data set showed that

the proposed approach outperformed the FLORA method on

classiﬁcation accuracy with re-emergence of previous concepts

in the stream.

Ramamurthy and Bhatnagar [15] use an ensemble approach

based on a set of classiﬁers in a global set G. An ensemble of

classiﬁers is built dynamically from a collection of classiﬁers

in G, if none of the existing individual classiﬁers are able to

meet a minimum accuracy threshold based on a user deﬁned

acceptance factor. Whenever the ensemble accuracy falls below

the accuracy threshold, G is updated with a new classiﬁer

trained on the current chunk of data.

Another ensemble based approach by Katakis et al. is

proposed in [9]. A mapping function is applied on data stream

instances to form conceptual vectors which are then grouped

together into a set of clusters. A classiﬁer is incrementally built

on each cluster and an ensemble is formed based on the set

of classiﬁers. Experimentation on the Usenet data set showed

that the ensemble approach produced better accuracy than a

simple incremental version of the Naive Bayes classiﬁer.

Gomes et al. [7] used a two layer approach with the ﬁrst

layer consisting of a set of classiﬁers trained on the current

concept, while the second layer contains classiﬁers created

from past concepts. A concept drift detector ﬂags when a

warning state is triggered and incoming data instances are

buffered to prepare a new classiﬁer. If the number of instances

in the warning window is below a threshold, the classiﬁer in

layer 1 is used instead of re-using classiﬁers in layer 2. One

major issue with this method is validity of the assumption that

explicit contextual information is available in the data stream.

Gama and Kosina also proposed a two layered system in

[5] which is designed for delayed labelling, similar in some

respects to the Gomes et al. [7] approach. In their approach,

Gama and Kosina pair a base classiﬁer in the ﬁrst layer

with a referee in the second layer. Referees learn regions of

feature space which its corresponding base classiﬁer predicts

accurately and is thus able to express a level of conﬁdence on

its base classiﬁer with respect to a newly generated concept.

The base classiﬁer which receives the highest conﬁdence score

is selected, provided that it is above a user deﬁned hit ratio

parameter; if not, a new classiﬁer is learnt.

Just-in-Time classiﬁers is the solution proposed by Allipi

et al. [1] to deal with recurrent concepts. Concept change

detection is carried out on the classiﬁcation accuracy as well as

by observing the distribution of input instances. The drawback

is that this model is designed for abrupt drifts and is weak at

handling gradual changes.

Recently, Sakthithasan and Pears in [16] used the Discrete

Fourier Transform (DFT) to encode decision trees into a

highly compressed form for future use. They showed that DFT

encoding is very effective in improving classiﬁcation accuracy,

memory usage and processing time in general. It maintains a

pool of Fourier spectra and a decision tree forest in parallel.

The decision tree forest dominates the model, when none of the

existing Fourier spectra matches the current concept, otherwise

classiﬁcation is done by the best performing Fourier spectrum.

III. APP LI CATI ON O F TH E DISCRETE FOURIER

TRANSFORM ON DECISION TRE ES

The Discrete Fourier Transform (DFT) has a vast area of

application in diverse domains such as time series analysis,

signal processing, image processing and so on. It turns out

as Park [12] and Kargupta [10] show, that the DFT is very

effective in terms of classiﬁcation when applied on a decision

tree model.

Kargupta et al. [10], working in the domain of distributed

data mining, showed that the Fourier spectrum fully captures

a decision tree in algebraic form, meaning that the Fourier

representation preserves the same classiﬁcation power as the

original decision tree.

A. Transforming Decision Tree into Fourier Spectrum

A decision tree can be represented in compact algebraic

form by applying the DFT to paths of the tree. Each Fourier

coefﬁcient ωjis given by:

ωj=1

2dX

x

f(x)ψλ

j(x); (1)

ψλ

j(x) = Qmexp 2πi

λmxmjmwhere j and x are strings of length

d;xmand jmrepresent the mth attribute value in jand

xrespectively; f(x)is the classiﬁcation outcome of path vector

x and ψλ

j(x)is the Fourier basis function.

Fig. 1. Decision Tree with 3 binary features

Figure 1 shows a simple example with 3 binary valued

features x1,x2and x3, out of which only x1and x3are

actually used in the classiﬁcation.

With the wild card operator * in place we can use equations

(1) and (2) to calculate non zero coefﬁcients. Thus for example

we can compute:

ω000 =4

8f(∗ ∗ 0)ψ000(∗ ∗ 0) + 2

8f(0 ∗1)ψ000(0 ∗1)

+2

8f(1 ∗1)ψ000f(1 ∗1) = 3

4

ω001 =4

8f(∗ ∗ 0)ψ001(∗ ∗ 0) + 2

8f(0 ∗1)ψ001(0 ∗1)

+2

8f(1 ∗1)ψ001f(1 ∗1) = 1

4

and so on.

Kargupta et al in [10] showed that the Fourier spectrum of

a given decision tree can be approximated by computing only a

small number of low order coefﬁcients, thus reducing storage

overhead. With a suitable thresholding scheme in place, the

Fourier spectrum consisting of the set of low order coefﬁcients

is thus an ideal mechanism for capturing past concepts.

Furthermore, classiﬁcation of unlabeled data instances can

be done directly in the Fourier domain as it is well known that

the inverse of the DFT deﬁned in expression 1 can be used

to recover the classiﬁcation value, thus avoiding the need for

expensive reconstruction of a decision tree from its Fourier

spectrum. The inverse Fourier Transform is given by

f(x) = X

j

ωjψλ

j(x)(2)

where ψλ

j(x)is the complex conjugate of ψλ

j(x). ‘ An instance

can be transformed into binary vector through the symbolic

mapping between the actual attribute value and mapped value

( either 0 or 1 in binary case). It can then be classiﬁed using

the inverse function in equation 2. Suppose the instance is 010,

the classiﬁcation value f(010) can be calculated as follows:

f(010) = 1

2d(−1)000.010ω000 +1

2d(−1)001.010ω001

+1

2d(−1)010.010ω010 +1

2d(−1)011.010ω011

+1

2d(−1)100.010ω100 +1

2d(−1)101.010ω101

+1

2d(−1)110.010ω110 +1

2d(−1)111.010ω111 = 1 (3)

IV. EXP LO ITATI ON O F TH E FOURIER TRANSFORM FOR

RECURRENT CON CE PT CA PT UR E

We ﬁrst present the basic algorithm used in Section IV-A

and then go on to discuss an optimization that we used for

energy thresholding in Section IV-B.

We use CBDT [8] as the base classiﬁer which maintains a

forest of Hoeffding Trees [4] CBDT is dynamic in the sense

that it can adapt to changing concepts at drift detection points.

As shown in Figure 2, the memory is divided into two

segments: the forest of Hoeffding trees; and a pool of Fourier

Spectra. The forest learns and undergoes structural modiﬁca-

tion on a continuous basis. The pool maintains a collection

of Fourier Spectra encoded from Hoeffding Trees, each of

which had the best classiﬁcation accuracy across the forest at a

particular concept drift point. Each Hoeffding Tree and Fourier

Spectrum is equipped with an instance of a drift detector. In

this research, we use the SeqDrift2 drift detector [13] as the

default option.

In [16], each Fourier spectrum is represented individually

as a Fourier Concept Tree (FCT). In this work, we aggregate

spectra and maintain a pool of ensemble spectra known as

Ensemble Pool (EP). The aggregation process is carried out in

two different ways. Algorithm EPaaggregates with reference

to similarity based on accuracy whereas EP aggregates based

Fig. 2. An Architecture for Recurrent Concept Capture

on structural similarity. We describe the EP process in Algo-

rithm EP and discuss how FCT can be generated from it as a

special case.

In practice, any incremental decision tree approach that

uses a forest of trees can be used in place of CBDT base

classiﬁer.

A. EP Algorithm

Algorithm EP

Input: Energy Threshold ET, Accuracy Tie Threshold τ

Input: Structural Similarity Threshold α

Output: Best Performing Classiﬁer Cthat suits current concept

1. Plant a Hoeffding tree rooted on each attribute found in the data

stream

2. Cis set to a randomly selected Hoeffding tree model from forest

3. Initialise an empty pool

4. Read an instance Ifrom the data stream

5. repeat

6. Apply all classiﬁers in forest and pool to classify I

7. Append 0to the embedded drift detector’s window for each

classiﬁer if classiﬁcation is correct, else 1

8. until Drift is detected by the current best classiﬁer C

9. if C is from the forest

10. Identify best performing Fourier Spectrum Fin pool

11. if (accuracy(C)-accuracy(F))> τ

12. Apply DFT on model Cto produce Fourier

Spectrum F* using energy threshold ET

13. if F* is not already in pool

14. Call Aggregation

15. Identify best classiﬁer Cacross forest and pool

16. GoTo 4

Algorithm Aggregation

Input: Fourier Spectrum F*,a set of existing ensembles Ein pool

Output: Updated Pool

1. repeat Over all data instances i

2. for Each ensemble Ein the pool

3. d(E) = d(E) + |c(F*, i)−c(E, i)|

4. until Next concept drift point

5. E∗= arg min

E

(distance(E))

6. if (E* ≥α) Merge F* with E

7. else insert E as a new spectrum

In step 1, a Hoeffding Tree rooted on each attribute is

created. In step 2, a tree is randomly chosen as the best

performing classiﬁer C. Next, an empty pool is created in step

3. Each incoming instance is routed to all trees in the forest and

pool until a concept drift signal is triggered by the drift detector

instance attached to the best classiﬁer C(steps 4 to 8). At the

ﬁrst concept drift point, the best performing tree C(in terms

of drift detector’s estimate of accuracy) is transformed into

a Fourier Spectrum F* after energy thresholding [16]. In this

method, the assumption is that the best tree that has the highest

accuracy helps locate conceot changes precisely than other

trees because it is the tree that captures concepts at a greater

detail than others, thus the highest accuracy. Thereafter, F* is

stored in the repository for reuse whenever the concept recurs.

The spectra stored in the repository are ﬁxed in nature as the

intention is to capture past concepts. A new best performing

classiﬁer is then identiﬁed as shown in step 15.

At each subsequent drift point, if the best classiﬁer is from

the pool then that classiﬁer is applied to classify data instances

until a new best classiﬁer emerges at a subsequent drift point.

Otherwise, if the best classiﬁer is from forest, two tests are

made prior to applying the DFT to reduce redundancy in the

pool. Firstly, we check whether the difference in accuracy

between the best Hoeffding tree in forest (C) and the best

performing Fourier Spectrum (F) in the pool (from step 10) is

greater than a user deﬁned tie threshold τ(step 11). If this test

succeeds, the DFT is applied to C to produce (F*) (step 12).

Furthermore, a second test is made to ensure that its Fourier

representation (F*) is not already in the pool (step 13). If this

test is also passed, algorithm Aggregation is called to integrate

F* into a selected existing Fourier Spectrum (E*) or plant (F*)

as a separate Fourier spectrum in the pool (step 14).

Algorithm Aggregation searches for the spectrum (E*)

that has the greatest structural similarity to the currently

generated spectrum (F*) (step 3). Step 3 evaluates the degree

of disagreement (d) between the classiﬁcation decisions (c) for

F* and E on data instance i. Degree of disagreement between

(F*) and each of the existing ensemble (E) in pool can easily be

updated incrementally in Algorithm EP using a single counter

variable at each ensemble E. This removes the steps from 2 to

4 in Algorithm Aggregation. As an alternative to aggregating

structurally similar spectra, we used accuracy as the measure

that deﬁnes similarity. Similarity based on accuracy leads to

aggregating similar performing Fourier Spectra together. Thus,

we test the hypothesis, aggregation of two spectra based on

structural similarity produces better performing trees than the

one based on accuracy.

As stated earlier, FCT omits the call to Algorithm Aggre-

gation and inserts (F*) as it is, and is thus a special case of

EP.

B. Optimising the Energy Thresholding Process

Sakthihasan et al. in [16] showed that classiﬁcation ac-

curacy is sensitive to spectral energy, which is given by the

total of the sum of squares of the coefﬁcients[10]); the higher

the energy the greater is the classiﬁcation accuracy in general.

Thresholding on spectral energy is thus an effective method of

obtaining a compact spectrum while retaining the classiﬁcation

power inherent in the decision tree counterpart.

A solution described in [16] was to iterate through each

order of the spectrum and compute ratio of energy at orders

i−1to that of irespectively. Thresholding can then be

implemented at order O when the ratio is less than some

small tolerance value, say 0.01. The drawback of this simple

solution is that it does not guarantee that the cumulative

energy up to order O contains a proportion () of the

total energy. Fortunately, a solution exists for this problem.

Theorem 1 proves that E(T) (total energy of Fourier Spectrum)

equals to ω0(The 0th coefﬁcient). Thus, total energy can

be computed efﬁciently, without having to enumerate all the

single coefﬁcients.

Theorem 1 The total spectral energy E=Pjω2

j=ω0,

where ω0denotes the coefﬁcient with order 0, which is easily

computed as its Fourier basis function is unity.

Proof: Omitted due to lack of space and can be found

in http://cogprints.org/9879/

This optimization signiﬁcantly increases processing speed,

especially in high dimensional data stream environments.

The next optimization is applied to optimize the Fourier

Basis function calculation in equation 1 especially when

wildcard characters (denoting absence of a feature) are

present in a path vector xof a Hoeffding Tree.

C. Optimizing the Computation of the Fourier Basis Function

The computation of a Fourier basis function for a given

partition jin a generic n−ary (n≥2) domain is given by:

X

x∈S

ψj(x) = X

x∈SY

m

exp2πijmxm

λm(4)

Thus we can see from (4.3) that the computation of

Px∈Sψ(j)over a set of schema S requires the computation

of an expensive inner product operation between the xand

j. However, it is possible to optimize this inner product

computation as deﬁned in Theorem 2.

Theorem 2 The computation of Px∈Sψj(x)can be

optimized as follows:

Case 1: If there exists at least one (p, ∗)combination

with p∈j,p6= 0 and ∗a wild card character deﬁning a set

of schema S, then Px∈Sψj(x)=0.

Case 2: else if there exists ncombinations of (0,∗)

pairs in the jand xvectors respectively, then

X

x∈S

ψj(x) = λ

λk−1

Y

k=n

exp2πijkxk

λkwhere λ=Qn−1

l=0 λl

Proof: Omitted due to lack of space and can be accessed

from:http://cogprints.org/9879/

The value of Case 1 is that a simple scan of the jand

xvectors will save a total of dmultiplications and d−1

additions.

We now turn our attention to Case 2. Since Qlλlis a

constant for all possible values of jand y, the value of Case

2 is that a scan of the two vectors will avoid the overhead of

nmultiplications and n−1additions.

Even with these optimizations, coefﬁcient calculation may

be expensive in a large dimensional data set. In the next section

we present a strategy to further optimize the derivation of the

spectrum.

D. Localized Approach to Ensemble Learning in the Fourier

Domain

In order to realize the full beneﬁts of ensemble learning in

the Fourier domain, we aggregate individual spectra si(x)that

represent different concepts which manifest at different points

in the stream.

sc(x) = X

i

AiX

i

si(x)

=X

i

AiX

j∈Pi

ωj(i)ψj(x)(5)

where sc(x)denotes the ensemble spectrum produced from the

individual spectra si(x)produced at different points iin the

stream; Aiis the classiﬁcation accuracy of its corresponding

spectrum and Piis the set of partitions for non zero coefﬁcients

in spectrum si.

Park in [12] used ensemble learning with Fourier spectra

in a setting different to ours. They considered a distributed

system with each node iproducing its own spectrum si(x)

and aggregation taking place at a central node. In our setting

of a data stream environment, we do not have all spectra in

advance but we can still use the same principle due to the

distributive nature of the linear weighted sum expressed by

(5). Hence, we use:

s(i+1)

c(x) = s(i)

c(x) + Ai+1si+1 (x)(6)

where s(i+1)

c(x),s(i)

crepresent the ensemble spectra at concept

drift points i+ 1 and irespectively in the stream and si+1(x)

is the spectrum produced at drift point i+ 1 with accuracy

Ai+1.

We use expression (6) for implementing ensemble learning

but with one essential difference. A direct application of (6)

using the entire (global) set of attributes Gcomprising the data

set would be inefﬁcient. As there are an exponential number

of coefﬁcients with respect to the number of attributes, this

could cause a bottleneck in high dimensional environments.

One practical solution is to populate the spectrum using only

attributes present in a given tree. The major advantage of this

approach is smaller computational overhead as the Fourier

transform effort is directly proportional to the size of the

attribute set used. Then this initial spectrum can be extended to

a full length spectrum containing the attributes that are absent

in the given tree, using a simple transformation scheme.

We deﬁne an attribute set of a Decision Tree as that subset

of attributes which deﬁne splits in the tree. Suppose that we

are integrating spectra from trees D1and D2, having attribute

sets Land Mrespectively. We apply the DFT on D1to obtain

S1using only the attributes in its attribute set Land not all

attributes in G. Similarly we generate S2from D2using only

the attributes deﬁned in M.

Now, in order to integrate S1with S2, we need to account

for differences in the attribute sets Land M. To do this, we

take S1and expand the spectrum by incorporating attributes in

the set M\L. The expansion is deﬁned by a single operation:

For each schema instance in the spectrum (say S1) expand

the spectrum by adding 0to all attribute index positions in

set M\L. The coefﬁcient value after expansion will remain

it the same as the classiﬁcation fvalue for all of these added

index positions remains unchanged. We are now in a position

to integrate two spectra produced from their own localized

set of attributes. Essentially, this means that we now have

a more efﬁcient method of implementing ensemble learning

using expression (6).

The next section presents the empirical outcomes of the

proposed models with the above mentioned optimizations.

V. EXPERIMENTAL STU DY

The main focus of the study is to assess the effectiveness

of the ensemble EP approach vis-a-vis FCT in respect of

classiﬁcation accuracy, memory consumption, processing

speed, tolerance to noise. We also assessed the sensitivity

of EP’s accuracy on two signiﬁcant factors, pool size and

impact of drift detector. All experimentation was done with

the following parameter values:

Tree Forest: Max Node Count=5000, Max Number of

Fourier spectra=10, Tie Threshold τ=0.01

SeqDrift2/ADWIN [2]: drift signiﬁcance value=0.01

A. Datasets Used for the Experimental Study

1) Synthetic Data: We experimented with the Rotating

Hyperplane data generator that is commonly used in drift

detection and recurrent concept mining. The dataset was

generated within the MOA data stream tool [3]. We injected

concept recurrence into the stream at known points so that we

could evaluate the capabilities of FCT and EP to recognize and

exploit such recurrences. For this dataset 10 different concepts

were generated, each of which spanned 5,000 instances and

each occurred a total of 3 times at different points in the

stream. In order to challenge the concept recognition process,

we added 10% noise by inverting the class labels of 10% of

randomly selected instances.

2) Real World Data: Spam Data Set: The Spam dataset

was used in its original form 1which encapsulates an evo-

lution of Spam messages. There are 9,324 instances and 499

informative attributes.

Electricity Data Set: NSW Electricity dataset is also used

in its original form 2. There are two classes Up and Down

that indicate the change of price with respect to the moving

average of the prices in last 24 hours.

Flight Data Set: This dataset is generated through the

use of NASA’s FLTz ﬂight simulator which was designed to

simulate ﬂight conditions experienced with commercial ﬂights.

1from http://www.liaad.up.pt/kdus/products/datasets-for-concept-drift

2from http://moa.cms.waikato.ac.nz/datasets/

It consists of a set of 20 separate ﬁles, each containing data

about a single ﬂight with four scenarios: take off, climb,

cruise and landing. Data is recorded every second and a data

instance is produced. The ”Velocity” feature is chosen as the

class feature as it needs to be adjusted in order to maintain

aircraft stability during various maneuvers such as take off

and landing. Velocity was discretized into binary outcomes

”UP” or ”DOWN” depending on the directional change of the

moving average in a window of size 10 data instances.

B. Comparative Study: Ensemble versus Single Spectrum Ap-

proach

Previous research on the use of Fourier spectrum re-

vealed accuracy and memory advantages over meta learning

approaches such as the one employed by Gama and Kosina

in storing past concepts in a repository [16]. For details of

the advantages of the Fourier approach and experimentation

with it the reader is referred to [16]. Our focus here is a

comparative study of the Ensemble approach versus the single

spectrum approach. With this in mind we designed three types

of experiments.

1) Accuracy: Accuracy is a critical performance measure

in many practical applications. Due to the dynamic nature of

data streams classiﬁcation accuracy on the current concept was

taken as the performance measure. Figure 3 presents Accuracy

Fig. 3. Accuracy Proﬁles

values of all algorithms at 10 equal-sized sub-divisions of the

stream. We also present overall mean and standard deviations

of accuracy taken across the entire stream for each dataset.

Fig 3 shows that the individual accuracies across segments

and overall accuracy across the entire stream are consistent

with each other. MetaCT, which uses a referee based strategy

was found to be the worst performing algorithm on all datasets.

In contrast EP outperforms the other algorithms in general,

followed by EPaand FCT. These results show clearly that

DFT based methods are superior in a dynamic data stream

environment.

FCT does not exploit aggregation of Fourier Spectra and

is hence challenged in a memory constrained environment

where the number of models stored for reuse is limited. Figure

3 depicts the performance in such an environment where

memory is severely limited. This introduces a large burden

on FCT to re-learn concepts after change. EP is more resilient

at small pool sizes as any given concept that recurs can be

approximated by a linear combination of spectra embedded

in the ensemble, just as a waveform of arbitrary shape can

be approximated by a large enough sum of sine functions in

signal processing.

Examining model usage statistics, EP was 3.7 times higher

in model re-use on the Flight dataset. The corresponding value

was 2.6 for EPaon the same dataset. This provides empirical

support for the claim that an aggregation-based model such

as EP has a signiﬁcant advantage in reducing the degree of

relearning. For Rotating Hyperplane with known recurrence

points the advantage of EP over its counterparts is very explicit.

We display the stream segment for the third round of concept

occurrences, spanning the 10 concepts. Each of the 10 intervals

represent the second recurrence of a concept and the Figure

shows that EP outperforms FCT on 8/10 concepts; EPaand

MetaCT on 7/10 concepts; and CBDT on all 10 concepts.

The next key aspect in a memory constrained environment

is memory consumption which is assessed in the following

section.

2) Memory: Memory consumption is inﬂuenced by the

degree of generalizability of a given algorithm. A greater

degree of generalizability promotes higher re-use and reduces

the number of spectra that need to be stored in the repository

to achieve a given level of classiﬁcation accuracy. In this

context it will be interesting to compare the consumption of

EP with that of FCT as they have contrasting model re-use

characteristics.

MetaCT(SeqDrift2) and CBDT were excluded from mem-

ory comparison due to their relatively poor performance in the

previous experiment.

TABLE I. MEMORY USAGE WITH POOL S IZE S ET T O 10

Dataset Average Pool Memory (in KBs)

FCT EPaEP

Flight 32.1 20.2 18.1

Electricity 31.6 16.1 14.1

Rot. Hyperplane 48.4 38.6 27.9

Spam 17.3 17.2 16.4

Table I presents the average memory consumption of the pool

over the entirety of each dataset. As mentioned in Section III,

each of the above algorithms in Table I has two components:

a forest and a repository pool. Memory consumed by forest

is not a distinguishing factor as there was a very marginal

difference between the algorithms and thus the focus was on

the repository pool.

Without exception, EP consumed the least memory com-

pared to the other algorithms. This was expected as EP

structurally examines instance vectors (i.e. corresponding to

classiﬁcation paths in Hoeffding tree) and aggregates similar

vectors together. On the other hand in EPa, structural sim-

ilarity is not guaranteed and two structurally very different

Spectra producing similar accuracy could be chosen as the

candidates to be aggregated, thus resulting in larger spectra.

Table I provides evidence to support this premise as the

memory consumed by EPais higher than that of EP but

lower than FCT. On average over all datasets, EP achieved a

41% reduction in memory consumption in relation to FCT; the

corresponding ﬁgure for Electricity was 55%. This represents

a signiﬁcant beneﬁt of applying aggregation in Fourier space.

3) Processing Speed: DFT application is a potential perfor-

mance bottleneck when compared to classiﬁcation, especially

in high dimensional data streams.

Processing speed is dependent on a variety of factors:

maintaining and classifying relatively larger number of Fourier

Spectra in FCT compared to EP and EPa, aggregation in EP

and EPathat generalize models thus reducing re-learning and

the need for DFT application, and ﬁnally the computational

overheads of aggregation. Therefore, this section assesses the

trade off between single and aggregated Fourier approaches in

terms of processing speed.

TABLE II. PRO CES SI NG SPE ED I N INS TANC ES P ER SE CO ND

Dataset FCT EPaEP

Flight 797.2 731.2 836.9

Electricity 11600.3 9002.5 11402.5

Rotating Hyperplane 5647.8 5413.8 5804.5

Spam 4.2 3.9 4.2

Table II shows that EP is the fastest most of the time. EPa,

even though it has the potential to be faster due to its simple

aggregation strategy, suffers from inappropriate aggregations

that introduce instability, thus triggering more drift points than

its EP counterpart. EP, on the other hand, efﬁciently does

structural similarity comparison by incrementally updating

simple counters that remembers the number of disagreements

in classiﬁcation between the current winner tree and every

Fourier Spectra in pool. On the other hand, although EP

through its aggregation strategy requires more computational

effort than EPa, that effort is compensated for by its stability,

which triggers fewer false drift alarms than either EPaor FCT.

Therefore, this experiment demonstrates that an expensive

operation such as aggregation if applied appropriately will

yield a direct processing speed advantage over a period of

time.

4) Effects of Noise: Algorithms that work well in noise-

free environments will fail on noisy environments if they

lack the ability to generalize to new data by removing minor

variations which often correspond to noise. DFT application, as

mentioned earlier, extracts signiﬁcant coefﬁcients by ignoring

minor coefﬁcients that may capture noise inherent in data. It

was shown in [16] that DFT application provides robustness in

a noisy environment as opposed to a non-DFT based approach

such as MetaCT. Therefore, this experiment is aimed at testing

whether aggregation has an added advantage over a non-

aggregation based method such as FCT.

Fig. 4. The impact of noise on accuracy

Figure 4 shows percentage accuracy decrease for noise

levels 20% and 30% on FCT and EP relative to accuracy on the

original Flight dataset. It is clear that the decrease in accuracy

is higher at the 30% noise level. What is interesting is the

higher tolerance of EP to noise compared to FCT. In 8/10

intervals, for 20% noise, EP is found to be having a lesser

decrease than its counterpart. Similarly at the 30% noise level,

the fraction is 4/10, with the two being tied in performance in

two other intervals. Again, as with the other metrics that we

tracked, the superior performance of EP can be explained in

terms of its power to generalize making it more robust to the

effects of noise 3.

Next we examine the sensitivity of EP on key parameters

that signiﬁcantly affect performance. Due to the superiority of

EP over the other algorithms, the study was conﬁned to this

algorithm. Please refer [16] for sensitivity analysis on FCT’s

parameters.

C. Sensitivity Analysis

EP(SeqDrift2) has two key parameters of its own: pool size

and choice of drift detector.

1) Pool Size: In this experiment we contrasted classiﬁca-

tion accuracy at two different ends of the pool size scale,

namely 1 and 10. In the context of the Flight dataset which

has four concepts, a pool size of 1 represents an extremely

limiting memory environment and the size of 10 represents a

situation where memory is plentiful. Figure 5 shows accuracy

Fig. 5. The impact of pool size on ﬂight dataset

values over 10 intervals. Interestingly, EP, with pool size1,

has the highest accuracy in 8/10 intervals. There is a 7.6%

and 7.2% gain in accuracy compared to FCT over pool sizes

1 and 10 respectively. This is a signiﬁcant outcome of this

research. Even in an extreme memory challenged environment,

EP achieves its best accuracy over a setting with a much

3The other 3 datasets that we experimented with displayed similar trends

to that of the Flight dataset and were thus not included in interests of space

constraints

higher memory capacity. The implication is that ensemble

accuracy increases with greater diversity and resonates with

the research conducted by [6]. This illustrates the strength of

aggregation applied in the EP algorithm. As more memory

becomes available at pool size 10, FCT’s accuracy converges

to that of its counterpart, as expected. At the higher memory

setting FCT can accommodate more spectra in its pool that

are tailored to speciﬁc concepts.

2) Impact of Drift Detector: A drift detector that incor-

rectly triggers change points leads to partial learning of a con-

cept and under developed classiﬁers being stored in the pool.

This introduces ﬂuctuations in accuracy, which in turn trigger

change detections, causing even more ﬂuctuations and so on.

This is a cyclic problem. On the other hand, if a drift detector

fails to detect changes, classiﬁers are not updated in a timely

fashion, thus leading to poor performance. This situation may

arise if a drift detector has signiﬁcantly high detection delay in

signaling changes. The ADWIN and SeqDrift2 drift detectors,

as shown in [13] have contrasting properties. SeqDrift2 has a

lower false positive rate than ADWIN while having similar

sensitivity to ADWIN. Therefore, the comparative study is

largely governed by false positive detections.

Fig. 6. The impact of drift detector on EP with pool size 10

Figure 6 reveals that SeqDrift2 helped EP to reduce the

frequency of sudden accuracy drops seen with ADWIN, due

to the latter signaling false changes in concepts. In the segment

shown in Figure 6, there is a 5% gain in accuracy by using

SeqDrift2 and it is 3.4% over the entire data set.

VI. CONCLUSIONS AND FUTURE WO RK

In this research we proposed a novel approach for cap-

turing and exploiting recurring concepts in data streams. We

optimized the derivation of the Fourier spectrum by employing

two mechanisms: one for energy thresholding and the other for

speeding up computation of the Fourier basis functions.

This research revealed that the ensemble approach outper-

formed the single spectrum approach and is thus the method

of choice in high speed dynamic environments that generate

large amounts of concepts over the progression of the stream.

In such environments FCT would be challenged in terms of

memory capacity and would be forced to ﬂush portions of

its repository sooner that EP, thus losing its ability to exploit

concept recurrences and in turn leading to a loss of accuracy.

However, as shown in the experimentation care needs to be

taken on how spectra are combined: a naive approach of simply

combining similarly performing spectra in terms of accuracy

can be worse than maintaining single spectra. We showed

that the structural similarity scheme outperformed the other

two approaches on a broad set of criteria including accuracy,

robustness to noise and over-ﬁtting, memory consumption and

processing speed.

In terms of future work there are two promising directions.

We believe that is possible to further reduce the computational

effort involved in deriving the spectrum by only keeping the

lowest order coefﬁcient at each leaf node of the Decision

Tree together with a residual coefﬁcient that captures the

contribution of other coefﬁcients at that node. Secondly, at

each concept drift point we can parallelize computation of the

spectrum in one thread while processing incoming instances

in another thread in a parallel environment such as a Spark

framework. REFERENCES

[1] C. Alippi, G. Boracchi and M. Roveri. Just-In-Time Classiﬁers for Re-

current Concepts. IEEE Transactions on Neural Networks and Learning

Systems, vol. 24(4), pages 620–634, 2013.

[2] Bifet, A. & Gavald`

a, R. Learning from Time-Changing Data with

Adaptive Windowing. In Proceedings of the 7th SIAM ICDM, pages

443–448. SIAM, 2007.

[3] Bifet, A. Holmes, G. Kirkby, R. & Pfahringer, B. MOA: Massive Online

Analysis. The Journal of Machine Learning Research, vol(11), pages

1601–1604, 2010.

[4] Domingos, P. & Hulten, G. Mining High-speed Data Streams. In

Proceedings of the ACM SIGKDD’00, pages 71–80, New York, NY,

USA, 2000. ACM.

[5] Gama, J. & Kosina, P. Learning about the Learning Process. In Advances

in Intelligent Data Analysis X, vol(7014) of Lecture Notes in Computer

Science, pages 162–172. Springer Berlin Heidelberg, 2011.

[6] Gashler, M., Giraud-Carrier C., & Martinez, T. Decision Tree Ensemble:

Small Heterogeneous Is Better Than Large Homogeneous. 7th Interna-

tional Conference on Machine Learning and Applications, pages 900–

905,IEEE Computer Society, 2008.

[7] Gomes, J. Menasalvas, E. & Sousa, P. Tracking Recurrent Concepts

Using Context. In Rough Sets and Current Trends in Computing,

vol(6086), pages 168–177. Springer Berlin Heidelberg, 2010.

[8] Hoeglinger, S. Pears, R. & Koh, Y. CBDT: A Concept Based Approach

to Data Stream Mining. In Proceedings of the PAKDD ’09, pages 1006–

1012, Berlin, Heidelberg, 2009. Springer-Verlag.

[9] Katakis, I. Tsoumakas, G. & Vlahavas, I. An Ensemble of Classiﬁers for

Coping with Recurring Contexts in Data Streams. In Proceedings of the

ECAI’08 , pages 763–764, Amsterdam, Netherlands, The Netherlands,

2008. The IOS Press.

[10] Kargupta, H. Park, B. & Dutta, H. Orthogonal Decision Trees. IEEE

Transactions on Knowledge and Data Engineering, vol(18), no(8), pages

1028–1042, 2006.

[11] Lazarescu, M. A Multi-Resolution Learning Approach to Tracking

Concept Drift and Recurrent Concepts. In 5th international workshop

on Pattern Recognition in Information Systems, 2005.

[12] Park, B. Knowledge Discovery from Heterogeneous Data Streams Using

Fourier Spectrum of Decision Trees. PhD thesis, Pullman, WA, USA,

2001.

[13] Pears, R. Sripirakas, S. & Koh, Y. Detecting concept change in dynamic

data streams. Machine Learning, 97:3, pp 259–293, 2014.

[14] Peipei Li, Xindong Wu, and Xuegang Hu, ”Mining recurring concept

drifts with limited labeled streaming data,” ACM Trans. Intell. Syst.

Technol.,vol. 3, no. 2, pp. 29:1-29:32, Feb. 2012

[15] Ramamurthy, S. & Bhatnagar, R. Tracking recurrent concept drift

in streaming data using ensemble classiﬁers. In 6th International

Conference on Machine Learning Applications, pages 404–409, Dec

2007.

[16] Sripirakas, S. & Pears, R. Mining Recurrent Concepts in Data Streams

Using the Discrete Fourier Transform. In DaWaK’14, vol(8646) of

Lecture Notes in Computer Science, pp 439–451. Springer International

Publishing, 2014.