Content uploaded by Albert Bifet

Author content

All content in this area was uploaded by Albert Bifet on Mar 24, 2018

Content may be subject to copyright.

Data Stream Classiﬁcation using Random Feature

Functions and Novel Method Combinations

Diego Marr´ona, Jesse Readb, Albert Bifetc, Nacho Navarroa

aDepartment of Computer Architecture, Universitat Politecnica de Catalunya and with the

Department of Computer Science, Barcelona Supercomputing Center, Spain

bAalto University and HIIT,Finland

cT´el´ecom ParisTech, Paris, France

Abstract

Big Data streams are being generated in a faster, bigger, and more common-

place. In this scenario, Hoeﬀding Trees are an established method for classiﬁca-

tion. Several extensions exist, including high-performing ensemble setups such

as online and leveraging bagging. Also, k-nearest neighbors is a popular choice,

with most extensions dealing with the inherent performance limitations over a

potentially-inﬁnite stream.

At the same time, gradient descent methods are becoming increasingly pop-

ular, owing in part to the successes of deep learning. Although deep neural

networks can learn incrementally, they have so far proved too sensitive to hyper-

parameter options and initial conditions to be considered an eﬀective ‘oﬀ-the-

shelf’ data-streams solution.

In this work, we look at combinations of Hoeﬀding-trees, nearest neighbour,

and gradient descent methods with a streaming preprocessing approach in the

form of a random feature functions ﬁlter for additional predictive power.

We further extend the investigation to implementing methods on GPUs,

which we test on some large real-world datasets, and show the beneﬁts of using

GPUs for data-stream learning due to their high scalability.

Our empirical evaluation yields positive results for the novel approaches that

we experiment with, highlighting important issues, and shed light on promising

future directions in approaches to data-stream classiﬁcation.

Keywords: Data Stream Mining, Big Data, Classiﬁcation, GPUs.

1. Introduction

There is a trend towards working with big and dynamic data sources. This

tendency is clear both in real world applications and the academic literature.

Email addresses: dmarron@ac.upc.edu (Diego Marr´on), jesse.read@aalto.fi (Jesse

Read), albert.bifet@telecom-paristech.fr (Albert Bifet), nacho@ac.upc.edu (Nacho

Navarro)

Preprint submitted to Elsevier November 4, 2015

arXiv:1511.00971v1 [cs.LG] 3 Nov 2015

Many modern data sources are not only dynamic but often generated at high

speed and must be classiﬁed in real time. Such contexts can be found in sensor

applications (e.g., tracking and activity monitoring), demand prediction (e.g.,

of electricity), manufacturing processes, robotics, email, news feeds, and social

networks. Real-time analysis of data streams is becoming a key area of data

mining research as the number of applications in this area grows.

The requirements for a classiﬁer in a data stream are to

•Be able to make a classiﬁcation at any time

•Deal with a potentially inﬁnite number of examples

•Access each example in the stream just once

These requirements can in fact be met by variety of learning schemes, includ-

ing even batch learners (e.g., [1]), where batches are constantly gathered over

time, and newer models replace older ones as memory ﬁlls up. Nevertheless,

incremental methods remain strongly preferred in the data streams literature,

and particularly the Hoeﬀding tree (HT) and its variations [2, 3], k-nearest

neighbors (kNN) [4]. Support for these options is given by large-scale empirical

comparisons [5], where it is also found that methods such as naive Bayes and

stochastic gradient descent-based (SGD) are relatively poor performers.

Classiﬁcation in data streams is a major area of research, in which Hoeﬀding

trees have long been a favoured method. The main contribution of this paper

is to show that random feature function can be leveraged by other algorithms

to obtain similar or even improved performance over tree-based methods.

With the recent popularity of Deep Learning (DL) methods we also want to

test how a random feature in the form of random projection layer performs on

Deep Neural Networks (DNNs).

DL aims for a better data representation at multiple layers of abstraction,

and for each layer the network needs to be ﬁne-tuned. In classiﬁcation, a com-

mon algorithm to ﬁne-tune the network is the SGD which tries to minimize the

error at the output layer using an objective function, such as Mean Squared

Error (MSE). A Gradient vector is used to back-propagate the error to previous

layers. This gradient nature of the algorithm makes it suitable to be trained in-

crementally in batches of size one, similar to how incremental training is done.

Unfortunately, DNN are very sensitive to hyper-parameters such as learning

rate (η), momentum (µ), number of number neurons per level, or the number

of levels. It is then not straight forward to provide an of-the-shelf method for

data streams.

Propagation between layers is usually done in the form of matrix-vector

or matrix-matrix multiplications, which are computational intensive operation.

Often hardware accelerators such as FPGAs or GPUs are used to accelerate the

calculations. Despite some eﬀorts, acceleration of HT and kNN algorithms for

data streams on the GPUs are has some limitations. We talk brieﬂy about this

in Section 2.

In recent years, Extreme Learning Machines [6] (ELMs) have emerged as

a popular framework in Machine Learning. ELMs are a type of feed-forward

2

neural networks characterized by a random initialization of their hidden layer,

combined with a fast training algorithm. Our random feature method is based

on this approach.

We made use of the MOA (Massive Online Analysis) framework [7], a soft-

ware environment for implementing algorithms and running experiments for

online learning from data streams in Java. It implements a large number of

modern methods for classiﬁcation in streams, including HT, kNN, and SGD-

based methods. We make use of MOA’s extensive library of methods to form

novel combinations with these methods and further employ an extremely rapid

preprocessing technique of projecting the input into a new space via random

feature functions (similar to ELMs). We then took the methods purely related

to Neural Networks (those which proved most promising under random projec-

tions) and implemented them using NVIDIA GPUs and CUDA 7.0; comparing

performance to the methods in MOA.

This paper is organized as follows: Section 2 introduces related work on tree

based approaches, neural networks, and data streams on GPU. We discuss the

use of random features in Sections 3 and 4 for HT/kNN methods and neural

networks respectively. We ﬁrst present the evaluation of tree-based methods

in Section 5 and later in Section 6 we extend the SGD method in the form of

DNNs, using diﬀerent activation functions. We ﬁnally conclude the paper in

Section 7.

2. Related Work

Hoeﬀding trees [2] are state-of-the-art in classiﬁcation for data streams and

they predict by choosing the majority class at each leaf. However, these trees

may be conservative at ﬁrst and in many situations naive Bayes method outper-

forms the standard Hoeﬀding tree initially, although it is eventually overtaken

[8]. A proposed hybrid adaptive method (by [8]) is a Hoeﬀding tree with naive

Bayes at the leaves, i.e., returning a naive Bayes prediction at the leaves, if

it has been so far more accurate overall than the majority class. Given it’s

widespread acceptance, this is the default in MOA, and we denote this method

in the experimental Section simply as HT. In fact, the naive Bayes classiﬁcation

comes for free, since it can be made with the same statistics that are collected

anyway by the tree.

Other established examples include using principal component analysis (re-

viewed also in [9]) for this transformation, and also Restricted Boltzmann Ma-

chines (RBMs) [10]. RBMs can be seen as a probabilistic binary version of PCA,

for ﬁnding higher-level feature representations. They have received widespread

popularity in recent years due to their use in successful deep learning approaches.

In this case, z=φ(x) = f(W>x) for some non-linearity f: a sigmoid function

is typical, but more recently rectiﬁed linear units (ReLUs, [11]) have fallen into

favour. The weight matrix Wis learned with gradient-based methods [12], and

the projected output should provide a better feature representation for a neural

network or any oﬀ-the-shelf method. This approach was applied to data streams

3

already in [13], but concluded that the sensitivity to hyper-parameters and ini-

tial conditions prevented good ‘out-of-the-box’ deployment in data streams.

Approaches such as the so-called extreme learning machines (ELMs) [14]

avoid tricky parametrizations by simply using random functions (indeed, ELMs

are basically linear learners on top of non-linear data transformations). De-

spite the hidden layer weights being random , it has been proven that ELMs is

still capable of universal approximation of any non-constant piecewise contiuous

function [15].

Also an incremental version of ELMs is proposed in [16]. It starts with

an small network, and new neurons are added at each step until an stopping

criterion of size or residual error is reached. The diﬀerence with our incremental

build is that we use one instance at time simulating they arrive in time, and we

incrementally train the network. Also our number of neurons is ﬁxed during the

training, in other words, we don’t add/remove any neuron during the process.

Nowadays, in 2015, it is diﬃcult when talking about DL and DNNs not

to mention GPUs. They are a massive parallel architectures providing an

outstanding performance for High Performance Computing and a very good

performance/watt ratio, as their architecture suits very ﬁne to their needs of

DNNs computations. Many tools includes a back-end to oﬄoad the computa-

tion to the GPU. NVIDIA has its own portal for deep learning on GPUs at

https://developer.nvidia.com/deep-learning.

GPUs has not only used to accelerate DL/DNN computations due to its

performance, it has been also been used to successfully accelerate HT and en-

sembles. However, few works are provided in the context of data streams and

GPUs.

The only work we are aware of regarding to HT in the context of online real-

time data streams mining is[17], were the authors present a parallel implemen-

tation of HT and Random Forests for binary trees and data streams achieving

goods speedups, but with limitations on the size and with high memory con-

sumption. More generic HT implementation of Random Forests is presented in

[18]. In [19] the authors introduced an open source library, available at github,

to predict images labelling using random forests. The library is also tested their

on a cell phone with VGA resolution in real-time with good results.

Also, kNN has already been successfully ported to GPUs [20]. That paper

presented one of the ﬁrst implementations of the “brute force” kNN on GPUs,

and compared with several CPU-based implementations with speedups up to

teo orders of magnitude. kNN is also used in business intelligence [21] and has

also its implementation on the GPU. The same way as with HT, a tool for

machine learning (including kNN) is described in [22].

3. Tree Based Random Feature Functions

Transforming the feature space prior to learning and classiﬁcation is an es-

tablished idea in the statistical and machine learning literature [9], for example

with basis (or feature-) functions. Suppose the input instance is xof length d.

4

This vector is transformed to a new space z=φ(x) via function φ, creating new

vector zof length h. Any oﬀ-the-shelf model now treats zas if it were the in-

put. The functions can be either chosen suitably by a domain expert, or simply

chosen to achieve a more dimensioned representation of the input. Polynomials

and splines are a typical choice.

Regarding HTs with additional algorithms in the leaves (as described in

Section 2), this ﬁlter can either be placed before the HT, or before the method

in the leaves, or both.

In this paper we adapt this methodology to deal with other classiﬁers in a

similar way, namely kNN and an SGD-based method (rather than naive Bayes)

at the leaves. We denote these cases HT-kNN and HT-SGD, respectively. For

example, in HT-SGD, a gradient descent learner is employed in the leaves of

each tree. Similarly to HT, predictions by the kNN and an SGD-based method

are only used if they are more accurate on average than the majority class.

3.1. Ensembles in Data Streams

Bagging is an ensemble method used to improve the accuracy of classiﬁer

methods. Non-streaming bagging [23] builds a set of Mbase models, training

each model with a bootstrap sample of size Ncreated by drawing random

samples with replacement from the original training set. Each base model’s

training set contains each of the original training example Ktimes where P(K=

k) follows a binomial distribution. This binomial distribution for large values of

Ntends to a Poisson(λ= 1) distribution, where Poisson(λ= 1)= exp(−1)/k!.

Using this fact, Oza and Russell [24, 25] proposed Online Bagging, an online

method that instead of sampling with replacement, gives each example a weight

according to Poisson(1). ADWIN Bagging [26] is an adaptive version of Online

Bagging that uses a change detector to decide when to discard under-performing

ensemble models.

Leveraging Bagging (LB, [3]) improves ADWIN Bagging, increasing the weights

of this resampling using a larger value λto compute the value of the Poisson

distribution. The Poisson distribution is used to model the number of events

occurring within a given time interval. It proved very competitive.

Again, we can run a ﬁlter on the input instances before entering the ensemble

of trees, or at the leaves. It is even possible to run the ﬁlter again on the

output of an ensemble (i.e., the votes), before running an additional stacking

procedure. This kind of methodology can give way to rather ‘deep’ classiﬁers.

Figure 1 illustrates a possible setup. In this sense of multiple levels we could

also call our approach deep learning. It is debatable whether decision trees can

be called a deep method (their levels involve partitioning an existing feature

set rather than because they simple partition a space rather than create higher-

level feature space). However, several of the methods we investigate have at least

multiple levels of feature transformation, which is behind the power of most deep

methods. In the following Section we investigate the empirical performance of

several novel combinations based on the methodology described so far.

5

y

z(1)

3

z(1)

2

z(1)

1

y(3)

T

y(2)

T

y(1)

T

z4

z3

z2

z1

x5

x4

x3

x2

x1

Figure 1: An example setup: Input xis ﬁltered (i.e., projected) to random layer

z(ﬁrst layer of connections), which goes to an ensemble of, for example, HTs

(second layer), wherein instances are partitioned to the leaves and are again

ﬁltered (third layer) and used as training for, say, SGD, producing (in the ﬁrth

layer of connections) ﬁnal vote y. Note, however, that we only draw the ﬁnal

two layers wrt to the ﬁrst of the HT models.

y

z4

z3

z2

z1

x5

x4

x3

x2

x1

Figure 2: Random Projection Layer: Input xis projected to a random layer z

(ﬁrst layer of connections), which is trained to produce the ﬁnal vote y.

4. Neural Networks with Random Projections for Data Streams

Data streams are potentially inﬁnite, and so, they can evolve with time. This

means the statistical distribution of the data we are interested on can change.

The idea behind the random layer is to improve data localization across the

space the trained layer sees. Imagine the instance is a tiny luminous point on

the space, with enough random neurons acting as a mirror we hope the trained

layer can capture better the data movement. The strategy used by the random

projection layer es shown in Figure 2

Except for the fact it is never trained, the random layer is a normal layer

and need its activation functions, in this work sigmoid, ReLU, ReLU incremental

and a Radial Basis Function are used.

The sigmoid function used is the standard one, with σ(x)∈[−1,1]:

σ(ak) = 1

1 + e−ak

where ak=W>

kxis the k-th activation function and Wis the weight d×h

matrix (dinput attributes, houtput features).

6

0

2

4

6

8

1

2

3

0

-1

-2 -2 -1 012

Figure 3: Terrain of ReLU basis function on two input attributes x1, x2the

feature function zis given on the vertical axis.

ReLU functions are deﬁned as:

zk=f(ak) = max(0, ak)

As stated in Section 2, ReLUs activation are very eﬃcient as they require only

a comparison. In our random projection we expect near 50% of the neurons to

be active for a single instance ( the terrain of a ReLU is exempliﬁed in Figure 3

).

One variation we can do to the standard ReLU is to use the mean value

of the attribute as a threshold. The mean value is calculated incrementally as

instances arrive. We call this variant ReLU incremental, and is deﬁned as:

f(ak) = max(¯ak, ak)

The last activation function we are using is the Radial Basis Function (RBF):

φ(x) = e−(x−ci)2

2σ2

where xis an input instance attribute value and ciis a random value set at

initialization time. Each neuron in the random layer has its own set of ci, the

length of both vector xand care the same. So, we can see the operation (x−ci)2

as the euclidean distance of the current instance to a randomly positioned center.

The σ2is a free parameter. A simpliﬁcation we can do to this notation is:

γ=1

2σ2

In our experiments we try diﬀerent γvalues passed at command line. We

use the following notation in our experiments:

φ(x) = e−γ(x−ci)2

All matrices and vectors in our model are initialized using random numbers.

Matrices are used as normal weight matrix, but the function of the vectors

are activation function dependent. Usually initialization is done using random

numbers with µ= 0 and σ= 1. Assuming our data range ∈[−1,1] if we put

7

(a) Typical Random ini-

tialization

(b) Better use of random

range

Figure 4: Random number initialization strategies

a Gaussian centered at one of the endpoints, half of its are of inﬂuence area if

wasted and will never see a point making it harder to ﬁll the whole space and

so the discovering of points.

If a smaller range is used, σ∈(0,1) (note the open interval), we can improve

each neuron’s area of inﬂuence, as shown in ﬁgure 4b. In red the random num-

bers range is smaller than data range so if we put a Gaussian at the random

endpoint can improve its inﬂuence are. In this example we used a Gaussian

function as an example, but we the idea extends the same for activation func-

tiones. In fact this is what we do in Section 6, specially when talking about the

sigmoid neurons as they are always used at the trained layer.

5. Random Feature Function Evaluation

Among the methods we investigate (e.g., HT, kNN, SGD1), diﬀerent levels

of ﬁlters and ensembles and possibly additional classiﬁcation in the leaves (in

the case of HT), there are a multitude of possible combinations. We ﬁrst inves-

tigate the viability of random feature functions and their eﬀect on the diﬀerent

classiﬁers (comparing these common methods with their ‘ﬁltered’ versions that

we denote HT-SGD, kNN-F, and SGD-F. This study led us to novel combina-

tions, which we further compare to the benchmark methods and state-of-the-art

Leveraging Bagging (LB-HT).

The random feature used in these evaluations are basically ELMs [14]. In this

Section, we use only ReLU (explained in Section 4) as the activation function.

In Section 4 we extend the functions used within the random feature to deﬁne

a random projectin layer for DNNs.

Our random feature is based on ELMs, which are deﬁned using Radial Basis

Funciontions, but instead in this section we use ReLU as the activation func-

tions. Both functions are deﬁned in detail in Section 4

All experiments in this section were carried out using the MOA framework [7]

with prequential evaluation: each individual example is used to test the model

1We refer, in this case, to the instantiation with default parameters in MOA, i.e., minimiz-

ing hinge loss

8

Table 1: Data sources used in the experimetnal evaluation. Synthetic datasets

are listed ﬁrst.

Dataset #Attributes #Instances

RBF1 10 100,000

HYP1 10 100,000

LED1 24 100,000

LOC1 25 100,000

LOC2 2500 100,000

Poker 10 829,201

Electricity 8 45,312

CoverType 54 581,012

SUSY 8 5,000,000

before it is used for training, and from this the accuracy can be incrementally

updated. We used an 8-core (3.20GHz each) desktop machine allowing up to 1

gigabyte of RAM per run (all methods were able to ﬁnish).

Table 1 lists the data sources used. A thorough description of most of the

datasets is given in [5]. Of the others, LOC1 and LOC2 are datasets dealing

with classifying the location of an object in a grid of 5 ×5 and 50 ×50 pix-

els respectively, described in [27]. SUSY [28] has features that are kinematic

properties measured by particle detectors in an accelerator. The binary class

distinguishes between a signal process which produces supersymmetric particles

and a background process which does not. It is one of the largest datasets in

the UCI repository that we could ﬁnd.

For the feature ﬁlter we used parameters h= 5dhidden units for kNN-F and

h= 10dfor SGD-F and HT-F (a decision based on the relative computational

sensitivity of kNN to a larger attribute space – for LOC2 this means 25,000

attributes in the projected space for SGD-F, and half of that for kNN-F) –

except where this is varied in Figure 5. For kNN we used a buﬀer size of 5000.

For LB we specify 10 models. In other cases, the default parameters in MOA

are used.

Figure 5 displays the results of varying the relative size of the new feature

space (wrt to the original feature space) on two real-world datasets. Note that

the feature space is diﬀerent, so even when this ratio is 1 : 1, performance may

diﬀer.

With regard to kNN, performance improves with more feature functions. In

one of the two cases, this is suﬃcient to overtake kNN on the original feature

space. Unfortunately, kNN is particularly sensitive to the number of attributes,

so complexity becomes an issue long before other methods. The new feature

space does not help the performance of HT, and in neither case does it reach the

performance of HT on the original feature space. In fact, it begins to decrease

again. This is because too many features makes it diﬃcult for HT to become

conﬁdent enough to split on, and may split poorly. Also, by partitioning the

feature space, interaction between the features is lost. SGD reacts best to a new

9

Table 2: Final Accuracy and Running Times. The dataset-wise ranking is given

in (parentheses) and the average of these ranks is given in the ﬁnal row.

(a) Accuracy

Dataset HT SGD kNN LB-HT LB-SGD-F kNN-F SGD-F HT-kNN HT-SGD-F

RBF1 75.0 (5) 54.5 (9) 92.0 (2) 88.7 (4) 72.0 (8) 90.4 (3) 72.0 (7) 92.6 (1) 73.7 (6)

RBFD 65.7 (5) 51.3 (9) 88.6 (1) 79.5 (4) 59.8 (8) 86.3 (2) 59.9 (7) 84.9 (3) 59.9 (6)

HYP1 87.7 (1) 50.3 (9) 82.9 (4) 85.7 (2) 67.2 (7) 77.0 (5) 67.2 (7) 83.3 (3) 67.9 (6)

LED1 73.1 (1) 10.3 (9) 62.8 (3) 72.0 (2) 15.6 (6) 49.0 (5) 15.5 (7) 62.8 (3) 15.5 (7)

POKR 76.1 (6) 68.9 (9) 69.3 (8) 87.6 (1) 82.3 (2) 81.5 (4) 81.9 (3) 74.8 (7) 80.1 (5)

LOC1 85.5 (8) 80.4 (9) 91.0 (2) 90.5 (6) 90.7 (4) 88.8 (7) 90.7 (3) 91.3 (1) 90.7 (5)

LOC2 56.3 (5) 51.5 (9) 75.7 (2) 52.6 (8) 56.8 (4) 74.5 (3) 55.9 (7) 75.9 (1) 56.1 (6)

ELEC 79.2 (4) 57.6 (9) 78.4 (5) 89.8 (1) 74.8 (7) 74.2 (8) 74.8 (6) 82.5 (2) 81.8 (3)

COVT 80.3 (5) 60.7 (9) 92.2 (1) 91.7 (2) 78.7 (6) 91.6 (3) 78.7 (7) 91.2 (4) 78.3 (8)

SUSY 78.2 (3) 76.5 (7) 67.5 (9) 78.7 (1) 77.7 (4) 71.2 (8) 77.7 (5) 77.2 (6) 78.4 (2)

avg rank 4.30 8.80 3.70 3.10 5.60 4.80 5.90 3.10 5.40

(b) Running Time (s)

Dataset HT SGD kNN LB-HT LB-SGD-F kNN-F SGD-F HT-kNN HT-SGD-F

RBF1 0 (3) 0 (1) 3 (6) 3 (5) 4 (8) 14 (9) 0 (2) 4 (7) 1 (4)

RBFD 1 (3) 0 (1) 3 (6) 2 (5) 4 (8) 15 (9) 0 (2) 4 (7) 1 (4)

HYP1 0 (2) 0 (1) 3 (6) 2 (5) 4 (7) 13 (9) 0 (3) 4 (8) 1 (4)

LED1 0 (2) 0 (1) 7 (6) 2 (5) 17 (8) 40 (9) 1 (3) 8 (7) 1 (4)

POKR 9 (2) 3 (1) 455 (8) 91 (5) 279 (6) 1539 (9) 21 (3) 422 (7) 26 (4)

LOC1 1 (2) 0 (1) 8 (7) 2 (5) 21 (8) 48 (9) 1 (3) 8 (6) 2 (4)

LOC2 9 (2) 4 (1) 1276 (7) 93 (3) 1917 (8) 2270 (9) 367 (5) 1230 (6) 350 (4)

ELEC 1 (3) 0 (1) 14 (7) 10 (6) 9 (5) 49 (9) 1 (2) 19 (8) 2 (4)

COVT 19 (2) 11 (1) 605 (6) 220 (3) 4119 (9) 3998 (8) 233 (4) 727 (7) 250 (5)

SUSY 45 (2) 25 (1) 1464 (8) 530 (5) 1040 (6) 4714 (9) 118 (3) 1428 (7) 159 (4)

avg rank 2.30 1.00 6.70 4.70 7.30 8.90 3.00 7.00 4.10

feature space. As noticed earlier [5], SGD is a poor performer compared to HTs,

however, working in a feature space of random ReLUs, SGD-F actually reaches

HT performance (on SUSY, and looks promising under ELEC) with similar time

complexity. Even at 1,000 times the original feature space, running time is

acceptable (only a several seconds per 10,000 instances). On the other hand,

the increased memory use is signiﬁcant across all methods. SGD requires 1,000

times more memory in this setting.

From this initial investigation we formulate several method combinations

for a more extensive evaluation. Table 2 displays the ﬁnal accuracy over the

data stream. The ﬁrst four columns represent the baselines and state-of-the-art

(LB-HT), and remaining columns are a selection of new method combinations.

Figure 6 gives a more detailed over-time view of the largest dataset (SUSY),

with the average performance plotted over the entire stream over 100 intervals,

and also the ﬁrst 1/10th of the data (again, over 100 intervals). The second plot

gives more of an idea about how models respond to fresh concepts. Learning

new concepts is a fundamental part in data streams of adapting to concept drift.

Regarding this experiment some of the most important observations and

conclusions are as follows:

•SGD-F (i.e., SGD with random feature functions), even in this ﬁrst anal-

ysis, out-competes established methods like kNN on several datasets.

•kNN beneﬁts relatively less (than SGD) from the feature functions ﬁlter.

This is expected, since kNN is already a non-linear learner. However, on

a few datasets accuracy is 5-10 percentage points higher with the ﬁlter.

10

101102103104105

40

60

80

100

log h/d

Accuracy (%)

ELEC

HT HT kNN

kNN SGD SGD

101102103104105

40

60

80

100

log h/d

Accuracy (%)

SUSY

HT HT kNN

kNN SGD SGD

(a) Accuracy

101102103104105

0

20

40

log h/d

Time (s)

ELEC

HT HT

kNN kNN

SGD SGD

101102103104105

0

20

40

log h/d

Time (s)

SUSY

HT HT

kNN kNN

SGD SGD

(b) Running Time (s)

Figure 5: Accuracy (Figure 5a) and Running Time (Figure 5b) on a 10,000-

instance segment of two real-world datasets (ELEC and SUSY) for varying pro-

portions of h(number of hidden units / basis functions) wrt d.kNN has been

cut out after h= 1000/d due to scalability reasons. Note the log scale on the

horizontal axis.

11

0 20 40 60 80

20

40

60

80 (76.67)

Accuracy (%)

HT

HT kNN

HT-SGD

HT-SGD-F

kNN

kNN-F

LB-HT

LB-SGD-F

SGD

SGD-F

0 50 100

10−2

10−1

100

Log Time (s)

K

HT

HT-kNN

HT-SGD

HT-SGD-F

kNN

kNN-F

LB-HT

LB-SGD-F

SGD

SGD-F

Figure 6: Performance over ﬁrst 50,000 examples (right) of the SUSY data, in

each divided into 100 windows.

12

•kNN can be used eﬀectively in the leaves of HT instead of the default

of naive Bayes. There is an additional computational cost involved, but

results showed this to be highly competitive method – best equal overall

in predictive performance tied with state-of-the-art LB-HT

•HT is diﬃcult to improve on using feature functions (at least with the

ReLUs that we experimented with). Again, this can be attributed to HT

being a non-linear learner. Peak accuracy is reached in relatively short

space of time.

•SGD takes longer than HT or LB-HT to reach competitive accuracy, but

the gap narrows signiﬁcantly with more examples (for example, under

SUSY). On the largest datasets, the ﬁnal average accuracy is within a

percentage point – and this average includes initial poorer performance.

Therefore, on particularly big data streams (which are increasingly com-

mon), HTs could ﬁnd themselves increasingly challenged to stay ahead of

these methods.

•HT-SGD-F is comparable to the state of the art LB-HT on several datasets,

but demonstrates more favourable running times.

•Unlike many deep learning techniques, these random functions do not

require sensitive calibration.

•Unsurprisingly, kNN-based methods perform best on the dataset RBFD

which has a drifting concept, since they automatically phase out older

concepts. We did not look into detail about dealing with concept drift in

this paper, but this can be dealt with by ‘meta methods’, e.g., [29].

•Employing random feature functions as a ‘ﬁlter’ in the MOA framework is

a convenient and ﬂexible way to apply it in a range of diﬀerent data-stream

methods.

6. GPU Extended Evaluation

In the previous Section evaluations, we noticed SGD methods have the

strongest advantage from random feature functions. This added to the increas-

ing popularity of DL methods, we elected this strategy for further investigation

and experimentation in this Section. A natural choice for implementing DNNs

is to use GPUs to accelerate the calculations. Our experiments were evaluated

on an NVIDIA Tesla K40c with 12GB of RAM each, 15 SMX and up to 2880

simultaneous threads and CUDA 7.0.

Another motivation to use to GPUs the selection of the network hyper pa-

rameters by using cross-validation: for each dataset and activation functions

diﬀerent conﬁgurations are tested and best performing one is chosen. In turned

out this was a high number of combinations and a way to accelerate the process

is using GPUs.

13

Table 3: Random numbers initialization strategy for the diﬀerent activation

functions

Activation Weight Matrix Bias Vector

RBF mean=0.0 and std=1.0 gamma

Sigmoid mean=0.0 and std=0.9 mean=0.0 and std=0.2

ReLU mean=0.0 and std=1.0 mean=0.0 and std=0.1

ReLU Inc mean=0.0 and std=1.0 0.0

The random projection layer is implemented using an standard two layers

feed-forward fully connected network. The input is fed to the random layer,

which is never trained, and the output from this layer is forwarded to the trained

layer. In this work we use the SGD and MSE as the training algorithm and

objective function respectively for the last layer.

We use three of the data sources from Table 1 Covertype (COVT), Electricity

(ELEC), and SUSY. This way we can compare the accuracy obtained in this

Section against well known state-of-the-art algorithms.

The initialization of each layer depends on the activation function used, we

tried diﬀerent random number initialization strategies and those for which we

achieved the best results are summarized in Table 3. Most of the weight matrices

are initialized using random numbers with mean=0 and σ= 1.0, except for the

sigmoid activation function. The bias vector purpose and usage is activation

function dependent.

Diﬀerent activation functions have been tested at the random layer: RBF-

gamma, Sigmoid, ReLU, incremental ReLU. Sigmoid and ReLU are used in the

standard way. As we can see in Table 3 bias vector for RBF stores the gammas,

in out evaluations we use γ={0.001,0.01,0.1,1.0,10.0}. ReLU incremental

used the bias vector to store the incremental mean for each output attribute.

At the trained layer always, we always use the standard sigmoid as the activation

function.

The same way as in Section 5, the network is built incrementally using

prequential learning; we visit each instance only one time. This is in contrast to

typical DNNs training, where instances are loaded in batches and the algorithm

iterates over them given number of times and, every time the error is reduced

the model is checkpointed and spread to be used.

Table 4 summarizes the best results we obtained, and it compares them with

the best results obtained in Section 5 evaluations. We choose the algorithms

by accuracy, and compared the time to run against them. Conﬁguration were

chosen by cross-validation using the following parameters: µ∈[0.1,1.0] with

an increment of 0.1 and a similar range for learning rate. Sizes tested: [10,100]

increment of 10, [100,1000] increment of 100, and two more sizes: 1500, 2000.

For the electricity dataset random projection layer (RPL) obtained an ac-

curacy of 85.33% using a random layer of 100 neurons and a sigmoid activation

function. As we can see in Table 4 the best performing algorithm is the LB-HT

which achieved a 89.8%. If compared with results at Table 2, we can see our

14

Table 4: GPU tests; best results.

Best Algorithms Random Projection

Dataset Alg Acc (%) Time (s) Activ Size Acc (%) Time (s) speedup

ELEC LB-HT 89.8 10 Sigmoid 100 85.33 1.2 8x

COVT kNN 92.2 605 ReLU 2000 94.59 32 17x

SUSY LB-HT 78.7 530 RBF 600 77.63 172 3x

Table 5: ELEC Evaluation

Activation Random Neurons µ η Accuracy(%)

SIG 100 0.3 0.11 85.33

ReLU 400 0.3 0.01 84.95

ReLU inc 200 0.3 0.01 84.97

RBF γ=0.001 2000 0.7 1.01 72.13

RBF γ=0.01 2000 0.7 1.01 72.13

RBF γ=0.1 2000 0.7 1.01 72.13

RBF γ=1.0 2000 0.7 1.01 72.13

RBF γ=10.0 2000 0.7 1.01 72.13

method is the second best result, 4.47 percentage points less.

In the covertype dataset evaluation RPL obtained the best result for this

dataset with an accuracy of 94.59%, improving 2.39 percentage points the kNN

algorithm using a ReLU activation functions.

Finally, our RPL performed relatively poorly in the SUSY dataset using 600

random neurons. We obtained a 77.63% , 1.07 percentage points less than the

LB-HT. This distance is lees than the distance obtained with the electricity

algorithm, but if we rank out results with those in Table 2 we are at the sixth

position.

With regard the time to complete, we can see the GPU is faster in all of

the three datasets. For the electricity dataset RPL is 8 times faster, for the

CoverType dataset 17 times faster and 3 times faster for the SUSY dataset.

Now we detail for each dataset the activation curves, the momentum and

learning rate for this ﬁgure are the same across all sizes and we used the ones

for the best results to see how size aﬀects the accuracy.

6.1. Electricity Dataset

Table 5 summarizes the best results for each activation functions, and its

conﬁgurations. As saw previously sigmoid activation function performed better

than the others for this dataset. In the second position we ﬁnd ReLU and ReLU

inc activation functions which gave similar results, and slightly worse than the

sigmoid. Regarding the RBF, all conﬁgurations we tried performed worse if

compared to sigmoid, ReLU and ReLU inc, but very similar for the diﬀerent

gammas. In Figure-7 we can see how accuracy changes with diﬀerent sizes we

tested.

15

Figure 7: Elec Dataset

10

20

30

40

50

60

70

80

90

100

200

300

400

500

600

700

800

900

1000

1500

2000

0

20

40

60

80

100

(85.34)

Random Layer Size

Accuracy (%)

sig

relu

relu inc

rbf gamma-1.0

Table 6: COV Evaluation

Activation Random Neurons µ η Accuracy(%)

SIG 1000 0.4 0.11 94.45

ReLU 2000 0.4 0.01 94.59

ReLU inc 2000 0.4 0.01 94.58

RBF γ=0.001 90 0.9 1.01 73.18

RBF γ=0.01 90 0.9 1.01 73.18

RBF γ=0.1 90 0.5 1.01 73.18

RBF γ=1.0 90 0.8 1.01 73.18

RBF γ=10.0 90 1.0 1.01 73.18

16

Figure 8: COV Normalized Dataset

10

20

30

40

50

60

70

80

90

100

200

300

400

500

600

700

800

900

1000

1500

2000

0

20

40

60

80

100

(94.59)

Random Layer Size

Accuracy (%)

sig

relu

relu inc

rbf gamma-1.0

6.2. CoverType Dataset

Table 6 give us the best results for the COVT dataset each activation func-

tion. We can see a similar pattern as with the ELEC evaluation, SIG, ReLU

and ReLU inc performed much better than the RBFs, and all three can beat

results shown in Table 2. This time the best result is obtained with the ReLU

activation function at the random layer.

In ﬁgure 8 we can see the activation curves. Although we got the best result

with ReLU, the sigmoid has a better learning curve and it is very close to he

ReLU accuracy. ReLU inc has a very similar learning curve as the standard

ReLU. The diﬀerent RBFs for the same momentum and learning with diﬀerent

sizes gives very similar (if not equal) results, so we chose the lower sizes.

6.3. SUSY Dataset

Table 7 shows the best results for the SUSY dataset and activation function,

and ﬁgure 9 the learning curves. The most noticeable eﬀect is sigmoid, ReLU

and ReLU inc the stop learning very soon, with only 20 random neurons ReLU

reached its maximum peak with 74.85%. The RBFs which performed poorly in

previous evaluations, here are those with the best results.

One curious result we can see is that the RBFs are performing around 7x%

in all 3 evaluations. Even if 2 of the 3 results are not very good, it seems they

are not very sensitive to the diﬀerent datasets, and somehow the results are

stable across diﬀerent data distributions.

17

Table 7: SUSY Evaluation

Activation Random Neurons µ η Accuracy(%)

SIG 20 1 0.61 67.28

ReLU 20 1 0.61 74.84

ReLU inc 20 1 0.91 74.80

RBF γ=0.001 600 1 0.71 77.63

RBF γ=0.01 600 1 0.71 77.63

RBF γ=0.1 600 1 0.71 77.63

RBF γ=1.0 600 1 0.71 77.63

RBF γ=10.0 600 1 0.71 77.63

Figure 9: SUSY Dataset

10

20

30

40

50

60

70

80

90

100

200

300

400

500

600

700

800

900

1000

1500

2000

0

20

40

60

80

100

(77.63)

Random Layer Size

Accuracy (%)

sig

relu

relu inc

rbf gamma-1.0

18

7. Conclusions

In this paper, we studied combinations of Hoeﬀding trees, nearest neighbour,

and gradient descent methods adding a layer based on a random feature function

ﬁlter. We found that this random layer can turn a simple gradient descent

learner into a competitive method for real-time data analysis. With this ﬁrst

attempt we could even improve on current state-of-the-art algorithms, scoring

the best and the second best results for two out of three datasets tested. Like

Hoeﬀding Trees and nearest neighbour methods, but unlike many many other

gradient descent-based methods, the random layer works well without intensive

parameter tuning.

We successfully extended and implemented on GPUs, obtaining powerful

predictive performance. This suggests that using GPUs for data stream min-

ing is a promising research topic for obtaining new fast and adaptive machine

learning methodologies.

In the future we intend to look for adding and pruning units incrementally

in the stream over time to respond to make more eﬃcient use of memory and

adapt to drifting concepts. Also we would like to continue studying how to

obtain new high scalable methods using GPUs.

Acknowledgment

This work was supported in part by the Aalto University AEF research pro-

gramme http://energyefficiency.aalto.fi/en/, by NVIDIA through the

UPC/BSC GPU Center of Excellence, and the Spanish Ministry of Science and

Technology through the TIN2012-34557.

[1] W. Qu, Y. Zhang, J. Zhu, Q. Qiu, Mining multi-label concept-drifting data

streams using dynamic classiﬁer ensemble, in: Asian Conference on Ma-

chine Learning, Vol. 5828 of Lecture Notes in Computer Science, Springer,

2009, pp. 308–321.

[2] P. Domingos, G. Hulten, Mining high-speed data streams, in: Proceedings

of the 6th ACM SIGKDD International Conference on Knowledge Discov-

ery and Data Mining, 2000, pp. 71–80.

[3] A. Bifet, G. Holmes, B. Pfahringer, Leveraging bagging for evolving data

streams, in: ECML PKDD’10, Springer-Verlag, Berlin, Heidelberg, 2010,

pp. 135–150.

[4] A. Shaker, E. H¨ullermeier, Instance-based classiﬁcation and regression on

data streams, in: Learning in Non-Stationary Environments, Springer New

York, 2012, pp. 185–201.

[5] J. Read, A. Bifet, B. Pfahringer, G. Holmes, Batch-incremental versus

instance-incremental learning in dynamic and evolving data, in: 11th Int.

Symposium on Intelligent Data Analysis, 2012.

19

[6] G. Huang, What are extreme learning machines? ﬁlling the gap between

frank rosenblatt’s dream and john von neumann’s puzzle, Cognitive Com-

putation 7 (3) (2015) 263–278. doi:10.1007/s12559-015-9333-0.

URL http://dx.doi.org/10.1007/s12559-015-9333-0

[7] A. Bifet, G. Holmes, R. Kirkby, B. Pfahringer, MOA: Massive Online Anal-

ysis http://moa.cs.waikato.ac.nz/, Journal of Machine Learning Research

(JMLR).

[8] G. Holmes, R. Kirkby, B. Pfahringer, Stress-testing Hoeﬀding trees, in: 9th

European Conference on Principles and Practice of Knowledge Discovery

in Databases (PKDD ’05), 2005, pp. 495–502.

[9] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning,

Springer Series in Statistics, Springer New York Inc., New York, NY, USA,

2001.

[10] G. Hinton, R. Salakhutdinov, Reducing the dimensionality of data with

neural networks, Science 313 (5786) (2006) 504 – 507.

[11] V. Nair, G. E. Hinton, Rectiﬁed linear units improve restricted boltzmann

machines, in: Proceedings of the 27th International Conference on Machine

Learning (ICML-10), June 21-24, 2010, Haifa, Israel, 2010, pp. 807–814.

[12] G. Hinton, Training products of experts by minimizing contrastive diver-

gence, Neural Computation 14 (8) (2000) 1711–1800.

[13] J. Read, F. Perez-Cruz, A. Bifet, Deep learning in multi-label data-streams,

in: Symposium on Applied Computing, ACM, 2015.

[14] G.-B. Huang, D. Wang, Y. Lan, Extreme learning machines: a survey,

International Journal of Machine Learning and Cybernetics 2 (2) (2011)

107–122. doi:10.1007/s13042-011-0019-y.

[15] G.-B. Huang, L. Chen, C.-K. Siew, Universal approximation using in-

cremental constructive feedforward networks with random hidden nodes,

Neural Networks, IEEE Transactions on 17 (4) (2006) 879–892. doi:

10.1109/TNN.2006.875977.

[16] G. bin Huang, M. bin Li, L. Chen, C. kheong Siew, Incremental extreme

learning machine with fully complex hidden nodes (2007).

[17] D. Marron, A. Bifet, G. D. F. Morales, Random forests of very fast deci-

sion trees on gpu for mining evolving big data streams, in: 21st European

Conference on Artiﬁcial Intelligence 2014, 2014.

[18] H. Grahn, N. Lavesson, M. H. Lapajne, D. Slat, Cudarf: A cuda-based

implementation of random forests., in: H. J. Siegel, A. El-Kadi (Eds.),

AICCSA, IEEE Computer Society, 2011, pp. 95–101.

20

[19] H. Schulz, B. Waldvogel, R. Sheikh, S. Behnke, CURFIL: random forests

for image labeling on GPU, in: VISAPP 2015 - Proceedings of the 10th

International Conference on Computer Vision Theory and Applications,

Volume 2, Berlin, Germany, 11-14 March, 2015., 2015, pp. 156–164.

[20] V. Garcia, E. Debreuve, M. Barlaud, Fast k nearest neighbor search using

gpu, in: 2008 IEEE Computer Society Conference on Computer Vision and

Pattern Recognition Workshops, 2008, pp. 1–6.

[21] L. Huang, Z. Liu, Z. Yan, P. Liu, Q. Cai, An implementation of high perfor-

mance parallel knn algorithm based on gpu, in: Networking and Distributed

Computing (ICNDC), 2012 Third International Conference on, 2012, pp.

30–30. doi:10.1109/ICNDC.2012.15.

[22] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou,

Y. Chen, Pudiannao: A polyvalent machine learning accelerator, in: Pro-

ceedings of the Twentieth International Conference on Architectural Sup-

port for Programming Languages and Operating Systems, ASPLOS ’15,

2015, pp. 369–381.

[23] L. Breiman, Bagging predictors, Mach. Learn. 24 (2) (1996) 123–140. doi:

http://dx.doi.org/10.1023/A:1018054314350.

[24] N. C. Oza, S. J. Russell, Experimental comparisons of online and batch

versions of bagging and boosting, in: KDD, 2001, pp. 359–364.

[25] N. Oza, S. Russell, Online bagging and boosting, in: Artiﬁcial Intelligence

and Statistics 2001, Morgan Kaufmann, 2001, pp. 105–112.

[26] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, R. Gavald`a, New ensem-

ble methods for evolving data streams, in: ACM SIGKDD international

conference on Knowledge discovery and data mining (KDD ’09), 2009, pp.

139–148.

[27] J. Read, J. Hollm´en, A deep interpretation of classiﬁer chains, in: Advances

in Intelligent Data Analysis XIII - 13th International Symposium, IDA

2014, 2014, pp. 251–262.

[28] P. Baldi, P. Sadowski, D. Whiteson, Searching for exotic particles in high-

energy physics with deep learning, Nature Communications 5 (4308).

[29] A. Bifet, R. Gavald`a, Learning from time-changing data with adaptive

windowing, in: SIAM International Conference on Data Mining, 2007.

21