ArticlePDF Available

Data Stream Classification using Random Feature Functions and Novel Method Combinations



Big Data streams are being generated in a faster, bigger, and more commonplace. In this scenario, Hoeffding Trees are an established method for classification. Several extensions exist, including high-performing ensemble setups such as online and leveraging bagging. Also, _k_-nearest neighbors is a popular choice, with most extensions dealing with the inherent performance limitations over a potentially-infinite stream. At the same time, gradient descent methods are becoming increasingly popular, owing in part to the successes of deep learning. Although deep neural networks can learn incrementally, they have so far proved too sensitive to hyper-parameter options and initial conditions to be considered an effective ‘off-the-shelf’ data-streams solution. In this work, we look at combinations of Hoeffding-trees, nearest neighbour, and gradient descent methods with a streaming preprocessing approach in the form of a random feature functions filter for additional predictive power. We further extend the investigation to implementing methods on GPUs, which we test on some large real-world datasets, and show the benefits of using GPUs for data-stream learning due to their high scalability. Our empirical evaluation yields positive results for the novel approaches that we experiment with, highlighting important issues, and shed light on promising future directions in approaches to data-stream classification.
Data Stream Classification using Random Feature
Functions and Novel Method Combinations
Diego Marr´ona, Jesse Readb, Albert Bifetc, Nacho Navarroa
aDepartment of Computer Architecture, Universitat Politecnica de Catalunya and with the
Department of Computer Science, Barcelona Supercomputing Center, Spain
bAalto University and HIIT,Finland
cel´ecom ParisTech, Paris, France
Big Data streams are being generated in a faster, bigger, and more common-
place. In this scenario, Hoeffding Trees are an established method for classifica-
tion. Several extensions exist, including high-performing ensemble setups such
as online and leveraging bagging. Also, k-nearest neighbors is a popular choice,
with most extensions dealing with the inherent performance limitations over a
potentially-infinite stream.
At the same time, gradient descent methods are becoming increasingly pop-
ular, owing in part to the successes of deep learning. Although deep neural
networks can learn incrementally, they have so far proved too sensitive to hyper-
parameter options and initial conditions to be considered an effective ‘off-the-
shelf’ data-streams solution.
In this work, we look at combinations of Hoeffding-trees, nearest neighbour,
and gradient descent methods with a streaming preprocessing approach in the
form of a random feature functions filter for additional predictive power.
We further extend the investigation to implementing methods on GPUs,
which we test on some large real-world datasets, and show the benefits of using
GPUs for data-stream learning due to their high scalability.
Our empirical evaluation yields positive results for the novel approaches that
we experiment with, highlighting important issues, and shed light on promising
future directions in approaches to data-stream classification.
Keywords: Data Stream Mining, Big Data, Classification, GPUs.
1. Introduction
There is a trend towards working with big and dynamic data sources. This
tendency is clear both in real world applications and the academic literature.
Email addresses: (Diego Marr´on), (Jesse
Read), (Albert Bifet), (Nacho
Preprint submitted to Elsevier November 4, 2015
arXiv:1511.00971v1 [cs.LG] 3 Nov 2015
Many modern data sources are not only dynamic but often generated at high
speed and must be classified in real time. Such contexts can be found in sensor
applications (e.g., tracking and activity monitoring), demand prediction (e.g.,
of electricity), manufacturing processes, robotics, email, news feeds, and social
networks. Real-time analysis of data streams is becoming a key area of data
mining research as the number of applications in this area grows.
The requirements for a classifier in a data stream are to
Be able to make a classification at any time
Deal with a potentially infinite number of examples
Access each example in the stream just once
These requirements can in fact be met by variety of learning schemes, includ-
ing even batch learners (e.g., [1]), where batches are constantly gathered over
time, and newer models replace older ones as memory fills up. Nevertheless,
incremental methods remain strongly preferred in the data streams literature,
and particularly the Hoeffding tree (HT) and its variations [2, 3], k-nearest
neighbors (kNN) [4]. Support for these options is given by large-scale empirical
comparisons [5], where it is also found that methods such as naive Bayes and
stochastic gradient descent-based (SGD) are relatively poor performers.
Classification in data streams is a major area of research, in which Hoeffding
trees have long been a favoured method. The main contribution of this paper
is to show that random feature function can be leveraged by other algorithms
to obtain similar or even improved performance over tree-based methods.
With the recent popularity of Deep Learning (DL) methods we also want to
test how a random feature in the form of random projection layer performs on
Deep Neural Networks (DNNs).
DL aims for a better data representation at multiple layers of abstraction,
and for each layer the network needs to be fine-tuned. In classification, a com-
mon algorithm to fine-tune the network is the SGD which tries to minimize the
error at the output layer using an objective function, such as Mean Squared
Error (MSE). A Gradient vector is used to back-propagate the error to previous
layers. This gradient nature of the algorithm makes it suitable to be trained in-
crementally in batches of size one, similar to how incremental training is done.
Unfortunately, DNN are very sensitive to hyper-parameters such as learning
rate (η), momentum (µ), number of number neurons per level, or the number
of levels. It is then not straight forward to provide an of-the-shelf method for
data streams.
Propagation between layers is usually done in the form of matrix-vector
or matrix-matrix multiplications, which are computational intensive operation.
Often hardware accelerators such as FPGAs or GPUs are used to accelerate the
calculations. Despite some efforts, acceleration of HT and kNN algorithms for
data streams on the GPUs are has some limitations. We talk briefly about this
in Section 2.
In recent years, Extreme Learning Machines [6] (ELMs) have emerged as
a popular framework in Machine Learning. ELMs are a type of feed-forward
neural networks characterized by a random initialization of their hidden layer,
combined with a fast training algorithm. Our random feature method is based
on this approach.
We made use of the MOA (Massive Online Analysis) framework [7], a soft-
ware environment for implementing algorithms and running experiments for
online learning from data streams in Java. It implements a large number of
modern methods for classification in streams, including HT, kNN, and SGD-
based methods. We make use of MOA’s extensive library of methods to form
novel combinations with these methods and further employ an extremely rapid
preprocessing technique of projecting the input into a new space via random
feature functions (similar to ELMs). We then took the methods purely related
to Neural Networks (those which proved most promising under random projec-
tions) and implemented them using NVIDIA GPUs and CUDA 7.0; comparing
performance to the methods in MOA.
This paper is organized as follows: Section 2 introduces related work on tree
based approaches, neural networks, and data streams on GPU. We discuss the
use of random features in Sections 3 and 4 for HT/kNN methods and neural
networks respectively. We first present the evaluation of tree-based methods
in Section 5 and later in Section 6 we extend the SGD method in the form of
DNNs, using different activation functions. We finally conclude the paper in
Section 7.
2. Related Work
Hoeffding trees [2] are state-of-the-art in classification for data streams and
they predict by choosing the majority class at each leaf. However, these trees
may be conservative at first and in many situations naive Bayes method outper-
forms the standard Hoeffding tree initially, although it is eventually overtaken
[8]. A proposed hybrid adaptive method (by [8]) is a Hoeffding tree with naive
Bayes at the leaves, i.e., returning a naive Bayes prediction at the leaves, if
it has been so far more accurate overall than the majority class. Given it’s
widespread acceptance, this is the default in MOA, and we denote this method
in the experimental Section simply as HT. In fact, the naive Bayes classification
comes for free, since it can be made with the same statistics that are collected
anyway by the tree.
Other established examples include using principal component analysis (re-
viewed also in [9]) for this transformation, and also Restricted Boltzmann Ma-
chines (RBMs) [10]. RBMs can be seen as a probabilistic binary version of PCA,
for finding higher-level feature representations. They have received widespread
popularity in recent years due to their use in successful deep learning approaches.
In this case, z=φ(x) = f(W>x) for some non-linearity f: a sigmoid function
is typical, but more recently rectified linear units (ReLUs, [11]) have fallen into
favour. The weight matrix Wis learned with gradient-based methods [12], and
the projected output should provide a better feature representation for a neural
network or any off-the-shelf method. This approach was applied to data streams
already in [13], but concluded that the sensitivity to hyper-parameters and ini-
tial conditions prevented good ‘out-of-the-box’ deployment in data streams.
Approaches such as the so-called extreme learning machines (ELMs) [14]
avoid tricky parametrizations by simply using random functions (indeed, ELMs
are basically linear learners on top of non-linear data transformations). De-
spite the hidden layer weights being random , it has been proven that ELMs is
still capable of universal approximation of any non-constant piecewise contiuous
function [15].
Also an incremental version of ELMs is proposed in [16]. It starts with
an small network, and new neurons are added at each step until an stopping
criterion of size or residual error is reached. The difference with our incremental
build is that we use one instance at time simulating they arrive in time, and we
incrementally train the network. Also our number of neurons is fixed during the
training, in other words, we don’t add/remove any neuron during the process.
Nowadays, in 2015, it is difficult when talking about DL and DNNs not
to mention GPUs. They are a massive parallel architectures providing an
outstanding performance for High Performance Computing and a very good
performance/watt ratio, as their architecture suits very fine to their needs of
DNNs computations. Many tools includes a back-end to offload the computa-
tion to the GPU. NVIDIA has its own portal for deep learning on GPUs at
GPUs has not only used to accelerate DL/DNN computations due to its
performance, it has been also been used to successfully accelerate HT and en-
sembles. However, few works are provided in the context of data streams and
The only work we are aware of regarding to HT in the context of online real-
time data streams mining is[17], were the authors present a parallel implemen-
tation of HT and Random Forests for binary trees and data streams achieving
goods speedups, but with limitations on the size and with high memory con-
sumption. More generic HT implementation of Random Forests is presented in
[18]. In [19] the authors introduced an open source library, available at github,
to predict images labelling using random forests. The library is also tested their
on a cell phone with VGA resolution in real-time with good results.
Also, kNN has already been successfully ported to GPUs [20]. That paper
presented one of the first implementations of the “brute force” kNN on GPUs,
and compared with several CPU-based implementations with speedups up to
teo orders of magnitude. kNN is also used in business intelligence [21] and has
also its implementation on the GPU. The same way as with HT, a tool for
machine learning (including kNN) is described in [22].
3. Tree Based Random Feature Functions
Transforming the feature space prior to learning and classification is an es-
tablished idea in the statistical and machine learning literature [9], for example
with basis (or feature-) functions. Suppose the input instance is xof length d.
This vector is transformed to a new space z=φ(x) via function φ, creating new
vector zof length h. Any off-the-shelf model now treats zas if it were the in-
put. The functions can be either chosen suitably by a domain expert, or simply
chosen to achieve a more dimensioned representation of the input. Polynomials
and splines are a typical choice.
Regarding HTs with additional algorithms in the leaves (as described in
Section 2), this filter can either be placed before the HT, or before the method
in the leaves, or both.
In this paper we adapt this methodology to deal with other classifiers in a
similar way, namely kNN and an SGD-based method (rather than naive Bayes)
at the leaves. We denote these cases HT-kNN and HT-SGD, respectively. For
example, in HT-SGD, a gradient descent learner is employed in the leaves of
each tree. Similarly to HT, predictions by the kNN and an SGD-based method
are only used if they are more accurate on average than the majority class.
3.1. Ensembles in Data Streams
Bagging is an ensemble method used to improve the accuracy of classifier
methods. Non-streaming bagging [23] builds a set of Mbase models, training
each model with a bootstrap sample of size Ncreated by drawing random
samples with replacement from the original training set. Each base model’s
training set contains each of the original training example Ktimes where P(K=
k) follows a binomial distribution. This binomial distribution for large values of
Ntends to a Poisson(λ= 1) distribution, where Poisson(λ= 1)= exp(1)/k!.
Using this fact, Oza and Russell [24, 25] proposed Online Bagging, an online
method that instead of sampling with replacement, gives each example a weight
according to Poisson(1). ADWIN Bagging [26] is an adaptive version of Online
Bagging that uses a change detector to decide when to discard under-performing
ensemble models.
Leveraging Bagging (LB, [3]) improves ADWIN Bagging, increasing the weights
of this resampling using a larger value λto compute the value of the Poisson
distribution. The Poisson distribution is used to model the number of events
occurring within a given time interval. It proved very competitive.
Again, we can run a filter on the input instances before entering the ensemble
of trees, or at the leaves. It is even possible to run the filter again on the
output of an ensemble (i.e., the votes), before running an additional stacking
procedure. This kind of methodology can give way to rather ‘deep’ classifiers.
Figure 1 illustrates a possible setup. In this sense of multiple levels we could
also call our approach deep learning. It is debatable whether decision trees can
be called a deep method (their levels involve partitioning an existing feature
set rather than because they simple partition a space rather than create higher-
level feature space). However, several of the methods we investigate have at least
multiple levels of feature transformation, which is behind the power of most deep
methods. In the following Section we investigate the empirical performance of
several novel combinations based on the methodology described so far.
Figure 1: An example setup: Input xis filtered (i.e., projected) to random layer
z(first layer of connections), which goes to an ensemble of, for example, HTs
(second layer), wherein instances are partitioned to the leaves and are again
filtered (third layer) and used as training for, say, SGD, producing (in the firth
layer of connections) final vote y. Note, however, that we only draw the final
two layers wrt to the first of the HT models.
Figure 2: Random Projection Layer: Input xis projected to a random layer z
(first layer of connections), which is trained to produce the final vote y.
4. Neural Networks with Random Projections for Data Streams
Data streams are potentially infinite, and so, they can evolve with time. This
means the statistical distribution of the data we are interested on can change.
The idea behind the random layer is to improve data localization across the
space the trained layer sees. Imagine the instance is a tiny luminous point on
the space, with enough random neurons acting as a mirror we hope the trained
layer can capture better the data movement. The strategy used by the random
projection layer es shown in Figure 2
Except for the fact it is never trained, the random layer is a normal layer
and need its activation functions, in this work sigmoid, ReLU, ReLU incremental
and a Radial Basis Function are used.
The sigmoid function used is the standard one, with σ(x)[1,1]:
σ(ak) = 1
1 + eak
where ak=W>
kxis the k-th activation function and Wis the weight d×h
matrix (dinput attributes, houtput features).
-2 -2 -1 012
Figure 3: Terrain of ReLU basis function on two input attributes x1, x2the
feature function zis given on the vertical axis.
ReLU functions are defined as:
zk=f(ak) = max(0, ak)
As stated in Section 2, ReLUs activation are very efficient as they require only
a comparison. In our random projection we expect near 50% of the neurons to
be active for a single instance ( the terrain of a ReLU is exemplified in Figure 3
One variation we can do to the standard ReLU is to use the mean value
of the attribute as a threshold. The mean value is calculated incrementally as
instances arrive. We call this variant ReLU incremental, and is defined as:
f(ak) = maxak, ak)
The last activation function we are using is the Radial Basis Function (RBF):
φ(x) = e(xci)2
where xis an input instance attribute value and ciis a random value set at
initialization time. Each neuron in the random layer has its own set of ci, the
length of both vector xand care the same. So, we can see the operation (xci)2
as the euclidean distance of the current instance to a randomly positioned center.
The σ2is a free parameter. A simplification we can do to this notation is:
In our experiments we try different γvalues passed at command line. We
use the following notation in our experiments:
φ(x) = eγ(xci)2
All matrices and vectors in our model are initialized using random numbers.
Matrices are used as normal weight matrix, but the function of the vectors
are activation function dependent. Usually initialization is done using random
numbers with µ= 0 and σ= 1. Assuming our data range [1,1] if we put
(a) Typical Random ini-
(b) Better use of random
Figure 4: Random number initialization strategies
a Gaussian centered at one of the endpoints, half of its are of influence area if
wasted and will never see a point making it harder to fill the whole space and
so the discovering of points.
If a smaller range is used, σ(0,1) (note the open interval), we can improve
each neuron’s area of influence, as shown in figure 4b. In red the random num-
bers range is smaller than data range so if we put a Gaussian at the random
endpoint can improve its influence are. In this example we used a Gaussian
function as an example, but we the idea extends the same for activation func-
tiones. In fact this is what we do in Section 6, specially when talking about the
sigmoid neurons as they are always used at the trained layer.
5. Random Feature Function Evaluation
Among the methods we investigate (e.g., HT, kNN, SGD1), different levels
of filters and ensembles and possibly additional classification in the leaves (in
the case of HT), there are a multitude of possible combinations. We first inves-
tigate the viability of random feature functions and their effect on the different
classifiers (comparing these common methods with their ‘filtered’ versions that
we denote HT-SGD, kNN-F, and SGD-F. This study led us to novel combina-
tions, which we further compare to the benchmark methods and state-of-the-art
Leveraging Bagging (LB-HT).
The random feature used in these evaluations are basically ELMs [14]. In this
Section, we use only ReLU (explained in Section 4) as the activation function.
In Section 4 we extend the functions used within the random feature to define
a random projectin layer for DNNs.
Our random feature is based on ELMs, which are defined using Radial Basis
Funciontions, but instead in this section we use ReLU as the activation func-
tions. Both functions are defined in detail in Section 4
All experiments in this section were carried out using the MOA framework [7]
with prequential evaluation: each individual example is used to test the model
1We refer, in this case, to the instantiation with default parameters in MOA, i.e., minimiz-
ing hinge loss
Table 1: Data sources used in the experimetnal evaluation. Synthetic datasets
are listed first.
Dataset #Attributes #Instances
RBF1 10 100,000
HYP1 10 100,000
LED1 24 100,000
LOC1 25 100,000
LOC2 2500 100,000
Poker 10 829,201
Electricity 8 45,312
CoverType 54 581,012
SUSY 8 5,000,000
before it is used for training, and from this the accuracy can be incrementally
updated. We used an 8-core (3.20GHz each) desktop machine allowing up to 1
gigabyte of RAM per run (all methods were able to finish).
Table 1 lists the data sources used. A thorough description of most of the
datasets is given in [5]. Of the others, LOC1 and LOC2 are datasets dealing
with classifying the location of an object in a grid of 5 ×5 and 50 ×50 pix-
els respectively, described in [27]. SUSY [28] has features that are kinematic
properties measured by particle detectors in an accelerator. The binary class
distinguishes between a signal process which produces supersymmetric particles
and a background process which does not. It is one of the largest datasets in
the UCI repository that we could find.
For the feature filter we used parameters h= 5dhidden units for kNN-F and
h= 10dfor SGD-F and HT-F (a decision based on the relative computational
sensitivity of kNN to a larger attribute space – for LOC2 this means 25,000
attributes in the projected space for SGD-F, and half of that for kNN-F) –
except where this is varied in Figure 5. For kNN we used a buffer size of 5000.
For LB we specify 10 models. In other cases, the default parameters in MOA
are used.
Figure 5 displays the results of varying the relative size of the new feature
space (wrt to the original feature space) on two real-world datasets. Note that
the feature space is different, so even when this ratio is 1 : 1, performance may
With regard to kNN, performance improves with more feature functions. In
one of the two cases, this is sufficient to overtake kNN on the original feature
space. Unfortunately, kNN is particularly sensitive to the number of attributes,
so complexity becomes an issue long before other methods. The new feature
space does not help the performance of HT, and in neither case does it reach the
performance of HT on the original feature space. In fact, it begins to decrease
again. This is because too many features makes it difficult for HT to become
confident enough to split on, and may split poorly. Also, by partitioning the
feature space, interaction between the features is lost. SGD reacts best to a new
Table 2: Final Accuracy and Running Times. The dataset-wise ranking is given
in (parentheses) and the average of these ranks is given in the final row.
(a) Accuracy
RBF1 75.0 (5) 54.5 (9) 92.0 (2) 88.7 (4) 72.0 (8) 90.4 (3) 72.0 (7) 92.6 (1) 73.7 (6)
RBFD 65.7 (5) 51.3 (9) 88.6 (1) 79.5 (4) 59.8 (8) 86.3 (2) 59.9 (7) 84.9 (3) 59.9 (6)
HYP1 87.7 (1) 50.3 (9) 82.9 (4) 85.7 (2) 67.2 (7) 77.0 (5) 67.2 (7) 83.3 (3) 67.9 (6)
LED1 73.1 (1) 10.3 (9) 62.8 (3) 72.0 (2) 15.6 (6) 49.0 (5) 15.5 (7) 62.8 (3) 15.5 (7)
POKR 76.1 (6) 68.9 (9) 69.3 (8) 87.6 (1) 82.3 (2) 81.5 (4) 81.9 (3) 74.8 (7) 80.1 (5)
LOC1 85.5 (8) 80.4 (9) 91.0 (2) 90.5 (6) 90.7 (4) 88.8 (7) 90.7 (3) 91.3 (1) 90.7 (5)
LOC2 56.3 (5) 51.5 (9) 75.7 (2) 52.6 (8) 56.8 (4) 74.5 (3) 55.9 (7) 75.9 (1) 56.1 (6)
ELEC 79.2 (4) 57.6 (9) 78.4 (5) 89.8 (1) 74.8 (7) 74.2 (8) 74.8 (6) 82.5 (2) 81.8 (3)
COVT 80.3 (5) 60.7 (9) 92.2 (1) 91.7 (2) 78.7 (6) 91.6 (3) 78.7 (7) 91.2 (4) 78.3 (8)
SUSY 78.2 (3) 76.5 (7) 67.5 (9) 78.7 (1) 77.7 (4) 71.2 (8) 77.7 (5) 77.2 (6) 78.4 (2)
avg rank 4.30 8.80 3.70 3.10 5.60 4.80 5.90 3.10 5.40
(b) Running Time (s)
RBF1 0 (3) 0 (1) 3 (6) 3 (5) 4 (8) 14 (9) 0 (2) 4 (7) 1 (4)
RBFD 1 (3) 0 (1) 3 (6) 2 (5) 4 (8) 15 (9) 0 (2) 4 (7) 1 (4)
HYP1 0 (2) 0 (1) 3 (6) 2 (5) 4 (7) 13 (9) 0 (3) 4 (8) 1 (4)
LED1 0 (2) 0 (1) 7 (6) 2 (5) 17 (8) 40 (9) 1 (3) 8 (7) 1 (4)
POKR 9 (2) 3 (1) 455 (8) 91 (5) 279 (6) 1539 (9) 21 (3) 422 (7) 26 (4)
LOC1 1 (2) 0 (1) 8 (7) 2 (5) 21 (8) 48 (9) 1 (3) 8 (6) 2 (4)
LOC2 9 (2) 4 (1) 1276 (7) 93 (3) 1917 (8) 2270 (9) 367 (5) 1230 (6) 350 (4)
ELEC 1 (3) 0 (1) 14 (7) 10 (6) 9 (5) 49 (9) 1 (2) 19 (8) 2 (4)
COVT 19 (2) 11 (1) 605 (6) 220 (3) 4119 (9) 3998 (8) 233 (4) 727 (7) 250 (5)
SUSY 45 (2) 25 (1) 1464 (8) 530 (5) 1040 (6) 4714 (9) 118 (3) 1428 (7) 159 (4)
avg rank 2.30 1.00 6.70 4.70 7.30 8.90 3.00 7.00 4.10
feature space. As noticed earlier [5], SGD is a poor performer compared to HTs,
however, working in a feature space of random ReLUs, SGD-F actually reaches
HT performance (on SUSY, and looks promising under ELEC) with similar time
complexity. Even at 1,000 times the original feature space, running time is
acceptable (only a several seconds per 10,000 instances). On the other hand,
the increased memory use is significant across all methods. SGD requires 1,000
times more memory in this setting.
From this initial investigation we formulate several method combinations
for a more extensive evaluation. Table 2 displays the final accuracy over the
data stream. The first four columns represent the baselines and state-of-the-art
(LB-HT), and remaining columns are a selection of new method combinations.
Figure 6 gives a more detailed over-time view of the largest dataset (SUSY),
with the average performance plotted over the entire stream over 100 intervals,
and also the first 1/10th of the data (again, over 100 intervals). The second plot
gives more of an idea about how models respond to fresh concepts. Learning
new concepts is a fundamental part in data streams of adapting to concept drift.
Regarding this experiment some of the most important observations and
conclusions are as follows:
SGD-F (i.e., SGD with random feature functions), even in this first anal-
ysis, out-competes established methods like kNN on several datasets.
kNN benefits relatively less (than SGD) from the feature functions filter.
This is expected, since kNN is already a non-linear learner. However, on
a few datasets accuracy is 5-10 percentage points higher with the filter.
(a) Accuracy
log h/d
Time (s)
log h/d
Time (s)
(b) Running Time (s)
Figure 5: Accuracy (Figure 5a) and Running Time (Figure 5b) on a 10,000-
instance segment of two real-world datasets (ELEC and SUSY) for varying pro-
portions of h(number of hidden units / basis functions) wrt d.kNN has been
cut out after h= 1000/d due to scalability reasons. Note the log scale on the
horizontal axis.
0 20 40 60 80
80 (76.67)
Accuracy (%)
0 50 100
Log Time (s)
Figure 6: Performance over first 50,000 examples (right) of the SUSY data, in
each divided into 100 windows.
kNN can be used effectively in the leaves of HT instead of the default
of naive Bayes. There is an additional computational cost involved, but
results showed this to be highly competitive method – best equal overall
in predictive performance tied with state-of-the-art LB-HT
HT is difficult to improve on using feature functions (at least with the
ReLUs that we experimented with). Again, this can be attributed to HT
being a non-linear learner. Peak accuracy is reached in relatively short
space of time.
SGD takes longer than HT or LB-HT to reach competitive accuracy, but
the gap narrows significantly with more examples (for example, under
SUSY). On the largest datasets, the final average accuracy is within a
percentage point – and this average includes initial poorer performance.
Therefore, on particularly big data streams (which are increasingly com-
mon), HTs could find themselves increasingly challenged to stay ahead of
these methods.
HT-SGD-F is comparable to the state of the art LB-HT on several datasets,
but demonstrates more favourable running times.
Unlike many deep learning techniques, these random functions do not
require sensitive calibration.
Unsurprisingly, kNN-based methods perform best on the dataset RBFD
which has a drifting concept, since they automatically phase out older
concepts. We did not look into detail about dealing with concept drift in
this paper, but this can be dealt with by ‘meta methods’, e.g., [29].
Employing random feature functions as a ‘filter’ in the MOA framework is
a convenient and flexible way to apply it in a range of different data-stream
6. GPU Extended Evaluation
In the previous Section evaluations, we noticed SGD methods have the
strongest advantage from random feature functions. This added to the increas-
ing popularity of DL methods, we elected this strategy for further investigation
and experimentation in this Section. A natural choice for implementing DNNs
is to use GPUs to accelerate the calculations. Our experiments were evaluated
on an NVIDIA Tesla K40c with 12GB of RAM each, 15 SMX and up to 2880
simultaneous threads and CUDA 7.0.
Another motivation to use to GPUs the selection of the network hyper pa-
rameters by using cross-validation: for each dataset and activation functions
different configurations are tested and best performing one is chosen. In turned
out this was a high number of combinations and a way to accelerate the process
is using GPUs.
Table 3: Random numbers initialization strategy for the different activation
Activation Weight Matrix Bias Vector
RBF mean=0.0 and std=1.0 gamma
Sigmoid mean=0.0 and std=0.9 mean=0.0 and std=0.2
ReLU mean=0.0 and std=1.0 mean=0.0 and std=0.1
ReLU Inc mean=0.0 and std=1.0 0.0
The random projection layer is implemented using an standard two layers
feed-forward fully connected network. The input is fed to the random layer,
which is never trained, and the output from this layer is forwarded to the trained
layer. In this work we use the SGD and MSE as the training algorithm and
objective function respectively for the last layer.
We use three of the data sources from Table 1 Covertype (COVT), Electricity
(ELEC), and SUSY. This way we can compare the accuracy obtained in this
Section against well known state-of-the-art algorithms.
The initialization of each layer depends on the activation function used, we
tried different random number initialization strategies and those for which we
achieved the best results are summarized in Table 3. Most of the weight matrices
are initialized using random numbers with mean=0 and σ= 1.0, except for the
sigmoid activation function. The bias vector purpose and usage is activation
function dependent.
Different activation functions have been tested at the random layer: RBF-
gamma, Sigmoid, ReLU, incremental ReLU. Sigmoid and ReLU are used in the
standard way. As we can see in Table 3 bias vector for RBF stores the gammas,
in out evaluations we use γ={0.001,0.01,0.1,1.0,10.0}. ReLU incremental
used the bias vector to store the incremental mean for each output attribute.
At the trained layer always, we always use the standard sigmoid as the activation
The same way as in Section 5, the network is built incrementally using
prequential learning; we visit each instance only one time. This is in contrast to
typical DNNs training, where instances are loaded in batches and the algorithm
iterates over them given number of times and, every time the error is reduced
the model is checkpointed and spread to be used.
Table 4 summarizes the best results we obtained, and it compares them with
the best results obtained in Section 5 evaluations. We choose the algorithms
by accuracy, and compared the time to run against them. Configuration were
chosen by cross-validation using the following parameters: µ[0.1,1.0] with
an increment of 0.1 and a similar range for learning rate. Sizes tested: [10,100]
increment of 10, [100,1000] increment of 100, and two more sizes: 1500, 2000.
For the electricity dataset random projection layer (RPL) obtained an ac-
curacy of 85.33% using a random layer of 100 neurons and a sigmoid activation
function. As we can see in Table 4 the best performing algorithm is the LB-HT
which achieved a 89.8%. If compared with results at Table 2, we can see our
Table 4: GPU tests; best results.
Best Algorithms Random Projection
Dataset Alg Acc (%) Time (s) Activ Size Acc (%) Time (s) speedup
ELEC LB-HT 89.8 10 Sigmoid 100 85.33 1.2 8x
COVT kNN 92.2 605 ReLU 2000 94.59 32 17x
SUSY LB-HT 78.7 530 RBF 600 77.63 172 3x
Table 5: ELEC Evaluation
Activation Random Neurons µ η Accuracy(%)
SIG 100 0.3 0.11 85.33
ReLU 400 0.3 0.01 84.95
ReLU inc 200 0.3 0.01 84.97
RBF γ=0.001 2000 0.7 1.01 72.13
RBF γ=0.01 2000 0.7 1.01 72.13
RBF γ=0.1 2000 0.7 1.01 72.13
RBF γ=1.0 2000 0.7 1.01 72.13
RBF γ=10.0 2000 0.7 1.01 72.13
method is the second best result, 4.47 percentage points less.
In the covertype dataset evaluation RPL obtained the best result for this
dataset with an accuracy of 94.59%, improving 2.39 percentage points the kNN
algorithm using a ReLU activation functions.
Finally, our RPL performed relatively poorly in the SUSY dataset using 600
random neurons. We obtained a 77.63% , 1.07 percentage points less than the
LB-HT. This distance is lees than the distance obtained with the electricity
algorithm, but if we rank out results with those in Table 2 we are at the sixth
With regard the time to complete, we can see the GPU is faster in all of
the three datasets. For the electricity dataset RPL is 8 times faster, for the
CoverType dataset 17 times faster and 3 times faster for the SUSY dataset.
Now we detail for each dataset the activation curves, the momentum and
learning rate for this figure are the same across all sizes and we used the ones
for the best results to see how size affects the accuracy.
6.1. Electricity Dataset
Table 5 summarizes the best results for each activation functions, and its
configurations. As saw previously sigmoid activation function performed better
than the others for this dataset. In the second position we find ReLU and ReLU
inc activation functions which gave similar results, and slightly worse than the
sigmoid. Regarding the RBF, all configurations we tried performed worse if
compared to sigmoid, ReLU and ReLU inc, but very similar for the different
gammas. In Figure-7 we can see how accuracy changes with different sizes we
Figure 7: Elec Dataset
Random Layer Size
Accuracy (%)
relu inc
rbf gamma-1.0
Table 6: COV Evaluation
Activation Random Neurons µ η Accuracy(%)
SIG 1000 0.4 0.11 94.45
ReLU 2000 0.4 0.01 94.59
ReLU inc 2000 0.4 0.01 94.58
RBF γ=0.001 90 0.9 1.01 73.18
RBF γ=0.01 90 0.9 1.01 73.18
RBF γ=0.1 90 0.5 1.01 73.18
RBF γ=1.0 90 0.8 1.01 73.18
RBF γ=10.0 90 1.0 1.01 73.18
Figure 8: COV Normalized Dataset
Random Layer Size
Accuracy (%)
relu inc
rbf gamma-1.0
6.2. CoverType Dataset
Table 6 give us the best results for the COVT dataset each activation func-
tion. We can see a similar pattern as with the ELEC evaluation, SIG, ReLU
and ReLU inc performed much better than the RBFs, and all three can beat
results shown in Table 2. This time the best result is obtained with the ReLU
activation function at the random layer.
In figure 8 we can see the activation curves. Although we got the best result
with ReLU, the sigmoid has a better learning curve and it is very close to he
ReLU accuracy. ReLU inc has a very similar learning curve as the standard
ReLU. The different RBFs for the same momentum and learning with different
sizes gives very similar (if not equal) results, so we chose the lower sizes.
6.3. SUSY Dataset
Table 7 shows the best results for the SUSY dataset and activation function,
and figure 9 the learning curves. The most noticeable effect is sigmoid, ReLU
and ReLU inc the stop learning very soon, with only 20 random neurons ReLU
reached its maximum peak with 74.85%. The RBFs which performed poorly in
previous evaluations, here are those with the best results.
One curious result we can see is that the RBFs are performing around 7x%
in all 3 evaluations. Even if 2 of the 3 results are not very good, it seems they
are not very sensitive to the different datasets, and somehow the results are
stable across different data distributions.
Table 7: SUSY Evaluation
Activation Random Neurons µ η Accuracy(%)
SIG 20 1 0.61 67.28
ReLU 20 1 0.61 74.84
ReLU inc 20 1 0.91 74.80
RBF γ=0.001 600 1 0.71 77.63
RBF γ=0.01 600 1 0.71 77.63
RBF γ=0.1 600 1 0.71 77.63
RBF γ=1.0 600 1 0.71 77.63
RBF γ=10.0 600 1 0.71 77.63
Figure 9: SUSY Dataset
Random Layer Size
Accuracy (%)
relu inc
rbf gamma-1.0
7. Conclusions
In this paper, we studied combinations of Hoeffding trees, nearest neighbour,
and gradient descent methods adding a layer based on a random feature function
filter. We found that this random layer can turn a simple gradient descent
learner into a competitive method for real-time data analysis. With this first
attempt we could even improve on current state-of-the-art algorithms, scoring
the best and the second best results for two out of three datasets tested. Like
Hoeffding Trees and nearest neighbour methods, but unlike many many other
gradient descent-based methods, the random layer works well without intensive
parameter tuning.
We successfully extended and implemented on GPUs, obtaining powerful
predictive performance. This suggests that using GPUs for data stream min-
ing is a promising research topic for obtaining new fast and adaptive machine
learning methodologies.
In the future we intend to look for adding and pruning units incrementally
in the stream over time to respond to make more efficient use of memory and
adapt to drifting concepts. Also we would like to continue studying how to
obtain new high scalable methods using GPUs.
This work was supported in part by the Aalto University AEF research pro-
gramme, by NVIDIA through the
UPC/BSC GPU Center of Excellence, and the Spanish Ministry of Science and
Technology through the TIN2012-34557.
[1] W. Qu, Y. Zhang, J. Zhu, Q. Qiu, Mining multi-label concept-drifting data
streams using dynamic classifier ensemble, in: Asian Conference on Ma-
chine Learning, Vol. 5828 of Lecture Notes in Computer Science, Springer,
2009, pp. 308–321.
[2] P. Domingos, G. Hulten, Mining high-speed data streams, in: Proceedings
of the 6th ACM SIGKDD International Conference on Knowledge Discov-
ery and Data Mining, 2000, pp. 71–80.
[3] A. Bifet, G. Holmes, B. Pfahringer, Leveraging bagging for evolving data
streams, in: ECML PKDD’10, Springer-Verlag, Berlin, Heidelberg, 2010,
pp. 135–150.
[4] A. Shaker, E. H¨ullermeier, Instance-based classification and regression on
data streams, in: Learning in Non-Stationary Environments, Springer New
York, 2012, pp. 185–201.
[5] J. Read, A. Bifet, B. Pfahringer, G. Holmes, Batch-incremental versus
instance-incremental learning in dynamic and evolving data, in: 11th Int.
Symposium on Intelligent Data Analysis, 2012.
[6] G. Huang, What are extreme learning machines? filling the gap between
frank rosenblatt’s dream and john von neumann’s puzzle, Cognitive Com-
putation 7 (3) (2015) 263–278. doi:10.1007/s12559-015-9333-0.
[7] A. Bifet, G. Holmes, R. Kirkby, B. Pfahringer, MOA: Massive Online Anal-
ysis, Journal of Machine Learning Research
[8] G. Holmes, R. Kirkby, B. Pfahringer, Stress-testing Hoeffding trees, in: 9th
European Conference on Principles and Practice of Knowledge Discovery
in Databases (PKDD ’05), 2005, pp. 495–502.
[9] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning,
Springer Series in Statistics, Springer New York Inc., New York, NY, USA,
[10] G. Hinton, R. Salakhutdinov, Reducing the dimensionality of data with
neural networks, Science 313 (5786) (2006) 504 – 507.
[11] V. Nair, G. E. Hinton, Rectified linear units improve restricted boltzmann
machines, in: Proceedings of the 27th International Conference on Machine
Learning (ICML-10), June 21-24, 2010, Haifa, Israel, 2010, pp. 807–814.
[12] G. Hinton, Training products of experts by minimizing contrastive diver-
gence, Neural Computation 14 (8) (2000) 1711–1800.
[13] J. Read, F. Perez-Cruz, A. Bifet, Deep learning in multi-label data-streams,
in: Symposium on Applied Computing, ACM, 2015.
[14] G.-B. Huang, D. Wang, Y. Lan, Extreme learning machines: a survey,
International Journal of Machine Learning and Cybernetics 2 (2) (2011)
107–122. doi:10.1007/s13042-011-0019-y.
[15] G.-B. Huang, L. Chen, C.-K. Siew, Universal approximation using in-
cremental constructive feedforward networks with random hidden nodes,
Neural Networks, IEEE Transactions on 17 (4) (2006) 879–892. doi:
[16] G. bin Huang, M. bin Li, L. Chen, C. kheong Siew, Incremental extreme
learning machine with fully complex hidden nodes (2007).
[17] D. Marron, A. Bifet, G. D. F. Morales, Random forests of very fast deci-
sion trees on gpu for mining evolving big data streams, in: 21st European
Conference on Artificial Intelligence 2014, 2014.
[18] H. Grahn, N. Lavesson, M. H. Lapajne, D. Slat, Cudarf: A cuda-based
implementation of random forests., in: H. J. Siegel, A. El-Kadi (Eds.),
AICCSA, IEEE Computer Society, 2011, pp. 95–101.
[19] H. Schulz, B. Waldvogel, R. Sheikh, S. Behnke, CURFIL: random forests
for image labeling on GPU, in: VISAPP 2015 - Proceedings of the 10th
International Conference on Computer Vision Theory and Applications,
Volume 2, Berlin, Germany, 11-14 March, 2015., 2015, pp. 156–164.
[20] V. Garcia, E. Debreuve, M. Barlaud, Fast k nearest neighbor search using
gpu, in: 2008 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition Workshops, 2008, pp. 1–6.
[21] L. Huang, Z. Liu, Z. Yan, P. Liu, Q. Cai, An implementation of high perfor-
mance parallel knn algorithm based on gpu, in: Networking and Distributed
Computing (ICNDC), 2012 Third International Conference on, 2012, pp.
30–30. doi:10.1109/ICNDC.2012.15.
[22] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou,
Y. Chen, Pudiannao: A polyvalent machine learning accelerator, in: Pro-
ceedings of the Twentieth International Conference on Architectural Sup-
port for Programming Languages and Operating Systems, ASPLOS ’15,
2015, pp. 369–381.
[23] L. Breiman, Bagging predictors, Mach. Learn. 24 (2) (1996) 123–140. doi:
[24] N. C. Oza, S. J. Russell, Experimental comparisons of online and batch
versions of bagging and boosting, in: KDD, 2001, pp. 359–364.
[25] N. Oza, S. Russell, Online bagging and boosting, in: Artificial Intelligence
and Statistics 2001, Morgan Kaufmann, 2001, pp. 105–112.
[26] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, R. Gavald`a, New ensem-
ble methods for evolving data streams, in: ACM SIGKDD international
conference on Knowledge discovery and data mining (KDD ’09), 2009, pp.
[27] J. Read, J. Hollm´en, A deep interpretation of classifier chains, in: Advances
in Intelligent Data Analysis XIII - 13th International Symposium, IDA
2014, 2014, pp. 251–262.
[28] P. Baldi, P. Sadowski, D. Whiteson, Searching for exotic particles in high-
energy physics with deep learning, Nature Communications 5 (4308).
[29] A. Bifet, R. Gavald`a, Learning from time-changing data with adaptive
windowing, in: SIAM International Conference on Data Mining, 2007.
... We used three datasets and they are KDD CUP'99 [4], [5], forest cover type [5], [26], electric power consumption dataset [26]. We compared the accuracy, precision, recall and F-Measures on these datasets. ...
... We used three datasets and they are KDD CUP'99 [4], [5], forest cover type [5], [26], electric power consumption dataset [26]. We compared the accuracy, precision, recall and F-Measures on these datasets. ...
Full-text available
A Few algorithms were actualized by the analysts for performing clustering of data streams. Most of these algorithms require that the number of clusters (K) has to be fixed by the customer based on input data and it can be kept settled all through the clustering process. Stream clustering has faced few difficulties in picking up K. In this paper, we propose an efficient approach for data stream clustering by embracing an Improved Differential Evolution (IDE) algorithm. The IDE algorithm is one of the quick, powerful and productive global optimization approach for programmed clustering. In our proposed approach, we additionally apply an entropy based method for distinguishing the concept drift in the data stream and in this way updating the clustering procedure online. We demonstrated that our proposed method is contrasted with Genetic Algorithm and identified as proficient optimization algorithm. The performance of our proposed technique is assessed and cr eates the accuracy of 92.29%, the precision is 86.96%, recall is 90.30% and F-measure estimate is 88.60%.
... The experimental results demonstrate that hdoutliers algorithm reduces the risk of making a false outlier detection for a board class. Diego Marron et al. [11], presented a combination method for big data steam classification. The method of combining three machine-learning algorithms (KNN, Hoeffding-Tree, and Gradient Descent method) and utilizing GPUs for data stream learning due to their high ability. ...
Full-text available
Owing to the exponential expansion in the data size, fast and efficient systems of analysis are extremely needed. The traditional algorithms of machine learning face the challenge of learning bottlenecks such as; human participation, time, and the accuracy of prediction. But, the efficient and fast methods of dynamic learning offer considerable advantages like lower human participation, rapid algorithms of learning, and easiness implementation. This review paper presents the researches with a brief display for recently existing works in big data analytics and the effective algorithms of machine learning, furthermore, the issues of resources allocation in big data
... Deep learning is not commonly used so far with evolving data streams due to the high computational cost of training and its sensitivity to hyper-parameter configurations (e.g., depth, number of neurons) (Marrón, Read, Bifet, & Navarro, 2017). Moreover, since deep learning methods are resource-intensive, they require powerful GPUs. ...
Full-text available
The significant growth of interconnected Internet‐of‐Things (IoT) devices, the use of social networks, along with the evolution of technology in different domains, lead to a rise in the volume of data generated continuously from multiple systems. Valuable information can be derived from these evolving data streams by applying machine learning. In practice, several critical issues emerge when extracting useful knowledge from these potentially infinite data, mainly because of their evolving nature and high arrival rate which implies an inability to store them entirely. In this work, we provide a comprehensive survey that discusses the research constraints and the current state‐of‐the‐art in this vibrant framework. Moreover, we present an updated overview of the latest contributions proposed in different stream mining tasks, particularly classification, regression, clustering, and frequent patterns. This article is categorized under: • Fundamental Concepts of Data and Knowledge > Key Design Issues in Data Mining • Fundamental Concepts of Data and Knowledge > Motivation and Emergence of Data Mining Abstract Data Stream Mining
The construction of hydrogen refueling station (HRS) infrastructure is crucial for enabling the utilization of hydrogen fuel cell vehicles for long-distance transportation. In this study, the life cycle cost of...
Big Data is rapidly growing domain across various real time areas like Banking, Finance, Indusrty, Medicine, Trading and so on. Due to its diversified application, handling the big data for security during data transmission or management is highly risky. Most of the researchers try to handle big data classification based on the domain of interest for increasing productivity or customer satisfaction in decision making. Whereas, this paper focuses on the classification of big data file to enhance security during the data transmission over network and management.Most of the big data applications contains valuable and confidential data. The existing data security approaches are not sufficient on handling the security for data based on the threat level. Therefore, this paper proposes a hybrid approach to classify the big data based on the threat level of the contents associated with the data under consideration into open and close. To ensure the security of big data files, they are transmitted into the Hadoop Distributed File System along with relevant information to assess the level of threat they pose. The Threat Impact Level (TIL) is then calculated as a metric to determine the threshold level required for their protection.
The world is full of data and day by day this is growing in an exponential manner. To deal with this data and extract useful information from it, a very effective field has been evolved called big data mining. The actual work of this process is to fetch various useful and effective patterns from this enormous amount of data. However, this work has been confined to the field of data stream clustering. In the field of data stream clustering, identification of clusters and its shape from the vast amount of complex real-world data is still a major challenge. In addition, the enormous dimensionality of a dataset brings more complexity to the extraction of information. This paper proposes a novel multi-objective Differential Evolution (DE) algorithm based on elitism that tactfully addresses the local minimum problem. In addition, for the automatic estimation of initial number of clusters in the dataset, three targets are considered. Our experimental results show that the proposed algorithm with three objectives perform better in the prediction of the total number of clusters in the dataset as compared to some single objective as well as some existing multi-objective algorithms. Moreover, our approach can be used to identify overlapping clusters in the dataset.
Since 2009, the deep learning revolution, which was triggered by the introduction of ImageNet, has stimulated the synergy between Machine Learning (ML)/Deep Learning (DL) and Software Engineering (SE). Meanwhile, critical reviews have emerged that suggest that ML/DL should be used cautiously. To improve the quality (especially the applicability and generalizability) of ML/DL-related SE studies, and to stimulate and enhance future collaborations between SE/AI researchers and industry practitioners, we conducted a 10-year Systematic Literature Review (SLR) on 906 ML/DL-related SE papers published between 2009 and 2018. Our trend analysis demonstrated the mutual impacts that ML/DL and SE have had on each other. At the same time, however, we also observed a paucity of replicable and reproducible ML/DL-related SE studies and identified five factors that influence their replicability and reproducibility. To improve the applicability and generalizability of research results, we analyzed what ingredients in a study would facilitate an understanding of why a ML/DL technique was selected for a specific SE problem. In addition, we identified the unique trends of impacts of DL models on SE tasks, as well as five unique challenges that needed to be met in order to better leverage DL to improve the productivity of SE tasks. Finally, we outlined a road-map that we believe can facilitate the transfer of ML/DL-based SE research results into real-world industry practices.
In the past years, the importance of processing data streams increased with the emergence of new technologies and application domains. The Internet of Things provides many examples in which processing and analyzing data streams are critical success factors. With the growing amount of data, the usage of machine learning (ML) algorithms has become an essential part of data analysis. However, the high volume and velocity of data presents new challenges, which need to be addressed, e.g. frequent model changes, concept drift or insufficient time to train models. From our point of view, these challenges cannot be tackled alone by using an algorithm-centric approach, i.e. to focus solely on finding appropriate algorithms, and neglecting the structure of the overall processing system. Therefore, we propose a generic architectural framework, which describes common components and their interactions with each other in order to apply ML technologies to streaming data. Furthermore, we implement essential components in two real-world use cases to highlight the feasibility of our approach.
Big data is progressively being used in various areas, such as industry, financial dealing, medicine, and so on, as it can handle the challenges in processing large amounts of data. One of the data mining techniques used widely and effectively to classify big data is the MapReduce model. In this paper, an approach for the classification of big data is developed using Cuckoo–Grey wolf based Correlative Naive Bayes classifier and MapReduce Model (CGCNB-MRM). Accordingly, a novel classifier, named Cuckoo–Grey wolf based Correlative Naive Bayes classifier (CG-CNB), is designed by modifying CNB classifier with a newly developed optimization algorithm, Cuckoo–Grey Wolf based Optimization (CGWO). CGWO algorithm is designed by the effective integration of Cuckoo Search (CS) Algorithm into Grey Wolf Optimizer (GWO), to optimize the CNB model by the optimal selection of the model parameters. Finally, the proposed CGCNB-MRM approach performs the classification for each data samples based on the probability index table and the posterior probability of the data. Three metrics, such as accuracy, sensitivity, and specificity, are utilized for the performance evaluation of the proposed CGCNB-MRM approach, where it could achieve 80.7% accuracy with 84.5% sensitivity and 76.9% specificity and thus, prove its effectiveness in big data classification.
Full-text available
Random Forest is a classical ensemble method used to improve the performance of single tree classifiers. It is able to obtain superior performance by increasing the diversity of the single classifiers. However, in the more challenging context of evolving data streams, the classifier has also to be adaptive and work under very strict constraints of space and time. Furthermore, the computational load of using a large number of classifiers can make its application extremely expensive. In this work, we present a method for building Random Forests that use Very Fast Decision Trees for data streams on GPUs. We show how this method can benefit from the massive parallel architecture of GPUs, which are becoming an efficient hardware alternative to large clusters of computers. Moreover, our algorithm minimizes the communication between CPU and GPU by building the trees directly inside the GPU. We run an empirical evaluation and compare our method to two well know machine learning frameworks, VFML and MOA. Random Forests on the GPU are at least 300x faster while maintaining a similar accuracy.
Conference Paper
Full-text available
Random forests are popular classifiers for computer vision tasks such as image labeling or object detection. Learning random forests on large datasets, however, is computationally demanding. Slow learning impedes model selection and scientific research on image features. We present an open-source implementation that significantly accelerates both random forest learning and prediction for image labeling of RGB-D and RGB images on GPU when compared to an optimized multi-core CPU implementation. We use the fast training to conduct hyper-parameter searches, which significantly improves on previous results on the NYU Depth v2 dataset. Our prediction runs in real time at VGA resolution on a mobile GPU and has been used as data term in multiple applications.
Conference Paper
In the “classifier chains” (CC) approach for multi-label classification, the predictions of binary classifiers are cascaded along a chain as additional features. This method has attained high predictive performance, and is receiving increasing analysis and attention in the recent multi-label literature, although a deep understanding of its performance is still taking shape. In this paper, we show that CC gets predictive power from leveraging labels as additional stochastic features, contrasting with many other methods, such as stacking and error correcting output codes, which use label dependence only as kind of regularization. CC methods can learn a concept which these cannot, even supposing the same base classifier and hypothesis space. This leads us to connections with deep learning (indeed, we show that CC is competitive precisely because it is a deep learner), and we employ deep learning methods – showing that they can supplement or even replace a classifier chain. Results are convincing, and throw new insight into promising future directions.
In order to be useful and effectively applicable in dynamically evolving environments, machine learning methods have to meet several requirements, including the ability to analyze incoming data in an online, incremental manner, to observe tight time and memory constraints, and to appropriately respond to changes of the data characteristics and underlying distributions. This paper advocates an instance-based learning algorithm for that purpose, both for classification and regression problems. This algorithm has a number of desirable properties that are not, at least not as a whole, shared by currently existing alternatives. Notably, our method is very flexible and thus able to adapt to an evolving environment quickly, a point of utmost importance in the data stream context. At the same time, the algorithm is relatively robust and thus applicable to streams with different characteristics. © 2012 Springer Science+Business Media New York. All rights reserved.
The emergent machine learning technique—extreme learning machines (ELMs)—has become a hot area of research over the past years, which is attributed to the growing research activities and significant contributions made by numerous researchers around the world. Recently, it has come to our attention that a number of misplaced notions and misunderstandings are being dissipated on the relationships between ELM and some earlier works. This paper wishes to clarify that (1) ELM theories manage to address the open problem which has puzzled the neural networks, machine learning and neuroscience communities for 60 years: whether hidden nodes/neurons need to be tuned in learning, and proved that in contrast to the common knowledge and conventional neural network learning tenets, hidden nodes/neurons do not need to be iteratively tuned in wide types of neural networks and learning models (Fourier series, biological learning, etc.). Unlike ELM theories, none of those earlier works provides theoretical foundations on feedforward neural networks with random hidden nodes; (2) ELM is proposed for both generalized single-hidden-layer feedforward network and multi-hidden-layer feedforward networks (including biological neural networks); (3) homogeneous architecture-based ELM is proposed for feature learning, clustering, regression and (binary/multi-class) classification. (4) Compared to ELM, SVM and LS-SVM tend to provide suboptimal solutions, and SVM and LS-SVM do not consider feature representations in hidden layers of multi-hidden-layer feedforward networks either.
Machine Learning (ML) techniques are pervasive tools in various emerging commercial applications, but have to be accommodated by powerful computer systems to process very large data. Although general-purpose CPUs and GPUs have provided straightforward solutions, their energy-efficiencies are limited due to their excessive supports for flexibility. Hardware accelerators may achieve better energy-efficiencies, but each accelerator often accommodates only a single ML technique (family). According to the famous No-Free-Lunch theorem in the ML domain, however, an ML technique performs well on a dataset may perform poorly on another dataset, which implies that such accelerator may sometimes lead to poor learning accuracy. Even if regardless of the learning accuracy, such accelerator can still become inapplicable simply because the concrete ML task is altered, or the user chooses another ML technique. In this study, we present an ML accelerator called Pu-DianNao, which accommodates seven representative ML techniques, including k-means, k-nearest neighbors, naive bayes, support vector machine, linear regression, classification tree, and deep neural network. Benefited from our thorough analysis on computational primitives and locality properties of different ML techniques, PuDianNao can perform up to 1056 GOP/s (e.g., additions and multiplications) in an area of 3.51 mm2, and consumes 596 mW only. Compared with the NVIDIA K20M GPU (28nm process), PuDianNao (65nm process) is 1.20x faster, and can reduce the energy by 128.41x.