Conference PaperPDF Available


Content may be subject to copyright.
Learning Fast and Slow:
A Unified Batch/Stream Framework
Jacob Montiel, Albert Bifet, Viktor Losing†‡, Jesse Read§and Talel Abdessalem∗¶
ecom ParisTech Universit´
e Paris-Saclay, Paris, France
{jacob.montiel, albert.bifet}
Bielefeld University, Bielefeld, Germany
HONDA Research Institute Europe, Offenbach, Germany
§LIX, ´
Ecole Polytechnique, Palaiseau, France
UMI CNRS IPAL, National University of Singapore
Abstract—Data ubiquity highlights the need of efficient and
adaptable data-driven solutions. In this paper, we present FAST
AN D SLOW LEARNING (F SL), a novel unified framework that
sheds light on the symbiosis between batch and stream learn-
ing. FSL works by employing Fast (stream) and Slow (batch)
Learners, emulating the mechanisms used by humans to make
decisions. We showcase the applicability of FSL on the task of
classification by introducing the FAST A ND SL OW CLASSIFIER
(FS C). A Fast Learner provides predictions on the spot, con-
tinuously updating its model and adapting to changes in the
data. On the other hand, the Slow Learner provides predictions
considering a wider spectrum of seen data, requiring more time
and data to create complex models. Once that enough data has
been collected, FS C trains the Slow Learner and starts tracking
the performance of both learners. A drift detection mechanism
triggers the creation of new Slow models when the current Slow
model becomes obsolete. FS C selects between Fast and Slow
Learners according to their performance on new incoming data.
Test results on real and synthetic data show that FSC effectively
drives the positive interaction of stream and batch models for
learning from evolving data streams.
Index Terms—Machine Learning; Classification; Batch Learn-
ing; Stream Learning; Concept Drift
A common premise (communis opinio) in Machine Learning
is that Batch and Stream Learning are mutually exclusive. A
typical start point when tackling Machine Learning problems,
is to categorize them as either batch or stream, and from that
point most studies treat them separately. This separation often
arises from specific requirements and expectations from both
types of learning.
In batch learning, constraints on resources such as time
and memory are somewhat relaxed (within acceptable limits)
in favor of complex models that generalize well the data
that describes a phenomenon. For example, Neural Networks
(NN) and Tree Ensembles, such as Random Forest (RF) and
eXtreme Gradient Boosting (XGB), are popular techniques in
classification that require large amounts of data for training.
Additionally, as the amount and complexity of the data in-
creases, these techniques are known to be computationally
demanding [1]. For example, in [2], a large deep neural
network is used to learn high-level features. Training time on
10 million images is 3 days for a 9-layer locally connected
sparse autoencoder, using 1000 machines (16000 cores).
On the other hand, requirements for stream learning [3]
are based on the assumption of a continuous flow of data.
In this context, storing data is impractical, thus models are
incrementally trained, data is processed as it arrives (one at
a time) and never re-used nor stored. Streaming techniques
prioritize the optimization of two resources: time and storage.
Since streams are assumed to be potentially infinite, stream
models are required to be always available, this means that
they must provide predictions at any time.
Different to the batch setting, where data distribution is
assumed static, in data streams the relationship between input
data and the corresponding response (label) may change over
time, this is known as Concept Drift. Without intervention,
batch learners will fail after drift occurs, since they were
essentially trained on data that does not correspond to the
current concept. A common intervention is to re-train a new
batch model introducing the new concepts. Stream techniques,
on the other hand, adapt to new concepts as the model is up-
dated while some techniques even include special mechanisms
designed to handle drifts.
In this work, we consider that learning is a continuous
activity and aim to show that there exists an overlapping
region where batch and stream learning can positively interact,
benefiting from their strengths while compensating their weak-
nesses. Stream learners are easier and cheaper to maintain,
they adapt (react) to changes in the data distribution and
provide predictions based on the most recent data. Batch
learners require time to collect sufficient data before they can
build a model, but once available, they typically generate more
complex and accurate models.
While storing a complete data stream is impractical, current
trends in (cheap) data storage [4], [5] provide the means to
store large subsets of the stream. In fact, big data sets for batch
learning are usually generated by collecting data over time. In
this scenario, where data is continuously arriving, we propose
a framework composed of stream and batch learners.
This unified approach is similar to the model proposed
by Nobel Prize in Economics laureate Daniel Kahneman in
his best-selling book Thinking, Fast and Slow [6] to describe
the mechanisms behind human decision-making. The central
thesis of this book is a dichotomy between two modes of
thought: System 1 is fast, instinctive and emotional; System
2is slower, more deliberative, and more logical. Equivalent
systems for Machine Learning, are outlined in Table I.
TABLE I: The Fast and Slow systems for Machine Learning.
Cheap (mem., time) Expensive (mem., time)
Always ready Trains on large batches
Robust to drifts, adapts Complex and robust models
Focus on the present Generalize the larger scheme
The rest of the paper is organized as follows: Sec. II
provides an overview of related work. Sec. III describes the
FAST AN D SLOW LEARNING framework and its application on
classification. Sec. IV outlines the experimental evaluation. In
Sec. V, examines test results. Conclusions and future directions
are discussed in Sec. VI.
In recent years, Learning from Data Streams has received
increasing attention due to the growth in generation of Fast
Data and the democratization of Machine Learning. Concept
Drift is of special interest when working with evolving data
due to its impact on learning and multiple approaches have
been proposed, [7] provides a complete overview of this topic.
In the following, we describe relevant studies.
A couple of drift-aware ensembles, work under the Long-
Term Short-Term memory model (LTM-STM). This
model is also found in Long Short Term Memory networks
(LSTMs) [8], a well known type of Recurrent Neural Network
which is known to perform, for many tasks, better than the
standard version. Proposed by Baddeley and Hitch in the
seminal work [9], the STM model describes the working
memory, a cognitive system responsible of holding information
for processing. According to this model, separate buffers are
used for different types of information, and they are separate
of the LTM. The Two-Layer System presented in [10] is
composed of a Learning Layer (Layer 1), where a model is
learned for the original problem, and a Control Layer (Layer
2) that monitors the learning process of the first layer. In the
second layer, a drift detection mechanism is used to identify
regions of stable concepts which are stored in a pool of long-
memory models. Similarly, the Self Adjusting Memory (SAM)
model [11] builds an ensemble with models targeting current
or former concepts. SAM is built using two memories: STM
for the current concept, and the LTM to retain information
about past concepts. A cleaning process is in charge of
controlling the size of the STM while keeping the information
in the LTM consistent with the information in the STM.
Fig. 1: Fast and Slow Learning Framework overview.
In this paper, we introduce the FAST A ND SLOW LEARNING
(FS L) framework. Similarly to [10] and [11], we approach
learning as a combination of two systems: FAS T and SL OW.
Moreover, we consider learning as a continuous task where
batch and stream learning co-exist. The FAS T system, consists
of a stream model Ms, updated over time. The SL OW system
consists of a sequence of batch models {Mb1, Mb2, . . . , Mbn}
that are replaced as they become obsolete and a data buffer that
stores most recent data for training. Under this schema, FAS T
and SL OW learners complement each other: The FA ST learner,
incrementally trained as data becomes available, provides
predictions consistent with the current concept, able to adapt
quickly to concept drift. Whereas the SLOW learner first
collects a large amount of data to generate complex models,
which are superior in times of stable concepts. An overview
of the FS L framework is available in Figure 1.
A. Fast and Slow Classifier
In the following, we describe an application of the FSL
framework in the form of the FA ST A ND SL OW CL AS SI FIE R
(FS C), where the internal models in the FAST and SL OW
systems correspond to classification models. Consider a con-
tinuous stream of data A={(~xi, yi)}|i= 1, . . . , t where
irepresents the current time and t→ ∞.~xiis a fea-
ture vector and yithe corresponding target class such that
yi∈ {C1, C2, . . . , Ck}. A FAS T (stream) model Mfis trained
on single instances as they arrive. A SLO W (batch) model
Msis trained using the data buffer B={(~xi, yi)}|i=
bn+ 1, . . . , b}such that Bcontains the last nsamples
in Awith respect to a given point in time b.
STREAM :trainf(Mf,(~xi, yi)) (1)
BATCH :trains(Ms, B)(2)
The objective is to accurately predict the class of the current
feature ~xt, as its label is presumed to be initially unknown.
The predicted label of a model Mis denoted as ˆy. As soon as
the true label ygets revealed the performance Pis measured
according to some loss function `.
P(M) = `(y, ˆy)(3)
Fig. 2: FSC operation. In single-mode, the FA ST model is
used for prediction while SL OW is buffering the data. In dual-
mode, both models predict each sample but only one is selected
based on their recent performance. SLOW models are replaced
in presence of concept drift.
Performance of stream models is measured using prequen-
tial evaluation, where an instance’s class is predicted before
using it for training. This is, for the i-th instance:
trainf(Mi1,(~xi, yi)) (5)
FSC works in two modes, outlined in Figure 2, where FAS T
and SL OW systems perform slightly different tasks:
Single mode: Corresponds to the beginning of the stream,
from i= 0 until Msis trained (provides predictions).
In this mode, only the FAS T system is active and FSC
effectively performs as a stream classifier, whereas the
SLOW system stores incoming data into its buffer B
(~xi, yi), and triggers the training of Mswhen the buffer
reaches its limit.
Dual mode: Begins once training of Msis complete.
From this point, FS C tracks the performance of Mfand
Msand selects the one with best (current) performance Pt
over most recent data. Data buffering in Bcontinues in a
sliding window fashion and is used for training new batch
models upon detecting concept drifts. Upon entering into
dual mode, FS C remains in that mode.
Notice that when we refer to Msas slow, we refer to the
training time as well as the time required for data collection.
Since data arrives on time intervals t, the time for the SLOW
classifier to be ready to predict is:
idle time = (∆t×n) + time(train(Ms)),(6)
where nis the number of samples. The worst case scenario
is when data arrives at slow rates (large t) and training
time is high, for example a complex deep neural network with
hundreds of layers.
1) Training Batch learners on Data Streams: If concept
changes in the stream, Msbecomes obsolete and has to
be replaced by a new model trained on the current con-
cept. We propose a strategy to track Msperformance and
trigger the training of new models automatically. FSC uses
the ADWIN [12] change detector to track changes in the
performance of Ms. ADWIN keeps one dynamic size window
that grows when there is no change and shrinks otherwise. If
the difference between the means of two possible adjacent sub-
windows exceeds a delta depending on a predefined confidence
parameter δthen a change is detected and the dynamic window
reduces its size. As soon as ADWIN detects a change, the
training of new models is triggered.
FSC uses the buffer Bto train and evaluate new models.
Bis split into two adjacent sets Btrain and Btest where
|Btrain|  |Btest |, keeping the data arrival order so Btest
contains the last samples from the stream. In addition to
training a model on Btrain, FSC trains Kcandidate models on
different samplings of Btrain, in order to improve the chances
of training an optimal model. FS C applies three different
sampling strategies:
Probabilistic Adaptive Window (PAW) [13], keeps in
logarithmic memory a sample of items W, giving more
weight to newer items but maintaining some older items.
Each new item iis stored into Wthen for each jelement
in W, a random number rin (0,1) is obtained, then jis
removed from Wif r > 21/|W|.
Reservoir Sampling (RS) [14], samples Rrandom items
from a stream (without replacement) and is equally biased
towards old items and new items. For a sample size R
with iitems seen so far, the chance of any item to be
chosen is R/i. For the next item, a chosen item has a
probability to remain of R/(i+ 1) and a new item has a
probability of R/(i+ 1) to be kept.
Weighted Reservoir Sampling (WRS) [15]. Samples from
a weighted distribution where each items has a corre-
sponding weight in the stream. A key value kiis assigned
to each item iin the stream, by ki=u1/wi
i, where wi
is the weight of the item and uiis a random number
in (0,1). Finally, the Ritems with largest keys kiare
kept. FS C assigns weights to instances in Btrain based
on their arrival order, with larger weights given to newer
Once all candidate models (Ms0
1,··· ,K+1 ) are trained, FSC
uses Btest to compare their performance against Ms. FSC
picks between the top candidate Ms0
top and the current model
Msbased on their performance:
Ms=(Ms,if P(Ms)> P (Ms0
top )
top ,otherwise (7)
2) Model selection: During dual mode operation, FSC
applies the prequential method to evaluate the performance Pt
of Mfand Msat time tto identify the current top performer.
Then, FS C yields predictions from the current top performer
for the next ninstances. Similar to humans, FSC uses the
FAST system by default. Thus predictions from FSC are given
ˆx) = (Mf(~
ˆx),if Pt(Mf)Pt(Ms)
ˆx),otherwise (8)
One way to measure performance is using sliding windows
such that only last nsamples are taken into account. Let yW
and ˆyWbe the true and predicted classes in the window W,
then the prequential error at time tis
Et(W) = `t(yW,ˆyW)(9)
A consequence of excluding older samples is that measure-
ments can vary considerably between consecutive windows,
which translates in more transitions between Mfand Ms.
An alternative to sliding windows are fading factors [16],
which are faster and memory-less. This method decreases the
influence of older samples on the measurement as the stream
progresses. Computing the loss function `for each sample, the
prequential error at time tusing fading factors is estimated by
Bt= 1 + (α×Bt1)(12)
Where αR: 0 α1is the forgetting factor. In
Sec. IV, we show how using fading factors effectively reduces
the number of FAS T-S LOW model transitions while keeping
the performance of FS C optimal.
We propose two main variants of FSC based on the method
used to measure performance. In the following, we refer as
FSC to the variant that measures performance over sliding
windows, while FSCf f uses fading factors.
We are interested in the task of classification in presence of
concept drift. In order to thoroughly evaluate FSC, we select
data sets from a variety of domains, for both binary and multi-
class problems, see Table II. We use 7 synthetic data sets
and 5 real-worlds data sets. The synthetic data sets include
different types of drift, including: abrupt (concept changes
suddenly), gradual (concept changes slowly over a region
where past and new concepts are present) and incremental-
fast (there exist multiple intermediate concepts). For further
details, we refer the reader to [17] for AGRAWAL, HYPER
and SEA; and to [11] for MIXED DRIFT and TRANSIENT
We configure FSC using Hoeffding Trees (HT) [18] as Mf
and eXtreme Gradient Boosting (XGB) [19] as Ms. In this
TABLE II: Datasets. [Type] S: synthetic data; R: real world
data. [Drifts] A: abrupt, G: gradual; If: incremental fast.
# instances # features # classes Type Drift
AGRa1000000 9 2 S A
AGRg1000000 9 2 S G
HYPERf1000000 10 2 S If
MDRIFTa600000 2 10 S A
SEAa1000000 3 2 S A
SEAg1000000 3 2 S G
TCHESSa200000 2 8 S A
AIRL 539383 7 2 R -
COVTYPE 581012 54 7 R -
ELEC 45312 6 2 R -
POKER 829201 10 10 R -
WEATHER 18159 8 2 R -
section we use the labels FAS T and SL OW to refer to Mfand
Msrespectively. Using this configuration, we aim to show
that FS C is an effective collaboration strategy for learning
from evolving data streams. Given that neither HT nor XGB
have an explicit mechanism to handle drifts, robustness in
performance on drift zones can be reasonably attributed to the
drift detection mechanism in FS C which triggers the creation
of new SL OW models. In our tests, we assume the training time
for SL OW models to be smaller than the time tbetween
incoming samples, see Eq. 6. This means that Msis ready
to predict from instance n+ 1. The size of the buffer Bis
set n= 10000 for all data sets except for the smaller ones
(ELEC and WEATHER) where n= 5000. The size of the
window Wto compare SL OW models is set to 200 samples,
the same number of samples used to periodically compare
FAST vs SL OW performance and perform model selection.
For sampling the data in Btrain, we set the window size
in PAW equal to |Btrain|(remember that the actual sample
size is dynamic) where |Btrain|=|B|−|Btest|= 9800.
Two different sample sizes, corresponding to 50% and 75%
of Btrain are used for Reservoir Sampling (RS50, RS75 ) and
Weighted Reservoir Sampling (WRS50, WRS75 ), Finally, we
set the confidence δ= 0.002 for ADWIN and the forgetting
factor α= 0.999 for fading factors. We perform two main
1) Compare FSC against the performance of FAS T and
SLOW over the stream, this is, measure the impact in
performance from the collaboration between models.
During single mode operation, only the FAS T model is
active, thus FSC defaults to this model for predictions
and performance is the same. Once SLOW is trained,
FSC enters into dual mode and further model trainings
are triggered by the drift detector. It is important to
notice that FAS T and FSC cover the entire stream while
SLOW only covers part of it. We compare performance
for the models during both modes.
2) Compare FSC and FS Cf f against other learning meth-
ods: Passive Aggressive Classifier (PAC) [20], an in-
cremental learning method that modifies its prediction
mechanism in response to feedback from current per-
formance. The Perceptron Classifier (PRCP). Hoeffding
Adaptive Trees (HAT) [21], which updates the tree struc-
ture in response to drift changes. And Self Adjusting
Memory (SAM) [11], described in Sec. II. It is important
to note that HAT and SAM are methods designed to
handle Concept Drift.
FSC is implemented using scikit-multiflow [22], a stream
learning framework in Python. Tests are performed using the
implementations of PAC and PRCP in scikit-learn [23], the
Python implementation of XGB and the implementations of
HT, HAT and SAM from scikit-multiflow. All classification
methods are set to default parameters. To measure perfor-
mance, we use accuracy for binary classification and kappa
statistics for multi-class classification.
TABLE III: Test results for FSC using sliding windows for performance tracking and model selection. Numbers in bold are
top values. Values in single+dual are the average performance of FAST and FSC for the whole stream. Type indicates if the
stream is B:Binary or M:Multi-class. Main performance indicator is accuracy for binary streams and kappa for multi-class.
Accuracy Kappa
single+dual dual mode single+dual dual mode
AGRaB 90.07 91.23 86.97 90.04 91.22 78.62 81.01 73.86 78.57 80.99
AGRgB 80.79 89.05 85.95 80.67 89.01 61.02 76.82 70.40 60.79 76.76
HYPERfB 78.76 84.73 84.49 78.70 84.72 57.43 69.37 68.87 57.28 69.34
MDRIFTaM 44.95 53.54 52.08 44.77 53.50 35.71 45.29 43.66 35.48 45.22
SEAaB 86.43 88.41 88.37 86.43 88.43 71.10 75.23 75.14 71.11 75.29
SEAgB 86.43 88.09 88.06 86.43 88.11 71.08 74.58 74.51 71.09 74.63
TCHESSaM 68.89 95.66 98.35 69.76 97.96 50.11 78.75 84.46 50.89 81.07
synth avg 76.62 84.39 83.47 76.69 84.71 60.72 71.58 70.13 60.74 71.90
AIRL B 63.81 65.15 63.95 63.86 65.23 12.67 16.51 17.75 12.79 16.70
COVTYPE M 82.33 83.23 77.75 82.73 83.64 59.19 60.70 47.35 59.47 61.00
ELEC B 77.88 77.96 76.06 77.12 77.24 51.68 52.26 49.21 50.54 51.25
POKER M 74.15 89.07 88.30 74.24 89.33 32.95 76.54 79.29 33.16 77.29
WEATHER B 73.97 76.16 77.19 74.10 76.66 31.98 38.96 41.91 32.65 40.79
real avg 74.43 78.31 76.65 74.41 78.42 37.69 48.99 47.10 37.72 49.41
overall avg 75.71 81.86 80.63 75.74 82.09 51.13 62.17 60.53 51.15 62.53
An example of FAS T and SL OW models working inde-
pendently is shown in Figure 3. Test results for the FAS T
and SL OW models correspond to HT and XGB respectively.
Plots in Fig. 4, 5 show model performance over time (top),
the currently selected model indicating the source (FAS T or
SLOW ) of predictions yield by FSC (middle) and the drift
detection flag (bottom). Due to space limitation not all plots
are shown.
Results for FSC are summarized in Table III, divided
depending on the operation mode (single+dual, dual) in which
they where measured. Results indicate that FSC performs
better than FAS T for all data sets across the entire stream
(single+dual). On the other hand, FSC performs better than
FAST and SL OW during dual mode in most cases. It should
be noted that when we say that FS C performs better we
refer to the optimal integration of predictions from FAST and
SLOW models. We report performance in terms of accuracy
and kappa statistics but is important to note that the main
performance indicator used by FS C is accuracy for binary
streams and kappa statistics for multi-class streams.
In AGRa(Figure 4a) and AGRg(Figure 4b) we can
distinguish the regions where drifts are simulated and their
Fig. 3: FAS T (orange) and SLOW (blue) models sliding accu-
racy on AGRgwhich contains 3 gradual drifts at the 25%,
50% and 75% marks.
impact on the classifiers. In both cases, the SL OW system
reacts faster than FAS T due to the change detection mechanism
that triggers the training of new batch models. In AGRa, the
SLOW system recovers quickly after the second drift but is
not able to train an optimal model for the concept and FS C
chooses FAST. This is corrected later as a new drift is detected
(around the 700K mark). In AGRg, FAST takes longer to adapt
since the concept drift is gradual. In this case, we see two clear
regions after the second and third drifts where FSC opts for
the SL OW model.
A more complex scenario is seen in COVTYPE (Figure 4e),
where FSC constantly changes between FAST and SLOW with
only small regions dominated by one model. Despite this,
we notice that FSC yields predictions from the current top
performer, resulting in an overall improvements. In ELEC
(Figure 4f), model selection is easier to observe given the
smaller size of the stream. In these cases we see that FS C
benefits from continuously picking predictions from both FAS T
and SL OW models.
The results from HYPERf(Figure 4c) are interesting since
they exemplify the case where one of the classifiers becomes
superior. FAST performance drops around the middle of the
stream and FSC defaults to SLOW to provide predictions. This
finding has important implications, it highlights the scenario
where a batch model is a better predictor for a stream, but
requires time to become optimal. The first half of the stream
can be considered as a kind of tuning phase where FS C yields
predictions from FAS T and SL OW. When an optimal SL OW
model is trained FSC defaults to it. We have successfully
trained an optimal batch model for the stream while keeping
the requirement of providing predictions at any time.
As previously indicated, TCHESSa(Figure 4d) is designed
to handicap classifiers that favour new concepts over old ones.
Abrupt and frequent drifts are challenging for FAS T which
(a) AGRa(b) AGRg
Fig. 4: FSC (green) vs FAST (orange) vs S LOW (blue) current (sliding) performance calculated over the last 200 instances.
Below each plot are the flags corresponding to model selection and drift detection. Accuracy is shown for binary classification
and kappa statistics for multi-class classification.
starts with very low performance and takes most of the stream
to perform better although still below SL OW. On the other
hand, SL OW performs consistently better despite sudden (and
very short) drops due to the nature of the stream.
It is important to notice that the proposed mechanisms in
charge of creating new SLOW models is effective on most
cases, enabling the positive interaction between FAST and
SLOW . The average ratio of selected candidate SL OW models
per sampling strategy is displayed in Table IV. We observe that
W RS75 is the sampling strategy that results in more optimal
SLOW models, followed by W RS50 .
TABLE IV: Average ratio of SLOW model selection per
sampling strategy. Buffer corresponds to using all instances
in the buffer.
Buffer PAW RS50 RS75 WRS50 WRS75
average 3.38 4.51 6.54 6.85 11.82 66.91
In Figure 5 we see that using fading factors to measure
performance effectively reduces the number of translations
between FAST and SL OW models. Although the impact on
overall performance is marginal with gains <1% on FSCf f
over FSC, the usage of fading factors represents a good com-
promise between performance, number of model transitions
and resources. Due to space limitations we only show plots
for selected datasets.
In AGRg, Figure 5a, we see a significant drop in the number
of model transitions in regions where one of the models
starts showing better performance around the 300K and 550K
marks. Another interesting example is HYPERf, Figure 5b,
as previously discussed, SL OW is the overall top performer
for the second half of the stream. Using sliding windows we
still see some model transitions whose impact on performance
is presumably marginal, Figure 4c. These transitions are not
observed (after the middle mark) using fading factors, in
this case FSC yields predictions from SL OW for the rest of
(a) AGRg(b) HYPERf
Fig. 5: FS Cf f with fading factors (green) vs FAST (orange) vs S LOW (blue) current (sliding) performance. Notice that using
fading factors results in less transitions between models.
TABLE V: Comparison of FSC and FSCff against stand alone stream methods.
Accuracy Kappa
AGRaB 91.23 91.35 54.83 55.55 90.07 80.67 68.62 81.01 81.24 9.31 10.84 78.62 61.23 36.18
AGRgB 89.05 89.10 54.22 54.90 80.79 79.19 68.89 76.82 76.93 8.08 9.52 61.02 58.29 32.67
HYPERfB 84.73 84.83 81.89 82.53 78.76 86.89 86.97 69.37 69.57 63.78 65.05 57.43 73.77 73.94
MDRIFTaM 53.54 53.20 38.00 17.28 44.95 45.47 86.66 45.29 44.87 11.66 7.32 35.71 39.16 85.07
SEAaB 88.41 88.48 70.82 74.93 86.43 82.68 87.60 75.23 75.39 39.96 47.81 71.10 63.53 73.85
SEAgB 88.09 88.18 70.73 74.85 86.43 82.47 87.29 74.58 74.77 39.77 41.31 71.08 63.09 73.19
TCHESSaM 95.66 96.03 57.58 57.60 68.89 32.26 93.76 78.75 79.39 51.52 51.55 50.11 22.56 92.87
synth avg 84.39 84.45 61.15 59.66 76.62 60.80 82.83 71.58 71.74 32.01 33.34 60.72 54.52 66.82
AIRL B 65.15 65.30 58.12 57.55 63.81 60.80 60.47 16.51 16.73 15.18 14.08 12.67 19.62 18.61
COVTYPE M 83.23 83.72 95.32 94.59 82.33 93.99 95.12 60.70 62.24 92.48 91.31 59.19 90.33 92.17
ELEC B 77.96 77.83 85.95 85.83 77.88 87.44 79.89 52.26 51.89 71.24 70.99 51.68 74.18 58.61
POKER M89.09 88.01 77.98 77.06 74.15 69.91 81.84 80.35 78.37 60.91 59.35 32.95 46.15 66.92
WEATHER B 76.16 76.30 68.06 68.05 73.97 69.29 78.10 38.96 39.19 25.81 25.86 31.98 30.05 44.88
real avg 78.32 78.23 77.09 76.62 74.43 76.29 79.08 49.76 49.68 53.12 52.32 37.69 52.07 56.24
overall avg 81.86 81.86 67.79 66.73 75.71 72.59 81.27 62.49 62.55 40.81 41.25 51.13 53.50 62.41
the stream. As expected, COVTYPE in Figure 5c remains
as a complex case where both models interact continuously.
However, we see an overall reduction in the number of
model transitions without compromising performance. Finally,
although in ELEC fading factors reduces performance, the
difference is marginal. This can be attributed to the small size
of the stream and the reduced number of model transitions.
Consequently FS C sometimes misses the mark on the top
performer for incoming samples. A possible solution for small
streams is to reduce the window size (200 samples in our tests)
in order to track closely performance of both models.
Test results displayed in Table V show how FSC and F SCf f
stand against established stream methods. The objective is to
show that the FAST AN D SLOW CLASS IFI ER is an effective
leveraging strategy where a stream model (HT) collaborates
with a batch models (XGB), thus we consider HT as the
baseline performance in our tests.
First, we observe that both FSC and F SCf f outperform
the baseline, in average FSCf f is slightly better than FSC
in terms of kappa and performs equally in terms of accuracy.
Although difference in performance are marginal, we consider
that FSCf f has the upper hand given that fading factors are
cheaper to maintain and result in less model transitions.
We see that PAC and PRCP are the overall worts per-
formers, both below the baseline. This can be justified in
part by their lack of support for Concept Drift. However,
it is important to point that PAC is the top performer for
COVTYPE. Interestingly, although HAT shows good perfor-
mance in some datasets, in average performs closely (3.13%
accuracy, +2.37% kappa) to the baseline despite the fact that it
includes a drift detection mechanism. SAM on the other hand,
consistently performs above the baseline (+5.56% accuracy,
+11.28% kappa). Thus, it is relevant to analyze how FSC
stands out against SAM.
Experimental results confirm that FSCf f leverages the
performance of a stand alone stream model (HT) by means
of a batch model (XGB). We see that FSCff outperforms
other incremental learners for most datasets and in average
performs slightly above the standalone top performer SAM.
It is important to remind the reader that we explicitly use
methods that are not designed to handle drift (HT and XGB)
as base models for FS Cf f . This is because the objective of
our tests is to show how the proposed framework performs
in the presence of Concept Drift by means of the positive
collaboration of stream and batch models. It is expected that in
actual applications the proper batch and stream models will be
used to configure FS C, based on the type of problem, available
resources and other specific requirements. This is why FS C
is designed to be model-agnostic.
Based on test results we conclude that FS Cf f represents
a good compromise in terms of resources, model transitions
and overall performance. The variety in our datasets outlines
different scenarios where the FAST AN D SLOW CL ASS IFI ER
can effectively handle the interaction of FAST and SLOW
models as well as the regions where the default usage of only
one of the models is the best option.
We introduced FAS T AND SL OW LEARNING, a new unified
framework that uses fast (stream) and slow (batch) models.
Our research shows that considering learning as a continuous
task, and contrary to popular belief, a symbiosis of batch and
stream learning is feasible and can be effective. We showed
the applicability of the proposed framework for classification
of streams with concept drift. Test results on synthetic and
real-world data confirm that the FAS T AN D SLO W CLA S-
SI FIE R leverages stream and batch methods by combining
their strengths: a FAS T model adapts to changes in the data
and provides predictions based on current concepts while a
SLOW model generate complex models built on wider data
distributions. A drift detection mechanism triggers the creation
of new SL OW models that are consistent with the new concept.
Although our present study is focused on the feasibility
of a unified batch/stream framework, our findings outline an
area of opportunity to further investigate ways to exploit the
collaboration between different type of learners. Future work
includes using an ensemble instead of single batch models and
a pool of models to keep old models which can be useful in
the presence of recurrent concepts.
[1] M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya,
R. Wald, and E. Muharemagic, “Deep learning applications and chal-
lenges in big data analytics,” Journal of Big Data, vol. 2, no. 1, p. 1,
[2] Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado,
J. Dean, and A. Y. Ng, “Building high-level features using large
scale unsupervised learning,” in Proceedings of the 29th International
Coference on International Conference on Machine Learning. USA:
Omnipress, 2012, pp. 507–514.
[3] A. Bifet, R. Gavald`
a, G. Holmes, and B. Pfahringer, Machine Learning
for Data Streams with Practical Examples in MOA. MIT Press, 2018,
[4] C. P. Chen and C.-Y. Zhang, “Data-intensive applications, challenges,
techniques and technologies: A survey on big data,Information Sci-
ences, vol. 275, pp. 314–347, 2014.
[5] A. Siddiqa, I. A. T. Hashem, I. Yaqoob, M. Marjani, S. Shamshirband,
A. Gani, and F. Nasaruddin, “A survey of big data management:
Taxonomy and state-of-the-art,Journal of Network and Computer
Applications, vol. 71, pp. 151–166, 2016.
[6] D. Kahneman, Thinking, fast and slow. New York: Farrar, Straus and
Giroux, 2011.
[7] J. Gama, I. ˇ
e, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A
survey on concept drift adaptation,ACM Computing Surveys, vol. 46,
no. 4, pp. 1–37, 2014.
[8] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[9] A. D. Baddeley and G. Hitch, “Working memory,” in Psychology of
learning and motivation. Elsevier, 1974, vol. 8, pp. 47–89.
[10] J. Gama and P. Kosina, “Recurrent concepts in data streams classifica-
tion,” Knowledge and Information Systems, vol. 40, no. 3, pp. 489–507,
[11] V. Losing, B. Hammer, and H. Wersing, “KNN classifier with self
adjusting memory for heterogeneous concept drift,” Proceedings - IEEE
International Conference on Data Mining, ICDM, vol. 1, pp. 291–300,
[12] A. Bifet and R. Gavald`
a, “Learning from Time-Changing Data with
Adaptive Windowing,” Proceedings of the 2007 SIAM International
Conference on Data Mining, pp. 443–448, 2007.
[13] A. Bifet, B. Pfahringer, J. Read, and G. Holmes, “Efficient data stream
classification via probabilistic adaptive windows,” in Proceedings of the
28th annual ACM symposium on applied computing. ACM, 2013, pp.
[14] J. S. Vitter, “Random sampling with a reservoir,” ACM Transactions on
Mathematical Software, vol. 11, no. 1, pp. 37–57, 1985.
[15] P. S. Efraimidis and P. G. Spirakis, “Weighted random sampling with a
reservoir,” Information Processing Letters, vol. 97, no. 5, pp. 181–185,
[16] J. Gama, R. Sebasti˜
ao, and P. P. Rodrigues, “On evaluating stream
learning algorithms,” Machine Learning, vol. 90, no. 3, pp. 317–346,
[17] H. M. Gomes, A. Bifet, J. Read, J. P. Barddal, F. Enembreck,
B. Pfharinger, G. Holmes, and T. Abdessalem, “Adaptive random forests
for evolving data stream classification,Machine Learning, vol. 106, no.
9-10, pp. 1469–1495, 2017.
[18] P. Domingos and G. Hulten, “Mining high-speed data streams,” Kdd,
pp. 71–80, 2000.
[19] T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,”
Kdd, pp. 1–10, 2016.
[20] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer,
“Online Passive-Aggressive Algorithms,” Journal of Machine Learning
Research, vol. 7, no. Mar, pp. 551—-585, 2006.
[21] A. Bifet and R. Gavald`
a, “Adaptive Learning from Evolving Data
Streams,” 8th International Symposium on Intelligent Data Analysis, pp.
249–260, 2009.
[22] J. Montiel, J. Read, A. Bifet, and T. Abdessalem, “Scikit-
Multiflow: A Multi-output Streaming Framework,Journal of Machine
Learning Research, 2018, accepted. [Online]. Available: https:
[23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al.,
“Scikit-learn: Machine learning in python,” Journal of machine learning
research, vol. 12, no. Oct, pp. 2825–2830, 2011.
... [3]; • Model selection with both incremental and batch learning models as expert models depending on the context, ex. [4]; and • Incremental models coupled with deep learning models [5] While these incremental models and methods mainly address the algorithmic aspects, system decisions are required to scale these algorithms for big data. This is especially true since Hyper-Parameter Optimization (HPO) of incremental models is an active area of research as the most efficient approach to perform HPO is still an open question. ...
... In particular, studies have shown that smartly combining incremental models that adapt for local changes in data distribution with batch learning models to model the complex relations between features and the target variable can result in a more accurate forecast (ex. [3], [4]). We therefore consider ILGBM and LGBM models with simple averaging of the respective forecasts which results in an improved accuracy of 9.09%, 8.63%, 1.25% in predictions at category, department, item levels, respectively and an overall improvement of 6.77% in the mean accuracy as compared to the best performer in each hierarchy/level. ...
Conference Paper
Full-text available
Today’s highly instrumented systems generate large amounts of time series data from many different domains. In order to create meaningful insights from these data, techniques are needed to handle the collection, processing, and analysis at scale. The high frequency and volume of data that is generated introduces several challenges including data transformation, managing concept drift, the operational cost of model re-training and tracking, and scaling hyperparameter optimization.Incremental machine learning can provide a viable solution to handle these kinds of data. Further, distributed machine learning can be an efficient technique to improve performance, increase accuracy, and scale to larger input sizes.In this paper, we introduce a framework that combines the computational capabilities of Apache Spark and the workflow parallelization of Ray for distributed incremental learning. We conduct an empirical analysis of our framework for time series forecasting using the Walmart M5 dataset. The system can perform a parameter search on streaming data with concept drift producing a robust pipeline that fits high-volume data effectively. The results are encouraging and substantiate system proficiency over traditional big data analysis approaches that exclusively use either offline or online training.
... ADWIN efficiently keeps a variable-length window of recent items; such that it holds that there has been no change in the data distribution. When there is a change in the distribution, the window size shrinks as presented in [27] and [28]. For both phases of pre-training and fine tuning, we used the ADAM optimizer and the Mean Absolute Error (MAE) as a cost function. ...
... While some claim that incremental learning is the ultimate goal of AI [34], we disagree -at least in the strict sense; the biological underpinnings of lifelong learning machines are complex [30] and suggest a diversity of mechanisms. In the world of data-stream learning, arguing for at least a mix between instance-incremental and batch-incremental (e.g., by [37]) seems appropriate. ...
The literature on machine learning in the context of data streams is vast and growing. However, many of the defining assumptions regarding data-stream learning tasks are too strong to hold in practice, or are even contradictory such that they cannot be met in the contexts of supervised learning. Algorithms are chosen and designed based on criteria which are often not clearly stated, for problem settings not clearly defined, tested in unrealistic settings, and/or in isolation from related approaches in the wider literature. This puts into question the potential for real-world impact of many approaches conceived in such contexts, and risks propagating a misguided research focus. We propose to tackle these issues by reformulating the fundamental definitions and settings of supervised data-stream learning with regard to contemporary considerations of concept drift and temporal dependence; and we take a fresh look at what constitutes a supervised data-stream learning task, and a reconsideration of algorithms that may be applied to tackle such tasks. Through and in reflection of this formulation and overview, helped by an informal survey of industrial players dealing with real-world data streams, we provide recommendations. Our main emphasis is that learning from data streams does not impose a single-pass or online-learning approach, or any particular learning regime; and any constraints on memory and time are not specific to streaming. Meanwhile, there exist established techniques for dealing with temporal dependence and concept drift, in other areas of the literature. For the data streams community, we thus encourage a shift in research focus, from dealing with often-artificial constraints and assumptions on the learning mode, to issues such as robustness, privacy, and interpretability which are increasingly relevant to learning in data streams in academic and industrial settings.
... The Online Mondrian reaches state-of-the-art performances in the real datasets [12], [13], [14], [9]. The Data Stream Mondrian 2 GB achieves similar performances for the Banos et al dataset and RandomRBF stable. ...
Full-text available
Supervised learning algorithms generally assume the availability of enough memory to store their data model during the training and test phases. However, in the Internet of Things, this assumption is unrealistic when data comes in the form of infinite data streams, or when learning algorithms are deployed on devices with reduced amounts of memory. In this paper, we adapt the online Mondrian forest classification algorithm to work with memory constraints on data streams. In particular, we design five out-of-memory strategies to update Mondrian trees with new data points when the memory limit is reached. Moreover, we design trimming mechanisms to make Mondrian trees more robust to concept drifts under memory constraints. We evaluate our algorithms on a variety of real and simulated datasets, and we conclude with recommendations on their use in different situations: the Extend Node strategy appears as the best out-of-memory strategy in all configurations, whereas different trimming mechanisms should be adopted depending on whether a concept drift is expected. All our methods are implemented in the OrpailleCC open-source library and are ready to be used on embedded systems and connected objects.
... We can associate t s and t r with the warning and drift detection from the adaptive window. This strategy is similar to how data is buffered to sequentially train batch learners in the Fast and Slow Framework [23], where batch and stream learning methods operate together. ...
Conference Paper
Full-text available
An ensemble of learners tends to exceed the pre-dictive performance of individual learners. This approach has been explored for both batch and online learning. Ensembles methods applied to data stream classification were thoroughly investigated over the years, while their regression counterparts received less attention in comparison. In this work, we discuss and analyze several techniques for generating, aggregating, and updating ensembles of regressors for evolving data streams. We investigate the impact of different strategies for inducing diversity into the ensemble by randomizing the input data (resampling, random subspaces and random patches). On top of that, we devote particular attention to techniques that adapt the ensemble model in response to concept drifts, including adaptive window approaches, fixed periodical resets and randomly determined windows. Extensive empirical experiments show that simple techniques can obtain similar predictive performance to sophisticated algorithms that rely on reactive adaptation (i.e., concept drift detection and recovery).
... The objective is two-folded: First, fill the void in Python for a stream learning framework which can also interact with available tools such as scikit-learn and extend the set of available state-of-the-art methods on this platform. Second, provide a set of tools to facilitate the development of stream learning research, an example is [85]. ...
L'ère du Big Data a révolutionné la manière dont les données sont créées et traitées. Dans ce contexte, de nombreux défis se posent, compte tenu de la quantité énorme de données disponibles qui doivent être efficacement gérées et traitées afin d’extraire des connaissances. Cette thèse explore la symbiose de l'apprentissage en mode batch et en flux, traditionnellement considérés dans la littérature comme antagonistes, sur le problème de la classification à partir de flux de données en évolution. L'apprentissage en mode batch est une approche bien établie basée sur une séquence finie: d'abord les données sont collectées, puis les modèles prédictifs sont créés, finalement le modèle est appliqué. Par contre, l’apprentissage par flux considère les données comme infinies, rendant le problème d’apprentissage comme une tâche continue (sans fin). De plus, les flux de données peuvent évoluer dans le temps, ce qui signifie que la relation entre les caractéristiques et la réponse correspondante peut changer. Nous proposons un cadre systématique pour prévoir le surendettement, un problème du monde réel ayant des implications importantes dans la société moderne. Les deux versions du mécanisme d'alerte précoce (batch et flux) surpassent les performances de base de la solution mise en œuvre par le Groupe BPCE, la deuxième institution bancaire en France. De plus, nous introduisons une méthode d'imputation évolutive basée sur un modèle pour les données manquantes dans la classification. Cette méthode présente le problème d'imputation sous la forme d'un ensemble de tâches de classification / régression résolues progressivement.Nous présentons un cadre unifié qui sert de plate-forme d'apprentissage commune où les méthodes de traitement par batch et par flux peuvent interagir de manière positive. Nous montrons que les méthodes batch peuvent être efficacement formées sur le réglage du flux dans des conditions spécifiques. Nous proposons également une adaptation de l'Extreme Gradient Boosting algorithme aux flux de données en évolution. La méthode adaptative proposée génère et met à jour l'ensemble de manière incrémentielle à l'aide de mini-lots de données. Enfin, nous présentons scikit-multiflow, un framework open source en Python qui comble le vide en Python pour une plate-forme de développement/recherche pour l'apprentissage à partir de flux de données en évolution.
Full-text available
scikit-multiflow is a framework for learning from data streams and multi-output learning in Python. Conceived to serve as a platform to encourage the democratization of stream learning research, it provides multiple state-of-the-art learning methods, data generators and evaluators for different stream learning problems, including single-output, multi-output and multi-label. scikit-multiflow builds upon popular open source frameworks including scikit-learn, MOA and MEKA. Development follows the FOSS principles. Quality is enforced by complying with PEP8 guidelines, using continuous integration and functional testing. The source code is available at © 2018 Jacob Montiel and Jesse Read and Albert Bifet and Talel Abdessalem.
Full-text available
Today many information sources—including sensor networks, financial markets, social networks, and healthcare monitoring—are so-called data streams, arriving sequentially and at high speed. Analysis must take place in real time, with partial data and without the capacity to store the entire data set. This book presents algorithms and techniques used in data stream mining and real-time analytics. Taking a hands-on approach, the book demonstrates the techniques using MOA (Massive Online Analysis), a popular, freely available open-source software framework, allowing readers to try out the techniques after reading the explanations. The book first offers a brief introduction to the topic, covering big data mining, basic methodologies for mining data streams, and a simple example of MOA. More detailed discussions follow, with chapters on sketching techniques, change, classification, ensemble methods, regression, clustering, and frequent pattern mining. Most of these chapters include exercises, an MOA-based lab session, or both. Finally, the book discusses the MOA software, covering the MOA graphical user interface, the command line, use of its API, and the development of new methods within MOA. The book will be an essential reference for readers who want to use data stream mining as a tool, researchers in innovation or data stream mining, and programmers who want to create new algorithms for MOA.
Full-text available
Random forests is currently one of the most used machine learning algorithms in the non-streaming (batch) setting. This preference is attributable to its high learning performance and low demands with respect to input preparation and hyper-parameter tuning. However, in the challenging context of evolving data streams, there is no random forests algorithm that can be considered state-of-the-art in comparison to bagging and boosting based algorithms. In this work, we present the adaptive random forest (ARF) algorithm for classification of evolving data streams. In contrast to previous attempts of replicating random forests for data stream learning, ARF includes an effective resampling method and adaptive operators that can cope with different types of concept drifts without complex optimizations for different data sets. We present experiments with a parallel implementation of ARF which has no degradation in terms of classification performance in comparison to a serial implementation, since trees and adaptive operators are independent from one another. Finally, we compare ARF with state-of-the-art algorithms in a traditional test-then-train evaluation and a novel delayed labelling evaluation, and show that ARF is accurate and uses a feasible amount of resources.
Conference Paper
Full-text available
Data Mining in non-stationary data streams is gaining more attention recently, especially in the context of Internet of Things and Big Data. It is a highly challenging task, since the fundamentally different types of possibly occurring drift undermine classical assumptions such as data independence or stationary distributions. Available algorithms are either struggling with certain forms of drift or require a priori knowledge in terms of a task specific setting. We propose the Self Adjusting Memory (SAM) model for the k Nearest Neighbor (kNN) algorithm since kNN constitutes a proven classifier within the streaming setting. SAM-kNN can deal with heterogeneous concept drift, i.e different drift types and rates, using biologically inspired memory models and their coordination. It can be easily applied in practice since an optimization of the meta parameters is not necessary. The basic idea is to construct dedicated models for the current and former concepts and apply them according to the demands of the given situation. An extensive evaluation on various benchmarks, consisting of artificial streams with known drift characteristics as well as real world datasets is conducted. Thereby, we explicitly add new benchmarks enabling a precise performance evaluation on multiple types of drift. The highly competitive results throughout all experiments underline the robustness of SAM-kNN as well as its capability to handle heterogeneous concept drift.
Full-text available
The rapid growth of emerging applications and the evolution of cloud computing technologies have significantly enhanced the capability to generate vast amounts of data. Thus, it has become a great challenge in this big data era to manage such voluminous amount of data. The recent advancements in big data techniques and technologies have enabled many enterprises to handle big data efficiently. However, these advances in techniques and technologies have not yet been studied in detail and a comprehensive survey of this domain is still lacking. With focus on big data management, this survey aims to investigate feasible techniques of managing big data by emphasizing on storage, pre-processing, processing and security. Moreover, the critical aspects of these techniques are analyzed by devising a taxonomy in order to identify the problems and proposals made to alleviate these problems. Furthermore, big data management techniques are also summarized. Finally, several future research directions are presented.
Conference Paper
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.