Content uploaded by Jacob Montiel
Author content
All content in this area was uploaded by Jacob Montiel on Feb 06, 2020
Content may be subject to copyright.
Learning Fast and Slow:
A Unified Batch/Stream Framework
Jacob Montiel∗, Albert Bifet∗, Viktor Losing†‡, Jesse Read§and Talel Abdessalem∗¶
∗LTCI, T´
el´
ecom ParisTech Universit´
e Paris-Saclay, Paris, France
{jacob.montiel, albert.bifet}@telecom-paristech.fr
†Bielefeld University, Bielefeld, Germany
vlosing@techfak.uni-bielefeld.de
‡HONDA Research Institute Europe, Offenbach, Germany
§LIX, ´
Ecole Polytechnique, Palaiseau, France
jesse.read@polytechnique.edu
¶UMI CNRS IPAL, National University of Singapore
talel.abdessalem@enst.fr
Abstract—Data ubiquity highlights the need of efficient and
adaptable data-driven solutions. In this paper, we present FAST
AN D SLOW LEARNING (F SL), a novel unified framework that
sheds light on the symbiosis between batch and stream learn-
ing. FSL works by employing Fast (stream) and Slow (batch)
Learners, emulating the mechanisms used by humans to make
decisions. We showcase the applicability of FSL on the task of
classification by introducing the FAST A ND SL OW CLASSIFIER
(FS C). A Fast Learner provides predictions on the spot, con-
tinuously updating its model and adapting to changes in the
data. On the other hand, the Slow Learner provides predictions
considering a wider spectrum of seen data, requiring more time
and data to create complex models. Once that enough data has
been collected, FS C trains the Slow Learner and starts tracking
the performance of both learners. A drift detection mechanism
triggers the creation of new Slow models when the current Slow
model becomes obsolete. FS C selects between Fast and Slow
Learners according to their performance on new incoming data.
Test results on real and synthetic data show that FSC effectively
drives the positive interaction of stream and batch models for
learning from evolving data streams.
Index Terms—Machine Learning; Classification; Batch Learn-
ing; Stream Learning; Concept Drift
I. INTRODUCTION
A common premise (communis opinio) in Machine Learning
is that Batch and Stream Learning are mutually exclusive. A
typical start point when tackling Machine Learning problems,
is to categorize them as either batch or stream, and from that
point most studies treat them separately. This separation often
arises from specific requirements and expectations from both
types of learning.
In batch learning, constraints on resources such as time
and memory are somewhat relaxed (within acceptable limits)
in favor of complex models that generalize well the data
that describes a phenomenon. For example, Neural Networks
(NN) and Tree Ensembles, such as Random Forest (RF) and
eXtreme Gradient Boosting (XGB), are popular techniques in
classification that require large amounts of data for training.
Additionally, as the amount and complexity of the data in-
creases, these techniques are known to be computationally
demanding [1]. For example, in [2], a large deep neural
network is used to learn high-level features. Training time on
10 million images is 3 days for a 9-layer locally connected
sparse autoencoder, using 1000 machines (16000 cores).
On the other hand, requirements for stream learning [3]
are based on the assumption of a continuous flow of data.
In this context, storing data is impractical, thus models are
incrementally trained, data is processed as it arrives (one at
a time) and never re-used nor stored. Streaming techniques
prioritize the optimization of two resources: time and storage.
Since streams are assumed to be potentially infinite, stream
models are required to be always available, this means that
they must provide predictions at any time.
Different to the batch setting, where data distribution is
assumed static, in data streams the relationship between input
data and the corresponding response (label) may change over
time, this is known as Concept Drift. Without intervention,
batch learners will fail after drift occurs, since they were
essentially trained on data that does not correspond to the
current concept. A common intervention is to re-train a new
batch model introducing the new concepts. Stream techniques,
on the other hand, adapt to new concepts as the model is up-
dated while some techniques even include special mechanisms
designed to handle drifts.
In this work, we consider that learning is a continuous
activity and aim to show that there exists an overlapping
region where batch and stream learning can positively interact,
benefiting from their strengths while compensating their weak-
nesses. Stream learners are easier and cheaper to maintain,
they adapt (react) to changes in the data distribution and
provide predictions based on the most recent data. Batch
learners require time to collect sufficient data before they can
build a model, but once available, they typically generate more
complex and accurate models.
While storing a complete data stream is impractical, current
trends in (cheap) data storage [4], [5] provide the means to
store large subsets of the stream. In fact, big data sets for batch
learning are usually generated by collecting data over time. In
this scenario, where data is continuously arriving, we propose
a framework composed of stream and batch learners.
This unified approach is similar to the model proposed
by Nobel Prize in Economics laureate Daniel Kahneman in
his best-selling book Thinking, Fast and Slow [6] to describe
the mechanisms behind human decision-making. The central
thesis of this book is a dichotomy between two modes of
thought: System 1 is fast, instinctive and emotional; System
2is slower, more deliberative, and more logical. Equivalent
systems for Machine Learning, are outlined in Table I.
TABLE I: The Fast and Slow systems for Machine Learning.
FAST SYSTEM SLOW SYSTEM
Cheap (mem., time) Expensive (mem., time)
Always ready Trains on large batches
Robust to drifts, adapts Complex and robust models
Focus on the present Generalize the larger scheme
The rest of the paper is organized as follows: Sec. II
provides an overview of related work. Sec. III describes the
FAST AN D SLOW LEARNING framework and its application on
classification. Sec. IV outlines the experimental evaluation. In
Sec. V, examines test results. Conclusions and future directions
are discussed in Sec. VI.
II. RE LATED WOR K
In recent years, Learning from Data Streams has received
increasing attention due to the growth in generation of Fast
Data and the democratization of Machine Learning. Concept
Drift is of special interest when working with evolving data
due to its impact on learning and multiple approaches have
been proposed, [7] provides a complete overview of this topic.
In the following, we describe relevant studies.
A couple of drift-aware ensembles, work under the Long-
Term —Short-Term memory model (LTM-STM). This
model is also found in Long Short Term Memory networks
(LSTMs) [8], a well known type of Recurrent Neural Network
which is known to perform, for many tasks, better than the
standard version. Proposed by Baddeley and Hitch in the
seminal work [9], the STM model describes the working
memory, a cognitive system responsible of holding information
for processing. According to this model, separate buffers are
used for different types of information, and they are separate
of the LTM. The Two-Layer System presented in [10] is
composed of a Learning Layer (Layer 1), where a model is
learned for the original problem, and a Control Layer (Layer
2) that monitors the learning process of the first layer. In the
second layer, a drift detection mechanism is used to identify
regions of stable concepts which are stored in a pool of long-
memory models. Similarly, the Self Adjusting Memory (SAM)
model [11] builds an ensemble with models targeting current
or former concepts. SAM is built using two memories: STM
for the current concept, and the LTM to retain information
about past concepts. A cleaning process is in charge of
controlling the size of the STM while keeping the information
in the LTM consistent with the information in the STM.
Fig. 1: Fast and Slow Learning Framework overview.
III. FAST AND SL OW LEARNING FRA MEWOR K
In this paper, we introduce the FAST A ND SLOW LEARNING
(FS L) framework. Similarly to [10] and [11], we approach
learning as a combination of two systems: FAS T and SL OW.
Moreover, we consider learning as a continuous task where
batch and stream learning co-exist. The FAS T system, consists
of a stream model Ms, updated over time. The SL OW system
consists of a sequence of batch models {Mb1, Mb2, . . . , Mbn}
that are replaced as they become obsolete and a data buffer that
stores most recent data for training. Under this schema, FAS T
and SL OW learners complement each other: The FA ST learner,
incrementally trained as data becomes available, provides
predictions consistent with the current concept, able to adapt
quickly to concept drift. Whereas the SLOW learner first
collects a large amount of data to generate complex models,
which are superior in times of stable concepts. An overview
of the FS L framework is available in Figure 1.
A. Fast and Slow Classifier
In the following, we describe an application of the FSL
framework in the form of the FA ST A ND SL OW CL AS SI FIE R
(FS C), where the internal models in the FAST and SL OW
systems correspond to classification models. Consider a con-
tinuous stream of data A={(~xi, yi)}|i= 1, . . . , t where
irepresents the current time and t→ ∞.~xiis a fea-
ture vector and yithe corresponding target class such that
yi∈ {C1, C2, . . . , Ck}. A FAS T (stream) model Mfis trained
on single instances as they arrive. A SLO W (batch) model
Msis trained using the data buffer B={(~xi, yi)}|i=
b−n+ 1, . . . , b}such that Bcontains the last nsamples
in Awith respect to a given point in time b.
STREAM :trainf(Mf,(~xi, yi)) (1)
BATCH :trains(Ms, B)(2)
The objective is to accurately predict the class of the current
feature ~xt, as its label is presumed to be initially unknown.
The predicted label of a model Mis denoted as ˆy. As soon as
the true label ygets revealed the performance Pis measured
according to some loss function `.
P(M) = `(y, ˆy)(3)
Fig. 2: FSC operation. In single-mode, the FA ST model is
used for prediction while SL OW is buffering the data. In dual-
mode, both models predict each sample but only one is selected
based on their recent performance. SLOW models are replaced
in presence of concept drift.
Performance of stream models is measured using prequen-
tial evaluation, where an instance’s class is predicted before
using it for training. This is, for the i-th instance:
ˆyi=Mi−1(~xi)(4)
trainf(Mi−1,(~xi, yi)) (5)
FSC works in two modes, outlined in Figure 2, where FAS T
and SL OW systems perform slightly different tasks:
•Single mode: Corresponds to the beginning of the stream,
from i= 0 until Msis trained (provides predictions).
In this mode, only the FAS T system is active and FSC
effectively performs as a stream classifier, whereas the
SLOW system stores incoming data into its buffer B←
(~xi, yi), and triggers the training of Mswhen the buffer
reaches its limit.
•Dual mode: Begins once training of Msis complete.
From this point, FS C tracks the performance of Mfand
Msand selects the one with best (current) performance Pt
over most recent data. Data buffering in Bcontinues in a
sliding window fashion and is used for training new batch
models upon detecting concept drifts. Upon entering into
dual mode, FS C remains in that mode.
Notice that when we refer to Msas slow, we refer to the
training time as well as the time required for data collection.
Since data arrives on time intervals ∆t, the time for the SLOW
classifier to be ready to predict is:
idle time = (∆t×n) + time(train(Ms)),(6)
where nis the number of samples. The worst case scenario
is when data arrives at slow rates (large ∆t) and training
time is high, for example a complex deep neural network with
hundreds of layers.
1) Training Batch learners on Data Streams: If concept
changes in the stream, Msbecomes obsolete and has to
be replaced by a new model trained on the current con-
cept. We propose a strategy to track Msperformance and
trigger the training of new models automatically. FSC uses
the ADWIN [12] change detector to track changes in the
performance of Ms. ADWIN keeps one dynamic size window
that grows when there is no change and shrinks otherwise. If
the difference between the means of two possible adjacent sub-
windows exceeds a delta depending on a predefined confidence
parameter δthen a change is detected and the dynamic window
reduces its size. As soon as ADWIN detects a change, the
training of new models is triggered.
FSC uses the buffer Bto train and evaluate new models.
Bis split into two adjacent sets Btrain and Btest where
|Btrain| |Btest |, keeping the data arrival order so Btest
contains the last samples from the stream. In addition to
training a model on Btrain, FSC trains Kcandidate models on
different samplings of Btrain, in order to improve the chances
of training an optimal model. FS C applies three different
sampling strategies:
•Probabilistic Adaptive Window (PAW) [13], keeps in
logarithmic memory a sample of items W, giving more
weight to newer items but maintaining some older items.
Each new item iis stored into Wthen for each jelement
in W, a random number rin (0,1) is obtained, then jis
removed from Wif r > 2−1/|W|.
•Reservoir Sampling (RS) [14], samples Rrandom items
from a stream (without replacement) and is equally biased
towards old items and new items. For a sample size R
with iitems seen so far, the chance of any item to be
chosen is R/i. For the next item, a chosen item has a
probability to remain of R/(i+ 1) and a new item has a
probability of R/(i+ 1) to be kept.
•Weighted Reservoir Sampling (WRS) [15]. Samples from
a weighted distribution where each items has a corre-
sponding weight in the stream. A key value kiis assigned
to each item iin the stream, by ki=u1/wi
i, where wi
is the weight of the item and uiis a random number
in (0,1). Finally, the Ritems with largest keys kiare
kept. FS C assigns weights to instances in Btrain based
on their arrival order, with larger weights given to newer
instances.
Once all candidate models (Ms0
1,··· ,K+1 ) are trained, FSC
uses Btest to compare their performance against Ms. FSC
picks between the top candidate Ms0
top and the current model
Msbased on their performance:
Ms=(Ms,if P(Ms)> P (Ms0
top )
Ms0
top ,otherwise (7)
2) Model selection: During dual mode operation, FSC
applies the prequential method to evaluate the performance Pt
of Mfand Msat time tto identify the current top performer.
Then, FS C yields predictions from the current top performer
for the next ninstances. Similar to humans, FSC uses the
FAST system by default. Thus predictions from FSC are given
by:
FSC(~
ˆx) = (Mf(~
ˆx),if Pt(Mf)≥Pt(Ms)
Ms(~
ˆx),otherwise (8)
One way to measure performance is using sliding windows
such that only last nsamples are taken into account. Let yW
and ˆyWbe the true and predicted classes in the window W,
then the prequential error at time tis
Et(W) = `t(yW,ˆyW)(9)
A consequence of excluding older samples is that measure-
ments can vary considerably between consecutive windows,
which translates in more transitions between Mfand Ms.
An alternative to sliding windows are fading factors [16],
which are faster and memory-less. This method decreases the
influence of older samples on the measurement as the stream
progresses. Computing the loss function `for each sample, the
prequential error at time tusing fading factors is estimated by
Et=St/Bt(10)
St=`(yt,ˆyt)+(α×St−1)(11)
Bt= 1 + (α×Bt−1)(12)
Where α∈R: 0 α≤1is the forgetting factor. In
Sec. IV, we show how using fading factors effectively reduces
the number of FAS T-S LOW model transitions while keeping
the performance of FS C optimal.
We propose two main variants of FSC based on the method
used to measure performance. In the following, we refer as
FSC to the variant that measures performance over sliding
windows, while FSCf f uses fading factors.
IV. EXP ERIME NTAL EVALUATI ON
We are interested in the task of classification in presence of
concept drift. In order to thoroughly evaluate FSC, we select
data sets from a variety of domains, for both binary and multi-
class problems, see Table II. We use 7 synthetic data sets
and 5 real-worlds data sets. The synthetic data sets include
different types of drift, including: abrupt (concept changes
suddenly), gradual (concept changes slowly over a region
where past and new concepts are present) and incremental-
fast (there exist multiple intermediate concepts). For further
details, we refer the reader to [17] for AGRAWAL, HYPER
and SEA; and to [11] for MIXED DRIFT and TRANSIENT
CHESSBOARD.
We configure FSC using Hoeffding Trees (HT) [18] as Mf
and eXtreme Gradient Boosting (XGB) [19] as Ms. In this
TABLE II: Datasets. [Type] S: synthetic data; R: real world
data. [Drifts] A: abrupt, G: gradual; If: incremental fast.
# instances # features # classes Type Drift
AGRa1000000 9 2 S A
AGRg1000000 9 2 S G
HYPERf1000000 10 2 S If
MDRIFTa600000 2 10 S A
SEAa1000000 3 2 S A
SEAg1000000 3 2 S G
TCHESSa200000 2 8 S A
AIRL 539383 7 2 R -
COVTYPE 581012 54 7 R -
ELEC 45312 6 2 R -
POKER 829201 10 10 R -
WEATHER 18159 8 2 R -
section we use the labels FAS T and SL OW to refer to Mfand
Msrespectively. Using this configuration, we aim to show
that FS C is an effective collaboration strategy for learning
from evolving data streams. Given that neither HT nor XGB
have an explicit mechanism to handle drifts, robustness in
performance on drift zones can be reasonably attributed to the
drift detection mechanism in FS C which triggers the creation
of new SL OW models. In our tests, we assume the training time
for SL OW models to be smaller than the time ∆tbetween
incoming samples, see Eq. 6. This means that Msis ready
to predict from instance n+ 1. The size of the buffer Bis
set n= 10000 for all data sets except for the smaller ones
(ELEC and WEATHER) where n= 5000. The size of the
window Wto compare SL OW models is set to 200 samples,
the same number of samples used to periodically compare
FAST vs SL OW performance and perform model selection.
For sampling the data in Btrain, we set the window size
in PAW equal to |Btrain|(remember that the actual sample
size is dynamic) where |Btrain|=|B|−|Btest|= 9800.
Two different sample sizes, corresponding to 50% and 75%
of Btrain are used for Reservoir Sampling (RS50, RS75 ) and
Weighted Reservoir Sampling (WRS50, WRS75 ), Finally, we
set the confidence δ= 0.002 for ADWIN and the forgetting
factor α= 0.999 for fading factors. We perform two main
tests:
1) Compare FSC against the performance of FAS T and
SLOW over the stream, this is, measure the impact in
performance from the collaboration between models.
During single mode operation, only the FAS T model is
active, thus FSC defaults to this model for predictions
and performance is the same. Once SLOW is trained,
FSC enters into dual mode and further model trainings
are triggered by the drift detector. It is important to
notice that FAS T and FSC cover the entire stream while
SLOW only covers part of it. We compare performance
for the models during both modes.
2) Compare FSC and FS Cf f against other learning meth-
ods: Passive Aggressive Classifier (PAC) [20], an in-
cremental learning method that modifies its prediction
mechanism in response to feedback from current per-
formance. The Perceptron Classifier (PRCP). Hoeffding
Adaptive Trees (HAT) [21], which updates the tree struc-
ture in response to drift changes. And Self Adjusting
Memory (SAM) [11], described in Sec. II. It is important
to note that HAT and SAM are methods designed to
handle Concept Drift.
FSC is implemented using scikit-multiflow [22], a stream
learning framework in Python. Tests are performed using the
implementations of PAC and PRCP in scikit-learn [23], the
Python implementation of XGB and the implementations of
HT, HAT and SAM from scikit-multiflow. All classification
methods are set to default parameters. To measure perfor-
mance, we use accuracy for binary classification and kappa
statistics for multi-class classification.
TABLE III: Test results for FSC using sliding windows for performance tracking and model selection. Numbers in bold are
top values. Values in single+dual are the average performance of FAST and FSC for the whole stream. Type indicates if the
stream is B:Binary or M:Multi-class. Main performance indicator is accuracy for binary streams and kappa for multi-class.
Accuracy Kappa
single+dual dual mode single+dual dual mode
Type FAST F SC SLOW FAST FSC FAS T FSC SL OW FAST FSC
AGRaB 90.07 91.23 86.97 90.04 91.22 78.62 81.01 73.86 78.57 80.99
AGRgB 80.79 89.05 85.95 80.67 89.01 61.02 76.82 70.40 60.79 76.76
HYPERfB 78.76 84.73 84.49 78.70 84.72 57.43 69.37 68.87 57.28 69.34
MDRIFTaM 44.95 53.54 52.08 44.77 53.50 35.71 45.29 43.66 35.48 45.22
SEAaB 86.43 88.41 88.37 86.43 88.43 71.10 75.23 75.14 71.11 75.29
SEAgB 86.43 88.09 88.06 86.43 88.11 71.08 74.58 74.51 71.09 74.63
TCHESSaM 68.89 95.66 98.35 69.76 97.96 50.11 78.75 84.46 50.89 81.07
synth avg 76.62 84.39 83.47 76.69 84.71 60.72 71.58 70.13 60.74 71.90
AIRL B 63.81 65.15 63.95 63.86 65.23 12.67 16.51 17.75 12.79 16.70
COVTYPE M 82.33 83.23 77.75 82.73 83.64 59.19 60.70 47.35 59.47 61.00
ELEC B 77.88 77.96 76.06 77.12 77.24 51.68 52.26 49.21 50.54 51.25
POKER M 74.15 89.07 88.30 74.24 89.33 32.95 76.54 79.29 33.16 77.29
WEATHER B 73.97 76.16 77.19 74.10 76.66 31.98 38.96 41.91 32.65 40.79
real avg 74.43 78.31 76.65 74.41 78.42 37.69 48.99 47.10 37.72 49.41
overall avg 75.71 81.86 80.63 75.74 82.09 51.13 62.17 60.53 51.15 62.53
V. RE SULTS
An example of FAS T and SL OW models working inde-
pendently is shown in Figure 3. Test results for the FAS T
and SL OW models correspond to HT and XGB respectively.
Plots in Fig. 4, 5 show model performance over time (top),
the currently selected model indicating the source (FAS T or
SLOW ) of predictions yield by FSC (middle) and the drift
detection flag (bottom). Due to space limitation not all plots
are shown.
Results for FSC are summarized in Table III, divided
depending on the operation mode (single+dual, dual) in which
they where measured. Results indicate that FSC performs
better than FAS T for all data sets across the entire stream
(single+dual). On the other hand, FSC performs better than
FAST and SL OW during dual mode in most cases. It should
be noted that when we say that FS C performs better we
refer to the optimal integration of predictions from FAST and
SLOW models. We report performance in terms of accuracy
and kappa statistics but is important to note that the main
performance indicator used by FS C is accuracy for binary
streams and kappa statistics for multi-class streams.
In AGRa(Figure 4a) and AGRg(Figure 4b) we can
distinguish the regions where drifts are simulated and their
Fig. 3: FAS T (orange) and SLOW (blue) models sliding accu-
racy on AGRgwhich contains 3 gradual drifts at the 25%,
50% and 75% marks.
impact on the classifiers. In both cases, the SL OW system
reacts faster than FAS T due to the change detection mechanism
that triggers the training of new batch models. In AGRa, the
SLOW system recovers quickly after the second drift but is
not able to train an optimal model for the concept and FS C
chooses FAST. This is corrected later as a new drift is detected
(around the 700K mark). In AGRg, FAST takes longer to adapt
since the concept drift is gradual. In this case, we see two clear
regions after the second and third drifts where FSC opts for
the SL OW model.
A more complex scenario is seen in COVTYPE (Figure 4e),
where FSC constantly changes between FAST and SLOW with
only small regions dominated by one model. Despite this,
we notice that FSC yields predictions from the current top
performer, resulting in an overall improvements. In ELEC
(Figure 4f), model selection is easier to observe given the
smaller size of the stream. In these cases we see that FS C
benefits from continuously picking predictions from both FAS T
and SL OW models.
The results from HYPERf(Figure 4c) are interesting since
they exemplify the case where one of the classifiers becomes
superior. FAST performance drops around the middle of the
stream and FSC defaults to SLOW to provide predictions. This
finding has important implications, it highlights the scenario
where a batch model is a better predictor for a stream, but
requires time to become optimal. The first half of the stream
can be considered as a kind of tuning phase where FS C yields
predictions from FAS T and SL OW. When an optimal SL OW
model is trained FSC defaults to it. We have successfully
trained an optimal batch model for the stream while keeping
the requirement of providing predictions at any time.
As previously indicated, TCHESSa(Figure 4d) is designed
to handicap classifiers that favour new concepts over old ones.
Abrupt and frequent drifts are challenging for FAS T which
(a) AGRa(b) AGRg
(c) HYPERf(d) TCHESSa
(e) COVTYPE (f) ELEC
Fig. 4: FSC (green) vs FAST (orange) vs S LOW (blue) current (sliding) performance calculated over the last 200 instances.
Below each plot are the flags corresponding to model selection and drift detection. Accuracy is shown for binary classification
and kappa statistics for multi-class classification.
starts with very low performance and takes most of the stream
to perform better although still below SL OW. On the other
hand, SL OW performs consistently better despite sudden (and
very short) drops due to the nature of the stream.
It is important to notice that the proposed mechanisms in
charge of creating new SLOW models is effective on most
cases, enabling the positive interaction between FAST and
SLOW . The average ratio of selected candidate SL OW models
per sampling strategy is displayed in Table IV. We observe that
W RS75 is the sampling strategy that results in more optimal
SLOW models, followed by W RS50 .
TABLE IV: Average ratio of SLOW model selection per
sampling strategy. Buffer corresponds to using all instances
in the buffer.
Buffer PAW RS50 RS75 WRS50 WRS75
average 3.38 4.51 6.54 6.85 11.82 66.91
In Figure 5 we see that using fading factors to measure
performance effectively reduces the number of translations
between FAST and SL OW models. Although the impact on
overall performance is marginal with gains <1% on FSCf f
over FSC, the usage of fading factors represents a good com-
promise between performance, number of model transitions
and resources. Due to space limitations we only show plots
for selected datasets.
In AGRg, Figure 5a, we see a significant drop in the number
of model transitions in regions where one of the models
starts showing better performance around the 300K and 550K
marks. Another interesting example is HYPERf, Figure 5b,
as previously discussed, SL OW is the overall top performer
for the second half of the stream. Using sliding windows we
still see some model transitions whose impact on performance
is presumably marginal, Figure 4c. These transitions are not
observed (after the middle mark) using fading factors, in
this case FSC yields predictions from SL OW for the rest of
(a) AGRg(b) HYPERf
(c) COVTYPE (d) ELEC
Fig. 5: FS Cf f with fading factors (green) vs FAST (orange) vs S LOW (blue) current (sliding) performance. Notice that using
fading factors results in less transitions between models.
TABLE V: Comparison of FSC and FSCff against stand alone stream methods.
Accuracy Kappa
Type FSC FSCff PAC PRCP HT HAT SAM FSC FSCf f PAC PRCP HT HAT SAM
AGRaB 91.23 91.35 54.83 55.55 90.07 80.67 68.62 81.01 81.24 9.31 10.84 78.62 61.23 36.18
AGRgB 89.05 89.10 54.22 54.90 80.79 79.19 68.89 76.82 76.93 8.08 9.52 61.02 58.29 32.67
HYPERfB 84.73 84.83 81.89 82.53 78.76 86.89 86.97 69.37 69.57 63.78 65.05 57.43 73.77 73.94
MDRIFTaM 53.54 53.20 38.00 17.28 44.95 45.47 86.66 45.29 44.87 11.66 7.32 35.71 39.16 85.07
SEAaB 88.41 88.48 70.82 74.93 86.43 82.68 87.60 75.23 75.39 39.96 47.81 71.10 63.53 73.85
SEAgB 88.09 88.18 70.73 74.85 86.43 82.47 87.29 74.58 74.77 39.77 41.31 71.08 63.09 73.19
TCHESSaM 95.66 96.03 57.58 57.60 68.89 32.26 93.76 78.75 79.39 51.52 51.55 50.11 22.56 92.87
synth avg 84.39 84.45 61.15 59.66 76.62 60.80 82.83 71.58 71.74 32.01 33.34 60.72 54.52 66.82
AIRL B 65.15 65.30 58.12 57.55 63.81 60.80 60.47 16.51 16.73 15.18 14.08 12.67 19.62 18.61
COVTYPE M 83.23 83.72 95.32 94.59 82.33 93.99 95.12 60.70 62.24 92.48 91.31 59.19 90.33 92.17
ELEC B 77.96 77.83 85.95 85.83 77.88 87.44 79.89 52.26 51.89 71.24 70.99 51.68 74.18 58.61
POKER M89.09 88.01 77.98 77.06 74.15 69.91 81.84 80.35 78.37 60.91 59.35 32.95 46.15 66.92
WEATHER B 76.16 76.30 68.06 68.05 73.97 69.29 78.10 38.96 39.19 25.81 25.86 31.98 30.05 44.88
real avg 78.32 78.23 77.09 76.62 74.43 76.29 79.08 49.76 49.68 53.12 52.32 37.69 52.07 56.24
overall avg 81.86 81.86 67.79 66.73 75.71 72.59 81.27 62.49 62.55 40.81 41.25 51.13 53.50 62.41
the stream. As expected, COVTYPE in Figure 5c remains
as a complex case where both models interact continuously.
However, we see an overall reduction in the number of
model transitions without compromising performance. Finally,
although in ELEC fading factors reduces performance, the
difference is marginal. This can be attributed to the small size
of the stream and the reduced number of model transitions.
Consequently FS C sometimes misses the mark on the top
performer for incoming samples. A possible solution for small
streams is to reduce the window size (200 samples in our tests)
in order to track closely performance of both models.
Test results displayed in Table V show how FSC and F SCf f
stand against established stream methods. The objective is to
show that the FAST AN D SLOW CLASS IFI ER is an effective
leveraging strategy where a stream model (HT) collaborates
with a batch models (XGB), thus we consider HT as the
baseline performance in our tests.
First, we observe that both FSC and F SCf f outperform
the baseline, in average FSCf f is slightly better than FSC
in terms of kappa and performs equally in terms of accuracy.
Although difference in performance are marginal, we consider
that FSCf f has the upper hand given that fading factors are
cheaper to maintain and result in less model transitions.
We see that PAC and PRCP are the overall worts per-
formers, both below the baseline. This can be justified in
part by their lack of support for Concept Drift. However,
it is important to point that PAC is the top performer for
COVTYPE. Interestingly, although HAT shows good perfor-
mance in some datasets, in average performs closely (−3.13%
accuracy, +2.37% kappa) to the baseline despite the fact that it
includes a drift detection mechanism. SAM on the other hand,
consistently performs above the baseline (+5.56% accuracy,
+11.28% kappa). Thus, it is relevant to analyze how FSC
stands out against SAM.
Experimental results confirm that FSCf f leverages the
performance of a stand alone stream model (HT) by means
of a batch model (XGB). We see that FSCff outperforms
other incremental learners for most datasets and in average
performs slightly above the standalone top performer SAM.
It is important to remind the reader that we explicitly use
methods that are not designed to handle drift (HT and XGB)
as base models for FS Cf f . This is because the objective of
our tests is to show how the proposed framework performs
in the presence of Concept Drift by means of the positive
collaboration of stream and batch models. It is expected that in
actual applications the proper batch and stream models will be
used to configure FS C, based on the type of problem, available
resources and other specific requirements. This is why FS C
is designed to be model-agnostic.
Based on test results we conclude that FS Cf f represents
a good compromise in terms of resources, model transitions
and overall performance. The variety in our datasets outlines
different scenarios where the FAST AN D SLOW CL ASS IFI ER
can effectively handle the interaction of FAST and SLOW
models as well as the regions where the default usage of only
one of the models is the best option.
VI. CONCLUSIONS
We introduced FAS T AND SL OW LEARNING, a new unified
framework that uses fast (stream) and slow (batch) models.
Our research shows that considering learning as a continuous
task, and contrary to popular belief, a symbiosis of batch and
stream learning is feasible and can be effective. We showed
the applicability of the proposed framework for classification
of streams with concept drift. Test results on synthetic and
real-world data confirm that the FAS T AN D SLO W CLA S-
SI FIE R leverages stream and batch methods by combining
their strengths: a FAS T model adapts to changes in the data
and provides predictions based on current concepts while a
SLOW model generate complex models built on wider data
distributions. A drift detection mechanism triggers the creation
of new SL OW models that are consistent with the new concept.
Although our present study is focused on the feasibility
of a unified batch/stream framework, our findings outline an
area of opportunity to further investigate ways to exploit the
collaboration between different type of learners. Future work
includes using an ensemble instead of single batch models and
a pool of models to keep old models which can be useful in
the presence of recurrent concepts.
REFERENCES
[1] M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya,
R. Wald, and E. Muharemagic, “Deep learning applications and chal-
lenges in big data analytics,” Journal of Big Data, vol. 2, no. 1, p. 1,
2015.
[2] Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado,
J. Dean, and A. Y. Ng, “Building high-level features using large
scale unsupervised learning,” in Proceedings of the 29th International
Coference on International Conference on Machine Learning. USA:
Omnipress, 2012, pp. 507–514.
[3] A. Bifet, R. Gavald`
a, G. Holmes, and B. Pfahringer, Machine Learning
for Data Streams with Practical Examples in MOA. MIT Press, 2018,
https://moa.cms.waikato.ac.nz/book/.
[4] C. P. Chen and C.-Y. Zhang, “Data-intensive applications, challenges,
techniques and technologies: A survey on big data,” Information Sci-
ences, vol. 275, pp. 314–347, 2014.
[5] A. Siddiqa, I. A. T. Hashem, I. Yaqoob, M. Marjani, S. Shamshirband,
A. Gani, and F. Nasaruddin, “A survey of big data management:
Taxonomy and state-of-the-art,” Journal of Network and Computer
Applications, vol. 71, pp. 151–166, 2016.
[6] D. Kahneman, Thinking, fast and slow. New York: Farrar, Straus and
Giroux, 2011.
[7] J. Gama, I. ˇ
Zliobait˙
e, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A
survey on concept drift adaptation,” ACM Computing Surveys, vol. 46,
no. 4, pp. 1–37, 2014.
[8] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[9] A. D. Baddeley and G. Hitch, “Working memory,” in Psychology of
learning and motivation. Elsevier, 1974, vol. 8, pp. 47–89.
[10] J. Gama and P. Kosina, “Recurrent concepts in data streams classifica-
tion,” Knowledge and Information Systems, vol. 40, no. 3, pp. 489–507,
2014.
[11] V. Losing, B. Hammer, and H. Wersing, “KNN classifier with self
adjusting memory for heterogeneous concept drift,” Proceedings - IEEE
International Conference on Data Mining, ICDM, vol. 1, pp. 291–300,
2017.
[12] A. Bifet and R. Gavald`
a, “Learning from Time-Changing Data with
Adaptive Windowing,” Proceedings of the 2007 SIAM International
Conference on Data Mining, pp. 443–448, 2007.
[13] A. Bifet, B. Pfahringer, J. Read, and G. Holmes, “Efficient data stream
classification via probabilistic adaptive windows,” in Proceedings of the
28th annual ACM symposium on applied computing. ACM, 2013, pp.
801–806.
[14] J. S. Vitter, “Random sampling with a reservoir,” ACM Transactions on
Mathematical Software, vol. 11, no. 1, pp. 37–57, 1985.
[15] P. S. Efraimidis and P. G. Spirakis, “Weighted random sampling with a
reservoir,” Information Processing Letters, vol. 97, no. 5, pp. 181–185,
2006.
[16] J. Gama, R. Sebasti˜
ao, and P. P. Rodrigues, “On evaluating stream
learning algorithms,” Machine Learning, vol. 90, no. 3, pp. 317–346,
2013.
[17] H. M. Gomes, A. Bifet, J. Read, J. P. Barddal, F. Enembreck,
B. Pfharinger, G. Holmes, and T. Abdessalem, “Adaptive random forests
for evolving data stream classification,” Machine Learning, vol. 106, no.
9-10, pp. 1469–1495, 2017.
[18] P. Domingos and G. Hulten, “Mining high-speed data streams,” Kdd,
pp. 71–80, 2000.
[19] T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,”
Kdd, pp. 1–10, 2016.
[20] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer,
“Online Passive-Aggressive Algorithms,” Journal of Machine Learning
Research, vol. 7, no. Mar, pp. 551—-585, 2006.
[21] A. Bifet and R. Gavald`
a, “Adaptive Learning from Evolving Data
Streams,” 8th International Symposium on Intelligent Data Analysis, pp.
249–260, 2009.
[22] J. Montiel, J. Read, A. Bifet, and T. Abdessalem, “Scikit-
Multiflow: A Multi-output Streaming Framework,” Journal of Machine
Learning Research, 2018, accepted. [Online]. Available: https:
//github.com/scikit-multiflow/scikit-multiflow
[23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al.,
“Scikit-learn: Machine learning in python,” Journal of machine learning
research, vol. 12, no. Oct, pp. 2825–2830, 2011.