Content uploaded by Albert Bifet

Author content

All content in this area was uploaded by Albert Bifet on Mar 24, 2018

Content may be subject to copyright.

VHT: Vertical Hoeffding Tree

Nicolas Kourtellis Gianmarco De Francisci Morales Albert Bifet Arinto Murdopo

Telefonica Research Qatar Computing Research Institute Telecom ParisTech LARC-SMU

nicolas.kourtellis@telefonica.com gdfm@acm.org albert@albertbifet.com arintom@smu.edu.sg

ABSTRACT

IoT Big Data requires new machine learning methods able

to scale to large size of data arriving at high speed. Deci-

sion trees are popular machine learning models since they

are very eﬀective, yet easy to interpret and visualize. In

the literature, we can ﬁnd distributed algorithms for learn-

ing decision trees, and also streaming algorithms, but not

algorithms that combine both features. In this paper we

present the Vertical Hoeﬀding Tree (VHT), the ﬁrst dis-

tributed streaming algorithm for learning decision trees. It

features a novel way of distributing decision trees via ver-

tical parallelism. The algorithm is implemented on top of

Apache SAMOA, a platform for mining distributed data

streams, and thus able to run on real-world clusters. We

run several experiments to study the accuracy and through-

put performance of our new VHT algorithm, as well as its

ability to scale while keeping its superior performance with

respect to non-distributed decision trees.

1. INTRODUCTION

Nowadays, we generate data from many of our daily ac-

tivities as we interact with software systems continuously.

The posts in a social network like Twitter or Facebook, the

purchases with a credit card, the clicks in a website, or the

access to the GPS, can all potentially produce useful in-

formation for interested parties. The recent advancements

in mobile devices and wearable technology have further in-

creased the rate and amount of data being generated. People

now generate data anywhere, anytime, by using a multitude

of gadgets and technologies. In the limit, the Internet of

Things (IoT) will continuously produce data without any

human intervention, thus leading to a dramatic increase vol-

ume and velocity of data. It is estimated that the IoT will

consist of almost 50 billion objects by 2020.1

There is a common pattern to most modern data sources:

data is generated continuously, as a stream. Extracting

1http://www.cisco.com/c/dam/en_us/about/ac79/docs/

innov/IoT_IBSG_0411FINAL.pdf

knowledge from these massive streams of data to create

models, and using them, e.g., to choose a suitable business

strategy, or to improve healthcare services, can generate sub-

stantial competitive advantages. Many applications need to

process incoming data and react on-the-ﬂy by using compre-

hensible prediction mechanisms. For example, when a bank

monitors the transactions of its clients to detect frauds, it

needs to identify and verify a fraud as soon as the transac-

tion is performed, and immediately either block it, or adjust

the prediction mechanism.

Streaming data analytic systems need to process and man-

age data streams in a fast and eﬃcient way, due to the strin-

gent restrictions in terms of time and memory imposed by

the streaming setting. The input to the system is an un-

bounded stream arriving at high speed. Therefore, we need

to use simple models that scale gracefully with the amount

of data. Additionally, we need to let the model take the

right decision online. But how can we trust that the model

is right? A way to create trust is to enhance understand-

ing of the model and its interpretability, for instance via

visualization. There are several models which satisfy both

requirements, however, for reasons we discuss further, in this

work we focus on decision trees.

A decision tree is a classic decision support tool that uses

a tree-like model. In machine learning, it can be used for

both classiﬁcation and regression [4]. At its core, a decision

tree is a model where internal nodes are tests on attributes,

branches are possible outcomes of these tests, and leafs are

decisions, e.g., a class assignment.

Decision trees, and in general tree-based classiﬁers, are

widely popular, for several reasons. First, the model is very

easy to interpret. It is easy to understand how the model

reaches a classiﬁcation decision, and the relative importance

of features. Trees are also easy to visualize, and to modify

according to domain knowledge. Second, prediction is very

fast. Once the model is trained, classifying a new instance

requires just a logarithmic number of very fast checks (in the

size of the model). For this reason, they are commonly used

in one of the most time-sensitive domains nowadays – Web

search [5, 16]. Third, trees are powerful classiﬁers that can

model non-linear relationships. Indeed, their performance,

especially when used in ensemble methods such as boosting,

bagging, and random forests, is outstanding [9].

Learning the optimal decision tree for a given labeled

dataset is NP-complete even for very simple settings [13].

Practical methods for building tree models usually employ

a greedy heuristic that optimizes decisions locally at each

node [4]. In a nutshell, the greedy heuristic starts with

1

arXiv:1607.08325v1 [cs.DC] 28 Jul 2016

an empty node (the root) as the initial model, and works

by recursively sorting the whole dataset through the cur-

rent model. Each leaf of the tree collects statistics on the

distribution of attribute-class co-occurrences in the part of

dataset that reaches the leaf. When all the dataset has been

analyzed, each leaf picks the best attribute according to a

splitting criterion (e.g., entropy or information gain). Then,

it becomes an internal node that branches on that attribute,

splits the dataset into newly created children leaves, and

calls the procedure recursively for these leaves. The proce-

dure usually stops when the leaf is pure (i.e., only one class

reaches the leaf), or when the number of instances reaching

the leaf is small enough. This recursive greedy heuristic is

inherently a batch process, as it needs to process the whole

dataset before taking a split decision. However, streaming

variants of tree learners also exist.

The Hoeﬀding tree [8] (a.k.a. VFDT) is a streaming de-

cision tree learner with statistical guarantees. In particular,

by leveraging the Chernoﬀ-Hoeﬀding bound [12], it guaran-

tees that the learned model is asymptotically close to the

model learned by the batch greedy heuristic, under mild as-

sumptions.

The learning algorithm is very simple. Each leaf keeps

track of the statistics for the portion of the stream it is

reached by, and computes the best two attributes according

to the splitting criterion. Let ∆Gbe the diﬀerence between

the value of the functions that represent the splitting cri-

terion of these two attributes. Let be a quantity that

depends on a user-deﬁned conﬁdence parameter δ, and that

decreases with the number of instances processed. When

∆G > , then the currently best attribute is selected to

split the leaf. The Hoeﬀding bound guarantees that this

choice is the correct one with probability larger than 1 −δ.

Streaming algorithms are only one of the two main ways

to deal with massive datasets, the other being distributed

algorithms [6]. However, even though streaming algorithms

are very eﬃcient, they are still bounded by the limits of a

single machine. As argued by Agarwal et al. [1], “there are

natural reasons for studying distributed machine learning

on a cluster.” Nowadays, the data itself is usually already

distributed, and the cost of moving it to a single machine is

too high. Furthermore, cluster computing with commodity

servers is economically more viable than using powerful sin-

gle machines, as testiﬁed by innumerable web companies [2].

Finally, “the largest problem solvable by a single machine

will always be constrained by the rate at which the hard-

ware improves, which has been steadily dwarfed by the rate

at which our data sizes have been increasing over the past

decade” [1].

For all the aforementioned reasons, the goal of the cur-

rent work is to propose a tree learning algorithm for the

streaming setting that runs in a distributed environment.

By combining the eﬃciency of streaming algorithms with

the scalability of distributed processing we aim at provid-

ing a practical tool to tackle the complexities of “big data”,

namely its velocity and volume. Speciﬁcally, we develop our

algorithm in the context of Apache SAMOA [7], an open-

source platform for mining big data streams.2

We name our algorithm the Vertical Hoeﬀding Tree (VHT).

The vertical part stands for the type of parallelism we em-

ploy, namely, vertical data parallelism. Similarly to the orig-

2http://samoa.incubator.apache.org

inal Hoeﬀding tree, the VHT features anytime prediction

and continuous learning.

Naturally, the combination of streaming and distributed

algorithms presents its own unique challenges. Other ap-

proaches have been proposed for parallel algorithms, which

however do not take into account the characteristics of mod-

ern, shared-nothing cluster computing environments [3].

Concisely, we make the following contributions:

•we propose VHT, the ﬁrst distributed streaming algo-

rithms for learning decision trees;

•in doing so, we explore a novel way of parallelizing deci-

sion trees via vertical parallelism;

•we deploy our algorithm on top of SAMOA, and run

it on a real-world Storm cluster to test scalability and

accuracy;

•we experiment with large datasets of tens of thousands

of attributes and obtain high accuracy (up to 80%) and

high throughput (oﬀering up to 20xspeedup over serial

streaming solutions).

The outline of the paper is as follows. We discuss related

work in Section 2, and some preliminary concepts in Sec-

tion 3. We present the new VHT algorithm in Section 4, var-

ious optimization and implementation details in Section 5,

and an empirical evaluation in Section 6, with several ex-

perimental setups on real and synthetic datasets. Finally,

with Section 7 we conclude this work.

2. RELATED WORK

The literature abounds with streaming and distributed

machine learning algorithms, though none of these features

both characteristics simultaneously. Reviewing all these al-

gorithms is out of the scope of this paper, so we focus our

attention on decision trees. We also review the few attempts

at creating distributed streaming learning algorithms that

have been proposed so far.

Algorithms. One of the pioneer works in decision tree

induction for the streaming setting is the Very Fast Decision

Tree algorithm (VFDT) [8]. This work focuses on alleviating

the bottleneck of machine learning application in terms of

time and memory, i.e. the conventional algorithm is not able

to process it due to limited processing time and memory. Its

main contribution is the usage of the Hoeﬀding Bound to

decide the number of data required to achieve certain level of

conﬁdence. This work has been the basis for a large number

of improvements, such as dealing with concept drift [11] and

handling continuous numeric attributes [10].

PLANET [15] is a framework for learning tree models

on massive datasets. This framework utilizes MapReduce

to provide scalability. The authors propose a PLANET

scheduler to transform steps in decision tree induction into

MapReduce jobs that can be executed in parallel. PLANET

uses task-parallelism where each task (node splitting) is ex-

ecuted by one MapReduce jobs that runs independently.

Clearly, MapReduce is a batch programming paradigm which

is not suited to deal with streams of data.

Ye et al. [20] show how to distribute and parallelize Gra-

dient Boosted Decision Trees (GBDT). The authors ﬁrst

implement MapReduce-based GBDT that employs horizon-

tal data partitioning. Converting GBDT to MapReduce

model is fairly straightforward. However, due to high over-

head from HDFS as communication medium when splitting

nodes, the authors conclude that MapReduce is not suitable

2

for this kind of algorithm. The authors then implement

GBDT by using MPI. This implementation uses vertical

data partitioning by splitting the data based on their at-

tributes. This partitioning technique minimizes inter-machine

communication cost. Vertical parallelism is also the data

partitioning strategy we choose for the VHT algorithm.

While technically not a tree, Vu et al. [19] propose the ﬁrst

distributed streaming rule-based regression algorithm. The

algorithm is in spirit similar to the VHT, as it uses vertical

parallelism and runs on top of distributed SPEs. However,

it creates a diﬀerent kind of model and deals with regression

rather than classiﬁcation.

Frameworks. We identify two frameworks that belong

to the category of distributed streaming machine learning:

Jubatus and StormMOA. Jubatus3is an example of dis-

tributed streaming machine learning framework. It includes

a library for streaming machine learning such as regres-

sion, classiﬁcation, recommendation, anomaly detection and

graph mining. It introduces the local ML model concept

which means there can be multiple models running at the

same time and they process diﬀerent sets of data. Using this

technique, Jubatus achieves horizontal scalability via hori-

zontal parallelism in partitioning data. We test horizontal

parallelism in our experiments, by implementing a horizon-

tally scaled version of the hoeﬀding tree.

Jubatus establishes tight coupling between the machine

learning library implementation and the underlying distributed

stream processing engine (SPE). The reason is Jubatus builds

and implements its own custom distributed SPE. In addi-

tion, Jubatus does not oﬀer any tree learning algorithm, as

all of its models need to be linear by construction.

StormMOA4is a project to combine MOA with Storm to

satisfy the need of scalable implementation of streaming ML

frameworks. It uses Storm’s Trident abstraction and MOA

library to implement OzaBag and OzaBoost[14].

Similarly to Jubatus, StormMOA also establishes tight

coupling between MOA (the machine learning library) and

Storm (the underlying distributed SPE). This coupling pre-

vents StormMOA’s extension by using other SPEs to exe-

cute the machine learning library.

StormMOA only allows to run a single model in each

Storm bolt (processor). This characteristic restricts the kind

of models that can be run in parallel to ensembles. The

sharding algorithm we use in the experimental section can

be seen as an instance of this type of framework.

3. PRELIMINARIES

This section introduces the background needed to under-

stand the VHT algorithm. First, we review the literature on

inducing decision trees on a stream. Then, we present the

programming paradigm oﬀered by Apache SAMOA.

3.1 Hoeffding Tree

A decision tree consists of a tree structure, where each

internal node corresponds to a test on an attribute. The

node splits into a branch for each attribute value (for dis-

crete attributes), or a set of branches according to ranges of

the value (for continuous attributes). Leaves contain clas-

siﬁcation predictors, usually majority class classiﬁers, i.e.,

3http://jubat.us/en

4http://github.com/vpa1977/stormmoa

Algorithm 1 HoeﬀdingTreeInduction(X,HT ,δ)

Require: X, a labeled training instance.

Require: HT , the current decision tree.

1: Use HT to sort Xinto a leaf l

2: Update suﬃcient statistic in l

3: Increment nl, the number of instances seen at l

4: if nlmod nmin = 0 and not all instances seen at l

belong to the same class then

5: For each attribute, compute Gl(Xi)

6: Let Xabe the attribute with highest Gl

7: Let Xbbe the attribute with second highest Gl

8: Compute the Hoeﬀding bound =qR2ln(1/δ)

2nl

9: if Xa6=X∅and (Gl(Xa)−Gl(Xb)> or < τ )

then

10: Replace lwith an internal node branching on Xa

11: for all branches of the split do

12: Add a new leaf with derived suﬃcient statistic

from the split node

13: end for

14: end if

15: end if

each leaf predicts the class belonging to the majority of the

instances that reach the leaf.

Decision tree models are very easy to interpret and vi-

sualize. The class predicted by a tree can be explained in

terms of a sequence of tests on its attributes. Each attribute

contributes to the ﬁnal decision, and it’s easy to understand

the importance of each attribute.5

The Hoeﬀding tree or VFDT is a very fast decision tree for

streaming data. Its main characteristic is that rather than

reusing instances recursively down the tree, it uses them

only once. Algorithm 1 shows a high-level description of the

Hoeﬀding tree.

At the beginning of the learning phase, it creates a tree

with only a single node. The Hoeﬀding tree induction Algo-

rithm 1 is invoked for each training instance Xthat arrives.

First, the algorithm sorts the instance into a leaf l(line 1).

This leaf is a learning leaf, therefore the algorithm updates

the suﬃcient statistic in l(line 2). In this case, the suﬃcient

statistic is the class distribution for each value of each at-

tribute. In practice, the algorithm increases a counter nijk ,

for attribute i, value j, and class k. The algorithm also in-

crements the number of instances (nl) seen at leaf lbased

on X’s weight (line 3).

A single instance usually does not change the distribution

signiﬁcantly enough, therefore the algorithm tries to grow

the tree only after a certain number of instances nmin has

been sorted to the leaf. In addition, the algorithm does not

grow the tree if all the instances that reached lbelong to

the same class (line 4).

To grow the tree, the algorithm attempts to ﬁnd a good

attribute to split the leaf on. The algorithm iterates through

each attribute and calculates the corresponding splitting cri-

terion Gl(Xi) (line 5). This criterion is an information-

theoretic function, such as entropy or information gain, which

is computed by making use of the counters nijk. The algo-

rithm also computes the criterion for the scenario where no

5http://blog.datadive.net/interpreting-random-

forests

3

split takes places (X∅). Domingos and Hulten [8] refer to this

inclusion of a no-split scenario with the term pre-pruning.

The algorithm then chooses the best (Xa) and the second

best (Xb) attributes based on the criterion (lines 6 and 7).

By using these chosen attributes, it computes the diﬀerence

of their splitting criterion values ∆Gl=Gl(Xa)−Gl(Xb).

To determine whether the leaf needs to be split, it compares

the diﬀerence ∆Glto the Hoeﬀding bound for the current

conﬁdence parameter δ(where Ris the range of possible

values of the criterion). If the diﬀerence is larger than the

bound (∆Gl> ), then Xais the best attribute with high

conﬁdence 1 −δ, and can therefore be used to split the leaf.

Line 9 shows the complete condition to split the leaf.

If the best attribute is the no-split scenario (X∅), the al-

gorithm does not perform any split. The algorithm also

uses a tie-breaking τmechanism to handle the case where

the diﬀerence in splitting criterion between Xaand Xbis

very small. If the Hoeﬀding bound becomes smaller than τ

(∆Gl< < τ), then the current best attribute is chosen

regardless of the values of ∆Gl.

If the algorithm splits the node, it replaces the leaf lwith

an internal node. It also creates branches based on the best

attribute that lead to newly created leaves and initializes

these leaves using the class distribution observed at the best

attribute the branches are starting from (lines 10 to 13).

3.2 SAMOA

Apache SAMOA6is an open-source distributed stream

mining platform initially developed at Yahoo Labs [6]. It

allows easy implementation and deployment of distributed

streaming machine learning algorithms on supported dis-

tributed stream processing engines (DSPEs). Besides, it

provides the ability to integrate new DSPEs into the frame-

work and leverage their scalability for performing big data

mining.

SAMOA is both a framework and a library. As a frame-

work, it allows the algorithm developer to abstract from the

underlying execution engine, and therefore reuse their code

on diﬀerent engines. It features a pluggable architecture

that allows it to run on several distributed stream process-

ing engines such as Storm,7Samza,8and Flink.9This capa-

bility is achieved by designing a minimal API that captures

the essence of modern DSPEs. This API also allows to eas-

ily write new bindings to port SAMOA to new execution

engines. SAMOA takes care of hiding the diﬀerences of the

underlying DSPEs in terms of API and deployment.

An algorithm in SAMOA is represented by a directed

graph of operators that communicate via messages along

streams which connect pairs of nodes. Borrowing the ter-

minology from Storm, this graph is called a Topology. Each

node in a Topology is a Processor that sends messages through

aStream. A Processor is a container for the code that imple-

ments the algorithm. At runtime, several parallel replicas

of a Processor are instantiated by the framework. Repli-

cas work in parallel, with each receiving and processing a

portion of the input stream. These replicas can be instanti-

ated on the same or diﬀerent physical computing resources,

according to the DSPE used. A Stream can have a single

source but several destinations (akin to a pub-sub system).

6https://samoa.incubator.apache.org

7https://storm.apache.org

8https://samza.apache.org

9https://flink.apache.org

Split

Source Model Aggregator

Instance

Shuffle Grouping

Key Grouping

All Grouping

Local Statistics

Attributes

Figure 1: High level diagram of the VHT topology.

A Topology is built by using a Topology Builder, which con-

nects the various pieces of user code to the platform code

and performs the necessary bookkeeping in the background.

A processor receives Content Events via a Stream. Al-

gorithm developers instantiate a Stream by associating it

with exactly one source Processor. When the destination

Processor wants to connect to a Stream, it needs to specify

the grouping mechanism which determines how the Stream

partitions and routes the transported Content Events. Cur-

rently there are three grouping mechanisms in SAMOA:

•Shuﬄe grouping, which routes the Content Events in a

round-robin way among the corresponding Processor in-

stances. This grouping ensures that each Processor in-

stance receives the same number of Content Events from

the stream.

•Key grouping, which routes the Content Events based

on their key, i.e., the Content Events with the same key

are always routed by the Stream to the same Processor

instance.

•All grouping, which replicates the Content Events and

broadcasts them to all downstream Processor instances.

4. ALGORITHM

In this section, we explain the details of our proposed algo-

rithm, the Vertical Hoeﬀding Tree, which is a data-parallel,

distributed version of the Hoeﬀding tree described in Sec-

tion 3. First, we describe the parallelization and the ideas

behind our design choice. Then, we present the engineering

details and optimizations we employed to obtain the best

performance.

4.1 Vertical Parallelism

Data parallelism is a way of distributing work across dif-

ferent nodes in a parallel computing environment such as

a cluster. In this setting, each node executes the same

operation on diﬀerent parts of the dataset. Contrast this

deﬁnition with task parallelism (aka pipelined parallelism),

where each node executes a diﬀerent operator and the whole

dataset ﬂows through each node at diﬀerent stages. When

applicable, data parallelism is able to scale to much larger

4

deployments, for two reasons: (i) data has usually much

higher intrinsic parallelism that can be leveraged compared

to tasks, and (ii) it is easier to balance the load of a data-

parallel application compared to a task-parallel one. These

attributes have led to the high popularity of the currently

available DSPEs. For these reasons, we employ data paral-

lelism in the design of VHT.

In machine learning, it is common to think about data

in matrix form. A typical linear classiﬁcation formulation

requires to ﬁnd a vector xsuch that A·x≈b, where Ais the

data matrix and bis a class label vector. The matrix Ais n×

m-dimensional, with nbeing the number of data instances

and mbeing the number of attributes of the dataset.

Clearly, there are two ways to slice this data matrix to

obtain data parallelism: by row or by column. The for-

mer is called horizontal paral lelism, the latter vertical paral-

lelism. With horizontal parallelism, data instances are inde-

pendent from each other, and can be processed in isolation

while considering all available attributes. With vertical par-

allelism, instead, attributes are considered independent from

each other.

The fundamental operation of the algorithm is to accu-

mulate statistics nijk (i.e., counters) for triplets of attribute

i, value j, and class k, for each leaf of the tree. The counters

for each leaf are independent, so let us consider the case for

a single leaf. These counters, together with the learned tree

structure, form the state of the VHT algorithm.

Diﬀerent kinds of parallelism distribute the counters across

computing nodes in diﬀerent ways. When using horizontal

parallelism, the instances are distributed randomly, there-

fore multiple instances of the same counter need to exist

on several nodes. On the other hand, when using vertical

parallelism, the counters for one attribute are grouped on a

single node.

This latter design has several advantages. First, by hav-

ing a single copy of the counter, the memory requirements

for the model are the same as in the sequential version. In

contrast, with horizontal parallelism a single attribute may

be kept track on every node, thus the memory requirements

grow linearly with the parallelism level. Second, by having

each attribute being tracked independently, the computa-

tion of the split criterion can be performed in parallel by

several nodes. Conversely, with horizontal partitioning the

algorithm needs to (centrally) aggregate the partial counters

before being able to compute the splitting criterion.

Of course, the vertically-parallel design has also its draw-

backs. In particular, horizontal parallelism achieves a good

load balance much more easily, even though solutions for

these problems have recently been proposed [17, 18]. In ad-

dition, if the instance stream arrives in row-format, it needs

to be transformed in column-format, and this transforma-

tion generates additional CPU overhead at the source. In-

deed, each attribute that constitutes an instance needs to

be sent independently, and needs to carry the class label of

its instance. Therefore, both the number of messages and

the size of the data transferred increase.

Nevertheless, as shown in Section 6, the advantages of

vertical parallelism outweigh its disadvantages for several

real-world settings.

4.2 Algorithm Structure

We are now ready to explain the structure of the VHT

algorithm. Recall from Section 3 that there are two main

Algorithm 2 Model Aggregator: VerticalHoeﬀding-

TreeInduction(E,V HT tree)

Require: Eis a training instance from source PI, wrapped

in instance content event

Require: V HT tree is the current state of the decision tree

in model-aggregator PI

1: Use V HT tree to sort Einto a leaf l

2: Send attribute content events to local-statistic PIs

3: Increment the number of instances seen at l(which is

nl)

4: if nlmod nmin = 0 and not all instances seen at l

belong to the same class then

5: Add linto the list of splitting leaves

6: Send compute content event with the id of leaf lto all

local-statistic PIs

7: end if

Algorithm 3 Local Statistic: UpdateLocal-

Statistic(attribute,local statistic)

Require: attribute is an attribute content event

Require: local statistic is the local statistic, could be im-

plemented as Table < leaf id, attribute id >

1: Update local statistic with data in attribute: attribute

value, class value and instance weights

parts to the Hoeﬀding tree algorithm: sorting the instances

through the current model, and accumulating statistics of

the stream at each leaf node. This separation oﬀers a neat

cut point to modularize the algorithm in two separate com-

ponents. We call the ﬁrst component model aggregator, and

the second component local statistics. Figure 1 presents a

visual depiction of the algorithm, speciﬁcally, of its compo-

nents and of how the data ﬂow among them.

The model aggregator holds the current model (the tree)

produced so far. Its main duty is to receive the incoming

instances and sort them to the correct leaf. If the instance

is unlabeled, the model predicts the label at the leaf and

sends it downstream (e.g., for evaluation). Otherwise, if the

instance is labeled it is used as training data. The VHT

decomposes the instance into its constituent attributes, at-

taches the class label to each, and sends them independently

to the following stage, the local statistics. Algorithm 2 is a

pseudocode for the model aggregator.

The local statistics contain the suﬃcient statistics nijk

for a set of attribute-value-class triples. Conceptually, the

local statistics Processor can be viewed as a large distributed

table, indexed by leaf id (row), and attribute id (column).

The value of the cell represents a set of counters, one for

each pair of attribute value and class. The local statistics

simply accumulate statistics on the data sent to it by the

model aggregator. Pseudocode for the update function is

shown in Algorithm 3.

In SAMOA, we implement vertical parallelism by connect-

ing the model to the statistics via key grouping. We use a

composite key made by the leaf id and the attribute id. Hor-

izontal parallelism can similarly be implemented via shuﬄe

grouping on the instances themselves.

Messages. During the execution of the VHT, the type of

events that are being sent and received from the diﬀerent

parts of the algorithm are summarized in Table 1.

5

Table 1: Diﬀerent type of content events used during the execution of the VHT algorithm.

Name Parameters From To

instance <attribute 1, . . . , attribute m, class C > Source PI Model-Aggregator PI

attribute <attribute id, attribute value, class C > Model Aggregator PI Local Statistic PI id =<leaf id + attribute id >

compute <leaf id > Model Aggregator PI All Local Statistic PIs

local-result < Xlocal

a, Xlocal

b>Local Statistic PI id Model Aggregator PI

drop <leaf id > Model Aggregator PI All Local Statistic PIs

Algorithm 4 Local Statistic: ReceiveCompute-

Message(compute,local statistic)

Require: compute is an compute content event

Require: local statistic is the local statistic, could be im-

plemented as Table < leaf id, attribute id >

1: Get leaf lID from compute content event

2: For each attribute ithat belongs to leaf lin local statis-

tic, compute Gl(Xi)

3: Find Xlocal

a, which is the attribute with highest Glbased

on the local statistic

4: Find Xlocal

b, which is the attribute with second highest

Glbased on the local statistic

5: Send Xlocal

aand Xlocal

busing local-result content

event to model-aggregator PI via computation-result

stream

Leaf splitting. Periodically, the model aggregator will try

to see if the model needs to evolve by splitting a leaf. When

a suﬃcient number of instances have been sorted through a

leaf, it will send a broadcast message to the statistics, ask-

ing to compute the split criterion for the given leaf id. The

statistics will get the table corresponding to the leaf, and

for each attribute compute the splitting criterion in paral-

lel (e.g., information gain or entropy). Each local statis-

tic Processor will then send back to the model the top two

attributes according to the chosen criterion, together with

their scores. The model aggregator simply needs to compute

the overall top two attributes, apply the Hoeﬀding bound,

and see whether the leaf needs to be split. Refer to Algo-

rithm 4 for a pseudocode.

Two cases can arise: the leaf needs splitting, or it doesn’t.

In the latter case, the algorithm simply continues without

taking any action. In the former case instead, the model

modiﬁes the tree by splitting the leaf on the selected at-

tribute, and generating one new leaf for each possible value

of the branch. Then, it broadcasts a drop message contain-

ing the former leaf id to the local statistics. This message

is needed to release the resources held by the leaf and make

space for the newly created leaves. Subsequently, the tree

can resume sorting instances to the new leaves. The local

statistics will create a new table for the new leaves lazily,

whenever they ﬁrst receive a previously unseen leaf id. Al-

gorithm 5 shows this part of the process. In its simplest

version, while the tree adjustment is performed, the algo-

rithm drops the new incoming instances. We show in the

next section an optimized version that buﬀers them to im-

prove accuracy.

5. VHT OPTIMIZATIONS

We now introduce three optimizations that improve the

performance of the VHT: model replication,optimistic split

Algorithm 5 Model Aggregator: Receive(local result,

V HT tree)

Require: local result is an local-result content event

Require: V HT tree is the current state of the decision tree

in model-aggregator PI

1: Get correct leaf lfrom the list of splitting leaves

2: Update Xaand Xbin the splitting leaf lwith Xlocal

aand

Xlocal

bfrom local result

3: if local results from all local-statistic PIs received or

time out reached then

4: Compute Hoeﬀding bound =qR2ln(1/δ)

2nl

5: if Xa6=X∅and (Gl(Xa)−Gl(Xb)> or < τ )

then

6: Replace lwith a split-node on Xa

7: for all branches of the split do

8: Add a new leaf with derived suﬃcient statistic

from the split node

9: end for

10: Send drop content event with id of leaf lto all local-

statistic PIs

11: end if

12: end if

Control

Split

Result

Source (n) Model (n) Stats (n) Evaluator (1)

AttributeInstance

Shuffle Grouping

Key Grouping

All Grouping

Figure 2: Deployment diagram for VHT.

execution, and instance buﬀering. The ﬁrst deals with the

throughput and I/O capability of the algorithm, by remov-

ing its single bottleneck at the model aggregator. The latter

two instead deal with the problem computing the split cri-

terion in a distributed environment.

Model replication. If the model is maintained in a single

Processor, it can easily become a bottleneck in the construc-

tion and maintenance of the tree, especially under high in-

stance arrival rate. Instead, a parallel replication and main-

tenance of the model on multiple Processors allows for higher

throughput, but indeed comes with increased management

complexity and possible accuracy drop.

In order to materialize the model replication of the VHT,

two issues must be resolved: how to distribute incoming in-

stances to models, and how to perform consistent leaf split-

ting across all models, thus guaranteeing consistent main-

6

tenance of the tree in all models. The ﬁrst issue can be

easily solved via shuﬄe grouping from the source Proces-

sor to all parallel model aggregator Processors. Assuming p

parallel models, shuﬄe grouping routes incoming instances

in a round-robin fashion among them, guaranteeing equal

split of instances among the models.

The second issue, however, requires a more elaborate so-

lution, because of two reasons. First, in a fully distributed

mode, each model can decide to send a control message to

the statistics at any time. To escape the problem of having

inconsistent models, one model (e.g., the ﬁrst to be created

in the topology construction) is appointed the role of the pri-

mary model and is responsible for broadcasting the control

message to the statistics. The frequency of this broadcast is

adjusted to take into account the level of model parallelism

p, and that each model receives 1/p-th of the total instances.

Second, the exact number of instances nlseen at each

leaf lis not available at any central point. Instead, each

model handles a portion of the stream and thus only a partial

number of instances for a leaf l(n0

l) is available to each model

for the computation of the Hoeﬀding bound . To remedy

this problem, a naive approach would be to estimate, at the

models, that nl≈n0

l×p. However, this approach can over-

or under-estimate the true number of instances seen per leaf,

if the instances are not dense, i.e., they don’t contain values

for all attributes.

A better approach is to have the local statistics broadcast

back to all models their estimation of n0

l, along with their

top two attributes Xlocal

aand Xlocal

b. Note that, for sparse

instances, the value of n0

lis still an estimate and not an exact

value due to the way attributes are distributed to statistics

via key grouping on the composite key (leaf id, attribute id).

Indeed, for dense instances, each instance gets decomposed

into all its attributes, so the model sends a message per

instance to the statistics. However, for sparse instances,

each instance gets decomposed into a subset of its attributes

(those with non-zero values). Therefore, each local statistic

may receive a diﬀerent number of instances per leaf, and thus

its counter for n0

lunderestimates the real value nl. However,

the models can now independently compute n00

l= max n0

l,

the maximum over all received estimates n0

l. This value

is a good estimate of the true value of nl, especially for

real-world skewed datasets, where one of the attributes is

extremely frequent. Finally, the models can compute the

Hoeﬀding bound for the particular leaf, and decide in a

consistent manner across all models whether to split, thus

allowing the maintenance of the same tree in all model pro-

cessors.

Optimistic split execution. In the simplest version of

the VHT algorithm, whenever the decision on splitting is

being taken, labeled instances are simply thrown away. This

behavior clearly wastes useful data, and is thus not desirable.

Note that there are two possible outcomes when a split

decision is taken. If the algorithm decides to split the cur-

rent leaf, all the statistics accumulated so far for the leaf are

dropped. Otherwise, the leaf keeps accumulating statistics.

In either case, the algorithm is better served by using the in-

stances that arrive during the split. If the split is taken, the

in-transit instances do not have any eﬀect in any case. How-

ever, if the split is not taken, the instances can be correctly

used to accumulate statistics.

Given these observations, we modify the VHT algorithm

to keep sending instances that arrive during splits to the

local statistics. We call this variant of the algorithm wk(0).

Instance buﬀering. The feedback for a split from the local

statistics to the model aggregator comes with a delay that

can aﬀect the performance of the model. While the model

is waiting to receive this feedback from the local statistics

to decide whether a split should be taken, the information

from the instances that arrive can be lost if the node splits.

To avoid this waste, we add a buﬀer to store instances in

the model during a split decision. The algorithm can re-

play these instances if the model decides to split. That is,

instances that arrive during a split decision are sent down-

stream and are accounted for in the current local statistics.

If a split occurs, these statistics are dropped, and the in-

stances are replayed from the buﬀer before resuming with

normal operations. Conversely, if no split occurs, the buﬀer

is simply dropped.

To avoid increasing the memory pressure of the algorithm,

the buﬀer resides on disk. The access to the buﬀer is se-

quential both while writing and when reading, so it does

not represent a bottleneck for the algorithm. We also limit

the maximum size of the buﬀer, to avoid delaying newly ar-

riving instances excessively. The optimal size of the buﬀer

depends on the number of attributes of the instances, the

arrival rate, the delay of the feedback from the local statis-

tics, and the speciﬁc hardware conﬁguration. Therefore, we

let the user customize its size with a parameter z, and we

refer to this version of the algorithm as wk(z).

Timeout. Each model waits for a timeout to receive all re-

sponses back from the statistics, before computing the new

splits. This timeout is primarily a system check to avoid

the model waiting indeﬁnitely. This timeout parameter may

impact the performance of the tree: if it’s too large, many

instances will not be stored in the available buﬀer and there-

fore lost. Thus, the size of the buﬀer is closely related to this

timeout parameter, allowing it to have enough instances to

be replayed if the model has a leaf that decides to split.

6. EXPERIMENTS

In our experimental evaluation of the VHT method, we

aim to study the following questions:

Q1: How does a centralized VHT compare to a centralized

hoeﬀding tree (available in MOA) with respect to accu-

racy and throughput?

Q2: How does the vertical parallelism used by VHT com-

pare to horizontal parallelism?

Q3: What is the eﬀect of number and density of attributes?

Q4: How does discarding or buﬀering instances aﬀect the

performance of VHT?

6.1 Experimental setup

In order to study these questions, we experiment with ﬁve

datasets (two synthetic generators and three real datasets),

ﬁve diﬀerent versions of the hoeﬀding tree algorithm, and up

to four levels of computing parallelism. We measure classi-

ﬁcation accuracy during the execution and at the end, and

throughput (number of classiﬁed instances per second). We

execute each experimental conﬁguration ten times, and re-

port the average of these measures.

Synthetic datasets. We use synthetic data streams pro-

duced by two random generators: one for dense and one for

sparse attributes.

7

•Dense attributes are extracted from a random decision

tree. We test diﬀerent number of attributes, and include

both categorical and numerical types. The label for each

conﬁguration is the number of categorical-numerical used

(e.g, 100-100 means the conﬁguration has 100 categorical

and 100 numerical attributes). We produce 10 diﬀerently

seeded streams with 1M instances for each tree, with

one of two balanced classes in each instance, and take

measurements every 100k instances.

•Sparse attributes are extracted from a random tweet

generator. We test diﬀerent dimensionalities for the at-

tribute space: 100, 1k, 10k. These attributes represent

the appearance of words from a predeﬁned bag-of-words.

On average, the generator produces 15 words per tweet

(size of a tweet is Gaussian), and uses a Zipf distribu-

tion with skew z= 1.5 to select words from the bag. We

produce 10 diﬀerently seeded streams with 1M tweets in

each stream. Each tweet has a binary class chosen uni-

formly at random, which conditions the Zipf distribution

used to generate the words.

Real datasets. We also test VHT on three real data streams

to assess its performance on benchmark data.10

•(elec) Electricity. This dataset has 45312 instances, 8

numerical attributes and 2 classes.

•(phy) Particle Physics. This dataset has 50000 instances

for training, 78 numerical attributes and 2 classes.

•(covtype) CovertypeNorm. This dataset has 581012 in-

stances, 54 numerical attributes and 7 classes.

Algorithms. We compare the following versions of the ho-

eﬀding tree algorithm.

•MOA: This is the standard Hoeﬀding tree in MOA.

•local: This algorithm executes VHT in a local, sequen-

tial execution engine. All split decisions are made in

a sequential manner in the same process, with no com-

munication and feedback delays between statistics and

model.

•wok: This algorithm discards instances that arrive dur-

ing a split decision. This version is the vanilla VHT.

•wk(z): This algorithm sends instances that arrive dur-

ing a split decision downstream. In also adds instances

to a buﬀer of size zuntil full. If the split decision is

taken, it replays the instances in the buﬀer through the

new tree model. Otherwise, it discards the buﬀer, as the

instances have already been incorporated in the statistics

downstream.

•sharding: Splits the incoming stream horizontally among

an ensemble of Hoeﬀding trees. The ﬁnal prediction

is computed by majority voting. This method is an

instance of horizontal parallelism applied to Hoeﬀding

trees.

Experimental conﬁguration. All experiments are per-

formed on a Linux server with 24 cores (Intel Xeon X5650),

clocked at 2.67GHz, L1d cache: 32kB, L1i cache: 32kB, L2

cache: 256kB, L3 cache: 12288kB, and 65GB of main mem-

ory. On this server, we run a Storm cluster (v0.9.3) and

zookeeper (v3.4.6). We use SAMOA v0.4.0 (development

version) and MOA v2016.04 available from the respective

project websites.

10http://moa.cms.waikato.ac.nz/datasets/,

http://osmot.cs.cornell.edu/kddcup/datasets.html

80

85

90

95

100

10-10 100-100 1k-1k 10k-10k

% accuracy

nominal attributes - numerical attributes

Dense attributes

local

moa

100 1k 10k

attributes

Sparse attributes

Figure 3: Accuracy of VHT executed in local mode on

SAMOA compared to MOA, for dense and sparse datasets.

100

101

102

103

104

105

10-10 100-100 1k-1k 10k-10k

Execution time (s)

nominal attributes - numerical attributes

Dense attributes

local

moa

100 1k 10k

attributes

Sparse attributes

Figure 4: Execution time of VHT in local mode on

SAMOA compared to MOA, for dense and sparse datasets.

We use several parallelism levels in the range of p=

2,...,16, depending on the experimental conﬁguration. For

dense instances, we stop at p= 8 due to memory constraints,

while for sparse instances we scale up to p= 16. We disable

model replication (i.e., use a single model aggregator), as in

our setup the model is not the bottleneck.

6.2 Accuracy and time of VHT local vs. MOA

In this ﬁrst set of experiments, we test if VHT is perform-

ing as well as its counterpart hoeﬀding tree in MOA. This is

mostly a sanity check to conﬁrm that the algorithm used to

build the VHT does not aﬀect the performance of the tree

when all instances are processed sequentially by the model.

To verify this fact, we execute VHT local and MOA with

both dense and sparse instances. Figure 3 shows that VHT

local achieves the same accuracy as MOA, even besting it at

times. However, VHT local always takes longer than MOA

to execute, as shown by Figure 4. Indeed, the local execu-

tion engine of SAMOA is optimized for simplicity rather

than speed. Therefore, the additional overhead required to

interface VHT to DSPEs is not amortized by scaling the

algorithm out. Future optimized versions of VHT and the

local execution engine should be able to close this gap.

6.3 Accuracy of VHT local vs. distributed

Next, we compare the performance of VHT local with

VHT built in a distributed fashion over multiple processors

for scalability. We use up to p= 8 parallel statistics, due to

memory restrictions, as our setup runs on a single machine.

In this set of experiments we compare the diﬀerent versions

of VHT, wok and wk(z), to understand what is the impact

of keeping instances for training after a model’s split. Accu-

racy of the model might be aﬀected, compared to the local

execution, due to delays in the feedback loop between statis-

tics and model. That is, instances arriving during a split will

8

be classiﬁed using an older version of the model compared

to the sequential execution. As our target is a distributed

system where independent processes run without coordina-

tion, this delay is a characteristic of the algorithm as much

as of the distributed SPE we employ.

We expect that buﬀering instances and replaying them

when a split is decided would improve the accuracy of the

model. In fact, this is the case for dense instances with a

small number of attributes (i.e., around 200), as shown in

Figure 5. However, when the number of available attributes

increases signiﬁcantly, the load imposed on the model seems

to outweigh the beneﬁts of keeping the instances for replay-

ing. We conjecture that the increased load in computing the

splitting criterion in the statistics further delays the feed-

back to compute the split. Therefore, a larger number of

instances are classiﬁed with an older model, thus negatively

aﬀecting the accuracy of the tree. In this case, the addi-

tional load imposed by replaying the buﬀer further delays

the split decision. For this reason, the accuracy for VHT

wk(z) drops by about 30% compared to VHT local. Con-

versely, the accuracy of VHT wok drops more gracefully,

and is always within 18% of the local version.

VHT always performs approximatively 10% better than

sharding. For dense instances with a large number of at-

tributes (20k), sharding fails to complete due to its mem-

ory requirements exceeding the available memory. Indeed,

sharding builds a full model for each shard, on a subset of

the stream. Therefore, its memory requirements are ptimes

higher than a standard hoeﬀding tree.

When using sparse instances, the number of attributes

per instance is constant, while the dimensionality of the at-

tribute space increases. In this scenario, increasing the num-

ber of attributes does not put additional load on the system.

Indeed, Figure 6 shows that the accuracy of all versions is

quite similar, and close to the local one. This observation is

in line with our conjecture that the overload on the system

is the cause for the drop in accuracy on dense instances.

We also study how the accuracy evolves over time. In gen-

eral, the accuracy of all algorithms is rather stable, as shown

in Figures 7 and 8. For instances with 10 to 100 attributes,

all algorithms perform similarly. For dense instances, the

versions of VHT with buﬀering, wk(z), outperform wok,

which in turn outperforms sharding. This result conﬁrms

that buﬀering is beneﬁcial for small number of attributes.

When the number of attributes increases to a few thousand

per instance, the performance of these more elaborate al-

gorithms drops considerably. However, the VHT wok con-

tinues to perform relatively well and better than sharding.

This performance, coupled with good speedup over MOA

(as shown next) makes it a viable option for streams with a

large number of attributes and a large number of instances.

6.4 Speedup of VHT distributed vs. MOA

Since the accuracy of VHT wk(z) is not satisfactory for

both types of instances, next we focus our investigation on

VHT wok.

Figure 9 shows the speedup of VHT for dense instances.

VHT wok is about 2-10 times faster than VHT local and up

to 4 times faster than MOA. Clearly, the algorithm achieves

a higher speedup when more attributes are present in each

instance, as (i) there is more opportunity for paralleliza-

tion, and (ii) the implicit load shedding caused by discard-

ing instances during splits has a larger eﬀect. Even though

0

20

40

60

80

100

2⋅105 4⋅105 6⋅105 8⋅105 1⋅106

% accuracy

instances

p=2, u=10, o=10

local

moa

wk(0)

wk(1k)

wk(10k)

wok

sharding

0

20

40

60

80

100

2⋅105 4⋅105 6⋅105 8⋅105 1⋅106

instances

p=2, u=1000, o=1000

local

moa

wk(0)

wk(1k)

wk(10k)

wok

sharding

Figure 7: Evolution of accuracy with respect to instances

arriving, for several versions of VHT (local, wok,wk(z))

and sharding, for dense datasets.

0

20

40

60

80

100

2⋅105 4⋅105 6⋅105 8⋅105 1⋅106

% accuracy

instances

p=2, a=100

local

moa

wk(0)

wk(1k)

wk(10k)

wok

sharding

0

20

40

60

80

100

2⋅105 4⋅105 6⋅105 8⋅105 1⋅106

instances

p=2, a=10000

local

moa

wk(0)

wk(1k)

wk(10k)

wok

sharding

Figure 8: Evolution of accuracy with respect to instances

arriving, for several versions of VHT (local, wok,wk(z))

and sharding, for sparse datasets.

sharding performs well in speedup with respect to MOA

on small number of attributes, it fails to build a model for

large number of attributes due to running out of memory.

In addition, even for a small number of attributes, VHT

wok outperforms sharding with a parallelism of 8. Thus, it

is clear from the results that the vertical parallelism used

by VHT oﬀers better scaling behavior than the horizontal

parallelism used by sharding.

When testing the algorithms on sparse instances, as shown

in Figure 10, we notice that VHT wok can reach up to 60

times the throughput of VHT local and 20 times the one

of MOA (for clarity we only show the results with respect

to MOA). Similarly to what observed for dense instances,

a higher speedup is observed when a larger number of at-

tributes are present for the model to process. This very large

superlinear speedup (20×with p= 2), is due to the aggres-

sive load shedding implicit in the wok version of VHT. The

algorithm actually performs consistently less work than the

local version and MOA.

However, note that for sparse instances the algorithm pro-

cesses a constant number of attributes, albeit from an in-

creasingly larger space. Therefore, in this setup, wok has a

constant overhead for processing each sparse instance, dif-

ferently from the dense case. VHT wok outperforms shard-

ing in most scenarios and especially for larger numbers of

attributes and larger parallelism.

Increased parallelism does not impact accuracy of the

model (see Figure 5 and Figure 6), but its throughput is

improved. Boosting the parallelism from 2 to 4 makes VHT

wok up to 2 times faster. However, adding more proces-

sors does not improve speedup, and in some cases there

is a slowdown due to additional communication overhead

(for dense instances). Particularly for sparse instances, par-

9

0

20

40

60

80

100

10-10 100-100 1k-1k 10k-10k

% accuracy

parallelism = 2

sharding wok wk(0) wk(1k) wk(10k) local

0

20

40

60

80

100

10-10 100-100 1k-1k 10k-10k

nominal attributes - numerical attributes

parallelism = 4

0

20

40

60

80

100

10-10 100-100 1k-1k 10k-10k

parallelism = 8

Figure 5: Accuracy of several versions of VHT (local, wok,wk(z)) and sharding, for dense datasets.

0

20

40

60

80

100

100 1k 10k

% accuracy

attributes

parallelism = 2

sharding wok wk(0) wk(1k) wk(10k) local

0

20

40

60

80

100

100 1k 10k

attributes

parallelism = 4

0

20

40

60

80

100

100 1k 10k

attributes

parallelism = 8

0

20

40

60

80

100

100 1k 10k

attributes

parallelism = 16

Figure 6: Accuracy of several versions of VHT (local, wok,wk(z)) and sharding, for sparse datasets.

1

2

3

4

5

10-10 100-100 1k-1k 10k-10k

Speedup over moa

parallelism = 2

sharding

wok

1

2

3

4

5

10-10 100-100 1k-1k 10k-10k

nominal attributes - numerical attributes

parallelism = 4

1

2

3

4

5

10-10 100-100 1k-1k 10k-10k

parallelism = 8

Figure 9: Speedup of VHT wok executed on SAMOA compared to MOA for dense datasets.

10

allelism does not impact accuracy which enables handling

large sparse data streams while achieving high speedup over

MOA.

6.5 Performance on real-world datasets

Tables 2 and 3 show the performance of VHT, either run-

ning in a local mode or in a distributed fashion over a storm

cluster of a few processors. We also test two diﬀerent ver-

sions of VHT: wok and wk(0). In the same tables we com-

pare VHT’s performance with MOA and sharding.

The results from these real datasets demonstrate that

VHT can perform similarly to MOA with respect to ac-

curacy and at the same time process the instances faster.

In fact, for the larger dataset, covtypeNorm, VHT wok

exhibits 1.8 speedup with respect to MOA, even though

the number of attributes is not very large (54 numeric at-

tributes). VHT wok also performs better than sharding,

even though the latter is faster in some cases. However, the

speedup oﬀered by sharding decreases when the parallelism

level is increased from 2 to 4 shards.

Table 2: Average accuracy (%) for diﬀerent algorithms,

with parallelism level (p), on the real-world datasets.

dataset MOA VHT Sharding

local wok wok wk(0) wk(0)

p=2 p=4 p=2 p=4 p=2 p=4

elec 75.4 75.4 75.0 75.2 75.4 75.6 74.7 74.3

phy 63.3 63.8 62.6 62.7 63.8 63.7 62.4 61.4

covtype 67.9 68.4 68.0 68.8 67.5 68.0 67.9 60.0

Table 3: Average execution time (seconds) for diﬀerent

algorithms, with parallelism level (p), on the real-world

datasets.

Dataset MOA VHT Sharding

local wok wok wk(0) wk(0)

p=2 p=4 p=2 p=4 p=2 p=4

elec 1.09 1 2 2 2 2 2 2.33

phy 5.41 4 3.25 4 3 3.75 3 4

covtype 21.77 16 12 12 13 12 9 11

6.6 Summary

In conclusion, our VHT algorithm has the following per-

formance traits. We learned that for a small number of

attributes, it helps to buﬀer incoming instances that can

be used in future decisions of split. For larger number of

attributes, the load in the model can be high and larger de-

lays can be observed in the integration of the feedback from

the local statistics into the model. In this case, buﬀered in-

stances may not be used on the most up-to-date model and

this can penalize the overall accuracy of the model.

With respect to a centralized sequential tree model (MOA),

it processes dense instances with thousands of attributes up

to 4×faster with only 10 −20% drop in accuracy. It can

also process sparse instances with thousands of attributes up

to 20×faster with only 5 −10% drop in accuracy. Also, its

ability to build the tree in a distributed fashion using tens of

processors allows it to scale and accommodate thousands of

attributes and parse millions of instances. Competing meth-

ods cannot handle these data sizes due to increased memory

and computational complexity.

7. CONCLUSION

The rapid increase observed in the number of users of so-

cial media or of IoT devices has lead to a multifold increase

in the data available for analysis. For these data to be an-

alyzed and business or other learnings to be extracted, new

machine learning methods are required, able to scale to a

large size of fast data arriving at high speeds.

In this paper we presented the Vertical Hoeﬀding Tree

(VHT), the ﬁrst distributed streaming algorithm for learn-

ing decision trees that can be used for performing classiﬁca-

tion tasks on such large data streams arriving at high rates.

VHT features a novel way of distributing decision trees via

vertical parallelism. The algorithm is implemented on top

of Apache SAMOA, a platform for mining big data streams,

and is thus able to run on real-world clusters.

Through exhaustive experimentation, and in comparison

to a centralized sequential tree model, we show that VHT

can process dense and sparse instances with thousands of

attributes up to 4×and 20×faster, respectively, and with

small degradation in accuracy. Also, VHT’s ability to build

the decision tree in a distributed fashion using tens of pro-

cessors allows it to scale and accommodate thousands of at-

tributes and parse millions of instances. We also show that

competing distributed methods cannot handle the same data

sizes due to memory and computational complexity.

8. REFERENCES

[1] A. Agarwal, O. Chapelle, M. Dud´ık, and J. Langford. A

reliable eﬀective terascale linear learning system. arXiv

preprint, arXiv:1110, 2011.

[2] L. A. Barroso and U. H¨olzle. The Datacenter as a

Computer: An Introduction to the Design of

Warehouse-Scale Machines. Morgan & Claypool

Publishers, 2009.

[3] Y. Ben-Haim and E. Tom-Tov. A Streaming Parallel

Decision Tree Algorithm. JMLR, 11:849–872, Mar. 2010.

ISSN 1532-4435.

[4] L. Breiman, J. Friedman, C. Stone, and R. Olshen.

Classiﬁcation and regression trees. CRC Press, 1984.

[5] O. Chapelle and Y. Chang. Yahoo! Learning to Rank

Challenge Overview, 2011.

[6] G. De Francisci Morales. SAMOA: A Platform for Mining

Big Data Streams. In RAMSS’13: 2nd International

Workshop on Real-Time Analysis and Mining of Social

Streams @WWW’13, 2013.

[7] G. De Francisci Morales and A. Bifet. SAMOA: Scalable

Advanced Massive Online Analysis. JMLR, 2015.

[8] P. Domingos and G. Hulten. Mining high-speed data

streams. In KDD ’00: 6th ACM International Conference

on Knowledge Discovery and Data Mining, pages 71–80,

New York, New York, USA, Aug. 2000. ACM Press.

[9] M. Fern´andez-Delgado and E. Cernadas. Do we Need

Hundreds of Classiﬁers to Solve Real World Classiﬁcation

Problems? JMLR, 15(1):3133–3181, 2014.

[10] J. Gama, R. Rocha, and P. Medas. Accurate decision trees

for mining high-speed data streams. In Proceedings of the

Ninth ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining, Washington, DC,

USA, August 24 - 27, 2003, pages 523–528, 2003.

[11] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and

A. Bouchachia. A survey on concept drift adaptation.

ACM Comput. Surv., 46(4):44:1–44:37, 2014.

11

2

4

6

8

10

12

14

16

18

20

100 1k 10k

Speedup over moa

parallelism = 2

sharding

wok

2

4

6

8

10

12

14

16

18

20

100 1k 10k

Number attributes

parallelism = 4

2

4

6

8

10

12

14

16

18

20

100 1k 10k

parallelism = 8

2

4

6

8

10

12

14

16

18

20

100 1k 10k

parallelism = 16

Figure 10: Speedup of VHT wok executed on SAMOA compared to MOA for sparse datasets.

[12] W. Hoeﬀding. Probability inequalities for sums of bounded

random variables. Journal of the American statistical

association, 58(301):13—-30, 1963.

[13] L. Hyaﬁl and R. L. Rivest. Constructing optimal binary

decision trees is NP-complete. Information Processing

Letters, 5(1):15–17, May 1976. ISSN 00200190.

[14] N. C. Oza and S. Russell. Online Bagging and Boosting. In

In Artiﬁcial Intelligence and Statistics 2001, pages

105–112. Morgan Kaufmann, 2001.

[15] B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo.

PLANET: Massively Parallel Learning of Tree Ensembles

with MapReduce. Proc. VLDB Endow., 2(2):1426–1437,

Aug. 2009. ISSN 2150-8097.

[16] S. Tyree, K. Q. Weinberger, K. Agrawal, and J. Paykin.

Parallel boosted regression trees for web search ranking. In

Proceedings of the 20th international conference on World

wide web, pages 387–396. ACM, 2011.

[17] M. A. Uddin Nasir, G. De Francisci Morales,

D. Garcia-Soriano, N. Kourtellis, and M. Seraﬁni. The

Power of Both Choices: Practical Load Balancing for

Distributed Stream Processing Engines. In ICDE ’15: 31st

International Conference on Data Engineering, 2015.

[18] M. A. Uddin Nasir, G. De Francisci Morales, N. Kourtellis,

and M. Seraﬁni. When Two Choices Are not Enough:

Balancing at Scale in Distributed Stream Processing. In

ICDE ’16: 32nd International Conference on Data

Engineering, 2016.

[19] A. T. Vu, G. De Francisci Morales, J. Gama, and A. Bifet.

Distributed adaptive model rules for mining big data

streams. In Big Data (Big Data), 2014 IEEE International

Conference on, pages 345–353. IEEE, 2014.

[20] J. Ye, J.-H. Chow, J. Chen, and Z. Zheng. Stochastic

Gradient Boosted Distributed Decision Trees. In

Proceedings of the 18th ACM conference on Information

and knowledge management, CIKM ’09, pages 2061–2064,

New York, NY, USA, 2009. ACM.

12