Content uploaded by Girma Kejela
Author content
All content in this area was uploaded by Girma Kejela on Jun 03, 2016
Content may be subject to copyright.
Cross-Device Consumer Identification
Girma Kejela
Department Electrical Engineering
and Computer Science
University of Stavanger
Stavanger, Norway
Email: girma.kejela@uis.no
Chunming Rong
Department Electrical Engineering
and Computer Science
University of Stavanger
Stavanger, Norway
Email: chunming.rong@uis.no
Abstract—Nowadays, a typical household owns multiple dig-
ital devices that can be connected to the Internet. Advertising
companies always want to seamlessly reach consumers behind
devices instead of the device itself. However, the identity of
consumers becomes fragmented as they switch from one device
to another. A naive attempt is to use deterministic features such
as user name, telephone number and email address. However
consumers might refrain from giving away their personal infor-
mation because of privacy and security reasons. The challenge
in ICDM2015 contest is to develop an accurate probabilistic
model for predicting cross-device consumer identity without
using the deterministic user information.
In this paper we present an accurate and scalable cross-
device solution using an ensemble of Gradient Boosting De-
cision Trees (GBDT) and Random Forest. Our final solution
ranks 9th both on the public and private LB with F0.5 score
of 0.855.
Keywords-Ensemble; Xgboost ; Deep Learning; GBM; Ran-
dom Forest; ICDM2015 contest
I. INTRODUCTION
ICDM 2015 contest is sponsored by Drawbridge, a lead-
ing company in probabilistic cross-device identity solution.
The task is to identify a set of computer cookies and mobile
devices that belong to the same user. We entered the contest
with an intension of designing an accurate and scalable
probabilistic model that enables brands to target users across
devices without asking them to login (i.e. to give away their
personal information).
Our approach involves data cleansing, predicting some
of the missing values, joining mobile devices with com-
puter cookies based on the frequency of seeing them on
the same IP address, feature engineering, supervised learn-
ing, model combination/ensemble, and searching for more
matches where the confidence of the model on the pre-
dicted best match is low. We used command line tools like
’sed’ and ’cat’ to clean the data. Missing values of some
categorical variables (anonymous c0 and anonymous c1)
in device all basic and cookie all basic data have been
predicted by using other available variables. Prediction and
replacement of the missing values has been done in their
respective native tables before joining them with the other
tables. For example, to predict the missing anonymous c0 in
the device all basic table all the other variables except the
index variables have been used. Part of the data with known
category has been used as training set and the other part
with missing values has been used as test-set for prediction.
We kept 15% of the training set as validation set which is
used for optimizing the parameters of the model.
II. PREPARING DATA FO R BINARY CLASSIFICATION
The first challenge we faced when started to participate
in this contest was to join device, cookie, IP, and property
tables and construct the training and test set for binary
classification. An attempt to join on any of the categorical or
index variables resulted in tremendously large dataset with
big proportion of false matches. In this work a step-by-step
approach was used to join devices and cookies based on
common IP address starting with the most frequent IP on
which both the devices and the cookies have been seen.
As we observed throughout our experiments, cellular IPs
give a large amount of false matches than non-cellular IPs
leading to huge dataset that is computationally expensive.
Thus, we filtered out the majority of cellular IPs based on
the ip frequency variable. We also removed cookies with
unknown drawbridge handle as it failed to improve the
performance of the model.
The assumption behind joining devices and cookies based
on common IP address is that if the same user owns multiple
devices, then they should be seen on the same IP address at
least once. A device can have multiple IP matches and the
data might grow very large if we consider all the possible
matches. A reasonable approach to workaround this is to
return the most frequent cookie seen on the same IP as the
device first, and continue joining on the next most frequent
cookie seen on the IP until we find most cookies that match
devices in the training data.
id all ipipagg
Join on IP
IP device all basic
cookie all basic
id property Join on
device id
Join on
cookie id
devices
cookies
Join on IP
according to
Algorithm 1
device-cookie pairs
Figure 1. Joining the device, cookie, IP, and property tables
Algorithm 1 Join devices (D) and cookies (C) based on IP
address
1: procedure JOIN(D,C )
2: C1 5 most frequent cookies on each IP
3: DC1 D./C1.Inner join on IP
4: C2 C62 C1
5: Cellular cookies C2 with cellular IP
6: uniq cellular Remove repeated cookies
from Cellular cookies keeping those with highest
ip frequency.
7: non Cellular C2 with non cellular IP
8: new Cookies (non Cellular [
uniq cellular)
9: DC2 D./new Cookies .Inner join on IP
10: DC (DC1[DC2)
11: Return DC .Return Devices-Cookie pairs
12: end procedure
III. FEATURE ENGINEERING
All the features in the device all basic, cookie all basic,
id all ip, ipagg all, and id all property tables have been
used. We didn’t see any improvement in performance
by including features from the property category table
and we considered it as redundant table. Five categori-
cal features namely, anonymous c0, anonymous c1, anony-
mous c2, country, and property id are common to both
devices and cookies but are not used for joining the tables.
Instead of representing them twice in the device-cookie pair,
we replaced them with binary features. If an instance of a
variable on cookie side matches an instance of corresponding
variable on the device side, the value of the new binary
variable will be 1, otherwise it will be 0. After replacing
each pair with a single binary variable the number of features
are reduced from 48 to 43. This also removed variables
with large number of category (e.g. anonymous c2 and
property id) resulting in faster training time.
Very rare categories from categorical variables that are
specific to the device or cookie tables have been re-
moved. For instance we removed the device os variable’s
instances that occurred less than 2,000 times in dataset
of around 21 million data points keeping 48 most fre-
quent categories. Similarly, we kept 110 categories from
computer browser version, 6 from device type, and 48
from computer os type. After excluding the index variables
such as device id, cookie id, device drawbridge handle,
cookie drawbridge handle and IP, we got 38 variables. The
same set of variables has been used for all of the models but
we generated dummy variables from non-binary categorical
features in case of the Xgboost model.
Table I shows the relative importance of the 10 most
relevant variables computed by GBM model. Features with
’dev ’ prefix represent device side features and those with
’c ’ prefix represent cookie side features.
Table I
VARIABLE IMPORTANCE -FOR 10 MOST IMPORTANT VARIABLES
Rank Var. name rel. importance perc. importance
1 dev ip anonymous c2 2578230.75 47.57%
2 c ip anonymous c2 829895.06 15.31%
3 c anonymous 5 460277.00 8.49%
4 c idxip anonymous c3 378855.90 6.99%
5 computer browser version 210147.73 3.88%
6 device os 185905.98 3.43%
7 dev iidxp anonymous c3 146490.69 2.70%
8 computer os type 130295.96 2.40%
9 dev ip anonymous c1 100744.43 1.86%
10 dev anonymous 5 49326.52 0.91%
IV. SUPERVISED LEARNING
A user related variable, drawbridge handle, was given
only for the training set and it was used to construct binary
target variable that represent positive matches and negative
matches. If drawbridge handle on the device side is the same
as drawbridge handle on the cookie side it means that they
belong to the same user and it represents positive match
otherwise it will be considered as negative match. Once the
problem of matching devices and cookies that belong to a
single user is transformed to a binary classification problem,
we implemented state-of-the-art predictive models that can
separate true matches from false matches with high accuracy.
In many cases, combination of learning models result in
a better predictive performance than each of the individual
models. Our final result was obtained by combining Xg-
boost, Random Forest and GBM. As can be seen from Table
1, the best single model is Xgboost. Our test-set contains 6
million data points, out of which only 61,156 devices will
have one or more cookies that belong to the same user as the
device. The submission requires only 61,156 devices with
their true cookie matches and we selected the final matches
based on the maximum probability of the predicted positive
match.
To further improve the prediction performance we selected
the device ids in the test data whose maximum predicted
probability is less than 0.4 and re-joined them with cookies
based on all possible IP matches. We did this because if
the probability of the best much is low it could mean
that the true match is not included in the test-set and
we wanted to search for more possible matches among
cookies that were discarded during the joining process.
This created 7 million new test-set from just 5,501 device
IDs in the device test basic data. We predicted using an
already trained GBM model and used maximum probability
to identify the best match for submission. This improved the
score on public LB from 0.852 to 0.855.
Table II
LEADER BOARD SCORE (F0.5) OF INDIVIDUAL MODELS AND THEIR
ENSEMBLES
Model Puplic LB Private LB Remark
Deep learn-
ing
0.8338 0.8351 Hidden:(128,64,64,32),
Rectifier with dropout
Xgboost 0.8531 0.8541 learning rate:0.01,
depth:10,
num round:2000
GBM 0.84792 0.84793 learning rate:0.01,
depth:10,
num trees:1500
RF 0.8455 0.8472 num trees:1000, sam-
ple rate:0.85
Ensemble:
GBM, RF,
Xgboost
0.8525 0.8535 Combination method:
averaging
Ensemble:
GBM, RF,
Xgboost
with
rematching
0.855391 0.855454 Combination method:
averaging and maxi-
mum probability
V. S UMMARY AND CONCLUSION
Tables of dataset were pre-processed and joined based on
common variable matches. We used a step-by-step IP based
joining to identify 176,133 cookies that belong to the same
user as the devices given in the training all basic data out of
180,083 possible matches. Instead of considering all possible
IP matches which produces tremendously large dataset, we
filtered out the less frequent cellular IPs. This left out circa
2.2% of the training set without their true matches and we
expect the same proportion was left out from the test set
too, corresponding to 2.2% error. The final error is the sum
of this error and the error produced by the binary classifier
models.
Supervised machine learning models have been used to
separate positive matches from false matches (i.e., cookies
that have been seen on the same IP as devices but don’t
belong to the same user). Ensemble of GBM, Random
Forest, and Xgboost outperforms each of the individual
models. The best single model is Xgboost with public LB
score of 0.853.
The cellular IPs discarded during the joining of devices
with cookies has helped to reduce the size of the training
data but it accounts for circa 2.2% of the final error. We
tried to compensate for this error by selecting devices with
predicted values of low probability (in this case, less than
0.4) and joined them with cookies based on all possible IP
matches creating a new test-set. We used an already trained
GBM to predict on the new test-set and selected prediction
with maximum probability as the best matches. This im-
proved the F0.5 score on the private LB by circa 0.002. For
better performance, our solution can be combined with other
solutions that identified more percentage of device-cookie
matches but used less accurate predictive models.
REFERENCES
[1] Thomas G. Dietterich, Ensemble Methods in Machine Learn-
ing, Oregon State University.
[2] Drawbridge Media kit: https://gallery.mailchimp.com/
dd5380a49beb13eb00838c7e2/files/Drawbridge MediaKit Jun2015 1 .pdf
[3] Contest Data Set: https://www.kaggle.com/c/icdm-2015-
drawbridge-cross-device-connections/data
[4] The code used in this contest will be released here:
https://github.com/Girmak/ICDM2015