- Access to this full-text is provided by Hindawi.
- Learn more
Download available
Content available from Journal of Advanced Transportation
This content is subject to copyright. Terms and conditions apply.
Research Article
Driver and Path Detection through Time-Series Classification
Mario Luca Bernardi ,1Marta Cimitile ,2Fabio Martinelli,3and Francesco Mercaldo3
1Giustino Fortunato University, Benevento, Italy
2Unitelma Sapienza, Rome, Italy
3National Research Council of Italy (CNR), Pisa, Italy
Correspondence should be addressed to Marta Cimitile; marta.cimitile@unitelma.it
Received 7 September 2017; Revised 17 January 2018; Accepted 12 February 2018; Published 22 March 2018
Academic Editor: Aboelmaged Noureldin
Copyright © Mario Luca Bernardi et al. is is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
Driver identication and path kind identication are becoming very critical topics given the increasing interest of automobile
industry to improve driver experience and safety and given the necessity to reduce the global environmental problems. Since in
the last years a high number of always more sophisticated and accurate car sensors and monitoring systems are produced, several
proposed approaches are based on the analysis of a huge amount of real-time data describing driving experience. In this work, a set
of behavioral features extracted by a car monitoring system is proposed to realize driver identication and path kind identication
and to evaluate driver’s familiarity with a given vehicle. e proposed feature model is exploited using a time-series classication
approach based on a multilayer perceptron (MLP) network to evaluate their eectiveness for the goals listed above. e experiment
is done on a real dataset composed of totally observations (each observation consists of a given person driving a given car on
a predened path) and shows that the proposed features have a very good driver and path identication and proling ability.
1. Introduction
Driver identication allows discovering the identity of a
vehicle driver using what he possesses and/or his physical and
behavioral characteristics []. is topic is recently becoming
very critical for the automobile industry, with an increas-
ing number of always more sophisticated and accurate car
sensors and monitoring systems able to extract information
about the driver (i.e., hand geometry, keystroke dynamics,
andvoiceprint).ereasonfortheinteresttowardsthetopic
is that driver identication may improve driver experience
allowing (i) a safer driving and an intelligent assistance
in case of emergencies, (ii) a more comfortable driving,
and (iii) a reduction of the global environment problems
[]. Looking for the driver safety, the driver identication
may support detecting some changes in the driver (due to
possible indisposition or state of being drunk) and activate
any security procedures (e.g., a ring may invite the driver
to stop). e study of the driver behavior, for each segment
of a road, also allows proling each road section supporting
the activation of alerts signals when more caution is required
(e.g., a vocal message may alert the driver to reduce the brake
pressure in a dangerous curve) []. Concerning the driver
comfort, for instance, the driver identication may discover
which member of the family is currently driving the car
and, consequently, perform an automatic setting of the car
equipment (i.e., radio volume and frequency, temperature,
or even speed limit) [, ]. Finally, the driver identication
can be useful to suggest new car improvements based on the
driver preferences or new systems with the aim of reducing
the car consumption and pollution based on the driving
characteristics. Based on the above-discussed advantages
deriving by the driver proling and identication, several
studieshavebeenproposedinthelastyearsfocusingon
the identication of driver physical and behavioral features.
e physical features [, ] are stable human characteristics
that have been largely diused in banking and forensic
domain to guarantee a higher safeness concerning the more
traditional authentication system based on the ownership
of a key (this authentication system can be easily bypassed
when someone comes in handy of the key). e behavioral
features are used to detect individual personality features and
are becoming the target of several studies in recent years that
Hindawi
Journal of Advanced Transportation
Volume 2018, Article ID 1758731, 20 pages
https://doi.org/10.1155/2018/1758731
Journal of Advanced Transportation
are mainly focused on speaker recognition []. e limit of
theseapproachesisthattheyarebasedontheanalysisof
only one feature. is may cause a high uncertainty in driver
identication especially if there is a noisy sensor. For this
reason, new approaches based on multimodal identication
systems are introduced [, ]. ey provide more accurate
driver identication based on the detection and analysis of a
higher set of behavioral features.
In this work, we propose a set of behavioral features to
perform driver identication and path kind identication
(i.e., dirt road, road with bumps, and highway) and to
evaluate driver’s familiarity with a given vehicle. Moreover,
we exploit a time-series classication (TSC) approach based
on a multilayer perceptron (MLP) network to evaluate the
eectiveness of the proposed set of features extracted by using
a CAN bus monitoring system.
We have identied three research questions (RQs):
RQ1: To what extent does the set of extracted features
allow driver identication?
RQ2: To what extent does the set of extracted features
allow path kind identication?
RQ3: Is it possible, using the available features, to
identify the driver’s familiarity with the vehicle (e.g.,
ownership)?
e rest of the paper is organized as follows.
Section discusses related work. Section describes the
background of our study. Section presents the proposed
approach discussing the overall classication process. Sec-
tions and discuss a set of experiments to answer the
proposed research questions using three datasets (made of
driving sessions performed by ten drivers on four cars
in ve dierent paths). Finally, Section provides conclusive
remarks and future directions.
2. Related Work
In the past, the real-world automotive data retrieving was
limited due to the diculty to equip the sensors in the cars.
Since the introduction of CAN protocol (http://www.can
.bosch.com), this limit is overcome and driving style identi-
cation is becoming a very appealing scenario. CAN protocol
denes a generic communication standard for all the vehicle
electronic devices. CAN protocol can cooperate with the
OBD II diagnostic connector (http://obd-elm.com/) that
provides a candidate list of vehicle parameters to monitor
along with how to encode their data.
As a matter of fact, researchers in [] discuss a driver
identication approach that is based on the driving behavior
signals observed while the driver is following another vehicle.
e analyzed signals (they were measured using a driving
simulator) are the following: accelerator pedal, brake pedal,
vehicle velocity, and distance from the vehicle in front. e
approach obtains an identication rate equal to % for twelve
drivers and equal to % for thirty drivers.
e accelerator and the steering wheel as characteristics
to discriminate between dierent drivers are analyzed in [].
ey employ hidden Markov model (HMM) on the consid-
ered features to model the driver characteristics. ey build
two models for each driver: one trained from accelerator
data and the other one learned from steering wheel angle
data. Consequently, the models are used to identify dierent
drivers obtaining an accuracy rate equal to %.
Another HMM-based approach to model driver human
behavior is proposed in []. is method employs a sim-
ulated driving environment to evaluate the eectiveness of
the proposed solution. Van Ly [] explores the possibility of
using the inertial sensors of the vehicle from the CAN bus
to build a prole of the driver observing braking and turning
events to characterize an individual compared to acceleration
events.
Authors in [, ] represent gas and brake pedal oper-
ation patterns with the Gaussian mixture model (GMM).
eyobtainanidenticationrateequalto.%using
data extracted by a driving simulator and equal to .%
for a eld test with drivers, resulting in % and %
error reduction, respectively, over a driver model based
on raw pedal operation signals without spectral analysis.
Considering data from steering wheel angle, brake status,
acceleration status, and vehicle speed, researchers in []
model the driver behavior through HMMs and GMMs with
the aim of capturing the sequence of driving characteristics
acquired from the CAN bus information. ey obtain % of
the accuracy of action classication and % of accuracy for
driver identication.
Authors in [] classify real-world mechanical features
from the CAN bus with four dierent classication algo-
rithms: they obtain an accuracy equal to . using decision
tree, equal to . using KNN, equal to . for Random-
Forest, and equal to . using MLP algorithm. Researchers
in [] classify a set of features extracted from the powertrain
signals of the vehicle, showing that the learned classier is
abletorecognizethehumandrivingstylebasedonthepower
demands placed on the vehicle powertrain with an overall
accuracy equal to %.
In [], the features extracted from the accelerator and
brakepedalpressureareusedasinputstoafuzzyneural
network (FNN) system to ascertain the identity of the driver.
Two fuzzy neural networks, namely, the evolving fuzzy neural
network (EFuNN) and the adaptive network-based fuzzy
inference system (ANFIS), are used to demonstrate the
viability of the two proposed techniques.
Summarizing the results obtained from the above-
described studies, we can conclude that the obtained identi-
cation rate is ranging from % [] to . []. e method
we propose is able to reach a precision rate equal to %,
overcoming the current literature in terms of precision. Fur-
thermore, the existing methods are oen tested in simulated
environments: a plethora of variables (like the trac jam and
the number of the cars involved in the scenarios) are set a
priori.
Dierently, we conduct experiments in the real-world
environment, in order to take into account real-world vari-
ables that cannot be predicted. In addition, we perform a set
of experiments aiming to identify the car owner regardless
of the car and path. Dierently, the other discussed methods
Journal of Advanced Transportation
usually perform the experiments considering a single setting:
for example, in the experiment proposed in [] (it is the
method for which the best precision values are obtained) the
drivers under analysis perform the same path on the same car.
3. Background: Time-Series
Classification Approaches
Machine Learning (ML) explores the study and implemen-
tation of some algorithms aiming to learn from some mon-
itored data and make predictions about their future values.
Here, we focus on TSC algorithms used to classify sequences
of observations of a phenomenon [], as eectively experi-
mented in [] in another context.
Time Series (TS) consists of a sequence of discrete-time
observations evaluated at successive equally spaced points in
time.eyarewidelydiusedtodescribethetimecourseofa
phenomenon. TS is useful to predict its future trend and relate
it to other phenomena under study. For this reason, they are
adopted in several domains with dierent aims. A problem
recurring in several domains is TS classication (TSC) which
requires training a classier on a set of cases, where each
case contains an ordered set of real-valued attributes and a
class label (qualifying its kind or nature). TSC problems arise
in a wide range of elds including environmental sciences,
computational biology, image processing, and soware engi-
neering.
ML approaches are widely adopted to perform such
classication. ere are two main ML algorithms families:
(i) Supervised Learning (SL). e classier is presented
with example inputs and their desired outputs, given
by an oracle, and the goal is to learn a general rule
that maps inputs to outputs. e learning phase is the
process of building a model able to discriminate the
classes from a set of records that contain class labels.
(ii) Unsuper vised Learning.Nolabelsaregiventothe
learning algorithm, leaving it to nd structure in its
input. Unsupervised learning can be a goal in itself
(discovering hidden patterns in data), or a means
towards an end (feature learning).
In this paper, the focus is on supervised algorithms to
perform classication of TS data.
A well-known class of classier is based on decision trees.
A decision tree can be used to predict the future values of
an item or to perform classication tasks. e classication
consists of splitting a dataset into smaller sets organized as a
binary tree-like structure obtaining an easier description of
the distribution of the data [].
RandomForest [] is a widely adopted decision trees
algorithm proposing a way of averaging multiple deep deci-
sion trees (trained on dierent parts of the same training set)
with the aim of reducing the variance. It starts by constructing
several decision trees at training time. For each tree, the class
(which can be the mode of the classication or the mean of
prediction) is obtained. Random decision forests are used to
correct decision trees habit of overtting to their training set.
Another well-known method for TS classication
approach is based on Dynamic Time Warping (DTW) that
allows an elastic shiing of the time axis, to accommodate
sequences that are similar, but out of phase. DTW is a classic
technique for time-series classication. DTW is normally
used with instance-based classiers and can be incorporated
into a decision trees algorithm. For constructing the decision
trees, the J algorithm was exploited []. It starts by
building decision trees from a set of training data and then
visits all decision-node. At each iteration, it chooses the most
active node and splits until each leaf is reached and not any
more splits are obtainable.
Other eective approaches to classication problems
exploit neural networks. MLPs are among the most popular
kinds of neural networks and have been used in a wide
variety of modeling, classication, prediction, and optimiza-
tion problems. ey are subclasses of more general struc-
tures called feedforward neural networks that are capable
of approximating generic classes of functions (including
continuous and integrable functions).
In this paper, an MLP network architecture is used.
An MLP neural network is based on multiple layers of
nodes in a directed graph where each layer is fully connected
to the next one []. In particular, the simplest MLP network
model describes a fully connected network with three layers
(one input layer, one hidden layer, and an output layer) in
which each node is a neuron that uses a nonlinear activation
function.
An activation function in MLP network is a (typically
nonlinear) function transforming the weighted sum of inputs
toaneuronintoanoutputvalue.Usually,itmapsthe
rangeofinputsignalsintoareferenceinterval(suchas
[−1,1] or [0,1]). e activation functions used in typical
neural networks are classied into threshold, linear, and
nonlinear activation functions. For MLP networks the hidden
layers have nonlinear activation functions whereas the output
layer could have both linear and nonlinear ones. e MLP
models adopt the following bipolar sigmoidal function (i.e.
hyperbolic tangent sigmoid) as activation function:
(V)=2
1+−2V−1 ()
whereas a linear activation function is used for output layer.
Looking for example to a single network model, it contains
at least three fully connected layers (one hidden layer and
one output layer). e hidden layer neurons are connected
to the nodes of input layer by weights, whereas weights
areusedtoconnecttheneuronsoftheoutputlayerswith
the neurons of the hidden layer. Each layer applies the same
activation function. Weights in the network are used to
connect neurons between layers.
MLP behavior and performances depend upon three
fundamental aspects: (i) activation functions of the units;
(ii) network architecture; and (iii) the weight of each input
connection. Since the rst two aspects are xed, the behavior
of the MLP is mainly dened by the current values of the
weights. e weights are dened during the training process:
they are initially set to random values, and then instances
Journal of Advanced Transportation
of the training set are repeatedly exposed to the network.
evaluesfortheinputofaninstanceareplacedonthe
input units and the output of the net is compared with the
desired output for this instance. en, all the weights in the
net are adjusted slightly in the direction that would bring
the output values of the net closer to the values for the
desired output. ere are several learning algorithms with
which a network can be trained but the most well-known
andwidelyusedlearningalgorithmtoestimatethevalues
of the weights is the Backpropagation (BP) algorithm. An
MLP network is trained using some form of gradient descent
and the gradients are calculated using backpropagation. For
classication, the learning algorithm minimizes the cross-
entropy loss function (𝐿)andweightsareupdatedusing
Stochastic Gradient Descent (SGD) evaluating the gradient
of the loss function as follows:
=−()
+𝐿
, ()
where is the learning rate parameter which controls the
step-size in the parameter space search and the cross-entropy
loss function 𝐿is given by
𝐿
,,= −ln
−1−ln 1−
+2
2,()
where 2
2is a regularization factor that penalizes complex
models and >0is a positive parameter that controls the
magnitudeoftheimposedpenalty.
For binary classication, the function computed by the
network passes through the logistic function:
()=1
(1+−𝑧)()
to obtain output values between zero and one. In this case
a threshold would assign samples of outputs larger than or
equal to tothepositiveclassandtherestofthenegativeone.
When there are >2classes, the output of the network is a
vector of size . In this case, in order to assign a single class
totheinputsample,thesomaxfunctionisused,whichcan
be dened as
softmax 𝑖= exp 𝑖
∑𝑘
𝑙=1 exp 𝑙,()
where 𝑖represents the th element of the input to somax,
which corresponds to class ,andis the number of classes.
e result is a vector containing the probabilities that the
sample belongs to each class and the output class is the one
with the highest probability.
In this paper, a dataset that has classes and an archi-
tecture based on jointmodels(eachwithoneoutputlayer
neuron) has been adopted. All the models are jointly trained
across -fold cross-validation over the classes on the whole
input , and hence they are all characterized by the same best
parameters (i.e., the same number of neurons in hidden layer,
thesamelearningrates,andthesamenumberofiterations).
As already observed since this architecture, with >2
models, is suitable to detect more than two classes, it requires
the somax stage that takes the inputs (the probabilities that
the input sample belongs to each class) and provides the best
class as output.
4. The Methodology
is section presents the proposed approach for classication
of drivers and paths using data extracted from vehicle sensors.
Figure summarizes the overall mining process structured
as two main subprocesses: (a) Datasets Generation and (b)
MLP Training and Time-Series Classication. e remaining
part of the section will describe in more detail these two
subprocesses.
4.1. Datasets Generation. e Datasets Generation steps are
reported in Figure (a). e rst step consists of the dataset
cleaning (by performing the removal of incomplete and
wrong data values) and normalization. e cleaning and
normalization activity is necessary since real-world data tend
to be rather noisy, not complete, or even inconsistent. It
applies techniques aimed at lling missing values, lters out
the noise, and corrects the inconsistent values (or removes
them) from the dataset. We adopted the following data
cleaning subprocess to polish data produced by the car
monitors in order to obtain a consistent dataset that is suitable
for statistical inference:
(i) xing missing values
(ii) removing noise
(iii) removing special character or values
(iv) verifying semantic consistency
(v) normalization.
In the last step numerical attributes are normalized using a
min–max normalization that performs a linear transforma-
tion of the original data. If min𝑋and max𝑋are the minimum
and maximum values for the attribute ,themin–max
normalization maps a value V𝑖of to a V
𝑖in the range
{newMin𝑋,newMax𝑋}by computing
V
𝑖=V𝑖−min𝑋
max𝑋−min𝑋newMax𝑋−newMin𝑋
+newMin𝑋.()
e normalized data is split into two sets (Training and
Test Set Generation step): (i) a training set used to train the
classier and (ii) a test set used to assess the performance of
theclassier.edatasetpartitioningisperformedbyusinga
-Fold Cross-Validation approach [] consisting in splitting
the data into equally sized folds. Subsequently, iterations
of training and validation are performed (for each iteration
a dierent fold of the data is used for validation and the
remainingfoldsareusedfortraining).Moreover,toensure
that each fold is representative, the data are stratied prior to
being split into folds. is model selection method, according
to [], provides less biased estimation of the accuracy.
4.2. MLP Training and Time-Series Classication. e MLP
Training and Time-Series Classication process is reported
in Figure (b) and is based on an MLP network for learning
[].
Journal of Advanced Transportation
(a) Datasets Generation
Driving
data
Cleaning and
normalization
Training
examples
Time-series
segmentation
Time-series
windows
Postprocessing
Cleaned and normalized driving data
Multilayer
perceptron
network model
generation
Trend/value-based
windows’ features
(b) MLP training and time-series classication
Training and Test
Set Generation
Unlabeled
testing examples
MLP Classification
model
Driver and path
classication
F : e overall classication process.
e adopted time-series classication approach is based
on the following main steps:
(i) Time-Series Segmentation, in which the time series
are analyzed and divided into segments (i.e., win-
dows)
(ii) Postprocessing, in which for each window a represen-
tation based on values and trends features is generated
(iii) Multilayer Perceptron Network Model Generation, in
which an MLP network is trained
(iv) Classication, in which the trained classier is tested
on new time-series samples to perform the classica-
tion step.
In the rst step, the multivariate time series is rstly
divided into a sequence of segments by sliding a window
(which can be xed or threshold-based) incrementally across
the time-series values. In this paper, we experimented two
time-series sliding window approaches: xed windows and
threshold-based windows (we refer to it as Start&Stop). e
former exploits windows of xed sizes. In particular, we
dened ve windows of increasing sizes ranging from sec
to sec. Outside of this range, results or performances were
not acceptable. e latter segmentation approach denes
time-series regions based on signal variation thresholds. A
Start&Stop region is dened as the interval in which the
vehicle moves o from a steady state until it reaches another
steady state. is means that only transitory parts of the sig-
nals are captured. e rationale for using just these portions
of the time series instead of the entire signal is intuitive:
when signals became constants, they give little information
(especially for what concerns the driving behavior). is, in
eect, has proven to be very eective in ltering out useless
information that wastes classier memory but provides no
useful information to help discriminating driving behavior or
path kind.
Aer the segmentation, in the Postprocessing step, a
setoffeaturesisevaluatedforeachtime-serieswindow.In
particular, each window 𝑖is represented by a sequence
of features (V1,...,V𝑛,1,...,𝑚)containing value-
based features and trend-based features. e rst features
represent the discretized values of the time series and take
one value in a discretized set (in the [0,1]range). e second
set of features describes the trend of the time series
local to the window and is usually represented using shape-
based metrics (including moments of various orders, mean,
standard deviation, average energy, entropy, skewness, and
kurtosis). In this paper, only a single feature is used as a
value-basedfeatureandtwomoments(standarddeviation
and skewness) are used to represent trend-based features. e
approach to generating these representations is similar to the
trend and the value-based analysis proposed in [].
In Multilayer Perceptron Network Model Generation
step,anMLPnetwork,theensembleofnetworks, has been
used. It is instantiated for each dataset and exploits a number
of networks model equal to the number of the classes to be
detected.
e input vectors for the MLP network contains the
window representation components (value and the two
Journal of Advanced Transportation
T : Features involved in the study with the relative description and features source (i.e., a feature extracted by the OBD connector or
gathered by the device GPS).
Feature Description
CO2ing/km (average) (g/km) CO2consumption computed in average
Cost per mile/km (Instant) (/km) Instantaneous cost for kilometer
Cost per mile/km (Trip) (/km) Trip cost for kilometer
Engine Load (%) Percentage load of the engine
Engine RPM (rpm) Revolutions per minute of the engine
Fuel cost (trip) (cost) Cost of the consumed fuel for trip
() -axis gyroscope, i.e, pith
() -axis gyroscope i.e., yaw
() -axis gyroscope i.e., roll
GPS Speed (meters/second) Speed as measured from GPS unit
GPS versus OBD Speed dierence (km/h) Speed dierence between GPS chipset and OBD
Kilometers per Liter (Instant) (kpl) Instantaneous number of kilometers traveled for fuel liter
Kilometers per Liter (Long Term Average) (kpl) Average number of kilometers traveled for fuel liter
Liters Per Kilometer (Instant) (l/ km) Instantaneous liters employed for Km
Liters Per Kilometer (Long Term Average) (l/ km) Average liters employed for Km
Trip Average KPL(kpl) Distance traveled (in km)/fuel consumed by the vehicle
moments) representing data for each feature reported in
Table .
In this step, the MLP network is trained using the class
labels available for each set of value-based and trend-based
information of each window.
In the classication step, the trained classier can be used
to classify new data and is validated on new samples to assess
its performances.
Features Selection. e proposed classication approach is
basedonsomeassumptionsthatareconrmedbythe
experimental results: (i) each person has a dierent driving
style that can be recognized irrespective of the path or vehicle;
(ii) there exist features whose distribution of values is well
separated from women and men allowing an eective genre
identication; (iii) there exist features, related to how the
vehicle is solicited by the road, that allow recognizing road
kind (among several types). Based on such considerations, we
selected an initial set of sixteen features that are reported in
Table .
esetoffeaturesreportedinTableisanalyzedper-
forming a feature selection step to reduce the dimensionality
of the dataset. is step requires analyzing and understand-
ing the feature’s impact on the classication model. e
most relevant features are selected by evaluating correlations
among the features (the selection of a subset of an orthogonal
and independent set of features allows discarding redundant
information). To this aim, we used a correlation-based fea-
tures selection (CFS) algorithm exploiting correlation matrix
lters as discussed in []. CFS approach allows rating
the features using a correlation-based heuristic evaluation
function. is function is biased towards those subsets of
features largely correlated with the class to be detected and
uncorrelated with each other. All the unnecessary features
can be ignored given their low correlation with the class
and the redundant features can be eliminated given their
high correlation with one or more of the remaining features.
A feature is approved if it gives an extended and ecient
classication in regions of the examples space not already
covered by other features. e CFS evaluation function can
be expressed as follows:
𝑆=cf
+(−1)
()
𝑆represents the heuristic goal function of the subset
containing features. e mean feature-class correlation
is indicated as cf while is the average feature-feature
intercorrelation. e numerator of the equation indicates the
capability of a set of features to classify an example while the
denominator indicates how much redundancy there is among
the features set.
5. Experiment Setting
In this section, we describe the features extraction process
and the dataset realized to perform the experiments.
5.1. Features Extraction. We constructed a dataset gathering
data from the CAN bus of a set of real vehicles. In order to
collect data, the Torque Pro (https://play.google.com/store/
apps/details?id=org.prowl.torque) application and Mini Blu-
etooth ELM OBD Scanner were used.
e OBD scanner was installed on the vehicles to produce
a self-diagnostic report generated by the onboard monitoring
system.
e data is recorded every second during driving using
Torque Pro application by an Android smartphone xed in
thecarusinganadequatesupport.
5.2. Datasets Denition. In this paper three dierent
datasets (some further details and the download of the
Journal of Advanced Transportation
14.3910
14.3915
14.3920
14.3925
14.3930
14.3935
14.3940
14.3945
14.3950
14.3955
14.3960
14.3965
14.3970
14.3975
14.3980
14.3985
14.3990
14.3995
14.4000
14.4005
14.4010
14.4015
14.4020
14.4025
14.4030
14.4035
14.4040
14.4045
14.4050
14.4055
14.4060
14.4065
14.4070
14.4075
14.4080
14.4085
14.4090
14.4095
14.4100
14.4105
14.4110
14.4115
14.4120
14.4125
14.4130
14.4135
14.4140
Longitude
40.8975 40.9000 40.9025 40.9050 40.9075 40.9100 40.9125 40.9150 40.9175 40.9200 40.9225 40.9250 40.9275 40.930040.8950
Latitude
Path
P
1
P
2
P
3
P
4
P
5
F : An overview of the paths under study: km in ve paths of dierent kind in the Naples area.
datasets are available at https://github.com/martinacimitile/
Car-Data-Mining/wiki) (𝐴,𝐵,and𝐶) are exploited to
answertheproposedresearchquestions.Allthedatasets
refertotheareashowninFigure,wherethestudyhasbeen
executed: -and-axes of the gure are associated with
longitude and latitude whereas the color represents one of
ve paths.
Each dataset contains two replicas of the same observa-
tion obtained with dierent conditions to avoid bias.
𝐴is composed of two replicas of observations and has
the following characteristics:
(i) Each observation describes the driving on the entire
trackreportedinFigure.
(ii) e observations are performed by four drivers
(1,1,2,and 3)oncars(Hyundaii,
Lancia y, Fiat Punto, and Nissan Note).
(iii) For each replica, four persons drive four cars on the
entire track.
𝐵is composed of two replicas of observations having
the following characteristics:
(i) Each observation describes one of the paths compos-
ing the track of Figure . As shown by the gure we
consider ve consecutive paths (called, resp., 1,2,
3,4,and5).
(ii) Five men (1,...,5)andvewomen
(1,...,5) drive the same car (Hyundai i).
𝐶is composed of two replicas of observations having
the following characteristics:
(i) Each observation describes one of the paths reported
in Figure .
(ii) e observations are performed by four drivers
(1,1,2,and 3)oncars(Hyundaii,
Lancia y, Fiat Punto, and Nissan Note) and on paths.
(iii) For each replica, each person drives all the cars one
time on each path.
5.3. Descriptive Statistics. In this section, descriptive statistics
are used to describe the features used in this study (a
quantitative analysis of the distributions of features has
been performed). e analysis shows that several groups of
features are well separated and median values do not fall into
interquartile ranges of the distributions.
In order to provide statistical evidence that the fea-
tures can be considered as characteristics of the driver
Journal of Advanced Transportation
0
1000
2000
3000
4000
5000
Engine RPM
0
5
10
15
20
Trip Average KPL
5
10
15
20
Liters per 100 Kilometer
0
10
20
30
40
50
Average Trip Speed
DM1
DF1DM2DM3
Driver
DM1
DF1DM2DM3
Driver
DM1
DF1DM2DM3
Driver
DM1
DF1DM2DM3
Driver
F : Box plots related to the distribution of the features Average Trip Speed, Liters Per Kilometers, Trip Average KPL, and Engine
RPM for the drivers 1,1,2,and3.
behavior we show the box plot related to four drivers (i.e.,
1,1,2,and 3)featuresdistributionshownas
box plots.
Figure shows the distributions of four drivers related to
Average Trip Speed, Liters Per Kilometers, Trip Average
KPL, and Engine RPM.
Asshowninthegure,theAverageTripSpeedofthefour
drivers is the same but it is interesting to observe that the
drivers 2and 1do not present peaks of speed (they
keepthesamespeedvariationbothinaccelerationandin
deceleration during the entire driving session).
e gure also shows that even if the interquartile ranges
arecomparableamongthedrivers,themedianvaluesare
quite dierent for dierent features.
Moreover, for the feature “Liter per Kilometer,”
distributions present dierent medians for the drivers. is is
symptomatic of the dierent fuel consumption between the
drivers involved in the experiment.
e driver 1isconrmedtobethemoreaggressive
relating to the driving style: indeed his median is very close to
the rd quarter, and this is conrmed by the fact that he is the
driver that reaches the higher speed (as conrmed by the box
plots in Figure ). e driver 1, on the other side, presents
a fuel consumption very close to the one exhibited by 1
but, as we observe from the box plot in Figure , she reaches
anaveragespeedveryclosetotheoneof2,consuming
less fuel than 1and 1.
e driver 3exhibits an average fuel consumption
slightly lower than 2: this conrms the balanced driving
style of the driver.
As shown by the box plots, the medians of the Trip
Average KPL (i.e., the ratio between distance traveled and
fuel consumed) for 2and 3exhibit the best values;
moreover 1and 1present almost the same medians.
is means that 2and 3can travel more kilometers
with less fuel with respect to 1and 1.
Journal of Advanced Transportation
e distributions relating to the Engine RPM box plot
show that the drivers 2and 1exhibit the higher value
related to the Engine RPM: probably they change gear too
late, and hence the Engine RPM rises. From the other side,
we observe that 1and 3drivers present a lower media
value for the considered feature: this means that they do not
stress the engine.
isisalsoconrmedlookingatthescatterplotsofthe
distributions shown in Figure . Figure (a) shows a long
term average of the kilometers per liter that are done by
drivers (shown in dierent colors). e scatter plot, in this
case,showsthatthedistributionsarewellseparated.esame
considerations can be done for the distribution of velocity
concerning path kind that is shown in Figure (b). In this
case, the scatter plot reveals that dierent path kinds have
quite dierent velocity distributions.
6. Experiments Description
In this section, the description of the experiments conducted
to answer the RQs introduced in Section is reported.
6.1. Experiments Design. e set of conducted experiments is
synthesized in Table . e table shows, for each RQ, the list
of the conducted experiments (the experiment label, its goal,
and the involved dataset).
Looking at the table, RQ1is explored with ve experi-
ments (11,12,13,14,and15). 11,12,and13 aim to
evaluate the eectiveness of the approach in identifying the
driver in three cases:
(i) Regardless of the car but xing the path (11)
(ii) Regardless of the path but xing the car (12)
(iii) Regardless of both car and path (13).
Moreover, 14 and 15 aim to identify the gender of the driver:
(i) Regardless of the car but xing the path (14)
(ii) Regardless of the path but xing the car (15).
Similarly, RQ2is explored with a single experiment (21)
aiming to evaluate if it is possible to identify the path kind
xing the car but regardless of the path and the driver. RQ3is
explored by means of two experiments. e rst experiment
(31)aimsatevaluatingifitispossibletoidentifythedriver
familiarity with the vehicle xing the car but regardless of
the path and the driver. e second experiment (32)aims
at evaluating if it is possible to identify the driver familiarity
with the vehicle when car, driver, and path can change.
Each experiment consists in applying the classication
method described in Section on a specic dataset (Table
reports for each experiment the considered dataset).
In particular in 11 the classication method is applied
to the dataset 𝐴, since it contains several drivers and a
single path. For what concerns 12,thedataset𝐵can be
used to study the inuence of the path, xing the car (since
it was built on a single car). In 13 the dataset 𝐶is used
sinceitisbasedonfourcarsandcontainsvepaths.14
is conducted using the dataset 𝐴since we need multiple
cars involving drivers of both genres. Conversely, 15 is
performed on the dataset 𝐵since it is based on paths
for each driver and contains a well-balanced set of male
andfemaledrivers(vemenandvewomen)onasingle
car. For 2we exploited the dataset 𝐵since it contains
twelve drivers and ve paths and it was performed on a
single car. RQ3is explored in the experiments 31 and 32
involving, respectively, the datasets 𝐴and 𝐵.Basedonthe
classication process described in Figure , each considered
dataset is cleaned and normalized performing the Datasets
Generation steps. e normalized dataset is used to generate a
training set and a test set. Each experiment is performed using
the MLP classier as described in Figure . Moreover, the
experiment 11 is performed using two alternative classiers
(RF and DWT) in order to compare the eectiveness of the
MLP classier with respect to other approaches oen used
in literature. Finally, each experiment is also repeated using
dierent window segmentation strategies ( s, s, s, s,
and the Start&Stop method) in order to evaluate the impact of
the sliding window approach on the classier performances.
Another consideration can be made on the training
dataset. It is obtained as a partition of the normalized
dataset and it is augmented with a column that species the
classication labels associated with the evaluated instance.
is column was derived by an expert looking at the evaluated
instance in the area considered for the study. is classica-
tion is then used to both train the classier and perform the
evaluation. In the context of RQ1, for the experiments 11,
12,and13, the considered datasets are augmented with a
column that species the driver identity. For the 11 and 13
thepossibledriversidentitylabelscanbe1,1,2,
and 3(corresponding to all the possible drivers involved
in 𝐴and 𝐶). For 12, looking at the explored dataset 𝐵
the drivers identity label can assume the following values:
1,2,3,4,and5and 1,2,3,4,
and 5. Similarly, for the experiments corresponding to
RQ14 and RQ15 the considered datasets are augmented with
a column that species the driver gender (driver is labeled as
“Male” or “Female”). In the experiment conducted to answer
the RQ2,thedataset𝐵was augmented with a column that
species the kind of the path (the path is labeled as “City
Street”; “Highway”; or “Dirt road”) needed to perform the
training and the validation. Similarly, in the context of RQ3,
the explored datasets were augmented with a column that
species the ownership (“Owner”; “Not Owner”).
6.2. Evaluation Strategy. e validation has been performed
using classication quality metrics. e three metrics used to
evaluate the performance of our approach for the research
questions are precision, recall, and accuracy.
Precision has been computed as the proportion of the
observations that truly belong to investigated class (e.g.,
driver, driver genre, driver path, and driver’s familiarity)
among all those which were assigned to the class. It is the ratio
of the number of records correctly assigned to a specic class
to the total number of records assigned to that class (correct
and incorrect ones):
Precision =tp
tp +fp,()
Journal of Advanced Transportation
0.400
0.375
0.050
0.350
0.100
0.125
0.150
0.175
0.300
0.225
0.250
0.275
0.475
0.525
0.200
0.075
0.025
0.425
0.450
0.000
0.500
0.325
0.550
−0.025
Fuel cost (trip) (cost)
5.00
5.50
6.00
6.50
7.00
7.50
8.00
8.50
9.00
9.50
10.00
10.50
11.00
11.50
12.00
12.50
13.00
13.50
14.00
14.50
15.00
15.50
Kilometers per Liters (Long Term Average) (kpl)
Driver
DF1DF3
DF4
DF5
DF2
DM3
DM2DM5
DM4
DM1
(a) Kilometers per Liter (long average) with respect to fuel cost, for dierent drivers
GPS speed (meters/second) 0.000 30.818
7
5
4
3
2
1
8
9
6
0
25
32
26
27
33
24
11
22
13
14
21
16
17
18
19
30
34
15
23
12
10
29
28
31
20
−5
−3
−4
−6
−2
−1
GPS speed (meters/second)
City street
Highway
Dirt road
Path kind
(b) Speeds with respect to dierent path kinds (city street, highway, and dirt road)
F : Scatter plots of the data distribution on dataset 𝐴.
Journal of Advanced Transportation
T : Experiments overview.
RQ Experiment label Dataset Experiment goal
RQ1:towhatextentdo
OBD II features allow
driver identication?
11 𝐴
Evaluate the eectiveness of the proposed approach to
identify the driver regardless of the car but xing the
path
12 𝐵
Evaluate the eectiveness of the proposed approach to
identify the driver regardless of the path but xing the
car
13 𝐶Evaluate the eectiveness of the proposed approach to
identify the driver regardless of both car and path
14 𝐴
Evaluate the eectiveness of the proposed approach to
identify the driver gender regardless of the car but
xing the path
15 𝐵
Evaluate the eectiveness of the proposed approach to
identify the driver gender regardless of the path but
xing the car
RQ2:towhatextentdo
OBD II features allow path
kind identication? 21 𝐵
Evaluate the eectiveness of the proposed approach to
identify the path kind xing the car but regardless of
the path and the driver
RQ3:isitpossible,usingthe
features available in OBD
II, to identify the driver’s
familiarity with the vehicle?
31 𝐴
Evaluate the eectiveness of the proposed approach to
identify the driver familiarity with the vehicle xing of
the car but regardless of the path and the driver
32 𝐵
Evaluate the eectiveness of the proposed approach to
identify the driver familiarity with the vehicle when car,
driver, and path can change
Journal of Advanced Transportation
T : 11 , precision, recall, accuracy, and training time obtained by using the MLP classier for dierent windows sizes.
Average window size Driver Driver recall Driver precision Overall precision Overall recall Accuracy Training time (sec)
s
training segments
DF , ,
, , ,
DM , ,
DM , ,
DM , ,
s
training segments
DF , ,
, , ,
DM , ,
DM , ,
DM , ,
s
training segments
DF , ,
, , ,
DM , ,
DM , ,
DM , ,
s
training segments
DF , ,
, , ,
DM , ,
DM , ,
DM , ,
Start&Stop
training segments
DF , ,
, , ,
DM , ,
DM ,
DM
where tp indicates the number of true positives and fp
indicates the number of false positives.
e recall has been computed as the proportion of
observations that were assigned to a given class, among all the
observations that truly belong to the class. It is the ratio of the
number of relevant records retrieved to the total number of
relevant records:
Recall =tp
tp +fn,()
wheretpindicatesthenumberoftruepositivesandfn
indicates the number of false negatives.
Accuracy is dened as a statistical measure of how well a
binary classication is able to evaluate correctly the instances
under analysis with respect to the considered features. Basi-
cally the accuracy is the proportion of true results (both
true positives and true negatives) among the total number of
instances evaluated.
In the following, for each RQ, the results of experi-
mentation have been described and discussed. e datasets
were all replicated once in dierent conditions of trac and
timing, to avoid the bias of such variables. e results were
consistent with the ones obtained on the rst replica and
hence are not reported in detail. In the online repository
(https://github.com/martinacimitile/Car-Data-Mining/wiki)
all the datasets, along with replicas, are provided to allow
experiments replication.
6.3. Discussion of Results. Table r e p orts re s u l t s for t he
experiment 11 performed by using the MLP classier.
In the rst column of the table, the adopted window sizes
are reported. For each window size, we also specied the
number of observations that are collected (e.g., considering
a-secondwindow,thenumberofcollectedtrainingseg-
mentsis).Startingfromthethirdcolumn,thetable
reports—for each segmentation choice and for each driver
(column two)—the values of precision, recall, accuracy, and
training times obtained by using the MLP classier. Precision
and recall are evaluated for each class (the driver in this
context) and as averages. e results of the experiment 11
performedbyusingtheRFandtheDTWclassiersareshown
in Table (for brevity, they are shown only for this rst
experiment). Comparing Tables and , we can conclude
that the best results are obtained for MLP on almost all
the sizes. Based on the results, we can also conclude that
thebestsegmentationchoiceonMLPisthethreshold-based
segmentation. However, MLP is also the slowest classier in
terms of training time. Moreover, MLP and RF perform better
than DTW on medium window sizes and for threshold-based
segmentation.
e 11 results are also synthesized by Figure .
Figure (a) shows the trend of the classication accuracy
forMLP,RF,andDTWwithrespecttotheadoptedwindow
strategy. Figure (b) shows the training times: RF and DTW
are comparable but several times (x) faster than the MLP
network.
Journal of Advanced Transportation
T : 11 , precision, recall, accuracy, and training time obtained by using the RF and DTW classiers for dierent windows sizes.
Ave rage
window size Driver
RandomForest (RF) DTW
Driver
recall
Driver
precision
Overall
precision Overall recall Accuracy Training time
(sec)
Driver
recall
Driver
precision
Overall
precision Overall recall Accuracy Training time
(sec)
s
training
segments
DF , ,
, , ,
, ,
, , ,
DM , , , ,
DM , , , ,
DM , , , ,
s
training
segments
DF , ,
, , ,
, ,
, , ,
DM , , , ,
DM , , , ,
DM , , , ,
s
training
segments
DF , ,
, , ,
, ,
, , ,
DM , , , ,
DM , , , ,
DM , , , ,
s
training
segments
DF , ,
, , ,
, ,
, , ,
DM , , , ,
DM , , , ,
DM , , , ,
Start&Stop
training
segments
DF , ,
, , ,
, ,
, , ,
DM , , , ,
DM , , , ,
DM , , , ,
Journal of Advanced Transportation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Accuracy
0.94
0.85
0.82
0.44
0.52
0.43
0.8
0.82
0.76
0.88
0.89
0.81
0.68
0.72
0.65
MLP Accuracy
RF Accuracy
DTW Accuracy
training
Start&Stop
7562
segments
training
30 s
486
segments
training
60 s
243
segments
training
10 s
1459
segments
training
1s
14592
segments
Experimental E11, window segment size
(a)
MLP training time (sec)
RF training time (sec)
DTW training time (sec)
Average window sizes
0
200
400
600
800
1000
1200
Times (seconds)
Experiment E11, window segment size
(b)
F : 11, trend of accuracy (a) and training times (b) evaluated by using MLP, RF, and DTW classiers for dierent windows sizes.
For what concerns experiments 12 and 13,theresults
arereportedinTablesand.Aswecansee,datafollowsthe
same trends observed for 11.
e following considerations can be done:
(i) Driver classication on the dataset 𝐴,onasingle
path, gives better results for the same identication on
dataset 𝐵that is performed on a xed car but varying
the path.
(ii) e eectiveness obtained in 13 (i.e., regardless of
the car and the path) is, as expected, lower than 11
and 12 eectiveness. is is important to estimate
the quality of classication for applications where the
path or the car is xed. In those cases, a RF or a DTW
approach could be adopted since precision and recall
are higher and could be acceptable (leading to faster
training times). Conversely, MLP approach remains
the best classier in our experimentation in terms of
classication quality, and it is the best one to choose
for the general case.
Results for precision and recall of the driver gender
identication experiments (14 and 15)fordierentsize
windows are reported, respectively, in Tables and .
e tables show that even if the best results have been
obtained for the threshold-based segmentation, the xed
windows of sixty seconds provide quite reasonable results
with a much more reduced training time. is means that,
for applications more sensitive to training time, it could be
preferred.
Table shows the results of experiment 2.Itreports,
for each segmentation choice, the results (precision, recall,
accuracy, and training times) of the runs on the MLP classier
for path kind identication among t hree classes (highway, city
street, and dirt road). Precision and recall are evaluated for
each class (the path kind in this context) and as averages. e
best results are also obtained in this case for threshold-based
segmentation on the MLP. In particular, the threshold-based
segmentation provides the best accuracy of . (which is
also the best accuracy obtained for path kind identication)
whereas, for xed windows segmentation, the best value of
accuracy was . (obtained for the xed windows size of
s). Fixed windows segments too small (around second)
and too wide (more than seconds) provided very bad
results limiting useful windows sizes in the range (s, s).
Tables and , respectively, report results of experi-
ments 31 and 32. ey report, for each segmentation choice,
the results for MLP classier in terms of precision, recall,
accuracy, and training times. e evaluated instance is the
familiarity detection and it is labeled as “Owner” and “Not
Owner.” Precision and recall are evaluated for each class and
as averages.
reshold-based segmentation on the MLP conrms to
be the best classier. For 31,thebestvalueforaccuracy
(which was also the best achieved overall accuracy for
familiarity detection) was .. e performance trend among
xed windows sizes is consistent with other classiers. For
experiment 32, the best value of accuracy is . showing that
when car, driver, and path can change, the classier has worse
performance.
Finally, even if it is not possible to directly compare
the obtained results with the results obtained in the work
discussed in Section (each approach is tested on a dierent
dataset), we can observe that the obtained precision rate is
very encouraging. However, we reach a precision rate equal
to . with respect to the precision rate of . obtained in
[] (it is the best precision described in related work).
We conclude this section reporting the trend of accuracy
metric for all the experiments and all the segments. Figure
highlights that the adoption of threshold-based segmentation
improveswithrespecttousingxedwindowsforallthe
groups of features.
6.4. reats to Validity. In this section the main threats to
the validity of our research are discussed. Construct validity
represents the quality of choices about the particular forms
of the variables (i.e., the choice of outcome measure or the
Journal of Advanced Transportation
T : 12 , precision, recall, accuracy, and training time evaluated by using MLP for dierent windows sizes.
Average window size Driver Driver recall Driver precision Overall precision Overall recall Accuracy Training time (sec)
s
training segments
DF , ,
, , ,
DF , ,
DF , ,
DF , ,
DF , ,
DM , ,
DM , ,
DM , ,
DM , ,
DM , ,
s
training segments
DF , ,
, , ,
DF , ,
DF , ,
DF , ,
DF , ,
DM , ,
DM , ,
DM , ,
DM , ,
DM , ,
s
training segments
DF , ,
, , ,
DF , ,
DF , ,
DF , ,
DF , ,
DM , ,
DM , ,
DM , ,
DM , ,
DM , ,
s
training segments
DF , ,
, , ,
DF , ,
DF , ,
DF , ,
DF , ,
DM , ,
DM , ,
DM , ,
DM , ,
DM , ,
Start&Stop
training segments
DF , ,
, , ,
DF , ,
DF , ,
DF ,
DF , ,
DM
DM , ,
DM , ,
DM , ,
DM ,
Journal of Advanced Transportation
T : 13 , precision, recall, accuracy, and training time evaluated by using MLP for dierent windows sizes.
Average window size Driver Driver recall Driver precision Overall precision Overall recall Accuracy Training time (sec)
s
training segments
DF , ,
, , ,
DM , ,
DM , ,
DM , ,
s
training segments
DF , ,
, , ,
DM , ,
DM , ,
DM , ,
s
training segments
DF , ,
, , ,
DM , ,
DM , ,
DM , ,
s
training segments
DF , ,
, , ,
DM , ,
DM , ,
DM , ,
Start&Stop
training segments
DF , ,
, , ,
DM , ,
DM , ,
DM , ,
T : 14 , precision, recall, accuracy, and training time evaluated by using MLP for dierent windows sizes.
Average window size Ownership Precision genre Recall genre Overall precision Overall recall Accuracy Training time (sec)
s
training segments
FEMALE , , , , ,
MALE , ,
s
training segments
FEMALE , , , , ,
MALE , ,
s
training segments
FEMALE , , , , ,
MALE , ,
s
training segments
FEMALE , , , , ,
MALE , ,
Start&Stop
training segments
FEMALE , , , , ,
MALE , ,
T : 15 , precision, recall, accuracy, and training time evaluated by using MLP for dierent windows sizes.
Average window size Ownership Precision genre Recall genre Overall precision Overall recall Accuracy Training time (sec)
s
training segments
FEMALE , , , , ,
MALE , ,
s
training segments
FEMALE , , , , ,
MALE , ,
s
training segments
FEMALE , , , , ,
MALE , ,
s
training segments
FEMALE , , , , ,
MALE , ,
Start&Stop
training segments
FEMALE , , , , ,
MALE , ,
Journal of Advanced Transportation
T : 2, precision, recall, accuracy, and training time evaluated by using MLP for dierent windows sizes.
Average window
size Path kind Path kind
recall
Path kind
precision
Ave rage pa th
kind precision
Ave rage pa th
kind recall
Overall
accuracy
Training time
(sec)
s
training
segments
Highway , ,
, , ,
City street , ,
Dirt road , ,
s
training
segments
Highway , ,
, , ,
City street , ,
Dirt road , ,
s
training
segments
Highway , ,
, , ,
City street , ,
Dirt road , ,
s
training
segments
Highway , ,
, , ,
City street , ,
Dirt road , ,
Start&Stop
training
segments
Highway , ,
, , ,
City Street , ,
Dirt road , ,
T : 31, precision, recall, accuracy, and training time evaluated by using MLP for dierent windows sizes.
Average window size Ownership Ownership
recall
Ownership
precision
Overall
precision Overall recall Accuracy Training time
(sec)
s OWNER , , , , ,
training segments Not Owner , ,
s Owner , , , , ,
training segments Not Owner , ,
s Owner , , , , ,
training segments Not Owner , ,
s Owner , , , , ,
training segments Not Owner , ,
Start&Stop Owner , , , , ,
training segments Not Owner , ,
T : 32 , precision, recall, accuracy, and training time evaluated by using MLP for dierent windows sizes.
Average window size Ownership Ownership
recall
Ownership
precision
Overall
precision Overall recall Accuracy Training time
(sec)
s Owner , , , , ,
training segments Not Owner , ,
s Owner , , , , ,
training segments Not Owner , ,
s Owner , , , , ,
training segments Not Owner , ,
s Owner , , , , ,
training segments Not Owner , ,
Start&Stop Owner , , , , ,
training segments Not Owner , ,
Journal of Advanced Transportation
Start&Stop
Segment
0.4
0.6
0.8
Accuracy
Experiment
1s10 s30 s60 s
E11
E12
E13
E2
E31
E32
E14
E15
F : Accuracy trend over segments and experiments.
choice of treatment). ey concern the relationship between
theory and observation. In our proposal, some problems can
be introduced by the hypothesis guessing of all the involved
drivers. Drivers can know or guess the desired end-result,
and they can change their behavior. is risk is mitigated by
omitting the scope of the study and what are the monitored
features to the drivers. Moreover, a bias in experimental
design can be introduced by the path knowledge. If some
drivers well know the tested path they can have a dierent
behavior if compared to the drivers that never ride along that
path. is risk is mitigated by training all the drivers on the
pathssothattheirknowledgeissimilar.
Internal validity is concerned with the possibility that
some factors would be more suitable for the proposed features
to perform classication. To exclude this eventuality, we
performed a specic feature selection step studying correla-
tion and independence for all features available in OBD II
standard. Moreover, in order to best validate the training of
the classier, we adopted a -fold cross-validation.
Conclusion validity regards the degree to which the
conclusions we state about the relationship (between the
treatment and the outcome) are reasonable.
reats to external validity concern the generalization of
our ndings. Of course, replication on further projects to
conrm or contradict the obtained results is always desirable.
7. Conclusion
is paper proposes an approach to identify the driver, the
familiarity of the driver with the vehicle, and the kind of the
roadbasedonthestudyofthebehaviorofapersonduring
the driving. It is based on the assumptions that a proper set
ofbehavioralfeaturescanbeusedto(i)recognizedierent
drivers by capturing their dierent driving style; (ii) detect
their familiarity with the car; and (iii) detect the road kind
on which they are driving (among several types). Based on
these assumptions, we extracted, using a monitoring system
placed in the cars, an eective set of features whose samples
are sent to a time-series classication approach. e classier
exploits a supervised learning approach and is based on an
MLP network; its performances were compared with classic
decision tree classiers. e proposed time-series classier
has been proved to be eective at identifying the driver, the
driver genre, and the road kind aer being trained on the
proposed set of behavioral features. Specically, the approach
has been evaluated with eight experiments on three datasets
made from real data logged on four cars driven by ten drivers
in the Naples area. Each experiment allows exploring a
specic aspect of the proposed research questions, evaluating
the eectiveness of the proposed approach to
(i) identify the driver regardless of the car but xing the
path;
(ii) identify the driver regardless of the path but xing the
car;
(iii) identify the driver regardless of both car and path;
(iv) identify the driver gender regardless of the car but
xing the path;
(v) identify the driver gender regardless of the path but
xing the car;
(vi) identify the path kind xing the car but regardless of
the path and the driver;
(vii) identify the driver familiarity with the vehicle xing
the car but regardless of the path and the driver;
(viii) identify the driver familiarity with the vehicle when
car, driver, and path can change.
e obtained results show high accuracy for all the per-
formed experiments. In particular, the proposed approach is
very eective in identifying driver. e best accuracy (.) is
obtained to identify the gender of the driver regardless of the
car but xing the path. Good accuracy is also obtained in all
the other experiments aiming to perform driver identication
(the accuracy value is never less than .). Looking for the
driver genre identication (it is evaluated regardless of the
car but xing the path), the proposed approach shows an
accuracy of . revealing that man and woman have dierent
driving style. Moreover, the proposed approach can be also
used to identify the road kind (highway, city street, and dirt
road) by xing the car but regardless of the path and the
driver. e obtained accuracy value is, in this case, equal to
.. Eective results are also obtained in the detection of
driver familiarity. Here we have a higher accuracy (.) in
identifying the driver familiarity when the vehicle is xed.
A slightly lower accuracy (.) is obtained when all cars,
drivers, and paths can change. Further experimentation in
this general case should be performed to see if the size and the
number of hidden layers of a single network and the number
of networks positively inuence the resulting accuracy.
Journal of Advanced Transportation
Finally, we also compare the proposed MLP classier with
tree classic decision classiers. e obtained results show how
even if all the classiers are characterized by high values of
accuracy, the best performances are obtained by using the
proposed ensemble MLP classier. Training times however
also show that our ensemble classier is the slowest during
the learning phase. is means that our approach is perfect
for applications that do not require real-time identication
of changing drivers (e.g., the owner can perform continuous
training on its car and can be detected with very high levels of
accuracy). As future work, we are extending the study adding
a behavioral characterization of the driver using both formal
approaches (using model checking) and fuzzy rule extraction
from the example dataset. is not only allows performing
the identication of driver and paths but also will provide
an explanation of how the identied driver is behaving for
predened classes (e.g., polluting driver, aggressive driver,
and cheap driver). Moreover, the evaluation can be extended
toahighernumberofdrivers,cars,andpaths.Finally,
theapplicationoftheproposedapproachintheroadkind
identication can be further explored considering a more
accurate road classication model.
Conflicts of Interest
e authors declare that they have no conicts of interest.
Acknowledgments
is work has been partially supported by H EU-
funded projects NeCS and CISP and EIT-Digital Project HII
and PRIN “Governing Adaptive and Unplanned Systems of
Systems” and the EU Project CyberSure .
References
[] K. Igarashi, C. Miyajima, K. Itou, K. Takeda, F. Itakura, and
H. Abut, “Biometric identication using driving behavioral
signals,” in Proceedings of the IEEE International Conference on
Multimedia and Expo (ICME ’04),pp.–,Taipei,Taiwan.
[] M. Okamoto, S. Otani, Y. Kaitani, and K. Uchida, “Identication
of driver operations with extraction of driving primitives,” in
Proceedings of the 2011 20th IEEE International Conference on
Control Applications, CCA 2011, pp. –, USA, September
.
[] M.Itoh,D.Yokoyama,M.Toyoda,andM.Kitsuregawa,“Visual
interface for exploring caution spots from vehicle recorder big
data,” in Proceedings of the 3rd IEEE International Conference
on Big Data, IEEE Big Data 2015, pp. –, USA, November
.
[] A. Wahab, C. Quek, C. K. Tan, and K. Takeda, “Driving prole
modeling and recognition based on so computing approach,”
IEEE Transactions on Neural Networks and Learning Systems,
vol. , no. , pp. –, .
[] F.Martinelli,F.Mercaldo,A.Orlando,V.Nardone,A.Santone,
and A. K. Sangaiah, “Human behavior characterization for
driving style recognition in vehicle system,” Computers &
Electrical Engineering,.
[] J. Daugman, “How iris recognition works,” IEEE Transactions
on Circuits and Systems for Video Technology,vol.,no.,pp.
–, .
[] S. Gonz´
alez, C. M. Travieso, J. B. Alonso, and M. A. Ferrer,
“Automatic biometric identication system by hand geometry,”
in Proceedings of the 37th Annual International Carnahan
Conference on Security Technology, pp. –, twn, October
.
[] D. A. Reynolds, “An overview of automatic speaker recognition
technology,” in Proceedings of IEEE International Conference on
Acoustics, Speech, and Signal Processing, pp. IV-–IV-,
Orlando, FL, USA, May .
[] S. Ribari´
c, D. Ribari´
c, and N. Paveˇ
si´
c, “Multimodal biometric
user-identication system for network-based applications,” IEE
Proceedings Vision, Image and Signal Processing,vol.,no.,
pp.–,.
[] R. W. Frischholz and U. Dieckmann, “BioID: A multimodal
biometric identication system,” e Computer Journal,vol.,
no. , pp. –, .
[] T. Wakita, K. Ozawa, C. Miyajima et al., “Driver identication
using driving behavior signals,” IEICE Transaction on Informa-
tion and Systems,vol.E-D,no.,pp.–,.
[] X. Zhang, X. Zhao, and J. Rong, “A study of individual
characteristics of driving behavior based on hidden markov
model,” Sensors & Transducers Journal,vol.,no.,pp.–
, .
[] M.Enev,A.Takakuwa,K.Koscher,andT.Kohno,“Automobile
driver ngerprinting,” in Proceedings on Privacy Enhancing
Tech n o l o g ies, vol. , pp. –, .
[] M. Van Ly, S. Martin, and M. M. Trivedi, “Driver classication
and driving style recognition using inertial sensors,” in Proceed-
ings of the 2013 IEEE Intelligent Vehicles Symposium, IEEE IV
2013, pp. –, Australia, June .
[] C. Miyajima, Y. Nishiwaki, K. Ozawa et al., “Driver modeling
based on driving behavior and its evaluation in driver identi-
cation,” Proceedings of the IEEE,vol.,no.,pp.–,.
[] Y. Nishiwaki, K. Ozawa, T. Wakita, C. Miyajima, K. Itou, and
K. Takeda, “Driver identication based on spectral analysis
of driving behavioral signals,” in Advances for In-Vehicle and
Mobile Systems,pp.–,Springer,.
[] S.Choi,J.Kim,D.Kwak,P.Angkititrakul,andJ.H.Hansen,
“Analysis and classication of driver behavior using in-vehicle
can-bus information,” Biennial Workshop on DSP for In-Vehicle
and Mobile Systems,p
p.–,.
[] B. I. Kwak, J. Woo, and H. K. Kim, “Know your master: Driver
proling-based anti-the method,” in Proceedings of the 14th
Annual Conference on Privacy, Security and Trust, PST 2016,pp.
–, New Zealand, December .
[] G. Kedar-Dongarkar and M. Das, “Driver classication for
optimization of energy usage in a vehicle,” in Proceedings of
the Conference on Systems Engineering Research, CSER 2012,pp.
–, USA, March .
[]X.Meng,K.K.Lee,andY.Xu,“Humandrivingbehavior
recognition based on hidden Markov models,” in Proceedings
of the 2006 IEEE International Conference on Robotics and
Biomimetics, ROB IO 2006, pp. –, China, December .
[] P. Esling and C. Agon, “Time-series data mining,” ACM Com-
puting Surveys,vol.,no.,article,.
[]M.L.Bernardi,M.Cimitile,F.Martinelli,andF.Mercaldo,
“A time series classication approach to game bot detection,”
in Proceedings of the 7th International Conference on Web
Journal of Advanced Transportation
Intelligence, Mining and Semantics, (WIMS ’17),pp.–,New
Yo r k , N Y, U S A , J u n e .
[] N. K. Al-Salihy and T. Ibrikci, “Classifying breast cancer by
using decision tree algorithms,” in Proceedings of the the 6th
International Conference (ICSCA ’17), pp. –, Bangkok,
ailand, Feburary .
[] T. Hastie, R. Tibshirani, and J. Friedman, e Elements of
Statistical Learning,SpringerNewYorkInc.,NewYork,NY,
USA, .
[] Y.Zheng,Q.Liu,E.Chen,Y.Ge,andJ.L.Zhao,“Timeseries
classication using multi-channels deep convolutional neural
networks,” in International Conference on Web-Age Information
Management,F.Li,G.Li,S.-W.Hwang,B.Yao,andZ.Zhang,
Eds.,vol.ofLecture Notes in Computer Science, pp. –
, .
[] M. Stone, “Cross-validatory choice and assessment of statistical
predictions,” JournaloftheRoyalStatisticalSociety,vol.,pp.
–, .
[] R. Kohavi, “A study of cross-validation and bootstrap for
accuracy estimation and model selection,” in Proceedings of the
14th International Joint Conference on Articial Intelligence,vol.
ofIJCAI95, pp. –, Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA, , http://dl.acm.org/citation.cfm.
[] A. Nanopoulos, R. Alcock, and Y. Manolopoulos, “Feature-
based classication of time-series data,” International Journal of
Computer Research,vol.,pp.–,.
[]B.Esmael,A.Arnaout,R.K.Fruhwirth,andG.onhauser,
Multivariate Time Series Classication by Combining Trend-
Based and Value-Based Approximations, Springer Berlin Heidel-
berg, Berlin, Heidelberg, .
[] N. S´
anchez-Maro˜
no, A. Alonso-Betanzos, and M. Tombilla-
Sanrom´
an, Filter Methods for Feature Selection – A Comparative
Study, Springer Berlin Heidelberg, Berlin, Heidelberg, .
Available via license: CC BY 4.0
Content may be subject to copyright.