ArticlePDF Available

Driver and Path Detection through Time-Series Classification

Authors:
  • UnitelmaSapienza University of Rome

Abstract and Figures

Driver identification and path kind identification are becoming very critical topics given the increasing interest of automobile industry to improve driver experience and safety and given the necessity to reduce the global environmental problems. Since in the last years a high number of always more sophisticated and accurate car sensors and monitoring systems are produced, several proposed approaches are based on the analysis of a huge amount of real-time data describing driving experience. In this work, a set of behavioral features extracted by a car monitoring system is proposed to realize driver identification and path kind identification and to evaluate driver’s familiarity with a given vehicle. The proposed feature model is exploited using a time-series classification approach based on a multilayer perceptron (MLP) network to evaluate their effectiveness for the goals listed above. The experiment is done on a real dataset composed of totally 292 observations (each observation consists of a given person driving a given car on a predefined path) and shows that the proposed features have a very good driver and path identification and profiling ability.
This content is subject to copyright. Terms and conditions apply.
Research Article
Driver and Path Detection through Time-Series Classification
Mario Luca Bernardi ,1Marta Cimitile ,2Fabio Martinelli,3and Francesco Mercaldo3
1Giustino Fortunato University, Benevento, Italy
2Unitelma Sapienza, Rome, Italy
3National Research Council of Italy (CNR), Pisa, Italy
Correspondence should be addressed to Marta Cimitile; marta.cimitile@unitelma.it
Received 7 September 2017; Revised 17 January 2018; Accepted 12 February 2018; Published 22 March 2018
Academic Editor: Aboelmaged Noureldin
Copyright ©  Mario Luca Bernardi et al. is is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
Driver identication and path kind identication are becoming very critical topics given the increasing interest of automobile
industry to improve driver experience and safety and given the necessity to reduce the global environmental problems. Since in
the last years a high number of always more sophisticated and accurate car sensors and monitoring systems are produced, several
proposed approaches are based on the analysis of a huge amount of real-time data describing driving experience. In this work, a set
of behavioral features extracted by a car monitoring system is proposed to realize driver identication and path kind identication
and to evaluate driver’s familiarity with a given vehicle. e proposed feature model is exploited using a time-series classication
approach based on a multilayer perceptron (MLP) network to evaluate their eectiveness for the goals listed above. e experiment
is done on a real dataset composed of totally  observations (each observation consists of a given person driving a given car on
a predened path) and shows that the proposed features have a very good driver and path identication and proling ability.
1. Introduction
Driver identication allows discovering the identity of a
vehicle driver using what he possesses and/or his physical and
behavioral characteristics []. is topic is recently becoming
very critical for the automobile industry, with an increas-
ing number of always more sophisticated and accurate car
sensors and monitoring systems able to extract information
about the driver (i.e., hand geometry, keystroke dynamics,
andvoiceprint).ereasonfortheinteresttowardsthetopic
is that driver identication may improve driver experience
allowing (i) a safer driving and an intelligent assistance
in case of emergencies, (ii) a more comfortable driving,
and (iii) a reduction of the global environment problems
[]. Looking for the driver safety, the driver identication
may support detecting some changes in the driver (due to
possible indisposition or state of being drunk) and activate
any security procedures (e.g., a ring may invite the driver
to stop). e study of the driver behavior, for each segment
of a road, also allows proling each road section supporting
the activation of alerts signals when more caution is required
(e.g., a vocal message may alert the driver to reduce the brake
pressure in a dangerous curve) []. Concerning the driver
comfort, for instance, the driver identication may discover
which member of the family is currently driving the car
and, consequently, perform an automatic setting of the car
equipment (i.e., radio volume and frequency, temperature,
or even speed limit) [, ]. Finally, the driver identication
can be useful to suggest new car improvements based on the
driver preferences or new systems with the aim of reducing
the car consumption and pollution based on the driving
characteristics. Based on the above-discussed advantages
deriving by the driver proling and identication, several
studieshavebeenproposedinthelastyearsfocusingon
the identication of driver physical and behavioral features.
e physical features [, ] are stable human characteristics
that have been largely diused in banking and forensic
domain to guarantee a higher safeness concerning the more
traditional authentication system based on the ownership
of a key (this authentication system can be easily bypassed
when someone comes in handy of the key). e behavioral
features are used to detect individual personality features and
are becoming the target of several studies in recent years that
Hindawi
Journal of Advanced Transportation
Volume 2018, Article ID 1758731, 20 pages
https://doi.org/10.1155/2018/1758731
Journal of Advanced Transportation
are mainly focused on speaker recognition []. e limit of
theseapproachesisthattheyarebasedontheanalysisof
only one feature. is may cause a high uncertainty in driver
identication especially if there is a noisy sensor. For this
reason, new approaches based on multimodal identication
systems are introduced [, ]. ey provide more accurate
driver identication based on the detection and analysis of a
higher set of behavioral features.
In this work, we propose a set of behavioral features to
perform driver identication and path kind identication
(i.e., dirt road, road with bumps, and highway) and to
evaluate driver’s familiarity with a given vehicle. Moreover,
we exploit a time-series classication (TSC) approach based
on a multilayer perceptron (MLP) network to evaluate the
eectiveness of the proposed set of features extracted by using
a CAN bus monitoring system.
We have identied three research questions (RQs):
RQ1: To what extent does the set of extracted features
allow driver identication?
RQ2: To what extent does the set of extracted features
allow path kind identication?
RQ3: Is it possible, using the available features, to
identify the driver’s familiarity with the vehicle (e.g.,
ownership)?
e rest of the paper is organized as follows.
Section  discusses related work. Section  describes the
background of our study. Section  presents the proposed
approach discussing the overall classication process. Sec-
tions  and  discuss a set of experiments to answer the
proposed research questions using three datasets (made of
 driving sessions performed by ten drivers on four cars
in ve dierent paths). Finally, Section  provides conclusive
remarks and future directions.
2. Related Work
In the past, the real-world automotive data retrieving was
limited due to the diculty to equip the sensors in the cars.
Since the introduction of CAN protocol (http://www.can
.bosch.com), this limit is overcome and driving style identi-
cation is becoming a very appealing scenario. CAN protocol
denes a generic communication standard for all the vehicle
electronic devices. CAN protocol can cooperate with the
OBD II diagnostic connector (http://obd-elm.com/) that
provides a candidate list of vehicle parameters to monitor
along with how to encode their data.
As a matter of fact, researchers in [] discuss a driver
identication approach that is based on the driving behavior
signals observed while the driver is following another vehicle.
e analyzed signals (they were measured using a driving
simulator) are the following: accelerator pedal, brake pedal,
vehicle velocity, and distance from the vehicle in front. e
approach obtains an identication rate equal to % for twelve
drivers and equal to % for thirty drivers.
e accelerator and the steering wheel as characteristics
to discriminate between dierent drivers are analyzed in [].
ey employ hidden Markov model (HMM) on the consid-
ered features to model the driver characteristics. ey build
two models for each driver: one trained from accelerator
data and the other one learned from steering wheel angle
data. Consequently, the models are used to identify dierent
drivers obtaining an accuracy rate equal to %.
Another HMM-based approach to model driver human
behavior is proposed in []. is method employs a sim-
ulated driving environment to evaluate the eectiveness of
the proposed solution. Van Ly [] explores the possibility of
using the inertial sensors of the vehicle from the CAN bus
to build a prole of the driver observing braking and turning
events to characterize an individual compared to acceleration
events.
Authors in [, ] represent gas and brake pedal oper-
ation patterns with the Gaussian mixture model (GMM).
eyobtainanidenticationrateequalto.%using
data extracted by a driving simulator and equal to .%
for a eld test with  drivers, resulting in % and %
error reduction, respectively, over a driver model based
on raw pedal operation signals without spectral analysis.
Considering data from steering wheel angle, brake status,
acceleration status, and vehicle speed, researchers in []
model the driver behavior through HMMs and GMMs with
the aim of capturing the sequence of driving characteristics
acquired from the CAN bus information. ey obtain % of
the accuracy of action classication and % of accuracy for
driver identication.
Authors in [] classify real-world mechanical features
from the CAN bus with four dierent classication algo-
rithms: they obtain an accuracy equal to . using decision
tree, equal to . using KNN, equal to . for Random-
Forest, and equal to . using MLP algorithm. Researchers
in [] classify a set of features extracted from the powertrain
signals of the vehicle, showing that the learned classier is
abletorecognizethehumandrivingstylebasedonthepower
demands placed on the vehicle powertrain with an overall
accuracy equal to %.
In [], the features extracted from the accelerator and
brakepedalpressureareusedasinputstoafuzzyneural
network (FNN) system to ascertain the identity of the driver.
Two fuzzy neural networks, namely, the evolving fuzzy neural
network (EFuNN) and the adaptive network-based fuzzy
inference system (ANFIS), are used to demonstrate the
viability of the two proposed techniques.
Summarizing the results obtained from the above-
described studies, we can conclude that the obtained identi-
cation rate is ranging from % [] to . []. e method
we propose is able to reach a precision rate equal to %,
overcoming the current literature in terms of precision. Fur-
thermore, the existing methods are oen tested in simulated
environments: a plethora of variables (like the trac jam and
the number of the cars involved in the scenarios) are set a
priori.
Dierently, we conduct experiments in the real-world
environment, in order to take into account real-world vari-
ables that cannot be predicted. In addition, we perform a set
of experiments aiming to identify the car owner regardless
of the car and path. Dierently, the other discussed methods
Journal of Advanced Transportation
usually perform the experiments considering a single setting:
for example, in the experiment proposed in [] (it is the
method for which the best precision values are obtained) the
drivers under analysis perform the same path on the same car.
3. Background: Time-Series
Classification Approaches
Machine Learning (ML) explores the study and implemen-
tation of some algorithms aiming to learn from some mon-
itored data and make predictions about their future values.
Here, we focus on TSC algorithms used to classify sequences
of observations of a phenomenon [], as eectively experi-
mented in [] in another context.
Time Series (TS) consists of a sequence of discrete-time
observations evaluated at successive equally spaced points in
time.eyarewidelydiusedtodescribethetimecourseofa
phenomenon. TS is useful to predict its future trend and relate
it to other phenomena under study. For this reason, they are
adopted in several domains with dierent aims. A problem
recurring in several domains is TS classication (TSC) which
requires training a classier on a set of cases, where each
case contains an ordered set of real-valued attributes and a
class label (qualifying its kind or nature). TSC problems arise
in a wide range of elds including environmental sciences,
computational biology, image processing, and soware engi-
neering.
ML approaches are widely adopted to perform such
classication. ere are two main ML algorithms families:
(i) Supervised Learning (SL). e classier is presented
with example inputs and their desired outputs, given
by an oracle, and the goal is to learn a general rule
that maps inputs to outputs. e learning phase is the
process of building a model able to discriminate the
classes from a set of records that contain class labels.
(ii) Unsuper vised Learning.Nolabelsaregiventothe
learning algorithm, leaving it to nd structure in its
input. Unsupervised learning can be a goal in itself
(discovering hidden patterns in data), or a means
towards an end (feature learning).
In this paper, the focus is on supervised algorithms to
perform classication of TS data.
A well-known class of classier is based on decision trees.
A decision tree can be used to predict the future values of
an item or to perform classication tasks. e classication
consists of splitting a dataset into smaller sets organized as a
binary tree-like structure obtaining an easier description of
the distribution of the data [].
RandomForest [] is a widely adopted decision trees
algorithm proposing a way of averaging multiple deep deci-
sion trees (trained on dierent parts of the same training set)
with the aim of reducing the variance. It starts by constructing
several decision trees at training time. For each tree, the class
(which can be the mode of the classication or the mean of
prediction) is obtained. Random decision forests are used to
correct decision trees habit of overtting to their training set.
Another well-known method for TS classication
approach is based on Dynamic Time Warping (DTW) that
allows an elastic shiing of the time axis, to accommodate
sequences that are similar, but out of phase. DTW is a classic
technique for time-series classication. DTW is normally
used with instance-based classiers and can be incorporated
into a decision trees algorithm. For constructing the decision
trees, the J algorithm was exploited []. It starts by
building decision trees from a set of training data and then
visits all decision-node. At each iteration, it chooses the most
active node and splits until each leaf is reached and not any
more splits are obtainable.
Other eective approaches to classication problems
exploit neural networks. MLPs are among the most popular
kinds of neural networks and have been used in a wide
variety of modeling, classication, prediction, and optimiza-
tion problems. ey are subclasses of more general struc-
tures called feedforward neural networks that are capable
of approximating generic classes of functions (including
continuous and integrable functions).
In this paper, an MLP network architecture is used.
An MLP neural network is based on multiple layers of
nodes in a directed graph where each layer is fully connected
to the next one []. In particular, the simplest MLP network
model describes a fully connected network with three layers
(one input layer, one hidden layer, and an output layer) in
which each node is a neuron that uses a nonlinear activation
function.
An activation function in MLP network is a (typically
nonlinear) function transforming the weighted sum of inputs
toaneuronintoanoutputvalue.Usually,itmapsthe
rangeofinputsignalsintoareferenceinterval(suchas
[−1,1] or [0,1]). e activation functions used in typical
neural networks are classied into threshold, linear, and
nonlinear activation functions. For MLP networks the hidden
layers have nonlinear activation functions whereas the output
layer could have both linear and nonlinear ones. e MLP
models adopt the following bipolar sigmoidal function (i.e.
hyperbolic tangent sigmoid) as activation function:
(V)=2
1+−2V−1 ()
whereas a linear activation function is used for output layer.
Looking for example to a single network model, it contains
at least three fully connected layers (one hidden layer and
one output layer). e hidden layer neurons are connected
to the nodes of input layer by weights, whereas weights
areusedtoconnecttheneuronsoftheoutputlayerswith
the neurons of the hidden layer. Each layer applies the same
activation function. Weights in the network are used to
connect neurons between layers.
MLP behavior and performances depend upon three
fundamental aspects: (i) activation functions of the units;
(ii) network architecture; and (iii) the weight of each input
connection. Since the rst two aspects are xed, the behavior
of the MLP is mainly dened by the current values of the
weights. e weights are dened during the training process:
they are initially set to random values, and then instances
Journal of Advanced Transportation
of the training set are repeatedly exposed to the network.
evaluesfortheinputofaninstanceareplacedonthe
input units and the output of the net is compared with the
desired output for this instance. en, all the weights in the
net are adjusted slightly in the direction that would bring
the output values of the net closer to the values for the
desired output. ere are several learning algorithms with
which a network can be trained but the most well-known
andwidelyusedlearningalgorithmtoestimatethevalues
of the weights is the Backpropagation (BP) algorithm. An
MLP network is trained using some form of gradient descent
and the gradients are calculated using backpropagation. For
classication, the learning algorithm minimizes the cross-
entropy loss function (𝐿)andweightsareupdatedusing
Stochastic Gradient Descent (SGD) evaluating the gradient
of the loss function as follows:
=−()
 +𝐿
, ()
where is the learning rate parameter which controls the
step-size in the parameter space search and the cross-entropy
loss function 𝐿is given by
𝐿
,,= −ln
−1−ln 1
+2
2,()
where 2
2is a regularization factor that penalizes complex
models and >0is a positive parameter that controls the
magnitudeoftheimposedpenalty.
For binary classication, the function computed by the
network passes through the logistic function:
()=1
(1+−𝑧)()
to obtain output values between zero and one. In this case
a threshold would assign samples of outputs larger than or
equal to tothepositiveclassandtherestofthenegativeone.
When there are >2classes, the output of the network is a
vector of size . In this case, in order to assign a single class
totheinputsample,thesomaxfunctionisused,whichcan
be dened as
softmax 𝑖= exp 𝑖
𝑘
𝑙=1 exp 𝑙,()
where 𝑖represents the th element of the input to somax,
which corresponds to class ,andis the number of classes.
e result is a vector containing the probabilities that the
sample belongs to each class and the output class is the one
with the highest probability.
In this paper, a dataset that has classes and an archi-
tecture based on jointmodels(eachwithoneoutputlayer
neuron) has been adopted. All the models are jointly trained
across -fold cross-validation over the classes on the whole
input , and hence they are all characterized by the same best
parameters (i.e., the same number of neurons in hidden layer,
thesamelearningrates,andthesamenumberofiterations).
As already observed since this architecture, with >2
models, is suitable to detect more than two classes, it requires
the somax stage that takes the inputs (the probabilities that
the input sample belongs to each class) and provides the best
class as output.
4. The Methodology
is section presents the proposed approach for classication
of drivers and paths using data extracted from vehicle sensors.
Figure  summarizes the overall mining process structured
as two main subprocesses: (a) Datasets Generation and (b)
MLP Training and Time-Series Classication. e remaining
part of the section will describe in more detail these two
subprocesses.
4.1. Datasets Generation. e Datasets Generation steps are
reported in Figure (a). e rst step consists of the dataset
cleaning (by performing the removal of incomplete and
wrong data values) and normalization. e cleaning and
normalization activity is necessary since real-world data tend
to be rather noisy, not complete, or even inconsistent. It
applies techniques aimed at lling missing values, lters out
the noise, and corrects the inconsistent values (or removes
them) from the dataset. We adopted the following data
cleaning subprocess to polish data produced by the car
monitors in order to obtain a consistent dataset that is suitable
for statistical inference:
(i) xing missing values
(ii) removing noise
(iii) removing special character or values
(iv) verifying semantic consistency
(v) normalization.
In the last step numerical attributes are normalized using a
min–max normalization that performs a linear transforma-
tion of the original data. If min𝑋and max𝑋are the minimum
and maximum values for the attribute ,theminmax
normalization maps a value V𝑖of to a V󸀠
𝑖in the range
{newMin𝑋,newMax𝑋}by computing
V󸀠
𝑖=V𝑖min𝑋
max𝑋min𝑋newMax𝑋newMin𝑋
+newMin𝑋.()
e normalized data is split into two sets (Training and
Test Set Generation step): (i) a training set used to train the
classier and (ii) a test set used to assess the performance of
theclassier.edatasetpartitioningisperformedbyusinga
-Fold Cross-Validation approach [] consisting in splitting
the data into equally sized folds. Subsequently, iterations
of training and validation are performed (for each iteration
a dierent fold of the data is used for validation and the
remainingfoldsareusedfortraining).Moreover,toensure
that each fold is representative, the data are stratied prior to
being split into folds. is model selection method, according
to [], provides less biased estimation of the accuracy.
4.2. MLP Training and Time-Series Classication. e MLP
Training and Time-Series Classication process is reported
in Figure (b) and is based on an MLP network for learning
[].
Journal of Advanced Transportation
(a) Datasets Generation
Driving
data
Cleaning and
normalization
Training
examples
Time-series
segmentation
Time-series
windows
Postprocessing
Cleaned and normalized driving data
Multilayer
perceptron
network model
generation
Trend/value-based
windows’ features
(b) MLP training and time-series classication
Training and Test
Set Generation
Unlabeled
testing examples
MLP Classification
model
Driver and path
classication
F : e overall classication process.
e adopted time-series classication approach is based
on the following main steps:
(i) Time-Series Segmentation, in which the time series
are analyzed and divided into segments (i.e., win-
dows)
(ii) Postprocessing, in which for each window a represen-
tation based on values and trends features is generated
(iii) Multilayer Perceptron Network Model Generation, in
which an MLP network is trained
(iv) Classication, in which the trained classier is tested
on new time-series samples to perform the classica-
tion step.
In the rst step, the multivariate time series is rstly
divided into a sequence of segments by sliding a window
(which can be xed or threshold-based) incrementally across
the time-series values. In this paper, we experimented two
time-series sliding window approaches: xed windows and
threshold-based windows (we refer to it as Start&Stop). e
former exploits windows of xed sizes. In particular, we
dened ve windows of increasing sizes ranging from  sec
to  sec. Outside of this range, results or performances were
not acceptable. e latter segmentation approach denes
time-series regions based on signal variation thresholds. A
Start&Stop region is dened as the interval in which the
vehicle moves o from a steady state until it reaches another
steady state. is means that only transitory parts of the sig-
nals are captured. e rationale for using just these portions
of the time series instead of the entire signal is intuitive:
when signals became constants, they give little information
(especially for what concerns the driving behavior). is, in
eect, has proven to be very eective in ltering out useless
information that wastes classier memory but provides no
useful information to help discriminating driving behavior or
path kind.
Aer the segmentation, in the Postprocessing step, a
setoffeaturesisevaluatedforeachtime-serieswindow.In
particular, each window 𝑖is represented by a sequence
of features (V1,...,V𝑛,1,...,𝑚)containing value-
based features and trend-based features. e rst features
represent the discretized values of the time series and take
one value in a discretized set (in the [0,1]range). e second
set of features describes the trend of the time series
local to the window and is usually represented using shape-
based metrics (including moments of various orders, mean,
standard deviation, average energy, entropy, skewness, and
kurtosis). In this paper, only a single feature is used as a
value-basedfeatureandtwomoments(standarddeviation
and skewness) are used to represent trend-based features. e
approach to generating these representations is similar to the
trend and the value-based analysis proposed in [].
In Multilayer Perceptron Network Model Generation
step,anMLPnetwork,theensembleofnetworks, has been
used. It is instantiated for each dataset and exploits a number
of networks model equal to the number of the classes to be
detected.
e input vectors for the MLP network contains the
window representation components (value and the two
Journal of Advanced Transportation
T : Features involved in the study with the relative description and features source (i.e., a feature extracted by the OBD connector or
gathered by the device GPS).
 Feature Description
 CO2ing/km (average) (g/km) CO2consumption computed in average
 Cost per mile/km (Instant) (/km) Instantaneous cost for kilometer
 Cost per mile/km (Trip) (/km) Trip cost for kilometer
 Engine Load (%) Percentage load of the engine
 Engine RPM (rpm) Revolutions per minute of the engine
 Fuel cost (trip) (cost) Cost of the consumed fuel for trip
 () -axis gyroscope, i.e, pith
 () -axis gyroscope i.e., yaw
 () -axis gyroscope i.e., roll
 GPS Speed (meters/second) Speed as measured from GPS unit
 GPS versus OBD Speed dierence (km/h) Speed dierence between GPS chipset and OBD
 Kilometers per Liter (Instant) (kpl) Instantaneous number of kilometers traveled for fuel liter
 Kilometers per Liter (Long Term Average) (kpl) Average number of kilometers traveled for fuel liter
 Liters Per  Kilometer (Instant) (l/ km) Instantaneous liters employed for  Km
 Liters Per  Kilometer (Long Term Average) (l/ km) Average liters employed for  Km
 Trip Average KPL(kpl) Distance traveled (in km)/fuel consumed by the vehicle
moments) representing data for each feature reported in
Table .
In this step, the MLP network is trained using the class
labels available for each set of value-based and trend-based
information of each window.
In the classication step, the trained classier can be used
to classify new data and is validated on new samples to assess
its performances.
Features Selection. e proposed classication approach is
basedonsomeassumptionsthatareconrmedbythe
experimental results: (i) each person has a dierent driving
style that can be recognized irrespective of the path or vehicle;
(ii) there exist features whose distribution of values is well
separated from women and men allowing an eective genre
identication; (iii) there exist features, related to how the
vehicle is solicited by the road, that allow recognizing road
kind (among several types). Based on such considerations, we
selected an initial set of sixteen features that are reported in
Table .
esetoffeaturesreportedinTableisanalyzedper-
forming a feature selection step to reduce the dimensionality
of the dataset. is step requires analyzing and understand-
ing the feature’s impact on the classication model. e
most relevant features are selected by evaluating correlations
among the features (the selection of a subset of an orthogonal
and independent set of features allows discarding redundant
information). To this aim, we used a correlation-based fea-
tures selection (CFS) algorithm exploiting correlation matrix
lters as discussed in []. CFS approach allows rating
the features using a correlation-based heuristic evaluation
function. is function is biased towards those subsets of
features largely correlated with the class to be detected and
uncorrelated with each other. All the unnecessary features
can be ignored given their low correlation with the class
and the redundant features can be eliminated given their
high correlation with one or more of the remaining features.
A feature is approved if it gives an extended and ecient
classication in regions of the examples space not already
covered by other features. e CFS evaluation function can
be expressed as follows:
𝑆=cf
+(−1)
()
𝑆represents the heuristic goal function of the subset
containing features. e mean feature-class correlation
is indicated as cf while is the average feature-feature
intercorrelation. e numerator of the equation indicates the
capability of a set of features to classify an example while the
denominator indicates how much redundancy there is among
the features set.
5. Experiment Setting
In this section, we describe the features extraction process
and the dataset realized to perform the experiments.
5.1. Features Extraction. We constructed a dataset gathering
data from the CAN bus of a set of real vehicles. In order to
collect data, the Torque Pro (https://play.google.com/store/
apps/details?id=org.prowl.torque) application and Mini Blu-
etooth ELM OBD  Scanner were used.
e OBD scanner was installed on the vehicles to produce
a self-diagnostic report generated by the onboard monitoring
system.
e data is recorded every second during driving using
Torque Pro application by an Android smartphone xed in
thecarusinganadequatesupport.
5.2. Datasets Denition. In this paper three dierent
datasets (some further details and the download of the
Journal of Advanced Transportation
14.3910
14.3915
14.3920
14.3925
14.3930
14.3935
14.3940
14.3945
14.3950
14.3955
14.3960
14.3965
14.3970
14.3975
14.3980
14.3985
14.3990
14.3995
14.4000
14.4005
14.4010
14.4015
14.4020
14.4025
14.4030
14.4035
14.4040
14.4045
14.4050
14.4055
14.4060
14.4065
14.4070
14.4075
14.4080
14.4085
14.4090
14.4095
14.4100
14.4105
14.4110
14.4115
14.4120
14.4125
14.4130
14.4135
14.4140
Longitude
40.8975 40.9000 40.9025 40.9050 40.9075 40.9100 40.9125 40.9150 40.9175 40.9200 40.9225 40.9250 40.9275 40.930040.8950
Latitude
Path
P
1
P
2
P
3
P
4
P
5
F : An overview of the paths under study:  km in ve paths of dierent kind in the Naples area.
datasets are available at https://github.com/martinacimitile/
Car-Data-Mining/wiki) (𝐴,𝐵,and𝐶) are exploited to
answertheproposedresearchquestions.Allthedatasets
refertotheareashowninFigure,wherethestudyhasbeen
executed: -and-axes of the gure are associated with
longitude and latitude whereas the color represents one of
ve paths.
Each dataset contains two replicas of the same observa-
tion obtained with dierent conditions to avoid bias.
𝐴is composed of two replicas of  observations and has
the following characteristics:
(i) Each observation describes the driving on the entire
trackreportedinFigure.
(ii) e observations are performed by four drivers
(1,1,2,and 3)oncars(Hyundaii,
Lancia y, Fiat Punto, and Nissan Note).
(iii) For each replica, four persons drive four cars on the
entire track.
𝐵is composed of two replicas of  observations having
the following characteristics:
(i) Each observation describes one of the paths compos-
ing the track of Figure . As shown by the gure we
consider ve consecutive paths (called, resp., 1,2,
3,4,and5).
(ii) Five men (1,...,5)andvewomen
(1,...,5) drive the same car (Hyundai i).
𝐶is composed of two replicas of  observations having
the following characteristics:
(i) Each observation describes one of the paths reported
in Figure .
(ii) e observations are performed by four drivers
(1,1,2,and 3)oncars(Hyundaii,
Lancia y, Fiat Punto, and Nissan Note) and on  paths.
(iii) For each replica, each person drives all the  cars one
time on each path.
5.3. Descriptive Statistics. In this section, descriptive statistics
are used to describe the features used in this study (a
quantitative analysis of the distributions of features has
been performed). e analysis shows that several groups of
features are well separated and median values do not fall into
interquartile ranges of the distributions.
In order to provide statistical evidence that the fea-
tures can be considered as characteristics of the driver
Journal of Advanced Transportation
0
1000
2000
3000
4000
5000
Engine RPM
0
5
10
15
20
Trip Average KPL
5
10
15
20
Liters per 100 Kilometer
0
10
20
30
40
50
Average Trip Speed
DM1
DF1DM2DM3
Driver
DM1
DF1DM2DM3
Driver
DM1
DF1DM2DM3
Driver
DM1
DF1DM2DM3
Driver
F : Box plots related to the distribution of the features Average Trip Speed, Liters Per  Kilometers, Trip Average KPL, and Engine
RPM for the drivers 1,1,2,and3.
behavior we show the box plot related to four drivers (i.e.,
1,1,2,and 3)featuresdistributionshownas
box plots.
Figure  shows the distributions of four drivers related to
Average Trip Speed, Liters Per  Kilometers, Trip Average
KPL, and Engine RPM.
Asshowninthegure,theAverageTripSpeedofthefour
drivers is the same but it is interesting to observe that the
drivers 2and 1do not present peaks of speed (they
keepthesamespeedvariationbothinaccelerationandin
deceleration during the entire driving session).
e gure also shows that even if the interquartile ranges
arecomparableamongthedrivers,themedianvaluesare
quite dierent for dierent features.
Moreover, for the feature “Liter per  Kilometer,”
distributions present dierent medians for the drivers. is is
symptomatic of the dierent fuel consumption between the
drivers involved in the experiment.
e driver 1isconrmedtobethemoreaggressive
relating to the driving style: indeed his median is very close to
the rd quarter, and this is conrmed by the fact that he is the
driver that reaches the higher speed (as conrmed by the box
plots in Figure ). e driver 1, on the other side, presents
a fuel consumption very close to the one exhibited by 1
but, as we observe from the box plot in Figure , she reaches
anaveragespeedveryclosetotheoneof2,consuming
less fuel than 1and 1.
e driver 3exhibits an average fuel consumption
slightly lower than 2: this conrms the balanced driving
style of the driver.
As shown by the box plots, the medians of the Trip
Average KPL (i.e., the ratio between distance traveled and
fuel consumed) for 2and 3exhibit the best values;
moreover 1and 1present almost the same medians.
is means that 2and 3can travel more kilometers
with less fuel with respect to 1and 1.
Journal of Advanced Transportation
e distributions relating to the Engine RPM box plot
show that the drivers 2and 1exhibit the higher value
related to the Engine RPM: probably they change gear too
late, and hence the Engine RPM rises. From the other side,
we observe that 1and 3drivers present a lower media
value for the considered feature: this means that they do not
stress the engine.
isisalsoconrmedlookingatthescatterplotsofthe
distributions shown in Figure . Figure (a) shows a long
term average of the kilometers per liter that are done by
drivers (shown in dierent colors). e scatter plot, in this
case,showsthatthedistributionsarewellseparated.esame
considerations can be done for the distribution of velocity
concerning path kind that is shown in Figure (b). In this
case, the scatter plot reveals that dierent path kinds have
quite dierent velocity distributions.
6. Experiments Description
In this section, the description of the experiments conducted
to answer the RQs introduced in Section  is reported.
6.1. Experiments Design. e set of conducted experiments is
synthesized in Table . e table shows, for each RQ, the list
of the conducted experiments (the experiment label, its goal,
and the involved dataset).
Looking at the table, RQ1is explored with ve experi-
ments (11,12,13,14,and15). 11,12,and13 aim to
evaluate the eectiveness of the approach in identifying the
driver in three cases:
(i) Regardless of the car but xing the path (11)
(ii) Regardless of the path but xing the car (12)
(iii) Regardless of both car and path (13).
Moreover, 14 and 15 aim to identify the gender of the driver:
(i) Regardless of the car but xing the path (14)
(ii) Regardless of the path but xing the car (15).
Similarly, RQ2is explored with a single experiment (21)
aiming to evaluate if it is possible to identify the path kind
xing the car but regardless of the path and the driver. RQ3is
explored by means of two experiments. e rst experiment
(31)aimsatevaluatingifitispossibletoidentifythedriver
familiarity with the vehicle xing the car but regardless of
the path and the driver. e second experiment (32)aims
at evaluating if it is possible to identify the driver familiarity
with the vehicle when car, driver, and path can change.
Each experiment consists in applying the classication
method described in Section  on a specic dataset (Table 
reports for each experiment the considered dataset).
In particular in 11 the classication method is applied
to the dataset 𝐴, since it contains several drivers and a
single path. For what concerns 12,thedataset𝐵can be
used to study the inuence of the path, xing the car (since
it was built on a single car). In 13 the dataset 𝐶is used
sinceitisbasedonfourcarsandcontainsvepaths.14
is conducted using the dataset 𝐴since we need multiple
cars involving drivers of both genres. Conversely, 15 is
performed on the dataset 𝐵since it is based on  paths
for each driver and contains a well-balanced set of male
andfemaledrivers(vemenandvewomen)onasingle
car. For 2we exploited the dataset 𝐵since it contains
twelve drivers and ve paths and it was performed on a
single car. RQ3is explored in the experiments 31 and 32
involving, respectively, the datasets 𝐴and 𝐵.Basedonthe
classication process described in Figure , each considered
dataset is cleaned and normalized performing the Datasets
Generation steps. e normalized dataset is used to generate a
training set and a test set. Each experiment is performed using
the MLP classier as described in Figure . Moreover, the
experiment 11 is performed using two alternative classiers
(RF and DWT) in order to compare the eectiveness of the
MLP classier with respect to other approaches oen used
in literature. Finally, each experiment is also repeated using
dierent window segmentation strategies ( s,  s,  s,  s,
and the Start&Stop method) in order to evaluate the impact of
the sliding window approach on the classier performances.
Another consideration can be made on the training
dataset. It is obtained as a partition of the normalized
dataset and it is augmented with a column that species the
classication labels associated with the evaluated instance.
is column was derived by an expert looking at the evaluated
instance in the area considered for the study. is classica-
tion is then used to both train the classier and perform the
evaluation. In the context of RQ1, for the experiments 11,
12,and13, the considered datasets are augmented with a
column that species the driver identity. For the 11 and 13
thepossibledriversidentitylabelscanbe1,1,2,
and 3(corresponding to all the possible drivers involved
in 𝐴and 𝐶). For 12, looking at the explored dataset 𝐵
the drivers identity label can assume the following values:
1,2,3,4,and5and 1,2,3,4,
and 5. Similarly, for the experiments corresponding to
RQ14 and RQ15 the considered datasets are augmented with
a column that species the driver gender (driver is labeled as
“Male” or “Female”). In the experiment conducted to answer
the RQ2,thedataset𝐵was augmented with a column that
species the kind of the path (the path is labeled as “City
Street”; “Highway”; or “Dirt road”) needed to perform the
training and the validation. Similarly, in the context of RQ3,
the explored datasets were augmented with a column that
species the ownership (“Owner”; “Not Owner”).
6.2. Evaluation Strategy. e validation has been performed
using classication quality metrics. e three metrics used to
evaluate the performance of our approach for the research
questions are precision, recall, and accuracy.
Precision has been computed as the proportion of the
observations that truly belong to investigated class (e.g.,
driver, driver genre, driver path, and driver’s familiarity)
among all those which were assigned to the class. It is the ratio
of the number of records correctly assigned to a specic class
to the total number of records assigned to that class (correct
and incorrect ones):
Precision =tp
tp +fp,()
 Journal of Advanced Transportation
0.400
0.375
0.050
0.350
0.100
0.125
0.150
0.175
0.300
0.225
0.250
0.275
0.475
0.525
0.200
0.075
0.025
0.425
0.450
0.000
0.500
0.325
0.550
−0.025
Fuel cost (trip) (cost)
5.00
5.50
6.00
6.50
7.00
7.50
8.00
8.50
9.00
9.50
10.00
10.50
11.00
11.50
12.00
12.50
13.00
13.50
14.00
14.50
15.00
15.50
Kilometers per Liters (Long Term Average) (kpl)
Driver
DF1DF3
DF4
DF5
DF2
DM3
DM2DM5
DM4
DM1
(a) Kilometers per Liter (long average) with respect to fuel cost, for dierent drivers
GPS speed (meters/second) 0.000 30.818
7
5
4
3
2
1
8
9
6
0
25
32
26
27
33
24
11
22
13
14
21
16
17
18
19
30
34
15
23
12
10
29
28
31
20
−5
−3
−4
−6
−2
−1
GPS speed (meters/second)
City street
Highway
Dirt road
Path kind
(b) Speeds with respect to dierent path kinds (city street, highway, and dirt road)
F : Scatter plots of the data distribution on dataset 𝐴.
Journal of Advanced Transportation 
T : Experiments overview.
RQ Experiment label Dataset Experiment goal
RQ1:towhatextentdo
OBD II features allow
driver identication?
11 𝐴
Evaluate the eectiveness of the proposed approach to
identify the driver regardless of the car but xing the
path
12 𝐵
Evaluate the eectiveness of the proposed approach to
identify the driver regardless of the path but xing the
car
13 𝐶Evaluate the eectiveness of the proposed approach to
identify the driver regardless of both car and path
14 𝐴
Evaluate the eectiveness of the proposed approach to
identify the driver gender regardless of the car but
xing the path
15 𝐵
Evaluate the eectiveness of the proposed approach to
identify the driver gender regardless of the path but
xing the car
RQ2:towhatextentdo
OBD II features allow path
kind identication? 21 𝐵
Evaluate the eectiveness of the proposed approach to
identify the path kind xing the car but regardless of
the path and the driver
RQ3:isitpossible,usingthe
features available in OBD
II, to identify the driver’s
familiarity with the vehicle?
31 𝐴
Evaluate the eectiveness of the proposed approach to
identify the driver familiarity with the vehicle xing of
the car but regardless of the path and the driver
32 𝐵
Evaluate the eectiveness of the proposed approach to
identify the driver familiarity with the vehicle when car,
driver, and path can change
 Journal of Advanced Transportation
T  : 11 , precision, recall, accuracy, and training time obtained by using the MLP classier for dierent windows sizes.
Average window size Driver Driver recall Driver precision Overall precision Overall recall Accuracy Training time (sec)
s
 training segments
DF , ,
, , , 
DM , ,
DM , ,
DM , ,
 s
 training segments
DF , ,
, , , 
DM , ,
DM , ,
DM , ,
 s
 training segments
DF , ,
, , , 
DM , ,
DM , ,
DM , ,
 s
 training segments
DF , ,
, , , 
DM , ,
DM , ,
DM , ,
Start&Stop
 training segments
DF , ,
, , , 
DM , ,
DM  ,
DM 
where tp indicates the number of true positives and fp
indicates the number of false positives.
e recall has been computed as the proportion of
observations that were assigned to a given class, among all the
observations that truly belong to the class. It is the ratio of the
number of relevant records retrieved to the total number of
relevant records:
Recall =tp
tp +fn,()
wheretpindicatesthenumberoftruepositivesandfn
indicates the number of false negatives.
Accuracy is dened as a statistical measure of how well a
binary classication is able to evaluate correctly the instances
under analysis with respect to the considered features. Basi-
cally the accuracy is the proportion of true results (both
true positives and true negatives) among the total number of
instances evaluated.
In the following, for each RQ, the results of experi-
mentation have been described and discussed. e datasets
were all replicated once in dierent conditions of trac and
timing, to avoid the bias of such variables. e results were
consistent with the ones obtained on the rst replica and
hence are not reported in detail. In the online repository
(https://github.com/martinacimitile/Car-Data-Mining/wiki)
all the datasets, along with replicas, are provided to allow
experiments replication.
6.3. Discussion of Results. Table  r e p orts re s u l t s for t he
experiment 11 performed by using the MLP classier.
In the rst column of the table, the adopted window sizes
are reported. For each window size, we also specied the
number of observations that are collected (e.g., considering
a-secondwindow,thenumberofcollectedtrainingseg-
mentsis).Startingfromthethirdcolumn,thetable
reports—for each segmentation choice and for each driver
(column two)—the values of precision, recall, accuracy, and
training times obtained by using the MLP classier. Precision
and recall are evaluated for each class (the driver in this
context) and as averages. e results of the experiment 11
performedbyusingtheRFandtheDTWclassiersareshown
in Table  (for brevity, they are shown only for this rst
experiment). Comparing Tables  and , we can conclude
that the best results are obtained for MLP on almost all
the sizes. Based on the results, we can also conclude that
thebestsegmentationchoiceonMLPisthethreshold-based
segmentation. However, MLP is also the slowest classier in
terms of training time. Moreover, MLP and RF perform better
than DTW on medium window sizes and for threshold-based
segmentation.
e 11 results are also synthesized by Figure .
Figure (a) shows the trend of the classication accuracy
forMLP,RF,andDTWwithrespecttotheadoptedwindow
strategy. Figure (b) shows the training times: RF and DTW
are comparable but several times (x) faster than the MLP
network.
Journal of Advanced Transportation 
T  : 11 , precision, recall, accuracy, and training time obtained by using the RF and DTW classiers for dierent windows sizes.
Ave rage
window size Driver
RandomForest (RF) DTW
Driver
recall
Driver
precision
Overall
precision Overall recall Accuracy Training time
(sec)
Driver
recall
Driver
precision
Overall
precision Overall recall Accuracy Training time
(sec)
s

training
segments
DF , ,
, , , 
, ,
, , , 
DM , , , ,
DM , , , ,
DM , , , ,
 s
 training
segments
DF , ,
, , , 
, ,
, , , 
DM , , , ,
DM , , , ,
DM , , , ,
 s
 training
segments
DF , ,
, , , 
, ,
, , , 
DM , , , ,
DM , , , ,
DM , , , ,
 s
 training
segments
DF , ,
, , , 
, ,
, , , 
DM , , , ,
DM , , , ,
DM , , , ,
Start&Stop
 training
segments
DF , ,
, , , 
, ,
, , , 
DM , , , ,
DM , , , ,
DM , , , ,
 Journal of Advanced Transportation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Accuracy
0.94
0.85
0.82
0.44
0.52
0.43
0.8
0.82
0.76
0.88
0.89
0.81
0.68
0.72
0.65
MLP Accuracy
RF Accuracy
DTW Accuracy
training
Start&Stop
7562
segments
training
30 s
486
segments
training
60 s
243
segments
training
10 s
1459
segments
training
1s
14592
segments
Experimental E11, window segment size
(a)
MLP training time (sec)
RF training time (sec)
DTW training time (sec)
Average window sizes
0
200
400
600
800
1000
1200
Times (seconds)
Experiment E11, window segment size
(b)
F : 11, trend of accuracy (a) and training times (b) evaluated by using MLP, RF, and DTW classiers for dierent windows sizes.
For what concerns experiments 12 and 13,theresults
arereportedinTablesand.Aswecansee,datafollowsthe
same trends observed for 11.
e following considerations can be done:
(i) Driver classication on the dataset 𝐴,onasingle
path, gives better results for the same identication on
dataset 𝐵that is performed on a xed car but varying
the path.
(ii) e eectiveness obtained in 13 (i.e., regardless of
the car and the path) is, as expected, lower than 11
and 12 eectiveness. is is important to estimate
the quality of classication for applications where the
path or the car is xed. In those cases, a RF or a DTW
approach could be adopted since precision and recall
are higher and could be acceptable (leading to faster
training times). Conversely, MLP approach remains
the best classier in our experimentation in terms of
classication quality, and it is the best one to choose
for the general case.
Results for precision and recall of the driver gender
identication experiments (14 and 15)fordierentsize
windows are reported, respectively, in Tables  and .
e tables show that even if the best results have been
obtained for the threshold-based segmentation, the xed
windows of sixty seconds provide quite reasonable results
with a much more reduced training time. is means that,
for applications more sensitive to training time, it could be
preferred.
Table  shows the results of experiment 2.Itreports,
for each segmentation choice, the results (precision, recall,
accuracy, and training times) of the runs on the MLP classier
for path kind identication among t hree classes (highway, city
street, and dirt road). Precision and recall are evaluated for
each class (the path kind in this context) and as averages. e
best results are also obtained in this case for threshold-based
segmentation on the MLP. In particular, the threshold-based
segmentation provides the best accuracy of . (which is
also the best accuracy obtained for path kind identication)
whereas, for xed windows segmentation, the best value of
accuracy was . (obtained for the xed windows size of
 s). Fixed windows segments too small (around second)
and too wide (more than  seconds) provided very bad
results limiting useful windows sizes in the range (s,  s).
Tables  and , respectively, report results of experi-
ments 31 and 32. ey report, for each segmentation choice,
the results for MLP classier in terms of precision, recall,
accuracy, and training times. e evaluated instance is the
familiarity detection and it is labeled as “Owner” and “Not
Owner.” Precision and recall are evaluated for each class and
as averages.
reshold-based segmentation on the MLP conrms to
be the best classier. For 31,thebestvalueforaccuracy
(which was also the best achieved overall accuracy for
familiarity detection) was .. e performance trend among
xed windows sizes is consistent with other classiers. For
experiment 32, the best value of accuracy is . showing that
when car, driver, and path can change, the classier has worse
performance.
Finally, even if it is not possible to directly compare
the obtained results with the results obtained in the work
discussed in Section  (each approach is tested on a dierent
dataset), we can observe that the obtained precision rate is
very encouraging. However, we reach a precision rate equal
to . with respect to the precision rate of . obtained in
[] (it is the best precision described in related work).
We conclude this section reporting the trend of accuracy
metric for all the experiments and all the segments. Figure 
highlights that the adoption of threshold-based segmentation
improveswithrespecttousingxedwindowsforallthe
groups of features.
6.4. reats to Validity. In this section the main threats to
the validity of our research are discussed. Construct validity
represents the quality of choices about the particular forms
of the variables (i.e., the choice of outcome measure or the
Journal of Advanced Transportation 
T  : 12 , precision, recall, accuracy, and training time evaluated by using MLP for dierent windows sizes.
Average window size Driver Driver recall Driver precision Overall precision Overall recall Accuracy Training time (sec)
s
 training segments
DF , ,
, , , 
DF , ,
DF , ,
DF , ,
DF , ,
DM , ,
DM , ,
DM , ,
DM , ,
DM , ,
 s
 training segments
DF , ,
, , , 
DF , ,
DF , ,
DF , ,
DF , ,
DM , ,
DM , ,
DM , ,
DM , ,
DM , ,
 s
 training segments
DF , ,
, , , 
DF , ,
DF , ,
DF , ,
DF , ,
DM , ,
DM , ,
DM , ,
DM , ,
DM , ,
 s
 training segments
DF , ,
, , , 
DF , ,
DF , ,
DF , ,
DF , ,
DM , ,
DM , ,
DM , ,
DM , ,
DM , ,
Start&Stop
 training segments
DF , ,
, , , 
DF , ,
DF , ,
DF  ,
DF , ,
DM 
DM , ,
DM , ,
DM , ,
DM  ,
 Journal of Advanced Transportation
T  : 13 , precision, recall, accuracy, and training time evaluated by using MLP for dierent windows sizes.
Average window size Driver Driver recall Driver precision Overall precision Overall recall Accuracy Training time (sec)
s
 training segments
DF , ,
, , , 
DM , ,
DM , ,
DM , ,
 s
 training segments
DF , ,
, , , 
DM , ,
DM , ,
DM , ,
 s
 training segments
DF , ,
, , , 
DM , ,
DM , ,
DM , ,
 s
 training segments
DF , ,
, , , 
DM , ,
DM , ,
DM , ,
Start&Stop
 training segments
DF , ,
, , , 
DM , ,
DM , ,
DM , ,
T  : 14 , precision, recall, accuracy, and training time evaluated by using MLP for dierent windows sizes.
Average window size Ownership Precision genre Recall genre Overall precision Overall recall Accuracy Training time (sec)
s
 training segments
FEMALE , , , , , 
MALE , ,
 s
 training segments
FEMALE , , , , , 
MALE , ,
 s
 training segments
FEMALE , , , , , 
MALE , ,
 s
 training segments
FEMALE , , , , , 
MALE , ,
Start&Stop
 training segments
FEMALE , , , , , 
MALE , ,
T  : 15 , precision, recall, accuracy, and training time evaluated by using MLP for dierent windows sizes.
Average window size Ownership Precision genre Recall genre Overall precision Overall recall Accuracy Training time (sec)
s
 training segments
FEMALE , , , , , 
MALE , ,
 s
 training segments
FEMALE , , , , , 
MALE , ,
 s
 training segments
FEMALE , , , , , 
MALE , ,
 s
 training segments
FEMALE , , , , , 
MALE , ,
Start&Stop
 training segments
FEMALE , , , , , 
MALE , ,
Journal of Advanced Transportation 
T  : 2, precision, recall, accuracy, and training time evaluated by using MLP for dierent windows sizes.
Average window
size Path kind Path kind
recall
Path kind
precision
Ave rage pa th
kind precision
Ave rage pa th
kind recall
Overall
accuracy
Training time
(sec)
s
 training
segments
Highway , ,
, , , 
City street , ,
Dirt road , ,
 s
 training
segments
Highway , ,
, , , 
City street , ,
Dirt road , ,
 s
 training
segments
Highway , ,
, , , 
City street , ,
Dirt road , ,
 s
 training
segments
Highway , ,
, , , 
City street , ,
Dirt road , ,
Start&Stop
 training
segments
Highway , ,
, , , 
City Street , ,
Dirt road , ,
T  : 31, precision, recall, accuracy, and training time evaluated by using MLP for dierent windows sizes.
Average window size Ownership Ownership
recall
Ownership
precision
Overall
precision Overall recall Accuracy Training time
(sec)
 s OWNER , , , , , 
 training segments Not Owner , ,
 s Owner , , , , , 
 training segments Not Owner , ,
 s Owner , , , , , 
 training segments Not Owner , ,
 s Owner , , , , , 
 training segments Not Owner , ,
Start&Stop Owner , , , , , 
 training segments Not Owner , ,
T : 32 , precision, recall, accuracy, and training time evaluated by using MLP for dierent windows sizes.
Average window size Ownership Ownership
recall
Ownership
precision
Overall
precision Overall recall Accuracy Training time
(sec)
 s Owner , , , , , 
 training segments Not Owner , ,
 s Owner , , , , , 
 training segments Not Owner , ,
 s Owner , , , , , 
 training segments Not Owner , ,
 s Owner , , , , , 
 training segments Not Owner , ,
Start&Stop Owner , , , , , 
 training segments Not Owner , ,
 Journal of Advanced Transportation
Start&Stop
Segment
0.4
0.6
0.8
Accuracy
Experiment
1s10 s30 s60 s
E11
E12
E13
E2
E31
E32
E14
E15
F : Accuracy trend over segments and experiments.
choice of treatment). ey concern the relationship between
theory and observation. In our proposal, some problems can
be introduced by the hypothesis guessing of all the involved
drivers. Drivers can know or guess the desired end-result,
and they can change their behavior. is risk is mitigated by
omitting the scope of the study and what are the monitored
features to the drivers. Moreover, a bias in experimental
design can be introduced by the path knowledge. If some
drivers well know the tested path they can have a dierent
behavior if compared to the drivers that never ride along that
path. is risk is mitigated by training all the drivers on the
pathssothattheirknowledgeissimilar.
Internal validity is concerned with the possibility that
some factors would be more suitable for the proposed features
to perform classication. To exclude this eventuality, we
performed a specic feature selection step studying correla-
tion and independence for all features available in OBD II
standard. Moreover, in order to best validate the training of
the classier, we adopted a -fold cross-validation.
Conclusion validity regards the degree to which the
conclusions we state about the relationship (between the
treatment and the outcome) are reasonable.
reats to external validity concern the generalization of
our ndings. Of course, replication on further projects to
conrm or contradict the obtained results is always desirable.
7. Conclusion
is paper proposes an approach to identify the driver, the
familiarity of the driver with the vehicle, and the kind of the
roadbasedonthestudyofthebehaviorofapersonduring
the driving. It is based on the assumptions that a proper set
ofbehavioralfeaturescanbeusedto(i)recognizedierent
drivers by capturing their dierent driving style; (ii) detect
their familiarity with the car; and (iii) detect the road kind
on which they are driving (among several types). Based on
these assumptions, we extracted, using a monitoring system
placed in the cars, an eective set of features whose samples
are sent to a time-series classication approach. e classier
exploits a supervised learning approach and is based on an
MLP network; its performances were compared with classic
decision tree classiers. e proposed time-series classier
has been proved to be eective at identifying the driver, the
driver genre, and the road kind aer being trained on the
proposed set of behavioral features. Specically, the approach
has been evaluated with eight experiments on three datasets
made from real data logged on four cars driven by ten drivers
in the Naples area. Each experiment allows exploring a
specic aspect of the proposed research questions, evaluating
the eectiveness of the proposed approach to
(i) identify the driver regardless of the car but xing the
path;
(ii) identify the driver regardless of the path but xing the
car;
(iii) identify the driver regardless of both car and path;
(iv) identify the driver gender regardless of the car but
xing the path;
(v) identify the driver gender regardless of the path but
xing the car;
(vi) identify the path kind xing the car but regardless of
the path and the driver;
(vii) identify the driver familiarity with the vehicle xing
the car but regardless of the path and the driver;
(viii) identify the driver familiarity with the vehicle when
car, driver, and path can change.
e obtained results show high accuracy for all the per-
formed experiments. In particular, the proposed approach is
very eective in identifying driver. e best accuracy (.) is
obtained to identify the gender of the driver regardless of the
car but xing the path. Good accuracy is also obtained in all
the other experiments aiming to perform driver identication
(the accuracy value is never less than .). Looking for the
driver genre identication (it is evaluated regardless of the
car but xing the path), the proposed approach shows an
accuracy of . revealing that man and woman have dierent
driving style. Moreover, the proposed approach can be also
used to identify the road kind (highway, city street, and dirt
road) by xing the car but regardless of the path and the
driver. e obtained accuracy value is, in this case, equal to
.. Eective results are also obtained in the detection of
driver familiarity. Here we have a higher accuracy (.) in
identifying the driver familiarity when the vehicle is xed.
A slightly lower accuracy (.) is obtained when all cars,
drivers, and paths can change. Further experimentation in
this general case should be performed to see if the size and the
number of hidden layers of a single network and the number
of networks positively inuence the resulting accuracy.
Journal of Advanced Transportation 
Finally, we also compare the proposed MLP classier with
tree classic decision classiers. e obtained results show how
even if all the classiers are characterized by high values of
accuracy, the best performances are obtained by using the
proposed ensemble MLP classier. Training times however
also show that our ensemble classier is the slowest during
the learning phase. is means that our approach is perfect
for applications that do not require real-time identication
of changing drivers (e.g., the owner can perform continuous
training on its car and can be detected with very high levels of
accuracy). As future work, we are extending the study adding
a behavioral characterization of the driver using both formal
approaches (using model checking) and fuzzy rule extraction
from the example dataset. is not only allows performing
the identication of driver and paths but also will provide
an explanation of how the identied driver is behaving for
predened classes (e.g., polluting driver, aggressive driver,
and cheap driver). Moreover, the evaluation can be extended
toahighernumberofdrivers,cars,andpaths.Finally,
theapplicationoftheproposedapproachintheroadkind
identication can be further explored considering a more
accurate road classication model.
Conflicts of Interest
e authors declare that they have no conicts of interest.
Acknowledgments
is work has been partially supported by H EU-
funded projects NeCS and CISP and EIT-Digital Project HII
and PRIN “Governing Adaptive and Unplanned Systems of
Systems” and the EU Project CyberSure .
References
[] K. Igarashi, C. Miyajima, K. Itou, K. Takeda, F. Itakura, and
H. Abut, “Biometric identication using driving behavioral
signals,” in Proceedings of the IEEE International Conference on
Multimedia and Expo (ICME ’04),pp.,Taipei,Taiwan.
[] M. Okamoto, S. Otani, Y. Kaitani, and K. Uchida, “Identication
of driver operations with extraction of driving primitives,” in
Proceedings of the 2011 20th IEEE International Conference on
Control Applications, CCA 2011, pp. –, USA, September
.
[] M.Itoh,D.Yokoyama,M.Toyoda,andM.Kitsuregawa,“Visual
interface for exploring caution spots from vehicle recorder big
data,” in Proceedings of the 3rd IEEE International Conference
on Big Data, IEEE Big Data 2015, pp. –, USA, November
.
[] A. Wahab, C. Quek, C. K. Tan, and K. Takeda, “Driving prole
modeling and recognition based on so computing approach,
IEEE Transactions on Neural Networks and Learning Systems,
vol. , no. , pp. –, .
[] F.Martinelli,F.Mercaldo,A.Orlando,V.Nardone,A.Santone,
and A. K. Sangaiah, “Human behavior characterization for
driving style recognition in vehicle system,Computers &
Electrical Engineering,.
[] J. Daugman, “How iris recognition works,IEEE Transactions
on Circuits and Systems for Video Technology,vol.,no.,pp.
–, .
[] S. Gonz´
alez, C. M. Travieso, J. B. Alonso, and M. A. Ferrer,
Automatic biometric identication system by hand geometry,
in Proceedings of the 37th Annual International Carnahan
Conference on Security Technology, pp. –, twn, October
.
[] D. A. Reynolds, “An overview of automatic speaker recognition
technology,” in Proceedings of IEEE International Conference on
Acoustics, Speech, and Signal Processing, pp. IV-–IV-,
Orlando, FL, USA, May .
[] S. Ribari´
c, D. Ribari´
c, and N. Paveˇ
si´
c, “Multimodal biometric
user-identication system for network-based applications,IEE
Proceedings Vision, Image and Signal Processing,vol.,no.,
pp.,.
[] R. W. Frischholz and U. Dieckmann, “BioID: A multimodal
biometric identication system,e Computer Journal,vol.,
no. , pp. –, .
[] T. Wakita, K. Ozawa, C. Miyajima et al., “Driver identication
using driving behavior signals,IEICE Transaction on Informa-
tion and Systems,vol.E-D,no.,pp.,.
[] X. Zhang, X. Zhao, and J. Rong, “A study of individual
characteristics of driving behavior based on hidden markov
model,Sensors & Transducers Journal,vol.,no.,pp.
, .
[] M.Enev,A.Takakuwa,K.Koscher,andT.Kohno,“Automobile
driver ngerprinting,” in Proceedings on Privacy Enhancing
Tech n o l o g ies, vol. , pp. –, .
[] M. Van Ly, S. Martin, and M. M. Trivedi, “Driver classication
and driving style recognition using inertial sensors,” in Proceed-
ings of the 2013 IEEE Intelligent Vehicles Symposium, IEEE IV
2013, pp. –, Australia, June .
[] C. Miyajima, Y. Nishiwaki, K. Ozawa et al., “Driver modeling
based on driving behavior and its evaluation in driver identi-
cation,Proceedings of the IEEE,vol.,no.,pp.,.
[] Y. Nishiwaki, K. Ozawa, T. Wakita, C. Miyajima, K. Itou, and
K. Takeda, “Driver identication based on spectral analysis
of driving behavioral signals,” in Advances for In-Vehicle and
Mobile Systems,pp.,Springer,.
[] S.Choi,J.Kim,D.Kwak,P.Angkititrakul,andJ.H.Hansen,
Analysis and classication of driver behavior using in-vehicle
can-bus information,Biennial Workshop on DSP for In-Vehicle
and Mobile Systems,p
p.,.
[] B. I. Kwak, J. Woo, and H. K. Kim, “Know your master: Driver
proling-based anti-the method,” in Proceedings of the 14th
Annual Conference on Privacy, Security and Trust, PST 2016,pp.
–, New Zealand, December .
[] G. Kedar-Dongarkar and M. Das, “Driver classication for
optimization of energy usage in a vehicle, in Proceedings of
the Conference on Systems Engineering Research, CSER 2012,pp.
–, USA, March .
[]X.Meng,K.K.Lee,andY.Xu,“Humandrivingbehavior
recognition based on hidden Markov models,” in Proceedings
of the 2006 IEEE International Conference on Robotics and
Biomimetics, ROB IO 2006, pp. –, China, December  .
[] P. Esling and C. Agon, “Time-series data mining,ACM Com-
puting Surveys,vol.,no.,article,.
[]M.L.Bernardi,M.Cimitile,F.Martinelli,andF.Mercaldo,
A time series classication approach to game bot detection,
in Proceedings of the 7th International Conference on Web
 Journal of Advanced Transportation
Intelligence, Mining and Semantics, (WIMS ’17),pp.,New
Yo r k , N Y, U S A , J u n e    .
[] N. K. Al-Salihy and T. Ibrikci, “Classifying breast cancer by
using decision tree algorithms,” in Proceedings of the the 6th
International Conference (ICSCA ’17), pp. –, Bangkok,
ailand, Feburary .
[] T. Hastie, R. Tibshirani, and J. Friedman, e Elements of
Statistical Learning,SpringerNewYorkInc.,NewYork,NY,
USA, .
[] Y.Zheng,Q.Liu,E.Chen,Y.Ge,andJ.L.Zhao,“Timeseries
classication using multi-channels deep convolutional neural
networks,” in International Conference on Web-Age Information
Management,F.Li,G.Li,S.-W.Hwang,B.Yao,andZ.Zhang,
Eds.,vol.ofLecture Notes in Computer Science, pp. –
, .
[] M. Stone, “Cross-validatory choice and assessment of statistical
predictions,JournaloftheRoyalStatisticalSociety,vol.,pp.
–, .
[] R. Kohavi, “A study of cross-validation and bootstrap for
accuracy estimation and model selection,” in Proceedings of the
14th International Joint Conference on Articial Intelligence,vol.
ofIJCAI95, pp. –, Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA, , http://dl.acm.org/citation.cfm.
[] A. Nanopoulos, R. Alcock, and Y. Manolopoulos, “Feature-
based classication of time-series data,International Journal of
Computer Research,vol.,pp.,.
[]B.Esmael,A.Arnaout,R.K.Fruhwirth,andG.onhauser,
Multivariate Time Series Classication by Combining Trend-
Based and Value-Based Approximations, Springer Berlin Heidel-
berg, Berlin, Heidelberg, .
[] N. S´
anchez-Maro˜
no, A. Alonso-Betanzos, and M. Tombilla-
Sanrom´
an, Filter Methods for Feature Selection – A Comparative
Study, Springer Berlin Heidelberg, Berlin, Heidelberg, .
... According to the aforementioned considerations, in this paper, we explore a new approach to detect, during a driving session, whether the monitored vehicle is driven or not by authorized drivers. The main assumption is that we can use a set of behavioral features extracted through a Controller Area Network (CAN) bus monitoring system to identify a specific driver [4], [5]. In order to increase the trustworthiness of the car driver identification system that we envision to design, we exploit Fuzzy Decision Trees (FDTs) for the classification module of the system. ...
... More extended studies, proposing both driver and path identification, are reported in [5], [11]. These researches propose the adoption of time-series classification based on Multilayer Perceptron (MLP) networks. ...
... For this reason our dataset contains observations of the same driver on different cars. In [5], we already investigated, in a black box detection context, the impact of the car in detecting the driver. As its results show, the best performance for the driver identification is obtained, obviously, in the unrealistic scenario where all users drive the same car. ...
... In literature, there are several solutions based on ML techniques for the identification of driver's behaviour. Bernardi et al. [6] used a Multi Layer Perception (MLP) to identify drivers. They used three datasets obtaining respectively 94%, 95% and 92% of Accuracy. ...
... Compared to our paper, [6], [7], [8], [12], [13] and [14] do not look for frequency. Also, LSTM in [9] obtained lower scores in comparison with a Decision Tree (DT) algorithm ( [12], [13] and [14]) on the same dataset. ...
... Hence, SR must use the best feature set for each driver. [6], [14] and [13] are the only ones that make owner-driver identification, they select the same feature set for all drivers. Finally, SR breaks down the timestamp in fine grained units to detect frequency in order to increase the accuracy. ...
Preprint
Full-text available
The introduction of Information and Communication Technology (ICT) in transportation systems leads to several advantages (efficiency of transport, mobility, traffic management). However, it may bring some drawbacks in terms of increasing security challenges, also related to human behaviour. As an example , in the last decades attempts to characterize drivers' behaviour have been mostly targeted. This paper presents Secure Routine, a paradigm that uses driver's habits to driver identification and, in particular, to distinguish the vehicle's owner from other drivers. We evaluate Secure Routine in combination with other three existing research works based on machine learning techniques. Results are measured using well-known metrics and show that Secure Routine outperforms the compared works.
... Additionally, novel approaches have been introduced which cover a broad spectrum from biometrics, driving patterns as well as driver behavior analysis [1,13,16,17,19,29,36,37]. However, the focus has been in areas such as accident prevention, drivers' intent prediction and monitoring of the vehicle, it's path or the driver's behavior [5,6,25,39,40]. The other authors who have ventured into auto-theft detection have used some of the previous methods mentioned earlier which are reliant on the physical characteristics of the vehicle [21]. ...
... However, there was some dependence on the features being observed. This means that observing fewer features for a larger dataset (2) (3) (4) (5) increased the confidence levels for the predictability compared to many features for the same large dataset. ...
Article
Full-text available
Autotheft is a crime that can be mitigated using artificial intelligence as a scientific approach. In this case, we assess the drivers driving pattern using both deep neural network and swarm intelligence algorithms. From the analysis we are able to obtain the driving signature of the driver which can be associated with the vehicle. The vehicle is then tracked and monitored. Next, a deviation from the usual driving signature of the owner or assigned driver would signify a possible instance of autotheft. Subsequently, the vehicle can be traced and reclaimed by the owner. The algorithms are evaluated based on their performance in analysing the datasets bearing variable features. The variations in features enable us to verify the efficacy and accuracy levels of the various algorithms that are used in the study. The metrics used for evaluation are the Mean Squared Error and the F1 Score for precision, accuracy and recall functionality.
... In contrast, during real activities, multiple stressors occur, making it difficult to establish defined stress levels to label recorded data [27]. Moreover, signal artifacts caused by movement could affect the accuracy of measured recordings, and the physiological responses caused by mental stress can be masked by variations due to physical activity itself [28]. ...
Article
Full-text available
Driver’s stress affects decision-making and the probability of risk occurrence, and it is therefore a key factor in road safety. This suggests the need for continuous stress monitoring. This work aims at validating a stress neurophysiological measure—a Neurometric—for out-of-the-lab use obtained from lightweight EEG relying on two wet sensors, in real-time, and without calibration. The Neurometric was tested during a multitasking experiment and validated with a realistic driving simulator. Twenty subjects participated in the experiment, and the resulting stress Neurometric was compared with the Random Forest (RF) model, calibrated by using EEG features and both intra-subject and cross-task approaches. The Neurometric was also compared with a measure based on skin conductance level (SCL), representing one of the physiological parameters investigated in the literature mostly correlated with stress variations. We found that during both multitasking and realistic driving experiments, the Neurometric was able to discriminate between low and high levels of stress with an average Area Under Curve (AUC) value higher than 0.9. Furthermore, the stress Neurometric showed higher AUC and stability than both the SCL measure and the RF calibrated with a cross-task approach. In conclusion, the Neurometric proposed in this work proved to be suitable for out-of-the-lab monitoring of stress levels.
... [12] applied an ensemble of CNN, RNN, and LSTM for driving identification. [13] proposed some behavioral characteristics extracted by a vehicle monitoring system for identifying drivers and path types. [14] identified drivers based on a cluster of driving patterns. ...
Article
Driver identification is a central research area in intelligent transportation systems, with applications in commercial freight transport and usage-based insurance. One way to perform the identification is to use smartphones as the main sensor devices. After extracting features from smartphone-embedded sensors, various machine learning methods can be used to identify the driver. However, the accuracy often degrades as the number of drivers increases. This paper uses a Generative Adversarial Network (GAN) for data augmentation to obtain a driver identification algorithm that maintains excellent performance also when the number of drivers increases. Since GAN diversifies the drivers' data, it makes it possible to apply the identification algorithm on a larger number of drivers without overfitting. Although GANs are commonly used in image processing for image augmentation, their use for driving signal augmentation is novel. However, GAN's training on raw driving signals diverges. This challenge is solved by getting the Discrete Wavelet Transform (DWT) on driving signals before feeding to GAN. Our experiments prove the usefulness of GAN model for generating driving signals emanating from DWT on smartphones' accelerometer and gyroscope signals. After collecting the augmented data, their histograms along the overlapped windows are fed to machine learning methods covered by a Stacked Generalization Method (SGM). The presented hybrid GAN-SGM approach identifies drivers with 97\% accuracy, 98\% precision, 97\% recall, and 97\% F1-measure that outperforms standard machine learning methods that utilize features extracted by the statistical, spectral, and temporal approaches.
... The effect of deep learning can be more prominent when the amount of data is large, so convolutional neural network methods are widely used in recent studies and achieve great success in the driving risk screening. For example, Bernardi et al. [16] proposed a feature extraction model, owning good path identification ability using a multilayer perception network. Werneke and Vollrath [17] examined the driving behaviors in different intersection situations, demonstrating that driving behaviors systematically depend on the environment. ...
Article
Full-text available
Driving risk classification is usually used for evaluating and reducing traffic accidents. It is of great significance to improve urban traffic problems, such as traffic jams and road accidents. Most related works use different representation methods with the data from the advanced driving assistance system and then use statistical approaches or machine learning to analyze the driving risk. However, because of the unbalance of positive and negative samples, the performance is usually not satisfactory. In this article, we propose a driving risk classification method via unbalanced time series samples. It first employs MeanShift with automatic bandwidth to cluster samples and expand its volume according to similarity. Then, it adopts a regional division module to simulate the outside environment and extracts state transition features by the Markov feature module to get more detailed information. By combining the stacking classification module with three convolutional neural networks, the performance is further improved. Finally, a city transfer module is designed with the drivers’ inherent attributes and sample weights to enhance the generalization of the model. The experimental results verify the effectiveness and generality of the classification model on driving risk. It can also be transferred to time series analysis in other fields.
Article
An increasing number of Electronic Control Units (ECUs) communicate with each other to accomplish the functionalities of modern vehicles. ECUs form an in-vehicle network that is precisely regulated and must be adequately protected from malicious activity, which has had several outbreaks in recent years. Therefore, we present CINNAMON, an AUTOSAR-based Basic Software Module that aims at confidentiality, integrity and authentication, all at the same time, for the traffic exchanged over the bus protocols that AUTOSAR supports. CINNAMON in fact stands for Confidential, INtegral aNd Authentic onboard coMmunicatiON. This article introduces the requirements and specification of CINNAMON in a differential fashion with respect to the existing Secure Onboard Communication Basic Software Module, which does not include confidentiality. As a result, CINNAMON exceeds SecOC at least against information gathering attacks. The article then defines three security profiles, regulating also the freshness attribute appropriately. Most importantly, CINNAMON is not a simple academic exercise because it is implemented in a laboratory environment on commercial ECUs, thus reaching the level of TRL 4, “Component and/or breadboard validation in laboratory environment”. The runtimes obtained on inexpensive devices are reassuring, paving the way for a possible large-scale application.
Article
Driver identification has emerged as an active field of study to further personalize the integrated advanced driver-assistance systems into intelligent vehicles, provide security and safety for ride-hailing services, and prevent auto theft. Several studies, within recent years, have investigated non-intrusive identification approaches, focusing on driving behavior analysis to characterize drivers' driving behavior, in which a considerable volume of labeled data is required. This paper develops a deep learning architecture, namely DriverRep, to extract the latent representations associated with each individual, called driver embeddings. These embeddings represent the unique driving characteristics of drivers. To this aim, we introduce a fully unsupervised triplet loss that selects triplet samples from data in an unsupervised manner and extracts the embeddings using our proposed stacked encoder architecture. Dilated causal convolutions are used to make residual blocks of the encoder. To perform the task of driver identification, we leverage the classification accuracy of SVM on top of the obtained driver embeddings. The evaluation results over two datasets, each of which contains ten drivers, reveal that the DriverRep can successfully capture the underlying features within the data and outperform benchmark driver identification schemes, obtaining an average accuracy of 94.7% and 96% for two-way and three-way identification, respectively. We also investigate the ability of the DriverRep in handling sparsely labeled data. The results represent substantial improvements in comparison with the supervised approaches when applied to highly sparsely labeled data.
Conference Paper
Full-text available
Online games consumers are strongly grown in the last years attracted by the always higher quality of the games and the more effective gaming infrastructures. The increasing of on line games market is also concurrent to the diffusion of game bots that allow to automatize malicious tasks obtaining some rewards with respect to the other game players (the game bots user increases personal benefits and popularity with low effort). Given the interest of game developers to preserve game equity and player satisfaction, the topic of game bots detection is becoming very critical and consists to distinguish between game bots and human players behaviour. This paper describes an approach to the online role player games bot detection based on time series classification used to discriminate between human and game bots behavioral features. In this paper an application of the proposed approach in a real role player game is reported.
Conference Paper
Full-text available
We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on arti cial data and theoretical results in restricted settings have shown that for selecting a good classi er from a set of classiers (model selection), ten-fold cross-validation may be better than the more expensive leaveone-out cross-validation. We report on a largescale experiment|over half a million runs of C4.5 and a Naive-Bayes algorithm|to estimate the e ects of di erent parameters on these algorithms on real-world datasets. For crossvalidation, we vary the number of folds and whether the folds are strati ed or not � for bootstrap, we vary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, the best method to use for model selection is ten-fold strati ed cross validation, even if computation power allows using more folds. 1
Article
Despite the development of new technologies in order to prevent the stealing of cars, the number of car thefts is sharply increasing. With the advent of electronics, new ways to steal cars were found. In order to avoid auto-theft attacks, in this paper we propose a machine learning based method to silently and continuously profile the driver by analyzing built-in vehicle sensors. We consider a dataset composed by 51 different features extracted by 10 different drivers, evaluating the efficiency of the proposed method in driver identification. We also find the most relevant features able to discriminate the car owner by an impostor. We obtain a precision and a recall equal to 99% evaluating a dataset containing data extracted from real vehicle.
Conference Paper
Nowadays, datasets are growing daily to obtain knowledge from big dataset. Extraction operation of useful information from the dataset is called data mining that is one of the major techniques to get the diagnostic results especially in medical care fields as breast cancer. Breast cancer is one of the widespread cancers among women if matched with all other tumors all over the world. Classification is widely used in most important and necessary tasks in the real life applications in all fields. This technique has ability of detecting very similarities/differences that a human analyst may be not notice and therefore create and introduction more accurate/useful categories. This study presents comparison and analyses breast cancer dataset by using classification decision tree algorithms. Decision tree algorithms are applied to these algorithms which are J48, Function Tree, Random Forest Tree, AD Alternating Decision Tree, Decision stump and Best First. A computationally efficient classifies of these decision tree algorithms by employing Waikato Environment for Knowledge Analysis (WEKA) that is development program which includes a set of machine learning algorithms. These masses included 569 with 357 benign and 212 malignant cases with 32 attributes to test and proof the differences among the classification methods or algorithms. These results found by manner which involves reserving a particular sample of a medical dataset on which do not train the model. The decision tree classification forms forecast breast tumor with lower error average and higher precision of correctly classified cases 97.7%. The predicted accuracy correctly classified instances for decision stump algorithm 88.0% model is the lowest of all.
Article
A generalized form of the cross‐validation criterion is applied to the choice and assessment of prediction using the data‐analytic concept of a prescription. The examples used to illustrate the application are drawn from the problem areas of univariate estimation, linear regression and analysis of variance.
Article
Drivers’ individual difference is one of the key factors to influence the accuracy of driving behavior model. The accuracy of model should include the effect characteristics of individual difference on driving behavior. The overtaking process was the research object to study the individual characteristics of driving behavior. The operation data of accelerator and steering wheel of each driver was analyzed with the character of time series. Based on both of the operation data, hidden Markov model (HMM) was employed to model the individual characteristics of driving behavior. Two individual models were built for each driver, one trained from accelerator data and one learned from steering wheel angel data. The models can be used to identify different drivers and the accuracy can reach to 85 %. It proved that individual difference is one factor which cannot be ignored in driving behavior model, and HMM has effectiveness in modeling it.
Article
In this paper, we propose a driver identification method that is based on the driving behavior signals that are observed while the driver is following another vehicle. Driving behavior signals, such as the use of the accelerator pedal, brake pedal, vehicle velocity, and distance from the vehicle in front, were measured using a driving simulator. We compared the identification rate obtained using different identification models. As a result, we found the Gaussian Mixture Model to be superior to the Helly model and the optimal velocity model. Also, the driver's operation signals were found to be better than road environment signals and car behavior signals for the Gaussian Mixture Model. The identification rate for thirty driver using actual vehicle driving in a city area was 73%.