ArticlePDF Available

Abstract and Figures

Human mobility data are an important proxy for understanding human mobility dynamics and developing useful analytical services. Unfortunately these data are very sensitive since they may enable the re-identification of individuals in a database. Existing frameworks for privacy risk assessment in human mobility data provide the data providers with tools to control and mitigate privacy risks, but they suffer two main shortcomings: (i) they have a high computational complexity; (ii) the privacy risk must be re-computed every time new data records become available. In this paper we propose novel re-identification attacks and a fast and flexible data mining approach for privacy risk assessment in human mobility data. The idea is to learn classifiers to capture the relation between individual mobility patterns and the level of privacy risk of individuals. We show the effectiveness of our approach by an extensive experimentation on a real-world GPS data in two urban areas, and investigate the relations between human mobility.
Content may be subject to copyright.
31
A Data Mining Approach to Assess Privacy Risk in Human
Mobility Data
ROBERTO PELLUNGRINI,Department of Computer Science, University of Pisa, Italy
LUCA PAPPALARDO and FRANCESCA PRATESI, Department of Computer Science, University
of Pisa, Italy – ISTI-CNR, Pisa, Italy
ANNA MONREALE, Department of Computer Science, University of Pisa, Italy
Human mobility data are an important proxy to understand human mobility dynamics, develop analytical
services, and design mathematical models for simulation and what-if analysis. Unfortunately mobility data
are very sensitive since they may enable the re-identication of individuals in a database. Existing frame-
works for privacy risk assessment provide data providers with tools to control and mitigate privacy risks,
but they suer two main shortcomings: (i) they have a high computational complexity; (ii) the privacy risk
must be recomputed every time new data records become available and for every selection of individuals,
geographic areas, or time windows. In this article, we propose a fast and exible approach to estimate pri-
vacy risk in human mobility data. The idea is to train classiers to capture the relation between individual
mobility patterns and the level of privacy risk of individuals. We show the eectiveness of our approach by
an extensive experiment on real-world GPS data in two urban areas and investigate the relations between
human mobility patterns and the privacy risk of individuals.
CCS Concepts: • Security and privacy Pseudonymity, anonymity and untraceability;Usability in
security and privacy;•Computing methodologies Classication and regression trees; Transfer learn-
ing;
Additional Key Words and Phrases: Human mobility, data mining, privacy
ACM Reference format:
Roberto Pellungrini, Luca Pappalardo, Francesca Pratesi, and Anna Monreale. 2017. A Data Mining Approach
to Assess Privacy Risk in Human Mobility Data. ACM Trans. Intell. Syst. Technol. 9, 3, Article 31 (December
2017), 27 pages.
https://doi.org/10.1145/3106774
1 INTRODUCTION
Human mobility analysis has attracted in the past decade a growing interest from dierent dis-
ciplines due to its importance in a wide range of applications, ranging from urban planning
and transportation engineering (Wang et al. 2012; Pappalardo et al. 2015; Marchetti et al. 2015;
Pappalardo et al. 2016) to public health (Colizza et al. 2007; Tizzoni et al. 2014). The availabil-
ity of massive collections of mobility data and the development of sophisticated techniques for
their analysis and mining (Zheng and Zhou 2011; Zheng 2015) have oered the unprecedented
This work has been partially funded by the European project SoBigData RI (Grant Agreement 654024).
Authors’ addresses: R. Pellungrini, L. Pappalardo, and A. Monreale, Computer Science Department, University of Pisa,
Italy; emails: roberto.pellungrini@di.unipi.it, luca.pappalardo@isti.cnr.it, anna.monreale@unipi.it; F. Pratesi, KDDLab -
ISTI - CNR, Via G. Moruzzi 1, Pisa, Italy; email: francesca.pratesi@isti.cnr.it.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and
the full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be
honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specic permission and/or a fee. Request permissions from Permissions@acm.org.
2017 Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 2157-6904/2017/12-ART31 $15.00
https://doi.org/10.1145/3106774
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
31:2 R. Pellungrini et al.
opportunity to observe human mobility at large scales and in great detail, leading to the discovery
of the fundamental quantitative patterns of human mobility (Gonzalez et al. 2008; Pappalardo et al.
2013; Song et al. 2010b; Pappalardo et al. 2015), accurate predictions of future human whereabouts
(Gambs et al. 2012;Luetal.2013), and the mathematical modeling of the main aspects of human
mobility dynamics (Jiang et al. 2016; Pappalardo et al. 2016; Pappalardo and Simini 2016; Song
et al. 2010a). These analyses are generally conducted on large datasets storing detailed informa-
tion about the spatio-temporal points visited by individuals in a territory, like GPS tracks (Bazzani
et al. 2010; Giannotti et al. 2011; Pappalardo et al. 2013,2015) or mobile phone data (Gonzalez
et al. 2008; Song et al. 2010b; Simini et al. 2012; Pappalardo et al. 2015). It goes without saying
that mobility data are sensitive because people’s whereabouts might reveal intimate personal in-
formation or allow the reidentication of individuals in a database, creating serious privacy risks
(Rubinstein 2013). For example it has been shown that just four spatio-temporal points can be
enough to uniquely identify 95% of individuals in a mobility dataset (de Montjoye et al. 2013).
Therefore, if mobility data are analyzed with malicious intent, there can be a serious violation of
the privacy rights of the individuals involved.
Driven by these sensitive issues, in recent years, researchers from dierent disciplines have
developed algorithms, methodologies, and frameworks to mitigate the individual privacy risks
associated with the analysis of GPS trajectories, mobile phone data, and Big Data in general (Abul
et al. 2008a; Monreale et al. 2014b;Wongetal.2007). These tools aim at preserving both the right
to privacy of individuals and the quality of the analytical results. However, to enable a practical
application of the privacy-preserving techniques proposed in the literature, it is necessary to nd
a trade-o between privacy protection and data quality. To this aim Pratesi et al. (2016)proposesa
framework for the privacy risk assessment of individuals in a mobility dataset. This framework is
compliant with the new EU General Data Protection Regulation, which explicitly imposes on data
controllers an assessment of the impact of data protection for the most risky processes.1
Although frameworks like the one presented in Pratesi et al. (2016) are proved to be eective in
many mobility scenarios, they suer a major limitation: The privacy risk assessment has a high
computational complexity because it requires a computation of the maximum risk of reidenti-
cation (or privacy risk) given the external knowledge that a malicious adversary might use in
conducting an attack. The generation of the external knowledge is nonpolynomial in time since
it considers all the possible ways the adversary can try to reidentify an individual in a mobility
dataset. The computational complexity is a severe limitation because the privacy risks must be re-
computed every time new data become available and for every selection of individuals, geographic
areas, and periods of time.
In this article, we propose a data mining approach for privacy risk assessment that overcomes
the computational shortcomings of existing frameworks. We rst introduce a repertoire of rei-
dentication attacks on mobility data and then use a data mining classier to predict the level of
privacy risk for an individual based solely on her mobility patterns. We evaluate our approach on
real-world mobility data with an extensive experiment. Starting from a dataset of around 1 mil-
lion GPS tracks produced by 12,000 private vehicles traveling in two urban areas in Italy during
one month, we extract individual mobility patterns and compute the privacy risk level associated
with vehicles according to the repertoire of reidentication attacks. We then train data mining
classiers and use them to determine (in polynomial time) the privacy risk level of previously
unseen vehicles whose data were not used in the learning phase, based just on their individual
mobility patterns. In a scenario where a Data Analyst requests a Data Provider for mobility data to
develop an analytical service, the Data Provider (e.g., a mobile phone carrier) can use the classiers
1The EU General Data Protection Regulation can be found at http://bit.ly/1TlgbjI.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
A Data Mining Approach to Assess Privacy Risk in Human Mobility Data 31:3
to immediately identify risky individuals (i.e., individuals with a high level of privacy risk). Then,
the Data Provider can select the most suitable privacy-preserving technique (e.g., k-anonymity,
dierential privacy) to reduce their privacy risk and release safe data to the Data Analyst. Although
our approach is constrained to a xed set of re-identication attacks, it can be easily extended to
any type of attack dened on human mobility data.
Our experiments on GPS data show two main results. First, the classiers are accurate in classi-
fying the privacy risk level of unseen individuals in the two urban areas. The classiers’ predictions
are particularly accurate in classifying the lowest and the highest levels of privacy risk, allowing
an immediate distinction between safe individuals and risky individuals. In particular, we observe
a high recall (99%) on the class of maximum privacy risk, meaning that the probability of misclas-
sifying a high-risk individual as a low-risk individual is negligible. The second remarkable result
is that the classiers built on one urban area are eective when used to determine the privacy risk
level of individuals in the other urban area. This suggests that the predictive models are able to
infer rather general relationships between mobility patterns and privacy risk, which are indepen-
dent of the number of individuals, the width of the geographic area, and the length of the period of
observation. This means that the Data Provider can reuse the same classiers for every selection
of the dataset without the need to redo the training process every time. Finally, we quantify the
impact of every individual mobility measure on the classiers, observing that it changes with the
type of re-identication attack considered (i.e., dierent attacks are based on gathering informa-
tion about dierent mobility patterns). Based on the results, we think our work provides two main
contributions. First, we show that we can eectively use data mining to estimate the privacy risk of
individuals in a fast, accurate, and precise way, overcoming the computational issues related to ex-
isting frameworks. Second, we shed light on the relationships between individual human mobility
patterns and risk of re-identication, which was not clearly investigated in the literature.
The article is organized as follows. In Section 2, we dene the data structures to describe human
mobility data according to dierent data aggregations. In Section 3, we introduce the framework
used for the privacy risk assessment, while Section 4describes the data mining approach we pro-
pose. In Section 5, we show the results of our experiments, and we discuss them in Section 6.
Section 7presents the main works related to our article, and, nally, Section 8concludes the arti-
cle by proposing some lines of new research.
2 DATA DEFINITIONS
The approach we present in this article is tailored for human mobility data: data describing the
movements of a set of individuals during a period of observation. This type of data is generally
collected in an automatic way through electronic devices (e.g., mobile phones, GPS devices) in the
form of raw trajectory data. A raw trajectory of an individual is a sequence of records identifying
the movements of that individual during the period of observation (Zheng and Zhou 2011; Zheng
2015). Every record has the following elds: the identier of the individual, a geographic location
expressed in coordinates (generally latitude and longitude), and a timestamp indicating when the
individual stopped in or went through that location. Depending on the specic application, a raw
trajectory can be aggregated into dierent mobility data structures:
Denition 2.1 (Trajectory). The trajectory Tuof an individual uis a temporally ordered sequence
of tuples Tu=(l1,t1),(l2,t2),...,(ln,tn),whereli=(xi,yi)is a location, xiand yiare the coor-
dinates of the geographic location, and tiis the corresponding timestamp, ti<tjif i<j.
Denition 2.2 (Frequency vector). The frequency vector Wuof an individual uis a sequence of
tuples Wu=(l1,w1),(l2,w2),...,(ln,wn)where li=(xi,yi)is a location, wiis the frequency of
the location (i.e., how many times location liappears in the individual’s trajectoryTu), and wi>wj
if i<j. A frequency vector Wuis hence an aggregation of a trajectoryTu.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
31:4 R. Pellungrini et al.
Denition 2.3 (Probability vector). The probability vector Puof an individual uis a sequence of
tuples Pu=(l1,p1),(l2,p2),...,(ln,pn),whereli=(xi,yi)is a location, piis the probability that
location liappears in Wu(i.e., pi=wi
liWuwi), and pi>pjif i<j. A probability vector Puis hence
an aggregation of a frequency vector Wu.
Denition 2.4 (Mobility Dataset). A mobility dataset is a set of mobility data structures D=
{S1,S2,...,Sn}where Su(1 un) is the mobility data structure of individual u. For example, a
mobility dataset can be a set of trajectories {T1,...,Tn}, a set of frequency vectors {W1,...,Wn},
or a set of probability vectors {P1,...,Pn}.Notethatthethreesetshavethesamesizen.
In the following, using the terms visit or point, we refer indierently to a tuple in a trajectory,
a tuple in a frequency vector, or a tuple in a probability vector. In other words, a visit viindicates
a pair consisting of a location liand a supplementary information (e.g., the timestamp ti,the
frequency wi, or the probability piof the location). Moreover, we denote byUset ={u1,...,un}the
set of distinct individuals and by Lset ={l1,...,lm}the set of distinct locations in a mobility dataset
D. In this article, we assume that mobility data are represented with one of the data structures just
described.
3 PRIVACY RISK ASSESSMENT FRAMEWORK
Several methodologies have been proposed in the literature for privacy risk assessment. In this ar-
ticle, we consider the framework proposed in Pratesi et al. (2016), which allows for the assessment
of the privacy risk inherent to human mobility data. The framework considers a scenario where
a Data Analyst requests a Data Provider human mobility data in order to develop an analytical
service. For its part, the Data Provider has to guarantee the right to privacy of the individuals
whose data are recorded. As a rst step, the Data Analyst communicates to the Data Provider the
data requirements for the analytical service. Assuming that the Data Provider stores a database
D, it aggregates, selects, and lters the dataset Dto meet the requirements of the Data Analyst
and produces a set of mobility datasets {D1,...,Dz}each with a dierent data structure and/or
aggregation of the data. The Data Provider then reiterates a four-step procedure until it considers
the data delivery safe:
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
A Data Mining Approach to Assess Privacy Risk in Human Mobility Data 31:5
In this article, we focus on improving Step (2) of the Data Delivery Procedure (i.e., Privacy Risk
Computation), which is the most critical one from a computational point of view. Computing the
privacy risk of an individual means simulating several possible attacks that a malicious adversary
can perform and computing the privacy risks associated with each attack. The privacy risk of an
individual is related to her probability of re-identication in a mobility dataset with respect to a
set of re-identication attacks. A re-identication attack assumes that an adversary gains access
to a mobility dataset. On the basis of some background knowledge about an individual (i.e., the
knowledge of a subset of her mobility data), the adversary tries to reidentify all the records in the
dataset regarding the individual under attack. In this article, we use the denition of privacy risk (or
re-identication risk) introduced in Samarati and Sweeney (1998a), Samarati (2001), and Sweeney
(2002) and widely used in the literature. There can be many background knowledge categories,
every category may have several background knowledge congurations, and every conguration
may have many instances.
A background knowledge category is a kind of information known by the adversary about a
specic set of dimensions of an individual’s mobility data. Typical dimensions in mobility data are
space, time, frequency of visiting a location, and probability of visiting a location (Section 2). Two
examples of background knowledge categories are a subset of the locations visited by an individual
(spatial dimension) and the specic times an individual visited those locations (spatial and tempo-
ral dimensions). The number kof the elements of a category known by the adversary is called the
background knowledge conguration. An example of background knowledge conguration is the
knowledge by the adversary of k=3 locations of an individual. Finally, an instance of background
knowledge is the specic information known by the adversary, such as a visit in a specic location.
We formalize these concepts as follows:
Denition 3.1 (Background knowledge conguration). Given a background knowledge category
B, we denote with Bk∈B={B1,B2,...,Bn}a specic background knowledge conguration,
where krepresents the number of elements in Bknown by the adversary. We dene an element
bBkas an instance of background knowledge conguration.
Example 3.2. Suppose a trajectory Tu=v1,v2,v3,v4of an individual uis present in the Data
Provider’s dataset D,wherevi=(li,ti)is a visit, liis a location, and tithe time when uvis-
ited location li, with i=1,...,4andti<tjif i<j. Based on Tu,the Data Provider can gener-
ate all the possible instances of a background knowledge conguration that an adversary might
use the re-identify the whole trajectory Tu. Considering the knowledge by the adversary of or-
dered subsequences of locations and k=2, we obtain the background knowledge conguration
B2={(v1,v2),(v1,v3),(v1,v4),(v2,v3),(v2,v4),(v3,v4)}. The adversary, for example, might know
instance b=(v1,v4)B2and aim at detecting all the records in Dregarding individual uin order
to reconstruct the whole trajectory Tu.
Let Dbe a database, Da mobility dataset extracted from Das an aggregation of the data on
specic dimensions (e.g., an aggregated data structure and/or a ltering on time and/or space), and
Duthe set of records representing individual uin D; we dene the probability of re-identication
as follows:
Denition 3.3 (Probability of re-identication). Given an attack, a function matching(d, b) indi-
cating whether or not a record dDmatches the instance of background knowledge congura-
tion bBk, and a function M(D,b)={dD|matchi(d,b)=True}, we dene the probability of
re-identication of an individual uin dataset Das:
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
31:6 R. Pellungrini et al.
PRD(d=u|b)=1
|M(D,b)|,
that is the probability to associate a record dDto an individual u, given instance bBk.
Note that PRD(d=u|b)=0 if the individual uis not represented in D. Since each instance bBk
has its own probability of re-identication, we dene the risk of re-identication of an individual as
the maximum probability of re-identication over the set of instances of a background knowledge
conguration:
Denition 3.4 (Risk of re-identication or privacy risk). The risk of re-identication (or privacy
risk) of an individual ugiven a background knowledge conguration Bkis her maximum probability
of re-identication Risk (u,D)=max PRD(d=u|b)for bBk. The risk of re-identication has the
lower bound |Du|
|D|(a random choice in D), and Risk (u,D)=0if uD.
To clarify the concepts of probability of re-identication and privacy risk, we provide the fol-
lowing example that, given a mobility dataset Dof trajectories, shows how we can compute the
two measures for a specic attack.
Example 3.5. Consider a set of individuals Uset={u1,u2,u3,u4,u5,u6}and the corresponding
dataset Dof trajectories:
D={
Tu1=(2011/02/03,Lucc a),(2011/02/03,Leдhor n ),(2011/02/03,Pisa),(2011/02/04,Florence)
Tu2=(2011/02/03,Lucc a),(2011/02/03,Pisa),(2011/02/04,Lucca),(2011/02/04,Leдhor n)
Tu3=(2011/02/03,Leдhorn),(2011/02/03,Pisa),(2011/02/04,Lucca ),(2011/02/04,Florence)
Tu4=(2011/02/04,Pisa),(2011/02/04,Leдhor n),(2011/02/04,Florence)
Tu5=(2011/02/04,Pisa),(2011/02/04,Florence),(2011/02/05,Lucca )
Tu6=(2011/02/04,Lucc a),(2011/02/04,Leдhor n )
}
Assume an adversary wants to perform an attack on individual u1knowing only the locations
she visited (without any information about the time), with background knowledge conguration
B2(i.e., the adversary knows two of the locations visited by individual u1). We compute the risk
of re-identication of individual u1, given the dataset Dof trajectories and the knowledge of the
adversary, in two steps:
(1) We compute the probability of re-identication for every possible instancebB2. Instance
b={Luc c a ,Leдhor n }has a probability of re-identication PRD(d=u1|{Luc c a ,Leдhor n })=1
4
because the pair {Lucca,Leдhorn}appears in trajectories Tu1,Tu2,Tu3,andTu6
(i.e., in a total of four trajectories). Instance {Lucca,Pisa}has a probability of re-
identication PRD(d=u1|{Lucca,Pisa})=1
4because the pair appears in four trajectories
Tu1,Tu2,Tu3,andTu5. Instance {Lucca,Florence}has a probability of re-identication
PRD(d=u1|{Lucca,Florence})=1
3because the pair appears in three trajectories Tu1,Tu3,
and Tu5. Analogously, we compute the probability of re-identication for the other three
possible instances: PRD(d=u1|{Leдhor n,Pisa})=1
4,PRD(d=u1|{Leдhor n,Florence})=1
3,
PRD(d=u1|{Pisa,Florence})=1
4;
(2) We compute the risk of re-identication of individualu1as the maximum of the probabil-
ities of re-identication among all instances in B2:Risk (u1)=max (1
4,1
4,1
3,1
4,1
3,1
4)=1
3.
We remark that the Data Provider does not know in advance the instance associated with the high-
est probability of re-identication of individual u1(i.e., the “best” combination of points from the
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
A Data Mining Approach to Assess Privacy Risk in Human Mobility Data 31:7
perspective of the malicious adversary). The Data Provider can use the preceding computation in
a preventive manner to identify the instance yielding the highest probability of re-identication,
which is, for individual u1, instance {Leдhor n,Florence}. Due to the denition of risk, which de-
pends on both an attacked individual’s structure and the structures of all the other individuals in
the dataset, identifying apriorian attack where the adversary has access to the bestk-combination
of points is dicult for the Data Provider. A particular case where the Data Provider can imme-
diately recognize the best k-combination of points is a scenario where the adversary knows a
location visited only by the individual under attack. Since the Data Provider has a view of the
entire dataset, she can simulate such an attack by selecting the locations visited by just one indi-
vidual (i.e., with number of visits equal to 1). In such a case, computing the privacy risk for the
individuals visiting those locations does not require any combinatorial computation because the
privacy risk is 1 for any value of k.
An individual is hence associated with several privacy risks, each for every background knowl-
edge conguration of an attack. Every privacy risk of an individual can be computed using the
following procedure (see also Section 1 in Supplementary Material):
3.1 Computational Complexity of Privacy Risk Computation
The procedure of privacy risk computation has a high computational complexity. We assume that
the adversary uses all the information available to her when conducting a re-identication attack
on an individual. Since it is unlikely that an adversary knows the complete movements of an
individual (i.e., all the points), we introduced the concept of background knowledge conguration
Bk, which indicates the portion of points kknown by the adversary when performing an attack
on an individual. The higher the k,the higher is the number of points known by the adversary
about the individual’s movement. The maximum possible value of kis len, the length of the data
structure of an individual.
The best k-combination of points is the one leading to the highest probability of re-identication
of the individual under attack. However, we do not know such a best combination in advance. For
this reason, given k, when we simulate an attack, we compute all the possible k-combinations of
points an adversary could know. Given a combination of kpoints, we assume that the adversary
uses all these kpoints to conduct the attack. This leads to a high overall computational com-
plexity O((len
k)×N), since the framework generates (len
k)background knowledge conguration
instances and, for each instance, it executes Nmatching operations by applying function matchinд.
In the extreme case where the adversary knows the complete movement of an individual (i.e.,
she knows all the points), we have k=len and the computational complexity is O(N). In general,
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
31:8 R. Pellungrini et al.
in the range k[1,len
2] of the computational complexity of the attack simulation increases with k,
while for k[len
2,n] the computational complexity decreases with k. While all the (len
k)possible
instances must be necessarily considered since, as already stated, we cannot exclude any of them
a priori, we can reduce the number Nof matching operations between a single instance and the
data structures in the dataset by eliminating unnecessary comparisons. To clarify the point, con-
sider an attack where we try to match an instance b=(l1,t1),(l2,t2)against a trajectory Tstarting
with the visit (l3,t3), with t1<t2<t3. Since Tis temporally ordered (see Denition 2.1), we can
immediately exclude that bcan be found in T. Although the overall worst-case complexity of the
attack remains (len
k), in practice, this optimization speeds up the execution by skipping unnec-
essary comparisons during the matching between an instance and a trajectory. However, as we
will show in Section 5, in practice, the matching optimizations do not eliminate the computational
problem, and the simulation of the attacks can take up to 2 weeks to compute the privacy risks of
individuals in our datasets.
Example 3.6. Consider the following scenario where an adversary knows 5 locations of an indi-
vidual with a trajectory length len =50. Computing the privacy risk of an individual with respect
to the background knowledge conguration B5requires the generation of the (50
5)=2,118,760
background knowledge instances. In a dataset of N=100,000 individuals, each with len =50, the
overall simulation of the attack would take around 210 billions of matching operations.
4 A DATA MINING APPROACH FOR PRIVACY RISK ASSESSMENT
Given its computational complexity, Procedure 3.2 (Privacy Risk Computation) becomes unfeasible
as the size of the dataset increases since it requires enormous time and computational costs. This
drawback is even more serious if we consider that the privacy risks must be necessarily recom-
puted every time the mobility dataset is updated with new data records and for every selection of
individuals, geographic areas, and periods of time. To overcome these problems, we propose a fast
and exible data mining approach. The idea is to train a predictive model to predict the privacy
risk of an individual based solely on her individual mobility patterns. The predictive model can be
either a regression model, if we want to predict the actual value of privacy risk, or a classication
model, if we want to predict the level of privacy risk. The training of the predictive model uses
a training dataset where every example refers to a single individual and consists of (i) a vector
of the individual’s mobility features and (ii) the privacy risk value or the privacy risk level of the
individual, depending on whether we perform a regression or a classication task, respectively.
Formally, we dene a regression training dataset as a tuple TR =(F,R)where Fis the set of the
individual’s mobility feature vectors and Ris the vector of the individual’s privacy risk. Similarly,
we dene a classication training dataset as a tuple TC =(F,C)where Cis the vector of the in-
dividual’s privacy risk level (e.g., from level 1 indicating no risk to level 10 indicating maximum
privacy risk). We dene a possible set Fof mobility features in Section 4.1, and we introduce a
repertoire of attacks on mobility data that can be used to assess privacy risks in Section 4.2.We
describe how to construct the regression training dataset and the classication training dataset in
Section 4.3. In Section 4.4, we describe how a Data Provider can use our approach in practice to
determine the privacy risk of individuals in her database. We make our approach parametric with
respect to the predictive algorithm: In our experiments, we use a Random Forest regressor and a
Random Forest classier for the regression and classication experiments, respectively (Section 5),
but every algorithm available in the literature can be used for the predictive tasks. Note that our
approach is constrained to the xed well-dened set of attacks introduced in Section 4.2,whichis
a representative set of nine suciently diverse attacks tailored for the data structures required to
compute standard individual human mobility measures. Our approach can be easily extended to
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
A Data Mining Approach to Assess Privacy Risk in Human Mobility Data 31:9
any type of attack dened on human mobility data by using the privacy framework proposed by
Pratesi et al. (2016).
4.1 Individual Mobility Features
The mobility dynamics of an individual can be described by a set of measures widely used in the
literature. Some measures describe specic aspects of an individual’s mobility; other measures
describe an individual’s mobility in relation to collective mobility.
A subset of these measures can be simply obtained as aggregation of an individual’s trajectory
or frequency vector. The number of visits Vof an individual is the length of her trajectory (i.e.,
the sum of all the visits she did in any location during the period of observation (Gonzalez et al.
2008; Pappalardo et al. 2015)). By dividing this quantity by the number of days in the period of
observation, we obtain the average number of daily visits V, which is a measure of the erratic
behavior of an individual during the day (Pappalardo and Simini 2016). The length Locs of the
frequency vector of an individual indicates the number of distinct places visited by the individual
during the period of observation (Gonzalez et al. 2008; Song et al. 2010a). Dividing Loc s by the
number of available locations in the considered territory, we obtain Locsratio, which indicates the
fraction of territory exploited by an individual in her mobility behavior. The maximum distance
Dmax traveled by an individual is dened as the length of the longest trip of the individual during
the period of observation (Williams et al. 2015), while Dtrip
max is dened as the ratio between Dmax
and the maximum possible distance between the locations in the area of observation. The sum of
all the trip lengths traveled by the individual during the period of observation is dened as Dsum
(Williams et al. 2015). It can be also averaged over the days in the period of observation, thus
obtaining Dsum.
In addition to these simple quantities, more complex measures can be computed based on an
individual’s mobility data, such as the radius of gyration (Gonzalez et al. 2008; Pappalardo et al.
2013) and the mobility entropy (Eagle and Pentland 2009; Song et al. 2010b). The radius of gyration
rдis the characteristic distance traveled by an individual during the period of observation, formally
dened by Gonzalez et al. (2008) and Pappalardo et al. (2013,2015)as:
rд=1
V
iL
wi(rircm)2,
where wiis the individual’s visitation frequency of location i,Vis the total number of visits of the
individual, riis a bi-dimensional vector describing the geographical coordinates of location i,and
rcm =1
ViLriis the center of mass of the individual (Gonzalez et al. 2008; Pappalardo et al. 2013).
The mobility entropy Eis a measure of the predictability of an individual’s trajectory. Formally, it
is dened as the Shannon entropy of an individual’s movements (Eagle and Pentland 2009; Song
et al. 2010b):
E=
iL
pilog2pi,
where piis the probability of location iin an individual’s probability vector.
Also, for each individual, we keep track of the characteristics of three dierent locations: the
most visited location, the second most visited location, and the least visited location. The frequency
wiof a location iis the number of times an individual visited location iduring the period of
observation, while the average frequency wiis the daily average frequency of location i.Wealso
dene wpop
ias the frequency of a location divided by the popularity of that location in the whole
dataset. The quantity Uratio
iis the number of distinct individuals who visited a location idivided by
the total number |Uset |of individuals in the dataset, while Uiis the number of distinct individuals
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
31:10 R. Pellungrini et al.
Table 1. The Individual Mobility Measures Used in Our Work
Symbol Name Structures Attacks
Vvisits
trajectory L
L S
V
Vdaily visits
Dmax max distance
Dsum sum distances
Dsum Dsum per day
Dtrip
max Dmax over area trajectory
location set
Locs distinct locations frequency vector F L
F L. S
Locsratio Locs over area frequency vector
location set
Rдradius of gyration probability vector
P
Emobility entropy
Eilocation entropy probability vector
probability vector dataset
Uiindividuals per
location
frequency vector,
frequency vector dataset
F
P
H  W
Uratio
iUiover individuals
wilocation frequency
wpop
iwiover overall
frequency
widaily location
frequency
For every mobility measure, we indicate the minimal data structures (among those presented in Section 2) needed to
compute it and the attacks that can be performed on the corresponding data structures.
who visited location iduring the period of observation. Finally, the location entropy Eiis the
predictability of location i,denedas:
Ei=
uUi
pulog2pu,
where puis the probability that individual uvisits location i.
Table 1indicates, for every mobility measure, the minimal data structures (among those pre-
sented in Section 2) required for its computation and the possible re-identication attacks that
can be conducted on these structures. Every individual uin the dataset is described by a mobil-
ity vector muof the 16 mobility features described earlier. The vectors of all the mobility vectors
of individual u1,...,unis the mobility matrix F=(mu1,...,mun). It is worth noting that all the
measures can be computed in linear time on the size of the corresponding data structure.
4.2 Privacy Aacks on Mobility Data
In this section, we describe the attacks we use in this article: the Proportion and Probability attacks
are a novel contribution, while the others are attacks already existing in the literature. In Section 2
of the Supplementary Material, we provide the pseudocode to reproduce the attacks and some toy
examples that illustrate how the attacks work.2
2We also provide the Python code we use for the simulation of the attacks at https://github.com/pellungrobe/privacy-lib.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
A Data Mining Approach to Assess Privacy Risk in Human Mobility Data 31:11
4.2.1 Location Aack. In a Location attack, the adversary knows a certain number of locations
visited by the individual, but she does not know the temporal order of the visits. Since an individual
might visit the same location multiple times in a trajectory, the adversary’s knowledge is a multiset
that may contain more occurrences of the same location. This is similar to considering the locations
as items of transactions. Similar attacks on transactional databases are used in Terrovitis et al.
(2008), Xu et al. (2008a), and Xu et al. (2008b) with the dierence that a transaction is a set of
items and not a multiset. Given an individual s,wedenotebyL(Ts)the multiset of locations liTs
visited by s. The background knowledge category of a Location attack is dened as follows:
Denition 4.1 (Location background knowledge). Let kbe the number of locations liof an indi-
vidual sknown by the adversary. The Location background knowledge is a set of congurations
based on klocations, dened as Bk=L(Ts)[k].Here,L(Ts)[k]denotes the set of all the possible
k-combinations of the elements in set L(Ts).
Since each instance bBkis a subset of locations XsL(Ts)of length k, given a record dD
and the corresponding individual u, we dene the matching function as:
matchinд(d,b)=true b L(Tu)
false otherwise (1)
4.2.2 Location Sequence Aack. In a Location Sequence attack, introduced in Mohammed et al.
(2009) and Monreale et al. (2014a), the adversary knows a subset of the locations visited by the
individual and the temporal ordering of the visits. Given an individual s,wedenotebyL(Ts)the
sequence of locations liTsvisited by s. The background knowledge category of a Location Se-
quence attack is dened as follows:
Denition 4.2 (Location sequence background knowledge). Let kbe the number of locations liof
a individual sknown by the adversary. The Location Sequence background knowledge is a set of
congurations based on klocations, dened as Bk=L(Ts)[k],whereL(Ts)[k]denotes the set of all
the possible k-subsequences of the elements in set L(Ts).
We indicate with abthat ais a subsequence of b. Each instance bBkis a subsequence of
location XsL(Ts)of length k. Given a record dDand the corresponding individual u,we
dene the matching function as:
matchinд(d,b)=true b L(Tu)
false otherwise (2)
4.2.3 Visit Aack. In a Visit attack, introduced in Abul et al. (2008b), Yarovoy et al. (2009),
Monreale et al. (2010a), and de Montjoye et al. (2013), an adversary knows a subset of the loca-
tions visited by the individual and the time the individual visited these locations. The background
knowledge category of a Visit attack is dened as:
Denition 4.3 (Visit based background knowledge). Let kbe the number of visits vof a individual
sknown by the adversary. The Visit background knowledge is a set of congurations based on
kvisits, dened as Bk=T[k]
swhere T[k]
sdenotes the set of all the possible k-subsequences of the
elements in trajectory Ts.
Each instance bBkis a spatio-temporal subsequence Xsof length k. The subsequence Xshas
a positive match with a specic trajectory if the latter supports bin terms of both spatial and
temporal dimensions. Thus, given a record dD, we dene the matching function as:
matchinд(d,b)=true (li,ti)b,(ld
i,td
i)d|li=ld
iti=td
i
false otherwise (3)
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
31:12 R. Pellungrini et al.
4.2.4 Frequent Location and Sequence Aack. We also introduce two attacks based on the
knowledge of the location frequency. In the Frequent Location attack, the adversary knows a num-
ber of frequent locations visited by an individual, while in the Frequent Location Sequence attack,
the adversary knows a subset of the locations visited by an individual and the relative ordering
with respect to the frequencies (from most frequent to least frequent). The Frequent Location at-
tack is similar to the Location attack with the dierence that, in frequency vectors, a location can
appear only once. As a consequence, this attack follows the same principle of Terrovitis et al. (2008)
and Xu et al. (2008a,2008b). The Frequent Location Sequence attack is similar to the Location Se-
quence attack, with two dierences: rst, a location can appear only once in the vector; second,
locations in a frequency vector are ordered by descending frequency and not by time. Thus, the
locations/sequence Xsof length kcannot contain repetitions of locations. We omit the denition of
the matching functions because they are similar to those of the attacks conducted on trajectories:
They must only consider the absence of location repetitions.
4.2.5 Frequency Aack. We introduce an attack where an adversary knows the locations visited
by the individual, their reciprocal ordering of frequency, and the minimum number of visits of the
individual to the locations. Thus, when searching for specic subsequences, the adversary must
consider also subsequences containing the known locations with a greater frequency. We recall
that, in the case of frequency vectors, we denote with visit vWthe pair composed by the frequent
location and its frequency. We also recall that we denote withWsthe frequency vector of individual
s. The background knowledge category of a Frequency attack is dened as follows:
Denition 4.4 (Frequency background knowledge). Let kbe the number of visits vof the frequency
vector of individual sknown by the adversary. The Frequency background knowledge is a set of
congurations based on kvisits, dened as Bk=W[k]
swhere W[k]
sdenotes the set of all possible
k-combinations of frequency vector Ws.
Each instance bBkis a frequency vector, and, given a record dD, we dene the matching
function as:
matchinд(d,b)=true (li,wi)b,(ld
i,wd
i)W|li=ld
iwiwd
i
false otherwise (4)
4.2.6 Home And Work Aack. In the Home and Work attack introduced in Zang and Bolot
(2011), the adversary knows the two most frequent locations of an individual and their frequencies.
It essentially assumes the same background knowledge of the Frequency attack but related only to
two locations. This is the only attack where the background knowledge conguration is composed
of just a single 2-combination for each individual. Mechanically, the matching function for this type
of attack is identical to the matching function of the Frequency attack.
4.2.7 Proportion Aack. We introduce an attack assuming that an adversary knows a subset of
locations visited by an individual and also the relative proportion between the number of visits to
these locations. In particular, the adversary knows the proportion between the frequency of the
most frequent known location and the frequency of the other known locations. This means that the
candidate set of possible matches consists of all the set of locations with similar proportions. Given
a set of visits XW,we denote with l1 the most frequent location of Xand with w1its frequency.
We also denote with prithe proportion between wiand w1for each viv1X.Wethendenote
with LR a set of frequent locations liwith their respectivepri. The background knowledge category
for this attack is dened as follows:
Denition 4.5 (Proportion background knowledge). Let kbe the number of locations liof an indi-
vidual sknown by the adversary. The Proportion background knowledge is a set of congurations
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
A Data Mining Approach to Assess Privacy Risk in Human Mobility Data 31:13
based on klocations, dened as Bk=LR[k]
swhere LR[k]
sdenotes the set of all possible k-
combinations of the frequent locations liwith associated pri.
Each adversary’s knowledge bBkis a LR structure, as previously dened. Given a record dD,
we dene the matching function as:
matchinд(d,b)=true (li,pri)b,(ld
i,prd
i)LRd|li=ld
ipri[prd
iδ,prd
i+δ]
false otherwise (5)
In the e quation, δis a tolerancefactor for the matching of proportions. In our experiments, δ=0.1.
4.2.8 Probability Aack. In a Probability attack an adversary knows the locations visited by
an individual and the probability of that individual to visit each location. This attack is similar to
the one introduced by Unnikrishnan and Naini (2013), where the goal is to match musers with
mpublic statistics, like empirical frequencies. However, there are some dierences between the
two attacks: The attack proposed in Unnikrishnan and Naini (2013) works on two sets of data,
called strings. One of the sets represents the published aggregated data of individuals, the other
represents the auxiliary information known by the adversary about the individuals in the data. The
two sets are equal in size, and also all the strings in the two sets have the same length. Given these
assumptions, Unnikrishnan and Naini (2013) propose an attack based on the minimum weight
bipartite matching. Conversely, in our Probability attack, we try to match a single background
knowledge instance with the set of probability vectors. Therefore, we cannot rely on matching
algorithms on a bipartite graph because we can not make assumptions regarding the length of
the sets or the length of the data: In general, the length of the probability vectors is not the same
among the individuals and is greater than the length of the background knowledge conguration
instances.
We recall that, in the case of probability vectors, we denote with visit vPthepaircomposed
of the frequent location and its probability. We also recall that we denote with Psthe probability
vector of individual s. The background knowledge category for this attack is dened as follows:
Denition 4.6 (Probability background knowledge). Let kbe the number of visits vof the proba-
bility vector of individual sknown by the adversary. The Probability-based background knowledge
is a set of congurations based on kvisits, dened as Bk=P[k]
swhere P[k]
sdenotes the set of all
possible k-combinations of probability vector Ps.
Each adversary’s knowledge bBkis a probability vector, and, given a record dD,wedene
the matching function as:
matchinд(d,b)=true (li,pi)b,(ld
i,pd
i)d|li=ld
ipi[pd
iδ,pd
i+δ]
false otherwise (6)
In the equation, δis a tolerance factor for the matching of probabilities. In our experiments,
δ=0.1.
4.3 Construction of Training Dataset
Given an attack ibased on a specic background knowledge conguration Bi
j, the regression train-
ing dataset TRi
jand the classication training dataset TCi
jcan be constructed by the following
three-step procedure:
(1) Given a mobility dataset D, for every individual uwe compute the set of individual mo-
bility features described in Section 4.1 based on her mobility data. Every individual uis
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
31:14 R. Pellungrini et al.
hence described by a mobility feature vector mu. All the individuals’ mobility feature vec-
tors compose mobility matrix F=(m1,...,mn),wherenis the number of individuals in
D;
(2) For every individual, we simulate the attack with background knowledge conguration
Bi
jon Din order to compute a privacy risk value for every individual. We obtain a privacy
risk vector Ri
j=(r1,...,rn). The regression training set is hence TRi
j=(F,Ri
j);
(3) We transform the regression training set TRi
jinto a classication training set TCi
jby dis-
cretizing vector Ri
jon the intervals [0.0], (0,0.1], (0.1,0.2], (0.2,0.3], (0.3,0.5], (0.5,1.0].
We obtain in this way a privacy risk level vector Ci
j=(c1,...,cn). The classication train-
ing set is hence TCi
j=(F,Ci
j).
Every regression classication dataset TRi
jor classication training dataset TCi
jis used to train
a predictive model Mi
j. The predictive model will be used by the Data Provider to immediately es-
timate the privacy risk value or the privacy risk level of previously unseen individuals whose data
were not used in the learning process, with respect to attack i, background knowledge congura-
tion Bi
j, and dataset D.
Example 4.7 (Construction of classication training set). Consider a mobility dataset of trajec-
tories D={Tu1,Tu2,Tu3,Tu4,Tu5}corresponding to ve individuals u1,u2,u3,u4,and u5.Givenan
attack i, a background knowledge conguration Bi
j, and dataset D, we construct the classication
training set TCi
jas follows:
(1) For every individual ui,we compute the 16 individual mobility measures based on her
trajectory Tui. Every individual uiis hence described by a mobility feature vector of
length 16 mui=(m(ui)
1,...,m(ui)
16 ). All the mobility feature vectors compose mobility ma-
trix F=(mu1,mu2,mu3,mu4,mu5);
(2) We simulate the attack with conguration Bi
jon dataset Dand obtain a vector of ve
privacy risk values Ri
j=(ru1,ru2,ru3,ru4,ru5), each for every individual;
(3) Suppose that the actual privacy risks resulting from simulation are
Ri
j=(1.0,0.5,1.0,0.25,0.03). We discretize the values of the privacy risk vector Ri
j
on the intervals [0.0], [0,0.1], (0.1,0.2], (0.2,0.3], (0.3,0.5], (0.5,1.0]. We hence obtain
a privacy risk level vector Ci
j=((0.5,1.0],(0.3,0.5],(0.5,1.0],(0.2,0.3],[0,0.1])and the
classication training dataset TCi
j=(F,Ci
j).
4.4 Usage of the Data Mining Approach
The Data Provider can use a classier Mi
jto determine the level of privacy risk with respect to an
attack iand background knowledge conguration Bi
jfor: (i) previously unseen individuals whose
data were not used in the learning process or (ii) a selection of individuals in the database already
used in the learning process. It is worth noting that, with existing methods, the privacy risk of
individuals in scenario (ii) must be recomputed by simulating attack (i)from scratch. In contrast,
the usage of classier Mi
jallows us to obtain the privacy risk of the selected individuals immedi-
ately. The computation of the mobility measures and the classication of privacy risk level can be
done in polynomial time as a one-o procedure.
To clarify this point, consider the following scenario. A Data Analyst requests the Data Provider
for updated mobility data about a new set of individuals with the purpose of studying their charac-
teristic traveled distance (radius of gyration rд) and the predictability of their movements (mobility
entropy E). Since both measures can be computed by using a probability vector (see Table 1), the
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
A Data Mining Approach to Assess Privacy Risk in Human Mobility Data 31:15
Data Provider can release just the probability vectors of the individuals requested. Before that,
however, the Data Provider wants to determine the level of privacy risk to the individuals with re-
spect to the Probability attack (P) and several background knowledge congurations BP
j.TheData
Provider uses classier MP
jpreviously trained to obtain the privacy risk level of the individuals.
On the basis of privacy risks obtained from MP
j, the Data Provider can immediately identify risky
individuals (i.e., individuals with a high level of privacy risk). She then can decide to either lter
out the risky individuals or to select suitable privacy-preserving techniques (e.g., k-anonimity or
dierential privacy) and transform their mobility data in such a way that their privacy is preserved.
In the next section, we present an extensive evaluation of our methodology on real-world mobility
data and show the eectiveness of the proposed data mining approach.
5 EXPERIMENTS
For all the attacks dened except the Home and Work attack, we consider four background knowl-
edge congurations Bkwith k=2,3,4,5, where conguration Bkcorresponds to an attack where
the adversary knows klocations visited by the individual. For the Home and Work attack, we have
just one possible background knowledge conguration, where the adversary knows the most fre-
quent location and the second most frequent location of an individual.
We use a dataset provided by Octo Telematics3storing the GPS tracks of private vehicles travel-
ing in two Italian urban areas, Florence and Pisa, from May 1, 2011, to May 31, 2011. In particular,
we have 9,715 private vehicles in the Florence dataset and 2,280 vehicles in the Pisa dataset. The
GPS device embedded in a vehicle automatically turns on when the vehicle starts, and the sequence
of GPS points that the device produces every 30 seconds forms the global GPS track of a vehicle.
When the vehicle stops, no points are logged or sent. We exploit these stops to split the global GPS
track of a vehicle into several subtracks, corresponding to the trips performed by the vehicle. To
ignore small stops like trac lights and gas stations, we follow the strategy commonly used in the
literature (Pappalardo et al. 2013,2015) and choose a stop duration threshold of at least 20 minutes:
If the time interval between two consecutive GPS points of the vehicle is larger than 20 minutes,
the rst point is considered as the end of a trip and the second one as the start of another trip.4We
assign each origin and destination point of the obtained subtracks to the corresponding census cell
according to the information provided by the Italian National Statistics Bureau (ISTAT) in order to
assign every origin and destination point to a location (Pappalardo et al. 2015). This allows us to
describe the mobility of every vehicle in the Florence or the Pisa datasets in terms of a trajectory,
in compliance with the denition introduced in Section 2. Since our purpose is to provide a tool to
immediately discriminate between individuals with low risk and individuals with high risk, in this
section, we show the results of classication experiments. We also perform regression experiments
where we predict the exact value of privacy risk and show the corresponding results in Section 3.4
of the Supplementary Material.
We construct a classication training dataset TCi
jfor every distinct background knowledge
conguration Bi
jof the attacks described in Section 4.2. This means that, in our experiments, we
build a total of 33 distinct classication training datasets for 33 distinct classication experiments.
This is because we consider four background knowledge congurations (k=2,3,4,5) for eight
attacks (Visit, Frequency, Location, Frequent Location Sequence, Frequent Location, Probability,
Proportion, Sequence), and just one background knowledge conguration for the Home and
Work attack. So we construct a total of (8×4)+1=33 distinct classication training datasets.
3https://www.octotelematics.com/.
4We also performed the extraction of the trips using dierent stop duration thresholds (5, 10, 15, 20, 30, 40 minutes), without
nding signicant dierences in the sample of short trips and in the statistical analysis we present in this article.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
31:16 R. Pellungrini et al.
Every classication dataset TCi
jis used to train a classier Mi
jusing Random Forest (Hastie et al.
2009).5We evaluate the overall performance of a classier by two metrics (Tan et al. 2005): (i) the
accuracy of classication ACC =|ˆ
f(xi)=f(xi)|
n,wheref(xi)is the actual label of individual i,ˆ
f(xi)
is the predicted label, and nis the number of individuals in the training dataset; and (ii) the
weighted average F-measure, dened as F=cC|c|2TP
2TP+FP+FN, where TP, FP, FN stand for
the numbers of true positives, false positives, and false negatives resulting from classication; C
is the set of labels; and |c|is the support of a label. All the experiments are performed using a
k-fold cross validation procedure with k=10. We also perform a holdout-validation nding similar
results in terms of accuracy and the F-measure with respect to the cross-validation method (see
Supplementary Material, Section 3.3).6
Table 2(columns Florence and Pisa) summarizes the results of the 33 classication tasks for
both the Florence and Pisa datasets. We compare the performance of a classier Mi
jwith the per-
formance of a baseline classier which generates predictions by respecting the distribution of pri-
vacy risk labels in Ci
j.5In Table 2, we observe a signicant gain in both accuracy and F-measure of
the classiers over the baseline. For example, in predicting the Probability privacy risk levels, the
classier reaches maximum performance values of ACC =0.95 and F-measure =0.95 (congura-
tion k=4, Florence), a signicant improvement with respect to the baseline model with ACC =0.56
and F=0.56. The Home and Work variable has the weakest relation with the individual mobility
features, reaching the lowest performance values: ACC =0.62 and F=0.59 (where the baseline
has ACC =0.37 and F=0.37). Finally, the classication results for Florence and Pisa are compara-
ble, with slightly better performances for the Florence dataset (see Table 2). It is worth noting that,
for some attacks such as the Visit attack, we have very similar performances in terms of both ac-
curacy and F-measure for any k. This is due to the fact that the privacy risk distributions resulting
from simulating the attack are similar for any k2 (Figure 1(c) and 4(c) in the Supplementary Ma-
terial). In contrast, for the Location Sequence attack, we observe that the distribution of privacy
risk for k=2 diers from the distributions of privacy risk for k3 (Figures 1(b) and 4(b) in the
Supplementary Material). In our classication results, this results in a dierence between k=2and
k3: The classication performances become stable for k3. Since the classiers are accurate
especially for the class of maximum risk (0.5,1], and since for k3 the number of individuals
with maximum privacy risk increases, as a consequence, the performance of classiers improve.
It is important to highlight that classifying a high-risk individual as a low-risk individual can be
a major issue. For our application, the recall is important to evaluate the performance of a classi-
er: A high recall on the highest risk class (0.5,1.0] indicates that a very low number of high-risk
individuals are misclassied as low-risk individuals. To be usable in practice, classiers need to
have a high recall on the highest risk class. Figure 1(a)-(b) show a matrix representing the clas-
sication error for every label of background knowledge conguration k=4 of the Probability
attack, for Florence (a) and Pisa (b). An element i,jin the matrix indicates the fraction of instances
for which the actual label jis classied as label iby the classier. The diagonal of the matrix,
hence, indicates the classier’s recall for every label. We observe that the recall of the highest risk
class (0.5,1.0] is 99% for Florence and 98% for Pisa. In particular, we observe that all the misclas-
sications of the classiers for the highest risk class are made predicting class (0.3, 0.5] (i.e., the
second highest class of risk). So there is a zero probability of misclassifying high-risk individuals
5We use the implementation provided by the scikit-learn package in Python (Pedregosa et al. 2011).
6The Python code for attacks simulation and classication tasks is available at https://github.com/pellungrobe/
privacy-mobility-lib.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
A Data Mining Approach to Assess Privacy Risk in Human Mobility Data 31:17
Table 2. Results of the 33 Classification Experiments for the Florence and the Pisa Datasets
The classication performance is evaluated by the overall accuracy (ACC) and the weighted F-measure (F) by using ak-fold
cross validation with k=10. In columns FI PI and PI FI, where FI indicates Florence and PI indicates Pisa, we show
the results of classication where we train the classiers on the rst urban area and try to predict the privacy risks of
individuals in the second urban area.
as low-risk individuals (i.e., classes [0.0] and (0.0, 0.1]). Similarly, in Figure 1(c)-(d), an element i,j
in the matrix indicates the fraction of instances for which the predicted label jis actually label iin
the dataset. The diagonal matrix indicates in this case the classier’s precision for every label. We
observe that the classier is very precise for the two lowest (risk [0.0] and risk (0.0,0.1]) and
the highest (risk (0.5,1.0]) privacy risk labels: Both the recall and the precision of these labels
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
31:18 R. Pellungrini et al.
Fig. 1. Classification error per class for classifier MP
4Probability aack Pand background knowledge config-
uration BP
4, for Florence (a, c) and Pisa (b, d). An element i,jin the matrices (a) and (b) indicates the fraction
of instances for which the actual class jis classified as class i. The diagonal of the matrices (a) and (b), hence,
indicates the classifier’s recall for every class. An element i,jin the matrices (c) and (d) indicates the fraction
of instances for which the predicted class jis actually class iin the dataset. The diagonal of matrices (c) and
(d) indicates in this case the classifier’s precision for every class. We observe that the classifier has both high
recall and high precision on the first two classes (low risk) and the last class (maximum risk). We provide the
matrices for all the other classifiers in Section 3.6 of the Supplementary Material.
are close to 1. Even on the labels where recall and precision are lower (i.e., (0.1,0.2], (0.2,0.3],
(0.3,0.5]), the classier is more prone to predict a higher level of risk than a lower level of risk.
These conservative choices allow the Data Provider to limit the privacy violation of individuals:
It is hence unlikely that a classier assigns to an individual a privacy risk label that is lower than
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
A Data Mining Approach to Assess Privacy Risk in Human Mobility Data 31:19
Fig. 2. The distribution of average importance of the mobility features for all 33 classifiers (Florence dataset).
her actual privacy risk label. We report in Section 3.6 of the Supplementary Material the matrices
corresponding to the classication results of all the other background knowledge congurations.
In Table 2(columns FI PI and PI FI), we also show the results of other classication ex-
periments where we train a classier on the Florence dataset and use it to classify the privacy risk
label of vehicles in the Pisa dataset, and vice versa. Even if the two datasets cover disjoint sets of
vehicles, we observe good predictive performance, comparable to the performance of classiers
where the training set and the test set belong to the same original dataset.
Importance of Mobility Features. We quantify the importance of every mobility feature in a clas-
sier Mi
jby taking its average importance in the decision trees of the resulting random forest.
The importance of a feature in a decision tree is computed as the (normalized) total reduction
of classication entropy brought by that feature in the tree (Hastie et al. 2009). Figure 2shows
a heatmap representing the average importance of every mobility feature to the 33 classiers in
Florence, where every column corresponds to a classier and every row corresponds to a mobility
feature. We report the same heatmap for Pisa in Section 3.5 of the Supplementary Material. We
observe the following results. First, while classiers corresponding to dierent congurations of
the attack show similar distributions of importances, classiers corresponding to congurations
of dierent attacks produce dierent distributions. For example, in the classiers corresponding
to the four congurations of the Visit attack, the average number of visits Vis, not surprisingly,
the most important mobility feature (Figure 2). In contrast, in the classiers corresponding to the
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
31:20 R. Pellungrini et al.
Table 3. The Average Importance of Every Mobility Feature Computed Over All
33 Classifiers for Florence and Pisa
Florence Pisa Florence Pisa
measure impo. measure impo. measure impo. measure impo.
1V3.66 Locsratio 3.24 15 Uratio
20.96 Uratio
20.92
2E2.92 Dsum 3.22 16 Un0.88 Un0.88
3Dsum 2.75 V2.87 17 wpop
n0.83 rд0.87
4Locsratio 2.51 E2.62 18 En0.79 En0.79
5V1.91 V1.69 19 E20.74 E20.75
6wpop
11.77 Locs 1.66 20 Dmax 0.68 wpop
n0.73
7Locs 1.67 wpop
11.62 21 Dtrip
max 0.63 Dtrip
max 0.67
8U11.44 U11.46 22 rд0.61 Dmax 0.58
9Uratio
11.32 Uratio
11.40 23 w10.42 w10.48
10 Dsum 1.19 U21.16 24 w20.40 w10.44
11 U21.12 Uratio
n1.09 25 w10.36 w20.36
12 wpop
21.07 wpop
21.07 26 wn0.13 wn0.15
13 E11.05 E11.06 27 wn0.12 w20.13
14 Uratio
n0.99 Dsum 0.98 28 w20.10 wn0.13
We observe a correlation r=0.96 between the importance of the mobility features in Florence and Pisa.
Table 4. Comparison of Execution Times of Aack Simulations and Classification
Tasks on Florence and Pisa
Variable (5
2k)Florence Pisa
simulation classier simulation classier
Home and Work 149s (2.5m) 7s 5s 3s
Frequency 645s (10m) 22s 20s 10s
Frequent Location Sequence 846s (14m) 22s 23s 10s
Proportion 900s (15m) 24s 30s 10s
Frequent Location 997s (10m) 22s 30s 10s
Probability 1,165s (20m) 22s 37s 10s
Visit 2,274s (38m) 16s 95s (1.5m) 9s
LocationSequence >168h (1week) 22s >168h (1week) 10s
Location >168h (1week) 22s >168h (1week) 10s
total >2weeks 172s >2weeks 79s
four congurations of the Proportion attack, Vhas a low importance while Dsum,E,and Locsratio
have the highest importance. A second result is that the distribution of the average importances
for Florence and Pisa are similar: We observe a Pearson correlation r=0.96 between the two im-
portances of the same variables in the two urban areas. Table 3shows a ranking of the average
importance the mobility features have in the classiers, for Florence and Pisa. Here, we observe
that individual measures (e.g., E,V,V) tend to be the most important ones, while location-based
features (e.g., Wi,Ei) tend to be less important.
Execution Times. We show the computational improvement of our approach in terms of execu-
tion time by comparing in Table 4the execution times of the attack simulations and the execution
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
A Data Mining Approach to Assess Privacy Risk in Human Mobility Data 31:21
times of the classication tasks.7The execution time of a single classication task is the sum of
three subtasks: (i) the execution time of training the classier on the training set, (ii) the execution
time of using the trained classier to predict the classes on the test set, and (iii) the execution time
of evaluating the performance of classication (i.e., computing accuracy and F-measure). Table 4
shows that the execution time of attack simulations is low for the Frequency, Frequent Location
Sequence, Proportion, Frequent Location, Probability, and Visit attacks (a few seconds for Pisa and
a few minutes for Florence). However, for Location Sequence and Location, the execution times are
huge: more than 1 week each. In contrast, the classication tasks have constant execution times
of around 10s for Pisa and 22s for Florence. In summary, our approach can compute the risk levels
for all 33 attacks in both Florence and Pisa in 250 seconds (less than 5 minutes), while the attack
simulations require more than 2 weeks of computation.
6 DISCUSSION
The implementation of our data mining approach on real mobility data produces three remarkable
results. First, the classiers provide precise estimates of individuals’ privacy risk, especially for the
lowest privacy risk level and the highest privacy risk levels (Table 2). Moreover, the classiers built
on a given dataset (e.g., Florence) can be eectively used to estimate the privacy risks in a dierent
dataset (e.g., Pisa; Table 2). These outcomes suggest that the classiers can be a valid and fast
alternative to existing privacy risk assessment tools. Instead of recomputing all the privacy risks
when new data records become available and for every selection of individuals and geographic
areas or periods of time, which would result in high computational costs, a Data Provider can
eectively use the classiers to obtain immediate and reliable estimates for every individual.
Second, dierent types of attacks generate dierent distributions of importance of the mobility
measures in the classiers (Figure 2). In particular, while some mobility measures are irrelevant
for determining the privacy risk of an individual regardless the type of the risk (e.g., wnand wn),
other mobility measures are very relevant to determine the privacy risk of an individual (e.g., V
and E). In other words, while some mobility measures provide a high predictive power, others
are irrelevant and cannot be used alone to determine the privacy risk level of an individual. This
suggests that both the learning phase and the predictive task should be done by computing the
extensive set of mobility measures by using the maximal data structure (trajectory), even when
a more aggregated data structure (e.g., a frequency vector) is sucient for the Data Analysts’
needs. However, this is not a problem in terms of computational costs because all the measures
can be computed in linear time of the size of the dataset. It is worth noting that our approach can
easily deal with changes in the long-term mobility patterns of an individual due, for example, to
migration or changes in home/workplace. Every time new mobility data for an individual become
available, the Data Provider can recompute her mobility features. To take into account long-term
changes in mobility patterns, the recomputation of mobility measures can be done at regular time
intervals (e.g., every month) by considering a time window with the most recent data (e.g., the last
6 months of data). The regular updates and the time window allow the Data Provider to consider
the recent mobility history of an individual and obtain up-to-date individual mobility patterns.
A third remarkable result is that on both datasets the mobility measures describing aspects
related to the individual alone, such as the number of visits Vin the individual’s trajectory and
the mobility entropy E, are the most important features with which to classify the privacy risk
of individuals (Table 3), far larger than the location-based measures (e.g., w1,w2,wn) and the
ones comparing individual mobility to collective mobility patterns (e.g., Dtrip
max ,wpop
m). This result
7For a given type of attack, we report the sum of the execution times of the attacks for congurations k=2,3,4,5. We
perform the experiments on Ubuntu 16.04.1 LTS 64 bit, 32GB RAM, 3.30GHz Intel Core i7.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
31:22 R. Pellungrini et al.
is important because, in contrast with existing privacy risk assessment frameworks, it allows for
estimating the privacy risk of an individual based on a limited amount of information about the
collectivity. Since every individual can obtain an estimate of her own privacy risk based just on
her mobility data, this increases awareness about personal data and helps her in deciding whether
or not to share mobility data with third parties. This is compliant with a user-centric ecosystem
(Forum 2013) like the one implemented by the personal data store (de Montjoye et al. 2012), where
each individual has the full control of her personal data life-cycle. For this reason, our data mining
approach can be integrated into the personal data store as a further tool available to the data owner.
7 RELATED WORKS
This article focuses on the mobility data of individuals traveling by car. An overview on the prob-
lems, techniques, and methodologies related to urban mobility data and urban computing can be
found in Zheng et al. (2014). Human mobility data contain personal, sensitive information and can
reveal many facets of the private life of individuals, leading to the possibility of a serious privacy
violation. Nevertheless, in the past years, many techniques for privacy-preserving analysis on hu-
man mobility data have been proposed in the literature (Giannotti et al. 2013) showing that it is
possible to design analytical mobility services where the quality of results coexists with the pro-
tection of personal data. A widely used privacy-preserving model is k-anonymity (Samarati and
Sweeney 1998a,1998b), which requires that an individual should not be identiable from a group
of size smaller than kbased on their quasi-identiers (QIDs), a set of attributes that can be used to
uniquely identify individuals. Abul et al. (2008b)proposethe(k,δ)-anonymity model, which takes
advantage of the inherent uncertainty of the moving object’s whereabouts, where δrepresents
the location precision. Assuming that dierent adversaries own disjoint parts of an individual’s
trajectory, Terrovitis and Mamoulis (2008) reduce privacy risk by relying on the suppression of
the dangerous observations from each individual’s trajectory. Yarovoy et al. (2009)proposethe
attack-graphs method to defend against attacks, based on k-anonymity. Monreale et al. (2010b)
illustrate a generalized approach to achieve k-anonymity.
Other works are based on the so-called dierential privacy model (Dwork et al. 2006). Monreale
et al. (2013), for example, consider a privacy-preserving distributed aggregation framework for
movement data, proposing the application of a ϵ-dierential privacy model. Cormode et al. (2012)
propose to publish a contingency table of trajectory data, where each cell in the table contains
the number of individuals commuting from the given source location to the given destination
location. Sebastien Gambs (2014) proposes a mobility model called Mobility Markov Chain, built
upon mobility traces to re-identify an individual, while Ji et al. (2014) denes several similarity
metrics which can be combined in a unied framework to provide de-anonymization of mobility
and social network data.
One of the most important works on privacy risk assessment is the L methodology (Deng
et al. 2011), a privacy-aware threat analysis framework based on Microsoft’s S methodology
(Swiderski and Snyder 2004), useful for modeling privacy threats in software-based systems. In the
past years, dierent techniques for risk management have been proposed, such as the OWASP’s
Risk Rating Methodology (OWASP 2016), NIST’s Special Publication 800-30 (Stoneburner et al.
2002), SEI’s OCTAVE (Alberts et al. 1999), and Microsoft’s DREAD (Meier and Corporation 2003).
Unfortunately, many of these works do not consider privacy risk assessment and simply include
privacy considerations when assessing the impact of threats. Trabelsi et al. (2009) elaborate an
entropy-based method to evaluate the disclosure risk of personal data, trying to manage quanti-
tatively privacy risks. The unicity measure proposed in Song et al. (2014) and Achara et al. (2015)
evaluates privacy risk as the number of records/trajectories which are uniquely identied.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
A Data Mining Approach to Assess Privacy Risk in Human Mobility Data 31:23
Basu et al. (2014) propose an empirical risk model for the estimation of privacy risk for trajectory
data and a framework to improve privacy risk estimation for mobility data, evaluating their model
using k-anonymized data. Armando et al. (2015) propose a risk-aware framework for information
disclosure which supports runtime risk assessment. In this framework, access-control decisions are
based on the disclosure-risk associated with a data access request, and adaptive anonymization is
used as a risk-mitigation method. Unfortunately, this framework only works on relational datasets
since it needs to discriminate between QIDs and sensitive attributes.
Other works in the literature study the re-identication risk as a privacy measure in the context
of network and social media data (Narayanan and Shmatikov 2009; Ramachandran et al. 2014)or
combine network data and mobile phone data to re-identify people (Cecaj et al. 2016). The com-
bination of multiple data sources for the attack and considering network data instead of mobility
data in our methodology is one of the most interesting extensions that we intend to investigate as
future work.
In this article, we use the privacy risk assessment framework introduced by Pratesi et al. (2016)
(Section 3) to calculate the privacy risks of each individual in a mobility dataset. Our novel con-
tribution is to overcome the inherent computational complexity of this framework by proposing a
data mining approach that uses data mining classiers to predict the privacy risk of an individual
based solely on her mobility patterns.
8 CONCLUSION
Human mobility data are a precious proxy to improve our understanding of human dynamics, as
well as to improve urban planning, transportation engineering, and epidemic modeling. Neverthe-
less, human mobility data contain sensitive information which, if analyzed with malicious intent,
can lead to a serious violation of the privacy of the individuals involved. In this article, we pro-
posed a fast and exible data mining approach for estimating the privacy risk in human mobility
data, one that overcomes the computational issues of existing privacy risk assessment frameworks.
We validated our approach with an extensive experimentation on real-world GPS data, showing
that we can achieve accurate estimations of privacy risks. In particular, the results showed that
(i) the classiers are accurate, especially on the highest and the lowest privacy risk classes; and
(ii) the classiers have a conservative behavior (i.e., misclassied individuals are assigned more
likely to classes of higher risk than to classes of lower risk with respect to the actual class of pri-
vacy risk). Moreover, we observed that a classier trained on data related to a specic urban area
can be eectively used to predict the privacy risk of individuals in another urban area.
We want to highlight some limitations of the article that we plan to overcome in future works.
First, we do not investigate Step (3) and Step (4) of the Data Delivery Procedure (Procedure 3.1), that
is, the most suitable techniques to reduce the privacy risk of individuals in the dataset while still
guaranteeing data quality for mobility analytics. Diverse techniques are proposed in the literature,
such as k-anonymity (Samarati and Sweeney 1998b) or dierential privacy (Dwork et al. 2006),
ranging from removing a fraction of the records or individuals, to injecting articial records to
hide risky individuals, to modifying the data structures of the most risky individuals. Our approach
provides a fast tool to immediately obtain the privacy risks of individuals, leaving to the Data
Provider the choice of the most suitable privacy-preserving techniques to manage and mitigate
the privacy risks of individuals. In future works, we plan to perform an extensive experimentation
to select the best techniques to reduce the privacy risk of individuals in mobility datasets while at
same time ensure high data quality for analytical services.
Our approach can be extended in several directions. First, we plan to apply our data mining
approach to mobility datasets with dierent characteristics, such as mobile phone data which
generally cover a larger geographic area (e.g., an entire country). This would allow us to deeply
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
31:24 R. Pellungrini et al.
investigate the “portability” of our approach (i.e., at what extent the classiers trained on a geo-
graphic zone can be used to predict the privacy risk of individuals in another geographic zone).
Second, the repertoire of attacks can be extended by adding new attacks or by dening “multi-
attacks” (i.e., combining multiple existing attacks). For example, a powerful “multi-attack” would
be a combination of Proportion and Probability: An adversary would know a set of klocations
and the corresponding probabilities and relative proportions. Many other multi-attacks can be de-
signed, and we leave this interesting line of research for future work. Third, we plan to investigate
whether our approach can be extended to contexts other than human mobility, such as the estima-
tion of privacy risk in social networks. It would be indeed interesting to investigate at what extent
data mining classiers are able to infer the relations between social network metrics and individual
risk of re-identication in social network data. Last, it would be interesting to repeat the experi-
ments with a larger repertoire of machine learning algorithms and identify the best performer, or
to combine them with boosting or bagging techniques to further improve the classication results.
We leave these interesting tasks for future work.
REFERENCES
Osman Abul, Francesco Bonchi, and Mirco Nanni. 2008a. Never walk alone: Uncertainty for anonymity in moving objects
databases. In Proceedings of the 24th International Conference on Data Engineering (ICDE’08). 376–385. DOI:https://doi.
org/10.1109/ICDE.2008.4497446
Osman Abul, Francesco Bonchi, and Mirco Nanni. 2008b. Never walk alone: Uncertainty for anonymity in moving objects
databases. In ICDE’08. 376–385.
Jagdish Prasad Achara, Gergely Ács, and Claude Castelluccia. 2015. On the unicity of smartphone applications. In Proceed-
ings of the 14th ACM Workshop on Privacy in the Electronic Society (WPES 2015), Denver, Colorado, USA, October 12, 2015.
27–36. DOI:https://doi.org/10.1145/2808138.2808146
Christopher Alberts, Sandra Behrens, Richard Pethia, and William Wilson. 1999. Operationally Critical Threat, Asset, and
Vulnerability Evaluation (OCTAVE) Framework, Version 1.0. Technical Report CMU/SEI-99-TR-017. Software Engineer-
ing Institute, Carnegie Mellon University, Pittsburgh, PA. http://resources.sei.cmu.edu/library/asset-view.cfm?AssetID=
13473.
Alessandro Armando, Michele Bezzi, Nadia Metoui, and Antonino Sabetta. 2015. Risk-based privacy-aware information
disclosure. International Journal of Security Software Engineering 6, 2 (April 2015), 70–89. DOI:https://doi.org/10.4018/
IJSSE.2015040104
Anirban Basu, Anna Monreale, Juan Camilo Corena, Fosca Giannotti, Dino Pedreschi, Shinsaku Kiyomoto, Yutaka Miyake,
Tadashi Yanagihara, and Roberto Trasarti. 2014. A privacy risk model for trajectory data. In Trust Management VIII,
Jianying Zhou, Nurit Gal-Oz, Jie Zhang, and Ehud Gudes (Eds.). IFIP Advances in Information and Communication
Technology, Vol. 430. Springer Berlin, 125–140. DOI:https://doi.org/10.1007/978-3- 662-43813-8_9
Armando Bazzani, Bruno Giorgini, Sandro Rambaldi, Riccardo Gallotti, and Luca Giovannini. 2010. Statistical laws in urban
mobility from microscopic GPS data in the area of Florence. Journal of Statistical Mechanics: Theory and Experiment 2010,
5 (2010), P05001. http://stacks.iop.org/1742-5468/2010/i=05/a=P05001
Alket Cecaj, Marco Mamei, and Franco Zambonelli. 2016. Re-identication and information fusion between anonymized
CDR and social network data. Journal of Ambient Intelligence and Humanized Computing 7, 1 (2016), 83–96. DOI:https:
//doi.org/10.1007/s12652-015-0303-x
Vittoria Colizza, Alain Barrat, Marc Barthelemy, Alain-Jacques Valleron, and Alessandro Vespignani. 2007. Modeling the
worldwide spread of Pandemic inuenza: Baseline case and containment interventions. PLOS Medicine 4, 1 (Jan. 2007),
1–16. DOI:https://doi.org/10.1371/journal.pmed.0040013
Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, and Thanh T. L. Tran. 2012. Dierentially private summaries
for sparse data. In ICDT’12. 299–311.
Yves-Alexandre de Montjoye, César A. Hidalgo, Michel Verleysen, and Vincent D. Blondel. 2013. Unique in the crowd: The
privacy bounds of human mobility. Scientic Reports 3 (March 2013), 1376. http://dx.doi.org/10.1038/srep01376.
Yves-Alexandre de Montjoye, Samuel S. Wang, and Alex Pentland. 2012. On the trusted use of large-scale personal data.
IEEE Data Engineering Bull. 35, 4 (2012), 5–8.
Mina Deng, Kim Wuyts, Riccardo Scandariato, Bart Preneel, and Wouter Joosen. 2011. A privacy threat analysis framework:
Supporting the elicitation and fulllment of privacy requirements. Requirements Engineering 16, 1 (March 2011), 3–32.
DOI:https://doi.org/10.1007/s00766-010-0115-7
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
A Data Mining Approach to Assess Privacy Risk in Human Mobility Data 31:25
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data
analysis. In TCC’06. 265–284.
Nathan Eagle and Alex S. Pentland. 2009. Eigenbehaviors: Identifying structure in routine. Behavioral Ecology and Sociobi-
ology 63, 7 (1 May 2009), 1057–1066. DOI:https://doi.org/10.1007/s00265-009-0739-0
Sébastien Gambs, Marc-Olivier Killijian, and Miguel Núñez del Prado Cortez. 2012. Next place prediction using mobility
Markov chains. In Proceedings of the 1st Workshop on Measurement, Privacy, and Mobility (MPM’12).ACM,NewYork,
Article3,6pages.DOI:https://doi.org/10.1145/2181196.2181199
Sebastien Gambs, Marc-Olivier Killijian, and Miguel Nuñez Del Prado Cortez. 2014. De-anonymization attack on geolocated
data. Journal of Computer System Science 80 (2014), 1597–1614.
Fosca Giannotti, Anna Monreale, and Dino Pedreschi. 2013. Mobility data and privacy. In Mobility Data Modeling, Manage-
ment, and Understanding, C. Renso, S. Spaccapietra, E. Zimanyi (Eds.). Springer, 174–193.
Fosca Giannotti, Mirco Nanni, Dino Pedreschi, Fabio Pinelli, Chiara Renso, Salvatore Rinzivillo, and Roberto Trasarti. 2011.
Unveiling the complexity of human mobility by querying and mining massive trajectory data. The VLDB Journal 20, 5
(2011), 695. DOI:https://doi.org/10.1007/s00778-011-0244-8
Marta C. Gonzalez, Cesar A. Hidalgo, and Albert-Laszlo Barabasi. 2008. Understanding individual human mobility patterns.
Nature 453, 7196 (June 2008), 779–782. DOI:https://doi.org/10.1038/nature06958
Trevor J. Hastie, Robert John Tibshirani, and Jerome H. Friedman. 2009. The Elements of Statistical Learning: Data Mining,
Inference, and Prediction. Springer, New York.
Shouling Ji, Weiqing Li, Mudhakar Srivatsa, Jing Selena He, and Raheem Beyah. 2014. Structure Based Data De-
Anonymization of Social Networks and Mobility Traces. Springer International Publishing, Cham, 237–254. DOI:https:
//doi.org/10.1007/978-3-319-13257- 0_14
Shan Jiang, Yingxiang Yang, Siddharth Gupta, Daniele Veneziano, Shounak Athavale, and Marta C. Gonzlez. 2016.
The TimeGeo modeling framework for urban mobility without travel surveys. Proceedings of the National Acad-
emy of Sciences 113, 37 (2016), E5370–E5378. DOI:https://doi.org/10.1073/pnas.1524261113 arXiv:http://www.pnas.
org/content/113/37/E5370.full.pdf.
Xin Lu, Erik Wetter, Nita Bharti, Andrew J. Tatem, and Linus Bengtsson. 2013. Approaching the limit of predictability in
human mobility. Scientic Reports 3, 1, 2923. http://dx.doi.org/10.1038/srep02923
Stefano Marchetti, Caterina Giusti, Monica Pratesi, Nicola Salvati, Fosca Giannotti, Dino Pedreschi, Salvatore Rinzivillo,
Luca Pappalardo, and Lorenzo Gabrielli. 2015. Small area model-based estimators using big data sources. Journal of
Ocial Statistics 31, 2 (2015), 263–281.
J. D. Meier and Microsoft Corporation. 2003. Improving Web Application Security: Threats and Countermeasures. Microsoft.
Noman Mohammed, Benjamin C. M. Fung, and Mourad Debbabi. 2009. Walking in the crowd: Anonymizing trajectory data
for pattern analysis. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09).
ACM, New York, 1441–1444.
Anna Monreale, Gennady Andrienko, Natalia Andrienko, Fosca Giannotti, Dino Pedreschi, Salvatore Rinzivillo, and Stefan
Wrobel. 2010a. Movement data anonymity through generalization. Transactions on Data Privacy 3, 2 (Aug. 2010), 91–121.
Anna Monreale, Gennady L. Andrienko, Natalia V. Andrienko, Fosca Giannotti, Dino Pedreschi, Salvatore Rinzivillo, and
Stefan Wrobel. 2010b. Movement data anonymity through generalization. Transactions on Data Privacy 3, 2 (2010), 91–
121.
Anna Monreale, Dino Pedreschi, Ruggero G. Pensa, and Fabio Pinelli. 2014a. Anonymity preserving sequential pattern
mining. Articial Intelligence and Law 22, 2 (2014), 141–173. DOI:https://doi.org/10.1007/s10506-014-9154- 6
Anna Monreale, Salvatore Rinzivillo, Francesca Pratesi, Fosca Giannotti, and Dino Pedreschi. 2014b. Privacy-by-
design in big data analytics and social mining. EPJ Data Science 3, 1 (2014), 10. DOI:https://doi.org/10.1140/epjds/
s13688-014-0010-4
Anna Monreale, Wendy Hui Wang, Francesca Pratesi, Salvatore Rinzivillo, Dino Pedreschi, Gennady Andrienko, and Na-
talia Andrienko. 2013. Privacy-Preserving Distributed Movement Data Aggregation. Springer International Publishing,
225–245. DOI:https://doi.org/10.1007/978-3-319-00615- 4_13
Arvind Narayanan and Vitaly Shmatikov. 2009. De-anonymizing social networks. In Proceedings of the 30th IEEE Symposium
on Security and Privacy (S&P’09). 173–187. DOI:https://doi.org/10.1109/SP.2009.22
OWASP. 2016. Risk rating methodology. Retrieved from https://www.owasp.org/index.php/OWASP_Risk_Rating_
Methodology.
Luca Pappalardo, Dino Pedreschi, Zbigniew Smoreda, and Fosca Giannotti. 2015. Using big data to study the link between
human mobility and socio-economic development. In Proceedings of the 2015 IEEE International Conference on Big Data
(Big Data’15). 871–878. DOI:https://doi.org/10.1109/BigData.2015.7363835
Luca Pappalardo, Salvatore Rinzivillo, Zehui Qu, Dino Pedreschi, and Fosca Giannotti. 2013. Understanding the patterns
of car travel. The European Physical Journal Special Topics 215, 1 (2013), 61–73. DOI:https://doi.org/10.1140/epjst%
252fe2013-01715-5
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
31:26 R. Pellungrini et al.
Luca Pappalardo, Salvatore Rinzivillo, and Filippo Simini. 2016. Human mobility modelling: Exploration and preferential
return meet the gravity model. Procedia Computer Science 83 (2016), 934–939. DOI:https://doi.org/10.1016/j.procs.2016.
04.188 The 7th International Conference on Ambient Systems, Networks and Technologies (ANT 2016) / The 6th Inter-
national Conference on Sustainable Energy Information Technology (SEIT-2016) / Aliated Workshops.
Luca Pappalardo and Filippo Simini. 2016. Modelling spatio-temporal routines in human mobility. CoRR abs/1607.05952
(2016). http://arxiv.org/abs/1607.05952
Luca Pappalardo, Filippo Simini, Salvatore Rinzivillo, Dino Pedreschi, Fosca Giannotti, and Albert-Laszlo Barabasi. 2015.
Returners and explorers dichotomy in human mobility. Nature Communications 6 (Sept. 2015). http://dx.doi.org/10.1038/
ncomms9166
Luca Pappalardo, Maarten Vanhoof, Lorenzo Gabrielli, Zbigniew Smoreda, Dino Pedreschi, and Fosca Giannotti. 2016. An
analytical framework to nowcast well-being using mobile phone data. International Journal of Data Science and Analytics
2, 1 (2016), 75–92. DOI:https://doi.org/10.1007/s41060-016-0013-2
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg,
J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning
in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
Francesca Pratesi, Anna Monreale, Roberto Trasarti, Fosca Giannotti, Dino Pedreschi, and Tadashi Yanagihara. 2016.
PRISQUIT: A System for Assessing Privacy Risk versus Quality in Data Sharing. Technical Report 2016-TR-043. ISTI -
CNR, Pisa, Italy.
Arthi Ramachandran, Yunsung Kim, and Augustin Chaintreau. 2014. “I knew they clicked when I saw them with their
friends”: Identifying your silent web visitors on social media. In Proceedings of the 2nd ACM Conference on Online Social
Networks (COSN’14). 239–246. DOI:https://doi.org/10.1145/2660460.2660461
Ira S. Rubinstein. 2013. Big data: The end of privacy or a new beginning? International Data Privacy Law (2013). DOI:https://
doi.org/10.1093/idpl/ips036 arXiv:http://idpl.oxfordjournals.org/content/early/2013/01/24/idpl.ips03 6.full.pdf+html
Pierangela Samarati. 2001. Protecting respondents’ identities in microdata release. IEEE Transactions on Knowledge and
Data Engineering 13, 6 (2001), 1010–1027. DOI:https://doi.org/10.1109/69.971193
Pierangela Samarati and Latanya Sweeney. 1998a. Generalizing data to provide anonymity when disclosing information
(abstract). In PODS. 188.
Pierangela Samarati and Latanya Sweeney. 1998b. Protecting privacy when disclosing information: K-anonymity and its
enforcement through generalization and suppression. In Proceedings of the IEEE Symposium on Research in Security and
Privacy. 384–393.
Filippo Simini, Marta C. Gonzalez, Amos Maritan, and Albert-Laszlo Barabasi. 2012. A universal model for mobility and
migration patterns. Nature 484, 7392 (05 04 2012), 96–100. http://dx.doi.org/10.1038/nature10856
Chaoming Song, Tal Koren, Pu Wang, and Albert-Laszlo Barabasi. 2010a. Modelling the scaling properties of human mo-
bility. Nature and Physics 6, 10 (10 2010), 818–823. http://dx.doi.org/10.1038/nphys1760
Chaoming Song, Zehui Qu, Nicholas Blumm, and Albert-Lszl Barabsi. 2010b. Limits of predictability in human mobility.
Science 327, 5968 (2010), 1018–1021. DOI:https://doi.org/10.1126/science.1177170 arXiv:http://www.sciencemag.org/cgi/
reprint/327/5968/1018.pdf
Yi Song, Daniel Dahlmeier, and Stéphane Bressan. 2014. Not so unique in the crowd: A simple and eective algorithm for
anonymizing location data. In Proceeding of the 1st International Workshop on Privacy-Preserving IR: When Information
Retrieval Meets Privacy and Security co-located with 37th Annual International ACM SIGIR Conference (PIR@SIGIR’14).
19–24.
G. Stoneburner, A. Goguen, and A. Feringa. 2002. Risk Management Guide for Information Technology Systems: Recommen-
dations of the National Institute of Standards and Technology. NIST special publication, Vol. 800. U.S. Department of
Commerce, National Institute of Standards and Technology.
Latanya Sweeney. 2002. K-anonymity: A model for protecting privacy. International Journal of Uncertainty and Fuzziness
in Knowledge-Based Systems 10, 5 (Oct. 2002), 557–570. DOI:https://doi.org/10.1142/S0218488502001648
Frank Swiderski and Window Snyder. 2004. Threat Modeling. O’Reilly Media.
Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. 2005. Introduction to Data Mining (1st Edition). Addison-Wesley
Longman Publishing Co., Inc., Boston, MA.
Manolis Terrovitis and Nikos Mamoulis. 2008. Privacy preservation in the publication of trajectories. In MDM. 65–72.
Manolis Terrovitis, Nikos Mamoulis, and Panos Kalnis. 2008. Privacy-preserving anonymization of set-valued data. Pro-
ceedings of the VLDB Endowment 1, 1 (Aug. 2008), 115–125. DOI:https://doi.org/10.14778/1453856.1453874
Michele Tizzoni, Paolo Bajardi, Adeline Decuyper, Guillaume Kon Kam King, Christian M. Schneider, Vincent Blondel,
Zbigniew Smoreda, Marta C. Gonzlez, and Vittoria Colizza. 2014. On the use of human mobility proxies for modeling
epidemics. PLOS Computational Biology 10, 7 (Jul. 2014), 1–15. DOI:https://doi.org/10.1371/journal.pcbi.1003716
Slim Trabelsi, Vincent Salzgeber, Michele Bezzi, and Gilles Montagnon. 2009. Data disclosure risk evaluation. In CRiSIS’09.
35–72.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
A Data Mining Approach to Assess Privacy Risk in Human Mobility Data 31:27
Jayakrishnan Unnikrishnan and Farid Movahedi Naini. 2013. De-anonymizing private data by matching statistics. In Pro-
ceedings of the 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton’13). 1616–1623.
DOI:https://doi.org/10.1109/Allerton.2013.6736722
Pu Wang, Timothy Hunter, Alexandre M. Bayen, Katja Schechtner, and Marta C. González. 2012. Understanding road usage
patterns in urban areas. Scientic Reports 2 (Dec. 2012), 1001 EP. http://dx.doi.org/10.1038/srep01001
Nathalie E. Williams, Timothy A. Thomas, Matthew Dunbar, Nathan Eagle, and Adrian Dobra. 2015. Measures of human
mobility using mobile phone records enhanced with GIS data. PLoS ONE 10, 7 (07 2015), 1–16. DOI:https://doi.org/10.
1371/journal.pone.0133630
W. K. Wong, David W. Cheung, Edward Hung, Ben Kao, and Nikos Mamoulis. 2007. Security in outsourcing of association
rule mining. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07). VLDB Endowment,
111–122.
World Economic Forum. 2013. Unlocking the Value of Personal Data: From Collection to Usage. Retrieved from http://
www3.weforum.org/docs/WEF_IT_UnlockingValuePersonalData_CollectionUsage_Report_2013.pdf.
Yabo Xu, Benjamin C. M. Fung, Ke Wang, Ada Wai-Chee Fu, and Jian Pei. 2008a. Publishing sensitive transactions for
itemset utility. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM’08). 1109–1114.
Yabo Xu, Ke Wang, Ada Wai-Chee Fu, and Philip S. Yu. 2008b. Anonymizing transaction databases for publication. In
Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 767–775.
Roman Yarovoy, Francesco Bonchi, Laks V. S. Lakshmanan, and Wendy Hui Wang. 2009. Anonymizing moving objects:
How to hide a MOB in a crowd? In EDBT. 72–83.
Hui Zang and Jean Bolot. 2011. Anonymization of location data does not work: A large-scale measurement study. In Pro-
ceedings of the 17th Annual International Conference on Mobile Computing and Networking (MobiCom’11).ACM,New
York, 145–156. DOI:https://doi.org/10.1145/2030613.2030630
Yu Zheng. 2015. Trajectory data mining: An overview. ACM Transactions on Intelligent Systems Technology 6, 3 (2015),
29:1–29:41. DOI:https://doi.org/10.1145/2743025
Yu Zheng, Licia Capra, Ouri Wolfson, and Hai Yang. 2014. Urban computing: Concepts, methodologies, and applications.
ACM Transactions on Intelligent Systems Technology 5, 3 (Sept. 2014), Article 38, 55 pages. DOI:https://doi.org/10.1145/
2629592
Yu Zhengand Xiaofang Zhou (Eds.). 2011. Computing with Spatial Trajectories.Springer.DOI:https://doi.org/10.1007/
978-1-4614-1629- 6
Received December 2016; revised May 2017; accepted May 2017
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 3, Article 31. Publication date: December 2017.
... Important from our perspective is that most of them share the common but rarely stated explicitly assumption: the data supplied by the GPS equipment and mobile phone operators contain the exact and certain information about the user's whereabouts. Almost certainly, it is true for the GPS data, where, privacy risks can be assessed in a precise and effective way (Pellungrini et al., 2017). ...
Article
The partition of the Mobile Phone Network (MPN) service area into the cell towers' Voronoi polygons (VP) may serve as a coordinate system for representing the location of the mobile phone devices, as demonstrated by numerous papers that exploit mobile phone data for studying human spatial mobility. In these studies, the user is assumed to be located inside the VP of the connected antenna. We investigate the credibility of this view by comparing volunteers' empirical data of two kinds: (1) VP of the connected 3G and 4G cell towers and (2) GPS tracks of these users at the time of connection. In more than 60% of connections, the user's mobile device was found outside the VP of the connected cell tower. We demonstrate that the area of a possible device's location is many times larger than the area of the cell tower's VP. To comprise 90% of the possible locations of the device connected to a specific cell tower, one has to consider the tower's VP together with the two adjacent rings of VPs. An additional, third, ring of the adjacent VPs is necessary to comprise 95% of the possible locations of the device connected to the cell tower. The revealed location uncertainty is in the nature of the MPN structure and service and entails essential overlap between the cell towers' service areas. We discuss the far-reaching consequences of this uncertainty for estimating locational privacy and urban mobility - population flows and individual trajectories. Our results undermine today's dominant opinion that an adversary, who obtains access to the database of the Call Detail Records maintained by the MPN operator, can identify a mobile device without knowing its number based on a very short sequence of time-stamped field observations of the user's connection.
... Unfortunately, access to individual mobility data is restricted because they contain sensitive information about the individuals whose movements are described, and due to the EU General Data Protection Regulation (GDPR). Even when personal identifiers are removed to anonymize the dataset, there is no guarantee about the protection of the geo-privacy of individuals because they can be re-identified with a small amount of information [16][17][18][19][20][21]. ...
Article
Full-text available
Modelling human mobility is crucial in several areas, from urban planning to epidemic modelling, traffic forecasting, and what-if analysis. Existing generative models focus mainly on reproducing the spatial and temporal dimensions of human mobility, while the social aspect, though it influences human movements significantly, is often neglected. Those models that capture some social perspectives of human mobility utilize trivial and unrealistic spatial and temporal mechanisms. In this paper, we propose the Spatial, Temporal and Social Exploration and Preferential Return model (STS-EPR), which embeds mechanisms to capture the spatial, temporal, and social aspects together. We compare the trajectories produced by STS-EPR with respect to real-world trajectories and synthetic trajectories generated by two state-of-the-art generative models on a set of standard mobility measures. Our experiments conducted on an open dataset show that STS-EPR, overall, outperforms existing spatial-temporal or social models demonstrating the importance of modelling adequately the sociality to capture precisely all the other dimensions of human mobility. We further investigate the impact of the tile shape of the spatial tessellation on the performance of our model. STS-EPR, which is open-source and tested on open data, represents a step towards the design of a mechanistic data-driven model that captures all the aspects of human mobility comprehensively.
... share the common, but rarely stated explicitly assumption: the data supplied by the GPS equipment and mobile phone operators contain the exact and certain information about user's whereabouts. Almost certainly, it is true for the GPS data, where, privacy risks can be assessed in a precise and effective way (Pellungrini et al., 2017). ...
Preprint
Full-text available
The partition of the Mobile Phone Network (MPN) service area into the cell towers' Voronoi polygons (VP) may serve as a coordinate system for representing the location of the mobile phone devices. This view is shared by numerous papers that exploit mobile phone data for studying human spatial mobility. We investigate the credibility of this view by comparing volunteers' locational data of two kinds: (1) Cell towers' that served volunteers' connections and (2) The GPS tracks of the users at the time of connection. In more than 60\% of connections, user's mobile device was found outside the VP of the cell tower that served for the connection. We demonstrate that the area of possible device's location is many times larger than the area of the cell tower's VP. To comprise 90\% of the possible locations of the devices that may be connected to the cell tower one has to consider the tower's VP together with the two rings of the VPs adjacent to the tower's VPs. An additional, third, ring of the adjacent VPs is necessary to comprise 95\% of possible locations of the devices that can be connected to a cell tower. The revealed location uncertainty is in the nature of the MPN structure and service and entail essential overlap between the cell towers' service areas. We discuss the far-reaching consequences of this uncertainty in regards to the estimating of locational privacy and urban mobility. Our results undermine today's dominant opinion that an adversary, who obtains the access to the database of the Call Detail Records maintained by the MPN operator, can identify a mobile device without knowing its number based on a very short sequence of time-stamped field observations of the user's connection.
... During the years of research, various attacks have been identified and countermeasures proposed (Alami-Kamouri et al., 2020;Bouchemal & Maamri, 2016;Jolly & Batra, 2019;Madkour et al., 2014;Sanae et al., 2019). A separate subject that requires particular attention today, when Internet users' privacy has become a great concern (Choi et al., 2018;Lopez et al., 2017), is to assure that in sensitive application areas, such as healthcare, insurance, banking and many others (Isern & Moreno, 2016;Pellungrini et al., 2017;Xia et al., 2017) agents could not be traced to their owners, neither by reading the agents' data nor by performing traffic analysis. ...
... Based on this, apriori anonymization was designed by count tree structure to reduce the complexity of algorithm ). In Pappalardo et al. (2018), the attributes in transaction data are classified into public and sensitive categories for (h, k, p)-coherent anonymity model. Partition has certain advantages in running time, and clustering algorithm can effectively improve the similarity between records in the same EG and reduce information loss by grouping and generalizing the records with large similarity. ...
Article
Full-text available
Privacy-preserving algorithm based on k-anonymity plays an outstanding role in real-world data mining applications, such as medical records, bioinformatics, market, and social network. How to maximize the availability of published data without sacrificing users’ privacy is the emphasis of privacy-preserving research. In this paper, we propose a mixed-feature weighted clustering algorithm for k-anonymity (MWCK) to study the contradiction of efficiency and information loss for utility-type anonymization. First, we propose the concept of natural equivalence group, then tuples with same attributes in dataset can be pre-extracted to reduce time complexity and information loss. Second, a sorting algorithm based on the shortest distance is proposed, which selects the optimal initial cluster center at a lower computational cost to reduce the number of iterations. Finally, MWCK not only considers intra-cluster isomorphism to reduce generalization information loss and inter-cluster heterogeneity to avoid local optimal solutions, but also applies to both numerical and categorical datasets. Extensive experiments show that our algorithm can effectively protect data privacy and has better comprehensive performance in terms of information loss and computational complexity than state-of-art methods.
People’s location data is continuously tracked from various devices and sensors, enabling an ongoing analysis of sensitive information that can violate people’s privacy and reveal confidential information. Synthetic data has been used to generate representative location sequences yet to maintain the users’ privacy. Nonetheless, the privacy-accuracy tradeoff between these two measures has not been addressed systematically. In this paper, we analyze the use of different synthetic data generation models for long location sequences, including extended short-term memory networks (LSTMs), Markov Chains, and variable-order Markov models (VMMs). We employ different performance measures, such as data similarity and privacy, and discuss the inherent tradeoff. Furthermore, we introduce other measurements to quantify each of these measures. Based on the anonymous data of 300 thousand cellular-phone users, our work offers a road map for developing policies for synthetic data generation processes. We propose a framework for building data generation models and evaluating their effectiveness regarding those accuracy and privacy measures.
Article
Privacy risk assessment plays a fundamental role in privacy preservation, as it determines the extent to which subsequent processing (such as generalization and obfuscation), should be applied to the sensitive data. However, most existing works on privacy risk assessment have focused on structured data, while unstructured text data remain relatively underexplored due to the complexity of natural language. In this article, we propose a novel method, PriTxt, for evaluating the privacy risk associated with text data by exploiting the semantic correlation. Using definitions derived from the General Data Protection Regulation (GDPR), a de facto standard of privacy preservation in practice, PriTxt first defines the private features that related to individual privacy in order to locate the sensitive words. By using the word2vec algorithm, a word-embedding model is further constructed to identify the quasi-sensitive words that are semantically correlated to the private features. The privacy risk of a given text is finally evaluated by aggregating the weighted risks of the sensitive and the quasi-sensitive words in the text. Experiments on real-world datasets demonstrate that the proposed PriTxt is effective for conducting risk assessment on text data and further outperforms the traditional methods.
Chapter
Privacy risk assessment determines the extent to which generalization and obfuscation should be applied to the sensitive data. In this paper, we propose PriTxt for evaluating the privacy risk associated with text data by exploiting the semantic correlation. Using definitions derived from the General Data Protection Regulation (GDPR), PriTxt first defines the private features that related to individual privacy. By using the word2vec algorithm, a word-embedding model is further constructed to identify the quasi-sensitive words. The privacy risk of a given text is finally evaluated by aggregating the weighted risks of the sensitive and the quasi-sensitive words in the text.
Presentation
Full-text available
The generation of realistic spatio-temporal trajectories of human mobility is of fundamental importance in a wide range of applications, such as the developing of protocols for mobile ad-hoc networks or what-if analysis in urban ecosystems. Current generative algorithms fail in accurately reproducing the individuals’ recurrent schedules and at the same time in accounting for the possibility that individuals may break the routine during periods of variable duration. Here we present Ditras (DIary-based TRAjectory Simulator), a framework to simulate the spatio-temporal patterns of human mobility. Ditras operates in two steps: the generation of a mobility diary and the translation of the mobility diary into a mobility trajectory. We propose a data-driven algorithm which constructs a diary generator from real data, capturing the tendency of individuals to follow or break their routine. We also propose a trajectory generator based on the concept of preferential exploration and preferential return. We instantiate Ditras with the proposed diary and trajectory generators and compare the resulting algorithm with real data and synthetic data produced by other generative algorithms, built by instantiating Ditras with several combinations of diary and trajectory generators. We show that the proposed algorithm reproduces the statistical properties of real trajectories in the most accurate way, making a step forward the understanding of the origin of the spatio-temporal patterns of human mobility.
Article
Full-text available
Well-established fine-scale urban mobility models today depend on detailed but cumbersome and expensive travel surveys for their calibration. Not much is known, however, about the set of mechanisms needed to generate complete mobility profiles if only using passive datasets with mostly sparse traces of individuals. In this study, we present a mechanistic modeling framework (TimeGeo) that effectively generates urban mobility patterns with resolution of 10 min and hundreds of meters. It ties together the inference of home and work activity locations from data, with the modeling of flexible activities (e.g., other) in space and time. The temporal choices are captured by only three features: the weekly home-based tour number, the dwell rate, and the burst rate. These combined generate for each individual: (i) stay duration of activities, (ii) number of visited locations per day, and (iii) daily mobility networks. These parameters capture how an individual deviates from the circadian rhythm of the population, and generate the wide spectrum of empirically observed mobility behaviors. The spatial choices of visited locations are modeled by a rank-based exploration and preferential return (r-EPR) mechanism that incorporates space in the EPR model. Finally, we show that a hierarchical multiplicative cascade method can measure the interaction between land use and generation of trips. In this way, urban structure is directly related to the observed distance of travels. This framework allows us to fully embrace the massive amount of individual data generated by information and communication technologies (ICTs) worldwide to comprehensively model urban mobility without travel surveys.
Article
Full-text available
Human mobility modelling is of fundamental importance in a wide range of applications, such as the developing of protocols for mobile ad hoc networks or for what-if analysis and simulation in urban ecosystems. Current generative models generally fail in accurately reproducing the individuals' recurrent daily schedules and at the same time in accounting for the possibility that individuals may break the routine and modify their habits during periods of unpredictability of variable duration. In this article we present DITRAS (DIary-based TRAjectory Simulator), a framework to simulate the spatio-temporal patterns of human mobility in a realistic way. DITRAS operates in two steps: the generation of a mobility diary and the translation of the mobility diary into a mobility trajectory. The mobility diary is constructed by a Markov model which captures the tendency of individuals to follow or break their routine. The mobility trajectory is produced by a model based on the concept of preferential exploration and preferential return. We compare DITRAS with real mobility data and synthetic data produced by other spatio-temporal mobility models and show that it reproduces the statistical properties of real trajectories in an accurate way.
Article
Full-text available
An intriguing open question is whether measurements made on Big Data recording human activities can yield us high-fidelity proxies of socio-economic development and well-being. Can we monitor and predict the socio-economic development of a territory just by observing the behavior of its inhabitants through the lens of Big Data? In this paper, we design a data-driven analytical framework that uses mobility measures and social measures extracted from mobile phone data to estimate indicators for socio-economic development and well-being. We discover that the diversity of mobility, defined in terms of entropy of the individual users' trajectories, exhibits (i) significant correlation with two different socio-economic indicators and (ii) the highest importance in predictive models built to predict the socio-economic indicators. Our analytical framework opens an interesting perspective to study human behavior through the lens of Big Data by means of new statistical indicators that quantify and possibly "nowcast" the well-being and the socio-economic development of a territory.
Conference Paper
Full-text available
Modeling the properties of individual human mobility is a challenging task that has received increasing attention in the last decade. Since mobility is a complex system, when modeling individual human mobility one should take into account that human movements at a collective level influence, and are influenced by, human movement at an individual level. In this paper we propose the $d$-EPR model, which exploits collective information and the gravity model to drive the movements of an individual and the exploration of new places on the mobility space. We implement our model to simulate the mobility of thousands synthetic individuals, and compare the synthetic movements with real trajectories of mobile phone users and synthetic trajectories produced by a prominent individual mobility model. We show that the distributions of global mobility measures computed on the trajectories produced by the $d$-EPR model are much closer to empirical data, highlighting the importance of considering collective information when simulating individual human mobility.
Book
Spatial trajectories have been bringing the unprecedented wealth to a variety of research communities. A spatial trajectory records the paths of a variety of moving objects, such as people who log their travel routes with GPS trajectories. The field of moving objects related research has become extremely active within the last few years, especially with all major database and data mining conferences and journals. Computing with Spatial Trajectories introduces the algorithms, technologies, and systems used to process, manage and understand existing spatial trajectories for different applications. This book also presents an overview on both fundamentals and the state-of-the-art research inspired by spatial trajectory data, as well as a special focus on trajectory pattern mining, spatio-temporal data mining and location-based social networks. Each chapter provides readers with a tutorial-style introduction to one important aspect of location trajectory computing, case studies and many valuable references to other relevant research work. Computing with Spatial Trajectories is designed as a reference or secondary text book for advanced-level students and researchers mainly focused on computer science and geography. Professionals working on spatial trajectory computing will also find this book very useful.
Chapter
Mobility data represent an invaluable source of information that can be recorded thanks to mobile telecommunications and ubiquitous computing where the locations of mobile users are continuously sensed. However, the collection, storage, and sharing of these movement data sets raise serious privacy concerns. In fact, position data may reveal the mobility behavior of the people: where they are going, where they live, where they work, their religion and so on. All this information refers to the private personal sphere of a person and therefore the analysis of mobility data may potentially reveal many facets of his or her private life. As a consequence, these kinds of data have to be considered personal information to be protected against undesirable and unlawful disclosure. In the specific case of mobility scenarios, there exist two major different contexts in which the location privacy problem has to be taken into consideration: online location-based services and offline data analysis context. In the first case, a user communicates to a service provider his or her location to receive on-the-fly a specific service. An example of LBS is find the closest point of interest (POI), where a POI could be a restaurant. Privacy issues in the context of online location-based services have been already addressed in Chapter 2. In the second case, large amounts of mobility data are collected and can be used for offline data mining analysis able to extract reliable knowledge useful to understand and manage intelligent transportation, urban planning, and sustainable mobility, as already highlighted in previous chapters.
Conference Paper
We present a novel de-anonymization attack on mobility trace data and social data. First, we design an Unified Similarity (US) measurement, based on which we present a US based De-Anonymization (DA) framework which iteratively de-anonymizes data with an accuracy guarantee. Then, to de-anonymize data without the knowledge of the overlap size between the anonymized data and the auxiliary data, we generalize DA to an Adaptive De-Anonymization (ADA) framework. Finally, we examine DA/ADA on mobility traces and social data sets.
Article
We study the problem of privacy in human mobility data, i.e., the re-identification risk of individuals in a trajectory dataset. We quantify the risk of being re-identified by the metric of uniqueness, the fraction of individuals in the dataset which are uniquely identifiable by a set of spatio-temporal points. We explore a human mobility dataset for more than half a million individuals over a period of one week. The location of an individual is specified every fifteen minutes. The results show that human mobility traces are highly identifiable with only a few spatio-temporal points. We propose a modification-based anonymization approach that is based on shorting the trajectories to reduce the risk of reidentification and information disclosure. Empirical, experimental results on the anonymized dataset show the decrease of uniqueness and suggest that anonymization techniques can help to improve the privacy protection and reduce privacy risks, although the anonymized data cannot provide full anonymity so far.