Content uploaded by Luca Pappalardo
Author content
All content in this area was uploaded by Luca Pappalardo on Sep 06, 2017
Content may be subject to copyright.
Fast estimation of privacy risk in human
mobility data
Roberto Pellungrini1, Luca Pappalardo1,2, Francesca Pratesi1,2, and Anna
Monreale1
1Department of Computer Science, University of Pisa, Italy
2ISTI-CNR, Pisa, Italy
Abstract. Mobility data are an important proxy to understand the pat-
terns of human movements, develop analytical services and design models
for simulation and prediction of human dynamics. Unfortunately mobil-
ity data are also very sensitive, since they may contain personal informa-
tion about the individuals involved. Existing frameworks for privacy risk
assessment enable the data providers to quantify and mitigate privacy
risks, but they suffer two main limitations: (i) they have a high compu-
tational complexity; (ii) the privacy risk must be re-computed for each
new set of individuals, geographic areas or time windows. In this paper
we explore a fast and flexible solution to estimate privacy risk in human
mobility data, using predictive models to capture the relation between
an individual’s mobility patterns and her privacy risk. We show the effec-
tiveness of our approach by experimentation on a real-world GPS dataset
and provide a comparison with traditional methods.
1 Introduction
In the last years human mobility analysis has attracted a growing interest due to
its importance in several applications such as urban planning, transportation en-
gineering and public health (11). The great availability of these data has offered
the opportunity to observe human movements at large scales and in great detail,
leading to the discovery of quantitative patterns (9), the mathematical modeling
of human mobility (10; 15) and so on. Unfortunately mobility data are sensitive
because they may reveal personal information or allow the re-identification of in-
dividuals, creating serious privacy risks if they are analyzed with malicious intent
(13). Driven by these sensitive issues, researchers have developed methodologies
and frameworks to mitigate the individual privacy risks associated to the study
of GPS trajectories and Big Data in general (1). These tools aim at preserving
both the right to individual’s privacy and the effectiveness of the analytical re-
sults, trying to find a reasonable trade-off between privacy protection and data
quality. They also allow the definition of infrastructures for supporting privacy
and of technical requirements for data protection, enforcing cross-relations be-
tween privacy-preserving solutions and legal regulations, since assessing privacy
risk is required by the new EU General Data Protection Regulation. To this
aim, Pratesi et al. (12) propose a framework for the privacy risk assessment of
individuals in mobility datasets. Although frameworks like the one presented in
(12) are effective in many scenarios, they suffer a major drawback: the privacy
risk assessment has a high computational complexity (non-polynomial in time)
because it computes the maximum privacy risk given an external knowledge that
a malicious adversary may have, i.e., it considers all the possible ways the ad-
versary can try to re-identify an individual. Moreover, the privacy risks must
be recomputed every time new data become available and for every selection of
individuals, geographic areas and periods of time.
In this paper we propose a data mining approach for privacy risk assessment
that overcomes the computational limitations of existing frameworks. We first
introduce some possible re-identification attacks on mobility data, and then we
use linear regression to predict the privacy risk of an individual based on her
mobility patterns, and we compute the individual privacy risk level according
to the re-identification attacks. We then train a regressor on such data to es-
timate in polynomial time the privacy risk level of previously unseen vehicles
based just on their individual mobility patterns. In a scenario where a Data An-
alyst asks a Data Provider for mobility data to deploy an analytical service, the
Data Provider (e.g., a mobile phone carrier) can use the regressor to immedi-
ately identify individuals with a high privacy risk. Then, the Data Provider can
select the most suitable privacy-preserving technique (e.g., k-anonymity, differ-
ential privacy) to mitigate their privacy risk and release only safe data to the
Data Analyst. Our experiments on GPS data shows that our approach is fairly
accurate in predicting the privacy risk of unseen individuals in an urban area.
The rest of the paper is organized as follows. In Section 2 we define the
data structures to describe human mobility data according to different data
aggregations. In Section 3 we introduce the framework used for the privacy risk
assessment, while Section 4 describes the data mining approach we propose. In
Section 5 we show the results of our experiments and we discuss them. Section
6 presents the main works related to our paper and finally Section 7 concludes
the paper proposing some lines of new research.
2 Data Definitions
The approach we present in this paper is tailored for human mobility data,
i.e., data describing the movements of a set of individuals. This type of data is
generally collected in an automatic way through electronic devices (e.g., mobile
phones, GPS devices) in form of raw trajectory data. Every record has the
following fields: the identifier of the individual, a geographic location expressed in
coordinates (generally latitude and longitude), a timestamp indicating when the
individual stopped in or went through that location. Depending on the specific
application, a trajectory can be aggregated into different data structures:
Definition 1 (Trajectory). The trajectory Tuof an individual uis a tempo-
rally ordered sequence of tuples Tu=h(l1, t1),(l2, t2),...,(ln, tn)i, where li=
(xi, yi)is a location, xiand yiare the coordinates of the geographic location,
and tiis the corresponding timestamp, ti< tjif i<j.
Definition 2 (Frequency vector). The frequency vector Wuof an individual
uis a sequence of tuples Wu=h(l1, w1),(l2, w2),...,(ln, wn)iwhere li= (xi, yi)
is a location, wiis the frequency of the location, i.e., how many times location
liappears in the individual’s trajectory Tu, and wi> wjif i < j. A frequency
vector Wuis hence an aggregation of a trajectory Tu.
In the following, with the terms visit we refer indifferently to a tuple in
a trajectory or in a frequency vector. In other words, a visit indicates a pair
consisting of a location and a supplementary information, e.g., the timestamp
or the frequency. We denote with Da mobility dataset, which we assume is a
set of a one of the above data types (trajectory or frequency vector).
3 Privacy Risk Assessment Framework
In this paper we consider the work proposed in (12), which allows for the privacy
risk assessment of human mobility data. This framework considers a scenario
where a Data Analyst asks a Data Provider for data to develop an analytical
service. The Data Provider must guarantee the right to privacy of the individ-
uals whose data are recorded. First, the Data Analyst transmits to the Data
Provider the data requirements for the service. With these specifications, the
Data Provider queries its dataset D, producing a set of datasets {D1, . . . , Dz},
each with different data structures and data aggregations. The Data Provider
then reiterates a procedure until it considers the data delivery safe:
(1) Identification of Attacks : identify a set of possible attacks that an adversary
might conduct in order to re-identify individuals in the datasets {D1, . . . , Dz};
(2) Privacy Risk Computation: simulate the attacks and compute the set of
privacy risk values for every individual in the mobility datasets {D1, . . . , Dz};
(3) Dataset Selection: select a mobility dataset D∈ {D1, . . . , Dz}with the best
trade-off between the privacy risks of individuals and the data quality, given
a certain level of tolerated privacy risk and the Data Analyst’s requirements;
(4) Risk Mitigation and Data delivery: apply a privacy-preserving transforma-
tion (e.g., generalization, randomization, etc.) on the chosen mobility dataset
Dto eliminate the residual privacy risk, producing a filtered mobility dataset
Dfilt. Deliver Df ilt to the Data Analyst when the Df ilt is adequately safe.
In this paper we focus on improving step (2), i.e., Privacy Risk Computation,
which is the most critical one from a computational point of view. Computing
the privacy risk of an individual means simulating several possible attacks a
malicious adversary can perform and computing the privacy risks associated
to each attack. The privacy risk of an individual is related to her probability
of re-identification in a dataset w.r.t. to a set of re-identification attacks. A re-
identification attack assumes that an adversary gains access to a dataset. On the
basis of some background knowledge about an individual, i.e., the knowledge of
a subset of her mobility data, the adversary tries to re-identify all the records
in the dataset regarding the individual under attack. In this paper we use the
definition of privacy risk (or re-identification risk) introduced in (14).
A background knowledge represents both the kind and quantity of informa-
tion known by an adversary. Two examples of kinds of background knowledge
are a subset of the locations visited by an individual (spatial dimension) and the
specific times an individual visited those locations (spatial and temporal dimen-
sions). We denote with kthe number of the elements known by the adversary. So
for example a specific background knowledge is the knowledge of three specific
locations visited by the individual under attack. We denote a set of background
knowledge of size kwith Bkand a specific background knowledge with b.
Let Dbe a database, Da mobility dataset extracted from Das an aggregation
of the data on specific dimensions (e.g., an aggregated data structure and/or a
filtering on time and/or space), and Duthe set of records representing individual
uin D, we define the probability of re-identification as follows:
Definition 3 (Probability of re-identification). Given an attack, a func-
tion matching(d, b) indicating whether or not a record d∈Dmatches the back-
ground knowledge b, and a function M(D, b) = {d∈D|matching(d, b) = T rue},
we define the probability of re-identification of an individual uin dataset Das:
P RD(d=u|b) = 1
|M(D,b)|that is the probability to associate record d∈Dto
individual u, given background knowledge b.
Note that P RD(d=u|b) = 0 if the individual uis not represented in D. Since
each background knowledge bhas its own probability of re-identification, we
define the risk of re-identification of an individual as the maximum probability
of re-identification over the set of possible background knowledge:
Definition 4 (Risk of re-identification or Privacy risk). The risk of re-
identification (or privacy risk) of an individual ugiven a set of background
knowledge Bkis her maximum probability of re-identification Risk(u, D) =
max P RD(d=u|b) for b∈Bk. The risk of re-identification has the lower bound
|Du|
|D|(a random choice in D), and Risk(u, D) = 0 if u /∈D.
An individual is hence associated to several privacy risks, each for every
background knowledge of an attack. Every privacy risk of an individual can be
computed using the following procedure:
1. define an attack based on a specific background knowledge;
2. given an individual and fixing k, compute all the possible b∈Bkand the
corresponding probability of re-identification;
3. select the privacy risk of the individual for a set Bkas the maximum prob-
ability of re-identification across all b∈Bk.
3.1 Computational Complexity of Privacy Risk Computation
The procedure of privacy risk computation has a high computational complexity.
We assume that the adversary uses all the information available to her when
conducting a re-identification attack on an individual. The maximum possible
value of kis len, the length of the data structure of an individual. Since it
is unlikely that an adversary knows the complete movement of an individual
(i.e., all the points), we have to reason about different and reasonable values
of k. To compute all b∈Bkwe have to compute a k-combination of elements
from the original data structure. We need all bto correctly compute the risk
of re-identification, since we have to know all the possible probabilities of re-
identification. This leads to a high overall computational complexity O(len
k×
N), since the framework generates len
kbackground knowledge band, for each
b, it executes Nmatching operations by applying function matching. While
some optimizations can be made depending on the kind of attack simulated, the
overall complexity of the procedure is dominated by the len
kterm.
4 Fast Privacy Risk Assessment with Data Mining
Given its computational complexity, the privacy risk computation becomes un-
feasible as the size of the dataset increases. This drawback is even more serious
if we consider that the privacy risks must be necessarily re-computed every time
the mobility dataset is updated and for every selection of individuals, geographic
areas and periods of time. In order to overcome these problems, we propose a
fast and flexible data mining approach. The idea is to train a regression model
to predict the privacy risk of an individual based solely on her individual mo-
bility patterns. The training of the predictive model is made by using a dataset
where every record refers to an individual and consists of (i) a vector of the in-
dividual’s mobility features and (ii) the privacy risk value of the individual. We
make our approach parametric with respect to the predictive algorithm: in our
experiments we use a Random Forest regressor, but every algorithm available in
literature can be used for the predictive tasks. Note that our approach is con-
strained to the fixed well-defined set of attacks introduced in Section 4.2, which
is a representative set of nine sufficiently diverse attacks tailored for the data
structures required to compute standard individual human mobility measures.
Our approach can be easily extended to any type of attack defined on human
mobility data by using the privacy framework proposed by (12).
4.1 Individual Mobility Features
The mobility dynamics of an individual can be described by a set of measures
widely used in literature. Some measures describe specific aspects of an indi-
vidual’s mobility; other measures describe an individual’s mobility in relation
to collective mobility. A subset of these measures can be simply obtained as
aggregation of an individual’s trajectory or frequency vector. The number of
visits Vof an individual is the length of her trajectory, i.e., the sum of all the
visits she did in any location during the period of observation (9). By dividing
this quantity by the number of days in the period of observation we obtain the
average number of daily visits V, which is a measure of the erratic behavior of
an individual during the day (10). The length Locs of the frequency vector of
an individual indicates the number of distinct places visited by the individual
during the period of observation (15). Dividing Locs by the number of available
locations on the considered territory we obtain Locsratio, which indicates the
fraction of territory exploited by an individual in her mobility behavior. The
maximum distance Dmax traveled by an individual is defined as the length of
the longest trip of the individual during the period of observation (20), while
Dtrip
max is defined as the ratio between Dmax and the maximum possible distance
between the locations in the area of observation. The sum of all the trip lengths
traveled by the individual during the period of observation is defined as Dsum
(20). It can be also averaged over the days in the period of observation obtain-
ing Dsum. The radius of gyration rgis the characteristic distance traveled by
an individual during the period of observation, formally defined as (9). The mo-
bility entropy Eis a measure of the predictability of an individual’s trajectory.
Formally, it is defined as the Shannon entropy of an individual’s movements (7).
Also, for each individual we keep track of the characteristics of three different
locations: the most visited location, the second most visited location and the
least visited location. The frequency wiof a location iis the number of times an
individual visited location iduring the period of observation, while the average
frequency wiis the daily average frequency of location i. We also define wpop
i
as the frequency of a location divided by the popularity of that location in the
whole dataset. The quantity Uratio
iis the number of distinct individuals that vis-
ited a location idivided by the total number |Uset|of individuals in the dataset,
while Uiis the number of distinct individuals that visited location iduring the
period of observation. Finally, the location entropy Eiis the predictability of
location i, defined as a variation of the Shannon entropy.
Every individual uin the dataset is described by a mobility vector muof the
16 mobility features described above. It is worth noting that all the measures
can be computed in linear time on the size of the corresponding data structure.
4.2 Privacy attacks on mobility data
In this section we describe the attacks we use in this paper:
Location Attack In a Location attack the adversary knows a certain num-
ber of locations visited by the individual but she does not know the temporal
order of the visits. Since an individual might visit the same location multiple
times in a trajectory, the adversary’s knowledge is a multiset that may contain
more occurrences of the same location.
Location Sequence Attack In a Location Sequence attack the adversary knows
a subset of the locations visited by the individual and the temporal ordering of
the visits.
Visit Attack In a Visit attack the adversary knows a subset of the locations
visited by the individual and the time the individual visited these locations.
Frequent Location and Sequence Attack We also introduce two attacks
based on the knowledge of the location frequency. In the Frequent Location at-
tack the adversary knows a number of frequent locations visited by an individual,
while in the Frequent Location Sequence attack the adversary knows a subset of
the locations visited by an individual and the relative ordering with respect to
the frequencies (from most frequent to least frequent). The Frequent Location
attack is similar to the Location attack with the difference that in frequency vec-
tors a location can appear only once. The Frequent Location Sequence attack is
similar to the Location Sequence attack, with two differences: first, a location
can appear only once in the vector; second, locations in a frequency vector are
ordered by descending frequency and not by time. Thus the locations/sequence
Xsof length kcannot contain repetition of locations.
Frequency Attack We introduce an attack where the adversary knows the lo-
cations visited by the individual, their reciprocal ordering of frequency, and the
minimum number of visits of the individual. This means that, when searching for
specific subsequences, the adversary must consider also subsequences containing
the known locations with a greater frequency.
Home And Work Attack In the Home and Work attack the adversary knows
the two most frequent locations of an individual and their frequencies. It essen-
tially assumes the same background knowledge of Frequency attack but related
only to two locations. This is the only attack where the set of background knowl-
edge is fixed and composed of just a single 2-combination for each individual.
4.3 Construction of training dataset
Given an attack ibased on a specific set of background knowledge Bi
j, the
regression training dataset TRi
jcan be constructed by the following two-step
procedure: first, given a mobility dataset D, for every individual uwe compute
the set of individual mobility features described in Section 4.1 based on her
mobility data. Every individual uis hence described by a mobility feature vec-
tor mu. All the individuals’ mobility feature vectors compose mobility matrix
F=(m1, . . . , mn), where nis the number of individuals in D; second, for every
individual we simulate the attack with Bi
jon D, in order to compute a privacy
risk value for every individual. We obtain a privacy risk vector Ri
j= (r1, . . . , rn).
The regression training set is hence TRi
j= (F, Ri
j);
Every regression dataset TRi
jis used to train a predictive model Mi
j. If
0≤i≤Iwhere Iis the number of different kinds of attack and 0 ≤j≤J
where Jis the number of different sets of possible background knowledge, we
have a total of J×Imodels. For example, if we consider sets of background
knowledge ranging in size from j= 1 to j= 5 for 7 different attacks, we would
have I= 7 and J= 5. The predictive model will be used by the Data Provider
to immediately estimate the privacy risk value of previously unseen individuals,
whose data were not used in the learning process, with respect to attack i, set
of background knowledge Bi
jand dataset D.
Example 1 (Construction of regression training set). Let us consider a mobility
dataset of trajectories D={Tu1, Tu2, Tu3, Tu4, Tu5}corresponding to five individ-
uals u1, u2, u3, u4and u5. Given an attack i, a set of background knowledge
Bi
jand dataset D, we construct the regression training set TCi
jas follows:
first, for every individual uiwe compute the 16 individual mobility measures
based on her trajectory Tui. Every individual uiis hence described by a mobil-
ity feature vector of length 16 mui= (m(ui)
1, . . . , m(ui)
16 ). All the mobility feature
vectors compose mobility matrix F=(mu1, mu2, mu3, mu4, mu5); second, we sim-
ulate the attack with Bi
jon dataset Dand obtain a vector of five privacy risk
values Ri
j= (ru1, ru2, ru3, ru4, ru5), each for every individual;
4.4 Usage of the regression approach
The Data Provider can use a regression model Mi
jto determine the value of
privacy risk with respect to an attack iand a set of background knowledge Bi
j
for: (i) previously unseen individuals, whose data were not used in the learn-
ing process; (ii) a selection of individuals in the database already used in the
learning process. It is worth noting that with existing methods the privacy risk
of individuals in scenario (ii) must be recomputed by simulating attack ifrom
scratch. In contrast, the usage of regression model Mi
jallows for obtaining the
privacy risk of the selected individuals immediately. The computation of the
mobility measures and the regression of privacy risk can be done in polynomial
time as a one-off procedure. To clarify this point, let us consider the following
scenario. A Data Analyst requests the Data Provider for updated mobility data
about a new set of individuals with the purpose of studying their characteristic
traveled distance (radius of gyration rg) and the predictability of their move-
ments (mobility entropy E). Since both measures can be computed by using a
frequency vector, the Data Provider can release just the frequency vectors of the
individuals requested. Before that, however, the Data Provider wants to deter-
mine the level of privacy risk of the individuals with respect to the Frequency
attack (F) and several sets of background knowledge BF
j. The Data Provider
uses the regression model MF
jpreviously trained to obtain the privacy risk of the
individuals. So the Data Provider computes the mobility features for the individ-
uals in the dataset and gives them in input to the regression model, obtaining an
estimation of privacy risk. On the basis of privacy risks obtained from MF
j, the
Data Provider can identify risky individuals, i.e., individuals with a high privacy
risk. She then can decide to either filter out the risky individuals or to select
suitable privacy-preserving techniques (e.g., k-anonymity or differential privacy)
and transform their mobility data in such a way that their privacy is preserved.
In the next section we present an evaluation of our methodology on real-world
mobility data and show the effectiveness of the proposed regression approach.
5 Experiments
For all the attacks defined except the Home and Work attack we consider four
sets of background knowledge Bkwith k= 2,3,4,5, where each Bkcorresponds
to an attack where the adversary knows klocations visited by the individual.
For the Home and Work attack we have just one possible set of background
knowledge, where the adversary knows the two most frequent locations of an
individual. We use a dataset provided by Octo Telematics 3storing the GPS
tracks of 9,715 private vehicles traveling in Florence, a very populous are of cen-
tral Italy, from 1st May to 31st May 2011, corresponding to 179,318 trajectories.
We assign each origin and destination point of the original raw trajectories to
the corresponding census cell according to the information provided by the Ital-
ian National Statistics Bureau, to assign every origin and destination point to
a location (9). This allows us to describe the mobility of every vehicle in terms
of a trajectory, in compliance with the definition introduced in Section 2. We
first performed a simulation of the attacks computing the privacy risk values
for all individuals in the dataset and for all Bk.4We then performed regression
experiments using a Random Forest regressor.5Table 1 shows the average Mean
Squared Error (mse) and the average coefficient of determination R2resulting
from the regression experiments for all the attacks. The results are averaged over
k= 2,3,4,5, since the empirical distributions of privacy risk are fairly similar
across different values of k. Also, mse and R2are almost identical for each kind
of attack. The best results are obtained for the Frequent Location Sequence at-
tack, with values of mse = 0.01 and R2= 0.92, while the weakest results are
obtained for the Home and Work attack, with values of mse = 0.07 and R2=
0.50. Overall, the results show good predictive performance across all attacks,
suggesting that regression could indeed be an accurate alternative to the direct
computation of privacy risk.
Execution Times We show the computational improvement of our approach in
terms of execution time by comparing in Table 2 the execution times of the attack
simulations and the execution times of the regression tasks.6The execution time
of a single regression task is the sum of three subtasks: (i) the execution time
of training the regressor on the training set; (ii) the execution time of using the
trained regressor to predict the risk on the test set; (iii) the execution time of
evaluating the performance of regression. Table 2 shows that the execution time
of attack simulations is low for most of the attacks except for Location Sequence
and Location, for which execution times are huge: more than 1 week each. In
contrast the regression tasks have constant execution times of around 22s. In
summary, our approach can compute the risk levels for all the 33 attacks in 179
seconds (less than 3 minutes), while the attack simulations require more than
two weeks of computation.
3https://www.octotelematics.com/
4The Python code for attacks simulation is available here: https://github.com/
pellungrobe/privacy-mobility-lib
5We use the Python package scikit-learn to perform the regression experiments.
6For a given type of attack we report the sum of the execution times of the attacks
for configurations k= 2,3,4,5. We perform the experiments on Ubuntu 16.04.1 LTS
64 bit, 32 GB RAM, 3.30GHz Intel Core i7.
predicted variable mse r2
Frequent Location Sequence 0.01 0.92
Visit 0.01 0.89
Frequency 0.02 0.88
Location 0.02 0.90
Location Sequence 0.02 0.84
Frequent Location 0.03 0.73
Home and Work 0.07 0.50
Table 1: Results of regression experiments.
variable (P5
2k)simulation regression
Home and Work 149s (2.5m) 7s
Frequency 645s (10m) 22s
Frequent Location Sequence 846s (14m) 22s
Frequent Location 997s (10m) 22s
Visit 2,274s (38m) 16s
LocationSequence >168h (1week) 22s
Location >168h (1week) 22s
total >2weeks 172s
Table 2: Execution times of attack simulations and regression tasks.
Discussion The preliminary work presented above shows some promising re-
sults. The coefficient of determination and the execution times suggest that
the regression can be a valid and fast alternative to existing privacy risk assess-
ment tools. Instead of re-computing privacy risks when new data records become
available, which would result in high computational costs, a Data Provider can
effectively use the regressors to obtain immediate and reliable estimates for every
individual. The mobility measures can be computed in linear time of the size of
the dataset. Every time new mobility data of an individual become available,
the Data Provider can recompute her mobility features. To take into account
long-term changes in mobility patterns the recomputation of mobility measures
can be done at regular time intervals (e.g., every month) by considering a time
window with the most recent data (e.g., the last six months of data).
6 Related Works
Human mobility data contains personal sensitive information and can reveal
many facets of the private life of individuals, leading to potential privacy viola-
tion. To overcome the possibility of privacy leaks, many techniques have been
proposed in literature. A widely used privacy-preserving model is k-anonymity
(14), which requires that an individual should not be identifiable from a group
of size smaller than kbased on their quasi-identifiers (QIDs), i.e., a set of at-
tributes that can be used to uniquely identify individuals. Assuming that adver-
saries own disjoint parts of a trajectory, (18) reduces privacy risk by relying on
the suppression of the dangerous observations from each individual’s trajectory.
In (21), authors propose the attack-graphs method to defend against attacks,
based on k-anonymity. Other works are based on the differential privacy model
(6). (8) considers a privacy-preserving distributed aggregation framework for
movement data. (4) proposes to publish a contingency table of trajectory data,
where each cell contains the number of individuals commuting from a source
to a destination. (26) defines several similarity metrics which can be combined
in a unified framework to provide de-anonymization of mobility data and social
network data. One of the most important work about privacy risk assessment is
the Linddun methodology (5), a privacy-aware framework, useful for modeling
privacy threats in software-based systems. In the last years, different techniques
for risk management have been proposed, such as NIST’s Special Publication
800-30 (17) and SEI’s OCTAVE (2). Unfortunately, many of these works do
not consider privacy risk assessment and simply include privacy considerations
when assessing the impact of threats. In (19), authors elaborate an entropy-
based method to evaluate the disclosure risk of personal data, trying to manage
quantitatively privacy risks. The unicity measure proposed in (16) evaluates the
privacy risk as the number of records/trajectories which are uniquely identified.
(25) proposes an empirical risk framework for improving privacy risk estimation
for mobility data, evaluating their model using k-anonymized data. (3) pro-
poses a risk-aware framework for information disclosure which supports runtime
risk assessment, using adaptive anonymization as risk-mitigation method. Un-
fortunately, this framework only works on relational datasets since it needs to
discriminate between quasi-identifiers and sensitive attributes. In this paper we
use the privacy risk assessment framework introduced by (12) (Section 3) to
calculate the privacy risks of each individual in a mobility dataset.
7 Conclusion
Human mobility data are a precious proxy to improve our understanding of hu-
man dynamics, as well as to improve urban planning, transportation engineering
and epidemic modeling. Nevertheless human mobility data contain sensitive in-
formation which can lead to a serious violation of the privacy of the individuals
involved. In this paper we explored a fast and flexible solution for estimating
the privacy risk in human mobility data, which overcomes the computational
issues of existing privacy risk assessment frameworks. We showed through ex-
perimentations that our approach can achieve good estimations of privacy risks.
As future work, it would be necessary to test our approach more extensively on
different, more numerous datasets. It would be also interesting to evaluate the
importance of mobility features with respect to the prediction of risk. Indeed, if
every individual can obtain an estimate of her own privacy risk based just on her
mobility data, this increases the awareness about personal data and helps her in
deciding whether or not to share the mobility data with third parties. Another
possible extension of our method would be to apply more refined data mining
techniques to assess the privacy risk of individuals. Moreover, our approach pro-
vides a fast tool to immediately obtain the privacy risks of individuals, leaving to
the Data Provider the choice of the most suitable privacy preserving techniques
to manage and mitigate the privacy risks of individuals. It would be interesting
to perform an extensive experimentation to select the best techniques to reduce
the privacy risk of individuals in mobility datasets and at same time ensuring
high data quality for analytical services.
Acknowledgment
Funded by the European project SoBigData (Grant Agreement 654024).
References
1. O. Abul, F. Bonchi, and M. Nanni. Never Walk Alone: Uncertainty for Anonymity
in Moving Objects Databases. In ICDE 2008. 376–385.
2. C. Alberts, S. Behrens, R. Pethia, and W. Wilson. 1999. Operationally Critical
Threat, Asset, and Vulnerability Evaluation (OCTAVE) Framework, Version 1.0.
CMU/SEI-99-TR-017. Software Engineering Institute, Carnegie Mellon University.
http://resources.sei.cmu.edu/library/asset-view.cfm?AssetID=13473
3. A. Armando, M. Bezzi, N. Metoui, and A. Sabetta. Risk-Based Privacy-Aware
Information Disclosure. Int. J. Secur. Softw. Eng. 6, 2 (April 2015), 70–89.
4. G. Cormode, C. M. Procopiuc, D. Srivastava, and T. T. L. Tran. Differentially
private summaries for sparse data. In ICDT ’12. 299–311.
5. M. Deng, K. Wuyts, R. Scandariato, B. Preneel, and W. Joosen. A privacy threat
analysis framework: supporting the elicitation and fulfillment of privacy requirements.
Requir. Eng.16, 1 (2011), pp 3–32.
6. C. Dwork, F. McSherry, K. Nissim, and A. Smith. 2006. Calibrating Noise to
Sensitivity in Private Data Analysis. In TCC ’06. 265–284.
7. N. Eagle and A. S. Pentland. Eigenbehaviors: identifying structure in routine. Be-
havioral Ecology and Sociobiology 63, 7 (2009), 1057–1066.
8. A. Monreale, W. H. Wang, F. Pratesi, S. Rinzivillo, D. Pedreschi, G. Andrienko,
and N. Andrienko. 2013. Privacy-Preserving Distributed Movement Data Aggregation.
Springer International Publishing, 225–245.
9. L. Pappalardo, F. Simini, S. Rinzivillo, D. Pedreschi, F. Giannotti, and A.-L.
Barabasi. Returners and explorers dichotomy in human mobility. Nature Com-
munications 6 (2015).
10. L. Pappalardo and F. Simini. Modelling spatio-temporal routines in human mo-
bility. CoRR abs/1607.05952 (2016).
11. L. Pappalardo, M. Vanhoof, L. Gabrielli, Z. Smoreda, D. Pedreschi, and F. Gi-
annotti. An analytical framework to nowcast well-being using mobile phone data.
International Journal of Data Science and Analytics 2, 1 (2016), 75–92.
12. F. Pratesi, A. Monreale, R. Trasarti, F. Giannotti, D. Pedreschi, and T. Yanag-
ihara. PRISQUIT: a System for Assessing Privacy Risk versus Quality in Data
Sharing. Technical Report 2016-TR-043. ISTI - CNR, Pisa, Italy. FriNov20162291.
13. I. S. Rubinstein. Big Data: The End of Privacy or a New Beginning? International
Data Privacy Law (2013).
14. P. Samarati and L. Sweeney. 1998a. Generalizing Data to Provide Anonymity
when Disclosing Information (Abstract). In PODS. 188.
15. C. Song, T. Koren, P. Wang, and A.-L. Barabasi. Modelling the scaling properties
of human mobility. Nat Phys 6, 10 (2010), 818–823.
16. Y. Song, D. Dahlmeier, and S. Bressan. Not So Unique in the Crowd: a Simple and
Effective Algorithm for Anonymizing Location Data. In PIR@SIGIR 2014. 19–24.
17. G. Stoneburner, A. Goguen, and A. Feringa. 2002. Risk Management Guide for
Information Technology Systems: Recommendations of the National Institute of Stan-
dards and Technology. NIST special publication, Vol. 800.
18. M. Terrovitis and N. Mamoulis. 2008. Privacy Preservation in the Publication of
Trajectories. In MDM. 65–72.
19. S. Trabelsi, V. Salzgeber, M. Bezzi, and G. Montagnon. 2009. Data disclosure risk
evaluation. In CRiSIS ’09. 35–72.
20. N. E. Williams, T. A. Thomas, M. Dunbar, N. Eagle, and A. Dobra. Measures of
Human Mobility Using Mobile Phone Records Enhanced with GIS Data. PLoS ONE
10, 7 (2015), 1–16.
21. R. Yarovoy, F. Bonchi, L. V. S. Lakshmanan, and W. H. Wang. 2009. Anonymizing
moving objects: how to hide a MOB in a crowd?. In EDBT. 72–83.
22. N. Mohammed, B. C.M. Fung, and M. Debbabi. Walking in the Crowd: Anonymiz-
ing Trajectory Data for Pattern Analysis. In CIKM 2009. 1441–1444.
23. H. Zang and J. Bolot. Anonymization of Location Data Does Not Work: A Large-
scale Measurement Study. In MobiCom 2011. 145–156.
24. J. Unnikrishnan and F. M. Naini. De-anonymizing private data by matching statis-
tics. In Allerton 2013. 1616–1623.
25. A. Basu, A. Monreale, J. C. Corena, F. Giannotti, D. Pedreschi, S. Kiyomoto, Y.
Miyake, T. Yanagihara, and R. Trasarti. 2014. A Privacy Risk Model for Trajectory
Data. In Trust Management VIII, pp 125–140.
26. S. Ji, Weiqing Li, M. Srivatsa, J. S. He, and R. Beyah. 2014. Structure Based Data
De-Anonymization of Social Networks and Mobility Traces. 237–254.