Conference PaperPDF Available

Fast Estimation of Privacy Risk in Human Mobility Data

Authors:

Abstract

Mobility data are an important proxy to understand the patterns of human movements, develop analytical services and design models for simulation and prediction of human dynamics. Unfortunately mobility data are also very sensitive, since they may contain personal information about the individuals involved. Existing frameworks for privacy risk assessment enable the data providers to quantify and mitigate privacy risks, but they suffer two main limitations: (i) they have a high computational complexity; (ii) the privacy risk must be re-computed for each new set of individuals, geographic areas or time windows. In this paper we explore a fast and flexible solution to estimate privacy risk in human mobility data, using predictive models to capture the relation between an individual’s mobility patterns and her privacy risk. We show the effectiveness of our approach by experimentation on a real-world GPS dataset and provide a comparison with traditional methods.
Fast estimation of privacy risk in human
mobility data
Roberto Pellungrini1, Luca Pappalardo1,2, Francesca Pratesi1,2, and Anna
Monreale1
1Department of Computer Science, University of Pisa, Italy
2ISTI-CNR, Pisa, Italy
Abstract. Mobility data are an important proxy to understand the pat-
terns of human movements, develop analytical services and design models
for simulation and prediction of human dynamics. Unfortunately mobil-
ity data are also very sensitive, since they may contain personal informa-
tion about the individuals involved. Existing frameworks for privacy risk
assessment enable the data providers to quantify and mitigate privacy
risks, but they suffer two main limitations: (i) they have a high compu-
tational complexity; (ii) the privacy risk must be re-computed for each
new set of individuals, geographic areas or time windows. In this paper
we explore a fast and flexible solution to estimate privacy risk in human
mobility data, using predictive models to capture the relation between
an individual’s mobility patterns and her privacy risk. We show the effec-
tiveness of our approach by experimentation on a real-world GPS dataset
and provide a comparison with traditional methods.
1 Introduction
In the last years human mobility analysis has attracted a growing interest due to
its importance in several applications such as urban planning, transportation en-
gineering and public health (11). The great availability of these data has offered
the opportunity to observe human movements at large scales and in great detail,
leading to the discovery of quantitative patterns (9), the mathematical modeling
of human mobility (10; 15) and so on. Unfortunately mobility data are sensitive
because they may reveal personal information or allow the re-identification of in-
dividuals, creating serious privacy risks if they are analyzed with malicious intent
(13). Driven by these sensitive issues, researchers have developed methodologies
and frameworks to mitigate the individual privacy risks associated to the study
of GPS trajectories and Big Data in general (1). These tools aim at preserving
both the right to individual’s privacy and the effectiveness of the analytical re-
sults, trying to find a reasonable trade-off between privacy protection and data
quality. They also allow the definition of infrastructures for supporting privacy
and of technical requirements for data protection, enforcing cross-relations be-
tween privacy-preserving solutions and legal regulations, since assessing privacy
risk is required by the new EU General Data Protection Regulation. To this
aim, Pratesi et al. (12) propose a framework for the privacy risk assessment of
individuals in mobility datasets. Although frameworks like the one presented in
(12) are effective in many scenarios, they suffer a major drawback: the privacy
risk assessment has a high computational complexity (non-polynomial in time)
because it computes the maximum privacy risk given an external knowledge that
a malicious adversary may have, i.e., it considers all the possible ways the ad-
versary can try to re-identify an individual. Moreover, the privacy risks must
be recomputed every time new data become available and for every selection of
individuals, geographic areas and periods of time.
In this paper we propose a data mining approach for privacy risk assessment
that overcomes the computational limitations of existing frameworks. We first
introduce some possible re-identification attacks on mobility data, and then we
use linear regression to predict the privacy risk of an individual based on her
mobility patterns, and we compute the individual privacy risk level according
to the re-identification attacks. We then train a regressor on such data to es-
timate in polynomial time the privacy risk level of previously unseen vehicles
based just on their individual mobility patterns. In a scenario where a Data An-
alyst asks a Data Provider for mobility data to deploy an analytical service, the
Data Provider (e.g., a mobile phone carrier) can use the regressor to immedi-
ately identify individuals with a high privacy risk. Then, the Data Provider can
select the most suitable privacy-preserving technique (e.g., k-anonymity, differ-
ential privacy) to mitigate their privacy risk and release only safe data to the
Data Analyst. Our experiments on GPS data shows that our approach is fairly
accurate in predicting the privacy risk of unseen individuals in an urban area.
The rest of the paper is organized as follows. In Section 2 we define the
data structures to describe human mobility data according to different data
aggregations. In Section 3 we introduce the framework used for the privacy risk
assessment, while Section 4 describes the data mining approach we propose. In
Section 5 we show the results of our experiments and we discuss them. Section
6 presents the main works related to our paper and finally Section 7 concludes
the paper proposing some lines of new research.
2 Data Definitions
The approach we present in this paper is tailored for human mobility data,
i.e., data describing the movements of a set of individuals. This type of data is
generally collected in an automatic way through electronic devices (e.g., mobile
phones, GPS devices) in form of raw trajectory data. Every record has the
following fields: the identifier of the individual, a geographic location expressed in
coordinates (generally latitude and longitude), a timestamp indicating when the
individual stopped in or went through that location. Depending on the specific
application, a trajectory can be aggregated into different data structures:
Definition 1 (Trajectory). The trajectory Tuof an individual uis a tempo-
rally ordered sequence of tuples Tu=h(l1, t1),(l2, t2),...,(ln, tn)i, where li=
(xi, yi)is a location, xiand yiare the coordinates of the geographic location,
and tiis the corresponding timestamp, ti< tjif i<j.
Definition 2 (Frequency vector). The frequency vector Wuof an individual
uis a sequence of tuples Wu=h(l1, w1),(l2, w2),...,(ln, wn)iwhere li= (xi, yi)
is a location, wiis the frequency of the location, i.e., how many times location
liappears in the individual’s trajectory Tu, and wi> wjif i < j. A frequency
vector Wuis hence an aggregation of a trajectory Tu.
In the following, with the terms visit we refer indifferently to a tuple in
a trajectory or in a frequency vector. In other words, a visit indicates a pair
consisting of a location and a supplementary information, e.g., the timestamp
or the frequency. We denote with Da mobility dataset, which we assume is a
set of a one of the above data types (trajectory or frequency vector).
3 Privacy Risk Assessment Framework
In this paper we consider the work proposed in (12), which allows for the privacy
risk assessment of human mobility data. This framework considers a scenario
where a Data Analyst asks a Data Provider for data to develop an analytical
service. The Data Provider must guarantee the right to privacy of the individ-
uals whose data are recorded. First, the Data Analyst transmits to the Data
Provider the data requirements for the service. With these specifications, the
Data Provider queries its dataset D, producing a set of datasets {D1, . . . , Dz},
each with different data structures and data aggregations. The Data Provider
then reiterates a procedure until it considers the data delivery safe:
(1) Identification of Attacks : identify a set of possible attacks that an adversary
might conduct in order to re-identify individuals in the datasets {D1, . . . , Dz};
(2) Privacy Risk Computation: simulate the attacks and compute the set of
privacy risk values for every individual in the mobility datasets {D1, . . . , Dz};
(3) Dataset Selection: select a mobility dataset D∈ {D1, . . . , Dz}with the best
trade-off between the privacy risks of individuals and the data quality, given
a certain level of tolerated privacy risk and the Data Analyst’s requirements;
(4) Risk Mitigation and Data delivery: apply a privacy-preserving transforma-
tion (e.g., generalization, randomization, etc.) on the chosen mobility dataset
Dto eliminate the residual privacy risk, producing a filtered mobility dataset
Dfilt. Deliver Df ilt to the Data Analyst when the Df ilt is adequately safe.
In this paper we focus on improving step (2), i.e., Privacy Risk Computation,
which is the most critical one from a computational point of view. Computing
the privacy risk of an individual means simulating several possible attacks a
malicious adversary can perform and computing the privacy risks associated
to each attack. The privacy risk of an individual is related to her probability
of re-identification in a dataset w.r.t. to a set of re-identification attacks. A re-
identification attack assumes that an adversary gains access to a dataset. On the
basis of some background knowledge about an individual, i.e., the knowledge of
a subset of her mobility data, the adversary tries to re-identify all the records
in the dataset regarding the individual under attack. In this paper we use the
definition of privacy risk (or re-identification risk) introduced in (14).
A background knowledge represents both the kind and quantity of informa-
tion known by an adversary. Two examples of kinds of background knowledge
are a subset of the locations visited by an individual (spatial dimension) and the
specific times an individual visited those locations (spatial and temporal dimen-
sions). We denote with kthe number of the elements known by the adversary. So
for example a specific background knowledge is the knowledge of three specific
locations visited by the individual under attack. We denote a set of background
knowledge of size kwith Bkand a specific background knowledge with b.
Let Dbe a database, Da mobility dataset extracted from Das an aggregation
of the data on specific dimensions (e.g., an aggregated data structure and/or a
filtering on time and/or space), and Duthe set of records representing individual
uin D, we define the probability of re-identification as follows:
Definition 3 (Probability of re-identification). Given an attack, a func-
tion matching(d, b) indicating whether or not a record dDmatches the back-
ground knowledge b, and a function M(D, b) = {dD|matching(d, b) = T rue},
we define the probability of re-identification of an individual uin dataset Das:
P RD(d=u|b) = 1
|M(D,b)|that is the probability to associate record dDto
individual u, given background knowledge b.
Note that P RD(d=u|b) = 0 if the individual uis not represented in D. Since
each background knowledge bhas its own probability of re-identification, we
define the risk of re-identification of an individual as the maximum probability
of re-identification over the set of possible background knowledge:
Definition 4 (Risk of re-identification or Privacy risk). The risk of re-
identification (or privacy risk) of an individual ugiven a set of background
knowledge Bkis her maximum probability of re-identification Risk(u, D) =
max P RD(d=u|b) for bBk. The risk of re-identification has the lower bound
|Du|
|D|(a random choice in D), and Risk(u, D) = 0 if u /D.
An individual is hence associated to several privacy risks, each for every
background knowledge of an attack. Every privacy risk of an individual can be
computed using the following procedure:
1. define an attack based on a specific background knowledge;
2. given an individual and fixing k, compute all the possible bBkand the
corresponding probability of re-identification;
3. select the privacy risk of the individual for a set Bkas the maximum prob-
ability of re-identification across all bBk.
3.1 Computational Complexity of Privacy Risk Computation
The procedure of privacy risk computation has a high computational complexity.
We assume that the adversary uses all the information available to her when
conducting a re-identification attack on an individual. The maximum possible
value of kis len, the length of the data structure of an individual. Since it
is unlikely that an adversary knows the complete movement of an individual
(i.e., all the points), we have to reason about different and reasonable values
of k. To compute all bBkwe have to compute a k-combination of elements
from the original data structure. We need all bto correctly compute the risk
of re-identification, since we have to know all the possible probabilities of re-
identification. This leads to a high overall computational complexity O(len
k×
N), since the framework generates len
kbackground knowledge band, for each
b, it executes Nmatching operations by applying function matching. While
some optimizations can be made depending on the kind of attack simulated, the
overall complexity of the procedure is dominated by the len
kterm.
4 Fast Privacy Risk Assessment with Data Mining
Given its computational complexity, the privacy risk computation becomes un-
feasible as the size of the dataset increases. This drawback is even more serious
if we consider that the privacy risks must be necessarily re-computed every time
the mobility dataset is updated and for every selection of individuals, geographic
areas and periods of time. In order to overcome these problems, we propose a
fast and flexible data mining approach. The idea is to train a regression model
to predict the privacy risk of an individual based solely on her individual mo-
bility patterns. The training of the predictive model is made by using a dataset
where every record refers to an individual and consists of (i) a vector of the in-
dividual’s mobility features and (ii) the privacy risk value of the individual. We
make our approach parametric with respect to the predictive algorithm: in our
experiments we use a Random Forest regressor, but every algorithm available in
literature can be used for the predictive tasks. Note that our approach is con-
strained to the fixed well-defined set of attacks introduced in Section 4.2, which
is a representative set of nine sufficiently diverse attacks tailored for the data
structures required to compute standard individual human mobility measures.
Our approach can be easily extended to any type of attack defined on human
mobility data by using the privacy framework proposed by (12).
4.1 Individual Mobility Features
The mobility dynamics of an individual can be described by a set of measures
widely used in literature. Some measures describe specific aspects of an indi-
vidual’s mobility; other measures describe an individual’s mobility in relation
to collective mobility. A subset of these measures can be simply obtained as
aggregation of an individual’s trajectory or frequency vector. The number of
visits Vof an individual is the length of her trajectory, i.e., the sum of all the
visits she did in any location during the period of observation (9). By dividing
this quantity by the number of days in the period of observation we obtain the
average number of daily visits V, which is a measure of the erratic behavior of
an individual during the day (10). The length Locs of the frequency vector of
an individual indicates the number of distinct places visited by the individual
during the period of observation (15). Dividing Locs by the number of available
locations on the considered territory we obtain Locsratio, which indicates the
fraction of territory exploited by an individual in her mobility behavior. The
maximum distance Dmax traveled by an individual is defined as the length of
the longest trip of the individual during the period of observation (20), while
Dtrip
max is defined as the ratio between Dmax and the maximum possible distance
between the locations in the area of observation. The sum of all the trip lengths
traveled by the individual during the period of observation is defined as Dsum
(20). It can be also averaged over the days in the period of observation obtain-
ing Dsum. The radius of gyration rgis the characteristic distance traveled by
an individual during the period of observation, formally defined as (9). The mo-
bility entropy Eis a measure of the predictability of an individual’s trajectory.
Formally, it is defined as the Shannon entropy of an individual’s movements (7).
Also, for each individual we keep track of the characteristics of three different
locations: the most visited location, the second most visited location and the
least visited location. The frequency wiof a location iis the number of times an
individual visited location iduring the period of observation, while the average
frequency wiis the daily average frequency of location i. We also define wpop
i
as the frequency of a location divided by the popularity of that location in the
whole dataset. The quantity Uratio
iis the number of distinct individuals that vis-
ited a location idivided by the total number |Uset|of individuals in the dataset,
while Uiis the number of distinct individuals that visited location iduring the
period of observation. Finally, the location entropy Eiis the predictability of
location i, defined as a variation of the Shannon entropy.
Every individual uin the dataset is described by a mobility vector muof the
16 mobility features described above. It is worth noting that all the measures
can be computed in linear time on the size of the corresponding data structure.
4.2 Privacy attacks on mobility data
In this section we describe the attacks we use in this paper:
Location Attack In a Location attack the adversary knows a certain num-
ber of locations visited by the individual but she does not know the temporal
order of the visits. Since an individual might visit the same location multiple
times in a trajectory, the adversary’s knowledge is a multiset that may contain
more occurrences of the same location.
Location Sequence Attack In a Location Sequence attack the adversary knows
a subset of the locations visited by the individual and the temporal ordering of
the visits.
Visit Attack In a Visit attack the adversary knows a subset of the locations
visited by the individual and the time the individual visited these locations.
Frequent Location and Sequence Attack We also introduce two attacks
based on the knowledge of the location frequency. In the Frequent Location at-
tack the adversary knows a number of frequent locations visited by an individual,
while in the Frequent Location Sequence attack the adversary knows a subset of
the locations visited by an individual and the relative ordering with respect to
the frequencies (from most frequent to least frequent). The Frequent Location
attack is similar to the Location attack with the difference that in frequency vec-
tors a location can appear only once. The Frequent Location Sequence attack is
similar to the Location Sequence attack, with two differences: first, a location
can appear only once in the vector; second, locations in a frequency vector are
ordered by descending frequency and not by time. Thus the locations/sequence
Xsof length kcannot contain repetition of locations.
Frequency Attack We introduce an attack where the adversary knows the lo-
cations visited by the individual, their reciprocal ordering of frequency, and the
minimum number of visits of the individual. This means that, when searching for
specific subsequences, the adversary must consider also subsequences containing
the known locations with a greater frequency.
Home And Work Attack In the Home and Work attack the adversary knows
the two most frequent locations of an individual and their frequencies. It essen-
tially assumes the same background knowledge of Frequency attack but related
only to two locations. This is the only attack where the set of background knowl-
edge is fixed and composed of just a single 2-combination for each individual.
4.3 Construction of training dataset
Given an attack ibased on a specific set of background knowledge Bi
j, the
regression training dataset TRi
jcan be constructed by the following two-step
procedure: first, given a mobility dataset D, for every individual uwe compute
the set of individual mobility features described in Section 4.1 based on her
mobility data. Every individual uis hence described by a mobility feature vec-
tor mu. All the individuals’ mobility feature vectors compose mobility matrix
F=(m1, . . . , mn), where nis the number of individuals in D; second, for every
individual we simulate the attack with Bi
jon D, in order to compute a privacy
risk value for every individual. We obtain a privacy risk vector Ri
j= (r1, . . . , rn).
The regression training set is hence TRi
j= (F, Ri
j);
Every regression dataset TRi
jis used to train a predictive model Mi
j. If
0iIwhere Iis the number of different kinds of attack and 0 jJ
where Jis the number of different sets of possible background knowledge, we
have a total of J×Imodels. For example, if we consider sets of background
knowledge ranging in size from j= 1 to j= 5 for 7 different attacks, we would
have I= 7 and J= 5. The predictive model will be used by the Data Provider
to immediately estimate the privacy risk value of previously unseen individuals,
whose data were not used in the learning process, with respect to attack i, set
of background knowledge Bi
jand dataset D.
Example 1 (Construction of regression training set). Let us consider a mobility
dataset of trajectories D={Tu1, Tu2, Tu3, Tu4, Tu5}corresponding to five individ-
uals u1, u2, u3, u4and u5. Given an attack i, a set of background knowledge
Bi
jand dataset D, we construct the regression training set TCi
jas follows:
first, for every individual uiwe compute the 16 individual mobility measures
based on her trajectory Tui. Every individual uiis hence described by a mobil-
ity feature vector of length 16 mui= (m(ui)
1, . . . , m(ui)
16 ). All the mobility feature
vectors compose mobility matrix F=(mu1, mu2, mu3, mu4, mu5); second, we sim-
ulate the attack with Bi
jon dataset Dand obtain a vector of five privacy risk
values Ri
j= (ru1, ru2, ru3, ru4, ru5), each for every individual;
4.4 Usage of the regression approach
The Data Provider can use a regression model Mi
jto determine the value of
privacy risk with respect to an attack iand a set of background knowledge Bi
j
for: (i) previously unseen individuals, whose data were not used in the learn-
ing process; (ii) a selection of individuals in the database already used in the
learning process. It is worth noting that with existing methods the privacy risk
of individuals in scenario (ii) must be recomputed by simulating attack ifrom
scratch. In contrast, the usage of regression model Mi
jallows for obtaining the
privacy risk of the selected individuals immediately. The computation of the
mobility measures and the regression of privacy risk can be done in polynomial
time as a one-off procedure. To clarify this point, let us consider the following
scenario. A Data Analyst requests the Data Provider for updated mobility data
about a new set of individuals with the purpose of studying their characteristic
traveled distance (radius of gyration rg) and the predictability of their move-
ments (mobility entropy E). Since both measures can be computed by using a
frequency vector, the Data Provider can release just the frequency vectors of the
individuals requested. Before that, however, the Data Provider wants to deter-
mine the level of privacy risk of the individuals with respect to the Frequency
attack (F) and several sets of background knowledge BF
j. The Data Provider
uses the regression model MF
jpreviously trained to obtain the privacy risk of the
individuals. So the Data Provider computes the mobility features for the individ-
uals in the dataset and gives them in input to the regression model, obtaining an
estimation of privacy risk. On the basis of privacy risks obtained from MF
j, the
Data Provider can identify risky individuals, i.e., individuals with a high privacy
risk. She then can decide to either filter out the risky individuals or to select
suitable privacy-preserving techniques (e.g., k-anonymity or differential privacy)
and transform their mobility data in such a way that their privacy is preserved.
In the next section we present an evaluation of our methodology on real-world
mobility data and show the effectiveness of the proposed regression approach.
5 Experiments
For all the attacks defined except the Home and Work attack we consider four
sets of background knowledge Bkwith k= 2,3,4,5, where each Bkcorresponds
to an attack where the adversary knows klocations visited by the individual.
For the Home and Work attack we have just one possible set of background
knowledge, where the adversary knows the two most frequent locations of an
individual. We use a dataset provided by Octo Telematics 3storing the GPS
tracks of 9,715 private vehicles traveling in Florence, a very populous are of cen-
tral Italy, from 1st May to 31st May 2011, corresponding to 179,318 trajectories.
We assign each origin and destination point of the original raw trajectories to
the corresponding census cell according to the information provided by the Ital-
ian National Statistics Bureau, to assign every origin and destination point to
a location (9). This allows us to describe the mobility of every vehicle in terms
of a trajectory, in compliance with the definition introduced in Section 2. We
first performed a simulation of the attacks computing the privacy risk values
for all individuals in the dataset and for all Bk.4We then performed regression
experiments using a Random Forest regressor.5Table 1 shows the average Mean
Squared Error (mse) and the average coefficient of determination R2resulting
from the regression experiments for all the attacks. The results are averaged over
k= 2,3,4,5, since the empirical distributions of privacy risk are fairly similar
across different values of k. Also, mse and R2are almost identical for each kind
of attack. The best results are obtained for the Frequent Location Sequence at-
tack, with values of mse = 0.01 and R2= 0.92, while the weakest results are
obtained for the Home and Work attack, with values of mse = 0.07 and R2=
0.50. Overall, the results show good predictive performance across all attacks,
suggesting that regression could indeed be an accurate alternative to the direct
computation of privacy risk.
Execution Times We show the computational improvement of our approach in
terms of execution time by comparing in Table 2 the execution times of the attack
simulations and the execution times of the regression tasks.6The execution time
of a single regression task is the sum of three subtasks: (i) the execution time
of training the regressor on the training set; (ii) the execution time of using the
trained regressor to predict the risk on the test set; (iii) the execution time of
evaluating the performance of regression. Table 2 shows that the execution time
of attack simulations is low for most of the attacks except for Location Sequence
and Location, for which execution times are huge: more than 1 week each. In
contrast the regression tasks have constant execution times of around 22s. In
summary, our approach can compute the risk levels for all the 33 attacks in 179
seconds (less than 3 minutes), while the attack simulations require more than
two weeks of computation.
3https://www.octotelematics.com/
4The Python code for attacks simulation is available here: https://github.com/
pellungrobe/privacy-mobility-lib
5We use the Python package scikit-learn to perform the regression experiments.
6For a given type of attack we report the sum of the execution times of the attacks
for configurations k= 2,3,4,5. We perform the experiments on Ubuntu 16.04.1 LTS
64 bit, 32 GB RAM, 3.30GHz Intel Core i7.
predicted variable mse r2
Frequent Location Sequence 0.01 0.92
Visit 0.01 0.89
Frequency 0.02 0.88
Location 0.02 0.90
Location Sequence 0.02 0.84
Frequent Location 0.03 0.73
Home and Work 0.07 0.50
Table 1: Results of regression experiments.
variable (P5
2k)simulation regression
Home and Work 149s (2.5m) 7s
Frequency 645s (10m) 22s
Frequent Location Sequence 846s (14m) 22s
Frequent Location 997s (10m) 22s
Visit 2,274s (38m) 16s
LocationSequence >168h (1week) 22s
Location >168h (1week) 22s
total >2weeks 172s
Table 2: Execution times of attack simulations and regression tasks.
Discussion The preliminary work presented above shows some promising re-
sults. The coefficient of determination and the execution times suggest that
the regression can be a valid and fast alternative to existing privacy risk assess-
ment tools. Instead of re-computing privacy risks when new data records become
available, which would result in high computational costs, a Data Provider can
effectively use the regressors to obtain immediate and reliable estimates for every
individual. The mobility measures can be computed in linear time of the size of
the dataset. Every time new mobility data of an individual become available,
the Data Provider can recompute her mobility features. To take into account
long-term changes in mobility patterns the recomputation of mobility measures
can be done at regular time intervals (e.g., every month) by considering a time
window with the most recent data (e.g., the last six months of data).
6 Related Works
Human mobility data contains personal sensitive information and can reveal
many facets of the private life of individuals, leading to potential privacy viola-
tion. To overcome the possibility of privacy leaks, many techniques have been
proposed in literature. A widely used privacy-preserving model is k-anonymity
(14), which requires that an individual should not be identifiable from a group
of size smaller than kbased on their quasi-identifiers (QIDs), i.e., a set of at-
tributes that can be used to uniquely identify individuals. Assuming that adver-
saries own disjoint parts of a trajectory, (18) reduces privacy risk by relying on
the suppression of the dangerous observations from each individual’s trajectory.
In (21), authors propose the attack-graphs method to defend against attacks,
based on k-anonymity. Other works are based on the differential privacy model
(6). (8) considers a privacy-preserving distributed aggregation framework for
movement data. (4) proposes to publish a contingency table of trajectory data,
where each cell contains the number of individuals commuting from a source
to a destination. (26) defines several similarity metrics which can be combined
in a unified framework to provide de-anonymization of mobility data and social
network data. One of the most important work about privacy risk assessment is
the Linddun methodology (5), a privacy-aware framework, useful for modeling
privacy threats in software-based systems. In the last years, different techniques
for risk management have been proposed, such as NIST’s Special Publication
800-30 (17) and SEI’s OCTAVE (2). Unfortunately, many of these works do
not consider privacy risk assessment and simply include privacy considerations
when assessing the impact of threats. In (19), authors elaborate an entropy-
based method to evaluate the disclosure risk of personal data, trying to manage
quantitatively privacy risks. The unicity measure proposed in (16) evaluates the
privacy risk as the number of records/trajectories which are uniquely identified.
(25) proposes an empirical risk framework for improving privacy risk estimation
for mobility data, evaluating their model using k-anonymized data. (3) pro-
poses a risk-aware framework for information disclosure which supports runtime
risk assessment, using adaptive anonymization as risk-mitigation method. Un-
fortunately, this framework only works on relational datasets since it needs to
discriminate between quasi-identifiers and sensitive attributes. In this paper we
use the privacy risk assessment framework introduced by (12) (Section 3) to
calculate the privacy risks of each individual in a mobility dataset.
7 Conclusion
Human mobility data are a precious proxy to improve our understanding of hu-
man dynamics, as well as to improve urban planning, transportation engineering
and epidemic modeling. Nevertheless human mobility data contain sensitive in-
formation which can lead to a serious violation of the privacy of the individuals
involved. In this paper we explored a fast and flexible solution for estimating
the privacy risk in human mobility data, which overcomes the computational
issues of existing privacy risk assessment frameworks. We showed through ex-
perimentations that our approach can achieve good estimations of privacy risks.
As future work, it would be necessary to test our approach more extensively on
different, more numerous datasets. It would be also interesting to evaluate the
importance of mobility features with respect to the prediction of risk. Indeed, if
every individual can obtain an estimate of her own privacy risk based just on her
mobility data, this increases the awareness about personal data and helps her in
deciding whether or not to share the mobility data with third parties. Another
possible extension of our method would be to apply more refined data mining
techniques to assess the privacy risk of individuals. Moreover, our approach pro-
vides a fast tool to immediately obtain the privacy risks of individuals, leaving to
the Data Provider the choice of the most suitable privacy preserving techniques
to manage and mitigate the privacy risks of individuals. It would be interesting
to perform an extensive experimentation to select the best techniques to reduce
the privacy risk of individuals in mobility datasets and at same time ensuring
high data quality for analytical services.
Acknowledgment
Funded by the European project SoBigData (Grant Agreement 654024).
References
1. O. Abul, F. Bonchi, and M. Nanni. Never Walk Alone: Uncertainty for Anonymity
in Moving Objects Databases. In ICDE 2008. 376–385.
2. C. Alberts, S. Behrens, R. Pethia, and W. Wilson. 1999. Operationally Critical
Threat, Asset, and Vulnerability Evaluation (OCTAVE) Framework, Version 1.0.
CMU/SEI-99-TR-017. Software Engineering Institute, Carnegie Mellon University.
http://resources.sei.cmu.edu/library/asset-view.cfm?AssetID=13473
3. A. Armando, M. Bezzi, N. Metoui, and A. Sabetta. Risk-Based Privacy-Aware
Information Disclosure. Int. J. Secur. Softw. Eng. 6, 2 (April 2015), 70–89.
4. G. Cormode, C. M. Procopiuc, D. Srivastava, and T. T. L. Tran. Differentially
private summaries for sparse data. In ICDT ’12. 299–311.
5. M. Deng, K. Wuyts, R. Scandariato, B. Preneel, and W. Joosen. A privacy threat
analysis framework: supporting the elicitation and fulfillment of privacy requirements.
Requir. Eng.16, 1 (2011), pp 3–32.
6. C. Dwork, F. McSherry, K. Nissim, and A. Smith. 2006. Calibrating Noise to
Sensitivity in Private Data Analysis. In TCC ’06. 265–284.
7. N. Eagle and A. S. Pentland. Eigenbehaviors: identifying structure in routine. Be-
havioral Ecology and Sociobiology 63, 7 (2009), 1057–1066.
8. A. Monreale, W. H. Wang, F. Pratesi, S. Rinzivillo, D. Pedreschi, G. Andrienko,
and N. Andrienko. 2013. Privacy-Preserving Distributed Movement Data Aggregation.
Springer International Publishing, 225–245.
9. L. Pappalardo, F. Simini, S. Rinzivillo, D. Pedreschi, F. Giannotti, and A.-L.
Barabasi. Returners and explorers dichotomy in human mobility. Nature Com-
munications 6 (2015).
10. L. Pappalardo and F. Simini. Modelling spatio-temporal routines in human mo-
bility. CoRR abs/1607.05952 (2016).
11. L. Pappalardo, M. Vanhoof, L. Gabrielli, Z. Smoreda, D. Pedreschi, and F. Gi-
annotti. An analytical framework to nowcast well-being using mobile phone data.
International Journal of Data Science and Analytics 2, 1 (2016), 75–92.
12. F. Pratesi, A. Monreale, R. Trasarti, F. Giannotti, D. Pedreschi, and T. Yanag-
ihara. PRISQUIT: a System for Assessing Privacy Risk versus Quality in Data
Sharing. Technical Report 2016-TR-043. ISTI - CNR, Pisa, Italy. FriNov20162291.
13. I. S. Rubinstein. Big Data: The End of Privacy or a New Beginning? International
Data Privacy Law (2013).
14. P. Samarati and L. Sweeney. 1998a. Generalizing Data to Provide Anonymity
when Disclosing Information (Abstract). In PODS. 188.
15. C. Song, T. Koren, P. Wang, and A.-L. Barabasi. Modelling the scaling properties
of human mobility. Nat Phys 6, 10 (2010), 818–823.
16. Y. Song, D. Dahlmeier, and S. Bressan. Not So Unique in the Crowd: a Simple and
Effective Algorithm for Anonymizing Location Data. In PIR@SIGIR 2014. 19–24.
17. G. Stoneburner, A. Goguen, and A. Feringa. 2002. Risk Management Guide for
Information Technology Systems: Recommendations of the National Institute of Stan-
dards and Technology. NIST special publication, Vol. 800.
18. M. Terrovitis and N. Mamoulis. 2008. Privacy Preservation in the Publication of
Trajectories. In MDM. 65–72.
19. S. Trabelsi, V. Salzgeber, M. Bezzi, and G. Montagnon. 2009. Data disclosure risk
evaluation. In CRiSIS ’09. 35–72.
20. N. E. Williams, T. A. Thomas, M. Dunbar, N. Eagle, and A. Dobra. Measures of
Human Mobility Using Mobile Phone Records Enhanced with GIS Data. PLoS ONE
10, 7 (2015), 1–16.
21. R. Yarovoy, F. Bonchi, L. V. S. Lakshmanan, and W. H. Wang. 2009. Anonymizing
moving objects: how to hide a MOB in a crowd?. In EDBT. 72–83.
22. N. Mohammed, B. C.M. Fung, and M. Debbabi. Walking in the Crowd: Anonymiz-
ing Trajectory Data for Pattern Analysis. In CIKM 2009. 1441–1444.
23. H. Zang and J. Bolot. Anonymization of Location Data Does Not Work: A Large-
scale Measurement Study. In MobiCom 2011. 145–156.
24. J. Unnikrishnan and F. M. Naini. De-anonymizing private data by matching statis-
tics. In Allerton 2013. 1616–1623.
25. A. Basu, A. Monreale, J. C. Corena, F. Giannotti, D. Pedreschi, S. Kiyomoto, Y.
Miyake, T. Yanagihara, and R. Trasarti. 2014. A Privacy Risk Model for Trajectory
Data. In Trust Management VIII, pp 125–140.
26. S. Ji, Weiqing Li, M. Srivatsa, J. S. He, and R. Beyah. 2014. Structure Based Data
De-Anonymization of Social Networks and Mobility Traces. 237–254.
... GDPR Art. 42 establishes the need for mechanisms to evaluate and demonstrate the compliance of the developed IS. ...
... The first two properties are related to anonymization or obfuscation algorithms (c.f. [41] [42]), while unlinkability is related to cryptographic algorithms and protocols (c.f. [51] [52]). ...
Article
Full-text available
Software Quality Control (SQC) techniques are widely used throughout the software development process with the objective of assessing and detecting anomalies that affect the quality of an information system. Privacy is one quality attribute of software systems for which several SQC techniques have been proposed in recent years. However, research has been carried out from different perspectives and, consequently, it has led to a growing body of knowledge scattered across different domains. To bridge this gap, we have carried out a systematic mapping study to provide practitioners and researchers with an overview of the state-of-the-art techniques to carry out software quality control of information systems focusing on aspects of privacy. Our results show a steady growth in the research efforts in this field. The European General Data Protection Regulation seems to have a significant influence on this growth, since 37% of techniques that focus on assessing compliance derive their assessment criteria from this legal framework. The maturity of the techniques varies between the type of technique: Formal verification techniques exhibit the lowest level of maturity while the combination of techniques has demonstrated its successful application in real-world scenarios. The latter seems a promising avenue of research as it provides better results in terms of coverage, precision and effectiveness than the application of individual, isolated techniques. In this paper, we describe the existing SQC techniques focusing on privacy and provide a suitable basis for identifying future research directions.
... A different approach, which is based on privacy-by-design methodology, can be found in earlier studies [84,85,99]. Here, the framework PRUDEnce is presented, providing an approach that, before applying any privacy-preserving transformation, allows looking at the effective risk there is in the data, as well as the service or purpose for which the data are queried, instead of relying only on theoretical results in terms of privacy. ...
Article
Full-text available
The exponential increase in the availability of large-scale mobility data has fueled the vision of smart cities that will transform our lives. The truth is that we have just scratched the surface of the research challenges that should be tackled in order to make this vision a reality. Consequently, there is an increasing interest among different research communities (ranging from civil engineering to computer science) and industrial stakeholders in building knowledge discovery pipelines over such data sources. At the same time, this widespread data availability also raises privacy issues that must be considered by both industrial and academic stakeholders. In this paper, we provide a wide perspective on the role that big data have in reshaping cities. The paper covers the main aspects of urban data analytics, focusing on privacy issues, algorithms, applications and services, and georeferenced data from social media. In discussing these aspects, we leverage, as concrete examples and case studies of urban data science tools, the results obtained in the “City of Citizens” thematic area of the Horizon 2020 SoBigData initiative, which includes a virtual research environment with mobility datasets and urban analytics methods developed by several institutions around Europe. We conclude the paper outlining the main research challenges that urban data science has yet to address in order to help make the smart city vision a reality.
... Juicer mainly focuses on anonymizing emails to generate synthetic redacted emails for internal human assessment. Since redacted emails are not published, we can be sure that they will not be joined with other datasets which might weaken the privacy expectations from k-anonymity [33,42]. Consequently, we limit our approach to k-anonymity. ...
Conference Paper
Extracting structured data from emails can enable several assistive experiences, such as reminding the user when a bill payment is due, answering queries about the departure time of a booked flight, or proactively surfacing an emailed discount coupon while the user is at that store. This paper presents Juicer, a system for extracting information from email that is serving over a billion Gmail users daily. We describe how the design of the system was informed by three key principles: scaling to a planet-wide email service, isolating the complexity to provide a simple experience for the developer, and safeguarding the privacy of users (our team and the developers we support are not allowed to view any single email). We describe the design tradeoffs made in building this system, the challenges faced and the approaches used to tackle them. We present case studies of three extraction tasks implemented on this platform---bill reminders, commercial offers, and hotel reservations---to illustrate the effectiveness of the platform despite challenges unique to each task. Finally, we outline several areas of ongoing research in large-scale machine-learned information extraction from email.
Chapter
In the modern Internet era the usage of social media such as Twitter and Facebook is constantly increasing. These social media are accumulating a lot of textual data, because individuals often use them for sharing their experiences and personal facts writing text messages. These data hide individual psychological aspects that might represent a valuable alternative source with respect to the classical clinical texts. In many studies, text messages are used to extract individuals psychometric profiles that help in analysing the psychological behaviour of users. Unfortunately, both text messages and psychometric profiles may reveal personal and sensitive information about users, leading to privacy violations. Therefore, in this paper, we propose a study of privacy risk for psychometric profiles: we empirically analyse the privacy risk of different aspects of the psychometric profiles, identifying which psychological facts expose users to an identity disclosure.
Article
In the Vehicular Ad-hoc Networks, an enormous number of Location-based Services could be given to clients as per their development characteristics. Then, protection might be split when clients distribute certain vehicular trajectories information to the servers. Location-based Services collect large amounts of the Vehicular Ad-hoc Networks trajectories data, and if this data is released directly without any processing, it leaks the vehicles privacy. Nowadays, many scientists have encouraged different technologies to protect privacy, but how to use it rationally for Location-based Service is also a challenge. This path is continuous in time and space. Still, mostly the existing approaches only acknowledge a single position of the moving object at a particular time. They do not find the entire path, which may destroy the integrity of the space–time of the trajectory of the vehicle. However, existing work cannot fully guarantee the privacy of the vehicle’s trajectories because randomly selected noise does not contribute to the dissemination of meaningful path data, and people must hide access to sensitive areas. In this paper, a Differential Privacy and generalization based anonymization approach is proposed to protect the privacy of the sensitive vehicular trajectories. Privacy analysis shows that this scheme is achieved the Differential Privacy. The experiments with vehicular trajectories data shows that the system has good data feasibility and can be applied on large vehicular trajectories datasets. In the results firstly, to compute the Dump Ratio and CAVG experiment to check the efficiency of the method. Results shows the histogram of DPPS, PPDP and NTDP and the proposed strategy on the four datasets, where DPPS, PPDP and NTDP have lower accessibility contrasted with the proposed technique. The accuracy, precision and the recall rate of proposed method is also achieved. The impact of privacy budget values on Relative Average Error, Mean Absolute Error, Standard Deviation are also examined.
Article
The proliferation of various mobile devices equipped with GPS positioning modules makes the collection of trajectories more easier than ever before, and more and more trajectory datasets have been available for business applications or academic researches. Normally, published trajectories are often anonymized by replacing real identities of mobile objects with pseudonyms (e.g., random identifiers); however, privacy leaks can hardly be prevented. In this paper, we introduce a novel paradigm of de-anonymization attack re-identifying trajectories of victims from anonymous trajectory datasets. Different from existing attacks, no background knowledge or side channel information about the target dataset is required. Instead, we claim that, for each moving object, there exist some mobility patterns that reflect the preference or usual behavior of the object, and will not change dramatically over a period of time. As long as those relatively stable patterns can be extracted from trajectories and be utilized as quasi-identifiers, trajectories can be linked to anonymous historical ones. To implement such kind of de-anonymization attacks, an adversary only needs to collect a few trajectory segments of a victim, the durations of which do not necessarily overlap with that of trajectories in the target dataset (in simple terms, those trajectory segments are not necessary sub-trajectories included in the target dataset). Since the movements of victims in public areas could be observed openly, an adversary can obtain traces or locations about the victims either by direct monitoring them (e.g., tracking) or from third parties (e.g., social-networks). Then, the adversary extracts useful patterns from both the historical trajectories in the accessible dataset and newly obtained trajectory segments of victims, the historical trajectory with most similar patterns to that of a victim is considered as belonging to the victim. In order to demonstrate the feasibility of such attacks, we conduct extensive trace-driven simulations. We extract road segment preferences and stop of interests from trajectories of vehicles, and construct feature vectors (mobility patterns) of vehicles according to them, used for trajectory comparisons. Simulation results show that the adversary could re-identify anonymous trajectories effectively.
Article
Full-text available
Human mobility modelling is of fundamental importance in a wide range of applications, such as the developing of protocols for mobile ad hoc networks or for what-if analysis and simulation in urban ecosystems. Current generative models generally fail in accurately reproducing the individuals' recurrent daily schedules and at the same time in accounting for the possibility that individuals may break the routine and modify their habits during periods of unpredictability of variable duration. In this article we present DITRAS (DIary-based TRAjectory Simulator), a framework to simulate the spatio-temporal patterns of human mobility in a realistic way. DITRAS operates in two steps: the generation of a mobility diary and the translation of the mobility diary into a mobility trajectory. The mobility diary is constructed by a Markov model which captures the tendency of individuals to follow or break their routine. The mobility trajectory is produced by a model based on the concept of preferential exploration and preferential return. We compare DITRAS with real mobility data and synthetic data produced by other spatio-temporal mobility models and show that it reproduces the statistical properties of real trajectories in an accurate way.
Article
Full-text available
An intriguing open question is whether measurements made on Big Data recording human activities can yield us high-fidelity proxies of socio-economic development and well-being. Can we monitor and predict the socio-economic development of a territory just by observing the behavior of its inhabitants through the lens of Big Data? In this paper, we design a data-driven analytical framework that uses mobility measures and social measures extracted from mobile phone data to estimate indicators for socio-economic development and well-being. We discover that the diversity of mobility, defined in terms of entropy of the individual users' trajectories, exhibits (i) significant correlation with two different socio-economic indicators and (ii) the highest importance in predictive models built to predict the socio-economic indicators. Our analytical framework opens an interesting perspective to study human behavior through the lens of Big Data by means of new statistical indicators that quantify and possibly "nowcast" the well-being and the socio-economic development of a territory.
Article
Full-text available
Time sequence data relating to users, such as medical histories and mobility data, are good candidates for data mining, but often contain highly sensitive information. Different methods in privacypreserving data publishing are utilised to release such private data so that individual records in the released data cannot be re-linked to specific users with a high degree of certainty. These methods provide theoretical worst-case privacy risks as measures of the privacy protection that they offer. However, often with many real-world data the worstcase scenario is too pessimistic and does not provide a realistic view of the privacy risks: the real probability of re-identification is often much lower than the theoretical worst-case risk. In this paper we propose a novel empirical risk model for privacy which, in relation to the cost of privacy attacks, demonstrates better the practical risks associated with a privacy preserving data release. We show detailed evaluation of the proposed risk model by using k-anonymised real-world mobility data. © IFIP International Federation for Information Processing 2014.
Article
Full-text available
Risk-aware access control systems grant or deny access to resources based on the notion of risk. It has many advantages compared to classical approaches, allowing for more flexibility, and ultimately supporting for a better exploitation of data. The authors propose and demonstrate a risk-aware access control framework for information disclosure, which supports run-time risk assessment. In their framework access-control decisions are based on the disclosure-risk associated with a data access request and, differently from existing models, adaptive anonymization operations are used as risk-mitigation method. The inclusion of on-the-fly anonymization allows for extending access to data, still preserving privacy below the maximum tolerable risk. Risk thresholds can be adapted to the trustworthiness of the requester role, so a single access control framework can support multiple data access use cases, ranging from sharing data among a restricted highly trusted group to public release low trust value. The authors have developed a prototype implementation of their framework and have assessed it by running a number of queries against the Adult Data Set from the UCI Machine Learning Repository, a publicly available dataset that is widely used by the research community. The experimental results are encouraging and confirm the feasibility of the proposed approach.
Article
Full-text available
The availability of massive digital traces of human whereabouts has offered a series of novel insights on the quantitative patterns characterizing human mobility. In particular, numerous recent studies have lead to an unexpected consensus: the considerable variability in the characteristic travelled distance of individuals coexists with a high degree of predictability of their future locations. Here we shed light on this surprising coexistence by systematically investigating the impact of recurrent mobility on the characteristic distance travelled by individuals. Using both mobile phone and GPS data, we discover the existence of two distinct classes of individuals: returners and explorers. As existing models of human mobility cannot explain the existence of these two classes, we develop more realistic models able to capture the empirical findings. Finally, we show that returners and explorers play a distinct quantifiable role in spreading phenomena and that a correlation exists between their mobility patterns and social interactions.
Conference Paper
We present a novel de-anonymization attack on mobility trace data and social data. First, we design an Unified Similarity (US) measurement, based on which we present a US based De-Anonymization (DA) framework which iteratively de-anonymizes data with an accuracy guarantee. Then, to de-anonymize data without the knowledge of the overlap size between the anonymized data and the auxiliary data, we generalize DA to an Adaptive De-Anonymization (ADA) framework. Finally, we examine DA/ADA on mobility traces and social data sets.
Article
We study the problem of privacy in human mobility data, i.e., the re-identification risk of individuals in a trajectory dataset. We quantify the risk of being re-identified by the metric of uniqueness, the fraction of individuals in the dataset which are uniquely identifiable by a set of spatio-temporal points. We explore a human mobility dataset for more than half a million individuals over a period of one week. The location of an individual is specified every fifteen minutes. The results show that human mobility traces are highly identifiable with only a few spatio-temporal points. We propose a modification-based anonymization approach that is based on shorting the trajectories to reduce the risk of reidentification and information disclosure. Empirical, experimental results on the anonymized dataset show the decrease of uniqueness and suggest that anonymization techniques can help to improve the privacy protection and reduce privacy risks, although the anonymized data cannot provide full anonymity so far.
Conference Paper
We propose a novel approach to privacy-preserving analytical processing within a distributed setting, and tackle the problem of obtaining aggregated information about vehicle traffic in a city from movement data collected by individual vehicles and shipped to a central server. Movement data are sensitive because people’s whereabouts have the potential to reveal intimate personal traits, such as religious or sexual preferences, and may allow re-identification of individuals in a database. We provide a privacy-preserving framework for movement data aggregation based on trajectory generalization in a distributed environment. The proposed solution, based on the differential privacy model and on sketching techniques for efficient data compression, provides a formal data protection safeguard. Using real-life data, we demonstrate the effectiveness of our approach also in terms of data utility preserved by the data transformation.
Conference Paper
Recent research has illustrated privacy breaches that can be effected on an anonymized dataset by an attacker who has access to auxiliary information about the users. Most of these attack strategies rely on the uniqueness of specific aspects of the users' data - e.g., observing a mobile user at just a few points on the time-location space are sufficient to uniquely identify him/her from an anonymized set of users. In this work, we consider de-anonymization attacks on anonymized summary statistics in the form of histograms. Such summary statistics are useful for many applications that do not need knowledge about exact user behavior. We consider an attacker who has access to an anonymized set of histograms of K users' data and an independent set of data belonging to the same users. Modeling the users' data as i.i.d., we study the composite hypothesis testing problem of identifying the correct matching between the anonymized histograms from the first set and the user data from the second. We propose a Generalized Likelihood Ratio Test as a solution to this problem and show that the solution can be identified using a minimum weight matching algorithm on an K × K complete bipartite weighted graph. We show that a variant of this solution is asymptotically optimal as the data lengths are increased.We apply the algorithm on mobility traces of over 1000 users on EPFL campus collected during two weeks and show that up to 70% of the users can be correctly matched. These results show that anonymized summary statistics of mobility traces themselves contain a significant amount of information that can be used to uniquely identify users by an attacker who has access to auxiliary information about the statistics.