Conference Paper

Practical Privacy Preservation in a Mobile Cloud Environment

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This computing model has additional advantages including lower operational and deployment costs due to its unique pricing policy (based on a pay-as-you-use model) where users do not explicitly provision or configure virtual machines (VMs) or containers but they only get charged based on the number of resources consumed by the application functions during execution [6], [7]. The serverless computing model has been successfully adopted in a wide range of application domains, including, processing event streams, next-generation web services and applications [8], [9], etc. 1 All major commercial cloud service providers are now offering serverless computing platforms, including AWS Lambda (https://aws.amazon.com/lambda/), Google Cloud Functions (https://cloud.google.com/functions) ...
Preprint
In recent years Serverless Computing has emerged as a compelling cloud based model for the development of a wide range of data-intensive applications. However, rapid container provisioning introduces non-trivial challenges for FaaS cloud providers, as (i) real-world FaaS workloads may exhibit highly dynamic request patterns, (ii) applications have service-level objectives (SLOs) that must be met, and (iii) container provisioning can be a costly process. In this paper, we present SLOPE, a prediction framework for serverless FaaS platforms to address the aforementioned challenges. Specifically, it trains a neural network model that utilizes knowledge from past runs in order to estimate the number of instances required to satisfy the invocation rate requirements of the serverless applications. In cases that a priori knowledge is not available, SLOPE makes predictions using a graph edit distance approach to capture the similarities among serverless applications. Our experimental results illustrate the efficiency and benefits of our approach, which can reduce the operating costs by 66.25% on average.
... In the IoT domain, serverless computing enables seamless communication and real-time analytics for sensor data. Furthermore, it is widely adopted for developing chatbots [7], voice assistants, event-driven applications, and real-time applications [8] [9]. Additionally, serverless architectures are utilized in image and video processing [10], microservices and APIs, DevOps automation, ecommerce, and online retail [11]. ...
Preprint
Serverless computing, also referred to as Function-as-a-Service (FaaS), is a cloud computing model that has attracted significant attention and has been widely adopted in recent years. The serverless computing model offers an intuitive, event-based interface that makes the development and deployment of scalable cloud-based applications easier and cost-effective. An important aspect that has not been examined in these systems is their energy consumption during the application execution. One way to deal with this issue is to schedule the function invocations in an energy-efficient way. However, efficient scheduling of applications in a multi-tenant environment, like FaaS systems, poses significant challenges. The trade-off between the server's energy usage and the hosted functions' performance requirements needs to be taken into consideration. In this work, we propose an Energy Efficient Scheduler for orchestrating the execution of serverless functions so that it minimizes energy consumption while it satisfies the applications' performance demands. Our approach considers real-time performance measurements and historical data and applies a novel DVFS technique to minimize energy consumption. Our detailed experimental evaluation using realistic workloads on our local cluster illustrates the working and benefits of our approach.
... umes of patient records. Patient data consists of sensitive health-related records whereby the confidentiality and privacy of the patient data are of utmost importance [7] [8] [9]. SE enables secure searching on the encrypted patient data [10] [11]. ...
Article
Full-text available
Serverless computing has seen rapid growth, thanks to its adaptability, elasticity, and deployment agility, embraced by both cloud providers and users. However, this surge in serverless adoption has prompted a reevaluation of security concerns and thus, searchable encryption has emerged as a crucial technology. This paper explores the Searchable Encryption as a Service (SEaaS) and introduces an innovative privacy-preserving Multiple Keyword Searchable Encryption (MKSE) scheme within a serverless cloud environment, addressing previously unmet security goals. The proposed scheme employs probabilistic encryption and leverages fully homomorphic encryption to enable operations on ciphertext, facilitating searches on encrypted data. Its core innovation lies in the use of probabilistic encryption for private multi-keyword searches. To validate its practicality, we deploy the scheme on the public cloud infrastructure, "Contabo," and conduct rigorous testing on a real-world dataset. The results demonstrate that our novel scheme successfully preserves the privacy of search queries and access patterns, achieving robust security. This research contributes to the field of serverless cloud security, particularly in the context of searchable encryption, by providing a refined solution for safeguarding data while maintaining usability in a serverless computing landscape.
Conference Paper
Full-text available
The proliferation of Internet of Things (IoT) and the success of resource-rich cloud services have pushed the data processing horizon towards the edge of the network. This has the potential to address bandwidth costs, and latency, availability and data privacy concerns. Serverless computing, a cloud computing model for stateless and event-driven applications, promises to further improve Quality of Service (QoS) by eliminating the burden of always-on infrastructure through ephemeral containers. Open source serverless frameworks have been introduced to avoid the vendor lock-in and computation restrictions of public cloud platforms and to bring the power of serverless computing to on-premises deployments. In an IoT environment, these frameworks can leverage the computational capabilities of devices in the local network to further improve QoS of applications delivered to the user. However, these frameworks have not been evaluated in a resource-constrained, edge computing environment. In this work we evaluate four open source serverless frameworks, namely, Kubeless, Apache OpenWhisk, OpenFaaS, Knative. Each framework is installed on a bare-metal, single master, Kubernetes cluster. We use the JMeter framework to evaluate the response time, throughput and success rate of functions deployed using these frameworks under different workloads. The evaluation results are presented and open research opportunities are discussed.
Conference Paper
Full-text available
Sequential pattern mining is a well-studied data mining task with wide applications. However, fine-tuning the minsup parameter of sequential pattern mining algorithms to generate enough patterns is difficult and time-consuming. To address this issue, the task of top-k sequential pattern mining has been defined, where k is the number of sequential patterns to be found, and is set by the user. In this paper, we present an efficient algorithm for this problem named TKS (Top-K Sequential pattern mining). TKS utilizes a vertical bitmap database representation, a novel data structure named PMAP (Precedence Map) and several efficient strategies to prune the search space. An extensive experimental study on real datasets shows that TKS outperforms TSP, the current state-of-the-art algorithm for top-k sequential pattern mining by more than an order of magnitude in execution time and memory.
Article
Full-text available
With the recent surge of location based social networks (LBSNs), activity data of millions of users has become attainable. This data contains not only spatial and temporal stamps of user activity, but also its semantic information. LBSNs can help to understand mobile users' spatial temporal activity preference (STAP), which can enable a wide range of ubiquitous applications, such as personalized context-aware location recommendation and group-oriented advertisement. However, modeling such user-specific STAP needs to tackle high-dimensional data, i.e., user-location-time-activity quadruples, which is complicated and usually suffers from a data sparsity problem. In order to address this problem, we propose a STAP model. It first models the spatial and temporal activity preference separately, and then uses a principle way to combine them for preference inference. In order to characterize the impact of spatial features on user activity preference, we propose the notion of personal functional region and related parameters to model and infer user spatial activity preference. In order to model the user temporal activity preference with sparse user activity data in LBSNs, we propose to exploit the temporal activity similarity among different users and apply nonnegative tensor factorization to collaboratively infer temporal activity preference. Finally, we put forward a context-aware fusion framework to combine the spatial and temporal activity preference models for preference inference. We evaluate our proposed approach on three real-world datasets collected from New York and Tokyo, and show that our STAP model consistently outperforms the baseline approaches in various settings.
Conference Paper
Full-text available
String similarity is most often measured by weighted or unweighted edit distance d(x, y). Ristad and Yianilos (1998) defined stochastic edit distance - a probability distribution p(y | x) whose parameters can be trained from data. We generalize this so that the probability of choosing each edit operation can depend on contextual features. We show how to construct and train a probabilistic finite-state transducer that computes our stochastic contextual edit distance. To illustrate the improvement from conditioning on context, we model typos found in social media text.
Conference Paper
Full-text available
Many scientists perform extensive computations by executing large bags of similar tasks (BoTs) in mixtures of computational environments, such as grids and clouds. Although the reliability and cost may vary considerably across these environments, no tool exists to assist scientists in the selection of environments that can both fulfill deadlines and fit budgets. To address this situation, we introduce the Expert BoT scheduling framework. Our framework systematically selects from a large search space the Pareto-efficient scheduling strategies, that is, the strategies that deliver the best results for both make span and cost. Expert chooses from them the best strategy according to a general, user-specified utility function. Through simulations and experiments in real production environments, we demonstrate that Expert can substantially reduce both make span and cost in comparison to common scheduling strategies. For bioinformatics BoTs executed in a real mixed grid + cloud environment, we show how the scheduling strategy selected by Expert reduces both make span and cost by 30%-70%, in comparison to commonly-used scheduling strategies.
Conference Paper
Full-text available
Advances in sensing and tracking technology enable location-based applications but they also create signif- icant privacy risks. Anonymity can provide a high de- gree of privacy, save service users from dealing with service providers' privacy policies, and reduce the ser- vice providers' requirements for safeguarding private in- formation. However, guaranteeing anonymous usage of location-based services requires that the precise location information transmitted by a user cannot be easily used to re-identify the subject. This paper presents a mid- dleware architecture and algorithms that can be used by a centralized location broker service. The adaptive al- gorithms adjust the resolution of location information along spatial or temporal dimensions to meet specified anonymity constraints based on the entities who may be using location services within a given area. Using a model based on automotive traffic counts and carto- graphic material, we estimate the realistically expected spatial resolution for different anonymity constraints. The median resolution generated by our algorithms is 125 meters. Thus, anonymous location-based requests for urban areas would have the same accuracy currently needed for E-911 services; this would provide sufficient resolution for wayfinding, automated bus routing ser- vices and similar location-dependent services.
Conference Paper
Full-text available
Geography and social relationships are inextricably intertwined; the people we interact with on a daily basis almost always live near us. As people spend more time online, data regarding these two dimensions -- geography and social relationships -- are becoming increasingly precise, allowing us to build reliable models to describe their interaction. These models have important implications in the design of location-based services, security intrusion detection, and social media supporting local communities. Using user-supplied address data and the network of associations between members of the Facebook social network, we can directly observe and measure the relationship between geography and friendship. Using these measurements, we introduce an algorithm that predicts the location of an individual from a sparse set of located users with performance that exceeds IP-based geolocation. This algorithm is efficient and scalable, and could be run on a network containing hundreds of millions of users.
Conference Paper
Full-text available
It is a well-known fact that the progress of personal communication devices leads to serious concerns about privacy in general, and location privacy in particular. As a response to these issues, a number of Location-Privacy Protection Mechanisms (LPPMs) have been proposed during the last decade. However, their assessment and comparison remains problematic because of the absence of a systematic method to quantify them. In particular, the assumptions about the attacker’s model tend to be incomplete, with the risk of a possibly wrong estimation of the users’ location privacy. In this paper, we address these issues by providing a formal framework for the analysis of LPPMs; it captures, in particular, the prior information that might be available to the attacker, and various attacks that he can perform. The privacy of users and the success of the adversary in his location-inference attacks are two sides of the same coin. We revise location privacy by giving a simple, yet comprehensive, model to formulate all types of location-information disclosure attacks. Thus, by formalizing the adversary’s performance, we propose and justify the right metric to quantify location privacy. We clarify the difference between three aspects of the adversary’s inference attacks, namely their accuracy, certainty, and correctness. We show that correctness determines the privacy of users. In other words, the expected estimation error of the adversary is the metric of users’ location privacy. We rely on well-established statistical methods to formalize and implement the attacks in a tool: the Location-Privacy Meter that measures the location privacy of mobile users, given various LPPMs. In addition to evaluating some example LPPMs, by using our tool, we assess the appropriateness of some popular metrics for location privacy: entropy and k-anonymity. The results show a lack of satisfactory correlation between these two metrics and the success of the adversary in inferring the users’ actual locations.
Article
Full-text available
The increasing trend of embedding positioning capabilities (for example, GPS) in mobile devices facilitates the widespread use of location-based services. For such applications to succeed, privacy and confidentiality are essential. Existing privacy-enhancing techniques rely on encryption to safeguard communication channels, and on pseudonyms to protect user identities. Nevertheless, the query contents may disclose the physical location of the user. In this paper, we present a framework for preventing location-based identity inference of users who issue spatial queries to location-based services. We propose transformations based on the well-established K-anonymity concept to compute exact answers for range and nearest neighbor search, without revealing the query source. Our methods optimize the entire process of anonymizing the requests and processing the transformed spatial queries. Extensive experimental studies suggest that the proposed techniques are applicable to real-life scenarios with numerous mobile users.
Article
Mobile sensor networks are a great source of data. By collecting data with mobile sensor nodes from individuals in a user community, e.g. using their smartphones, we can learn global information such as traffic congestion patterns in the city, location of key community facilities, and locations of gathering places. Can we publish and run queries on mobile sensor network databases without disclosing information about individual nodes? Differential privacy is a strong notion of privacy which guarantees that very little will be learned about individual records in the database, no matter what the attackers already know or wish to learn. Still, there is no practical system applying differential privacy algorithms for clustering points on real databases. This paper describes the construction of small coresets for computing k-means clustering of a set of points while preserving differential privacy. As a result, we give the first k-means clustering algorithm that is both differentially private, and has an approximation error that depends sub-linearly on the data's dimension d. Previous results introduced errors that are exponential in d. We implemented this algorithm and used it to create differentially private location data from GPS tracks. Specifically our algorithm allows clustering GPS databases generated from mobile nodes, while letting the user control the introduced noise due to privacy. We provide experimental results for the system and algorithms, and compare them to existing techniques. To the best of our knowledge, this is the first practical system that enables differentially private clustering on real data.
Conference Paper
Motivated by the increasing need to preserve privacy in digital devices, we introduce the on-device public-private model of computation. Our motivation comes from social-network based recommender systems in which the users want to receive recommendations based on the information available on their devices, as well as the suggestions of their social contacts, without sharing such information or contacts with the central recommendation system. Our model allows us to solve many algorithmic problems while providing absolute (deterministic) guarantees of the privacy of on-device data and the user's contacts. In fact, we ensure that the private data and private contacts are never revealed to the central system. Our restrictive model of computation presents several interesting algorithmic challenges because any computation based on private information and contacts must be performed on local devices of limited capabilities. Despite these challenges, under realistic assumptions of inter-device communication, we show several efficient algorithms for fundamental data mining and machine learning problems, ranging from k-means clustering to heavy hitters. We complement this analysis with strong impossibility results for efficient private algorithms without allowing inter-device communication. In our experimental evaluation, we show that our private algorithms provide results almost as accurate as those of the non-private ones while speeding up the on-device computations by orders of magnitude.
Conference Paper
As mobile devices and location-based services become increasingly ubiquitous, the privacy of mobile users' location traces continues to be a major concern. Traditional privacy solutions rely on perturbing each position in a user's trace and replacing it with a fake location. However, recent studies have shown that such point-based perturbation of locations is susceptible to inference attacks and suffers from serious utility losses, because it disregards the moving trajectory and continuity in full location traces. In this paper, we argue that privacy-preserving synthesis of complete location traces can be an effective solution to this problem. We present AdaTrace, a scalable location trace synthesizer with three novel features: provable statistical privacy, deterministic attack resilience, and strong utility preservation. AdaTrace builds a generative model from a given set of real traces through a four-phase synthesis process consisting of feature extraction, synopsis learning, privacy and utility preserving noise injection, and generation of differentially private synthetic location traces. The output traces crafted by AdaTrace preserve utility-critical information existing in real traces, and are robust against known location trace attacks. We validate the effectiveness of AdaTrace by comparing it with three state of the art approaches (ngram, DPT, and SGLT) using real location trace datasets (Geolife and Taxi) as well as a simulated dataset of 50,000 vehicles in Oldenburg, Germany. AdaTrace offers up to 3-fold improvement in trajectory utility, and is orders of magnitude faster than previous work, while preserving differential privacy and attack resilience.
Conference Paper
The development of positioning technologies has resulted in an increasing amount of mobility data being available. While bringing a lot of convenience to people's life, such availability also raises serious concerns about privacy. In this paper, we concentrate on one of the most sensitive information that can be inferred from mobility data, namely social relationships. We propose a novel social relation inference attack that relies on an advanced feature learning technique to automatically summarize users' mobility features. Compared to existing approaches, our attack is able to predict any two individuals' social relation, and it does not require the adversary to have any prior knowledge on existing social relations. These advantages significantly increase the applicability of our attack and the scope of the privacy assessment. Extensive experiments conducted on a large dataset demonstrate that our inference attack is effective, and achieves between 13% to 20% improvement over the best state-of-the-art scheme. We propose three defense mechanisms -- hiding, replacement and generalization -- and evaluate their effectiveness for mitigating the social link privacy risks stemming from mobility data sharing. Our experimental results show that both hiding and replacement mechanisms outperform generalization. Moreover, hiding and replacement achieve a comparable trade-off between utility and privacy, the former preserving better utility and the latter providing better privacy.
Conference Paper
Mobile sensor networks are a great source of data. By collecting data with mobile sensor nodes from individuals in a user community, e.g. using their smartphones, we can learn global information such as traffic congestion patterns in the city, location of key community facilities, and locations of gathering places. Can we publish and run queries on mobile sensor network databases without disclosing information about individual nodes? Differential privacy is a strong notion of privacy which guarantees that very little will be learned about individual records in the database, no matter what the attackers already know or wish to learn. Still, there is no practical system applying differential privacy algorithms for clustering points on real databases. This paper describes the construction of small coresets for computing k-means clustering of a set of points while preserving differential privacy. As a result, we give the first k-means clustering algorithm that is both differentially private, and has an approximation error that depends sub-linearly on the data's dimension d. Previous results introduced errors that are exponential in d. We implemented this algorithm and used it to create differentially private location data from GPS tracks. Specifically our algorithm allows clustering GPS databases generated from mobile nodes, while letting the user control the introduced noise due to privacy. We provide experimental results for the system and algorithms, and compare them to existing techniques. To the best of our knowledge, this is the first practical system that enables differentially private clustering on real data.
Conference Paper
This paper contributes to mobile crowdsourcing applications by developing a privacy preserving framework that enables users to contribute content to the community while controlling their privacy exposure. One fundamental challenge in such applications is how to preserve user privacy, as participants may end up revealing a great deal of user-identified, geo-located data, which can easily unfold user trajectories or sensitive locations (e.g., user's home or work location). In this paper we develop PROMPT, a highly efficient privacy preserving framework that runs locally on mobile devices. PROMPT relies on a novel geometric approximation approach to preserve user privacy, by evaluating the privacy exposure of users before sharing their geo-located data. Our detailed experimental evaluation using real-world datasets illustrates that our approach is effective, practical and has low overhead on smartphones.
Article
Along with the popularity of mobile social networks (MSNs) is the increasing danger of privacy breaches due to user location exposures. In this work, we take an initial step towards quantifying location privacy leakage from MSNs by matching the users' shared locations with their real mobility traces. We conduct a three-week real-world experiment with 30 participants and discover that both direct location sharing (e.g., Weibo or Renren) and indirect location sharing (e.g., Wechat or Skout) can reveal a small percentage of users' real points of interests (POIs). We further propose a novel attack to allow an external adversary to infer the demographics (e.g., age, gender, education) after observing users' exposed location profiles. We implement such attack in a large real-world dataset involving 22,843 mobile users. The experimental results show that the attacker can effectively predict demographic attributes about users with some shared locations. To resist such attacks, we propose SmartMask, a context-based system-level privacy protection solution, designed to automatically learn users' privacy preferences under different contexts and provide a transparent privacy control for MSN users. The effectiveness and efficiency of SmartMask have been well validated by extensive experiments.
Conference Paper
Concerns on location privacy frequently arise with the rapid development of GPS enabled devices and location-based applications. While spatial transformation techniques such as location perturbation or generalization have been studied extensively, most techniques rely on syntactic privacy models without rigorous privacy guarantee. Many of them only consider static scenarios or perturb the location at single timestamps without considering temporal correlations of a moving user's locations, and hence are vulnerable to various inference attacks. While differential privacy has been accepted as a standard for privacy protection, applying differential privacy in location based applications presents new challenges, as the protection needs to be enforced on the fly for a single user and needs to incorporate temporal correlations between a user's locations. In this paper, we propose a systematic solution to preserve location privacy with rigorous privacy guarantee. First, we propose a new definition, "δ-location set" based differential privacy, to account for the temporal correlations in location data. Second, we show that the well known l1-norm sensitivity fails to capture the geometric sensitivity in multidimensional space and propose a new notion, sensitivity hull, based on which the error of differential privacy is bounded. Third, to obtain the optimal utility we present a planar isotropic mechanism (PIM) for location perturbation, which is the first mechanism achieving the lower bound of differential privacy. Experiments on real-world datasets also demonstrate that PIM significantly outperforms baseline approaches in data utility.
Article
The increasing popularity of social media encourages more and more users to participate in various online activities and produces data in an unprecedented rate. Social media data is big, linked, noisy, highly unstructured and in- complete, and differs from data in traditional data mining, which cultivates a new research field - social media mining. Social theories from social sciences are helpful to explain social phenomena. The scale and properties of social media data are very different from these of data social sciences use to develop social theories. As a new type of social data, social media data has a fundamental question - can we apply social theories to social media data? Recent advances in computer science provide necessary computational tools and techniques for us to verify social theories on large-scale social media data. Social theories have been applied to mining social media. In this article, we review some key social theories in mining social media, their verification approaches, interesting findings, and state-of-the-art algorithms. We also discuss some future directions in this active area of mining social media with social theories.
Conference Paper
The ubiquity of mobile devices and the popularity of location-based-services have generated, for the first time, rich datasets of people's location information at a very high fidelity. These location datasets can be used to study people's behavior - for example, social studies have shown that people, who are seen together frequently at the same place and at the same time, are most probably socially related. In this paper, we are interested in inferring these social connections by analyzing people's location information, which is useful in a variety of application domains from sales and marketing to intelligence analysis. In particular, we propose an entropy-based model (EBM) that not only infers social connections but also estimates the strength of social connections by analyzing people's co-occurrences in space and time. We examine two independent ways: diversity and weighted frequency, through which co-occurrences contribute to social strength. In addition, we take the characteristics of each location into consideration in order to compensate for cases where only limited location information is available. We conducted extensive sets of experiments with real-world datasets including both people's location data and their social connections, where we used the latter as the ground-truth to verify the results of applying our approach to the former. We show that our approach outperforms the competitors.
Article
The paradigm of coresets has recently emerged as a powerful tool for efficiently approximating various extent measures of a point set P . Using this paradigm, one quickly computes a small subset Q of P , called a coreset, that approximates the original set P and and then solves the problem on Q using a relatively inefficient algorithm. The solution for Q is then translated to an approximate solution to the original point set P . This paper describes the ways in which this paradigm has been successfully applied to various optimization and extent measure problems.
Conference Paper
Even though human movement and mobility patterns have a high degree of freedom and variation, they also exhibit structural patterns due to geographic and social constraints. Using cell phone location data, as well as data from two online location-based social networks, we aim to understand what basic laws govern human motion and dynamics. We find that humans experience a combination of periodic movement that is geographically limited and seemingly random jumps correlated with their social networks. Short-ranged travel is periodic both spatially and temporally and not effected by the social network structure, while long-distance travel is more influenced by social network ties. We show that social relationships can explain about 10% to 30% of all human movement, while periodic behavior explains 50% to 70%. Based on our findings, we develop a model of human mobility that combines periodic short range movements with travel due to the social network structure. We show that our model reliably predicts the locations and dynamics of future human movement and gives an order of magnitude better performance than present models of human mobility.
Conference Paper
A privacy-aware proximity detection service determines if two mobile users are close to each other without requiring them to disclose their exact locations. Existing proposals for such services provide weak privacy, give low accuracy guarantees, incur high communication costs, or lack flexibility in user preferences. We address these shortcomings with a client-server solution for proximity detection, based on encrypted, multi-level partitions of the spatial domain. Our service notifies a user if any friend users enter the user’s specified area of interest, called the vicinity region. This region, in contrast to related work, can be of any shape and can be flexibly changed on the fly. Encryption and blind evaluation on the server ensures strong privacy, while low communication costs are achieved by an adaptive location-update policy. Experimental results show that the flexible functionality of the proposed solution is provided with low communication cost.
Conference Paper
A ubiquitous computing environment provides comfortable conditions for anyone to access diverse networks without being concerned about time, place, or device. The Location-Based Service (LBS), for example, provides various convenient services using an individual's location information. However, on the flip side of this convenience, if a user's location information is exposed, it can lead to serious privacy problems. This paper proposes an anonymous communication model that uses an echo agent for LBS to guarantee privacy. Located between the user and the service provider, the echo agent sends both the real user's route and dummy routes to prevent user location information from being grasped and traced by service providers. It maintains an effective data schema to collect, store, and use the users' past routes and dummy routes through the heuristic algorithm. Finally, it develops and applies the policies to manage them. Unlike existing methods, this model provides powerful anonymity because it generates dummy routes which are similar to the real routes of users. In addition, in terms of the simulation result for dummy generation, the proposed model reduces the probability of the wrong node generation to a tenth of the existing model's one. Thus, privacy of user location information can be protected via this method.
Conference Paper
We examine a very large-scale data set of more than 30 billion call records made by 25 million cell phone users across all 50 states of the US and attempt to determine to what extent anonymized location data can reveal private user information. Our approach is to infer, from the call records, the "top N" locations for each user and correlate this information with publicly-available side information such as census data. For example, the measured "top 2" locations likely correspond to home and work locations, the "top 3" to home, work, and shopping/school/commute path locations. We consider the cases where those "top N" locations are measured with different levels of granularity, ranging from a cell sector to whole cell, zip code, city, county and state. We then compute the anonymity set, namely the number of users uniquely identified by a given set of "top N" locations at different granularity levels. We find that the "top 1" location does not typically yield small anonymity sets. However, the top 2 and top 3 locations do, certainly at the sector or cell-level granularity. We consider a variety of different factors that might impact the size of the anonymity set, for example the distance between the "top N" locations or the geographic environment (rural vs urban). We also examine to what extent specific side information, in particular the size of the user's social network, decrease the anonymity set and therefore increase risks to privacy. Our study shows that sharing anonymized location data will likely lead to privacy risks and that, at a minimum, the data needs to be coarse in either the time domain (meaning the data is collected over short periods of time, in which case inferring the top N locations reliably is difficult) or the space domain (meaning the data granularity is strictly higher than the cell level). In both cases, the utility of the anonymized location data will be decreased, potentially by a significant amount.
Mining social media with social theories: a survey
  • J Tang
  • Y Chang
  • H Liu
Serverless boom or bust? an analysis of economic incentives
  • lin
A utility-preserving and scalable technique for protecting location data with geo-indistinguishability
  • ahuja