Conference Paper

Privacy Violations Using Microtargeted Ads: A Case Study

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

In this paper we propose a new class of attacks that exploit advertising systems offering micro targeting capabilities in order to breach user privacy. We study the advertising system offered by the world's largest online social network, Face book, and the risks that the design of the system poses to the privacy of its users. We propose, describe and provide experimental evidence of several novel approaches to exploiting the advertising system in order to obtain private user information. We communicated our findings to Face book on July 13, 2010, and received a prompt response. On July 20, 2010, Face book launched a change to their advertising system that made the kind of attacks we describe more difficult but not impossible to implement.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Therefore, social media advertising has become the source of a growing number of privacy concerns for internet users. The Facebook advertising platform in particular, has been the source of a number of controversies in recent years regarding privacy violations [113,154] and Facebook's ability to be used by dishonest actors for discriminatory advertising [16,24,147] or ad-driven propaganda to influence elections [43]. For example, ProPublica demonstrated how Facebook allowed advertisers to reach users associated with the topic of 'Jew Haters' [24], and also allowed advertisers to exclude people from ads about employment based on their age [16]. ...
... Despite these issues, and the fact that Facebook is constantly on the spotlight about its potential for misuse or the actual misuse of the platform and there are many studies on how this system could be manipulated [16,24,64,66,113,147,154,155], there is little to no understanding on how the ecosystem works overall, and what we can do to bring more transparency. ...
... The concerns about vulnerabilities of advertising interfaces have been pointed out by journalists [24,64,111] and researchers [59,95,113,147,154,155] alike. ProPublica pointed out how Facebook's rich advertising interface was allowing advertisers to exclude users by race [64], which is illegal in the US, and how the problem persisted one year after [111], despite Facebook's measures to counter such problems [52]. ...
Thesis
Social media advertising has been subject to many privacy complaints. It is largely unknown who advertises on social media, what data the platforms have about users, and why users are being shown particular ads. As a response, platforms like Facebook introduce transparency mechanisms where users receive explanations about why they received an ad, and what data has Facebook inferred about them. The aim of this thesis is to increase transparency in social media advertising. We build a browser extension, AdAnalyst, which collects the ads that users see in Facebook and the explanations that it provides to them, and in return we provide them with aggregated statistics about the ads they receive. By using AdAnalyst, and by conducting experiments where we target the users we monitor with ads, we find that Facebook's explanations are incomplete, misleading and vague. Additionally, we look at who is advertising on Facebook and how are they targeting users. We identify a wide range of advertisers, where some of them are part of potentially sensitive categories, like politics or health. We also find that advertisers employ targeting strategies that can be invasive, or opaque. Finally, we develop a collaborative method that allows us to infer why a user has been targeted with ads on Facebook, by looking at the characteristics of users that received the same ad.
... The implementation of consent mechanisms according to the GDPR requirements has begun to receive attention from privacy researchers (Datta et al. 2015). Privacy and transparency problems related to Ad platforms are also being put through increased scrutiny (Speicher et al. 2018;Andreou et al. 2018;Irfan and Aleksandra 2018;Korolova 2010;Venkatadri et al. 2019;Datta et al. 2018;Castelluccia et al. 2012;Parra-Arnau et al. 2017). ...
... Second, Facebook has been questioned over the years by regulators and privacy advocates about its privacy practices (New York Times 2018) and investigated by various data protection authorities. Researchers (Andreou et al. 2018;Speicher et al. 2018;Korolova 2010;Irfan and Aleksandra 2018;Ribeiro et al. 2019) have uncovered that various privacy harms, such as discrimination, inciting social division, micro-targeting, single-house-based targeting and disclosure of personal data to advertisers, can result from its advertising platform. Facebook has also recently suffered various security and privacy breaches, not to mention its own privacy invasive policies. ...
... Thus, they present extremely high privacy risks such as microtargeting and discrimination. (Korolova 2010;Irfan and Aleksandra 2018;Speicher et al. 2018;Ribeiro et al. 2019). Consent upholds transparency by ensuring that users are informed and they have the choice to agree to specific data processing purposes rather than generic ones. ...
Article
Full-text available
The EU General Data Protection Regulation (GDPR) recognizes the data subject’s consent as one of the legal grounds for data processing. Targeted advertising, based on personal data processing, is a central source of revenue for data controllers such as Google and Facebook. At present, the implementation of consent mechanisms for such advertisements are often not well developed in practice and their compliance with the GDPR requirements can be questioned. The absence of consent may mean an unlawful data processing and a lack of control of the user (data subject) on his personal data. However, consent mechanisms that do not fully satisfy GDPR requirements can give users a false sense of control, encouraging them to allow the processing of more personal data than they would have otherwise. In this paper, we identify the features, originating from GDPR requirements, of consent mechanisms. For example, the GDPR specifies that a consent must be informed and freely given, among other requirements. We then examine the Ad Consent Mechanism of Facebook that is based on processing of user activity data off Facebook Company Products provided by third parties with respect to these features. We discuss to what extent this consent mechanism respects these features. To the best of our knowledge, our evaluation of Facebook’s Ad Consent Mechanism is the first of its kind.
... The Facebook advertising platform has been the source of a number of controversies in recent years regarding privacy violations [31], [40], lack of transparency on how it provides information to users about the ads they see [22], and lately, Facebook's ability to be used by dishonest actors for discriminatory advertising [9], [16], [38] or ad-driven propaganda to influence elections [19]. For example, Propublica demonstrated how Facebook allowed advertisers to reach users associated with the topic of 'Jew Haters' [9], and also allowed advertisers to exclude people from ads about employment based on their age [16]. ...
... Because advertising platforms have been the vectors for privacy violations [31], [40], discriminatory advertising [9], [16], [38], and ad-driven propaganda [19], we begin by examining who the set of advertisers are and what features they have that might indicate their trustworthiness. Estimating the trustworthiness of an advertiser, however, is a difficult task. ...
... In addition, Venkatadri et al. [40] demonstrated several attacks that allow adversaries to infer users' phone numbers or de-anonymize the visitors of a proprietary website. Finally, Korolova et al. [31] demonstrated mechanisms through which an advertiser can infer the private attributes of a user. In our study we just exploited the Facebook advertising interface to gather various statistics about the attributes Facebook allows advertisers to target users. ...
... This opens the way for loss of user privacy. Korolova [2010] provide experimental evidence of several approaches to obtain private user information from the advertising system of the world's largest online social network, Facebook. Calandrino et al. [2011] show methods to infer users' private transactions from the information routinely revealed by the users to the advertising system. ...
... LibraryThing, and Amazon. To deal with such type of methods, one of the proposals suggested by Korolova [2010] 1 asks advertising systems to only use the public user information. While this way can defend the user privacy against almost all the attacks, it is highly unlikely that any for-profit corporation will use this way because it would make targeted advertising campaigns unfeasible. ...
... However, despite the best of intentions by the corporations hosting the advertising systems, personal user information stored at the advertising systems can nonetheless be misused for malicious purpose. Indeed, Korolova [2010] note that using Facebook's advertising systems one can infer user age or sexual orientation, relationship status, political and religious affiliation, presence or absence of a particular interest, as well as exact birthday. Kosinski et al. [2013] show that it is possible to accurately predict a range of highly sensitive personal attributes including: Privacy preserving input has been used for data collection by Wang et al. [2016]. ...
Thesis
Dans cette thèse, nous étudions des problèmes de prise de décisions séquentielles dans lesquels, pour chacune de ses décisions, l'apprenant reçoit une information qu'il utilise pour guider ses décisions futures. Pour aller au-delà du retour d’information conventionnel tel qu'il a été bien étudié pour des problèmes de prise de décision séquentielle tels que les bandits multi-bras, nous considérons des formes de retour d’information partielle motivées par des applications pratiques.En premier, nous considérons le problème des bandits duellistes, dans lequel l'apprenant sélectionne deux actions à chaque pas de temps et reçoit en retour une information relative (i.e. de préférence) entre les valeurs instantanées de ces deux actions.En particulier, nous proposons un algorithme optimal qui permet à l'apprenant d'obtenir un regret cumulatif quasi-optimal (le regret est la différence entre la récompense cumulative optimale et la récompense cumulative constatée de l’apprenant). Dans un second temps, nous considérons le problème des bandits corrompus, dans lequel un processus de corruption stochastique perturbe le retour d’information. Pour ce problème aussi, nous concevons des algorithmes pour obtenir un regret cumulatif asymptotiquement optimal. En outre, nous examinons la relation entre ces deux problèmes dans le cadre du monitoring partiel qui est un paradigme générique pour la prise de décision séquentielle avec retour d'information partielle.
... 5 Some intermediaries such as Facebook allow sellers to define target audience using attributes including date of birth, gender and location before they bid. Korolova (2010) demonstrates that sellers can select attributes so that they are satisfied only by a single user, effectively revealing the target consumer's demographic information that was supposed to be private. See Korolova (2010) and Venkatadri, Andreou, Liu, Mislove, Gummadi, Loiseau, and Goga (2018) for more details. ...
... Korolova (2010) demonstrates that sellers can select attributes so that they are satisfied only by a single user, effectively revealing the target consumer's demographic information that was supposed to be private. See Korolova (2010) and Venkatadri, Andreou, Liu, Mislove, Gummadi, Loiseau, and Goga (2018) for more details. This has sparked concerns about consumer data leakage through targeted advertisements, and served as one of the motivations for data protection 4 consumer data with sellers. ...
Article
This dissertation consists of two essays that examine issues related to data - how data is generated, used and monetized. In Chapter 1, I study how intermediaries such as Amazon and Google recommend products and services to consumers for which they receive compensation from the recommended sellers. Consumers will find these recommendations usefulonly if they are informative about the quality of the match between the sellers’ offerings and the consumer’s needs. The intermediary would like the consumer to purchase the product from the recommended seller, but is constrained because consumers need not follow the recommendation. I frame the intermediary’s problem as a mechanism design problem in which the mechanism designer cannot directly choose the outcome, but must encourage the consumer to choose the desired outcome. I show that in the optimal mechanism, the recommended seller has the largest non-negative virtual willingness to pay adjusted for the cost of persuasion. The optimal mechanism can be implemented via a handicap auction. I use this model to provide insights for current policy debates. In Chapter 2, in the joint work with Mallesh Pai and Rakesh Vohra, we propose a statistical test for identifying whether a policy or an algorithm is designed by a principal with discriminatory tastes. The test can be used for identifying, for example, whether predictive policing algorithms are discriminatory against minority neighborhoods. We also argue that the marginal outcome test (Becker (1993)), the most popular test of taste-based discrimination, fails for policies. We consider a canonical setup where the principal designs a policy (algorithm) that maps signals (data) to decisions for each group, such as whether to patrol or not for each area. The principal commits to the policy, which in turn affects agents’ incentives to take action, such as whether to commit a crime. In this environment, the marginal outcome test fails because the principal not only cares about the marginal benefitof catching a criminal but how patrolling changes agents’ incentive to commit a crime. We propose a new statistical test that deviates from the marginal outcome test precisely as much as the incentive effect.
... For example, in audience size estimation for targeted advertising, by definition the counts of the number of users satisfying certain criteria are released to the advertiser (who may be a firm, researcher, or stalker). 17 The billing system is another pathway, as exploited by Korolova [2010]. 18 In movie recommendation systems, the viewing habits of one user affect the recommendations made to a different user. ...
... • Calandrino et al. [2011] describe how to exploit the relative timing of a consumer's blog posts and changes in the outputs of a product recommendation system to infer purchases made, but not blogged about, by the blogger. • Korolova [2010] created several Facebook advertising campaigns that paired sufficiently many known attributes of the same specific individual to name her with various age ranges. The campaign for which Korlova was charged revealed the individual's age range. ...
Article
Full-text available
Differential privacy is at a turning point. Implementations have been successfully leveraged in private industry, the public sector, and academia in a wide variety of applications, allowing scientists, engineers, and researchers the ability to learn about populations of interest without specifically learning about these individuals. Because differential privacy allows us to quantify cumulative privacy loss, these differentially private systems will, for the first time, allow us to measure and compare the total privacy loss due to these personal data-intensive activities. Appropriately leveraged, this could be a watershed moment for privacy. Like other technologies and techniques that allow for a range of instantiations, implementation details matter. When meaningfully implemented, differential privacy supports deep data-driven insights with minimal worst-case privacy loss. When not meaningfully implemented, differential privacy delivers privacy mostly in name. Using differential privacy to maximize learning while providing a meaningful degree of privacy requires judicious choices with respect to the privacy parameter epsilon, among other factors. However, there is little understanding of what is the optimal value of epsilon for a given system or classes of systems/purposes/data etc. or how to go about figuring it out. To understand current differential privacy implementations and how organizations make these key choices in practice, we conducted interviews with practitioners to learn from their experiences of implementing differential privacy. We found no clear consensus on how to choose epsilon, nor is there agreement on how to approach this and other key implementation decisions. Given the importance of these implementation details there is a need for shared learning amongst the differential privacy community. To serve these purposes, we propose the creation of the Epsilon Registry—a publicly available communal body of knowledge about differential privacy implementations that can be used by various stakeholders to drive the identification and adoption of judicious differentially private implementations.
... As shown in figure 1, the number of active Facebook users as of April 2018 was about 2.2 billion as given in Ref. [6]. Given the importance of this site and its widespread reach, it is important to consider the need to 20 http://www.i-jim.org use social networking media, especially Facebook, for educational purposes. ...
... As stated in Ref. [20,21], Facebook can be very easily used as a tool and means of advertising. For example, ...
Article
Full-text available
p class="0abstract">Thanks to rapid development of information technology there are many new tools that can be used for teaching and learning environment. In the long past, we had met classical classrooms where teachers had to use chalk to write done everything on the blackboard. Later, we have met new tools such as projection, desktop computers, smart boards, etc. in the classical classrooms. Nowadays, we have met with mobile devices. Mobile devices are magic tools which can be used for teaching and learning as a distance. Because of its mobility we do not need to join a classical classroom to listen to courses offered by our teachers any more. Now, we can join in virtual classroom which can be established on cloud. On the other hand, there are many social media applications which can be also used for teaching and learning. For example, it is now possible to teach courses by online where Facebook can be used as a supplementary tool. There are instructors who use Facebook as a learning environment where instructors and learners interact simultaneity on it. The Facebook has made a significant contribution towards solving the problems faced by practical education students during the period of practical education. There is an increasing trend in the study community to use Facebook in order to solve these problems. The aim of this research is to encourage instructors and students to teach and learn by using Facebook as part of a new system of education, namely online distance learning. In this study we explain how mobile devices and social media have been used during teaching and learning of courses at master studies’ level at the Department of Software Engineering of College of Technology at Firat University in Turkey. We will explore the advantages and disadvantages of Facebook in terms of teaching and learning environment and we will suggest some recommendations for using Facebook as a teaching and learning platform which have been resulted from our research. </p
... Incidents such as the famous Hugo Awards 2015 attack [27] have already raised alarming concerns about privacy. Korolova [66] exploited the micro-targeting feature of Facebook's advertisement system to infer private user information easily from data visible to "only me", including inference from impressions and inference from clicks. ...
Preprint
Users have renewed interest in protecting their private data in the digital space. When they don't believe that their privacy is sufficiently covered by one platform, they will readily switch to another. Such an increasing level of privacy awareness has made privacy preservation an essential research topic. Nevertheless, new privacy attacks are emerging day by day. Therefore, a holistic survey to compare the discovered techniques on attacks over privacy preservation and their mitigation schemes is essential in the literature. We develop a study to fill this gap by assessing the resilience of privacy-preserving methods to various attacks and conducting a comprehensive review of countermeasures from a broader perspective. First, we introduce the fundamental concepts and critical components of privacy attacks. Second, we comprehensively cover major privacy attacks targeted at anonymous data, statistical aggregate data, and privacy-preserving models. We also summarize popular countermeasures to mitigate these attacks. Finally, some promising future research directions and related issues in the privacy community are envisaged. We believe this survey will successfully shed some light on privacy research and encourage researchers to entirely understand the resilience of different existing privacy-preserving approaches.
... Machine learning models trained on sensitive user data present the risk of leaking private user information [Dwork et al., 2007, Korolova, 2010, Calandrino et al., 2011, Shokri et al., 2017. Differential Privacy (DP) [Dwork et al., 2006] mitigates this risk, and has become a gold standard widely adopted in industry and government [Abowd, 2018, Wilson et al., 2020, Rogers et al., 2021, Amin et al., 2022. ...
Preprint
We study the problem of multi-task learning under user-level differential privacy, in which $n$ users contribute data to $m$ tasks, each involving a subset of users. One important aspect of the problem, that can significantly impact quality, is the distribution skew among tasks. Certain tasks may have much fewer data samples than others, making them more susceptible to the noise added for privacy. It is natural to ask whether algorithms can adapt to this skew to improve the overall utility. We give a systematic analysis of the problem, by studying how to optimally allocate a user's privacy budget among tasks. We propose a generic algorithm, based on an adaptive reweighting of the empirical loss, and show that when there is task distribution skew, this gives a quantifiable improvement of excess empirical risk. Experimental studies on recommendation problems that exhibit a long tail of small tasks, demonstrate that our methods significantly improve utility, achieving the state of the art on two standard benchmarks.
... Typically, advertising systems are established by leading social media networks, web browsers and other popular websites. Korolova (2010); Kosinski, Stillwell, and Graepel (2013) show that it is possible to accurately predict a range of highly sensitive personal attributes including age, sexual orientation, relationship status, political and religious affiliation using the feedback available to the advertising systems. Such possible breach of privacy necessitates us to protect personal user information not only from the advertisers but also from the advertising systems. ...
Preprint
We study the problem of preserving privacy while still providing high utility in sequential decision making scenarios in a changing environment. We consider abruptly changing environment: the environment remains constant during periods and it changes at unknown time instants. To formulate this problem, we propose a variant of multi-armed bandits called non-stationary stochastic corrupt bandits. We construct an algorithm called SW-KLUCB-CF and prove an upper bound on its utility using the performance measure of regret. The proven regret upper bound for SW-KLUCB-CF is near-optimal in the number of time steps and matches the best known bound for analogous problems in terms of the number of time steps and the number of changes. Moreover, we present a provably optimal mechanism which can guarantee the desired level of local differential privacy while providing high utility.
... The advertising system collects from advertisers the ads they want to display and their targeting criteria and then delivers the ads to people fitting those criteria. Rather than 'selling' information about their users, the business model is to sell space to advertisers, giving them access to people based on their demographics and interests (Facebook, 2007, November 6;Korolova, 2010). Why a user received a particular ad is therefore the result of a complex process depending upon many inputs including: what the platform thinks the user is interested in; characteristics of users the advertiser wants to reach; the set of advertisers and parameters of their campaigns; the bid prices of all advertisers; active users on the platform at a particular time; and the algorithm used to match ads to users (Andreou et al., 2018). ...
Chapter
Full-text available
To understand how the optimisation of emotion incubates false information online, this chapter examines profiling and targeting in citizen-political communications. Profiling and targeting are how emotion is understood, harnessed, amplified, dampened, manipulated and optimised. This chapter focuses on profiling and targeting in political campaigning as this is an intensively studied area awash with emotion and deception and attracts uneven protections across the world. Specifically, this chapter examines the targeting and profiling technologies and practices in political campaigning in the USA, UK and India, so highlighting the impact of different data protection regimes as well as uneven digital literacies. In exploring these issues, this chapter also outlines key tools and techniques utilised by digital political campaigners in the big data era to profile and target datafied emotions.
... The advertising system collects from advertisers the ads they want to display and their targeting criteria and then delivers the ads to people fitting those criteria. Rather than 'selling' information about their users, the business model is to sell space to advertisers, giving them access to people based on their demographics and interests (Facebook, 2007, November 6;Korolova, 2010). Why a user received a particular ad is therefore the result of a complex process depending upon many inputs including: what the platform thinks the user is interested in; characteristics of users the advertiser wants to reach; the set of advertisers and parameters of their campaigns; the bid prices of all advertisers; active users on the platform at a particular time; and the algorithm used to match ads to users (Andreou et al., 2018). ...
... Analytics are often conducted with weak or no privacy guarantees whatsoever, sometimes relying on ad hoc techniques for anonymity [77,171,278,280]. A prominent example of this is the U.S. Census, which, prior to 2020, relied on a heuristic algorithm of data swapping that has been shown vulnerable to reconstruction attacks that reveal sensitive information about large portions of the population [124]. ...
Article
Collecting distributed data from millions of individuals for the purpose of analytics is a common scenario – from Apple collecting typed words and emojis to improve its keyboard suggestions, to Google collecting location data to see how busy restaurants and businesses are. This data is often sensitive, and can be overly revealing about the individuals and communities whose data is being analyzed en masse. Differential privacy has become the gold-standard method to give strong individual privacy guarantees while releasing aggregate statistics about sensitive data. However, the process of computing such statistics can itself be a privacy risk. For instance, a simple approach would be to collect all the raw data at a single central entity, which then computes and releases the statistics. This entity then has to be trusted to not abuse the raw data; in practice, it can be difficult to find an entity with the requisite level of trust. In this thesis, we describe a new approach that uses cryptographic techniques to collect data privately and safely, without placing trust in any party. Although the natural candidates, such as secure multiparty computation (MPC) and fully homomorphic encryption (FHE) do not scale to millions of parties on their own, our key insight is that there are ways to refactor computations in such a way that they can be done using simpler techniques that do scale, such as additively homomorphic encryption. Our solution restructures centralized computations into distributed protocols that can be executed efficiently at scale. The systems we design based on this approach can support billions of participants and can handle a variety of real queries from the literature, including machine learning tasks, Pregel-style graph queries, and queries over large categorical data. We automate the distributed refactoring so that analysts can write the query as if the data were centralized without understanding how the rewriting works, and we protect against malicious parties who aim to poison or bias the results.
... The goal of privacy protection is to prevent the algorithm's output from revealing a user's private information, such as item preferences. Real-world privacy breaches have been reported in Amazon's recommendation system [8] and Facebook's advertisement system [23], where an adversary can learn considerable side information about a user solely based on the system's recommendation sequence. ...
Preprint
Bandit algorithms have become a reference solution for interactive recommendation. However, as such algorithms directly interact with users for improved recommendations, serious privacy concerns have been raised regarding its practical use. In this work, we propose a differentially private linear contextual bandit algorithm, via a tree-based mechanism to add Laplace or Gaussian noise to model parameters. Our key insight is that as the model converges during online update, the global sensitivity of its parameters shrinks over time (thus named dynamic global sensitivity). Compared with existing solutions, our dynamic global sensitivity analysis allows us to inject less noise to obtain $(\epsilon, \delta)$-differential privacy with added regret caused by noise injection in $\tilde O(\log{T}\sqrt{T}/\epsilon)$. We provide a rigorous theoretical analysis over the amount of noise added via dynamic global sensitivity and the corresponding upper regret bound of our proposed algorithm. Experimental results on both synthetic and real-world datasets confirmed the algorithm's advantage against existing solutions.
... Nonetheless, any campaign that targets or personalizes its messages will have to weigh the advantage of specific group targeting with the potential risk of decreasing sample size. There have already been several studies that have addressed the privacy risks of targeted advertisements, particularly when they target very small groups, but such risks have been continuously reported and resolved [51,52]. ...
Article
Full-text available
Although established marketing techniques have been applied to design more effective health campaigns, more often than not, the same message is broadcasted to large populations, irrespective of unique characteristics. As individual digital device use has increased, so have individual digital footprints, creating potential opportunities for targeted digital health interventions. We propose a novel precision public health campaign framework to structure and standardize the process of designing and delivering tailored health messages to target particular population segments using social media-targeted advertising tools. Our framework consists of five stages: defining a campaign goal, priority audience, and evaluation metrics; splitting the target audience into smaller segments; tailoring the message for each segment and conducting a pilot test; running the health campaign formally; and evaluating the performance of the campaigns. We have demonstrated how the framework works through 2 case studies. The precision public health campaign framework has the potential to support higher population uptake and engagement rates by encouraging a more standardized, concise, efficient, and targeted approach to public health campaign development.
... Differential privacy [15] is a mathematically quantifiable privacy guarantee for a data set used by a computation that analyzes it. While it originally emerged in the database and data mining communities, triggered by privacy concerns in Machine Learning (ML) [18,19,30,39,58,60], DP has garnered enormous traction in the ML community over the last decade [1,5,7,9,11,18,19,53,54,57]. is the global model the federation server sends to users, each of which retrains on its private data and sends the updated model parameters back to the federation server. ...
Preprint
Federated Learning (FL) is quickly becoming a goto distributed training paradigm for users to jointly train a global model without physically sharing their data. Users can indirectly contribute to, and directly benefit from a much larger aggregate data corpus used to train the global model. However, literature on successful application of FL in real-world problem settings is somewhat sparse. In this paper, we describe our experience applying a FL based solution to the Named Entity Recognition (NER) task for an adverse event detection application in the context of mass scale vaccination programs. We present a comprehensive empirical analysis of various dimensions of benefits gained with FL based training. Furthermore, we investigate effects of tighter Differential Privacy (DP) constraints in highly sensitive settings where federation users must enforce Local DP to ensure strict privacy guarantees. We show that local DP can severely cripple the global model's prediction accuracy, thus dis-incentivizing users from participating in the federation. In response, we demonstrate how recent innovation on personalization methods can help significantly recover the lost accuracy. We focus our analysis on the Federated Fine-Tuning algorithm, FedFT, and prove that it is not PAC Identifiable, thus making it even more attractive for FL-based training.
... Our paper focuses on a more foundational question of what should be considered political advertising. Several other studies have pinpointed problems with the Facebook ad ecosystem without focusing on political advertising such as discrimination [34], lack of transparency [2,3], and security and privacy problems [21,37]. ...
Preprint
Full-text available
Online political advertising has grown significantly over the last few years. To monitor online sponsored political discourse, companies such as Facebook, Google, and Twitter have created public Ad Libraries collecting the political ads that run on their platforms. Currently, both policymakers and platforms are debating further restrictions on political advertising to deter misuses. This paper investigates whether we can reliably distinguish political ads from non-political ads. We take an empirical approach to analyze what kind of ads are deemed political by ordinary people and what kind of ads lead to disagreement. Our results show a significant disagreement between what ad platforms, ordinary people, and advertisers consider political and suggest that this disagreement mainly comes from diverging opinions on which ads address social issues. Overall our results imply that it is important to consider social issue ads as political, but they also complicate political advertising regulations.
... Microtargeting is also sometimes used to garner public support during election campaigns. The practice of microtargeting, though effective, has raised questions regarding violation of privacy of user information [73]. ...
Article
Full-text available
Artificial Intelligence (AI) as a technology has existed for less than a century. In spite of this, it has managed to achieve great strides. The rapid progress made in this field has aroused the curiosity of many technologists around the globe and many companies across various domains are curious to explore its potential. For a field that has achieved so much in such a short duration, it is imperative that people who aim to work in Artificial Intelligence, study its origins, recent developments, and future possibilities of expansion to gain a better insight into the field. This paper encapsulates the notable progress made in Artificial Intelligence starting from its conceptualization to its current state and future possibilities, in various fields. It covers concepts like a Turing machine, Turing test, historical developments in Artificial Intelligence, expert systems, big data, robotics, current developments in Artificial Intelligence across various fields, and future possibilities of exploration.
... The applied approach results in 97% classification accuracy. In [28] author proposed another group of attacks to breach user privacy. This novel class of attack exploits advertising systems having the capability to the target audience at the micro-level. ...
Article
Full-text available
In the present era, Online Social Networking has become an important phenomenon in human society. However, a large section of users are not aware of the security and privacy concerns involve in it. People have a tendency to publish sensitive and private information for example date of birth, mobile numbers, places checked-in, live locations, emotions, name of their spouse and other family members, etc. that may potentially prove disastrous. By monitoring their social network updates, the cyber attackers, first, collect the user’s public information which is further used to acquire their confidential information like banking details, etc. and to launch security attacks e.g. fake identity attack. Such attacks or information leaks may gravely affect their life. In this technology-laden era, it is imperative for users must be well aware of the potential risks involved in online social networks. This paper comprehensively surveys the evolution of the online social networks, their associated risks and solutions. The various security models and the state of the art algorithms have been discussed along with a comparative meta-analysis using machine learning, deep learning, and statistical testing to recommend a better solution.
... Nonetheless, any campaign that targets or personalizes its messages will have to weigh the advantage of specific group targeting with the potential risk of decreasing sample size. There have already been several studies that have addressed the privacy risks of targeted advertisements, particularly when they target very small groups, but such risks have been continuously reported and resolved [51,52]. ...
Preprint
UNSTRUCTURED While established marketing techniques have been applied to design more effective health campaigns, more often than not, the same message is broadcasted to large populations, irrespective of unique characteristics. As individual digital device usage has increased, so has individual digital footprints, creating potential opportunities for targeted digital health interventions. We propose a novel Precision Public Health Campaign (PPHC) framework to structure and standardize the process of designing and delivering tailored health messages to target particular population segments using social media targeted advertising tools. Our framework consists of five stages: (1) defining a campaign goal, priority audience, and evaluation metrics, (2) splitting the target audience into smaller segments, (3) tailoring the message for each segment and doing a pilot test, (4) running the health campaign formally, and (5) evaluating the performance of the campaigns. We will demonstrate how the framework works through two case studies. The PPHC framework has the potential to support higher population uptake and engagement rates by encouraging a more standardized, concise, efficient, and targeted approach to public health campaign development.
... For example, Venkatadri et al. [71] noted the possibility of targeting individuals with sensitive PII, such as phone numbers provided for security purposes and phone numbers derived from friends' contact lists. Furthermore, attackers can exploit the vulnerabilities of the advertising interfaces to breach user privacy [47] or to employ discriminatory targeting [22,65]. On the subject of transparency, Kreiss and Mcgregor [49] consider that both Facebook and Google have been opaque in their decision-making, following policies that are not transparent and that were applied without explicit justification. ...
Conference Paper
Full-text available
This study investigates the possibilities and limits presented by the newly created ad libraries from Facebook and Google to analyze online political campaigns. We selected Germany as a case study and focused on the months leading up to the 2019 elections to the European Parliament. We identified the political actors that were active advertisers, compared their spending, and contrasted the number of ad impressions with user engagement on their organic online content. From the political ads, we extracted the unique ads and manually analyzed a subsample of them. Furthermore, we explored regional and demographic distributions of users reached by the advertisements and used them as a proxy for the advertisers’ targeting strategies. We also compared the success of the ad campaigns on boosted Facebook posts. We found that even though all the major German political parties engaged in online ad campaigns, they kept their attempts at microtargeting to a minimum. Although their Facebook-sponsored posts were more successful than normal posts, we did not find statistical significance for all the political parties. Interestingly, we noticed that the distribution of users reached by the right-wing party Alternative für Deutschland (AfD) diverges from that of the other parties. Finally, we discuss further challenges for enhancing transparency in online advertising.
... Manipulation, misinformation, and related concepts entered into the global political discourse. This entry was nothing unexpected from a computer science perspective; academic privacy research had pinpointed many of the risks well before these gained mainstream traction [15]. Later on, social media and technology companies sought to answer to the public uproar by traditional means of corporate social responsibility: by producing voluntary transparency reports on political ads. ...
Preprint
Online political advertisements have become an important element in electoral campaigning throughout the world. At the same time, concepts such as misinformation and manipulation have emerged as a global concern. Although these concepts are distinct from online political ads and data-driven electoral campaigning, they tend to share a similar trait related to valence, the intrinsic attractiveness or averseness of a message. Given this background, the paper examines online political ads by using a dataset collected from Google's transparency reports. The examination is framed to the mid-2019 situation in Europe, including the European Parliament election in particular. According to the results based on sentiment analysis of the textual ads displayed via Google's advertisement machinery, (i) most of the political ads have expressed positive sentiments, although these vary greatly between (ii) European countries as well as across (iii) European political parties. In addition to these results, the paper contributes to the timely discussion about data-driven electoral campaigning and its relation to politics and democracy.
... The paper [16] analyzes the possible causes of privacy breaches and proposes several attacks generated from advertising systems with micro-targeting capabilities. The authors focus on the Facebook case study, in particular, on the risks of user privacy leakage. ...
... The granularity of the user data held by these entities has given rise to powerful capabilities of microtargeting. These capabilities have derived in tools to select audiences that may enable even advertisers to target groups of users with great precision [57]. In Fig. 6, we show an interface offered by a social network and a DSP to choose an audience for better ad targeting. ...
Article
Full-text available
Online tracking is the key enabling technology of modern online advertising. In the recently established model of real-time bidding (RTB), the web pages tracked by ad platforms are shared with advertising agencies (also called DSPs), which, in an auction-based system, may bid for user ad impressions. Since tracking data are no longer confined to ad platforms, RTB poses serious risks to privacy, especially with regard to user profiling, a practice that can be conducted at a very low cost by any DSP or related agency, as we reveal here. In this work, we illustrate these privacy risks by examining a data set with the real ad-auctions of a DSP, and show that for at least 55% of the users tracked by this agency, it paid nothing for their browsing data. To mitigate this abuse, we propose a system that regulates the distribution of bid requests (containing user tracking data) to potentially interested bidders, depending on their previous behavior. In our approach, an ad platform restricts the sharing of tracking data by limiting the number of DSPs participating in each auction, thereby leaving unchanged the current RTB architecture and protocols. However, doing so may have an evident impact on the ad platform’s revenue. The proposed system is designed accordingly, to ensure the revenue is maximized while the abuse by DSPs is prevented to a large degree. Experimental results seem to suggest that our system is able to correct misbehaving DSPs, and consequently enhance user privacy.
... Related Work. Within the private preserving framework, there exist privacy-preserving data mining (PPDM) techniques in the database community [2] [3] [4] whose goal is to prevent association of any instance in a database to a person. In addition to PPDM, many privacy-preserving machine learning (PPML) techniques [5], [6], [7], [8], [9], [10] have been proposed to deal with data beyond those in the traditional databases. ...
Article
This paper proposes SEquential GAme NEtwork (SEGANE), a novel deep neural network (DNN) architecture for optimizing the performance of machine learning applications with multiple competing objectives. Specifically, SEGANE is evaluated in the context of data sanitization which aims to remove any pre-specified private information from the data in real time while keeping the relevant information used to improve the inference accuracy about the non-private information. In some settings, preserving private information and improving inference performance about non-private information are competing objectives. In such cases, SEGANE provides a sequential game framework and algorithmic tools to implement data sanitization schemes with flexible trade-off between these two objectives. We use two datasets: MNIST (hand-written digits) and IMDB (gender and age) to evaluate SEGANE. For MNIST, even numbers are considered private while numbers larger than 10 are considered non-private. For IMDB, in one setting, gender is considered private while age is non-private, and vice versa in another setting. Our experimental results on these datasets show that SEGANE is highly effective in removing private information from the dataset while allowing non-private data to be mined effectively.
... Korolova [16] was the rst to point out privacy aaacks based on micro-targeted online ads. Followup work has reverse-engineered the targeting options provided by major online ad networks [33] and explored privacy [2] and bias [26] issues of these online ad networks. ...
Preprint
Full-text available
During the summer of 2018, Facebook, Google, and Twitter created policies and implemented transparent archives that include U.S. political advertisements which ran on their platforms. Through our analysis of over 1.3 million ads with political content, we show how different types of political advertisers are disseminating U.S. political messages using Facebook, Google, and Twitter's advertising platforms. We find that in total, ads with political content included in these archives have generated between 8.67 billion - 33.8 billion impressions and that sponsors have spent over $300 million USD on advertising with U.S. political content. We are able to improve our understanding of political advertisers on these platforms. We have also discovered a significant amount of advertising by quasi for-profit media companies that appeared to exist for the sole purpose of creating deceptive online communities focused on spreading political messaging and not for directly generating profits. Advertising by such groups is a relatively recent phenomenon, and appears to be thriving on online platforms due to the lower regulatory requirements compared to traditional advertising platforms. We have found through our attempts to collect and analyze this data that there are many limitations and weaknesses that enable intentional or accidental deception and bypassing of the current implementations of these transparency archives. We provide several suggestions for how these archives could be made more robust and useful. Overall, these efforts by Facebook, Google, and Twitter have improved political advertising transparency of honest and, in some cases, possibly dishonest advertisers on their platforms. We thank the people at these companies who have built these archives and continue to improve them.
... As such we believe that privacy concerns for our visualization are similar to those of other aggregate population data. However, researchers have pointed out how the Facebook advertising API could in the past be misused to obtain personally identifiable information [5,4]. Facebook has since closed the corresponding loopholes concerning the audience estimates for so-called "custom audiences" 6 . ...
Conference Paper
Full-text available
We present two interactive data visualizations of fine-grained demographic information for New York City, US, and Doha, Qatar, obtained using Facebook's Marketing API. The visualizations make innovative use of treemaps to support a bi-modal data selection and visualization of both "where are people of type X" and "what type of people are in location Y." The two interactive visualizations aim to both show-case a front-end for census-type information and to demonstrate the richness of Facebook's advertising data.
... What kinds of information can we release about social networks while preserving the privacy of their users? Straightforward approaches, such as removing obvious identifiers or releasing summaries that concern at least a certain number of nodes, can be easily broken [46,38]. ...
Preprint
Full-text available
Motivated by growing concerns over ensuring privacy on social networks, we develop new algorithms and impossibility results for fitting complex statistical models to network data subject to rigorous privacy guarantees. We consider the so-called node-differentially private algorithms, which compute information about a graph or network while provably revealing almost no information about the presence or absence of a particular node in the graph. We provide new algorithms for node-differentially private estimation for a popular and expressive family of network models: stochastic block models and their generalization, graphons. Our algorithms improve on prior work, reducing their error quadratically and matching, in many regimes, the optimal nonprivate algorithm. We also show that for the simplest random graph models ($G(n,p)$ and $G(n,m)$), node-private algorithms can be qualitatively more accurate than for more complex models---converging at a rate of $\frac{1}{\epsilon^2 n^{3}}$ instead of $\frac{1}{\epsilon^2 n^2}$. This result uses a new extension lemma for differentially private algorithms that we hope will be broadly useful.
... In such situations, there is clearly a threat for the individual privacy. Even worse, individual private information may be retrieved from sanitized datasets, usually performing data correlation on multiple datasets coming from multiple sources [3] [4] [5] [6]. ...
Article
Full-text available
Differential privacy, and close other notions such as dχ-privacy, is at the heart of the privacy framework when considering the use of randomization to ensure data privacy. Such a guarantee is always submitted to some trade-off between the privacy level and the accuracy of the result. While a privacy parameter of the differentially private algorithms leverages this trade-off, it is often a hard task to choose a meaningful value for this numerical parameter. Only a few works have tackled this issue, and the present paper's goal is to continue this effort in two ways. First, we propose a generic framework to decide whether a privacy parameter value is sufficient to prevent from some predetermined and well-understood risks for privacy. Second, we instantiate our framework on mobility data from real-life datasets, and show some insightful features necessary for practical applications of randomized sanitization mechanisms. In our framework, we model scenarii where an attacker's goal is to de-sanitize some data previously sanitized in the sense of dχ-privacy, a privacy guarantee close to that of differential privacy. To each attack is associated a meaningful risk of data disclosure, and the level of success for the attack suggests a relevant value for the corresponding privacy parameter.
Article
News headlines about privacy invasions, discrimination, and biases discovered in the platforms of big technology companies are commonplace today, and big tech's reluctance to disclose how they operate counteracts ideals of transparency, openness, and accountability. This book is for computer science students and researchers who want to study big tech's corporate surveillance from an experimental, empirical, or quantitative point of view and thereby contribute to holding big tech accountable. As a comprehensive technical resource, it guides readers through the corporate surveillance landscape and describes in detail how corporate surveillance works, how it can be studied experimentally, and what existing studies have found. It provides a thorough foundation in the necessary research methods and tools, and introduces the current research landscape along with a wide range of open issues and challenges. The book also explains how to consider ethical issues and how to turn research results into real-world change.
Article
Installing audio-based applications exposes users to the risk of the data processor extracting additional information beyond the task the user permitted. To solve these privacy concerns, we propose to integrate an on-edge data obfuscation between the audio sensor and the recognition algorithm. We introduce a novel privacy loss metric and use adversarial learning to train an obfuscator. Contrary to existing work, our technique does not require users to specify which sensitive attributes they want to protect (opt-out) but instead only provide permission for specific tasks (opt-in). Moreover, we do not require retraining of recognition algorithms, making the obfuscated data compatible with existing methods. We experimentally validate our approach on four voice datasets and show that we can protect several attributes of the speaker, including gender, identity, and emotional state with a minimal recognition accuracy degradation.
Chapter
Federated Learning (FL) is quickly becoming a goto distributed training paradigm for users to jointly train a global model without physically sharing their data. Users can indirectly contribute to, and directly benefit from a much larger aggregate data corpus used to train the global model. However, literature on successful application of FL in real-world problem settings is somewhat sparse. In this paper, we describe our experience applying a FL based solution to the Named Entity Recognition (NER) task for an adverse event detection application in the context of mass scale vaccination programs. We present a comprehensive empirical analysis of various dimensions of benefits gained with FL based training. Furthermore, we investigate effects of tighter Differential Privacy (DP) constraints in highly sensitive settings where federation users must enforce DP to ensure strict privacy guarantees. We show that DP can severely cripple the global model’s prediction accuracy, thus disincentivizing users from participating in the federation. In response, we demonstrate how recent innovation in personalization methods can help significantly recover the lost accuracy.
Article
In recent years the number of individuals struggling with mental illness has increased, and traditional mental health services are now considered insufficient under the current circumstances which has prompted researchers to develop new approaches for mental healthcare. Social media usage is growing, and it is been utilized to help provide additional insight on mental health by using the information shared by individuals, as well as data taken from their social media activity. While this approach may provide a unique and effective perspective for mental health services, it is critical that privacy risks and protections are considered in the process. Social media services collect, process, and stores a substantial amount of information about its users and how that information is shared as well as what type of predictions are made may pose serious privacy concerns. This study aims to understand how privacy is addressed and emphasized during the process of using social media data for mental healthcare by constructing a systematic review on previous scholarly papers related to the topic. Solove's taxonomy of privacy is used to evaluate these publications privacy considerations and to demonstrate the privacy risks that may arise when social media data is used for mental health.
Poster
Full-text available
DNA molecules can retain information in high densities, with high durability and low overall energy cost. This would make DNA-based data storage system a compelling solution in placating the increasing gap between global data production and our current means to store data. While key technical developments in recent decades have allowed DNA-based data storage systems to slowly progress closer to mainstream usage, there has been an overall lack of discourse surrounding potential implications of the system in the context of human computer interaction (HCI). This article introduces the DNA-based technology, followed by highlights of some of the potential opportunities and challenges it brings to the HCI community. In summary, DNA-based data storage systems offer a new research topic for user experience studies and data physicalization, and these are driven by inherent biological qualities of the DNA. As a tool, given the longevity of DNA, the system could also function as a multi-lifespan information management product, designed to help in addressing long-term wicked problems. In terms of challenges, ethical implications surrounding the technology ownership, and communication hurdles for HCI researchers working with the new technology, should also be considered and addressed.
Chapter
Online political advertisements have become an important element in electoral campaigning throughout the world. At the same time, concepts such as disinformation and manipulation have emerged as a global concern. Although these concepts are distinct from online political ads and data-driven electoral campaigning, they tend to share a similar trait related to valence, the intrinsic attractiveness or averseness of a message. Given this background, the paper examines online political ads by using a dataset collected from Google’s transparency reports. The examination is framed to the mid-2019 situation in Europe, including the European Parliament elections in particular. According to the results based on sentiment analysis of the textual ads displayed via Google’s advertisement machinery, (i) most of the political ads have expressed positive sentiments, although these vary greatly between (ii) European countries as well as across (iii) European political parties. In addition to these results, the paper contributes to the timely discussion about data-driven electoral campaigning and its relation to politics and democracy.
Chapter
Mental healthcare services are insufficient under the current circumstances due to growing populations with mental health issues, the lack of enough mental health professionals, services, and programs that are needed. Traditional methods are often time consuming, expensive, and not timely. At the same time an increasingly number of people are using social media to interact with others and to share their personal stories and reflections. In this study we examined if online users’ social media activities were influenced by their mental well-being. To carry out this research we assessed Twitter activities between participants that reported high symptoms of depression and those with lower or no symptoms of depression. Our results confirm the influence in their activities in addition to interesting insights. We believe these findings can be beneficial to mental health care providers if users’ privacy is preserved.
Chapter
Both the International Education Organization (OIE) and UNESCO have stated that promoting collaborative activities is a key competence for sustainable development. This postulate focuses on collaboration with local and international networks. In this line, it is important to mention that, in each teamwork, the members are people who interact sharing objectives, rules and deadlines linked to the activity. Under this reality, it is essential to promote study-team activities in higher education, where students can develop skills to solve problems in multidisciplinary groups. To support the process of generating efficient study-teams, in this investigation we present a system capable of exploring the best alternatives to automatically organize homogeneous study-teams that favor the best performance. Our proposal uses a personalized genetic algorithm (GA), based on student learning styles and academic profile. The experimentation phase has yielded positive results compared to the self-organization method or the teacher imposition method.
Chapter
Challenging operational tasks, such as complex, unexpected incidents and severe accidents are characterised by an increase of operators’ mental demands, stress-induced deterioration of cognitive capacity and increased time pressure to resolve the situation. This combination can negatively affect the operator crew’s performance. This paper describes the progress of a research project that models the stress and workload of 54 nuclear power plant operators during simulated incident and accident scenarios. Here, we demonstrate how an extensive empirical field study with psychophysiological assessments can be successfully performed in a simulator with free movement. Also, the modelling approach to examine the relationship between the stress and workload, and performance, with moderating effects of operator role and the efficiency of the abnormal and emergency operations procedure (OPs) use will be described. Even though some observations will be made, the results of the study are, at this point, preliminary.
Article
Over 35% of Americans belong to racial minority groups. Racism targeting these individuals results in a range of harmful physical, psychological, and practical consequences. The present work aims to shed light on the current sense-making and support-seeking practices exhibited by targets of racism, as well as to identify the core needs and barriers that future socio-technical interventions could potentially address. The long-term goal of this work is to understand how CSCW researchers and designers could best support members of marginalized groups to make sense of and to seek support for experiences with racism. Narrative episode interviews with targets of racism revealed a number of key entry points for intervention. For example, participants' personal stories confirmed that uncertainty, both about the nature and consequences of the experience of racism, is a key motivator for support-seeking. In addition, despite the need for support, participants largely do not trust public forms of social media for support-seeking. We discuss how participants' accounts of the complex labor involved in determining who "gets it" in identifying potential supporters, and in navigating the complexities of trust and agency in sharing their experiences, present clear implications for the design of new socio-technical platforms for members of racial minority groups.
Conference Paper
Full-text available
Traditionally robots have been stand-alone systems. In recent years, however, they have increasingly been connected to external knowledge resources through the Internet of Things (IoT). These robots are thus becoming part of IoT and can realistically allocate Internet of Robotic Things (IoRT) technologies. IoRT can facilitate Human-Robot Interaction (HRI) at functional (commanding and programming) and social levels, as well as a means for remote-interaction. IoRT-HRI can cause privacy issues for humans, in part because robots can collect data using IoT and move in the real world, partly because robots can learn to read human social cues and adapt or correct their behavior accordingly. In this paper, we address the topic of privacy-preserving for IoRT-HRI applications. The objective is to design a data release framework called a Privacy Filter (PF) that can prevent an adversary from private mining information from the released data while keeping utility data. In the experiments, we test our framework on two accessible datasets: MNIST (handwritten digits) and UCI-HAR (activity recognition from motion). Our experimental results on these datasets show that PF is highly effective in removing private information from the dataset while allowing utility data to be mined effectively.
Conference Paper
Data brokers such as Acxiom and Experian are in the business of collecting and selling data on people; the data they sell is commonly used to feed marketing as well as political campaigns. Despite the ongoing privacy debate, there is still very limited visibility into data collection by data brokers. Recently, however, online advertising services such as Facebook have begun to partner with data brokers-to add additional targeting features to their platform- providing avenues to gain insight into data broker information. In this paper, we leverage the Facebook advertising system-and their partnership with six data brokers across seven countries-in order to gain insight into the extent and accuracy of data collection by data brokers today. We find that a surprisingly large percentage of Facebook accounts (e.g., above 90% in the U.S.) are successfully linked to data broker information. Moreover, by running controlled ads to 183 crowdsourced U.S.-based volunteers, we find that at least 40% of data broker sourced user attributes are not at all accurate, that users can have widely varying fractions of inaccurate attributes, and that even important information such as financial information can have a high degree of inaccuracy. Overall, this paper provides the first fine-grained look into the extent and accuracy of data collection by offline data brokers, helping to inform the ongoing privacy debate.
Conference Paper
This paper proposes SEquential GAme NEtwork (SEGANE), a novel deep neural network (DNN) architecture for optimizing the performance of machine learning applications with multiple competing objectives. Specifically, SEGANE is evaluated in the context of data sanitization which aims to remove any pre-specified private information from the data in real time while keeping the relevant information used to improve the inference accuracy about the non-private information. In some settings, preserving private information and improving inference performance about non-private information are competing objectives. In such cases, SEGANE provides a sequential game framework and algorithmic tools to implement data sanitization schemes with flexible trade-off between these two objectives. We use two datasets: MNIST (hand-written digits) and IMDB (gender and age) to evaluate SEGANE. For MNIST, even numbers are considered private while numbers larger than 10 are considered non-private. For IMDB, in one setting, gender is considered private while age is non-private, and vice versa in another setting. Our experimental results on these datasets show that SEGANE is highly effective in removing private information from the dataset while allowing non-private data to be mined effectively.
Conference Paper
Preserving privacy of users is a key requirement of web-scale data mining applications and systems such as web search, recommender systems, crowdsourced platforms, and analytics applications, and has witnessed a renewed focus in light of recent data breaches and new regulations such as GDPR. In this tutorial, we will first present an overview of privacy breaches over the last two decades and the lessons learned, key regulations and laws, and evolution of privacy techniques leading to differential privacy definition / techniques. Then, we will focus on the application of privacy-preserving data mining techniques in practice, by presenting case studies such as Apple's differential privacy deployment for iOS / macOS, Google's RAPPOR, LinkedIn Salary, and Microsoft's differential privacy deployment for collecting Windows telemetry. We will conclude with open problems and challenges for the data mining / machine learning community, based on our experiences in industry.
Article
Full-text available
Understanding the evolution of the user-base as well as user engagement of online services is critical not only for the service operators but also for customers, investors, and users. While we can find research works addressing this issue in online services such as Twitter, MySpace or Google+, such detailed analysis is missing for Facebook, which is currently the largest online social network. This paper presents the first detailed study on the demographic and geographic composition and evolution of the user-base and user engagement in Facebook over a period of three years. To this end, we have implemented a measurement methodology that leverages the marketing API of Facebook to retrieve actual information about the number of total users and the number of daily active users across 230 countries and age groups ranging between 13 and 65+. The conducted analysis reveals that Facebook is still growing and geographically expanding. Moreover, the growth pattern is heterogeneous across age groups, genders, and geographical regions. In particular, from a demography perspective, Facebook shows the lowest growth pattern among adolescents. Genderbased analysis showed that growth among men is still higher than the growth in women. Our geographical analysis reveals that while Facebook growth is slower in western countries, it presents fastest growth in developing countries mainly located in Africa and Central Asia, analyzing the penetration of these countries also shows that these countries are at earlier stages of Facebook penetration. Leveraging external socioeconomic datasets we also showed that this heterogeneous growth can be characterized by indicators such as availability and access to Internet, Facebook popularity, and factors related with population growth and gender inequality.
Conference Paper
Full-text available
We continue a line of research initiated in [10,11]on privacy-preserving statistical databases. Consider a trusted server that holds a database of sensitive information. Given a query function f mapping databases to reals, the so-called true answer is the result of applying f to the database. To protect privacy, the true answer is perturbed by the addition of random noise generated according to a carefully chosen distribution, and this response, the true answer plus noise, is returned to the user. Previous work focused on the case of noisy sums, in which f = ∑i g(x i ), where x i denotes the ith row of the database and g maps database rows to [0,1]. We extend the study to general functions f, proving that privacy can be preserved by calibrating the standard deviation of the noise according to the sensitivity of the function f. Roughly speaking, this is the amount that any single argument to f can change its output. The new analysis shows that for several particular applications substantially less noise is needed than was previously understood to be the case. The first step is a very clean characterization of privacy in terms of indistinguishability of transcripts. Additionally, we obtain separation results showing the increased value of interactive sanitization mechanisms over non-interactive.
Article
Buried in a list of 20 million Web search queries collected by AOL and recently released on the Internet is user No. 4417749. The number was assigned by the company to protect the searcher's anonymity, but it was not much of a shield. No. 4417749 conducted hundreds of searches over a three-month period on topics ranging from "numb fingers" to "60 single men" to "dog that urinates on everything." And search by search, click by click, the identity of AOL user No. 4417749 became easier to discern. There are queries for "landscapers in Lilburn, Ga," several people with the last name Arnold and "homes sold in shadow lake subdivision gwinnett county georgia." It did not take much investigating to follow that data trail to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends' medical ailments and loves her three dogs. "Those are my searches," she said, after a reporter read part of the list to her. AOL removed the search data from its site over the weekend and apologized for its release, saying it was an unauthorized move by a team that had hoped it would benefit academic researchers. But the detailed records of searches conducted by Ms. Arnold and 657,000 other Americans, copies of which continue to circulate online, underscore how much people unintentionally reveal about themselves when they use search engines — and how risky it can be for companies like AOL, Google and Yahoo to compile such data. Those risks have long pitted privacy advocates against online marketers and other Internet companies seeking to profit from the Internet's unique ability to track the comings and goings of users, allowing for more focused and therefore more lucrative advertising. But the unintended consequences of all that data being compiled, stored and cross-linked are what Marc Rotenberg, the executive director of the Electronic Privacy Information Center, a privacy rights group in Washington, called "a ticking privacy time bomb." Mr. Rotenberg pointed to Google's own joust earlier this year with the Justice Department over a subpoena for some of its search data. The company successfully fended off the agency's demand in court, but several other search companies, including AOL, complied. The Justice Department sought the information to help it defend a challenge to a law that is meant to shield children from sexually explicit material.
Book
This report is based on the findings of a daily tracking survey on Americans' use of the internet. The results in this report are based on data from telephone interviews conducted by Princeton Survey Research Associates International between August 18 and September 14, 2009, among a total sample of 2,253 adults, age 18 and older including 560 cell phone interviews. Interviews were conducted in both English (n=2,179) and Spanish (n=74). For results based on the total sample, one can say with 95% confidence that the error attributable to sampling and other random effects is plus or minus 2.3 percentage points. For results based on internet users (n=1,698), the margin of sampling error is plus or minus 2.7 percentage points. In addition to sampling error, question wording and practical difficulties in conducting telephone surveys may introduce some error or bias into the findings of opinion polls.
Article
Despite previous examinations of business actions, consumer reactions, and regulatory efforts, there has been no direct comparison of consumer and marketer expectations for establishing and respecting privacy boundaries. This study directly compares consumer segments' and marketers' expectations for privacy boundaries that regulate marketers' access to consumers and their information. Using data from a national online survey, the authors compare three consumer segments' preferences regarding the boundaries for the use of eight information technologies (cookies, biometrics, loyalty cards, radio frequency identification, text messaging, pop-up advertisements, telemarketing, and spam) with survey results of marketing managers and database vendors for the same set of questions. The results identify consumer segments and technologies for which consumer expectations differ from marketers and, thus, for which more regulatory and public policy attention and research scholarship is needed.
Article
This nationally representative telephone (wire-line and cell phone) survey explores Americans' opinions about behavioral targeting by marketers, a controversial issue currently before government policymakers. Behavioral targeting involves two types of activities: following users' actions and then tailoring advertisements for the users based on those actions. While privacy advocates have lambasted behavioral targeting for tracking and labeling people in ways they do not know or understand, marketers have defended the practice by insisting it gives Americans what they want: advertisements and other forms of content that are as relevant to their lives as possible.Contrary to what many marketers claim, most adult Americans (66%) do not want marketers to tailor advertisements to their interests. Moreover, when Americans are informed of three common ways that marketers gather data about people in order to tailor ads, even higher percentages - between 73% and 86% - say they would not want such advertising. Even among young adults, whom advertisers often portray as caring little about information privacy, more than half (55%) of 18-24 years-old do not want tailored advertising. And contrary to consistent assertions of marketers, young adults have as strong an aversion to being followed across websites and offline (for example, in stores) as do older adults.This survey finds that Americans want openness with marketers. If marketers want to continue to use various forms of behavioral targeting in their interactions with Americans, they must work with policymakers to open up the process so that individuals can learn exactly how their information is being collected and used, and then exercise control over their data. We offer specific proposals in this direction. An overarching one is for marketers to implement a regime of information respect toward the public rather than to treat them as objects from which they can take information in order to optimally persuade them.
Conference Paper
We investigate the degree to which modern web browsers are subject to “device fingerprinting” via the version and configuration information that they will transmit to websites upon request. We implemented one possible fingerprinting algorithm, and collected these fingerprints from a large sample of browsers that visited our test side, panopticlick.eff.org . We observe that the distribution of our fingerprint contains at least 18.1 bits of entropy, meaning that if we pick a browser at random, at best we expect that only one in 286,777 other browsers will share its fingerprint. Among browsers that support Flash or Java, the situation is worse, with the average browser carrying at least 18.8 bits of identifying information. 94.2% of browsers with Flash or Java were unique in our sample. By observing returning visitors, we estimate how rapidly browser fingerprints might change over time. In our sample, fingerprints changed quite rapidly, but even a simple heuristic was usually able to guess when a fingerprint was an “upgraded” version of a previously observed browser’s fingerprint, with 99.1% of guesses correct and a false positive rate of only 0.86%. We discuss what privacy threat browser fingerprinting poses in practice, and what countermeasures may be appropriate to prevent it. There is a tradeoff between protection against fingerprintability and certain kinds of debuggability, which in current browsers is weighted heavily against privacy. Paradoxically, anti-fingerprinting privacy technologies can be self-defeating if they are not used by a sufficient number of people; we show that some privacy measures currently fall victim to this paradox, but others do not.
Conference Paper
Online behavioral advertising (OBA) refers to the practice of tracking users across web sites in order to infer user interests and preferences. These interests and preferences are then used for selecting ads to present to the user. There is great concern that behavioral advertising in its present form infringes on user privacy. The resulting public debate — which includes consumer advocacy organizations, professional associations, and government agencies — is premised on the notion that OBA and privacy are inherently in conflict. In this paper we propose a practical architecture that enables targeting without compromising user privacy. Behavioral profiling and targeting in our system takes place in the user's browser. We discuss the effectiveness of the system as well as potential social engineering and web-based attacks on the architecture. One complication is billing; ad-networks must bill the correct advertiser without knowing which ad was displayed to the user. We propose an efficient cryptographic billing system that directly solves the problem. We implemented the core targeting system as a Firefox extension and report on its effectiveness.
Conference Paper
Online advertising supports many Internet services, such as search, email, and social networks. At the same time, there are widespread concerns about the privacy loss associated with user targeting. Yet, very little is publicly known about how ad networks operate, especially with regard to how they use user information to target users. This paper takes a first principled look at measurement methodologies for ad networks. It proposes new metrics that are robust to the high levels of noise inherent in ad distribution, identifies measurement pitfalls and artifacts, and provides mitigation strategies. It also presents an analysis of how three different classes of advertising -- search, contextual, and social networks, use user profile information today.
Conference Paper
According to a famous study (10) of the 1990 census data, 87% of the US population can be uniquely identified by gen- der, ZIP code and full date of birth. This short paper revisits the uniqueness of simple demographics in the US population based on the most recent census data (the 2000 census). We oer a detailed, comprehensive and up-to-date picture of the threat to privacy posed by the disclosure of simple de- mographic information. Our results generally agree with the findings of (10), although we find that disclosing one's gender, ZIP code and full date of birth allows for unique identification of fewer individuals (63% of the US popula- tion) than reported in (10). We hope that our study will be a useful reference for privacy researchers who need sim- ple estimates of the comparative threat of disclosing various demographic data.
Conference Paper
For purposes of this paper, we define "Personally identifi- able information" (PII) as information which can be used to distinguish or trace an individual's identity either alone or when combined with other information that is linkable to a specific individual. The popularity of Online Social Net- works (OSN) has accelerated the appearance of vast amounts of personal information on the Internet. Our research shows that it is possible for third-parties to link PII, which is leaked via OSNs, with user actions both within OSN sites and else- where on non-OSN sites. We refer to this ability to link PII and combine it with other information as "leakage". We have identified multiple ways by which such leakage occurs and discuss measures to prevent it.
Conference Paper
Behavioral Targeting (BT) is a technique used by online advertisers to increase the effectiveness of their campaigns, and is playing an increasingly important role in the online advertising market. However, it is underexplored in academia how much BT can truly help online advertising in search engines. In this paper we provide an empirical study on the click-through log of advertisements collected from a commercial search engine. From the experiment results over a period of seven days, we draw three important conclusions: (1) Users who clicked the same ad will truly have similar behaviors on the Web; (2) Click-Through Rate (CTR) of an ad can be averagely improved as high as 670% by properly segmenting users for behavioral targeted advertising in a sponsored search; (3) Using short term user behaviors to represent users is more effective than using long term user behaviors for BT. We conducted statistical t-test which verified that all conclusions drawn in the paper are statistically significant. To the best of our knowledge, this work is the first empirical study for BT on the click-through log of real world ads.
Article
In the information realm, loss of privacy is usually associated with failure to control access to information, to control the flow of information, or to control the purposes for which information is employed. Differential privacy arose in a context in which ensuring privacy is a challenge even if all these control problems are solved: privacy-preserving statistical analysis of data. The problem of statistical disclosure control-revealing accurate statistics about a set of respondents while preserving the privacy of individuals-has a venerable history, with an extensive literature spanning statistics, theoretical computer science, security, databases, and cryptography (see, for example, the excellent survey of Adam and Wortmann,1 the discussion of related work in Blum et al.,2 and the Journal of Official Statistics dedicated to confidentiality and disclosure control).
Conference Paper
In this paper we study the privacy preservation properties of a specific technique for query log anonymization: token- based hashing. In this approach, each query is tokenized, and then a secure hash function is applied to each token. We show that statistical techniques may be applied to par- tially compromise the anonymization. We then analyze the specific risks that arise from these partial compromises, fo- cused on revelation of identity from unambiguous names, addresses, and so forth, and the revelation of facts associ- ated with an identity that are deemed to be highly sensitive. Our goal in this work is twofold: to show that token-based hashing is unsuitable for anonymization, and to present a concrete analysis of specific techniques that may be eec- tive in breaching privacy, against which other anonymization schemes should be measured.
Conference Paper
In a social network, nodes correspond to people or other social en- tities, and edges correspond to social links between them. In an effort to preserve privacy, the practice of anonymization replaces names with meaningless unique identifiers. We describe a family of attacks such that even from a single anonymized copy of a social network, it is possible for an adversary to learn whether edges exist or not between specific targeted pairs of nodes.
Article
This paper describes the winning entry to the IJCNN 2011 Social Network Challenge run by Kaggle.com. The goal of the contest was to promote research on real-world link prediction, and the dataset was a graph obtained by crawling the popular Flickr social photo sharing website, with user identities scrubbed. By de-anonymizing much of the competition test set using our own Flickr crawl, we were able to effectively game the competition. Our attack represents a new application of de-anonymization to gaming machine learning contests, suggesting changes in how future competitions should be run. We introduce a new simulated annealing-based weighted graph matching algorithm for the seeding step of de-anonymization. We also show how to combine de-anonymization with link prediction---the latter is required to achieve good performance on the portion of the test set not de-anonymized---for example by training the predictor on the de-anonymized portion of the test set, and combining probabilistic predictions from de-anonymization and link prediction.
Conference Paper
We present a new class of statistical de- anonymization attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on. Our techniques are robust to perturbation in the data and tolerate some mistakes in the adversary's background knowledge. We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world's largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber's record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.
Conference Paper
We study the role that privacy-preserving algorithms, which prevent the leakage of specific information about participants, can play in the design of mechanisms for strategic agents, which must encourage players to honestly report information. Specifically, we show that the recent notion of differential privacv, in addition to its own intrinsic virtue, can ensure that participants have limited effect on the outcome of the mechanism, and as a consequence have limited incentive to lie. More precisely, mechanisms with differential privacy are approximate dominant strategy under arbitrary player utility functions, are automatically resilient to coalitions, and easily allow repeatability. We study several special cases of the unlimited supply auction problem, providing new results for digital goods auctions, attribute auctions, and auctions with arbitrary structural constraints on the prices. As an important prelude to developing a privacy-preserving auction mechanism, we introduce and study a generalization of previous privacy work that accommodates the high sensitivity of the auction setting, where a single participant may dramatically alter the optimal fixed price, and a slight change in the offered price may take the revenue from optimal to zero.
Big surge in social networking evidence says survey of nation's top divorce lawyers
  • Aaml
AAML, " Big surge in social networking evidence says survey of nation's top divorce lawyers, " February 10, 2010. [Online]. Available: http://www.aaml.org/ ?LinkServID=2F399AE0-E507-CDC7-A1065EE2EE6C4218
Protecting privacy with referrers Facebook Engineering's Notes http://www.facebook.com/notes/facebook-engineering/ protecting-privacy-with-referrers/392382738919
  • M Jones
Facebook leaks usernames, user ids, and personal details to advertisers
  • B Edelman
Facebook executive answers reader questions The New York Times http://bits.blogs.nytimes.comfacebook-executive- answers-reader-questions
  • E Schrage
Privacy MythBusters: No, Facebook doesn't give advertisers your data06/privacy- mythbusters-no-facebook-doesnt-give-advertisers-your-data
  • B Szoka
Facebook advertisers boost spending 10-fold, COO says
  • B Womack
From Facebook, answering privacy concerns with new settings
  • M Zuckerberg
Breaking browsers: Hacking autocomplete
  • J Grossman
Explaining Facebook's spam prevention systems The Facebook Blog
  • C Ghiossi
Divorce lawyers: Facebook tops in online evidence in court
  • L Italie
International online behavioural advertising survey 2010
  • J Mullock
  • S Groom
  • P Lee
Ad group unveils plan to improve web privacy The New York Times
  • T Vega
A primer on information theory and privacy Electronic Frontier Foundation
  • P Eckersley
Barry Diller: we spend every nickel we can on Facebook Interview to CNN Money http://www.allfacebook.com/2010/07/barry- diller-we-spend-every-nickel-we-can-on-facebook
  • N O Neill
Senators call out Facebook on instant personalization, other privacy issues
  • J Kincaid
It's modern trade: Web users get as much as they give
  • J Harper
Power Eye' lets consumers know why that web ad was sent to them
  • M Learmonth
Getting in bed with Robin Sage
  • T Ryan
  • G Mauch
Google agonizes on privacy as ad world vaults ahead
  • J Vascellaro
User survey results: Which ads do Facebook users like most (and least)
  • S Su
Tracking is an assault on liberty, with real dangers
  • N Carr
Responding to your feedback The Facebook Blog
  • B Schnitt
Ads posted on Facebook strike some as off-key The New York Times
  • B Stone
The role of advertising on Facebook The Facebook blog
  • S Sandberg
Judge approves $9.5 million facebook beacon accord The New York Times
  • D Kravets
Just married: Groom changes Facebook relationship status at the altar
  • J V Grove
Grove, J. V. (2009). Just married: Groom changes Facebook relationship status at the altar. http://mashable.com/2009/12/01/groom-facebook-update.
Reputation management and social media. Pew Internet and American Life Project
  • M Madden
  • A Smith
Madden, M. and Smith, A. (2010). Reputation management and social media. Pew Internet and American Life Project.
Being Eric Schmidt (on Facebook) TechCrunch
  • M Arrington
Arrington, M. (2010). Being Eric Schmidt (on Facebook). TechCrunch.
Googlers buy more junk food than Microsofties (and why Rapleaf is creepy) TechCrunch
  • R Wauters
Wauters, R. (2011). Googlers buy more junk food than Microsofties (and why Rapleaf is creepy). TechCrunch.
Barry Diller We spend every nickel we can on Facebook Interview to CNN Money http
  • O ' Neill
O'Neill, N. (2010). Barry Diller: " We spend every nickel we can on Facebook. " Interview to CNN Money http://www.allfacebook.com/2010/07/barry-diller- we-spend-every-nickel-we-can-on-facebook.
Facebook privacy: A bewildering tangle of options. The New York Times
  • G Gates
Gates, G. (2010). Facebook privacy: A bewildering tangle of options. The New York Times. http://www.nytimes.com/interactive/2010/05/12/business/ facebook-privacy.html.