Arvind Narayanan's research while affiliated with Princeton University and other places

Publications (84)

Preprint
Full-text available
Machine learning (ML) methods are proliferating in scientific research. However, the adoption of these methods has been accompanied by failures of validity, reproducibility, and generalizability. These failures can hinder scientific progress, lead to false consensus around invalid claims, and undermine the credibility of ML-based science. ML method...
Article
Machine-learning (ML) methods have gained prominence in the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. We systematically investigate reproducibility issues in ML-based science. Through a survey of literature in fields that have adopted ML methods, we find 17 fields wher...
Article
We collect and analyze a corpus of more than 300,000 political emails sent during the 2020 US election cycle. These emails were sent by over 3000 political campaigns and organizations including federal and state level candidates as well as Political Action Committees. We find that in this corpus, manipulative tactics—techniques using some level of...
Chapter
Blockchain analysis is essential for understanding how cryptocurrencies like Bitcoin are used in practice, and address clustering is a cornerstone of blockchain analysis. However, current techniques rely on heuristics that have not been rigorously evaluated or optimized. In this paper, we tackle several challenges of change address identification a...
Preprint
The use of machine learning (ML) methods for prediction and forecasting has become widespread across the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. In this paper, we systematically investigate reproducibility issues in ML-based science. We show that data leakage is inde...
Article
Full-text available
Machine learning models are known to perpetuate and even amplify the biases present in the data. However, these data biases frequently do not become apparent until after the models are deployed. Our work tackles this issue and enables the preemptive analysis of large-scale datasets. REvealing VIsual biaSEs (REVISE) is a tool that assists in the inv...
Preprint
Full-text available
Online platforms play an increasingly important role in shaping democracy by influencing the distribution of political information to the electorate. In recent years, political campaigns have spent heavily on the platforms' algorithmic tools to target voters with online advertising. While the public interest in understanding how platforms perform t...
Chapter
The use of big data research methods has grown tremendously in both academia and industry. One of the most fundamental rules of responsible big data research is the steadfast recognition that most data represent or impact people. For some projects, sharing data is an expectation of the human participants involved and thus a key part of ethical rese...
Preprint
Full-text available
Recent concerns that machine learning (ML) may be facing a reproducibility and replication crisis suggest that some published claims in ML research cannot be taken at face value. These concerns inspire analogies to the replication crisis affecting the social and medical sciences, as well as calls for greater integration of statistical approaches to...
Preprint
Full-text available
Concerns about privacy, bias, and harmful applications have shone a light on the ethics of machine learning datasets, even leading to the retraction of prominent datasets including DukeMTMC, MS-Celeb-1M, TinyImages, and VGGFace2. In response, the machine learning community has called for higher ethical standards, transparency efforts, and technical...
Preprint
Full-text available
Simulation can enable the study of recommender system (RS) evolution while circumventing many of the issues of empirical longitudinal studies; simulations are comparatively easier to implement, are highly controlled, and pose no ethical risk to human participants. How simulation can best contribute to scientific insight about RS alongside qualitati...
Preprint
Full-text available
Simulation has emerged as a popular method to study the long-term societal consequences of recommender systems. This approach allows researchers to specify their theoretical model explicitly and observe the evolution of system-level outcomes over time. However, performing simulation-based studies often requires researchers to build their own simula...
Preprint
Blockchain analysis is essential for understanding how cryptocurrencies like Bitcoin are used in practice, and address clustering is a cornerstone of blockchain analysis. However, current techniques rely on heuristics that have not been rigorously evaluated or optimized. In this paper, we tackle several challenges of change address identification a...
Preprint
Full-text available
Universities have been forced to rely on remote educational technology to facilitate the rapid shift to online learning. In doing so, they acquire new risks of security vulnerabilities and privacy violations. To help universities navigate this landscape, we develop a model that describes the actors, incentives, and risks, informed by surveying 105...
Chapter
Machine learning models are known to perpetuate and even amplify the biases present in the data. However, these data biases frequently do not become apparent until after the models are deployed. To tackle this issue and to enable the preemptive analysis of large-scale dataset, we present our tool. REVISE (REvealing VIsual biaSEs) is a tool that ass...
Article
Full-text available
We investigate data exfiltration by third-party scripts directly embedded on web pages. Specifically, we study three attacks: misuse of browsers’ internal login managers, social data exfiltration, and whole-DOM exfiltration. Although the possibility of these attacks was well known, we provide the first empirical evidence based on measurements of 30...
Preprint
Automated analysis of privacy policies has proved a fruitful research direction, with developments such as automated policy summarization, question answering systems, and compliance detection. So far, prior research has been limited to analysis of privacy policies from a single point in time or from short spans of time, as researchers did not have...
Article
Dark patterns are an abuse of the tremendous power that designers hold in their hands. As public awareness of dark patterns grows, so does the potential fallout. Journalists and academics have been scrutinizing dark patterns, and the backlash from these exposures can destroy brand reputations and bring companies under the lenses of regulators. Desi...
Preprint
Machine learning models are known to perpetuate the biases present in the data, but oftentimes these biases aren't known until after the models are deployed. We present the Visual Bias Extraction (ViBE) Tool that assists in the investigation of a visual dataset, surfacing potential dataset biases along three dimensions: (1) object-based, (2) gender...
Article
Dark patterns are user interface design choices that benefit an online service by coercing, steering, or deceiving users into making unintended and potentially harmful decisions. We present automated techniques that enable experts to identify dark patterns on a large set of websites. Using these techniques, we study shopping websites, which often u...
Conference Paper
The number of Internet-connected TV devices has grown significantly in recent years, especially Over-the-Top ("OTT") streaming devices, such as Roku TV and Amazon Fire TV. OTT devices offer an alternative to multi-channel television subscription services, and are often monetized through behavioral advertising. To shed light on the privacy practices...
Preprint
Dark patterns are user interface design choices that benefit an online service by coercing, steering, or deceiving users into making unintended and potentially harmful decisions. We present automated techniques that enable experts to identify dark patterns on a large set of websites. Using these techniques, we study shopping websites, which often u...
Article
Full-text available
The proliferation of smart home Internet of things (IoT) devices presents unprecedented challenges for preserving privacy within the home. In this paper, we demonstrate that a passive network observer (e.g., an Internet service provider) can infer private in-home activities by analyzing Internet traffic from commercially available smart home device...
Conference Paper
The security of most existing cryptocurrencies is based on a concept called Proof-of-Work, in which users must solve a computationally hard cryptopuzzle to authorize transactions ("one unit of computation, one vote''). This leads to enormous expenditure on hardware and electricity in order to collect the rewards associated with transaction authoriz...
Preprint
Full-text available
A bstract Consumer genetics databases hold dense genotypes of millions of people, and the number is growing quickly [1] [2]. In 2018, law enforcement agencies began using such databases to identify anonymous DNA via long-range familial searches. We show that this technique is far more powerful if combined with a genealogical database of the type co...
Preprint
The proliferation of smart home Internet of Things (IoT) devices presents unprecedented challenges for preserving privacy within the home. In this paper, we demonstrate that a passive network observer (e.g., an Internet service provider) can infer private in-home activities by analyzing Internet traffic from commercially available smart home device...
Preprint
The security of most existing cryptocurrencies is based on a concept called Proof-of-Work, in which users must solve a computationally hard cryptopuzzle to authorize transactions (`one unit of computation, one vote'). This leads to enormous expenditure on hardware and electricity in order to collect the rewards associated with transaction authoriza...
Preprint
Users are often ill-equipped to identify online advertisements that masquerade as non-advertising content. Because such hidden advertisements can mislead and harm users, the Federal Trade Commission (FTC) requires all advertising content to be adequately disclosed. In this paper, we examined disclosures within affiliate marketing, an endorsement-ba...
Conference Paper
In this paper, we present two web-based attacks against local IoT devices that any malicious web page or third-party script can perform, even when the devices are behind NATs. In our attack scenario, a victim visits the attacker's website, which contains a malicious script that communicates with IoT devices on the local network that have open HTTP...
Article
Full-text available
Monero is a privacy-centric cryptocurrency that allows users to obscure their transactions by including chaff coins, called “mixins,” along with the actual coins they spend. In this paper, we empirically evaluate two weaknesses in Monero’s mixin sampling strategy. First, about 62% of transaction inputs with one or more mixins are vulnerable to “cha...
Conference Paper
Blockchain technology is assembled from pieces that have long pedigrees in the academic literature, such as linked timestamping, consensus, and proof of work. In this tutorial, I'll begin by summarizing these components and how they fit together in Bitcoin's blockchain design. Then I'll present abstract models of blockchains; such abstractions help...
Article
While disclosures relating to various forms of Internet advertising are well established and follow specific formats, endorsement marketing disclosures are often open-ended in nature and written by individual publishers. Because such marketing often appears as part of publishers' actual content, ensuring that it is adequately disclosed is critical...
Article
Full-text available
We show that the simple act of viewing emails contains privacy pitfalls for the unwary. We assembled a corpus of commercial mailing-list emails, and find a network of hundreds of third parties that track email recipients via methods such as embedded pixels. About 30% of emails leak the recipient’s email address to one or more of these third parties...
Article
IF YOU HAVE read about bitcoin in the press and have some familiarity with academic research in the field of cryptography, you might reasonably come away with the following impression: Several decades' worth of research on digital cash, beginning with David Chaum,10,12 did not lead to commercial success because it required a centralized, bank-like...
Article
Analysis of blockchain data is useful for both scientific research and commercial applications. We present BlockSci, an open-source software platform for blockchain analysis. BlockSci is versatile in its support for different blockchains and analysis tasks. It incorporates an in-memory, analytical (rather than transactional) database, making it sev...
Article
The growing market for smart home IoT devices promises new conveniences for consumers while presenting new challenges for preserving privacy within the home. Many smart home devices have always-on sensors that capture users' offline activities in their living spaces and transmit information about these activities on the Internet. In this paper, we...
Article
Full-text available
We show how third-party web trackers can deanonymize users of cryptocurrencies. We present two distinct but complementary attacks. On most shopping websites, third party trackers receive information about user purchases for purposes of advertising and analytics. We show that, if the user pays using a cryptocurrency, trackers typically possess enoug...
Article
We’ve seen repeatedly that ideas in the research literature can be gradually forgotten or lie unappreciated, especially if they are ahead of their time, even in popular areas of research. Both practitioners and academics would do well to revisit old ideas to glean insights for present systems. Bitcoin was unusual and successful not because it was o...
Article
In the cryptographic currency Bitcoin, all transactions are recorded in the blockchain - a public, global, and immutable ledger. Because transactions are public, Bitcoin and its users employ obfuscation to maintain a degree of financial privacy. Critically, and in contrast to typical uses of obfuscation, in Bitcoin obfuscation is not aimed against...
Article
We present a systematic study of ad blocking - and the associated "arms race" - as a security problem. We model ad blocking as a state space with four states and six state transitions, which correspond to techniques that can be deployed by either publishers or ad blockers. We argue that this is a complete model of the system. We propose several new...
Chapter
When you browse the web, hidden “third parties” collect a large amount of data about your behavior. This data feeds algorithms to target ads to you, tailor your news recommendations, and sometimes vary prices of online products. The network of trackers comprises hundreds of entities, but consumers have little awareness of its pervasiveness and soph...
Article
Full-text available
Machine learning is a means to derive artificial intelligence by discovering patterns in existing data. Here we show that applying machine learning to ordinary human language results in human-like semantic biases. We replicate a spectrum of known biases, as measured by the Implicit Association Test, using a widely used, purely statistical machine-l...
Article
Monero is a privacy-centric cryptocurrency that allows users to obscure their transaction graph by including chaff coins, called "mixins," along with the actual coins they spend. In this report, we empirically evaluate two weak- nesses in Monero's mixin sampling strategy. First, about 62% of transaction inputs with one or more mixins are vulnerable...
Article
Full-text available
The use of big data research methods has grown tremendously over the past five years in both academia and industry. As the size and complexity of available datasets has grown, so too have the ethical questions raised by big data research. These questions become increasingly urgent as data and research agendas move well beyond those typical of the c...
Article
First, Arvind Narayanan and Andrew Miller, co-authors of the increasingly popular open-access Princeton Bitcoin textbook, provide an overview of ongoing research in cryptocurrencies. Second, Song Han provides an overview of hardware trends related to another long-studied academic problem that has recently seen an explosion in popularity: deep learn...
Conference Paper
Bitcoin provides two incentives for miners: block rewards and transaction fees. The former accounts for the vast majority of miner revenues at the beginning of the system, but it is expected to transition to the latter as the block rewards dwindle. There has been an implicit belief that whether miners are paid by block rewards or transaction fees d...
Conference Paper
We present the largest and most detailed measurement of online tracking conducted to date, based on a crawl of the top 1 million websites. We make 15 types of measurements on each site, including stateful (cookie-based) and stateless (fingerprinting-based) tracking, the effect of browser privacy tools, and the exchange of tracking data between diff...
Article
Full-text available
Machines learn what people know implicitly AlphaGo has demonstrated that a machine can learn how to do things that people spend many years of concentrated study learning, and it can rapidly learn how to do them better than any human can. Caliskan et al. now show that machines can learn word associations from written texts and that these association...
Conference Paper
While threshold signature schemes have been presented before, there has never been an optimal threshold signature algorithm for DSA. The properties of DSA make it quite challenging to build a threshold version. In this paper, we present a threshold DSA scheme that is efficient and optimal. We also present a compelling application to use our scheme:...
Chapter
Once released to the public, data cannot be taken back. As time passes, data analytic techniques improve and additional datasets become public that can reveal information about the original data. It follows that released data will get increasingly vulnerable to re-identification—unless methods with provable privacy properties are used for the data...
Article
The ability to identify authors of computer programs based on their coding style is a direct threat to the privacy and anonymity of programmers. Previous work has examined attribution of authors from both source code and compiled binaries, and found that while source code can be attributed with very high accuracy, the attribution of executable bina...
Article
Bit coin has emerged as the most successful cryptographic currency in history. Within two years of its quiet launch in 2009, Bit coin grew to comprise billions of dollars of economic value despite only cursory analysis of the system's design. Since then a growing literature has identified hidden-but-important properties of the system, discovered at...
Conference Paper
We study the ability of a passive eavesdropper to leverage "third-party" HTTP tracking cookies for mass surveillance. If two web pages embed the same tracker which tags the browser with a unique cookie, then the adversary can link visits to those pages from the same user (i.e., browser instance) even if the user's IP address varies. Further, many p...
Article
We present the first large-scale studies of three advanced web tracking mechanisms - canvas fingerprinting, evercookies and use of "cookie syncing" in conjunction with evercookies. Canvas fingerprinting, a recently developed form of browser fingerprinting, has not previously been reported in the wild; our results show that over 5% of the top 100,00...
Conference Paper
We propose Mixcoin, a protocol to facilitate anonymous payments in Bitcoin and similar cryptocurrencies. We build on the emergent phenomenon of currency mixes, adding an accountability mechanism to expose theft. We demonstrate that incentives of mixes and clients can be aligned to ensure that rational mixes will not steal. Our scheme is efficient a...
Article
Embedding coverage of ethics in software engineering courses would help students draw strength and wisdom from dialogue with other future members of their profession. Without a sense of professional ethics, individuals might justify to themselves conduct that would be much more difficult to justify in front of others. Additionally, professional eth...
Article
Despite privacy-preserving cryptography technologies' potential, they've largely failed to find commercial adoption. Reasons include people's unawareness of privacy-preserving cryptography, developers' lack of expertise, the field's complexity, economic constraints, and trust issues. View part 1 of this article (from the March/April 2013 issue) her...
Conference Paper
Perceptual, "context-aware" applications that observe their environment and interact with users via cameras and other sensors are becoming ubiquitous on personal computers, mobile phones, gaming platforms, household robots, and augmented-reality devices. This raises new privacy risks. We describe the design and implementation of DARKLY, a practical...
Article
One way to use cryptography for privacy is to tweak various systems to be privacy-preserving. But the more radical cypherpunk movement sought to wield crypto as a weapon of freedom, autonomy, and privacy that would fundamentally and inexorably reshape social, economic, and political power structures. This installment of On the Horizon primarily exa...
Article
While the Internet was conceived as a decentralized network, the most widely used web applications today tend toward centralization. Control increasingly rests with centralized service providers who, as a consequence, have also amassed unprecedented amounts of data about the behaviors and personalities of individuals. Developers, regulators, and co...
Conference Paper
Many commercial websites use recommender systems to help customers locate products and content. Modern recommenders are based on collaborative filtering: they use patterns learned from users' behavior to make recommendations, usually in the form of related-items lists. The scale and complexity of these systems, along with the fact that their output...
Article
This paper describes the winning entry to the IJCNN 2011 Social Network Challenge run by Kaggle.com. The goal of the contest was to promote research on real-world link prediction, and the dataset was a graph obtained by crawling the popular Flickr social photo sharing website, with user identities scrubbed. By de-anonymizing much of the competition...
Conference Paper
Online behavioral advertising (OBA) refers to the practice of tracking users across web sites in order to infer user interests and preferences. These interests and preferences are then used for selecting ads to present to the user. There is great concern that behavioral advertising in its present form infringes on user privacy. The resulting public...
Article
Developing effective privacy protection technologies is a critical challenge for security and privacy research as the amount and variety of data collected about individuals increase exponentially.
Article
Operators of online social networks are increasingly sharing potentially sensitive information about users and their relationships with advertisers, application developers, and data-mining researchers. Privacy is typically protected by anonymization, i.e., removing names, addresses, etc. We present a framework for analyzing privacy and anonymity in...
Conference Paper
We present a new class of statistical de- anonymization attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on. Our techniques are robust to perturbation in the data and tolerate some mistakes in the adversary's background knowledge. We apply our de-anonymization methodology to th...
Chapter
Obfuscation, when used as a technical term, refers to hiding information “in plain sight” inside computer code or digital data. The history of obfuscation in modern computing can be traced to two events that took place in 1976. The first was the publication of Diffie and Hellman’s seminal paper on public-key cryptography [DH76]. This paper is famou...
Article
We present a new class of statistical de-anonymization attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on. Our techniques are robust to perturbation in the data and tolerate some mistakes in the adversary's background knowledge. We apply our de-anonymization methodology to the...
Article
We study the problem of circuit obfuscation, i.e., transforming the circuit in a way that hides everything except its input-output behavior. Barak et al. showed that a universal obfuscator that obfuscates every circuit class cannot exist, leaving open the possibility of special-purpose obfuscators. Known positive results for obfuscation are limited...
Article
largest online movie rental service—publicly released a dataset containing movie ratings of 500,000 Netflix subscribers. The dataset is intended to be anonymous, and all personally identifying information has been removed. We demonstrate that an attacker who knows only a little bit about an individual subscriber can easily identify this subscriber’...
Conference Paper
We investigate whether it is possible to encrypt a database and then give it away in such a form that users can still access it, but only in a restricted way. In contrast to conventional privacy mechanisms that aim to prevent any access to individual records, we aim to restrict the set of queries that can be feasibly evaluated on the encrypted data...
Conference Paper
Human-memorable passwords are a mainstay of computer security. To decrease vulnerability of passwords to brute-force dictionary attacks, many organizations enforce complicated password-creation rules and require that passwords include numerals and special characters. We demonstrate that as long as passwords remain human-memorable, they are vulnerab...
Article
Online behavioral advertising (OBA) refers to the practice of tracking users across web sites in order to infer user interests and preferences. These interests and preferences are then used for selecting ads to present to the user. There is great concern that behavioral advertising in its present form infringes on user privacy. The resulting public...

Citations

... Separately, the adoption of benchmarking best practices from predictive modeling fields, including machine learning (Weber et al., 2019;Mangul et al., 2019;Mitchell et al., 2019;Kapoor and Narayanan, 2023), can help facilitate progress in explanatory modeling. The following list describes three important examples of these practices: ...
... At the extreme, some of these forms of benefit-such as the economic reward of selling AI systems-are consistent with the proliferation of AI applications that don't actually perform their stated task and do not provide a direct benefit to anyone impacted by their operation (Raji et al., 2022). This prospect echoes recent criticisms that frameworks for ethical and responsible AI are too centrally focused on the values and interests of technology creators (Birhane et al., 2022) and on the way that relevant moral values are instantiated within technologies rather than on the larger impacts of those technologies (Wang et al., 2022;Laufer et al., 2022), including questions of authority and power dynamics, human sovereignty and autonomy, and democratic values like liberty and freedom (Washington and Kuo, 2020;Barabas et al., 2020). * Authors contributed equally to this work. ...
... For example, Mathur et al. use structural topic modeling to study manipulative tactics in a corpus of U.S. political campaign emails from the 2020 election cycle. One of their findings is that the median active sender of campaign emails uses "sensationalist clickbait" in 37% of those emails [167]. A question about the external validity of this claim might be: is the frequency of sensationalist clickbait in campaign emails similar in other U.S. election cycles? ...
... We first examine the proposed methods capability of improving node classification tasks. We run each experiment with three trial runs and report the average test set F1-score of all 5 GNNs and random forest models in the timesteps from [34][35][36][37][38][39][40][41][42][43][44][45][46][47][48][49] separated into four intervals. The best results in each model group are underlined and shown in bold in Table 1. ...
... Compared to older statistical methods, they offer increased predictive accuracy [1], the ability to process large amounts of data [12], and the ability to use different types of data for scientific research, such as text, images, and video [7]. However, the rapid uptake of ML methods has been accompanied by concerns of validity, reproducibility, and generalizability [13][14][15][16][17][18][19]. There are several reasons for concern. ...
... However, this resulting uncertainty subsequently erodes trust in news disseminated via social media. Papakyriakopoulos et al. [29] also offer a large-scale data analysis of publicly available data, aiming to provide a critical assessment of how these platforms either amplified or moderated the distribution of political advertisements. Additionally, Ma et al. [20] engage in a conceptual replication and extension of this influential study. ...
... While algorithmic bias mitigation is well explored (Barocas et al. 2019;Ntoutsi et al. 2020;Mitchell et al. 2021;Mehrabi et al. 2021), few works tackle bias detection, especially in the CV domain. A common strategy (Fabbrizzi et al. 2022) is to extract information from the visual data, arrange it into tabular form, and quantify and mitigate bias on this data type (Zhao et al. 2017;Buolamwini and Gebru 2018;Merler et al. 2019;Wang et al. 2022). However, the reduction does not necessarily preserve all biased characteristics of the visual data space. ...
... The authors investigated the readability and text features from 2009 to 2019. The authors concluded that the privacy policies are long, hard to read, and lack transparency in some areas such as tracking cookies and data sharing with third parties [28]. Bateni compared the privacy policies before and after the General Data Protection Regulation (GDPR) [24,27,29]. ...
... Fairness Toolkits. Several analysis tools [3,8,14,17,18,37] exist to discover biases in datasets and models. For example, Google's Know Your Data [8] focuses on discovering correlations between attribute labels and metadata labels in TensorFlow image datasets. ...
... With recent [2], [1] and imminent restrictions [34] on third-party cookies, trackers are moving towards alternative methods of identifying users, which include the use of firstparty cookies [10], [8], [7], personally identifiable information (PII) [35], [36], and device/browser fingerprinting [37], [38], [39]. In contrast to third-party cookies that are automatically included as headers in outgoing HTTP requests, these alternative tracking methods rely on link decoration for identifier exfiltration. ...