Arvind Narayanan's research while affiliated with Princeton University and other places
What is this page?
This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
Publications (84)
Machine learning (ML) methods are proliferating in scientific research. However, the adoption of these methods has been accompanied by failures of validity, reproducibility, and generalizability. These failures can hinder scientific progress, lead to false consensus around invalid claims, and undermine the credibility of ML-based science. ML method...
Machine-learning (ML) methods have gained prominence in the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. We systematically investigate reproducibility issues in ML-based science. Through a survey of literature in fields that have adopted ML methods, we find 17 fields wher...
We collect and analyze a corpus of more than 300,000 political emails sent during the 2020 US election cycle. These emails were sent by over 3000 political campaigns and organizations including federal and state level candidates as well as Political Action Committees. We find that in this corpus, manipulative tactics—techniques using some level of...
Blockchain analysis is essential for understanding how cryptocurrencies like Bitcoin are used in practice, and address clustering is a cornerstone of blockchain analysis. However, current techniques rely on heuristics that have not been rigorously evaluated or optimized. In this paper, we tackle several challenges of change address identification a...
The use of machine learning (ML) methods for prediction and forecasting has become widespread across the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. In this paper, we systematically investigate reproducibility issues in ML-based science. We show that data leakage is inde...
Machine learning models are known to perpetuate and even amplify the biases present in the data. However, these data biases frequently do not become apparent until after the models are deployed. Our work tackles this issue and enables the preemptive analysis of large-scale datasets. REvealing VIsual biaSEs (REVISE) is a tool that assists in the inv...
Online platforms play an increasingly important role in shaping democracy by influencing the distribution of political information to the electorate. In recent years, political campaigns have spent heavily on the platforms' algorithmic tools to target voters with online advertising. While the public interest in understanding how platforms perform t...
The use of big data research methods has grown tremendously in both academia and industry. One of the most fundamental rules of responsible big data research is the steadfast recognition that most data represent or impact people. For some projects, sharing data is an expectation of the human participants involved and thus a key part of ethical rese...
Recent concerns that machine learning (ML) may be facing a reproducibility and replication crisis suggest that some published claims in ML research cannot be taken at face value. These concerns inspire analogies to the replication crisis affecting the social and medical sciences, as well as calls for greater integration of statistical approaches to...
Concerns about privacy, bias, and harmful applications have shone a light on the ethics of machine learning datasets, even leading to the retraction of prominent datasets including DukeMTMC, MS-Celeb-1M, TinyImages, and VGGFace2. In response, the machine learning community has called for higher ethical standards, transparency efforts, and technical...
Simulation can enable the study of recommender system (RS) evolution while circumventing many of the issues of empirical longitudinal studies; simulations are comparatively easier to implement, are highly controlled, and pose no ethical risk to human participants. How simulation can best contribute to scientific insight about RS alongside qualitati...
Simulation has emerged as a popular method to study the long-term societal consequences of recommender systems. This approach allows researchers to specify their theoretical model explicitly and observe the evolution of system-level outcomes over time. However, performing simulation-based studies often requires researchers to build their own simula...
Blockchain analysis is essential for understanding how cryptocurrencies like Bitcoin are used in practice, and address clustering is a cornerstone of blockchain analysis. However, current techniques rely on heuristics that have not been rigorously evaluated or optimized. In this paper, we tackle several challenges of change address identification a...
Universities have been forced to rely on remote educational technology to facilitate the rapid shift to online learning. In doing so, they acquire new risks of security vulnerabilities and privacy violations. To help universities navigate this landscape, we develop a model that describes the actors, incentives, and risks, informed by surveying 105...
Machine learning models are known to perpetuate and even amplify the biases present in the data. However, these data biases frequently do not become apparent until after the models are deployed. To tackle this issue and to enable the preemptive analysis of large-scale dataset, we present our tool. REVISE (REvealing VIsual biaSEs) is a tool that ass...
We investigate data exfiltration by third-party scripts directly embedded on web pages. Specifically, we study three attacks: misuse of browsers’ internal login managers, social data exfiltration, and whole-DOM exfiltration. Although the possibility of these attacks was well known, we provide the first empirical evidence based on measurements of 30...
The evolution of tricky user interfaces.
Automated analysis of privacy policies has proved a fruitful research direction, with developments such as automated policy summarization, question answering systems, and compliance detection. So far, prior research has been limited to analysis of privacy policies from a single point in time or from short spans of time, as researchers did not have...
Dark patterns are an abuse of the tremendous power that designers hold in their hands. As public awareness of dark patterns grows, so does the potential fallout. Journalists and academics have been scrutinizing dark patterns, and the backlash from these exposures can destroy brand reputations and bring companies under the lenses of regulators. Desi...
Machine learning models are known to perpetuate the biases present in the data, but oftentimes these biases aren't known until after the models are deployed. We present the Visual Bias Extraction (ViBE) Tool that assists in the investigation of a visual dataset, surfacing potential dataset biases along three dimensions: (1) object-based, (2) gender...
Dark patterns are user interface design choices that benefit an online service by coercing, steering, or deceiving users into making unintended and potentially harmful decisions. We present automated techniques that enable experts to identify dark patterns on a large set of websites. Using these techniques, we study shopping websites, which often u...
The number of Internet-connected TV devices has grown significantly in recent years, especially Over-the-Top ("OTT") streaming devices, such as Roku TV and Amazon Fire TV. OTT devices offer an alternative to multi-channel television subscription services, and are often monetized through behavioral advertising. To shed light on the privacy practices...
Dark patterns are user interface design choices that benefit an online service by coercing, steering, or deceiving users into making unintended and potentially harmful decisions. We present automated techniques that enable experts to identify dark patterns on a large set of websites. Using these techniques, we study shopping websites, which often u...
The proliferation of smart home Internet of things (IoT) devices presents unprecedented challenges for preserving privacy within the home. In this paper, we demonstrate that a passive network observer (e.g., an Internet service provider) can infer private in-home activities by analyzing Internet traffic from commercially available smart home device...
The security of most existing cryptocurrencies is based on a concept called Proof-of-Work, in which users must solve a computationally hard cryptopuzzle to authorize transactions ("one unit of computation, one vote''). This leads to enormous expenditure on hardware and electricity in order to collect the rewards associated with transaction authoriz...
A bstract
Consumer genetics databases hold dense genotypes of millions of people, and the number is growing quickly [1] [2]. In 2018, law enforcement agencies began using such databases to identify anonymous DNA via long-range familial searches. We show that this technique is far more powerful if combined with a genealogical database of the type co...
The proliferation of smart home Internet of Things (IoT) devices presents unprecedented challenges for preserving privacy within the home. In this paper, we demonstrate that a passive network observer (e.g., an Internet service provider) can infer private in-home activities by analyzing Internet traffic from commercially available smart home device...
The security of most existing cryptocurrencies is based on a concept called Proof-of-Work, in which users must solve a computationally hard cryptopuzzle to authorize transactions (`one unit of computation, one vote'). This leads to enormous expenditure on hardware and electricity in order to collect the rewards associated with transaction authoriza...
Users are often ill-equipped to identify online advertisements that masquerade as non-advertising content. Because such hidden advertisements can mislead and harm users, the Federal Trade Commission (FTC) requires all advertising content to be adequately disclosed. In this paper, we examined disclosures within affiliate marketing, an endorsement-ba...
In this paper, we present two web-based attacks against local IoT devices that any malicious web page or third-party script can perform, even when the devices are behind NATs. In our attack scenario, a victim visits the attacker's website, which contains a malicious script that communicates with IoT devices on the local network that have open HTTP...
Monero is a privacy-centric cryptocurrency that allows users to obscure their transactions by including chaff coins, called “mixins,” along with the actual coins they spend. In this paper, we empirically evaluate two weaknesses in Monero’s mixin sampling strategy. First, about 62% of transaction inputs with one or more mixins are vulnerable to “cha...
Blockchain technology is assembled from pieces that have long pedigrees in the academic literature, such as linked timestamping, consensus, and proof of work. In this tutorial, I'll begin by summarizing these components and how they fit together in Bitcoin's blockchain design. Then I'll present abstract models of blockchains; such abstractions help...
While disclosures relating to various forms of Internet advertising are well established and follow specific formats, endorsement marketing disclosures are often open-ended in nature and written by individual publishers. Because such marketing often appears as part of publishers' actual content, ensuring that it is adequately disclosed is critical...
We show that the simple act of viewing emails contains privacy pitfalls for the unwary. We assembled a corpus of commercial mailing-list emails, and find a network of hundreds of third parties that track email recipients via methods such as embedded pixels. About 30% of emails leak the recipient’s email address to one or more of these third parties...
IF YOU HAVE read about bitcoin in the press and have some familiarity with academic research in the field of cryptography, you might reasonably come away with the following impression: Several decades' worth of research on digital cash, beginning with David Chaum,10,12 did not lead to commercial success because it required a centralized, bank-like...
Analysis of blockchain data is useful for both scientific research and commercial applications. We present BlockSci, an open-source software platform for blockchain analysis. BlockSci is versatile in its support for different blockchains and analysis tasks. It incorporates an in-memory, analytical (rather than transactional) database, making it sev...
The growing market for smart home IoT devices promises new conveniences for consumers while presenting new challenges for preserving privacy within the home. Many smart home devices have always-on sensors that capture users' offline activities in their living spaces and transmit information about these activities on the Internet. In this paper, we...
We show how third-party web trackers can deanonymize users of cryptocurrencies. We present two distinct but complementary attacks. On most shopping websites, third party trackers receive information about user purchases for purposes of advertising and analytics. We show that, if the user pays using a cryptocurrency, trackers typically possess enoug...
We’ve seen repeatedly that ideas in the research literature can be gradually forgotten or lie unappreciated, especially if they are ahead of their time, even in popular areas of research. Both practitioners and academics would do well to revisit old ideas to glean insights for present systems. Bitcoin was unusual and successful not because it was o...
In the cryptographic currency Bitcoin, all transactions are recorded in the blockchain - a public, global, and immutable ledger. Because transactions are public, Bitcoin and its users employ obfuscation to maintain a degree of financial privacy. Critically, and in contrast to typical uses of obfuscation, in Bitcoin obfuscation is not aimed against...
We present a systematic study of ad blocking - and the associated "arms race" - as a security problem. We model ad blocking as a state space with four states and six state transitions, which correspond to techniques that can be deployed by either publishers or ad blockers. We argue that this is a complete model of the system. We propose several new...
When you browse the web, hidden “third parties” collect a large amount of data about your behavior. This data feeds algorithms to target ads to you, tailor your news recommendations, and sometimes vary prices of online products. The network of trackers comprises hundreds of entities, but consumers have little awareness of its pervasiveness and soph...
Machine learning is a means to derive artificial intelligence by discovering patterns in existing data. Here we show that applying machine learning to ordinary human language results in human-like semantic biases. We replicate a spectrum of known biases, as measured by the Implicit Association Test, using a widely used, purely statistical machine-l...
Monero is a privacy-centric cryptocurrency that allows users to obscure their transaction graph by including chaff coins, called "mixins," along with the actual coins they spend. In this report, we empirically evaluate two weak- nesses in Monero's mixin sampling strategy. First, about 62% of transaction inputs with one or more mixins are vulnerable...
The use of big data research methods has grown tremendously over the past five years in both academia and industry. As the size and complexity of available datasets has grown, so too have the ethical questions raised by big data research. These questions become increasingly urgent as data and research agendas move well beyond those typical of the c...
First, Arvind Narayanan and Andrew Miller, co-authors of the increasingly popular open-access Princeton Bitcoin textbook, provide an overview of ongoing research in cryptocurrencies. Second, Song Han provides an overview of hardware trends related to another long-studied academic problem that has recently seen an explosion in popularity: deep learn...
Bitcoin provides two incentives for miners: block rewards and transaction fees. The former accounts for the vast majority of miner revenues at the beginning of the system, but it is expected to transition to the latter as the block rewards dwindle. There has been an implicit belief that whether miners are paid by block rewards or transaction fees d...
We present the largest and most detailed measurement of online tracking conducted to date, based on a crawl of the top 1 million websites. We make 15 types of measurements on each site, including stateful (cookie-based) and stateless (fingerprinting-based) tracking, the effect of browser privacy tools, and the exchange of tracking data between diff...
Machines learn what people know implicitly
AlphaGo has demonstrated that a machine can learn how to do things that people spend many years of concentrated study learning, and it can rapidly learn how to do them better than any human can. Caliskan et al. now show that machines can learn word associations from written texts and that these association...
While threshold signature schemes have been presented before, there has never been an optimal threshold signature algorithm for DSA. The properties of DSA make it quite challenging to build a threshold version. In this paper, we present a threshold DSA scheme that is efficient and optimal. We also present a compelling application to use our scheme:...
Once released to the public, data cannot be taken back. As time passes, data analytic techniques improve and additional datasets become public that can reveal information about the original data. It follows that released data will get increasingly vulnerable to re-identification—unless methods with provable privacy properties are used for the data...
Expert-curated guides to the best of CS research.
The ability to identify authors of computer programs based on their coding
style is a direct threat to the privacy and anonymity of programmers. Previous
work has examined attribution of authors from both source code and compiled
binaries, and found that while source code can be attributed with very high
accuracy, the attribution of executable bina...
Bit coin has emerged as the most successful cryptographic currency in history. Within two years of its quiet launch in 2009, Bit coin grew to comprise billions of dollars of economic value despite only cursory analysis of the system's design. Since then a growing literature has identified hidden-but-important properties of the system, discovered at...
We study the ability of a passive eavesdropper to leverage "third-party" HTTP tracking cookies for mass surveillance. If two web pages embed the same tracker which tags the browser with a unique cookie, then the adversary can link visits to those pages from the same user (i.e., browser instance) even if the user's IP address varies. Further, many p...
We present the first large-scale studies of three advanced web tracking mechanisms - canvas fingerprinting, evercookies and use of "cookie syncing" in conjunction with evercookies. Canvas fingerprinting, a recently developed form of browser fingerprinting, has not previously been reported in the wild; our results show that over 5% of the top 100,00...
We propose Mixcoin, a protocol to facilitate anonymous payments in Bitcoin and similar cryptocurrencies. We build on the emergent phenomenon of currency mixes, adding an accountability mechanism to expose theft. We demonstrate that incentives of mixes and clients can be aligned to ensure that rational mixes will not steal. Our scheme is efficient a...
Embedding coverage of ethics in software engineering courses would help students draw strength and wisdom from dialogue with other future members of their profession. Without a sense of professional ethics, individuals might justify to themselves conduct that would be much more difficult to justify in front of others. Additionally, professional eth...
Despite privacy-preserving cryptography technologies' potential, they've largely failed to find commercial adoption. Reasons include people's unawareness of privacy-preserving cryptography, developers' lack of expertise, the field's complexity, economic constraints, and trust issues. View part 1 of this article (from the March/April 2013 issue) her...
Perceptual, "context-aware" applications that observe their environment and interact with users via cameras and other sensors are becoming ubiquitous on personal computers, mobile phones, gaming platforms, household robots, and augmented-reality devices. This raises new privacy risks. We describe the design and implementation of DARKLY, a practical...
One way to use cryptography for privacy is to tweak various systems to be privacy-preserving. But the more radical cypherpunk movement sought to wield crypto as a weapon of freedom, autonomy, and privacy that would fundamentally and inexorably reshape social, economic, and political power structures. This installment of On the Horizon primarily exa...
While the Internet was conceived as a decentralized network, the most widely
used web applications today tend toward centralization. Control increasingly
rests with centralized service providers who, as a consequence, have also
amassed unprecedented amounts of data about the behaviors and personalities of
individuals.
Developers, regulators, and co...
Many commercial websites use recommender systems to help customers locate products and content. Modern recommenders are based on collaborative filtering: they use patterns learned from users' behavior to make recommendations, usually in the form of related-items lists. The scale and complexity of these systems, along with the fact that their output...
This paper describes the winning entry to the IJCNN 2011 Social Network
Challenge run by Kaggle.com. The goal of the contest was to promote research on
real-world link prediction, and the dataset was a graph obtained by crawling
the popular Flickr social photo sharing website, with user identities scrubbed.
By de-anonymizing much of the competition...
Online behavioral advertising (OBA) refers to the practice of tracking users across web sites in order to infer user interests and preferences. These interests and preferences are then used for selecting ads to present to the user. There is great concern that behavioral advertising in its present form infringes on user privacy. The resulting public...
Developing effective privacy protection technologies is a critical challenge for security and privacy research as the amount and variety of data collected about individuals increase exponentially.
Operators of online social networks are increasingly sharing potentially sensitive information about users and their relationships with advertisers, application developers, and data-mining researchers. Privacy is typically protected by anonymization, i.e., removing names, addresses, etc. We present a framework for analyzing privacy and anonymity in...
We present a new class of statistical de- anonymization attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on. Our techniques are robust to perturbation in the data and tolerate some mistakes in the adversary's background knowledge. We apply our de-anonymization methodology to th...
Obfuscation, when used as a technical term, refers to hiding information “in plain sight” inside computer code or digital data. The history of obfuscation in modern computing can be traced to two events that took place in 1976. The first was the publication of Diffie and Hellman’s seminal paper on public-key cryptography [DH76]. This paper is famou...
We present a new class of statistical de-anonymization attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on. Our techniques are robust to perturbation in the data and tolerate some mistakes in the adversary's background knowledge. We apply our de-anonymization methodology to the...
We study the problem of circuit obfuscation, i.e., transforming the circuit in a way that hides everything except its input-output behavior. Barak et al. showed that a universal obfuscator that obfuscates every circuit class cannot exist, leaving open the possibility of special-purpose obfuscators. Known positive results for obfuscation are limited...
largest online movie rental service—publicly released a dataset containing movie ratings of 500,000 Netflix subscribers. The dataset is intended to be anonymous, and all personally identifying information has been removed. We demonstrate that an attacker who knows only a little bit about an individual subscriber can easily identify this subscriber’...
We investigate whether it is possible to encrypt a database and then give it away in such a form that users can still access it, but only in a restricted way. In contrast to conventional privacy mechanisms that aim to prevent any access to individual records, we aim to restrict the set of queries that can be feasibly evaluated on the encrypted data...
Human-memorable passwords are a mainstay of computer security. To decrease vulnerability of passwords to brute-force dictionary attacks, many organizations enforce complicated password-creation rules and require that passwords include numerals and special characters. We demonstrate that as long as passwords remain human-memorable, they are vulnerab...
Online behavioral advertising (OBA) refers to the practice of tracking users across web sites in order to infer user interests and preferences. These interests and preferences are then used for selecting ads to present to the user. There is great concern that behavioral advertising in its present form infringes on user privacy. The resulting public...
Citations
... Separately, the adoption of benchmarking best practices from predictive modeling fields, including machine learning (Weber et al., 2019;Mangul et al., 2019;Mitchell et al., 2019;Kapoor and Narayanan, 2023), can help facilitate progress in explanatory modeling. The following list describes three important examples of these practices: ...
... At the extreme, some of these forms of benefit-such as the economic reward of selling AI systems-are consistent with the proliferation of AI applications that don't actually perform their stated task and do not provide a direct benefit to anyone impacted by their operation (Raji et al., 2022). This prospect echoes recent criticisms that frameworks for ethical and responsible AI are too centrally focused on the values and interests of technology creators (Birhane et al., 2022) and on the way that relevant moral values are instantiated within technologies rather than on the larger impacts of those technologies (Wang et al., 2022;Laufer et al., 2022), including questions of authority and power dynamics, human sovereignty and autonomy, and democratic values like liberty and freedom (Washington and Kuo, 2020;Barabas et al., 2020). * Authors contributed equally to this work. ...
... For example, Mathur et al. use structural topic modeling to study manipulative tactics in a corpus of U.S. political campaign emails from the 2020 election cycle. One of their findings is that the median active sender of campaign emails uses "sensationalist clickbait" in 37% of those emails [167]. A question about the external validity of this claim might be: is the frequency of sensationalist clickbait in campaign emails similar in other U.S. election cycles? ...
... We first examine the proposed methods capability of improving node classification tasks. We run each experiment with three trial runs and report the average test set F1-score of all 5 GNNs and random forest models in the timesteps from [34][35][36][37][38][39][40][41][42][43][44][45][46][47][48][49] separated into four intervals. The best results in each model group are underlined and shown in bold in Table 1. ...
... Compared to older statistical methods, they offer increased predictive accuracy [1], the ability to process large amounts of data [12], and the ability to use different types of data for scientific research, such as text, images, and video [7]. However, the rapid uptake of ML methods has been accompanied by concerns of validity, reproducibility, and generalizability [13][14][15][16][17][18][19]. There are several reasons for concern. ...
... However, this resulting uncertainty subsequently erodes trust in news disseminated via social media. Papakyriakopoulos et al. [29] also offer a large-scale data analysis of publicly available data, aiming to provide a critical assessment of how these platforms either amplified or moderated the distribution of political advertisements. Additionally, Ma et al. [20] engage in a conceptual replication and extension of this influential study. ...
... While algorithmic bias mitigation is well explored (Barocas et al. 2019;Ntoutsi et al. 2020;Mitchell et al. 2021;Mehrabi et al. 2021), few works tackle bias detection, especially in the CV domain. A common strategy (Fabbrizzi et al. 2022) is to extract information from the visual data, arrange it into tabular form, and quantify and mitigate bias on this data type (Zhao et al. 2017;Buolamwini and Gebru 2018;Merler et al. 2019;Wang et al. 2022). However, the reduction does not necessarily preserve all biased characteristics of the visual data space. ...
... The authors investigated the readability and text features from 2009 to 2019. The authors concluded that the privacy policies are long, hard to read, and lack transparency in some areas such as tracking cookies and data sharing with third parties [28]. Bateni compared the privacy policies before and after the General Data Protection Regulation (GDPR) [24,27,29]. ...
... Fairness Toolkits. Several analysis tools [3,8,14,17,18,37] exist to discover biases in datasets and models. For example, Google's Know Your Data [8] focuses on discovering correlations between attribute labels and metadata labels in TensorFlow image datasets. ...
... With recent [2], [1] and imminent restrictions [34] on third-party cookies, trackers are moving towards alternative methods of identifying users, which include the use of firstparty cookies [10], [8], [7], personally identifiable information (PII) [35], [36], and device/browser fingerprinting [37], [38], [39]. In contrast to third-party cookies that are automatically included as headers in outgoing HTTP requests, these alternative tracking methods rely on link decoration for identifier exfiltration. ...