Michael Carl Tschantz's research while affiliated with Berkeley Earth and other places

Publications (60)

Preprint
The near universal condemnation of proxy discrimination hides a disagreement over what it is. This work surveys various notions of proxy and proxy discrimination found in prior work and represents them in a common framework. These notions variously turn on statistical dependencies, causal effects, and intentions. It discusses the limitations and us...
Thesis
Full-text available
Widespread Chinese social media applications such as Sina Weibo (Chinese Twitter), the most popular social network in China, are widely known for monitoring and deleting posts to conform to Chinese government requirements. Censorship of Chinese social media is a complex process that involves many factors. There are multiple stakeholders and many di...
Preprint
Bias in machine learning has manifested injustice in several areas, such as medicine, hiring, and criminal justice. In response, computer scientists have developed myriad definitions of fairness to correct this bias in fielded algorithms. While some definitions are based on established legal and ethical norms, others are largely mathematical. It is...
Preprint
Interactions between bids to show ads online can lead to an advertiser's ad being shown to more men than women even when the advertiser does not target towards men. We design bidding strategies that advertisers can use to avoid such emergent discrimination without having to modify the auction mechanism. We mathematically analyze the strategies to d...
Preprint
Full-text available
Widespread Chinese social media applications such as Weibo are widely known for monitoring and deleting posts to conform to Chinese government requirements. In this paper, we focus on analyzing a dataset of censored and uncensored posts in Weibo. Despite previous work that only considers text content of posts, we take a multi-modal approach that ta...
Conference Paper
We study how to evaluate Anti-Fingerprinting Privacy Enhancing Technologies (AFPETs). Experimental methods have the advantage of control and precision, and can be applied to new AFPETs that currently lack a user base. Observational methods have the advantage of scale and drawing from the browsers currently in real-world use. We propose a novel comb...
Preprint
We measure how effective Privacy Enhancing Technologies (PETs) are at protecting users from website fingerprinting. Our measurements use both experimental and observational methods. Experimental methods allow control, precision, and use on new PETs that currently lack a user base. Observational methods enable scale and drawing from the browsers cur...
Preprint
Full-text available
Under U.S. law, marketing databases exist under almost no legal restrictions concerning accuracy, access, or confidentiality. We explore the possible (mis)use of these databases in a criminal context by conducting two experiments. First, we show how this data can be used for "cybercasing" by using this data to resolve the physical addresses of indi...
Conference Paper
Full-text available
Google's Ad Settings shows the gender and age that Google has inferred about a web user. We compare the inferred values to the self-reported values of 501 survey participants. We find that Google often does not show an inference, but when it does, it is typically correct. We explore which usage characteristics, such as using privacy enhancing techn...
Preprint
We mathematically compare three competing definitions of group-level nondiscrimination: demographic parity, equalized odds, and calibration. Using the theoretical framework of Friedler et al., we study the properties of each definition under various worldviews, which are assumptions about how, if at all, the observed data is biased. We prove that d...
Preprint
Full-text available
Google's Ad Settings shows the gender and age that Google has inferred about a web user. We compare the inferred values to the self-reported values of 501 survey participants. We find that Google often does not show an inference, but when it does, it is typically correct. We explore which usage characteristics, such as using privacy enhancing techn...
Preprint
Privacy and nondiscrimination are related but different. We make this observation precise in two ways. First, we show that both privacy and nondiscrimination have two versions, a causal version and a statical associative version, with each version corresponding to a competing view of the proper goal of privacy or nondiscrimination. Second, for each...
Conference Paper
Facing undesired traffic from the Tor anonymity network, online service providers discriminate against Tor users. In this study we characterize the extent of discrimination faced by Tor users and the nature of undesired traffic exiting from the Tor network - a task complicated by Tor's need to maintain user anonymity. We leverage multiple independe...
Preprint
Full-text available
This paper examines different reasons the websites may vary in their availability by location. Prior works on availability mostly focus on censorship by nation states. We look at three forms of server-side blocking: blocking visitors from the EU to avoid GDPR compliance, blocking based upon the visitor's country, and blocking due to security concer...
Preprint
Full-text available
One of the Internet's greatest strengths is the degree to which it facilitates access to any of its resources from users anywhere in the world. However, users in the developing world have complained of websites blocking their countries. We explore this phenomenon using a measurement study. With a combination of automated page loads, manual checking...
Article
We present associative and causal views of differential privacy. Under the associative view, the possibility of dependencies between data points precludes a simple statement of differential privacy's guarantee as conditioning upon a single changed data point. However, a simple characterization of differential privacy as limiting the effect of a sin...
Conference Paper
We present and evaluate a large-scale malware detection system integrating machine learning with expert reviewers, treating reviewers as a limited labeling resource. We demonstrate that even in small numbers, reviewers can vastly improve the system’s ability to keep pace with evolving threats. We conduct our evaluation on a sample of VirusTotal sub...
Article
Full-text available
The malware detection arms race involves constant change: malware changes to evade detection and labels change as detection mechanisms react. Recognizing that malware changes over time, prior work has enforced temporally consistent samples by requiring that training binaries predate evaluation binaries. We present temporally consistent labels, requ...
Conference Paper
We examine the problem of aggregating the results of multiple anti-virus (AV) vendors' detectors into a single authoritative ground-truth label for every binary. To do so, we adapt a well-known generative Bayesian model that postulates the existence of a hidden ground truth upon which the AV labels depend. We use training based on Expectation Maxim...
Article
Full-text available
To partly address people’s concerns over web tracking, Google has created the Ad Settings webpage to provide information about and some choice over the profiles Google creates on users. We present AdFisher, an automated tool that explores how user behaviors, Google’s ads, and Ad Settings interact. AdFisher can run browser-based experiments and anal...
Conference Paper
Full-text available
Active learning is an area of machine learning examining strategies for allocation of finite resources, particularly human labeling efforts and to an extent feature extraction, in situations where available data exceeds available resources. In this open problem paper, we motivate the necessity of active learning in the security domain, identify pro...
Article
We argue that the evaluation of censorship evasion tools should depend upon economic models of censorship. We illustrate our position with a simple model of the costs of censorship. We show how this model makes suggestions for how to evade censorship. In particular, from it, we develop evaluation criteria. We examine how our criteria compare to the...
Article
Full-text available
To partly address people's concerns over web tracking, Google has created the Ad Settings webpage to provide information about and some choice over the profiles Google creates on users. We present AdFisher, an automated tool that explores how user behaviors, Google's ads, and Ad Settings interact. Our tool uses a rigorous experimental design and an...
Article
Full-text available
Information flow analysis has largely ignored the setting where the analyst has neither control over nor a complete model of the analyzed system. We formalize such limited information flow analyses and study an instance of it: detecting the usage of data by websites. We prove that these problems are ones of causal inference. Leveraging this connect...
Article
We present the Convex Polytope Machine (CPM), a novel non-linear learning algorithm for large-scale binary classification tasks. The CPM finds a large margin convex polytope separator which encloses one class. We develop a stochastic gradient descent based algorithm that is amenable to massive data sets, and augment it with a heuristic procedure to...
Conference Paper
In this position paper, we argue that to be of practical interest, a machine-learning based security system must engage with the human operators beyond feature engineering and instance labeling to address the challenge of drift in adversarial environments. We propose that designers of such systems broaden the classification goal into an explanatory...
Conference Paper
Full-text available
Privacy policies in sectors as diverse as Web services, finance and healthcare often place restrictions on the purposes for which a governed entity may use personal information. Thus, automated methods for enforcing privacy policies require a semantics of purpose restrictions to determine whether a governed agent used information for a purpose. We...
Article
Full-text available
Privacy policies often place restrictions on the purposes for which a governed entity may use personal information. For example, regulations, such as the Health Insurance Portability and Accountability Act (HIPAA), require that hospital employees use medical information for only certain purposes, such as treatment, but not for others, such as gossi...
Article
Full-text available
Differential privacy is a promising approach to privacy preserving data analysis with a well-developed theory for functions. Despite recent work on implementing systems that aim to provide differential privacy, the problem of formally verifying that these systems have differential privacy has not been adequately addressed. We develop a formal proba...
Article
Full-text available
Privacy policies often place requirements on the purposes for which a governed entity may use personal information. For example, regulations, such as HIPAA, require that hospital employees use medical information for only certain purposes, such as treatment. Thus, using formal or automated methods for enforcing privacy policies requires a semantics...
Article
Full-text available
Differential privacy is a promising approach to privacy preserving data analysis with a well-developed theory for functions. Despite recent work on implementing systems that aim to provide differential privacy, the problem of formally verifying that these systems have differential privacy has not been adequately addressed. This paper presents the f...
Conference Paper
Full-text available
Abstract Privacy means something different to everyone. Against a vast and rich canvas of diverse types of
Article
Full-text available
Differential privacy is a promising approach to privacy-preserving data analysis. There is now a well-developed theory of differentially private functions. Despite recent work on implementing database systems that aim to provide differential privacy and distributed systems that use differential privacy as a basis for higher level security propertie...
Article
Full-text available
We present a specialization of quantitative information flow to programs that compute statistics. We provide an approach for estimating the information flows present in such programs based on Monte Carlo simulation and argue that it is more accurate than previous approaches in this domain.
Conference Paper
Full-text available
Programs should keep sensitive information, such as medical records, confidential. We present a static analysis that extracts from a program's source code a sound approximation of the most restrictive conditional confidentiality policy that the program obeys. To formalize conditional confidentiality policies, we present a modified definition of non...
Chapter
Full-text available
The Sample Average Approximation (SAA) method is a technique for approximating solutions to stochastic programs. Here, we attempt to scale up the SAA method to harder problems than those previously studied. We argue that to apply the SAA method effectively, there are three parameters to optimize: the number of evaluations, the number of scenarios,...
Conference Paper
The growing importance of access control has led to the def- inition of numerous languages for specifying policies. Since these languages are based on different foundations, language users and designers would benefit from formal means to com- pare them. We present a set of properties that examine the behavior of policies under enlarged requests, po...
Conference Paper
Sensitive data are increasingly available on-line through the Web and other distributed protocols. This heightens the need to carefully control access to data. Control means not only preventing the leakage of data but also permitting access to necessary information. Indeed, the same datum is often treated differently depending on context. System de...
Article
Full-text available
We formalize the supplier offer acceptance problem in TAC SCM as a multi-stage stochastic program. In addition, we suggest a heuristic for solving this problem using the rollout method, following one or two stage approximations of the multi-stage stochastic program as the base policy during rollouts. We also describe a heuristic based on the notion...
Article
Full-text available
In this paper, we combine two approaches to handling uncertainty: we use techniques for finding optimal solutions in the expected sense to solve combinatorial optimization problems in an online setting. The problem we address is the scheduling component of the Trading Agent Competition in Supply Chain Management (TAC SCM) problem, a combinatorial o...
Article
Full-text available
In this paper, we combine two approaches to handling uncertainty: we use techniques for finding optimal solutions in the expected sense to solve combinatorial optimization problems in an online setting.
Article
Full-text available
The paper describes the design of the agent Botticelli, a nalist in the 2003 Trading Agent Competition in Supply Chain Management (TAC SCM). In TAC SCM, a simulated computer manufacturing scenario, Botticelli competes with other agents to win customer orders and negotiates with suppliers to procure the components necessary to complete its orders. W...
Article
Abstract Languages for the specification of access-control policies should support language,features that allow for policies to be written in a clear manner.,This work,presents a set of language,features found in current access-control languages,and formalizes a set of intuitive properties the author believes to be relevant to policy clarity. The a...
Article
Tamper-evident software has the property that a verifier can detect a violation of program integrity during execution. In this paper, we study programs that through their own execution provide sufficient information in the form of responses and timing to detect tampering. We refer to such programs as timed tamper-evident programs. We formalize the...
Article
We present translations from a logic with indexed lax modalities to first-order intuitionistic logic and intuitionistic linear logic. These translations rely on a continuation passing style encoding for the lax modalities. We show that our translations preserve provability of formulas.
Article
Sensitive data are increasingly available on-line through the Web and other distributed protocols. This heightens the need to carefully control access to data. Control means not only preventing the leakage of data but also permitting ac- cess to necessary information. Indeed, the same datum is often treated differently depending on context. System...
Article
Full-text available
We examine a well known confidentiality requirement called noninterference and argue that many systems do not meet this requirement despite maintaining the privacy of its users. We discuss a weaker requirement called incident-insensitive noninterference that captures why these systems maintain the privacy of its users while possibly not satisfy-ing...

Citations

... This is because saliency maps are mathematically capable of responding only to the most important features, and it cannot be ruled out that some comparably less important features also had an impact on the output of a model. However, such features could be crucial, for example, when it comes to assessing possible biases regarding protected attributes [50]. ...
... Within group fairness, there are still two opposing viewpoints: we're all equal (WAE) and what you see is what you get (WYSIWYG) [42,109]. The WAE viewpoint considers that all groups have similar abilities to perform the task, e.g., all groups of people are equally capable of walking more, while the WYSIWYG viewpoint holds that the data reflect each group's ability to perform the task, e.g., some groups of people might be less capable. ...
... This concern is real, as the cost-per-conversion often varies significantly across demographic groups (Lambrecht and Tucker 2016), meaning that some groups may be inadvertently left behind; this concept is referred to as "crowding out" of the market. On the other hand, imposing strict demographic parity (e.g., requiring that the demographic distribution of recruited individuals matches the composition of the SNAP-eligible population) can result in unacceptably high cost, meaning fewer people overall are ultimately enrolled (Gelauff et al. 2020;Nasr and Tschantz 2020). Figure 1. ...
... In response, governments provided principles and regulations to guide organizations developing AI (Smuha 2019). In this light, several recent publications investigate concepts, such as trustworthiness (e.g., Thiebes et al. 2021), explainability (e.g., Meske et al. 2020, Liao & Varshney 2021, fairness (e.g., Lee et al. 2020, Datta et al. 2021, or responsibility (Blodgett et al. 2022, Arrieta et al. 2020, to develop methods under the umbrella term of AI ethics (Shneiderman 2020). A prevalent theme in the literature investigates system trust and trustworthiness as a contractual phenomenon based on the functionalities an AI system aspires to offer the end-user (Vianello et al. 2022). ...
... We lay out a set of relatively non-controversial desiderata for disclosure risk assessment methods. We use these desiderata to analyze three major frameworks: absolute disclosure risk, prior-to-posterior comparisons, and the counterfactual comparisons that motivate differential privacy [34,32,85]. Based on this analysis, we conclude that it is impossible to simultaneously satisfy all the desiderata. ...
... Most of the papers that were discarded in this round were either literature surveys in the domain of machine learning for software engineering (i.e., using ML techniques to facilitate software engineering tasks; not relevant to this study) or used interviews or surveys to evaluate tools. We also removed papers that have a narrow focus or are entirely model-centric, e.g., interviewing only data scientists about their modeling work (e.g., [23,35,46,80]) or interviewing only non-technical people (e.g., [12,33,100,117]). ...
... This impairs data integrity. Recent evidence has shown that Weibo posts that incite collective action (Chen et al., 2013;King et al., 2014), pertain to China's state leaders (Arefi et al., 2019), or criticize central government (Yang, 2009) are more likely to be censored. Moreover, a perception of censorship is likely to lead to users' self-censorship, either in the form of modifying the content or in inaction (e.g. ...
... Vastel et al. [79] detects browser fingerprint inconsistencies in a small manually-curated dataset. Datta et al. [29] and Merzdovnik et al. [56] evaluate and measure anti-fingerprint techniques. As a comparison, our measurement is the first to perform a billionscale study on adversarial browser fingerprints. ...
... The cookie or ID contains information relevant to the user's demographics and interests-this can include information that the user has either directly entered during their website browsing (eg, by explicitly reporting their gender to a website) or that has been estimated based on patterns in their web use (eg, making purchases or visiting websites more frequently associated with a particular gender or age group). If the user is logged into their Google account while browsing the web (eg, is logged into Gmail or using the Chrome browser with their user profile logged in), this information also includes data gathered from their use of Google services [59,60]. If Google Analytics algorithmically determines that it does not have sufficient data to validly estimate a user's gender or age range, it will not provide an estimate for that user's demographic variables. ...
... They demonstrated that their algorithm improves the security of Tor clients by 36% on average, and ASes with high Tor bandwidth can be less resilient to active routing attacks than ASes with lower Tor bandwidth. Singh et al. (2017) studied abusive traffic on Tor such as spamming, vulnerability scanning, scraping, and other undesired behavior used by online service providers to discriminate against Tor users. The authors utilized several data sources for their study, such as email complaints sent to exit operators, commercial IP blacklists, web page crawls via Tor, and privacy-sensitive measurements of their own Tor exit nodes. ...