
MobilityMirror: Bias-Adjusted Transportation Datasets: First Workshop, BiDU 2018, Rio de Janeiro, Brazil, August 31, 2018, Revised Selected Papers

To read the full-text of this research, you can request a copy directly from the authors.


We describe customized synthetic datasets for publishing mobility data. Companies are providing new transportation modalities, and their data is of high value for integrative transportation research, policy enforcement, and public accountability. However, these companies are disincentivized from sharing data not only to protect the privacy of individuals (drivers and/or passengers), but also to protect their own competitive advantage. Moreover, demographic biases arising from how the services are delivered may be amplified if released data is used in other contexts.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
On line analytical processing (OLAP) is an essential element of decision-support systems. OLAP tools provide insights and understanding needed for improved decision making. However, the answers to OLAP queries can be biased and lead to perplexing and incorrect insights. In this paper, we propose, a system to detect, explain, and to resolve bias in decision-support queries. We give a simple definition of a biased query, which performs a set of independence tests on the data to detect bias. We propose a novel technique that gives explanations for bias, thus assisting an analyst in understanding what goes on. Additionally, we develop an automated method for rewriting a biased query into an unbiased query, which shows what the analyst intended to examine. In a thorough evaluation on several real datasets we show both the quality and the performance of our techniques, including the completely automatic discovery of the revolutionary insights from a famous 1973 discrimination case.
Conference Paper
Full-text available
Machine learning can impact people with legal or ethical consequences when it is used to automate decisions in areas such as insurance, lending, hiring, and predictive policing. In many of these scenarios, previous decisions have been made that are unfairly biased against certain subpopulations, for example those of a particular race, gender, or sexual orientation. Since this past data may be biased, machine learning predictors must account for this to avoid perpetuating or creating discriminatory practices. In this paper, we develop a framework for modeling fairness using tools from causal inference. Our definition of counterfactual fairness captures the intuition that a decision is fair towards an individual if it the same in (a) the actual world and (b) a counterfactual world where the individual belonged to a different demographic group. We demonstrate our framework on a real-world problem of fair prediction of success in law school.
Conference Paper
Full-text available
Digital maps represent an incredible HCI success-they have transformed the way people navigate in and access information about the world. While these platforms contain terabytes of data about road networks and points of interest (POIs), their information about physical accessibility is commensurately poor. Moreover, because of their highly graphical nature and reliance on gesture and mouse input, digital maps can be inaccessible to some user groups (e.g., those with visual or motor impairments). While there is active HCI work towards addressing both concerns, to our knowledge, there has been no direct effort to unite this research community. The goal of this SIG is threefold: first, to bring together and network scholars and practitioners who are broadly working in the area of accessible maps; second, to identify grand challenges and open problems; third, to help better establish accessible maps as a valuable topic with important HCI-related research problems.
Full-text available
We continue a line of research initiated in Dinur and Nissim (2003); Dwork and Nissim (2004); and Blum et al. (2005) on privacy-preserving statistical databases. Consider a trusted server that holds a database of sensitive information. Given a query function $f$ mapping databases to reals, the so-called {\em true answer} is the result of applying $f$ to the database. To protect privacy, the true answer is perturbed by the addition of random noise generated according to a carefully chosen distribution, and this response, the true answer plus noise, is returned to the user. Previous work focused on the case of noisy sums, in which $f = \sum_i g(x_i)$, where $x_i$ denotes the $i$th row of the database and $g$ maps database rows to $[0,1]$. We extend the study to general functions $f$, proving that privacy can be preserved by calibrating the standard deviation of the noise according to the {\em sensitivity} of the function $f$. Roughly speaking, this is the amount that any single argument to $f$ can change its output. The new analysis shows that for several particular applications substantially less noise is needed than was previously understood to be the case. The first step is a very clean definition of privacy---now known as differential privacy---and measure of its loss. We also provide a set of tools for designing and combining differentially private algorithms, permitting the construction of complex differentially private analytical tools from simple differentially private primitives. Finally, we obtain separation results showing the increased value of interactive statistical release mechanisms over non-interactive ones.
Full-text available
Society is increasingly relying on data-driven predictive models for automated decision making. This is not by design, but due to the nature and noisiness of observational data, such models may systematically disadvantage people belonging to certain categories or groups, instead of relying solely on individual merits. This may happen even if the computing process is fair and well-intentioned. Discrimination-aware data mining studies of how to make predictive models free from discrimination, when the historical data, on which they are built, may be biased, incomplete, or even contain past discriminatory decisions. Discrimination-aware data mining is an emerging research discipline, and there is no firm consensus yet of how to measure the performance of algorithms. The goal of this survey is to review various discrimination measures that have been used, analytically and computationally analyze their performance, and highlight implications of using one or another measure. We also describe measures from other disciplines, which have not been used for measuring discrimination, but potentially could be suitable for this purpose. This survey is primarily intended for researchers in data mining and machine learning as a step towards producing a unifying view of performance criteria when developing new algorithms for non-discriminatory predictive modeling. In addition, practitioners and policy makers could use this study when diagnosing potential discrimination by predictive models.
Full-text available
Bike-sharing programs, with initiatives to increase bike use and improve accessibility of urban transit, have received increasing attention in growing number of cities across the world. The latest generation of bike-sharing systems has employed smart card technology that produces station-based data or trip-level data. This facilitates the studies of the practical use of these systems. However, few studies have paid attention to the changes in users and system usage over the years, as well as the impact of system expansion on its usage. Monitoring the changes of system usage over years enables the identification of system performance and can serve as an input for improving the location-allocation of stations. The objective of this study is to explore the impact of the expansion of a bicycle-sharing system on the usage of the system. This was conducted for a bicycle-sharing system in Zhongshan (China), using operational usage data of different years following system expansion. To this end, we performed statistical and spatial analyses to examine the changes in both users and system usage between before and after the system expansion. The findings show that there is a big variation in users and aggregate usage following the system expansion. However, the trend in spatial distribution of demand shows no substantial difference over the years, i.e. the same high-demand and low-demand areas appear. There are decreases in demand for some old stations over the years, which can be attributed to either the negative performance of the system or the competition of nearby new stations. Expanding the system not only extends the original users’ ability to reach new areas but also attracts new users to use bike-sharing systems. In the conclusions, we present and discuss the findings, and offer recommendations for the further expansion of system.
Full-text available
Social scientists and data analysts are increasingly making use of Big Data in their analyses. These data sets are often “found data” arising from purely observational sources rather than data derived under strict rules of a statistically designed experiment. However, since these large data sets easily meet the sample size requirements of most statistical procedures, they give analysts a false sense of security as they proceed to focus on employing traditional statistical methods. We explain how most analyses performed on Big Data today lead to “precisely inaccurate” results that hide biases in the data but are easily overlooked due to the enhanced significance of the results created by the data size. Before any analyses are performed on large data sets, we recommend employing a simple data segmentation technique to control for some major components of observational data biases. These segments will help to improve the accuracy of the results.
Full-text available
With the increasing prevalence of information networks, research on privacy-preserving network data publishing has received substantial attention recently. There are two streams of relevant research, targeting different privacy requirements. A large body of existing works focus on preventing node re-identification against adversaries with structural background knowledge, while some other studies aim to thwart edge disclosure. In general, the line of research on preventing edge disclosure is less fruitful, largely due to lack of a formal privacy model. The recent emergence of differential privacy has shown great promise for rigorous prevention of edge disclosure. Yet recent research indicates that differential privacy is vulnerable to data correlation, which hinders its application to network data that may be inherently correlated. In this paper, we show that differential privacy could be tuned to provide provable privacy guarantees even in the correlated setting by introducing an extra parameter, which measures the extent of correlation. We subsequently provide a holistic solution for non-interactive network data publication. First, we generate a private vertex labeling for a given network dataset to make the corresponding adjacency matrix form dense clusters. Next, we adaptively identify dense regions of the adjacency matrix by a data-dependent partitioning process. Finally, we reconstruct a noisy adjacency matrix by a novel use of the exponential mechanism. To our best knowledge, this is the first work providing a practical solution for publishing real-life network data via differential privacy. Extensive experiments demonstrate that our approach performs well on different types of real-life network datasets.
Full-text available
To partly address people’s concerns over web tracking, Google has created the Ad Settings webpage to provide information about and some choice over the profiles Google creates on users. We present AdFisher, an automated tool that explores how user behaviors, Google’s ads, and Ad Settings interact. AdFisher can run browser-based experiments and analyze data using machine learning and significance tests. Our tool uses a rigorous experimental design and statistical analysis to ensure the statistical soundness of our results. We use AdFisher to find that the Ad Settings was opaque about some features of a user’s profile, that it does provide some choice on ads, and that these choices can lead to seemingly discriminatory ads. In particular, we found that visiting webpages associated with substance abuse changed the ads shown but not the settings page. We also found that setting the gender to female resulted in getting fewer instances of an ad related to high paying jobs than setting it to male. We cannot determine who caused these findings due to our limited visibility into the ad ecosystem, which includes Google, advertisers, websites, and users. Nevertheless, these results can form the starting point for deeper investigations by either the companies themselves or by regulatory bodies.
Full-text available
Differential privacy is fast becoming the method of choice for releasing data under strong privacy guarantees. A standard mechanism is to add noise to the counts in contingency tables derived from the dataset. However, when the dataset is sparse in its underlying domain, this vastly increases the size of the published data, to the point of making the mechanism infeasible. We propose a general framework to overcome this problem. Our approach releases a compact summary of the noisy data with the same privacy guarantee and with similar utility. Our main result is an efficient method for computing the summary directly from the input data, without materializing the vast noisy data. We instantiate this general framework for several summarization methods. Our experiments show that this is a highly practical solution: The summaries are up to 1000 times smaller, and can be computed in less than 1% of the time compared to standard methods. Finally, our framework works with various data transformations, such as wavelets or sketches.
Full-text available
We study fifteen months of human mobility data for one and a half million individuals and find that human mobility traces are highly unique. In fact, in a dataset where the location of an individual is specified hourly, and with a spatial resolution equal to that given by the carrier's antennas, four spatio-temporal points are enough to uniquely identify 95% of the individuals. We coarsen the data spatially and temporally to find a formula for the uniqueness of human mobility traces given their resolution and the available outside information. This formula shows that the uniqueness of mobility traces decays approximately as the 1/10 power of their resolution. Hence, even coarse datasets provide little anonymity. These findings represent fundamental constraints to an individual's privacy and have important implications for the design of frameworks and institutions dedicated to protect the privacy of individuals.
Conference Paper
Full-text available
We continue a line of research initiated in [10,11]on privacy-preserving statistical databases. Consider a trusted server that holds a database of sensitive information. Given a query function f mapping databases to reals, the so-called true answer is the result of applying f to the database. To protect privacy, the true answer is perturbed by the addition of random noise generated according to a carefully chosen distribution, and this response, the true answer plus noise, is returned to the user. Previous work focused on the case of noisy sums, in which f = ∑i g(x i ), where x i denotes the ith row of the database and g maps database rows to [0,1]. We extend the study to general functions f, proving that privacy can be preserved by calibrating the standard deviation of the noise according to the sensitivity of the function f. Roughly speaking, this is the amount that any single argument to f can change its output. The new analysis shows that for several particular applications substantially less noise is needed than was previously understood to be the case. The first step is a very clean characterization of privacy in terms of indistinguishability of transcripts. Additionally, we obtain separation results showing the increased value of interactive sanitization mechanisms over non-interactive.
Full-text available
Differential privacy is a strong notion for protecting individual privacy in privacy preserving data analysis or publishing. In this paper, we study the problem of differentially private histogram release for random workloads. We study two multidimensional partitioning strategies including: 1) a baseline cell-based partitioning strategy for releasing an equi-width cell histogram, and 2) an innovative 2-phase kd-tree based partitioning strategy for releasing a v-optimal histogram. We formally analyze the utility of the released histograms and quantify the errors for answering linear queries such as counting queries. We formally characterize the property of the input data that will guarantee the optimality of the algorithm. Finally, we implement and experimentally evaluate several applications using the released histograms, including counting queries, classification, and blocking for record linkage and show the benefit of our approach.
Conference Paper
Full-text available
Public transit systems play an important role in combating traffic congestion, reducing carbon emissions, and promot- ing compact, sustainable urban communities. The usability of public transit can be significantly enhanced by providing good traveler information systems. We describe OneBus- Away, a set of transit tools focused on providing real-time ar- rival information for Seattle-area bus riders. We then present results from a survey of OneBusAway users that show a set of important positive outcomes: strongly increased overall satisfaction with public transit, decreased waiting time, in- creased transit trips per week, increased feelings of safety, and even a health benefit in terms of increased distance walked when using transit. Finally, we discuss the design and policy implications of these results and plans for future research in this area.
Conference Paper
Full-text available
We propose the first differentially private aggregation algorithm for distributed time-series data that offers good practical utility without any trusted server. This addresses two important challenges in participatory data-mining applications where (i) individual users collect temporally correlated time-series data (such as location traces, web history, personal health data), and (ii) an untrusted third-party aggregator wishes to run aggregate queries on the data. To ensure differential privacy for time-series data despite the presence of temporal correlation, we propose the Fourier Perturbation Algorithm (FPAk). Standard differential privacy techniques perform poorly for time-series data. To answer n queries, such techniques can result in a noise of Θ(n) to each query answer, making the answers practically useless if n is large. Our FPAk algorithm perturbs the Discrete Fourier Transform of the query answers. For answering n queries, FPAk improves the expected error from Θ(n) to roughly Θ(k) where k is the number of Fourier coefficients that can (approximately) reconstruct all the n query answers. Our experiments show that k n for many real-life data-sets resulting in a huge error-improvement for FPAk. To deal with the absence of a trusted central server, we propose the Distributed Laplace Perturbation Algorithm (DLPA) to add noise in a distributed way in order to guarantee differential privacy. To the best of our knowledge, DLPA is the first distributed differentially private algorithm that can scale with a large number of users: DLPA outperforms the only other distributed solution for differential privacy proposed so far, by reducing the computational load per user from O(U) to O(1) where U is the number of users.
Full-text available
Privacy preserving data publishing has attracted considerable research interest in recent years. Among the existing solutions, {\em $\epsilon$-differential privacy} provides one of the strongest privacy guarantees. Existing data publishing methods that achieve $\epsilon$-differential privacy, however, offer little data utility. In particular, if the output dataset is used to answer count queries, the noise in the query answers can be proportional to the number of tuples in the data, which renders the results useless. In this paper, we develop a data publishing technique that ensures $\epsilon$-differential privacy while providing accurate answers for {\em range-count queries}, i.e., count queries where the predicate on each attribute is a range. The core of our solution is a framework that applies {\em wavelet transforms} on the data before adding noise to it. We present instantiations of the proposed framework for both ordinal and nominal data, and we provide a theoretical analysis on their privacy and utility guarantees. In an extensive experimental study on both real and synthetic data, we show the effectiveness and efficiency of our solution.
Full-text available
We show that it is possible to significantly improve the accuracy of a general class of histogram queries while satisfying differential privacy. Our approach carefully chooses a set of queries to evaluate, and then exploits consistency constraints that should hold over the noisy output. In a post-processing phase, we compute the consistent input most likely to have produced the noisy output. The final output is differentially-private and consistent, but in addition, it is often much more accurate. We show, both theoretically and experimentally, that these techniques can be used for estimating the degree sequence of a graph very precisely, and for computing a histogram that can support arbitrary range queries accurately. Comment: 15 pages, 7 figures, minor revisions to previous version
In a randomized audit study, we sent passengers in Boston, MA on nearly 1000 rides on controlled routes using the Uber and Lyft smartphone apps, recording key performance metrics. Passengers randomly selected between accounts that used African American-sounding and white-sounding names. We find that the probability an Uber driver accepts a ride, sees the name, and then cancels doubles when passengers used the account attached to the African American-sounding name. In contrast, Lyft drivers observe the name before accepting a ride and, as expected, we find no effect of name on cancellations. We do not, however, find that the increase in cancellations leads to measurably longer wait times for Uber.
Differential privacy is a strong notion for protecting individual privacy in data analysis or publication, with strong privacy guaranteeing security against adversaries with arbitrary background knowledge. A histogram is a representative and popular tool for data publication and visualization tasks. Following the emergence and development of data analysis and increasing release demands, protecting the private data and preventing sensitive information from leakage has become one of the major challenges for histogram publication. In recent years, many approaches have been proposed for publishing histograms with differential privacy. This paper explores the problem of publishing histograms with differential privacy, and provides a systematical summarization of existing research efforts in this field, begining with a discussion of the basic principles and characteristics of the technology. Furthermore, we provide a comprehensive comparison of a series of state-of-the-art histogram publication schemes. Finally, we provide possible suggestions for further expansions of future work in this area.
Conference Paper
This paper defines software fairness and discrimination and develops a testing-based method for measuring if and how much software discriminates, focusing on causality in discriminatory behavior. Evidence of software discrimination has been found in modern software systems that recommend criminal sentences, grant access to financial products, and determine who is allowed to participate in promotions. Our approach, Themis, generates efficient test suites to measure discrimination. Given a schema describing valid system inputs, Themis generates discrimination tests automatically and does not require an oracle. We evaluate Themis on 20 software systems, 12 of which come from prior work with explicit focus on avoiding discrimination. We find that (1) Themis is effective at discovering software discrimination, (2) state-of-the-art techniques for removing discrimination from algorithms fail in many situations, at times discriminating against as much as 98% of an input subdomain, (3) Themis optimizations are effective at producing efficient test suites for measuring discrimination, and (4) Themis is more efficient on systems that exhibit more discrimination. We thus demonstrate that fairness testing is a critical aspect of the software development cycle in domains with possible discrimination and provide initial tools for measuring software discrimination.
Recent work on fairness in machine learning has focused on various statistical discrimination criteria and how they trade off. Most of these criteria are observational: They depend only on the joint distribution of predictor, protected attribute, features, and outcome. While convenient to work with, observational criteria have severe inherent limitations that prevent them from resolving matters of fairness conclusively. Going beyond observational criteria, we frame the problem of discrimination based on protected attributes in the language of causal reasoning. This viewpoint shifts attention from "What is the right fairness criterion?" to "What do we want to assume about the causal data generating process?" Through the lens of causality, we make several contributions. First, we crisply articulate why and when observational criteria fail, thus formalizing what was before a matter of opinion. Second, our approach exposes previously ignored subtleties and why they are fundamental to the problem. Finally, we put forward natural causal non-discrimination criteria and develop algorithms that satisfy them.
Many data analysis tasks, such as solving prediction problems or inferring cause effect relationships, can be framed as statistical inference on models with outcome variables. This type of inference has been very successful in a variety of applications, including image and video analysis, speech recognition, machine translation, autonomous vehicle control, game playing, and validating hypotheses in the empirical sciences. As statistical and machine learning models become an increasingly ubiquitous part of our lives, policymakers, regulators, and advocates have expressed fears about the harmful impact of deployment of such models that encode harmful and discriminatory biases of their creators. A growing community is now addressing issues of fairness and transparency in data analysis in part by defining, analyzing, and mitigating harmful effects of algorithmic bias from a variety of perspectives and frameworks [3, 4, 6, 7, 8, 18]. In this paper, we consider the problem of fair statistical inference involving outcome variables. Examples include classification and regression problems, and estimating treatment effects in randomized trials or observational data. The issue of fairness arises in such problems where some covariates or treatments are "sensitive", in the sense of having potential of creating discrimination. In this paper, we argue that the presence of discrimination in our setting can be formalized in a sensible way as the presence of an effect of a sensitive covariate on the outcome along certain causal pathways, a view which generalizes [16]. We discuss a number of complications that arise in classical statistical inference due to this view, and suggest workarounds, based on recent work in causal and semi-parametric inference.
atrix low-rank approximation is intimately related to data modelling; a problem that arises frequently in many different fields. This book is a comprehensive exposition of the theory, algorithms, and applications of structured low-rank approximation. Local optimization methods and effective suboptimal convex relaxations for Toeplitz, Hankel, and Sylvester structured problems are presented. A major part of the text is devoted to application of the theory. Applications described include: - system and control theory: approximate realization, model reduction, output error, and errors-in-variables identification; - signal processing: harmonic retrieval, sum-of-damped exponentials, finite impulse response modeling, and array processing; - machine learning: multidimensional scaling and recommender system; - computer vision: algebraic curve fitting and fundamental matrix estimation; - bioinformatics for microarray data analysis; - chemometrics for multivariate calibration; - psychometrics for factor analysis; and - computer algebra for approximate common divisor computation; Special knowledge from the respective application fields is not required. The book is complemented by a software implementation of the methods presented, which makes the theory directly applicable in practice. In particular, all numerical examples in the book are included in demonstration files and can be reproduced by the reader. This gives hands-on experience with the theory and methods detailed. In addition, exercises and MATLAB examples will assist the reader quickly to assimilate the theory on a chapter-by-chapter basis.
In risk assessment and predictive policing, biased data can yield biased results.
We propose a learning algorithm for fair classification that achieves both group fairness (the proportion of members in a protected group receiving positive classification is identical to the proportion in the population as a whole), and individual fairness (similar individuals should be treated similarly). We formulate fairness as an optimization problem of finding a good representation of the data with two competing goals: to encode the data as well as possible, while simultaneously obfuscating any information about membership in the protected group. We show positive results of our algorithm relative to other known techniques, on three datasets. Moreover, we demonstrate several advantages to our approach. First, our intermediate representation can be used for other classification tasks (i.e., transfer learning is possible); secondly, we take a step toward learning a distance metric which can find important dimensions of the data for classification.
In this paper, we present DPSense, an approach to publish statistical information from datasets under differential privacy via sensitivity control. More specifically, we consider the problem of publishing column counts for high-dimensional datasets, such as query logs or the Netflix dataset. The key challenge is that as the sensitivity is high, high-magnitude noises need to be added to satisfy differential privacy. We explore how to effectively performs sensitivity control, i.e., limiting the contribution of each tuple in the dataset. We introduce a novel low-sensitivity quality function that enables one to effectively choose a contribution limit while satisfying differential privacy. Based on DPSense, we further propose an extension to correct the under-estimation bias, which we call DPSense-S. Experimental results show that our proposed approaches advance the state of the art for publishing noisy column counts and for finding the columns with the highest counts. Finally, we give the analysis and discussion for the stability of DPSense and DPSense-S, which benefits from the high correlation between quality function and error, as well as other insights of DPSense, DPSense-S, and existing approaches.
We proposed and developed a taxi-sharing system that accepts taxi passengers’ real-time ride requests sent from smartphones and schedules proper taxis to pick up them via ridesharing, subject to time, capacity, and monetary constraints. The monetary constraints provide incentives for both passengers and taxi drivers: passengers will not pay more compared with no ridesharing and get compensated if their travel time is lengthened due to ridesharing; taxi drivers will make money for all the detour distance due to ridesharing. While such a system is of significant social and environmental benefit, e.g., saving energy consumption and satisfying people's commute, real-time taxi-sharing has not been well studied yet. To this end, we devise a mobile-cloud architecture based taxi-sharing system. Taxi riders and taxi drivers use the taxi-sharing service provided by the system via a smart phone App. The Cloud first finds candidate taxis quickly for a taxi ride request using a taxi searching algorithm supported by a spatio-temporal index. A scheduling process is then performed in the cloud to select a taxi that satisfies the request with minimum increase in travel distance. We built an experimental platform using the GPS trajectories generated by over 33,000 taxis over a period of three months. A ride request generator is developed (available at∼sma/ridesharing) in terms of the stochastic process modelling real ride requests learned from the data set. Tested on this platform with extensive experiments, our proposed system demonstrated its efficiency, effectiveness and scalability. For example, when the ratio of the number of ride requests to the number of taxis is 6, our proposed system serves three times as many taxi riders as that when no ridesharing is performed while saving 11 percent in total travel distance and 7 percent taxi fare per rider.
What does it mean for an algorithm to be biased? In U.S. law, unintentional bias is encoded via disparate impact, which occurs when a selection process has widely different outcomes for different groups, even as it appears to be neutral. This legal determination hinges on a definition of a protected class (ethnicity, gender) and an explicit description of the process. When computers are involved, determining disparate impact (and hence bias) is harder. It might not be possible to disclose the process. In addition, even if the process is open, it might be hard to elucidate in a legal setting how the algorithm makes its decisions. Instead of requiring access to the process, we propose making inferences based on the data it uses. We present four contributions. First, we link disparate impact to a measure of classification accuracy that while known, has received relatively little attention. Second, we propose a test for disparate impact based on how well the protected class can be predicted from the other attributes. Third, we describe methods by which data might be made unbiased. Finally, we present empirical evidence supporting the effectiveness of our test for disparate impact and our approach for both masking bias and preserving relevant information in the data. Interestingly, our approach resembles some actual selection practices that have recently received legal scrutiny.
Conference Paper
Evaluating the performance of database systems is crucial when database vendors or researchers are developing new technologies. But such evaluation tasks rely heavily on actual data and query workloads that are often unavailable to researchers due to privacy restrictions. To overcome this barrier, we propose a framework for the release of a synthetic database which accurately models selected performance properties of the original database. We improve on prior work on synthetic database generation by providing a formal, rigorous guarantee of privacy. Accuracy is achieved by generating synthetic data using a carefully selected set of statistical properties of the original data which balance privacy loss with relevance to the given query workload. An important contribution of our framework is an extension of standard differential privacy to multiple tables.
Causal effects are defined as comparisons of potential outcomes under different treatments on a common set of units. Observed values of the potential outcomes are revealed by the assignment mechanism-a probabilistic model for the treatment each unit receives as a function of covariates and potential outcomes. Fisher made tremendous contributions to causal inference through his work on the design of randomized experiments, but the potential outcomes perspective applies to other complex experiments and nonrandomized studies as well. As noted by Kempthorne in his 1976 discussion of Savage's Fisher lecture, Fisher never bridged his work on experimental design and his work on parametric modeling, a bridge that appears nearly automatic with an appropriate view of the potential outcomes framework, where the potential outcomes and covariates are given a Bayesian distribution to complete the model specification. Also, this framework crisply separates scientific inference for causal effects and decisions based on such inference, a distinction evident in Fisher's discussion of tests of significance versus tests in an accept/reject framework. But Fisher never used the potential outcomes framework, originally proposed by Neyman in the context of randomized experiments, and as a result he provided generally flawed advice concerning the use of the analysis of covariance to adjust for posttreatment concomitants in randomized trials.
Conference Paper
Differential privacy has emerged as one of the most promising privacy models for private data release. It can be used to release different types of data, and, in particular, histograms, which provide useful summaries of a dataset. Several differentially private histogram releasing schemes have been proposed recently. However, most of them directly add noise to the histogram counts, resulting in undesirable accuracy. In this paper, we propose two sanitization techniques that exploit the inherent redundancy of real-life datasets in order to boost the accuracy of histograms. They lossily compress the data and sanitize the compressed data. Our first scheme is an optimization of the Fourier Perturbation Algorithm (FPA) presented in [13]. It improves the accuracy of the initial FPA by a factor of 10. The other scheme relies on clustering and exploits the redundancy between bins. Our extensive experimental evaluation over various real-life and synthetic datasets demonstrates that our techniques preserve very accurate distributions and considerably improve the accuracy of range queries over attributed histograms.
Conference Paper
In 1977 Dalenius articulated a desideratum for statistical databases: nothing about an individual should be learnable from the database that cannot be learned without access to the database. We give a general impossibility result showing that a formalization of Dalenius’ goal along the lines of semantic security cannot be achieved. Contrary to intuition, a variant of the result threatens the privacy even of someone not in the database. This state of affairs suggests a new measure, differential privacy, which, intuitively, captures the increased risk to one’s privacy incurred by participating in a database. The techniques developed in a sequence of papers [8, 13, 3], culminating in those described in [12], can achieve any desired level of privacy under this measure. In many cases, extremely accurate information about the database can be provided while simultaneously ensuring very high levels of privacy.
Conference Paper
Spearman's footrule and Kendall's tau are two well established distances between rankings. They, however, fail to take into account concepts crucial to evaluating a result set in information retrieval: element relevance and positional information. That is, changing the rank of a highly-relevant document should result in a higher penalty than changing the rank of an irrelevant document; a similar logic holds for the top versus the bottom of the result ordering. In this work, we extend both of these metrics to those with position and element weights, and show that a variant of the Diaconis-Graham inequality still holds - the generalized two measures remain within a constant factor of each other for all permutations. We continue by extending the element weights into a distance metric between elements. For example, in search evaluation, swapping the order of two nearly duplicate results should result in little penalty, even if these two are highly relevant and appear at the top of the list. We extend the distance measures to this more general case and show that they remain within a constant factor of each other. We conclude by conducting simple experiments on web search data with the proposed measures. Our experiments show that the weighted generalizations are more robust and consistent with each other than their unweighted counter-parts.
Machine bias: risk assessments in criminal sentencing
  • J Angwin
  • J Larson
  • S Mattu
  • L Kirchner
do no harm: Ethical guidelines for applying predictive tools within human services
  • Metrolab Network
  • First
Universally utility-maximizing privacy mechanisms
  • A Ghosh
  • T Roughgarden
  • M Sundararajan
Discrimination in online Ad delivery
  • L Sweeney
Differentially private histogram publication
  • J Xu
  • Z Zhang
  • X Xiao
  • Y Yang
  • G Yu
  • M Winslett
Racial and gender discrimination in transportation network companies
  • Y Ge
  • C R Knittel
  • D Mackenzie
  • S Zoepf