Conference Paper

Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Machine-learning (ML) algorithms are increasingly utilized in privacy-sensitive applications such as predicting lifestyle choices, making medical diagnoses, and facial recognition. In a model inversion attack, recently introduced in a case study of linear classifiers in personalized medicine by Fredrikson et al., adversarial access to an ML model is abused to learn sensitive genomic information about individuals. Whether model inversion attacks apply to settings outside theirs, however, is unknown. We develop a new class of model inversion attack that exploits confidence values revealed along with predictions. Our new attacks are applicable in a variety of settings, and we explore two in depth: decision trees for lifestyle surveys as used on machine-learning-as-a-service systems and neural networks for facial recognition. In both cases confidence values are revealed to those with the ability to make prediction queries to models. We experimentally show attacks that are able to estimate whether a respondent in a lifestyle survey admitted to cheating on their significant other and, in the other context, show how to recover recognizable images of people's faces given only their name and access to the ML model. We also initiate experimental exploration of natural countermeasures, investigating a privacy-aware decision tree training algorithm that is a simple variant of CART learning, as well as revealing only rounded confidence values. The lesson that emerges is that one can avoid these kinds of MI attacks with negligible degradation to utility.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... For instance, if the training data of the pre-trained models contain privacy-sensitive information, an adversary who downloads the pre-trained models could potentially perform different privacy attacks to infer the private information. In particular, membership inference attacks [18,19] have been studied to infer whether a private instance is in the training set, and model inversion attacks have been studied to reconstruct the private training instances under certain assumptions [28,11,10,26], which raises more privacy and safety concerns. ...
... Membership inference attacks and model inversion attacks are two major categories of such attacks. In membership inference attacks [18,19], the adversary aims to decide whether a sample is a member of the training set, while in model inversion attacks [28,11,10,26], the adversary attempts to reconstruct the training set under certain assumptions. [11] was the first to propose model inversion attacks aiming at recovering private training data. ...
... The authors demonstrated that personal genetic markers could be effectively recovered given the output of the model and auxiliary knowledge. [10] extended model inversion to more complex models, including shallow neural networks for face recognition. The recovered data with their proposed method are identified as the original person at a much higher rate than random guessing. ...
Preprint
Full-text available
Transfer learning through the use of pre-trained models has become a growing trend for the machine learning community. Consequently, numerous pre-trained models are released online to facilitate further research. However, it raises extensive concerns on whether these pre-trained models would leak privacy-sensitive information of their training data. Thus, in this work, we aim to answer the following questions: "Can we effectively recover private information from these pre-trained models? What are the sufficient conditions to retrieve such sensitive information?" We first explore different statistical information which can discriminate the private training distribution from other distributions. Based on our observations, we propose a novel private data reconstruction framework, SecretGen, to effectively recover private information. Compared with previous methods which can recover private data with the ground true prediction of the targeted recovery instance, SecretGen does not require such prior knowledge, making it more practical. We conduct extensive experiments on different datasets under diverse scenarios to compare SecretGen with other baselines and provide a systematic benchmark to better understand the impact of different auxiliary information and optimization operations. We show that without prior knowledge about true class prediction, SecretGen is able to recover private data with similar performance compared with the ones that leverage such prior knowledge. If the prior knowledge is given, SecretGen will significantly outperform baseline methods. We also propose several quantitative metrics to further quantify the privacy vulnerability of pre-trained models, which will help the model selection for privacy-sensitive applications. Our code is available at: https://github.com/AI-secure/SecretGen.
... Despite the ML data being carefully curated in local data centers isolated from the open network [4], recent research on ML privacy [5]- [9] reveals, once a deep neural network (DNN) finishes its training process on a private dataset, the model immediately becomes an exploitable source for privacy breach. By interacting with the trained model in a full-knowledge manner (i.e., with known parameters, model architecture, etc.) or via the prediction API, attacks are known to infer global/individual sensitive attributes of the training data [10]- [13], infer data membership [14], [15], or even steal the full functionality of the private datasets from the trained model [5], [6], [8]. The success of these previous attacks reflect the tension between the confidential data and the openly accessible trained model. ...
... One branch of attacks focuses more on the privacy of the training data as a whole. Inferring global sensitive information of the training data, model inversion attack [10], [13] and property inference attack [11], [23] respectively target at revealing the class representatives of a training data (e.g., the average face of an identity when attacking a face recognition model) and whether a first-order predicate holds on the training data (e.g., whether a face dataset contains no whites). Stealing the functionality of a private training dataset, model extraction attack constructs a surrogate model by distilling the black-box prediction API [5]- [7], [24] or directly reverse-engineers the model parameters by exploiting the property of the rectified linear units [8], [25], [26]. ...
Preprint
Full-text available
In this paper, we present a novel insider attack called Matryoshka, which employs an irrelevant scheduled-to-publish DNN model as a carrier model for covert transmission of multiple secret models which memorize the functionality of private ML data stored in local data centers. Instead of treating the parameters of the carrier model as bit strings and applying conventional steganography, we devise a novel parameter sharing approach which exploits the learning capacity of the carrier model for information hiding. Matryoshka simultaneously achieves: (i) High Capacity -- With almost no utility loss of the carrier model, Matryoshka can hide a 26x larger secret model or 8 secret models of diverse architectures spanning different application domains in the carrier model, neither of which can be done with existing steganography techniques; (ii) Decoding Efficiency -- once downloading the published carrier model, an outside colluder can exclusively decode the hidden models from the carrier model with only several integer secrets and the knowledge of the hidden model architecture; (iii) Effectiveness -- Moreover, almost all the recovered models have similar performance as if it were trained independently on the private data; (iv) Robustness -- Information redundancy is naturally implemented to achieve resilience against common post-processing techniques on the carrier before its publishing; (v) Covertness -- A model inspector with different levels of prior knowledge could hardly differentiate a carrier model from a normal model.
... In addition to de-anonymization, M Fredrikson also identified training sets without public datasets. Nonetheless, the exposed model [12] shows no available training set. Release-only models also have privacy risks. ...
... The attacker's goal is to find specific information in the training data [31][32][33]. The previous research published by M Fredrikson [12] is a type of membership inference attack. Moreover, membership inference attacks include white-box and black-box membership inference attacks. ...
Article
Full-text available
The primary motivation is to address difficulties in data interpretation or a reduction in model accuracy. Although differential privacy can provide data privacy guarantees, it also creates problems. Thus, we need to consider the noise setting for differential privacy is currently inconclusive. This paper’s main contribution is finding a balance between privacy and accuracy. The training data of deep learning models may contain private or sensitive corporate information. These may be dangerous to attacks, leading to privacy data leakage for data sharing. Many strategies are for privacy protection, and differential privacy is the most widely applied one. Google proposed a federated learning technology to solve the problem of data silos in 2016. The technology can share information without exchanging original data and has made significant progress in the medical field. However, there is still the risk of data leakage in federated learning; thus, many models are now used with differential privacy mechanisms to minimize the risk. The data in the financial field are similar to medical data, which contains a substantial amount of personal data. The leakage may cause uncontrollable consequences, making data exchange and sharing difficult. Let us suppose that differential privacy applies to the financial field. Financial institutions can provide customers with higher value and personalized services and automate credit scoring and risk management. Unfortunately, the economic area rarely applies differential privacy and attains no consensus on parameter settings. This study compares data security with non-private and differential privacy financial visual models. The paper finds a balance between privacy protection with model accuracy. The results show that when the privacy loss parameter ϵ is between 12.62 and 5.41, the privacy models can protect training data, and the accuracy does not decrease too much.
... With machine learning (ML) becoming ubiquitous in many aspects of our society, questions of its privacy and security take centre stage. A growing field of research in privacy attacks on ML [1][2][3][4] tells us that it is possible to infer information about training data even in a black-box setting, without access to model parameters. A wider population, however, is concerned with privacy practices used in the ML development cycle, such as company employees or contractors manually inspecting and annotating user data (https://www.theguardian.com/technology/2020/jan/10/skype-audio-graded-byworkers-in-china-with-no-security-measures, ...
... A rapidly expanding area of privacy-preserving machine learning research has been recently focused on the attacks that compromise privacy of training data, such as model inversion [1] and membership inference [2]. The former is based on observing the output probabilities of the target model for a given class and performing gradient descent on an input reconstruction. ...
Article
Full-text available
We consider the problem of enhancing user privacy in common data analysis and machine learning development tasks, such as data annotation and inspection, by substituting the real data with samples from a generative adversarial network. We propose employing Bayesian differential privacy as the means to achieve a rigorous theoretical guarantee while providing a better privacy-utility trade-off. We demonstrate experimentally that our approach produces higher-fidelity samples compared to prior work, allowing to (1) detect more subtle data errors and biases, and (2) reduce the need for real data labelling by achieving high accuracy when training directly on artificial samples.
... Besides, DL has the ability of automatic trainable feature extraction from high-dimensional data to achive state-of-theart predictions, such as image classification (13). Of note, if we train a non-private DL model with the sensitive data, it becomes vulnerable to privacy inference attack (14) and model inversion attack (15). ...
... For example, if a DL model with two components (i.e., different batches of training data) with a privacy budget ∈ 1 and ∈ 2 has access to a private dataset, the complete DL model can achieve differential privacy with a privacy budget ∈ 1 + ∈ 2 . Besides, DP based models are invariant to post-processing, such as model inversion attack (14,15). Hence, Shokri et al. (27) first introduced differential private DL model. ...
Article
Full-text available
Proper analysis of high-dimensional human genomic data is necessary to increase human knowledge about fundamental biological questions such as disease associations and drug sensitivity. However, such data contain sensitive private information about individuals and can be used to identify an individual (i.e., privacy violation) uniquely. Therefore, raw genomic datasets cannot be publicly published or shared with researchers. The recent success of deep learning (DL) in diverse problems proved its suitability for analyzing the high volume of high-dimensional genomic data. Still, DL-based models leak information about the training samples. To overcome this challenge, we can incorporate differential privacy mechanisms into the DL analysis framework as differential privacy can protect individuals’ privacy. We proposed a differential privacy based DL framework to solve two biological problems: breast cancer status (BCS) and cancer type (CT) classification, and drug sensitivity prediction. To predict BCS and CT using genomic data, we built a differential private (DP) deep autoencoder (dpAE) using private gene expression datasets that performs low-dimensional data representation learning. We used dpAE features to build multiple DP binary classifiers to predict BCS and CT in any individual. To predict drug sensitivity, we used the Genomics of Drug Sensitivity in Cancer (GDSC) dataset. We extracted GDSC’s dpAE features to build our DP drug sensitivity prediction model for 265 drugs. Evaluation of our proposed DP framework shows that it achieves improved prediction performance in predicting BCS, CT, and drug sensitivity than the previously published DP work.
... Deep models can memorise a user's sensitive information [2,10,11]. Several attack types [23] including simple reverse engineering [7] can reveal private information of users. Particularly for healthcare, model inversion attacks can even recover a patient's medical images [24]. ...
Preprint
Full-text available
Rights provisioned within data protection regulations, permit patients to request that knowledge about their information be eliminated by data holders. With the advent of AI learned on data, one can imagine that such rights can extent to requests for forgetting knowledge of patient's data within AI models. However, forgetting patients' imaging data from AI models, is still an under-explored problem. In this paper, we study the influence of patient data on model performance and formulate two hypotheses for a patient's data: either they are common and similar to other patients or form edge cases, i.e. unique and rare cases. We show that it is not possible to easily forget patient data. We propose a targeted forgetting approach to perform patient-wise forgetting. Extensive experiments on the benchmark Automated Cardiac Diagnosis Challenge dataset showcase the improved performance of the proposed targeted forgetting approach as opposed to a state-of-the-art method.
... Another challenge in working with personal data lies in the possibility of using GANs to security attacks that accurately reveal missing characteristics of real individuals, which could compromise their privacy [15,16,17]. Optimizing the trade-off between the privacy of the source data and the quality of the synthetic data remains an open challenge [18,19]. ...
Preprint
Full-text available
Despite the remarkable success of Generative Adversarial Networks (GANs) on text, images, and videos, generating high-quality tabular data is still under development owing to some unique challenges such as capturing dependencies in imbalanced data, optimizing the quality of synthetic patient data while preserving privacy. In this paper, we propose DP-CGANS, a differentially private conditional GAN framework consisting of data transformation, sampling, conditioning, and networks training to generate realistic and privacy-preserving tabular data. DP-CGANS distinguishes categorical and continuous variables and transforms them to latent space separately. Then, we structure a conditional vector as an additional input to not only presents the minority class in the imbalanced data, but also capture the dependency between variables. We inject statistical noise to the gradients in the networking training process of DP-CGANS to provide a differential privacy guarantee. We extensively evaluate our model with state-of-the-art generative models on three public datasets and two real-world personal health datasets in terms of statistical similarity, machine learning performance, and privacy measurement. We demonstrate that our model outperforms other comparable models, especially in capturing dependency between variables. Finally, we present the balance between data utility and privacy in synthetic data generation considering the different data structure and characteristics of real-world datasets such as imbalance variables, abnormal distributions, and sparsity of data.
... In addition, generally, data contains sensitive information and it is difficult to train a model while preserving privacy. In particular, data with sensitive information cannot be transferred to untrusted third-party cloud environments (cloud GPUs and TPUs) even though they provide a powerful computing environment [3][4][5][6][7][8][9]. Accordingly, it has been challenging to train/test a ML model with encrypted images as one way for solving these issues [10]. ...
Preprint
In this paper, we propose a combined use of transformed images and vision transformer (ViT) models transformed with a secret key. We show for the first time that models trained with plain images can be directly transformed to models trained with encrypted images on the basis of the ViT architecture, and the performance of the transformed models is the same as models trained with plain images when using test images encrypted with the key. In addition, the proposed scheme does not require any specially prepared data for training models or network modification, so it also allows us to easily update the secret key. In an experiment, the effectiveness of the proposed scheme is evaluated in terms of performance degradation and model protection performance in an image classification task on the CIFAR-10 dataset.
... Despite the fact that machine learning and neural networks are widely applied in industry settings, the trained models are costly to obtain. Furthermore, there are security [27], [28] and privacy [23], [29] concerns for revealing trained models to potential adversaries. Thus, trained model should be treated as proprietary and protected the same way. ...
Preprint
Full-text available
A variety of explanation methods have been proposed in recent years to help users gain insights into the results returned by neural networks, which are otherwise complex and opaque black-boxes. However, explanations give rise to potential side-channels that can be leveraged by an adversary for mounting attacks on the system. In particular, post-hoc explanation methods that highlight input dimensions according to their importance or relevance to the result also leak information that weakens security and privacy. In this work, we perform the first systematic characterization of the privacy and security risks arising from various popular explanation techniques. First, we propose novel explanation-guided black-box evasion attacks that lead to 10 times reduction in query count for the same success rate. We show that the adversarial advantage from explanations can be quantified as a reduction in the total variance of the estimated gradient. Second, we revisit the membership information leaked by common explanations. Contrary to observations in prior studies, via our modified attacks we show significant leakage of membership information (above 100% improvement over prior results), even in a much stricter black-box setting. Finally, we study explanation-guided model extraction attacks and demonstrate adversarial gains through a large reduction in query count.
... The attacker needs to provide some auxiliary, which could be some experience or human knowledge. The concept of model inversion is introduced by Fredrikson et al. 35 . They showed how the adversary using the outputs from a classifier to infer the sensitive features used as inputs. ...
Article
Full-text available
Hydraulic equipment, as a typical mechanical product, has been wildly used in various fields. Accurate acquisition and secure transmission of assembly deviation data are the most critical issues for hydraulic equipment manufacturer in the PLM-oriented value chain collaboration. Existing deviation prediction methods are mainly used for assembly quality control, which concentrate in the product design and assembly stage. However, the actual assembly deviations generated in the service stage can be used to guide the equipment maintenance and tolerance design. In this paper, a high-fidelity prediction and privacy-preserving method is proposed based on the observable assembly deviations. A hierarchical graph attention network (HGAT) is established to predict the assembly feature deviations. The hierarchical generalized representation and differential privacy reconstruction techniques are also introduced to generate the graph attention network model for assembly deviation privacy-preserving. A derivation gradient matrix is established to calculate the defined modified necessary index of assembly parts. Two privacy-preserving strategies are designed to protect the assembly privacy of node representation and adjacent relationship. The effectiveness and superiority of the proposed method are demonstrated by a case study with a four-column hydraulic press.
... Transfer learning is susceptible to train-test leakage since the train and test data are often generated independently without precautions to prevent leaks [6]. Research on leakage in transfer learning focuses on membership inference [35,41] (predicting if a model has seen an instance during training) and property inference [2,17] (predicting properties of the training data). Both inferences rely on the observation that neural models may memorize some training instances to generalize through interpolation [5,7] and to similar test instances [15,16]. ...
Preprint
Full-text available
Neural retrieval models are often trained on (subsets of) the millions of queries of the MS MARCO / ORCAS datasets and then tested on the 250 Robust04 queries or other TREC benchmarks with often only 50 queries. In such setups, many of the few test queries can be very similar to queries from the huge training data -- in fact, 69% of the Robust04 queries have near-duplicates in MS MARCO / ORCAS. We investigate the impact of this unintended train-test leakage by training neural retrieval models on combinations of a fixed number of MS MARCO / ORCAS queries that are highly similar to the actual test queries and an increasing number of other queries. We find that leakage can improve effectiveness and even change the ranking of systems. However, these effects diminish as the amount of leakage among all training instances decreases and thus becomes more realistic.
... In the model inversion attack the adversarial uses black-box access to the model M to extract some feature x i , i ∈ {1, . . . , n}, given some knowledge about the other features and the value y [43]. ...
Preprint
Full-text available
Federated learning is a data decentralization privacy-preserving technique used to perform machine or deep learning in a secure way. In this paper we present theoretical aspects about federated learning, such as the presentation of an aggregation operator, different types of federated learning, and issues to be taken into account in relation to the distribution of data from the clients, together with the exhaustive analysis of a use case where the number of clients varies. Specifically, a use case of medical image analysis is proposed, using chest X-ray images obtained from an open data repository. In addition to the advantages related to privacy, improvements in predictions (in terms of accuracy and area under the curve) and reduction of execution times will be studied with respect to the classical case (the centralized approach). Different clients will be simulated from the training data, selected in an unbalanced manner, i.e., they do not all have the same number of data. The results of considering three or ten clients are exposed and compared between them and against the centralized case. Two approaches to follow will be analyzed in the case of intermittent clients, as in a real scenario some clients may leave the training, and some new ones may enter the training. The evolution of the results for the test set in terms of accuracy, area under the curve and execution time is shown as the number of clients into which the original data is divided increases. Finally, improvements and future work in the field are proposed.
... In [41], a Privacy-Preserving Adversarial Protector Network (PPAPNet) was developed as an improved technique for adding noise to the face. Experiments revealed that the PPAPNet performs remarkably in converting original images into high-quality deidentified images and resisting inversion attacks [42]. Reference [43] directly processed face images in the pixel space to realize DP, regardless of the image's distribution characteristics. ...
Article
Full-text available
In recent years, the security and privacy issues of face data in video surveillance have become one of the hotspots. How to protect privacy while maintaining the utility of monitored faces is a challenging problem. At present, most of the mainstream methods are suitable for maintaining data utility with respect to pre-defined criteria such as the structure similarity or shape of the face, which bears the criticism of poor versatility and adaptability. This paper proposes a novel generative framework called Quality Maintenance-Variational AutoEncoder (QM-VAE), which takes full advantage of existing privacy protection technologies. We innovatively add the loss of service quality to the loss function to ensure the generation of de-identified face images with guided quality preservation. The proposed model automatically adjusts the generated image according to the different service quality evaluators, so it is generic and efficient in different service scenarios, even some that have nothing to do with simple visual effects. We take facial expression recognition as an example to present experiments on the dataset CelebA to demonstrate the utility-preservation capabilities of QM-VAE. The experimental data show that QM-VAE has the highest quality retention rate of 86%. Compared with the existing method, QM-VAE generates de-identified face images with significantly improved utility and increases the effect by 6.7%.
... For instance, social robots challenge users' private information (psychological, social privacy) due to their autonomy and potential for social bonding lutz2019privacy. The privacy-sensitive leakage of a model on its training data is defined as the information that an adversary can learn about them from a model that he cannot infer from other models trained on other data from the same distribution [4][5][6][7]. The utility gain is defined as the information that is learned from the model about the data population, which can be applied to measure the attacking performance. ...
Article
Full-text available
The training and application of machine learning models encounter the leakage of a significant amount of information about their training dataset that can be investigated by inference attack or model inversion in the fields, such as computer vision, social robots. The conventional methods for the privacy preserving exploit to apply differential privacy into the training process, which might bring a negative influence on the convergence or robustness. We first conjecture the necessary steps to carry out a successful membership inference attack in a machine learning setting and then explicitly formulate the defense based on the conjecture. This paper investigates the construction of new training parameters with a Loss-based Differentiation Strategy (LDS) for a new learning model. The main idea of LDS is to partition the training dataset into some folds and sort their training parameters by similarity to enable privacy-accuracy inequality. The LDS-based model leakages less information on MIA than the primitive learning model and makes it impossible for the adversary to generate the representative samples. Finally, extensive simulations are conducted to validate the proposed scheme and the results demonstrate that the LDS can lower the MIA accuracy in terms of most CNNs models.
... The proposed method resides in the latter category. The 189 study [18] for the effect of model inversion attacks on medical image 203 segmentation task using U-Net and SegNet, respectively. The 204 study in [23] proposed the method deep leakage from gra-205 dients that proposed a differentiable model which matches the 206 weight gradients with that of the trained model in order to 207 reconstruct the data. ...
Article
The past decade has seen a rapid adoption of Artificial Intelligence (AI), specifically the deep learning networks, in Internet of Medical Things (IoMT) ecosystem. However, it has been shown recently that the deep learning networks can be exploited by adversarial attacks that not only make IoMT vulnerable to the data theft but also to the manipulation of medical diagnosis. The existing studies consider adding noise to the raw IoMT data or model parameters which not only reduces the overall performance concerning medical inferences but also is ineffective to the likes of deep leakage from gradients method. In this work, we propose proximal gradient split learning (PSGL) method for defense against the model inversion attacks. The proposed method intentionally attacks the IoMT data when undergoing the deep neural network training process at client side. We propose the use of proximal gradient method to recover gradient maps and a decision-level fusion strategy to improve the recognition performance. Extensive analysis show that the PGSL not only provides effective defense mechanism against the model inversion attacks but also helps in improving the recognition performance on publicly available datasets. We report 14.0%, 17.9%, and 36.9% gains in accuracy over reconstructed and adversarial attacked images, respectively.
... Some attack methods threaten the data privacy of data holders, such as the membership inference attack [4]- [7], the model inversion attack [4], [8], [9], and the property inference attack [9]- [11]. In a membership inference attack, when given a model and an exact example, the attacker infers whether this example was used to train the model or not. ...
Article
Full-text available
In recent years, deep neural networks (DNNs) have been successfully applied in various tasks, and various third-party models are available to data holders. However, data holders who blindly use third-party models to train on their data may lead to data leakage, resulting in serious data privacy problems. The Capacity Abuse Attack (CAA) is the state-of-the-art black-box attack method which uses the labels of the augmented malicious dataset to encode the information of the training data. However, the expanded malicious dataset in CAA are artificially synthesized, not natural images, and significantly different from the original training data. Thus these malicious images are easy to be detected. In our attack, we use a technique similar to generating poisoned datasets in backdoor attacks, make malicious data generated similar to real and natural images, and make our attack more concealed. Extensive experiments are conducted, and the results demonstrate that our attack can effectively obtain the private training data of data holders without significantly impacting the model's original task. Index Terms-Deep neural networks, data privacy, black-box attack, backdoor attack
... Deep neural networks (DNNs) have achieved massive successes in a variety of tasks, including image understanding, natural language processing, etc (Goodfellow, Bengio, and Courville 2016;LeCun, Bengio, and Hinton 2015). Broadly, on the other hand, DNNs may compromise sensitive information carried in the training data (Shokri et al. 2017;Fredrikson, Jha, and Ristenpart 2015;Zhu, Liu, and Han 2019), thereby raising privacy issues. To counter this, learning algorithms that provide principled privacy guarantees in the line of differential privacy (DP) (Dwork and Roth 2014;Kearns and Roth 2020;Vadhan 2017) have been developed. ...
Article
Full-text available
Training deep neural networks (DNNs) for meaningful differential privacy (DP) guarantees severely degrades model utility. In this paper, we demonstrate that the architecture of DNNs has a significant impact on model utility in the context of private deep learning, whereas its effect is largely unexplored in previous studies. In light of this missing, we propose the very first framework that employs neural architecture search to automatic model design for private deep learning, dubbed as DPNAS. To integrate private learning with architecture search, a DP-aware approach is introduced for training candidate models composed on a delicately defined novel search space. We empirically certify the effectiveness of the proposed framework. The searched model DPNASNet achieves state-of-the-art privacy/utility trade-offs, e.g., for the privacy budget of (epsilon, delta)=(3, 1e-5), our model obtains test accuracy of 98.57% on MNIST, 88.09% on FashionMNIST, and 68.33% on CIFAR-10. Furthermore, by studying the generated architectures, we provide several intriguing findings of designing private-learning-friendly DNNs, which can shed new light on model design for deep learning with differential privacy.
... One common approach to attack model privacy are Model Inversion. Model inversion infers training data from trained model or training process [17,18]. Hitaj et al. [19] train a GAN to generate prototypical samples of the training sets during the learning process. ...
Preprint
Full-text available
Recently privacy concerns of person re-identification (ReID) raise more and more attention and preserving the privacy of the pedestrian images used by ReID methods become essential. De-identification (DeID) methods alleviate privacy issues by removing the identity-related of the ReID data. However, most of the existing DeID methods tend to remove all personal identity-related information and compromise the usability of de-identified data on the ReID task. In this paper, we aim to develop a technique that can achieve a good trade-off between privacy protection and data usability for person ReID. To achieve this, we propose a novel de-identification method designed explicitly for person ReID, named Person Identify Shift (PIS). PIS removes the absolute identity in a pedestrian image while preserving the identity relationship between image pairs. By exploiting the interpolation property of variational auto-encoder, PIS shifts each pedestrian image from the current identity to another with a new identity, resulting in images still preserving the relative identities. Experimental results show that our method has a better trade-off between privacy-preserving and model performance than existing de-identification methods and can defend against human and model attacks for data privacy.
... Ateniese et al. [4] was the first to formulate such an attack, demonstrating that machine learning models could unintentionally leak global properties of training data. In contrast to privacy attacks such as membership inference attacks [39] and model inversion attacks [21], their attack aimed to discover properties aggregated over all records in a dataset rather than properties of individual records. For instance, in the context of a speech recognition engine, Ateniese et al. [4] demonstrated that their attack could infer the proportion of training data generated by Indian speakers. ...
Preprint
Alongside the rapid development of data collection and analysis techniques in recent years, there is increasingly an emphasis on the need to address information leakage associated with such usage of data. To this end, much work in the privacy literature is devoted to the protection of individual users and contributors of data. However, some situations instead require a different notion of data confidentiality involving global properties aggregated over the records of a dataset. Such notions of information protection are particularly applicable for business and organization data, where global properties may reflect trade secrets, or demographic data, which can be harmful if mishandled. Recent work on property inference attacks furthermore shows how data analysis algorithms can be susceptible to leaking these global properties of data, highlighting the importance of developing mechanisms that can protect such information. In this work, we demonstrate how a distribution privacy framework can be applied to formalize the problem of protecting global properties of datasets. Given this framework, we investigate several mechanisms and their tradeoffs for providing this notion of data confidentiality. We analyze the theoretical protection guarantees offered by these mechanisms under various data assumptions, then implement and empirically evaluate these mechanisms for several data analysis tasks. The results of our experiments show that our mechanisms can indeed reduce the effectiveness of practical property inference attacks while providing utility substantially greater than a crude group differential privacy baseline. Our work thus provides groundwork for theoretically supported mechanisms for protecting global properties of datasets.
... Communicating model updates during the training process can reveal sensitive information to third parties or to the central server. In certain instances, data leakage can occur, such as when ML models 'memorize' datasets [80][81][82] and when access to model parameters and updates can be used to infer the original dataset 83 . Differential privacy 84 can further reinforce privacy protection for federated learning 70,85,86 . ...
Article
In the past decade, the application of machine learning (ML) to healthcare has helped drive the automation of physician tasks as well as enhancements in clinical capabilities and access to care. This progress has emphasized that, from model development to model deployment, data play central roles. In this Review, we provide a data-centric view of the innovations and challenges that are defining ML for healthcare. We discuss deep generative models and federated learning as strategies to augment datasets for improved model performance, as well as the use of the more recent transformer models for handling larger datasets and enhancing the modelling of clinical text. We also discuss data-focused problems in the deployment of ML, emphasizing the need to efficiently deliver data to ML models for timely clinical predictions and to account for natural data shifts that can deteriorate model performance. This Review discusses the use of deep generative models, federated learning and transformer models to address challenges in the deployment of machine learning for healthcare.
... Model weight inversion attacks such as in [25,26] have limited impacts and they do not necessarily entail a privacy breach as discussed in [27]. Similarly, the use of generative adversarial networks for attack on collaborative training systems [28] has been reported to be unrealistic in [29]. ...
Article
Full-text available
In this paper, we propose a secure system for performing deep learning with distributed trainers connected to a central parameter server. Our system has the following two distinct features: (1) the distributed trainers can detect malicious activities in the server; (2) the distributed trainers can perform both vertical and horizontal neural network training. In the experiments, we apply our system to medical data including magnetic resonance and X-ray images and obtain approximate or even better area-under-the-curve scores when compared to the existing scores.
... As shown in Fig. 2, without having direct access to the client's local data, FL systems allow distributed clients to collaborate and train ML and DL models by sharing training parameters. However, instead of centralizing raw data, sharing gradient updates to a central server trained on the local clients could lead to reverse engineering attacks by passively intercepting the gradients exchanged during the training process [232,233]. This means malicious clients can introduce a backdoor functionality that compromises the underlying FL system during the training process of the global federated model [234] which can be against the GDPR since medical data are highly sensitive and private data. ...
Preprint
Full-text available
With the advent of the IoT, AI, and ML/DL algorithms, the data-driven medical application has emerged as a promising tool for designing reliable and scalable diagnostic and prognostic models from medical data. This has attracted a great deal of attention from academia to industry in recent years. This has undoubtedly improved the quality of healthcare delivery. However, these AI-based medical applications still have poor adoption due to their difficulties in satisfying strict security, privacy, and quality of service standards (such as low latency). Moreover, medical data are usually fragmented and private, making it challenging to generate robust results across populations. Recent developments in federated learning (FL) have made it possible to train complex machine-learned models in a distributed manner. Thus, FL has become an active research domain, particularly processing the medical data at the edge of the network in a decentralized way to preserve privacy and security concerns. To this end, this survey paper highlights the current and future of FL technology in medical applications where data sharing is a significant burden. It also review and discuss the current research trends and their outcomes for designing reliable and scalable FL models. We outline the general FL's statistical problems, device challenges, security, privacy concerns, and its potential in the medical domain. Moreover, our study is also focused on medical applications where we highlight the burden of global cancer and the efficient use of FL for the development of computer-aided diagnosis tools for addressing them. We hope that this review serves as a checkpoint that sets forth the existing state-of-the-art works in a thorough manner and offers open problems and future research directions for this field.
... This attack is more related to the attack on the server-side model, not the transmitted data. The second is the inversion attack [35] which attempts to infer raw data from processed representation. This is the same attack scenario as the aforementioned reconstruction attack. ...
Preprint
Full-text available
Deep learning models are increasingly deployed in real-world applications. These models are often deployed on the server-side and receive user data in an information-rich representation to solve a specific task, such as image classification. Since images can contain sensitive information, which users might not be willing to share, privacy protection becomes increasingly important. Adversarial Representation Learning (ARL) is a common approach to train an encoder that runs on the client-side and obfuscates an image. It is assumed, that the obfuscated image can safely be transmitted and used for the task on the server without privacy concerns. However, in this work, we find that training a reconstruction attacker can successfully recover the original image of existing ARL methods. To this end, we introduce a novel ARL method enhanced through low-pass filtering, limiting the available information amount to be encoded in the frequency domain. Our experimental results reveal that our approach withstands reconstruction attacks while outperforming previous state-of-the-art methods regarding the privacy-utility trade-off. We further conduct a user study to qualitatively assess our defense of the reconstruction attack.
... Compared with the traditional collaboration of machine learning that put the training data together, FL realises the collaboration while keeping the training data decentralised, which is based on the interaction of the model parameters. However, there is still evidence that even from the training parameters of the model, some key information of the data can still be reconstructed (Fredrikson et al., 2015). Thus, the secure interaction of the model parameters is of great significance. ...
Article
Full-text available
The manufacturing of complex parts, such as aircraft structural parts and aero-engine casing parts, has always been one of the focuses in the manufacturing field. The machining process involves a variety of hard problems (e.g. tool wear prediction, smart process planning), which require assumptions, simplifications and approximations during the mechanism-based modelling. For these problems, supervised machine learning methods have achieved good results by fitting input–output relations from plenty of labelled data. However, the data acquisition is difficult, time consuming, and of high cost, thus the amount of data in a single enterprise is often limited. To address this issue, this research aims to realise the equivalent manufacturing data sharing based on federated learning (FL), which is a new machine learning framework to use the scattered data while protecting the data privacy. An enterprise-oriented framework is first proposed to find FL participants with similar data resources. Then, the machining parameter planning task for aircraft structural parts is taken as an example to propose an FL model, which mines the knowledge and rules in the historical processing files from multiple enterprises. In addition, to solve the data difference among enterprises, domain adaptation method in transfer learning is used to obtain domain-invariant features. In the case study, a prototype platform is developed, and to validate the performance of the proposed model, a data set is built based on the historical processing files from three aircraft manufacturing enterprises. The proposed model achieves the best performance compared with the model trained only with the data from a single enterprise, and the model without domain adaptation.
... • Privacy and security concerns: Although one of the purposes to utilize distributed learning in wireless communication is to preserve data privacy, the model gradients information or outputs transmitted through wireless communication links can still be disclosed and reversely traced, which means the privacy is only partially preserved [164]. This kind of privacy concern is named gradient leakage attacks [165] and membership inference [166]. ...
Preprint
The cloud-based solutions are becoming inefficient due to considerably large time delays, high power consumption, security and privacy concerns caused by billions of connected wireless devices and typically zillions bytes of data they produce at the network edge. A blend of edge computing and Artificial Intelligence (AI) techniques could optimally shift the resourceful computation servers closer to the network edge, which provides the support for advanced AI applications (e.g., video/audio surveillance and personal recommendation system) by enabling intelligent decision making on computing at the point of data generation as and when it is needed, and distributed Machine Learning (ML) with its potential to avoid the transmission of large dataset and possible compromise of privacy that may exist in cloud-based centralized learning. Therefore, AI is envisioned to become native and ubiquitous in future communication and networking systems. In this paper, we conduct a comprehensive overview of recent advances in distributed intelligence in wireless networks under the umbrella of native-AI wireless networks, with a focus on the basic concepts of native-AI wireless networks, on the AI-enabled edge computing, on the design of distributed learning architectures for heterogeneous networks, on the communication-efficient technologies to support distributed learning, and on the AI-empowered end-to-end communications. We highlight the advantages of hybrid distributed learning architectures compared to the state-of-art distributed learning techniques. We summarize the challenges of existing research contributions in distributed intelligence in wireless networks and identify the potential future opportunities.
... Remarkable performance of face embeddings has made face recognition very popular for identity authentication in wide range of security-sensitive applications from financial sectors to criminal identification. However, deep CNNs proved to be vulnerable against adversarial instances [8,19,24,27]. Adversarial inputs [15] are defined as maliciously generated indistinguishable instances for human eyes by adding small perturbations [12]. The adversarial attacks against face recognition are often divided to two types: dodging attack which enables the attackers to evade being recognized and impersonation attack in which the attacker is recognized as another individual. ...
Preprint
Full-text available
Recent successful adversarial attacks on face recognition show that, despite the remarkable progress of face recognition models, they are still far behind the human intelligence for perception and recognition. It reveals the vulnerability of deep convolutional neural networks (CNNs) as state-of-the-art building block for face recognition models against adversarial examples, which can cause certain consequences for secure systems. Gradient-based adversarial attacks are widely studied before and proved to be successful against face recognition models. However, finding the optimized perturbation per each face needs to submitting the significant number of queries to the target model. In this paper, we propose recursive adversarial attack on face recognition using automatic face warping which needs extremely limited number of queries to fool the target model. Instead of a random face warping procedure, the warping functions are applied on specific detected regions of face like eyebrows, nose, lips, etc. We evaluate the robustness of proposed method in the decision-based black-box attack setting, where the attackers have no access to the model parameters and gradients, but hard-label predictions and confidence scores are provided by the target model.
... This is also referred to as the attribute inference attack, and the attacker could be either the RS or external entities that have obtained public information about the user and the RS. In terms of the platform privacy, the existence of the recommendation also opens the door to the model inversion attack [93], where attackers can observe the behavior of the model and then reconstruct the private information. For example, by sending queries and receiving feedback from the RS, one can reconstruct the missing private attributes of a sample [130,201], infer if a certain record (or user) is present in the training data of the model [251,334] (a.k.a. ...
Preprint
Recommender systems (RS), serving at the forefront of Human-centered AI, are widely deployed in almost every corner of the web and facilitate the human decision-making process. However, despite their enormous capabilities and potential, RS may also lead to undesired counter-effects on users, items, producers, platforms, or even the society at large, such as compromised user trust due to non-transparency, unfair treatment of different consumers, or producers, privacy concerns due to extensive use of user's private data for personalization, just to name a few. All of these create an urgent need for Trustworthy Recommender Systems (TRS) so as to mitigate or avoid such adverse impacts and risks. In this survey, we will introduce techniques related to trustworthy and responsible recommendation, including but not limited to explainable recommendation, fairness in recommendation, privacy-aware recommendation, robustness in recommendation, user controllable recommendation, as well as the relationship between these different perspectives in terms of trustworthy and responsible recommendation. Through this survey, we hope to deliver readers with a comprehensive view of the research area and raise attention to the community about the importance, existing research achievements, and future research directions on trustworthy recommendation.
... Despite being powerful, ML models are shown to be vulnerable to various privacy attacks [8,28,30], represented by membership inference attacks [28,27,22,31]. The goal of membership inference attack is to determine whether a data sample is used to train a target ML model. ...
Preprint
Full-text available
Semi-supervised learning (SSL) leverages both labeled and unlabeled data to train machine learning (ML) models. State-of-the-art SSL methods can achieve comparable performance to supervised learning by leveraging much fewer labeled data. However, most existing works focus on improving the performance of SSL. In this work, we take a different angle by studying the training data privacy of SSL. Specifically, we propose the first data augmentation-based membership inference attacks against ML models trained by SSL. Given a data sample and the black-box access to a model, the goal of membership inference attack is to determine whether the data sample belongs to the training dataset of the model. Our evaluation shows that the proposed attack can consistently outperform existing membership inference attacks and achieves the best performance against the model trained by SSL. Moreover, we uncover that the reason for membership leakage in SSL is different from the commonly believed one in supervised learning, i.e., overfitting (the gap between training and testing accuracy). We observe that the SSL model is well generalized to the testing data (with almost 0 overfitting) but ''memorizes'' the training data by giving a more confident prediction regardless of its correctness. We also explore early stopping as a countermeasure to prevent membership inference attacks against SSL. The results show that early stopping can mitigate the membership inference attack, but with the cost of model's utility degradation.
... As a result, it is quite challenging to train an L2M model with a high utility. Orthogonal to this, L2M models are vulnerable to adversarial attacks, i.e., privacy model attacks (Shokri et al., 2017;Fredrikson et al., 2015;Wang et al., 2015;Papernot et al., 2016), when DNNs are trained on highly sensitive data, e.g., clinical records (Choi et al., 2017;Miotto et al., 2016), user profiles (Roumia & Steinhubl, 2014;Wu et al., 2010), and medical images (Plis et al., 2014;Helmstaedter et al., 2013). ...
Preprint
Full-text available
In this paper, we show that the process of continually learning new tasks and memorizing previous tasks introduces unknown privacy risks and challenges to bound the privacy loss. Based upon this, we introduce a formal definition of Lifelong DP, in which the participation of any data tuples in the training set of any tasks is protected, under a consistently bounded DP protection, given a growing stream of tasks. A consistently bounded DP means having only one fixed value of the DP privacy budget, regardless of the number of tasks. To preserve Lifelong DP, we propose a scalable and heterogeneous algorithm, called L2DP-ML with a streaming batch training, to efficiently train and continue releasing new versions of an L2M model, given the heterogeneity in terms of data sizes and the training order of tasks, without affecting DP protection of the private training set. An end-to-end theoretical analysis and thorough evaluations show that our mechanism is significantly better than baseline approaches in preserving Lifelong DP. The implementation of L2DP-ML is available at: https://github.com/haiphanNJIT/PrivateDeepLearning.
... Meanwhile, another branch of research demonstrates that although clients' private data is not directly revealed, the shared FL global model can unintentionally leak sensitive information about the data on which it was trained [12]. As pointed out by previous studies, the attacker can reconstruct the training data [13], deanonymize participants [14], and even infer class representatives [15] and data sample membership [16,17]. ...
Preprint
Due to the distributed nature of Federated Learning (FL), researchers have uncovered that FL is vulnerable to backdoor attacks, which aim at injecting a sub-task into the FL without corrupting the performance of the main task. Single-shot backdoor attack achieves high accuracy on both the main task and backdoor sub-task when injected at the FL model convergence. However, the early-injected single-shot backdoor attack is ineffective because: (1) the maximum backdoor effectiveness is not reached at injection because of the dilution effect from normal local updates; (2) the backdoor effect decreases quickly as the backdoor will be overwritten by the newcoming normal local updates. In this paper, we strengthen the early-injected single-shot backdoor attack utilizing FL model information leakage. We show that the FL convergence can be expedited if the client trains on a dataset that mimics the distribution and gradients of the whole population. Based on this observation, we proposed a two-phase backdoor attack, which includes a preliminary phase for the subsequent backdoor attack. In the preliminary phase, the attacker-controlled client first launches a whole population distribution inference attack and then trains on a locally crafted dataset that is aligned with both the gradient and inferred distribution. Benefiting from the preliminary phase, the later injected backdoor achieves better effectiveness as the backdoor effect will be less likely to be diluted by the normal model updates. Extensive experiments are conducted on MNIST dataset under various data heterogeneity settings to evaluate the effectiveness of the proposed backdoor attack. Results show that the proposed backdoor outperforms existing backdoor attacks in both success rate and longevity, even when defense mechanisms are in place.
... Within HFL, data leakage attracts the most attention: [40,69] use GAN [32] to generate class-representative and synthesized samples; [25,28,39,77,80] adopt a optimization framework to invert the gradient and learn the data; [4] observes the special structure of gradient with single input and partially recovered the private data; [13,48] show how to reconstruct the data when they are uniformly distributed or binary; [78] further demonstrates that the sign of cross-entropy loss could directly reveal the label. Meanwhile, other forms of attack have been proposed, including membership inference [55,60], property inference [53] which resembles the class-representative inference in model inversion [27], and model poisoning/backdoor attack [9,10]. We refer the reader to [14] for a comprehensive survey of the security vulnerabilities in FL ecosystem, and [42] for general open problems in FL. ...
Preprint
Full-text available
We consider vertical logistic regression (VLR) trained with mini-batch gradient descent -- a setting which has attracted growing interest among industries and proven to be useful in a wide range of applications including finance and medical research. We provide a comprehensive and rigorous privacy analysis of VLR in a class of open-source Federated Learning frameworks, where the protocols might differ between one another, yet a procedure of obtaining local gradients is implicitly shared. We first consider the honest-but-curious threat model, in which the detailed implementation of protocol is neglected and only the shared procedure is assumed, which we abstract as an oracle. We find that even under this general setting, single-dimension feature and label can still be recovered from the other party under suitable constraints of batch size, thus demonstrating the potential vulnerability of all frameworks following the same philosophy. Then we look into a popular instantiation of the protocol based on Homomorphic Encryption (HE). We propose an active attack that significantly weaken the constraints on batch size in the previous analysis via generating and compressing auxiliary ciphertext. To address the privacy leakage within the HE-based protocol, we develop a simple-yet-effective countermeasure based on Differential Privacy (DP), and provide both utility and privacy guarantees for the updated algorithm. Finally, we empirically verify the effectiveness of our attack and defense on benchmark datasets. Altogether, our findings suggest that all vertical federated learning frameworks that solely depend on HE might contain severe privacy risks, and DP, which has already demonstrated its power in horizontal federated learning, can also play a crucial role in the vertical setting, especially when coupled with HE or secure multi-party computation (MPC) techniques.
Article
Deep learning techniques based on the neural network have made significant achievements in various fields of artificial intelligence. However, model training requires large‐scale data sets, these data sets are crowd‐sourced and model parameters will contain the encoding of private information, resulting in the risk of privacy leakage. With the trend toward sharing pretrained models, the risk of stealing training data sets through member inference attacks and model inversion attacks is further heightened. To tackle the privacy‐preserving problems in deep learning tasks, we propose an improved Differential Privacy Stochastic Gradient Descent algorithm, using Simulated Annealing algorithm and Laplace Smooth denoising mechanism to optimize the allocation method of privacy loss, replacing the constant clipping method with adaptive gradient clipping method to improve model accuracy. we also analyze privacy cost under random shuffle data batch processing method in detail within the framework of Subsampled Rényi Differential Privacy. Compared with the existing privacy protection training methods with fixed parameters and dynamic privacy parameters in classification tasks, our implementation and experiments show that we can use less privacy budget train deep neural networks with the nonconvex objective function, obtain a higher model evaluation, and have almost zero additional cost in terms of model complexity, training efficiency, and model quality.
Article
An accountable algorithmic transparency report (ATR) should ideally investigate (a) transparency of the underlying algorithm, and (b) fairness of the algorithmic decisions, and at the same time preserve data subjects’ privacy . However, a provably formal study of the impact to data subjects’ privacy caused by the utility of releasing an ATR (that investigates transparency and fairness), has yet to be addressed in the literature. The far-fetched benefit of such a study lies in the methodical characterization of privacy-utility trade-offs for release of ATRs in public, and their consequential application-specific impact on the dimensions of society, politics, and economics. In this paper, we first investigate and demonstrate potential privacy hazards brought on by the deployment of transparency and fairness measures in released ATRs. To preserve data subjects’ privacy, we then propose a linear-time optimal-privacy scheme , built upon standard linear fractional programming (LFP) theory, for announcing ATRs, subject to constraints controlling the tolerance of privacy perturbation on the utility of transparency schemes. Subsequently, we quantify the privacy-utility trade-offs induced by our scheme, and analyze the impact of privacy perturbation on fairness measures in ATRs. To the best of our knowledge, this is the first analytical work that simultaneously addresses trade-offs between the triad of privacy, utility, and fairness, applicable to algorithmic transparency reports.
Article
A feedforward-designed convolutional neural network (FF-CNN) is an interpretable neural network with low training complexity. Unlike a neural network trained using backpropagation (BP) algorithms and optimizers (e.g., stochastic gradient descent (SGD) and Adam), a FF-CNN obtains the model parameters in one feed-forward calculation based on two methods of data statistics: subspace approximation with adjusted bias and least squares regression. Currently, models based on FF-CNN training methods have achieved outstanding performance in the fields of image classification and point cloud data processing. In this study, we analyze and verify that there is a risk of user privacy leakage during the training process of FF-CNN and existing privacy-preserving methods for model gradients or loss functions do not apply to FF-CNN models. Therefore, we propose a securely forward-designed convolutional neural network algorithm (SFF-CNN) to protect the privacy and security of data providers for the FF-CNN model. Firstly, we propose the DPSaab algorithm to add the corresponding noise to the one-stage Saab transform in the FF-CNN design for improved protection performance. Secondly, because noise addition brings the risk of model over-fitting and further increases the possibility of privacy leakage, we propose the SJS algorithm to filter the input features of the fully connected model layer. Finally, we theoretically prove that the proposed algorithm satisfies differential privacy and experimentally demonstrate that the proposed algorithm has strong privacy protection. The proposed algorithm outperforms the compared deep learning privacy-preserving algorithms in terms of utility and robustness.
Article
Full-text available
Recently enacted legislation grants individuals certain rights to decide in what fashion their personal data may be used and in particular a “right to be forgotten”. This poses a challenge to machine learning: how to proceed when an individual retracts permission to use data which has been part of the training process of a model? From this question emerges the field of machine unlearning , which could be broadly described as the investigation of how to “delete training data from models”. Our work complements this direction of research for the specific setting of class-wide deletion requests for classification models (e.g. deep neural networks). As a first step, we propose linear filtration as an intuitive, computationally efficient sanitization method. Our experiments demonstrate benefits in an adversarial setting over naive deletion schemes.
Article
Digital twin technologies – comprised of data-rich models and machine learning – allow the operators of smart city applications to gain an accurate representation of complex cyber-physical models. However, the implicit need for resilient data protection must be achieved by integrating privacy-preserving mechanisms into the DT system design as part of an effective defence-in-depth strategy.
Article
In the past few decades, artificial intelligence (AI) technology has experienced swift developments, changing everyone’s daily life and profoundly altering the course of human society. The intention behind developing AI was and is to benefit humans by reducing labor, increasing everyday conveniences, and promoting social good. However, recent research and AI applications indicate that AI can cause unintentional harm to humans by, for example, making unreliable decisions in safety-critical scenarios or undermining fairness by inadvertently discriminating against a group or groups. Consequently, trustworthy AI has recently garnered increased attention regarding the need to avoid the adverse effects that AI could bring to people, so people can fully trust and live in harmony with AI technologies. A tremendous amount of research on trustworthy AI has been conducted and witnessed in recent years. In this survey, we present a comprehensive appraisal of trustworthy AI from a computational perspective to help readers understand the latest technologies for achieving trustworthy AI. Trustworthy AI is a large and complex subject, involving various dimensions. In this work, we focus on six of the most crucial dimensions in achieving trustworthy AI: (i) Safety & Robustness, (ii) Nondiscrimination & Fairness, (iii) Explainability, (iv) Privacy, (v) Accountability & Auditability, and (vi) Environmental Well-being. For each dimension, we review the recent related technologies according to a taxonomy and summarize their applications in real-world systems. We also discuss the accordant and conflicting interactions among different dimensions and discuss potential aspects for trustworthy AI to investigate in the future.
Article
Background: Modern machine learning and deep learning algorithms require large amounts of data; however, data sharing between multiple healthcare institutions is limited by privacy and security concerns. Summary: Federated learning provides a functional alternative to the single-institution approach while avoiding the pitfalls of data sharing. In cross-silo federated learning, the data do not leave a site. The raw data are stored at the site of collection. Models are created at the site of collection and are updated locally to achieve a learning objective. We demonstrate a use case with COVID-19-associated AKI. We showed that federated models outperformed their local counterparts, even when evaluated on local data in the test dataset, and performance was like those being used for pooled data. Increases in performance at a given hospital were inversely proportional to dataset size at a given hospital, which suggests that hospitals with smaller datasets have significant room for growth with federated learning approaches. Key messages: This short article provides an overview of federated learning, gives a use case for COVID-19-associated acute kidney injury, and finally details the issues along with some potential solutions.
Article
Deep Neural Network (DNN), one of the most powerful machine learning algorithms, is increasingly leveraged to overcome the bottleneck of effectively exploring and analyzing massive data to boost advanced scientific development. It is not a surprise that cloud computing providers offer the cloud-based DNN as an out-of-the-box service. Though there are some benefits from the cloud-based DNN, the interaction mechanism among two or multiple entities in the cloud inevitably induces new privacy risks. This survey presents the most recent findings of privacy attacks and defenses appeared in cloud-based neural network services. We systematically and thoroughly review privacy attacks and defenses in the pipeline of cloud-based DNN service, i.e., data manipulation, training, and prediction. In particular, a new theory, called cloud-based ML privacy game, is extracted from the recently published literature to provide a deep understanding of state-of-the-art research. Finally, the challenges and future work are presented to help researchers to continue to push forward the competitions between privacy attackers and defenders.
Article
Building performant and robust artificial intelligence (AI)–based applications for dentistry requires large and high-quality data sets, which usually reside in distributed data silos from multiple sources (e.g., different clinical institutes). Collaborative efforts are limited as privacy constraints forbid direct sharing across the borders of these data silos. Federated learning is a scalable and privacy-preserving framework for collaborative training of AI models without data sharing, where instead the knowledge is exchanged in form of wisdom learned from the data. This article aims at introducing the established concept of federated learning together with chances and challenges to foster collaboration on AI-based applications within the dental research community.
Article
In this work, we study the decentralized empirical risk minimization problem under the constraint of differential privacy (DP). Based on the algorithmic framework of dual averaging, we develop a novel decentralized stochastic optimization algorithm to solve the problem. The proposed algorithm features the following: i) it perturbs the stochastic subgradient evaluated over individual data samples, with which the information about the dataset can be released in a differentially private manner; ii) it employs hyperparameters that are more aggressive than conventional decentralized dual averaging algorithms to speed up convergence. The upper bound for the utility loss of the proposed algorithm is proven to be smaller than that of existing methods to achieve the same level of DP. As a by-product, when removing the perturbation, the non-private version of the proposed algorithm attains the optimal O(1/t) convergence rate for non-smooth stochastic optimization. Finally, experimental results are presented to demonstrate the effectiveness of the algorithm.
Conference Paper
Full-text available
The biometrics community enjoys an active research field that has produced algorithms for several modalities suitable for real-world applications. Despite these developments, there exist few open source implementations of complete algorithms that are maintained by the community or deployed outside a laboratory environment. In this paper we motivate the need for more community-driven open source software in the field of biometrics and present OpenBR as a candidate to address this deficiency. We overview the OpenBR software architecture and consider still-image frontal face recognition as a case study to illustrate its strengths and capabilities. All of our work is available at www.openbiometrics.org.
Conference Paper
Full-text available
Most modern face recognition systems rely on a feature representation given by a hand-crafted image descriptor, such as Local Binary Patterns (LBP), and achieve improved performance by combining several such representations. In this paper, we propose deep learning as a natural source for obtaining additional, complementary representations. To learn features in high-resolution images, we make use of convolutional deep belief networks. Moreover, to take advantage of global structure in an object class, we develop local convolutional restricted Boltzmann machines, a novel convolutional learning model that exploits the global structure by not assuming stationarity of features across the image, while maintaining scalability and robustness to small misalignments. We also present a novel application of deep learning to descriptors other than pixel intensity values, such as LBP. In addition, we compare performance of networks trained using unsupervised learning against networks with random filters, and empirically show that learning weights not only is necessary for obtaining good multilayer representations, but also provides robustness to the choice of the network architecture parameters. Finally, we show that a recognition system using only representations obtained from deep learning can achieve comparable accuracy with a system using a combination of hand-crafted image descriptors. Moreover, by combining these representations, we achieve state-of-the-art results on a real-world face verification database.
Article
Full-text available
Pylearn2 is a machine learning research library. This does not just mean that it is a collection of machine learning algorithms that share a common API; it means that it has been designed for flexibility and extensibility in order to facilitate research projects that involve new or unusual use cases. In this paper we give a brief history of the library, an overview of its basic philosophy, a summary of the library's architecture, and a description of how the Pylearn2 community functions socially.
Conference Paper
Full-text available
Many classification tasks, such as spam filtering, intrusion detection, and terrorism detection, are complicated by an adversary who wishes to avoid detection. Previous work on adversarial classification has made the unrealistic assumption that the attacker has perfect knowledge of the classifier [2]. In this paper, we introduce the adversarial classifier reverse engineering (ACRE) learning problem, the task of learning sufficient information about a classifier to construct adversarial attacks. We present efficient algorithms for reverse engineering linear classifiers with either continuous or Boolean features and demonstrate their effectiveness using real data from the domain of spam filtering.
Conference Paper
Full-text available
Genome-wide association studies (GWAS) aim at discovering the association between genetic variations, particularly single-nucleotide polymorphism (SNP), and common diseases, which is well recog- nized to be one of the most important and active areas in biomedical research. Also renowned is the privacy implication of such stud- ies, which has been brought into the limelight by the recent attack proposed by Homer et al. Homer's attack demonstrates that it is possible to identify a GWAS participant from the allele frequencies of a large number of SNPs. Such a threat, unfortunately, was found in our research to be significantly understated. In this paper, we show that individuals can actually be identified from even a rela- tively small set of statistics, as those routinely published in GWAS papers. We present two attacks. The first one extends Homer's attack with a much more powerful test statistic, based on the corre- lations among different SNPs described by coefficient of determi- nation (r2). This attack can determine the presence of an individual from the statistics related to a couple of hundred SNPs. The second attack can lead to complete disclosure of hundreds of participants' SNPs, through analyzing the information derived from published statistics. We also found that those attacks can succeed even when the precisions of the statistics are low and part of data is missing. We evaluated our attacks on the real human genomes and concluded that such threats are completely realistic.
Conference Paper
Full-text available
We examine the tradeoff between privacy and usability of statistical databases. We model a statistical database by an n-bit string d1,..,dn, with a query being a subset q ⊆ [n] to be answered by Σiεqdi. Our main result is a polynomial reconstruction algorithm of data from noisy (perturbed) subset sums. Applying this reconstruction algorithm to statistical databases we show that in order to achieve privacy one has to add perturbation of magnitude (Ω√n). That is, smaller perturbation always results in a strong violation of privacy. We show that this result is tight by exemplifying access algorithms for statistical databases that preserve privacy while adding perturbation of magnitude Õ(√n).For time-T bounded adversaries we demonstrate a privacypreserving access algorithm whose perturbation magnitude is ≈ √T.
Conference Paper
Full-text available
There has been much interest in unsuper- vised learning of hierarchical generative mod- els such as deep belief networks. Scaling such models to full-sized, high-dimensional images remains a dicult problem. To ad- dress this problem, we present the convolu- tional deep belief network, a hierarchical gen- erative model which scales to realistic image sizes. This model is translation-invariant and supports ecient bottom-up and top-down probabilistic inference. Key to our approach is probabilistic max-pooling, a novel technique which shrinks the representations of higher layers in a probabilistically sound way. Our experiments show that the algorithm learns useful high-level visual features, such as ob- ject parts, from unlabeled images of objects and natural scenes. We demonstrate excel- lent performance on several visual recogni- tion tasks and show that our model can per- form hierarchical (bottom-up and top-down) inference over full-sized images.
Article
Full-text available
De-identified clinical data in standardized form (eg, diagnosis codes), derived from electronic medical records, are increasingly combined with research data (eg, DNA sequences) and disseminated to enable scientific investigations. This study examines whether released data can be linked with identified clinical records that are accessible via various resources to jeopardize patients' anonymity, and the ability of popular privacy protection methodologies to prevent such an attack. The study experimentally evaluates the re-identification risk of a de-identified sample of Vanderbilt's patient records involved in a genome-wide association study. It also measures the level of protection from re-identification, and data utility, provided by suppression and generalization. Privacy protection is quantified using the probability of re-identifying a patient in a larger population through diagnosis codes. Data utility is measured at a dataset level, using the percentage of retained information, as well as its description, and at a patient level, using two metrics based on the difference between the distribution of Internal Classification of Disease (ICD) version 9 codes before and after applying privacy protection. More than 96% of 2800 patients' records are shown to be uniquely identified by their diagnosis codes with respect to a population of 1.2 million patients. Generalization is shown to reduce further the percentage of de-identified records by less than 2%, and over 99% of the three-digit ICD-9 codes need to be suppressed to prevent re-identification. Popular privacy protection methods are inadequate to deliver a sufficiently protected and useful result when sharing data derived from complex clinical systems. The development of alternative privacy protection models is thus required.
Article
Full-text available
Recent studies have demonstrated that statistical methods can be used to detect the presence of a single individual within a study group based on summary data reported from genome-wide association studies (GWAS). We present an analytical and empirical study of the statistical power of such methods. We thereby aim to provide quantitative guidelines for researchers wishing to make a limited number of SNPs available publicly without compromising subjects' privacy.
Article
Full-text available
Genetic variability among patients plays an important role in determining the dose of warfarin that should be used when oral anticoagulation is initiated, but practical methods of using genetic information have not been evaluated in a diverse and large population. We developed and used an algorithm for estimating the appropriate warfarin dose that is based on both clinical and genetic data from a broad population base. Clinical and genetic data from 4043 patients were used to create a dose algorithm that was based on clinical variables only and an algorithm in which genetic information was added to the clinical variables. In a validation cohort of 1009 subjects, we evaluated the potential clinical value of each algorithm by calculating the percentage of patients whose predicted dose of warfarin was within 20% of the actual stable therapeutic dose; we also evaluated other clinically relevant indicators. In the validation cohort, the pharmacogenetic algorithm accurately identified larger proportions of patients who required 21 mg of warfarin or less per week and of those who required 49 mg or more per week to achieve the target international normalized ratio than did the clinical algorithm (49.4% vs. 33.3%, P<0.001, among patients requiring < or = 21 mg per week; and 24.8% vs. 7.2%, P<0.001, among those requiring > or = 49 mg per week). The use of a pharmacogenetic algorithm for estimating the appropriate initial dose of warfarin produces recommendations that are significantly closer to the required stable therapeutic dose than those derived from a clinical algorithm or a fixed-dose approach. The greatest benefits were observed in the 46.2% of the population that required 21 mg or less of warfarin per week or 49 mg or more per week for therapeutic anticoagulation.
Article
Full-text available
Author Summary In this report we describe a framework for accurately and robustly resolving whether individuals are in a complex genomic DNA mixture using high-density single nucleotide polymorphism (SNP) genotyping microarrays. We develop a theoretical framework for detecting an individual's presence within a mixture, show its limits through simulation, and finally demonstrate experimentally the identification of the presence of genomic DNA of individuals within a series of highly complex genomic mixtures. Our approaches demonstrate straightforward identification of trace amounts (<1%) of DNA from an individual contributor within a complex mixture. We show how probe-intensity analysis of high-density SNP data can be used, even given the experimental noise of a microarray. We discuss the implications of these findings in two fields: forensics and genome-wide association (GWA) genetic studies. Within forensics, resolving whether an individual is contributing trace amounts of genomic DNA to a complex mixture is a tremendous challenge. Within GWA studies, there is a considerable push to make experimental data publicly available so that the data can be combined with other studies. Our findings show that such an approach does not completely conceal identity, since it is straightforward to assess the probability that a person or relative participated in a GWA study.
Conference Paper
Full-text available
In this paper, we present a new technique to protect the face biometric during recognition, using the so called cancellable biometric. The technique is based on image-based (statistical) face recognition using the 2DPCA algorithm. The biometric data is transformed to its cancellable domain using polynomial functions and co-occurrence matrices. Original facial images are transformed non-linearly by a polynomial function whose parameters can be change accordingly to the issuing version of the secure cancellable template. Co-occurrence matrices are also used in the transform to generate a distinctive feature vector which is used for both security and recognition accuracy. The Hadamard product is used to construct the final cancellable template. It shows high flexibility in proving a new relationship between two independent covariance matrices, which is mathematically proven. The generated cancellable templates are used in the same fashion as the original facial images. The 2DPCA recognition algorithm has been used without any changes; the transformations are applied on the input images only and yet with higher recognition accuracy. Theoretical and experimental results have shown high irreversibility of data with improved accuracy of up to 3% from the original data
Article
We initiate the study of privacy in pharmacogenetics, wherein machine learning models are used to guide medical treatments based on a patient's genotype and background. Performing an in-depth case study on privacy in personalized warfarin dosing, we show that suggested models carry privacy risks, in particular because attackers can perform what we call model inversion: an attacker, given the model and some demographic information about a patient, can predict the patient's genetic markers. As differential privacy (DP) is an oft-proposed solution for medical settings such as this, we evaluate its effectiveness for building private versions of pharmacogenetic models. We show that DP mechanisms prevent our model inversion attacks when the privacy budget is carefully selected. We go on to analyze the impact on utility by performing simulated clinical trials with DP dosing models. We find that for privacy budgets effective at preventing attacks, patients would be exposed to increased risk of stroke, bleeding events, and mortality. We conclude that current DP mechanisms do not simultaneously improve genomic privacy while retaining desirable clinical efficacy, highlighting the need for new mechanisms that should be evaluated in situ using the general methodology introduced by our work.
Article
Pylearn2 is a machine learning research library. This does not just mean that it is a collection of machine learning algorithms that share a common API; it means that it has been designed for flexibility and extensibility in order to facilitate research projects that involve new or unusual use cases. In this paper we give a brief history of the library, an overview of its basic philosophy, a summary of the library's architecture, and a description of how the Pylearn2 community functions socially.
Conference Paper
We introduce a novel privacy framework that we call Membership Privacy. The framework includes positive membership privacy, which prevents the adversary from significantly increasing its ability to conclude that an entity is in the input dataset, and negative membership privacy, which prevents leaking of non-membership. These notions are parameterized by a family of distributions that captures the adversary's prior knowledge. The power and flexibility of the proposed framework lies in the ability to choose different distribution families to instantiate membership privacy. Many privacy notions in the literature are equivalent to membership privacy with interesting distribution families, including differential privacy, differential identifiability, and differential privacy under sampling. Casting these notions into the framework leads to deeper understanding of the strengthes and weaknesses of these notions, as well as their relationships to each other. The framework also provides a principled approach to developing new privacy notions under which better utility can be achieved than what is possible under differential privacy.
Conference Paper
Most modern face recognition systems rely on a feature representation given by a hand-crafted image descriptor, such as Local Binary Patterns (LBP), and achieve improved performance by combining several such representations. In this paper, we propose deep learning as a natural source for obtaining additional, complementary representations. To learn features in high-resolution images, we make use of convolutional deep belief networks. Moreover, to take advantage of global structure in an object class, we develop local convolutional restricted Boltzmann machines, a novel convolutional learning model that exploits the global structure by not assuming stationarity of features across the image, while maintaining scalability and robustness to small misalignments. We also present a novel application of deep learning to descriptors other than pixel intensity values, such as LBP. In addition, we compare performance of networks trained using unsupervised learning against networks with random filters, and empirically show that learning weights not only is necessary for obtaining good multilayer representations, but also provides robustness to the choice of the network architecture parameters. Finally, we show that a recognition system using only representations obtained from deep learning can achieve comparable accuracy with a system using a combination of hand-crafted image descriptors. Moreover, by combining these representations, we achieve state-of-the-art results on a real-world face verification database.
Article
We consider the power of linear reconstruction attacks in statistical data privacy, showing that they can be applied to a much wider range of settings than previously understood. Linear attacks have been studied before (Dinur and Nissim PODS'03, Dwork, McSherry and Talwar STOC'07, Kasiviswanathan, Rudelson, Smith and Ullman STOC'10, De TCC'12, Muthukrishnan and Nikolov STOC'12) but have so far been applied only in settings with releases that are obviously linear. Consider a database curator who manages a database of sensitive information but wants to release statistics about how a sensitive attribute (say, disease) in the database relates to some nonsensitive attributes (e.g., postal code, age, gender, etc). We show one can mount linear reconstruction attacks based on any release that gives: a) the fraction of records that satisfy a given non-degenerate boolean function. Such releases include contingency tables (previously studied by Kasiviswanathan et al., STOC'10) as well as more complex outputs like the error rate of classifiers such as decision trees; b) any one of a large class of M-estimators (that is, the output of empirical risk minimization algorithms), including the standard estimators for linear and logistic regression. We make two contributions: first, we show how these types of releases can be transformed into a linear format, making them amenable to existing polynomial-time reconstruction algorithms. This is already perhaps surprising, since many of the above releases (like M-estimators) are obtained by solving highly nonlinear formulations. Second, we show how to analyze the resulting attacks under various distributional assumptions on the data. Specifically, we consider a setting in which the same statistic (either a) or b) above) is released about how the sensitive attribute relates to all subsets of size k (out of a total of d) nonsensitive boolean attributes.
Article
We propose a proof-of-concept machine-learning expert system that learned knowledge of lifestyle and the associated 10-year cardiovascular disease (CVD) risks from individual-level data (i.e., Atherosclerosis Risk in Communities Study, ARIC). The expert system prioritizes lifestyle options and identifies the one that maximally reduce an individual's 10-year CVD risk by (1) using the knowledge learned from the ARIC data and (2) communicating for patient-specific cardiovascular risk information and personal limitations and preferences (as defined by variables used in this study). As a result, the optimal lifestyle is not only prioritized based on an individual's characteristics but is also relevant to personal circumstances. We also explored probable uses and tested the system in several examples using real-world scenarios and patient preferences. For example, the system identifies the most effective lifestyle activities as the starting point for an individual's behavior change, shows different levels of BMI changes and the associated CVD risk reductions to encourage weight loss, identifies whether weight loss or smoking cessation is the most urgent change for a diabetes patient, etc. Answers to the questions noted above vary based on an individual's characteristics. Our validation results from clinical trial simulations, which compared original with the optimal lifestyle using an independent dataset, show that the optimal individualized patient-centered lifestyle consistently reduced 10-year CVD risks.
Conference Paper
In 1977 Dalenius articulated a desideratum for statistical databases: nothing about an individual should be learnable from the database that cannot be learned without access to the database. We give a general impossibility result showing that a formalization of Dalenius’ goal along the lines of semantic security cannot be achieved. Contrary to intuition, a variant of the result threatens the privacy even of someone not in the database. This state of affairs suggests a new measure, differential privacy, which, intuitively, captures the increased risk to one’s privacy incurred by participating in a database. The techniques developed in a sequence of papers [8, 13, 3], culminating in those described in [12], can achieve any desired level of privacy under this measure. In many cases, extremely accurate information about the database can be provided while simultaneously ensuring very high levels of privacy.
Conference Paper
Over the last decade great strides have been made in developing techniques to compute functions privately. In particular, Differential Privacy gives strong promises about conclusions that can be drawn about an individual. In contrast, various syntactic methods for providing privacy (criteria such as k-anonymity and l-diversity) have been criticized for still allowing private information of an individual to be inferred. In this paper, we consider the ability of an attacker to use data meeting privacy definitions to build an accurate classifier. We demonstrate that even under Differential Privacy, such classifiers can be used to infer "private" attributes accurately in realistic data. We compare this to similar approaches for inference-based attacks on other forms of anonymized data. We show how the efficacy of all these attacks can be measured on the same scale, based on the probability of successfully inferring a private attribute. We observe that the accuracy of inference of private attributes for differentially private data and $l$-diverse data can be quite similar.
Conference Paper
Machine learning systems offer unparalled flexibility in dealing with evolving input in a variety of applications, such as intrusion detection systems and spam e-mail filtering. However, machine learning algorithms themselves can be a target of attack by a malicious adversary. This paper provides a framework for answering the question, "Can machine learning be secure?" Novel contributions of this paper include a taxonomy of different types of attacks on machine learning techniques! and systems, a variety of defenses against those attacks, a discussion of ideas that are important to security for machine learning, an analytical model giving a lower bound on attacker's work function, and a list of open, problems.
Conference Paper
Marginal (contingency) tables are the method of choice for government agencies releasing statistical summaries of categorical data. In this paper, we derive lower bounds on how much distortion (noise) is necessary in these tables to ensure the privacy of sensitive data. We extend a line of recent work on impossibility results for private data analysis [9, 12, 13, 15] to a natural and important class of functionalities. Consider a database consisting of n rows (one per individual), each row comprising d binary attributes. For any subset of T attributes of size |T|=k, the marginal table for T has 2 k entries; each entry counts how many times in the database a particular setting of these attributes occurs. We provide lower bounds for releasing all d kk-attribute marginal tables under several different notions of privacy. (1) We give efficient polynomial time attacks which allow an adversary to reconstruct sensitive information given insufficiently perturbed marginal table releases. In particular, for a constant k, we obtain a tight bound of Ω ˜(min{n,d k-1 ) on the average distortion per entry for any mechanism that releases all k-attribute marginals while providing “attribute” privacy (a weak notion implied by most privacy definitions). (2) Our reconstruction attacks require a new lower bound on the least singular value of a random matrix with correlated rows. Let M (k) be a matrix with d k rows formed by taking all possible k-way entry-wise products of an underlying set of d random vectors from {0,1} n . For constant k, we show that the least singular value of M (k) is Ω ˜(d k ) with high probability (the same asymptotic bound as for independent rows). (3) We obtain stronger lower bounds for marginal tables satisfying differential privacy. We give a lower bound of Ω ˜(min{n,d k }), which is tight for n=Ω ˜(d k ). We extend our analysis to obtain stronger results for mechanisms that add instance-independent noise and weaker results when k is super-constant.
Conference Paper
This work is at the intersection of two lines of research. One line, initiated by Dinur and Nissim, investigates the price, in accuracy, of protecting privacy in a statistical database. The second, growing from an extensive literature on compressed sensing (see in particular the work of Donoho and collab- orators (14, 7, 13, 11)) and explicitly connected to error- correcting codes by Candes and Tao ((4); see also (5, 3)), is in the use of linear programming for error correction. Our principal result is the discovery of a sharp threshhold 0.239, so that if < and A is a random m ◊ n en- coding matrix of independently chosen standard Gaussians, where m = O(n), then with overwhelming probability over choice of A, for all x 2 Rn, LP decoding corrects bm c arbi- trary errors in the encoding Ax, while decoding can be made to fail if the error rate exceeds . Our bound resolves an open question of Candes, Rudelson, Tao, and Vershyin (3) and (oddly, but explicably) refutes empirical conclusions of Donoho (11) and Candes et al (3). By scaling and rounding we can easily transform these results to obtain polynomial- time decodable random linear codes with polynomial-sized alphabets tolerating any < 0.239 fraction of arbitrary errors. In the context of privacy-preserving datamining our re- sults say that any privacy mechanism, interactive or non- interactive, providing reasonably accurate answers to a 0.761 fraction of randomly generated weighted subset sum queries, and arbitrary answers on the remaining 0.239 fraction, is blatantly non-private.
Conference Paper
We present a new class of statistical de- anonymization attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on. Our techniques are robust to perturbation in the data and tolerate some mistakes in the adversary's background knowledge. We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world's largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber's record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.
Augmented identity app helps you identify strangers on the street
  • C Dillow
General social surveys, 1972-2012. National Opinion Research Center {producer}; The Roper Center for Public Opinion Research
  • T W Smith
  • M Marsden
  • J Hout
  • Kim
The OpenCV library. Dr. Dobb's Journal of Software Tools
  • G Bradski
FiveThirtyEight.com DataLab: How americans like their steak. http://fivethirtyeight.com/datalab/how-americans-like-their-steak
  • W Hickey
Facial recognition API
  • A R Kairos
  • Inc
The ORL database of faces
  • Laboratories Cambridge
Individualized patient-centered lifestyle recommendations
  • W Nick Chih-Lin Chi
  • Jennifer G Street
  • Matthew A Robinson
  • Crawford
Estimation of Treatment Effects from Combined Data: Identification versus Data Security. NBER volume Economics of Digitization: An Agenda To appear. T. Komarova D. Nekipelov and E. Yakovlev. Estimation of Treatment Effects from Combined Data: Identification versus Data Security
  • T Komarova
  • D Nekipelov
  • E Yakovlev
Facial scanning is making gains in surveillance. The New York Times
  • C Savage
Model inversion attacks and basic countermeasures
  • M Fredrikson
  • S Jha
  • T Ristenpart
Facial recognition API
  • Skybiometry
Closing the Gap to Human-Level Performance in Face Verification
  • Deepface
Facial recognition API. https://lambdal.com/face-recognition-api
  • Lambda Labs
Prediction API. https://cloud.google.com/prediction
  • Google
Microsoft Azure Machine Learning. Microsoft. Microsoft Azure Machine Learning
  • Microsoft
Social science research on pornography
  • J Prince
General social surveys 1972-2012. National Opinion Research Center {producer}; The Roper Center for Public Opinion Research University of Connecticut {distributor} 2103
  • T W Smith
  • P Marsden
  • M Hout
  • J Kim