Conference Paper

Bounded and Unbiased Composite Differential Privacy

Authors:
  • CSIRO's Data61
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In the modern era, DP has become a de-facto standard for privacy preservation in many domains. DP use in non-interactive data sharing settings (e.g., when the whole database is published in scrambled form) is also increasing, and many methods have been proposed [36]. Table 2 presents a summary and comparison of SOTA DPbased methods used in privacy-preserving data outsourcing. ...
Article
Full-text available
In the modern era, data of diverse types (medical, financial, etc.) are outsourced from data owner environments to the public domains for data mining and knowledge discovery purposes. However, data often encompass sensitive information about individuals, and outsourcing the data without sufficient protection may endanger privacy. Anonymization methods are mostly used in data outsourcing to protect privacy; however, it is very hard to apply anonymity to datasets of poor quality while maintaining an equilibrium between privacy, utility, and truthfulness (i.e., ensuring the values in anonymized data are consistent with the real data). To address these technical problems, we propose and implement a data balancing and attribute correlation-aware differential privacy (DP) method for mixed data outsourcing while accomplishing the three crucial objectives of privacy, truthfulness, and utility. Our method first identifies quality-related issues in the data and solves them in an automated manner by adding the fewest possible good-quality synthetic records. We propose a data partitioning method that exploits correlations between attributes to create blocks of data to lessen the amount of noise added by the DP model. To preserve higher truthfulness while guaranteeing privacy, categorical attributes are considered as one unit, and an exponential mechanism is applied to them. The numerical attributes are transformed using the Laplace mechanism with a relatively higher ϵ. The joint application of these mechanisms to data blocks enables effective resolution of the truthfulness–privacy trade-off, and data usability is extremely high. Extensive experiments are performed on three benchmark datasets to demonstrate the effectiveness of our method in real scenarios. The experiment results and analysis indicate significantly better performance on four different evaluation metrics compared to the recent state-of-the-art (SOTA) DP-based methods. Furthermore, our method has better efficiency than its counterparts.
... Inbounded and bounded DP [67,68]: If the dataset is not known, you are operating under unbounded DP (e.g., the sets of possible datasets is of any size). In contrast, if the dataset is known, you are operating under bounded DP (e.g., the sets of possible datasets are known size). ...
Article
Full-text available
Differential privacy has recently gained prominence, especially in the context of private machine learning. While the definition of differential privacy makes it possible to provably limit the amount of information leaked by an algorithm, practical implementations of differentially private algorithms often contain subtle vulnerabilities. Therefore, there is a need for effective methods that can audit differentially private algorithms before they are deployed in the real world. The article examines studies that recommend privacy guarantees for differential private machine learning. It covers a wide range of topics on the subject and provides comprehensive guidance for privacy auditing schemes based on privacy attacks to protect machine-learning models from privacy leakage. Our results contribute to the growing literature on differential privacy in the realm of privacy auditing and beyond and pave the way for future research in the field of privacy-preserving models.
... Differential privacy is a statistical technique that provides strong privacy guarantees by introducing noise into the data being shared [233]. In the context of CRNs, differential privacy is applied to protect spectrum sensing reports and usage patterns from revealing sensitive user information [234]. ...
Article
Cognitive Radio (CR) technology offers dynamic spectrum access, allowing for more efficient use of the radio frequency spectrum. While this innovation addresses spectrum scarcity, it introduces significant security and privacy concerns. This paper examines key vulnerabilities in cognitive radio networks (CRNs), including spectrum sensing data falsification, primary user emulation attacks, and denial-of-service attacks, which exploit the adaptive and opportunistic nature of CR systems. In addition, privacy challenges arise from the frequent sharing of location, identity, and spectrum usage data. This paper explores existing security frameworks, mitigation strategies, and privacy-preserving techniques, emphasizing the need for robust cryptographic methods, trust management, and real-time intrusion detection systems. The paper concludes by identifying open research areas that need attention to develop secure, resilient CRNs while preserving user privacy.
Article
Stroke includes both hemorrhagic and ischaemic stroke, and with the rising incidence of stroke, the mortality rate of hemorrhagic stroke is higher than that of ischaemic stroke, accounting for 15% of the stroke mortality rate. In this area, clinically intelligent diagnosis and treatment plays an important role. By integrating imaging features, patient clinical information, treatment plans and diagnosis, accurate personalized efficacy assessment and prognosis prediction can be achieved. In this study, machine learning models (Random Forest, XGBoost, logistic regression, LGBoost, and AdaBoost) for the exploration of factors associated with the risk of haematoma expansion (HE) were developed based on patients' diagnostic data. mRS scores were used to assess the prognostic status of the patients, Principal Component Analysis was used for data dimensionality reduction, and Spearman Correlation Analysis was used to analyze the features' direct correlation. Five machine learning models were applied to predict the probability of HE and the prognosis of hemorrhagic stroke in patients. The models were tuned using grid search and ten‐fold cross‐validation methods to obtain more accurate predictions. The results of the study showed that the mRS index and factors such as history of diabetes, history of coronary heart disease, haematoma volume and age were closely related to the prognosis of patients. Among them, RF and XGBoost performed well in predicting the probability of HE, with the area under the ROC curve reaching 0.98, while LGBoost performed best in predicting the prognostic status of hemorrhagic stroke patients.
Article
Stereo cameras have been widely used in three‐dimensional (3D) reconstruction, target ranging and environment perception. The calibration of stereo cameras is the key technology of vision system. Most of the existing methods are not flexible due to limitations of the size of the cameras' overlapping fields of view (FOV) and the degree of defocus. This paper presents a stereo calibration method in which the feature points on the target are globally encoded by the phase‐coding method. Each feature point has a specific code, and the mapping relationship between image coordinates and world coordinates can be uniquely determined. So, the stereo cameras whether or not capture the same area of the target, both the camera coordinate systems can be unified under one global coordinate system. In addition, calibration can be accomplished when cameras are out of focus seriously, for the coding and extraction of feature points are phase‐based. Simulations and experiments were carried out to prove flexibility and robustness of the proposed method.
Article
In digital‐intelligent warehouses, the heavy handling tasks, complex algorithms with high computational demands, and vast solution spaces pose significant challenges to achieving stable, efficient, and balanced operation of multiple Autonomous Mobile Robots (AMRs) for automated cargo handling. This paper focuses on a virtual smart warehouse environment and employs Python software to conduct simulation experiments for multi‐AMR task allocation. The simulated smart warehouse comprises three idle AMRs and 16 task points that require transportation. The experimental simulations demonstrate that the improved genetic algorithm can find the global optimal solution with relatively low computational cost, meeting the fast response requirements in real‐world operations. It enables stable operation, high efficiency, and balanced task allocation for multiple AMRs. The simulation results validate the reliability of the proposed method, effectively addressing the issues of multi‐AMR task allocation and path planning in digital‐intelligent warehouses.
Article
The morphological characteristics of electrocardiograms (ECGs) serve as a fundamental basis for diagnosing arrhythmias. Convolutional neural networks (CNNs), leveraging their local receptive field properties, effectively capture the morphological features of ECG signals and have been extensively employed in the automatic diagnosis of arrhythmias. However, the variability in the duration of ECG morphological features renders single‐scale convolutional kernels inadequate for fully extracting these features. To address this limitation, this study proposes a multi‐scale parallel joint optimization convolutional neural network (MPJO_CNN). The proposed method utilizes convolutional kernels of varying scales to extract ECG features, further refining these features via parallel computation and implementing a joint optimization strategy to enhance classification performance. Experimental results demonstrate that on the MIT‐BIH arrhythmia database, this method not only achieved state‐of‐the‐art performance, with an accuracy of 99.41% and an F1 score of 98.09%, but also showed high sensitivity to classes with fewer samples.
Article
Urban logistics is an important bridge connecting urban supply and demand, and its development level is crucial to urban economic development. To explore the development level, spatial and temporal differences, and influencing factors of urban logistics in China, this article uses the entropy weight method, Theil index, and hierarchical regression model to analyze the panel data of urban logistics in 31 provinces (autonomous regions and municipalities) in China. The research conclusion provides a reference for the government to formulate a balanced development policy for urban logistics. It has important practical significance for guiding logistics enterprises to adjust their regional distribution.
Article
Cross‐resolution person re‐identification (CR‐ReID) seeks to overcome the challenge of retrieving and matching specific person images across cameras with varying resolutions. Numerous existing studies utilize established CNNs and ViTs to resize captured low‐resolution (LR) images and align them with high‐resolution (HR) image features or construct common feature spaces to match between images of different resolutions. However, these methods ignore the potential feature connection between the LR and HR images of the same pedestrian identity. Besides, the CNNs or ViTs usually obtain outliers within the attention maps of LR images; this inclination to excessively concentrate on anomalous information may obscure the genuine and anticipated characteristics between images, which makes it challenging to extract meaningful information from the images. In this work, we propose the abnormal feature elimination and reconfiguration Transformer (ARTransformer), a novel network architecture for robust cross‐resolution person re‐identification tasks. This method uses a resolution feature discriminator to learn resolution‐invariant features and output feature matrices of images with different resolutions. It then calculates the potential feature relationships between images of pedestrians with the same identity but different resolutions through a new cross‐resolution landmark agent attention (CR‐LAA) mechanism. Conclusively, it utilizes output feature matrices to model LR and HR image interactions by mitigating abnormal image features and prioritizing attention on the target person by learning representations from input images of various resolutions. Experimental results show that ARTransformer performs well in matching images with different resolutions, even with unseen resolution, and extensive evaluations on four real‐world datasets confirm the excellent results of our approach.
Article
Although ViT has achieved significant success in the field of image classification, research on ViT‐based object detection algorithms is still in its early stages, and their application in real‐world scenarios is limited. Furthermore, algorithms based on ViT or Transformer are prone to overfitting issues when training data is scarce. While CO‐DETR has achieved state‐of‐the‐art object detection precision on the COCO dataset leaderboard, the ViT‐based CO‐DETR also suffers from overfitting problems, which affect its detection precision on smaller datasets. Based on the study of ViT‐based object detection algorithms, a new object detection algorithm termed DC‐DETR (DropKey Co‐DETR) was proposed in this paper. It builds upon CO‐DETR and introduces a regularization method called DropKey into the Transformer attention mechanism. By randomly dropping part of the Key during the attention phase, the network is encouraged to capture global information about the target object. This method effectively alleviates the overfitting problem in ViT for object detection tasks, improving the model's precision and generalization ability. To validate the effectiveness and practical applicability of DC‐DETR in environments with limited computational resources, a dataset for hot work scenarios was collected and annotated. Based on this dataset, performance tests were conducted on the DC‐DETR, CO‐DETR, and YOLOv5 algorithms. The test results indicate that the proposed DC‐DETR algorithm exhibits superior performance, with detection precision improving by 0.7% compared to CO‐DETR and by 5.7% compared to YOLOv5. The detection speed is the same as CO‐DETR, and only 2.9 ms slower than YOLOv5. The experiments demonstrate that the proposed DC‐DETR algorithm achieves a balance between precision and speed, making it well‐suited for practical object detection applications.
Article
The distributed electric drive vehicle is a highly nonlinear and time‐varying system. To address the issue of drive slip control under varying driving forces and road surface coefficients, a novel drive slip control strategy is proposed, which considers axle load transfer during vehicle acceleration. The strategy employs an improved PSO algorithm to obtain optimal parameters for the BP neural network, uses the BP neural network for forward propagation to calculate PID parameters in real‐time, and adjusts the weight matrix through backward propagation to achieve real‐time adaptive PID control for vehicle slip. Experimental results indicate that this strategy improves the ITAE index by 13.6% and response time by 74.8% compared to the anti‐saturation PID.
Article
This paper presents a semantic segmentation algorithm for medical images, leveraging the DeepLabV3+ architecture in conjunction with the ResNeXt network. The proposed algorithm takes into account the correlation between each structure of lung images and the unique characteristics of image features. Firstly, the cavity convolution algorithm is employed to enhance the receptive field of the network's feature map without augmenting the number of network parameters. Then, the extraction of dense pixel features and the expansion of the receptive field for lung images are conducted using a Densely Connected Atrous Spatial Pyramid Pooling (DenseASPP) module integrated with the ResNeXt network, which is based on multi‐scale feature fusion. This ultimately leads to improved refinement of the edges in segmented lung images. The algorithm has shown excellent performance in clinical applications, providing medical professionals with more precise and accurate data to inform diagnostic and treatment strategies. Our algorithm achieved Mean Pixel Accuracy (MPA) of 0.9866, Intersection Over Union (IOU) of 0.9886 and Mean Intersection over Union (MIoU) of 0.9761, which demonstrates superiority over other state‐of‐the‐art algorithms.
Article
The intricate spatial‐temporal dynamics and variability of sign language gestures pose significant challenges for Continuous Sign Language Recognition (CSLR) systems. Existing models often fall short in accurately capturing these complexities, leading to performance issues and frequent misalignments. To address these shortcomings, we introduce a new approach that leverages Denoising Diffusion Models (DDMs) to improve feature representation in the visual‐sequential module of CSLR systems. Originally intended for generative tasks, DDMs have shown strong potential in representation learning through a denoising process akin to Denoising Autoencoders. Our method incorporates a denoising diffusion transformer into the CSLR framework to refine spatial‐temporal features, capitalizing on the ability of diffusion models to enhance representation quality. By conditionally denoising visual feature sequences, our approach increases the discriminative capability of the system. Additionally, we introduce an additional classifier, trained with Connectionist Temporal Classification (CTC) loss, to provide complementary supervision and further boost performance. Extensive experiments demonstrate that our method significantly improves CSLR accuracy by effectively capturing the subtle details of continuous sign language gestures and overcoming the representation limitations of current models.
Article
In response to the issues of poor sharpness and low information entropy in traditional MSRCP (Multi‐Scale Retinex with Color Restoration) algorithms for image enhancement, we propose an improved MSRCP algorithm for low‐light image enhancement with chromaticity preservation. First, we replaced the extrema calculation method in the color restoration function with a calculation method based on clipped pixel ratios. Then, we combined guided filtering and Gaussian filtering to calculate the incident component. Finally, we conducted experiments using six different low‐light images and compared the results with the traditional MSRCP algorithm, such as SSR, MSR, MSRCR, and MSRCP. The experimental results show that our method improved the sharpness and information entropy values in the five comparison images by 5.6%–35.6% and 0.18%–15.3%, respectively.
Article
Successfully training a model requires substantial computational power, excellent model design, and high training costs, which implies that a well‐trained model holds significant commercial value. Protecting a trained Deep Neural Network (DNN) model from Intellectual Property (IP) infringement has become a matter of intense concern recently. Particularly, embedding and verifying watermarks in black‐box models without accessing internal model parameters, while ensuring the robustness and invisibility of the watermark, remains a challenging issue. Unlike many existing methods, we propose a cascade ownership verification framework based on invisible watermarks, with a focus on how to effectively protect the copyright of black‐box watermark models and detect unauthorized users' infringement behaviors. This framework consists of two parts: watermark generation and copyright verification. In the watermark generation phase, watermarked samples are generated from key samples and label images. The difference between watermarked samples and key samples is imperceptible, while a specific identifier has been injected into the watermarked samples, leaving a backdoor as an entry point for copyright verification. The copyright verification phase employs hypothesis testing to enhance the confidence level of verification. In image classification tasks based on MNIST, CIFAR‐10, and CIFAR‐100 datasets, experiments were conducted on several popular deep learning models. The experimental results show that this framework offers high security and effectiveness in protecting model copyrights and demonstrates strong robustness against pruning and fine‐tuning attacks.
Article
In order to solve the problem that the existing PM 2.5 concentration prediction methods ignore the spatial and temporal influencing factors of PM 2.5 concentration, this paper constructs a spatial characteristic factor of PM 2.5 concentration based on the maximum information coefficient, and proposes a CNN‐LSTM combined prediction model based on multi‐feature fusion, which transforms the abstract spatial and temporal influencing factors into quantifiable features. The model has good feature extraction ability and strong ability to capture short‐term transient information and long‐range dependent information in time series data, which improves the prediction performance of the model. The experimental results show that the prediction accuracy of CNN‐LSTM model based on multi‐feature fusion is 87.21%, and MAPE is 6.25, 4.84, and 1.29 less than BP, SVR, and LightGBM, and 1.91 and 7.04 less than CNN and LSTM.
Article
Nowadays, vehicular ad‐hoc networks (VANETs) offer increased convenience to drivers and enable intelligent traffic management. However, the public wireless transmission channel in VANETs brings challenges related to security vulnerabilities and privacy leakage, in addition, vehicles produced by different manufacturers may use different cryptosystems such as certificateless cryptosystems (CLCs) and identity‐based cryptosystems (IBC). To address privacy leakage during cross‐cryptosystem communication in VANETs, we propose a lattice‐based heterogeneous signcryption scheme named LHS‐C2I. The scheme facilitates secure multi‐cryptosystem bidirectional communication as CLC‐based vehicles to IBC‐based vehicles and IBC‐based vehicles to CLC‐based vehicles. The confidentiality and authenticity of LHS‐C2I help to prevent the users from privacy leakage during cross‐cryptosystem communication and to authenticate the message integrity and the sender's identity legitimacy. The proposed scheme is proven to achieve Indistinguishability under Chosen Ciphertext Attack (IND‐CCA2) and Existential Unforgeability against Adaptive Chosen Messages Attack (EUF‐CMA) within the random oracle model. Performance analysis demonstrates that LHS‐C2I outperforms existing schemes in terms of computational overhead, communication overhead, and overall security features. It is particularly well‐suited for scenarios requiring secure communication across different cryptosystems in VANETs.
Article
The algorithm for multimodal image‐text retrieval aims to overcome the differences between visual and textual data, enabling efficient and accurate recognition between images and text. Since manually labeled data are usually expensive, many researchers attempted to use low‐quality multimodal data obtained through network batch operations. This presents a challenge for the model's generalization performance and prediction accuracy. To address this issue, we construct a system of multimodal image‐text retrieval based on the fusion of pre‐trained models. Firstly, we enhance the diversity of the original data using the MixGen algorithm to improve the model's generalization performance. Next, we employ Chinese‐CLIP as the most suitable foundational model based on comparative experiments among three different models. Finally, we construct a comprehensive ensemble model with three base Chinese‐CLIP models based on the specific characteristics of the tasks, which includes a prediction‐based fusion model for the text‐to‐image task and a feature‐based fusion model for the image‐to‐text task. Extensive experiments show that our model outperforms state‐of‐the‐art single foundation models in generalization, especially with low‐quality image‐text pairs and small datasets in the Chinese context.
Article
Traditional Oncomelania hupensis detection relies on human eye observation, which results in reduced efficiency due to easy fatigue of the human eye and limited individual cognition, an improved YOLOv8 O. hupensis detection algorithm, YOLOv8‐ESW(expectation–maximization attention [EMA], Small Target Detection Layer, and Wise‐IoU), is proposed. The original dataset is augmented using the OpenCV library. To imitate image blur caused by motion jitter, salt and pepper, and Gaussian noise were added to the dataset; to imitate images from different angles captured by the camera in an instant, affine, translation, flip, and other transformations were performed on the original data, resulting in a total of 6000 images after data enhancement. Considering the insufficient feature fusion problem caused by lightweight convolution, We present the expectation–EMA module (E), which innovatively incorporates a coordinate attention mechanism and convolutional layers to introduce a specialized layer for small target detection (S). This design significantly improves the network's ability to synergize information from both superficial and deeper layers, better focusing on small target O. hupensis and occluded O. hupensis . To tackle the challenge of quality imbalance among O. hupensis samples, we employ the Wise‐IoU (WIoU) loss function (W). This approach uses a gradient gain distribution strategy and improves the model convergence speed and regression accuracy. The YOLOv8‐ESW model, with 16.8 million parameters and requiring 98.4 GFLOPS for computations, achieved a mAP of 92.74% when tested on the O. hupensis dataset, marking a 4.09% improvement over the baseline model. Comprehensive testing confirms the enhanced network's efficacy, significantly elevating O. hupensis detection precision, minimizing both missed and false detections, and fulfilling real‐time processing criteria. Compared with the current mainstream models, it has certain advantages in detection accuracy and has reference value for subsequent research in actual detection.
Article
The reliability of the catenary system is crucial for the safety and efficiency of heavy‐haul railways. This study presents a probabilistic risk analysis model for the catenary system, employing causal inference methods to capture the complex relationships among risk factors. Using historical operational data, we identify key risk contributors such as environmental conditions, vehicular loads, and equipment failures. By combining fault tree analysis (FTA) and failure mode and effects analysis (FMEA), we establish risk propagation pathways. The proposed method utilizes Bayesian networks to quantify conditional probabilities and trace the causal chains leading to potential failures. Through reverse inference, we identify critical risk nodes and their impact on system performance. This approach enhances the accuracy of risk assessment and provides an effective tool for proactive risk management in heavy‐haul railways, aiding in the optimization of maintenance strategies and strengthening the resilience of the catenary system under varying operational conditions.
Article
The Density Peaks Clustering (DPC) algorithm is well‐known for its simplicity and efficiency in clustering data of arbitrary shapes. However, it faces challenges such as inconsistent local density definitions and sample assignment errors. This paper introduces the Shared Neighbors and Natural Neighbors Density Peaks Clustering (SN‐DPC) algorithm to address these issues. SN‐DPC redefines local density by incorporating weighted shared neighbors, which enhances the density contribution from distant samples and provides a better representation of the data distribution. It also establishes a new similarity measure between samples using shared and natural neighbors, which increases intra‐cluster similarity and reduces assignment errors, thereby improving clustering performance. Compared with DPC‐CE, IDPC‐FA, DPCSA, FNDPC, and traditional DPC, SN‐DPC demonstrated superior effectiveness on both synthetic and real datasets. When applied to the analysis of electricity consumption patterns, it more accurately identified load consumption patterns and usage habits.
Article
Visible‐Infrared Person Re‐Identification (VI‐ReID) is a complex challenge in cross‐modality retrieval, wherein the goal is to recognize individuals from images captured via RGB and IR cameras. While many existing methods focus on narrowing the gap between different modalities through designing various feature‐level constraints, they often neglect the consistency of channel statistics information across the modalities, which results in suboptimal matching performance. In this work, we introduce a new approach for VI‐ReID that incorporates Cross‐Composition Normalization (CCN) and Self‐Enrichment Normalization (SEN). Specifically, Cross‐Composition Normalization is a plug‐and‐play module that can be seamlessly integrated into shallow CNN layers without requiring modifications to the training objectives. It probabilistically blends feature statistics between instances, thereby fostering the model's ability to learn inter‐modality feature distributions. Conversely, Self‐Enrichment Normalization leverages attention mechanisms to recalibrate statistics, effectively bridging the gaps between training and test distributions. This enhancement markedly boosts the discriminability of features in VI‐ReID tasks. To validate the efficacy of our proposed method, we carried out comprehensive experiments on two public cross‐modality datasets. The results clearly demonstrate the superiority of our Cross‐Composition and Self‐Enrichment normalization techniques in addressing the challenges of the VI‐ReID problem.
Article
Fatigue detection holds paramount importance in the timely identification of safety hazards. Nonetheless, prevailing fatigue detection methodologies often overlook the diverse spectrum of fatigue features or temporal cues. To address this lacuna, we introduce fatigue detection based on blood volume pulse signal and multi‐physical features (FDBVPS‐MF). Initially, a non‐invasive technique is employed to extract the blood volume pulse signal (BVPS) from the forehead region, which is subsequently fed into a one‐dimensional convolutional neural network (1D CNN) to formulate a fatigue detection model based on BVPS. Concurrently, features such as percentage of eyelid closure (PERCLOS), blink frequency (BF), and maximum closing time (MCT) are computed from eye images, and amalgamated with yawning frequency (YF) derived from mouth images to generate multi‐physical features (MF). MF is then input into the 1D CNN network to construct a fatigue detection model based on MF. Subsequently, employing weights, derived through Adaboosting, a fusion approach is executed to integrate the outputs of the two fatigue detection models, thus facilitating multi‐modal fatigue detection. On the UTA‐RLDD dataset, the proposed FDBVPS‐MF exhibits an accuracy and precision of 88.9% and 88.2%, respectively. Experimental findings substantiate the superior efficacy of FDBVPS‐MF over conventional methodologies.
Article
Segmenting breast tumors from dynamic contrast‐enhanced magnetic resonance images is a critical step in the early detection and diagnosis of breast cancer. However, this task becomes significantly more challenging due to the diverse shapes and sizes of tumors, which make it difficult to establish a unified perception field for modeling them. Moreover, tumor regions are often subtle or imperceptible during early detection, exacerbating the issue of extreme class imbalance. This imbalance can lead to biased training and challenge accurately segmenting tumor regions from the predominant normal tissues. To address these issues, we propose a hierarchical region contrastive learning approach for breast tumor segmentation. Our approach introduces a novel hierarchical region contrastive learning loss function that addresses the class imbalance problem. This loss function encourages the model to create a clear separation between feature embeddings by maximizing the inter‐class margin and minimizing the intra‐class distance across different levels of the feature space. In addition, we design a novel Attention‐based 3D Multi‐scale Feature Fusion Residual Module to explore more granular multi‐scale representations to improve the feature learning ability of tumors. Extensive experiments on two breast DCE‐MRI datasets demonstrate that the proposed algorithm is more competitive against several state‐of‐the‐art approaches under different segmentation metrics.
Article
Due to the greatly improved capabilities of devices, massive data, and increasing concern about data privacy, Federated Learning (FL) has been increasingly considered for applications to wireless communication networks (WCNs). Wireless FL (WFL) is a distributed method of training a global deep learning model in which a large number of participants each train a local model on their training datasets and then upload the local model updates to a central server. However, in general, nonindependent and identically distributed (non-IID) data of WCNs raises concerns about robustness, as a malicious participant could potentially inject a “backdoor” into the global model by uploading poisoned data or models over WCN. This could cause the model to misclassify malicious inputs as a specific target class while behaving normally with benign inputs. This survey provides a comprehensive review of the latest backdoor attacks and defense mechanisms. It classifies them according to their targets (data poisoning or model poisoning), the attack phase (local data collection, training, or aggregation), and defense stage (local training, before aggregation, during aggregation, or after aggregation). The strengths and limitations of existing attack strategies and defense mechanisms are analyzed in detail. Comparisons of existing attack methods and defense designs are carried out, pointing to noteworthy findings, open challenges, and potential future research directions related to security and privacy of WFL.
Article
Full-text available
A key tool for building differentially private systems is adding Gaussian noise to the output of a function evaluated on a sensitive dataset. Unfortunately, using a continuous distribution presents several practical challenges. First and foremost, finite computers cannot exactly represent samples from continuous distributions, and previous work has demonstrated that seemingly innocuous numerical errors can entirely destroy privacy. Moreover, when the underlying data is itself discrete (e.g., population counts), adding continuous noise makes the result less interpretable. With these shortcomings in mind, we introduce and analyze the discrete Gaussian in the context of differential privacy. Specifically, we theoretically and experimentally show that adding discrete Gaussian noise provides essentially the same privacy and accuracy guarantees as the addition of continuous Gaussian noise. We also present an simple and efficient algorithm for exact sampling from this distribution. This demonstrates its applicability for privately answering counting queries, or more generally, low-sensitivity integer-valued queries.
Article
Full-text available
Many data analysis operations can be expressed as a GROUP BY query on an unbounded set of partitions, followed by a per-partition aggregation. To make such a query differentially private, adding noise to each aggregation is not enough: we also need to make sure that the set of partitions released is also differentially private. This problem is not new, and it was recently formally introduced as differentially private set union [14]. In this work, we continue this area of study, and focus on the common setting where each user is associated with a single partition. In this setting, we propose a simple, optimal differentially private mechanism that maximizes the number of released partitions. We discuss implementation considerations, as well as the possible extension of this approach to the setting where each user contributes to a fixed, small number of partitions.
Conference Paper
Full-text available
Many agencies release datasets and statistics about groups of individuals that are used as input to a number of critical decision processes. To conform with privacy and confidentiality requirements, these agencies are often required to release privacy-preserving versions of the data. This paper studies the release of differentially private datasets and analyzes their impact on some critical resource allocation tasks under a fairness perspective. The paper shows that, when the decisions take as input differentially private data, the noise added to achieve privacy disproportionately impacts some groups over others. The paper analyzes the reasons for these disproportionate impacts and proposes guidelines to mitigate these effects. The proposed approaches are evaluated on critical decision problems that use differentially private census data.
Article
Full-text available
Differential privacy provides strong privacy preservation guarantee in information sharing. As social network analysis has been enjoying many applications, it opens a new arena for applications of differential privacy. This article presents a comprehensive survey connecting the basic principles of differential privacy and applications in social network analysis. We concisely review the foundations of differential privacy and the major variants. Then, we discuss how differential privacy is applied to social network analysis, including privacy attacks in social networks, models of differential privacy in social network analysis, and a series of popular tasks, such as analyzing degree distribution, counting subgraphs and assigning weights to edges. We also discuss a series of challenges for future work.
Article
Full-text available
Crowdsourcing plays an essential role in the Internet of Things (IoT) for data collection, where a group of workers is equipped with Internet-connected geolocated devices to collect sensor data for marketing or research purpose. In this paper, we consider crowdsourcing these worker's hot travel path. Each worker is required to report his real-time location information, which is sensitive and has to be protected. Local differential privacy is a strong privacy concept and has been deployed in many software systems. However, the local differential privacy technology needs a large number of participants to ensure the accuracy of the estimation, which is not always the case for crowdsourcing. To solve this problem, we proposed a trie-based iterative statistic method, which combines additive secret sharing and local differential privacy technologies. The proposed method has excellent performance even with a limited number of participants without the need of complex computation. Specifically, the proposed method contains three main components: iterative statistic, adaptive sampling, and secure reporting. We theoretically analyze the effectiveness of the proposed method and perform extensive experiments to show that the proposed method not only provides a strict privacy guarantee, but also significantly improves the performance from the previous existing solutions.
Article
Full-text available
The availability of high-fidelity energy networks brings significant value to academic and commercial research. However, such releases also raise fundamental concerns related to privacy and security as they can reveal sensitive commercial information and expose system vulnerabilities. This paper investigates how to release the data for power networks where the parameters of transmission lines and transformers are obfuscated. It does so by using the framework of Differential Privacy (DP), that provides strong privacy guarantees and has attracted significant attention in recent years. Unfortunately, simple DP mechanisms often result in AC-infeasible networks. To address these concerns, this paper presents a novel differentially private mechanism that guarantees AC-feasibility and largely preserves the fidelity of the obfuscated power network. Experimental results also show that the obfuscation significantly reduces the potential damage of an attack carried by exploiting the released dataset.
Article
Full-text available
Blockchain technology ensures that data is tamper-proof, traceable, and trustworthy. This article introduces a well-known blockchain technology implementation—Hyperledger Fabric. The basic framework and privacy protection mechanisms of Hyperledger Fabric such as certificate authority, channel, Private Data Collection, etc. are described. As an example, a specific business scenario of supply chain finance is figured out. And accordingly, some design details about how to apply these privacy protection mechanisms are described.
Conference Paper
Full-text available
Ubiquitous mobile and wireless communication systems have the potential to revolutionize transportation systems, making accurate mobility traces and activity-based patterns available to optimize the design and operations of mobility systems. However, these rich data sets also pose significant privacy risks, potentially revealing highly sensitive information about individual agents. This paper studies how to use differential privacy to release mobility data for transportation applications. It shows that existing approaches do not provide the desired fidelity for practical uses. To remedy this limitation, the paper proposes the idea of Constraint-Based Differential Privacy (CBDP) that casts the production of a private data set as an optimization problem that redistributes the noise introduced by a randomized mechanism to satisfy fundamental constraints of the original data set. The CBDP has strong theoretical guarantees: It is a constant factor away from optimality and when the constraints capture categorical features, it runs in polynomial time. Experimental results show that CBDP ensures that a city-level multi-modal transit system has similar performance measures when designed and optimized over the real and private data sets and improves state-of-art privacy methods by an order of magnitude.
Article
Full-text available
We continue a line of research initiated in Dinur and Nissim (2003); Dwork and Nissim (2004); and Blum et al. (2005) on privacy-preserving statistical databases. Consider a trusted server that holds a database of sensitive information. Given a query function f mapping databases to reals, the so-called {\em true answer} is the result of applying f to the database. To protect privacy, the true answer is perturbed by the addition of random noise generated according to a carefully chosen distribution, and this response, the true answer plus noise, is returned to the user. Previous work focused on the case of noisy sums, in which f=ig(xi)f = \sum_i g(x_i), where xix_i denotes the ith row of the database and g maps database rows to [0,1]. We extend the study to general functions f, proving that privacy can be preserved by calibrating the standard deviation of the noise according to the {\em sensitivity} of the function f. Roughly speaking, this is the amount that any single argument to f can change its output. The new analysis shows that for several particular applications substantially less noise is needed than was previously understood to be the case. The first step is a very clean definition of privacy---now known as differential privacy---and measure of its loss. We also provide a set of tools for designing and combining differentially private algorithms, permitting the construction of complex differentially private analytical tools from simple differentially private primitives. Finally, we obtain separation results showing the increased value of interactive statistical release mechanisms over non-interactive ones.
Conference Paper
Full-text available
We continue a line of research initiated in [10,11]on privacy-preserving statistical databases. Consider a trusted server that holds a database of sensitive information. Given a query function f mapping databases to reals, the so-called true answer is the result of applying f to the database. To protect privacy, the true answer is perturbed by the addition of random noise generated according to a carefully chosen distribution, and this response, the true answer plus noise, is returned to the user. Previous work focused on the case of noisy sums, in which f = ∑i g(x i ), where x i denotes the ith row of the database and g maps database rows to [0,1]. We extend the study to general functions f, proving that privacy can be preserved by calibrating the standard deviation of the noise according to the sensitivity of the function f. Roughly speaking, this is the amount that any single argument to f can change its output. The new analysis shows that for several particular applications substantially less noise is needed than was previously understood to be the case. The first step is a very clean characterization of privacy in terms of indistinguishability of transcripts. Additionally, we obtain separation results showing the increased value of interactive sanitization mechanisms over non-interactive.
Article
Full-text available
Recent dierentially private query mechanisms oer strong privacy guarantees by adding noise to the query answer. For a single counting query, the technique is simple, accurate, and provides optimal utility. However, analysts typically wish to ask multiple queries. In this case, the optimal strat- egy is not apparent, and alternative query strategies can involve dicult trade-os in accuracy, and may produce in- consistent answers. In this work we show that it is possible to signicantly im- prove accuracy for a general class of histogram queries. Our approach carefully chooses a set of queries to evaluate, and then exploits consistency constraints that should hold over the noisy output. In a post-processing phase, we compute the consistent input most likely to have produced the noisy output. The nal output is both private and consistent, but in addition, it is often much more accurate. We apply our techniques to real datasets and show they can be used for estimating the degree sequence of a graph with extreme precision, and for computing a histogram that can support arbitrary range queries accurately.
Article
In order to efficiently provide demand side management (DSM) in smart grid, carrying out pricing on the basis of real-time energy usage is considered to be the most vital tool because it is directly linked with the finances associated with smart meters. Hence, every smart meter user wants to pay the minimum possible amount along with getting maximum benefits. In this context, usage based dynamic pricing strategies of DSM plays their role and provide users with specific incentives that help shaping their load curve according to the forecasted load. However, these reported real-time values can leak privacy of smart meter users, which can lead to serious consequences such as spying, etc. Moreover, most dynamic pricing algorithms charge all users equally irrespective of their contribution in causing peak factor. Therefore, in this paper, we propose a modified usage based dynamic pricing mechanism that only charges the users responsible for causing peak factor. We further integrate the concept of differential privacy to protect the privacy of real-time smart metering data. To calculate accurate billing, we also propose a noise adjustment method. Finally, we propose D emand R esponse enhancing D ifferential P ricing (DRDP) strategy that effectively enhances demand response along with providing dynamic pricing to smart meter users. We also carry out theoretical analysis for differential privacy guarantees and for cooperative state probability to analyze behavior of cooperative smart meters. The performance evaluation of DRDP strategy at various privacy parameters show that the proposed strategy outperforms previous mechanisms in terms of dynamic pricing and privacy preservation. 1 1. A preliminary version has been published by 2020 IEEE International Conference on Communications (ICC 2020), June, 2020, Dublin, Ireland entitled Differentially Private Dynamic Pricing for Efficient Demand Response in Smart Grid. </fn
Article
Because learning sometimes involves sensitive data, machine learning algorithms have been extended to offer differential privacy for training data. In practice, this has been mostly an afterthought, with privacy-preserving models obtained by re-running training with a different optimizer, but using the model architectures that already performed well in a non-privacy-preserving setting. This approach leads to less than ideal privacy/utility tradeoffs, as we show here. To improve these tradeoffs, prior work introduces variants of differential privacy that weaken the privacy guarantee proved to increase model utility. We show this is not necessary and instead propose that utility be improved by choosing activation functions designed explicitly for privacy-preserving training. A crucial operation in differentially private SGD is gradient clipping, which along with modifying the optimization path (at times resulting in not-optimizing a single objective function), may also introduce both significant bias and variance to the learning process. We empirically identify exploding gradients arising from ReLU may be one of the main sources of this. We demonstrate analytically and experimentally how a general family of bounded activation functions, the tempered sigmoids, consistently outperform the currently established choice: unbounded activation functions like ReLU. Using this paradigm, we achieve new state-of-the-art accuracy on MNIST, FashionMNIST, and CIFAR10 without any modification of the learning procedure fundamentals or differential privacy analysis. While the changes we make are simple in retrospect, the simplicity of our approach facilitates its implementation and adoption to meaningfully improve state-of-the-art machine learning while still providing strong guarantees in the original framework of differential privacy.
Article
Post-processing immunity is a fundamental property of differential privacy: it enables the application of arbitrary data-independent transformations to the results of differentially private outputs without affecting their privacy guarantees. When query outputs must satisfy domain constraints, post-processing can be used to project them back onto the feasibility region. Moreover, when the feasible region is convex, a widely adopted class of post-processing steps is also guaranteed to improve accuracy. Post-processing has been applied successfully in many applications including census data, energy systems, and mobility. However, its effects on the noise distribution is poorly understood: It is often argued that post-processing may introduce bias and increase variance. This paper takes a first step towards understanding the properties of post-processing. It considers the release of census data and examines, both empirically and theoretically, the behavior of a widely adopted class of post-processing functions.
Article
When querying databases containing sensitive information, the privacy of individuals stored in the database has to be guaranteed. Such guarantees are provided by differentially private mechanisms which add controlled noise to the query responses. However, most such mechanisms do not take into consideration the valid range of the query being posed. Thus, noisy responses that fall outside of this range may potentially be produced. To rectify this and therefore improve the utility of the mechanism, the commonly-used Laplace distribution can be truncated to the valid range of the query and then normalized. However, such a data-dependent operation of normalization leaks additional information about the true query response, thereby violating the differential privacy guarantee. Here, we propose a new method which preserves the differential privacy guarantee through a careful determination of an appropriate scaling parameter for the Laplace distribution. We adapt the privacy guarantee in the context of the Laplace distribution to account for data-dependent normalization factors and study this guarantee for different classes of range constraint configurations. We provide derivations of the optimal scaling parameter (i.e., the minimal value that preserves differential privacy) for each class or provide an approximation thereof. As a result of this work, one can use the Laplace distribution to answer queries in a range-adherent and differentially private manner. To demonstrate the benefits of our proposed method of normalization, we present an experimental comparison against other range-adherent mechanisms. We show that our proposed approach is able to provide improved utility over the alternative mechanisms.
Article
The continuous development of Healthcare 4.0 has brought great convenience to people. Through the Internet of Things technology, doctors can analyze patients’ health data and make timely diagnosis. However, behind the high efficiency, the mobile crowdsensing technology used for data transmission still has the risk of leaking the privacy of task and patient information. To this end, this paper proposes a privacy-enhanced multi-area task assignment strategy, named PMTA. Specifically, we use deep differential privacy to add noise to patient data, and then put the noise-added dataset into a deep Q-network for training, combined with a spectral clustering algorithm, to obtain an optimal classification strategy. Further, in order to address the problem of data silos, we adopt federated learning to jointly train the classification models of different hospitals to obtain a global model and realize data sharing among different hospitals. Finally, we use the optimal classification of patients for task deployment on the blockchain, and limit patients to only apply for tasks of the corresponding level through the smart contract technology, so as to protect task privacy. Experimental results show that our strategy can not only effectively protect task and patient privacy, but also achieve better system performance.
Article
Objectives: To develop and demonstrate the feasibility of a Global Open Source Severity of Illness Score (GOSSIS)-1 for critical care patients, which generalizes across healthcare systems and countries. Design: A merger of several critical care multicenter cohorts derived from registry and electronic health record data. Data were split into training (70%) and test (30%) sets, using each set exclusively for development and evaluation, respectively. Missing data were imputed when not available. Setting/patients: Two large multicenter datasets from Australia and New Zealand (Australian and New Zealand Intensive Care Society Adult Patient Database [ANZICS-APD]) and the United States (eICU Collaborative Research Database [eICU-CRD]) representing 249,229 and 131,051 patients, respectively. ANZICS-APD and eICU-CRD contributed data from 162 and 204 hospitals, respectively. The cohort included all ICU admissions discharged in 2014-2015, excluding patients less than 16 years old, admissions less than 6 hours, and those with a previous ICU stay. Interventions: Not applicable. Measurements and main results: GOSSIS-1 uses data collected during the ICU stay's first 24 hours, including extrema values for vital signs and laboratory results, admission diagnosis, the Glasgow Coma Scale, chronic comorbidities, and admission/demographic variables. The datasets showed significant variation in admission-related variables, case-mix, and average physiologic state. Despite this heterogeneity, test set discrimination of GOSSIS-1 was high (area under the receiver operator characteristic curve [AUROC], 0.918; 95% CI, 0.915-0.921) and calibration was excellent (standardized mortality ratio [SMR], 0.986; 95% CI, 0.966-1.005; Brier score, 0.050). Performance was held within ANZICS-APD (AUROC, 0.925; SMR, 0.982; Brier score, 0.047) and eICU-CRD (AUROC, 0.904; SMR, 0.992; Brier score, 0.055). Compared with GOSSIS-1, Acute Physiology and Chronic Health Evaluation (APACHE)-IIIj (ANZICS-APD) and APACHE-IVa (eICU-CRD), had worse discrimination with AUROCs of 0.904 and 0.869, and poorer calibration with SMRs of 0.594 and 0.770, and Brier scores of 0.059 and 0.063, respectively. Conclusions: GOSSIS-1 is a modern, free, open-source inhospital mortality prediction algorithm for critical care patients, achieving excellent discrimination and calibration across three countries.
Article
Blockchain has gradually attracted widespread attention from the research community of the IoT, due to its decentralization, consistency, and other attributes. It builds a secure and robust system by generating a backup locally for each participant node to collectively maintain the network. However, this feature brings some privacy concerns since all nodes can access the chain data, users' sensitive information under risk of leakage. The local differential privacy (LDP) mechanism can be a promising way to address this issue as it implements data perturbation before uploading to the chain. While traditional LDP mechanisms cannot fit well with blockchain since the requirements of a fixed input range, large data volume, and using the same privacy budget, which are practically difficult in a decentralized environment. To overcome these problems, we propose a novel LDP mechanism to split input numerical data and implement perturbation by digital bits, which does not require a fixed input range and large data volume. In addition, we use an iteration approach to adaptively allocate the privacy budget for different perturbation procedures that minimize the total deviation of perturbed data and increase the data utility. We employ mean estimation as the statistical utility metric under the same and randomized privacy budgets to evaluate the performance of our novel LDP mechanism. The experiment results indicate that the proposed LDP mechanism performs better in different scenarios, and our adaptive privacy budget allocation model can significantly reduce the deviation of perturbation function to provide high data utility while maintaining privacy.
Article
Federated learning is a promising distributed machine learning paradigm that has been playing a significant role in providing privacy-preserving learning solutions. However, alongside all its achievements, there are also limitations. First, traditional frameworks assume that all the clients are voluntary and so will want to participate in training only for improving the model's accuracy. However, in reality, clients usually want to be adequately compensated for the data and resources they will use before participating. Second, today's frameworks do not offer sufficient protection against malicious participants who try to skew a jointly trained model with poisoned updates. To address these concerns, we have developed a more robust federated learning scheme based on joint differential privacy. The framework provides two game-theoretic mechanisms to motivate clients to participate in training. These mechanisms are dominant-strategy truthful, individual rational, and budget-balanced. Further, the influence an adversarial client can have is quantified and restricted, and data privacy is similarly guaranteed in quantitative terms. Experiments with different training models on real-word datasets demonstrate the effectiveness of the proposed approach.
Article
The popularity of wearable smart healthcare devices has led to the emergence of a new service paradigm. However, in order to improve service quality, the manufacturers and online service providers collect massive data. This is a big concern as medical data is extremely sensitive. A few schemes have been proposed to overcome this problem. But, they suffer from security risks and overall increased complexity. Also, there was no implicit entity authentication and data integrity involved. We address these problems by allowing rectified data access through a directing authority, known as the transcryptor, using polymorphic encryption. Entity authentication and data integrity are achieved by smartly organizing data access and key information packets. The performance of the proposed approach is tested over different modalities data with varying size, whereas the security analysis is demonstrated using a challenge-response game model. The comparison with state-of-the-art schemes illustrates superiority of the proposed approach.
Chapter
We initiate a study of the composition properties of interactive differentially private mechanisms. An interactive differentially private mechanism is an algorithm that allows an analyst to adaptively ask queries about a sensitive dataset, with the property that an adversarial analyst’s view of the interaction is approximately the same regardless of whether or not any individual’s data is in the dataset. Previous studies of composition of differential privacy have focused on non-interactive algorithms, but interactive mechanisms are needed to capture many of the intended applications of differential privacy and a number of the important differentially private primitives.
Article
A key factor in big data analytics and artificial intelligence is the collection of user data from a large population. However, the collection of user data comes at the price of privacy risks, not only for users but also for businesses who are vulnerable to internal and external data breaches. To address privacy issues, local differential privacy (LDP) has been proposed to enable an untrusted collector to obtain accurate statistical estimation on sensitive user data (e.g., location, health, and financial data) without actually accessing the true records. As key-value data is an extremely popular NoSQL data model, there are a few works in the literature that study LDP-based statistical estimation on key-value data. However, these works have some major limitations, including supporting small key space only, fixed key collection range, difficulty in choosing an appropriate padding length, and high communication cost. In this article, we propose a two-phase mechanism PrivKVMPrivKVM^* as an optimized and highly-complete solution to LDP-based key-value data collection and statistics estimation. We verify its correctness and effectiveness through rigorous theoretical analysis and extensive experimental results.
Article
The effective physical data sharing has been facilitating the functionality of Industrial IoTs, which is believed to be one primary basis for Industry 4.0. These physical data, while providing pivotal information for multiple components of a production system, also bring in severe privacy issues for both workers and manufacturers, thus aggravating the challenges for data sharing. Current designs tend to simplify the behaviors of participants for better theoretical analysis, and they cannot properly handle the challenges in IIoTs where the behaviors are more complicated and correlated. Therefore, this paper proposes a privacy-preserved data sharing framework for IIoTs, where multiple competing data consumers exist in different stages of the system. The framework allows data contributors to share their contents upon requests. The uploaded contents will be perturbed to preserve the sensitive status of contributors. The differential privacy is adopted in the perturbation to guarantee the privacy preservation. Then the data collector will process and relay contents with subsequent data consumers. This data collector will gain both its own data utility and extra profits in data relay. Two algorithms are proposed for data sharing in different scenarios, based on whether the service provider will further process the contents to retain its exclusive utility. This work also provides for both algorithms a comprehensive consideration on privacy, data utility, bandwidth efficiency, payment, and rationality for data sharing. Finally, the evaluation on real-world datasets demonstrates the effectiveness of proposed methods, together with clues for data sharing towards Industry 4.0.
Article
In a decentralized Internet of Things (IoT) network, a fusion center receives information from multiple sensors to infer a public hypothesis of interest. To prevent the fusion center from abusing the sensor information, each sensor sanitizes its local observation using a local privacy mapping, which is designed to achieve both inference privacy of a private hypothesis and data privacy of the sensor raw observations. Various inference and data privacy metrics have been proposed in the literature. We introduce the concept of privacy implication (with vanishing budget) to study the relationships between these privacy metrics. We propose an optimization framework in which both local differential privacy (data privacy) and information privacy (inference privacy) metrics are incorporated. In the parametric case where sensor observations’ distributions are known a priori, we propose a two-stage local privacy mapping at each sensor, and show that such an architecture is able to achieve information privacy and local differential privacy to within the predefined budgets. For the nonparametric case where sensor distributions are unknown, we adopt an empirical optimization approach. Simulation and experiment results demonstrate that our proposed approaches allow the fusion center to accurately infer the public hypothesis while protecting both inference and data privacy.
Conference Paper
Local differential privacy (LDP) is a strong privacy standard for collecting and analyzing data, which has been used, e.g., in the Chrome browser, iOS and macOS. In LDP, each user perturbs her information locally, and only sends the randomized version to an aggregator who performs analyses, which protects both the users and the aggregator against private information leaks. Although LDP has attracted much research attention in recent years, the majority of existing work focuses on applying LDP to complex data and/or analysis tasks. In this paper, we point out that the fundamental problem of collecting multidimensional data under LDP has not been addressed sufficiently, and there remains much room for improvement even for basic tasks such as computing the mean value over a single numeric attribute under LDP. Motivated by this, we first propose novel LDP mechanisms for collecting a numeric attribute, whose accuracy is at least no worse (and usually better) than existing solutions in terms of worst-case noise variance. Then, we extend these mechanisms to multidimensional data that can contain both numeric and categorical attributes, where our mechanisms always outperform existing solutions regarding worst-case noise variance. As a case study, we apply our solutions to build an LDP-compliant stochastic gradient descent algorithm (SGD), which powers many important machine learning tasks. Experiments using real datasets confirm the effectiveness of our methods, and their advantages over existing solutions.
Article
With rapid development of e-healthcare systems, patients that are equipped with resource-limited e-healthcare devices (Internet of Things) generate huge amount of health data for health management. These health data possess significant medical value when aggregated from these distributed devices. However, efficient health data aggregation poses several security and privacy issues such as confidentiality disclosure and differential attacks, as well as patients may be reluctant to contribute their health data for aggregation. In this paper, we propose a privacy-preserving heath data aggregation scheme that securely collects health data from multiple sources and guarantee fair incentives for contributing patients. Specifically, we employ signature techniques to keep fair incentives for patients. Meanwhile, we add noises into the health data for differential privacy. Furthermore, we combine Boneh-Goh-Nissim Crypto system and Shamir Secret Sharing to keep data obliviousness security and fault tolerance. Security and privacy discussions show that our scheme can resist differential attacks, tolerate healthcare centers failures, and keep fair incentives for patients. Performance evaluations demonstrate cost-efficient computation, communication and storage overhead.
Article
With the growing availability of public open data, the protection of citizens’ privacy has become a vital issue for governmental data publishing. However, there are a large number of operational risks in the current government cloud platforms. When the cloud platform is attacked, most existing privacy protection models for data publishing cannot resist the attacks if the attacker has prior background knowledge. Potential attackers may gain access to the published statistical data, and identify specific individual's background information, which may cause the disclosure of citizens’ private information. To address this problem, we propose a fog-computing-based differential privacy approach for privacy-preserving data publishing in this paper. We discuss the risk of citizens’ privacy disclosure related to governmental data publishing, and present a differential privacy framework for publishing governmental statistical data based on fog computing. Based on the framework, a data publishing algorithm using a MaxDiff histogram is developed, which can be used to realize the function of preserving user privacy based on fog computing. Applying the differential method, Laplace noises are added to the original data set, which prevents citizens’ privacy from disclosure even if attackers get strong background knowledge. According to the maximum frequency difference, the adjacent data bins are grouped, then the differential privacy histogram with minimum average error can be constructed. We evaluate the proposed approach by computational experiments based on the real data set of Philippine families’ income and expenditures provided by Kaggle. It shows that the proposed data publishing approach can not only effectively protect citizens’ privacy, but also reduce the query sensitivity and improve the utility of the data published.
Conference Paper
We propose truncated concentrated differential privacy (tCDP), a refinement of differential privacy and of concentrated differential privacy. This new definition provides robust and efficient composition guarantees, supports powerful algorithmic techniques such as privacy amplification via sub-sampling, and enables more accurate statistical analyses. In particular, we show a central task for which the new definition enables exponential accuracy improvement.
Article
The collection and analysis of telemetry data from users' devices is routinely performed by many software companies. Telemetry collection leads to improved user experience but poses significant risks to users' privacy. Locally differentially private (LDP) algorithms have recently emerged as the main tool that allows data collectors to estimate various population statistics, while preserving privacy. The guarantees provided by such algorithms are typically very strong for a single round of telemetry collection, but degrade rapidly when telemetry is collected regularly. In particular, existing LDP algorithms are not suitable for repeated collection of counter data such as daily app usage statistics. In this paper, we develop new LDP mechanisms geared towards repeated collection of counter data, with formal privacy guarantees even after being executed for an arbitrarily long period of time. For two basic analytical tasks, mean estimation and histogram estimation, our LDP mechanisms for repeated data collection provide estimates with comparable or even the same accuracy as existing single-round LDP collection mechanisms. We conduct empirical evaluation on real-world counter datasets to verify our theoretical results. Our mechanisms have been deployed by Microsoft to collect telemetry across millions of devices.
Conference Paper
A large amount of valuable information resides in decentralized social graphs, where no entity has access to the complete graph structure. Instead, each user maintains locally a limited view of the graph. For example, in a phone network, each user keeps a contact list locally in her phone, and does not have access to other users' contacts. The contact lists of all users form an implicit social graph that could be very useful to study the interaction patterns among different populations. However, due to privacy concerns, one could not simply collect the unfettered local views from users and reconstruct a decentralized social network. In this paper, we investigate techniques to ensure local differential privacy of individuals while collecting structural information and generating representative synthetic social graphs. We show that existing local differential privacy and synthetic graph generation techniques are insufficient for preserving important graph properties, due to excessive noise injection, inability to retain important graph structure, or both. Motivated by this, we propose LDPGen, a novel multi-phase technique that incrementally clusters users based on their connections to different partitions of the whole population. Every time a user reports information, LDPGen carefully injects noise to ensure local differential privacy.We derive optimal parameters in this process to cluster structurally-similar users together. Once a good clustering of users is obtained, LDPGen adapts existing social graph generation models to construct a synthetic social graph. We conduct comprehensive experiments over four real datasets to evaluate the quality of the obtained synthetic graphs, using a variety of metrics, including (i) important graph structural measures; (ii) quality of community discovery; and (iii) applicability in social recommendation. Our experiments show that the proposed technique produces high-quality synthetic graphs that well represent the original decentralized social graphs, and significantly outperform those from baseline approaches.
Article
GPS-enabled devices are now ubiquitous, from airplanes and cars to smartphones and wearable technology. This has resulted in a wealth of data about the movements of individuals and populations, which can be analyzed for useful information to aid in city and traffic planning, disaster preparedness and so on. However, the places that people go can disclose extremely sensitive information about them, and thus their use needs to be filtered through privacy preserving mechanisms. This turns out to be a highly challenging task: raw trajectories are highly detailed, and typically no pair is alike. Previous attempts fail either to provide adequate privacy protection, or to remain sufficiently faithful to the original behavior. This paper presents DPT, a system to synthesize mobility data based on raw GPS trajectories of individuals while ensuring strong privacy protection in the form of ε-differential privacy. DPT makes a number of novel modeling and algorithmic contributions including (i) discretization of raw trajectories using hierarchical reference systems (at multiple resolutions) to capture individual movements at differing speeds, (ii) adaptive mechanisms to select a small set of reference systems and construct prefix tree counts privately, and (iii) use of direction-weighted sampling for improved utility. While there have been prior attempts to solve the subproblems required to generate synthetic trajectories, to the best of our knowledge, ours is the first system that provides an end-to-end solution. We show the efficacy of our synthetic trajectory generation system using an extensive empirical evaluation.
Conference Paper
In the study of differential privacy, composition theorems (starting with the original paper of Dwork, McSherry, Nissim, and Smith (TCC’06)) bound the degradation of privacy when composing several differentially private algorithms. Kairouz, Oh, and Viswanath (ICML’15) showed how to compute the optimal bound for composing k arbitrary (ϵ,δ)(\epsilon ,\delta )-differentially private algorithms. We characterize the optimal composition for the more general case of k arbitrary (ϵ1,δ1),,(ϵk,δk)(\epsilon _{1},\delta _{1}),\ldots ,(\epsilon _{k},\delta _{k})-differentially private algorithms where the privacy parameters may for each algorithm in the composition. We show that computing the optimal composition in general is #P-complete. Since computing optimal composition exactly is infeasible (unless FP=#P), we give an approximation algorithm that computes the composition to arbitrary accuracy in polynomial time. The algorithm is a modification of Dyer’s dynamic programming approach to approximately counting solutions to knapsack problems (STOC’03).
Conference Paper
Machine learning techniques based on neural networks are achieving remarkable results in a wide variety of domains. Often, the training of models requires large, representative datasets, which may be crowdsourced and contain sensitive information. The models should not expose private information in these datasets. Addressing this goal, we develop new algorithmic techniques for learning and a refined analysis of privacy costs within the framework of differential privacy. Our implementation and experiments demonstrate that we can train deep neural networks with non-convex objectives, under a modest privacy budget, and at a manageable cost in software complexity, training efficiency, and model quality.
Article
Nowadays gigantic crowd-sourced data from mobile devices have become widely available in social networks, enabling the possibility of many important data mining applications to improve the quality of our daily lives. While providing tremendous benefits, the release of crowd-sourced social network data to the public will pose considerable threats to mobile users’ privacy. In this paper, we investigate the problem of real-time spatio-temporal data publishing in social networks with privacy preservation. Specifically, we consider continuous publication of population statistics and design RescueDP - an online aggregate monitoring framework over infinite streams with w-event privacy guarantee. Its key components including adaptive sampling, adaptive budget allocation, dynamic grouping, perturbation and filtering, are seamlessly integrated as a whole to provide privacy-preserving statistics publishing on infinite time stamps. Moreover, we further propose an enhanced RescureDP with neural networks to accurately predict the values of statistics and improve the utility of released data. Both RescueDP and the enhanced RescueDP are proved satisfying w-event privacy. We evaluate the proposed schemes with real-world as well as synthetic datasets and compare them with two w-event privacy-assured representative methods. Experimental results show that the proposed schemes outperform the existing methods and improve the utility of real-time data sharing with strong privacy guarantee.
Conference Paper
Adaptivity is an important feature of data analysis - the choice of questions to ask about a dataset often depends on previous interactions with the same dataset. However, statistical validity is typically studied in a nonadaptive model, where all questions are specified before the dataset is drawn. Recent work by Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014) initiated a general formal study of this problem, and gave the first upper and lower bounds on the achievable generalization error for adaptive data analysis. Specifically, suppose there is an unknown distribution P and a set of n independent samples x is drawn from P. We seek an algorithm that, given x as input, accurately answers a sequence of adaptively chosen ``queries'' about the unknown distribution P. How many samples n must we draw from the distribution, as a function of the type of queries, the number of queries, and the desired level of accuracy? In this work we make two new contributions towards resolving this question: • We give upper bounds on the number of samples n that are needed to answer statistical queries. The bounds improve and simplify the work of Dwork et al. (STOC, 2015), and have been applied in subsequent work by those authors (Science, 2015; NIPS, 2015). • We prove the first upper bounds on the number of samples required to answer more general families of queries. These include arbitrary low-sensitivity queries and an important class of optimization queries (alternatively, risk minimization queries). As in Dwork et al., our algorithms are based on a connection with algorithmic stability in the form of differential privacy. We extend their work by giving a quantitatively optimal, more general, and simpler proof of their main theorem that the stability notion guaranteed by differential privacy implies low generalization error. We also show that weaker stability guarantees such as bounded KL divergence and total variation distance lead to correspondingly weaker generalization guarantees.
Conference Paper
When analyzing data that has been perturbed for privacy reasons, one is often concerned about its usefulness. Recent research on differential privacy has shown that the accuracy of many data queries can be improved by post-processing the perturbed data to ensure consistency constraints that are known to hold for the original data. Most prior work converted this post-processing step into a least squares minimization problem with customized efficient solutions. While improving accuracy, this approach ignored the noise distribution in the perturbed data. In this paper, to further improve accuracy, we formulate this post-processing step as a constrained maximum likelihood estimation problem, which is equivalent to constrained L1 minimization. Instead of relying on slow linear program solvers, we present a faster generic recipe (based on ADMM) that is suitable for a wide variety of applications including differentially private contingency tables, histograms, and the matrix mechanism (linear queries). An added benefit of our formulation is that it can often take direct advantage of algorithmic tricks used by the prior work on least-squares post-processing. An extensive set of experiments on various datasets demonstrates that this approach significantly improve accuracy over prior work.
Article
In recent years, many approaches to differentially privately publish histograms have been proposed. Several approach-es rely on constructing tree structures in order to decrease the error when answer large range queries. In this paper, we examine the factors affecting the accuracy of hierarchical ap-proaches by studying the mean squared error (MSE) when answering range queries. We start with one-dimensional his-tograms, and analyze how the MSE changes with differen-t branching factors, after employing constrained inference, and with different methods to allocate the privacy budget among hierarchy levels. Our analysis and experimental re-sults show that combining the choice of a good branching factor with constrained inference outperform the current s-tate of the art. Finally, we extend our analysis to multi-dimensional histograms. We show that the benefits from employing hierarchical methods beyond a single dimension are significantly diminished, and when there are 3 or more dimensions, it is almost always better to use the Flat method instead of a hierarchy.
Article
Privacy preserving on data mining and data release has attracted an increasing research interest over a number of decades. Differential privacy is one influential privacy notion that offers a rigorous and provable privacy guarantee for data mining and data release. Existing studies on differential privacy assume that in a data set, records are sampled independently. However, in real-world applications, records in a data set are rarely independent. The relationships among records are referred to as correlated information and the data set is defined as correlated data set. A differential privacy technique performed on a correlated data set will disclose more information than expected, and this is a serious privacy violation. Although recent research was concerned with this new privacy violation, it still calls for a solid solution for the correlated data set. Moreover, how to decrease the large amount of noise incurred via differential privacy in correlated data set is yet to be explored. To fill the gap, this paper proposes an effective correlated differential privacy solution by defining the correlated sensitivity and designing a correlated data releasing mechanism. With consideration of the correlated levels between records, the proposed correlated sensitivity can significantly decrease the noise compared with traditional global sensitivity. The correlated data releasing mechanism correlated iteration mechanism is designed based on an iterative method to answer a large number of queries. Compared with the traditional method, the proposed correlated differential privacy solution enhances the privacy guarantee for a correlated data set with less accuracy cost. Experimental results show that the proposed solution outperforms traditional differential privacy in terms of mean square error on large group of queries. This also suggests the correlated differential privacy can successfully retain the utility while preserving the privacy.
Article
The problem of privacy-preserving data analysis has a long history spanning multiple disciplines. As electronic data about individuals becomes increasingly detailed, and as technology enables ever more powerful collection and curation of these data, the need increases for a robust, meaningful, and mathematically rigorous definition of privacy, together with a computationally rich class of algorithms that satisfy this definition. Differential Privacy is such a definition. After motivating and discussing the meaning of differential privacy, the preponderance of this monograph is devoted to fundamental techniques for achieving differential privacy, and application of these techniques in creative combinations, using the query-release problem as an ongoing example. A key point is that, by rethinking the computational goal, one can often obtain far better results than would be achieved by methodically replacing each step of a non-private computation with a differentially private implementation. Despite some astonishingly powerful computational results, there are still fundamental limitations – not just on what can be achieved with differential privacy but on what can be achieved with any method that protects against a complete breakdown in privacy. Virtually all the algorithms discussed herein maintain differential privacy against adversaries of arbitrary computational power. Certain algorithms are computationally intensive, others are efficient. Computational complexity for the adversary and the algorithm are both discussed. We then turn from fundamentals to applications other than query-release, discussing differentially private methods for mechanism design and machine learning. The vast majority of the literature on differentially private algorithms considers a single, static, database that is subject to many analyses. Differential privacy in other models, including distributed databases and computations on data streams is discussed. Finally, we note that this work is meant as a thorough introduction to the problems and techniques of differential privacy, but is not intended to be an exhaustive survey – there is by now a vast amount of work in differential privacy, and we can cover only a small portion of it.
Article
Randomized Aggregatable Privacy-Preserving Ordinal Response, or RAPPOR, is a technology for crowdsourcing statistics from end-user client software, anonymously, with strong privacy guarantees. In short, RAPPORs allow the forest of client data to be studied, without permitting the possibility of looking at individual trees. By applying randomized response in a novel manner, RAPPOR provides the mechanisms for such collection as well as efficient, high-utility analysis of the collected data. RAPPOR permits statistics to be collected on the population of client-side strings with strong privacy guarantees for each client, and without linkability of their reports.
Conference Paper
Boosting is a general method for improving the accuracy of learning algorithms. We use boosting to construct improved privacy-pre serving synopses of an input database. These are data structures that yield, for a given set Q of queries over an input database, reasonably accurate estimates of the responses to every query in Q, even when the number of queries is much larger than the number of rows in the database. Given a base synopsis generator that takes a distribution on Q and produces a "weak" synopsis that yields "good" answers for a majority of the weight in Q, our Boosting for Queries algorithm obtains a synopsis that is good for all of Q. We ensure privacy for the rows of the database, but the boosting is performed on the queries. We also provide the first synopsis generators for arbitrary sets of arbitrary low-sensitivity queries, i.e., queries whose answers do not vary much under the addition or deletion of a single row. In the execution of our algorithm certain tasks, each incurring some privacy loss, are performed many times. To analyze the cumulative privacy loss, we obtain an O(ε2) bound on the expected privacy loss from a single e-differentially private mechanism. Combining this with evolution of confidence arguments from the literature, we get stronger bounds on the expected cumulative privacy loss due to multiple mechanisms, each of which provides e-differential privacy or one of its relaxations, and each of which operates on (potentially) different, adaptively chosen, databases.
Article
In the information realm, loss of privacy is usually associated with failure to control access to information, to control the flow of information, or to control the purposes for which information is employed. Differential privacy arose in a context in which ensuring privacy is a challenge even if all these control problems are solved: privacy-preserving statistical analysis of data. The problem of statistical disclosure control-revealing accurate statistics about a set of respondents while preserving the privacy of individuals-has a venerable history, with an extensive literature spanning statistics, theoretical computer science, security, databases, and cryptography (see, for example, the excellent survey of Adam and Wortmann,1 the discussion of related work in Blum et al.,2 and the Journal of Official Statistics dedicated to confidentiality and disclosure control).