Figure 2 - uploaded by Kato Mivule
Content may be subject to copyright.
9: An illustration of data swapping; statistical traits are kept. Synthetic data sets Synthetic data generation is another SDC method in which an original set of tuples is replaced with a new set of look-alike tuples but while still preserving the statistical properties of the original data values (Ciriani et al., 2007). Synthetic data generation falls in two major categories, fully synthetic and partially synthetic. Proposed by Rubin (1993) fully synthetic datasets are unreal or pseudo datasets created by replacing values in the original dataset with imputed unknown data values that retain the same statistical characteristics as in the original dataset but totally hide any sensitive or private information (Rubin, 1993) (Reiter, 2002). On the other hand, Little (1993) proposed a different approach, rather than replace all values in the dataset with synthetic data, partially synthetic datasets are generated in which only sensitive values are replaced with unreal or pseudo values to enhance confidentiality (Little, 1993). However, Drechsler, 

9: An illustration of data swapping; statistical traits are kept. Synthetic data sets Synthetic data generation is another SDC method in which an original set of tuples is replaced with a new set of look-alike tuples but while still preserving the statistical properties of the original data values (Ciriani et al., 2007). Synthetic data generation falls in two major categories, fully synthetic and partially synthetic. Proposed by Rubin (1993) fully synthetic datasets are unreal or pseudo datasets created by replacing values in the original dataset with imputed unknown data values that retain the same statistical characteristics as in the original dataset but totally hide any sensitive or private information (Rubin, 1993) (Reiter, 2002). On the other hand, Little (1993) proposed a different approach, rather than replace all values in the dataset with synthetic data, partially synthetic datasets are generated in which only sensitive values are replaced with unreal or pseudo values to enhance confidentiality (Little, 1993). However, Drechsler, 

Source publication
Thesis
Full-text available
Abstract (Summary) The purpose of this investigation is to study and pursue a user-defined approach in preserving data privacy while maintaining an acceptable level of data utility using machine learning classification techniques as a gauge in the generation of synthetic data sets. This dissertation will deal with data privacy, data utility, machin...

Similar publications

Article
Full-text available
Virtual learning environments contain valuable data about students that can be correlated and analyzed to optimize learning. Modern learning environments based on data mashups that collect and integrate data from multiple sources are relevant for learning analytics systems because they provide insights into students’ learning. However, data sets in...
Article
Full-text available
Objectives: Privacy and accuracy are always trade offfactors in the field of data publishing. Ideally both the factors are considered critical for data handling. Privacy loss and accuracy loss need to be maintained low as possible for an efficient data handling system. Authors have come up with various data publishing techniques aiming to achieve b...
Conference Paper
Full-text available
Utility that modern smartphone technology provides to individuals is most often enabled by technical capabilities that are privacy-affecting by nature, i.e. smartphone apps are provided with access to a multiplicity of sensitive resources required to implement context-sensitivity or personalization. Due to the ineffectiveness of current privacy ris...
Article
Full-text available
In the era of the Internet of Things (IoT), drug developers can potentially access a wealth of real-world, participant-generated data that enable better insights and streamlined clinical trial processes. Protection of confidential data is of primary interest when it comes to health data, as medical condition influences daily, professional, and soci...
Conference Paper
Full-text available
For datasets in Collaborative Filtering (CF) recommen-dations, even if the identifier is deleted and some triv-ial perturbation operations are applied to ratings before they are released, there are research results claiming that the adversary could discriminate the individual's iden-tity with a little bit of information. In this paper, we propose k...

Citations

... The more privacy we had injected in the query permutation (more diversionary keywords), the less usability (less relevant documents), and vice versa. Finding the balance between privacy and usability needs, remains a challenge as this study further shows, and as such, requiring tradeoffs [30]- [35]. ...
Conference Paper
Full-text available
Search engines have vast technical capabilities to retain Internet search logs for each user and thus present major privacy vulnerabilities to both individuals and organizations in revealing user intent. Additionally, many of the web search privacy enhancing tools available today require that the user trusts a third party, which make confidentiality of user intent even more challenging. The user is left at the mercy of the third party without the control over his or her own privacy. In this article, we suggest a user-centric heuristic, Distortion Search, a web search query privacy methodology that works by the formation of obfuscated search queries via the permutation of query keyword categories, and by strategically applying k-anonymised web navigational clicks on URLs and Ads to generate a distorted user profile and thus providing specific user intent and query confidentiality. We provide empirical results via the evaluation of distorted web search queries in terms of retrieved search results and the resulting web ads from search engines. Preliminary experimental results indicate that web search query and specific user intent privacy might be achievable from the user side without the involvement of the search engine or other third parties.
... The last step of analysis is a consideration of machine learning as a gauge to quantify the usability of the privatized dataset. Both the original and privatized datasets were subjected to the KNN classification algorithm [18]. The classification accuracy of the original data was at 90.50 per cent as shown in Figure 7. ...
Article
Full-text available
With the increasing number of sophisticated cyber attacks on both government and private infrastructure, cybersecurity data sharing is critical for the advancement of collaborative research among various entities, both in government, private sector, and academia. Of recent, the US Congress passed the Cyber Intelligence Sharing and Protection Act, as a framework for data sharing between various entities. Nevertheless this development raises the issue of trust between the collaborating parties, since shared data could be revealing. Conversely, due to the sensitive and confidential nature of the data involved, entities would have to employ various anonymization techniques to meet legal requirements in compliance with confidentiality policies of both their own organizations and federal government requirements. Secondly, a basic sharing of the data without the privatization process could make entities involved vulnerable to insider and inference attacks. For instance, an entity sharing data on cyber attacks might accidentally reveal a sensitive network topology to an untrusted collaborator. As a contribution, we propose a modest but effective data privacy enhancement heuristic; a targeted 2k basic data swapping of individual web search log records. In this heuristic, if individual has a set of x records in their web search log set A, those records are swapped in that individual set A, then swapped again with another individual y records in set B. Our preliminary results show that data swapping is effective for big data and it would be demanding to trace the original issuer of the queries in a given large dataset of web search logs, thus providing some level of confidentiality.
... However, during this privacy procedure the usability of the privatized healthcare data decreases as useful information that might be sensitive is suppressed or altered. Further compounding this problem is that realizing equilibrium between the need for healthcare records privacy and the usability of the privatized dataset is intractable [1]. Attention in this study is placed on the healthcare data. ...
... The k value is tunable; in other words, the user can select the number of k central points for which they expect values to accumulate around. K-Means employs the Euclidean distance in computing the distances between values and a central mean value and is formally articulated in the following formula [1]: ...
Conference Paper
Full-text available
Healthcare data regulations and laws require that entities protect patient privacy when managing healthcare records that contain personally sensitive and revealing information. However, during the process of implementing data confidentiality on the healthcare records, the usability of the privatized healthcare data reduces as sensitive information is removed or transformed. Realizing equilibrium between the need for healthcare records privacy and the usability of the privatized dataset is problematic. In this study an analysis of healthcare records privacy and usability using targeted data swapping and K-Means clustering is done. Focus is placed on the healthcare data with the main problem being, how can such data be transacted securely, confidentially, while retaining the data usability. The question being probed is, can healthcare data such as, the heart disease dataset be securely and confidentially shared while maintaining data usability.
... In case of a successful reconstruction attack, what the adversary gets is the noisy data, assuming no prior insider knowledge. Data privacy, in this first step, is achieved using noise addition [9], with a distribution ~ ( = 1, = 0.2) -generating a noisy data set with similar statistical traits to the original [10]. • Phase 2: In the second step, Distance transforms is applied on the noisy data set to extract coefficients. ...
... Yet still, from an anecdotal view, there seems to be an observable improvement in clustering results for the privatized synthetic data as shown in Figure 6, with the exception of the Iris Virginica attribute. However, after application of the Davis-Bouldin Index metric [10] [11], to test the clustering performance, there was an actual degradation in the clustering performance, as shown in Figure 8 and Table 3. The Davis Bouldin Index for the original data was reported at 0.668 and 0.765 for the privatized synthetic data. ...
... ACKNOWLEDGEMENT Portion of this work was presented as part of the dissertation by the author in fulfillment of the requirements for the D.Sc. degree, in the Computer Science Department, at Bowie State University [10]. With great gratitude, I extend my sincere thanks Dr. Claude Turner and Dr. Soo-Yeon Ji, in the Computer Science Department, at Bowie State University, for the untiring assistance during this study. ...
Conference Paper
Full-text available
Organizations have interest in research collaboration efforts that involve data sharing with peers. However, such partnerships often come with confidentiality risks that could involve insider attacks and untrustworthy collaborators who might leak sensitive information. To mitigate such data sharing vulnerabilities, entities share privatized data with retracted sensitive information. However, while such data sets might offer some assurances of privacy, maintaining the statistical traits of the original data, is often problematic, leading to poor data usability. Therefore, in this paper, a confidential synthetic data generation heuristic, that employs a combination of data privacy and distance transforms techniques, is presented. The heuristic is used for the generation of privatized numeric synthetic data, while preserving the statistical traits of the original data. Empirical results from applying unsupervised learning, using k-means, to test the usability of the privatized synthetic data set, are presented. Preliminary results from this implementation show that it might be possible to generate privatized synthetic data sets, with the same statistical morphological structure as the original, using data privacy and distance transforms methods.
... One way to address this problem, is the generation of privatized synthetic data sets that retain the statistical traits of the original data. Therefore, as a contribution, the Filtered Classification Error Gauge (Filtered x-CEG) methodology is suggested as a heuristic for the generation of privatized synthetic data [17]. The Filtered x-CEG is a variation of the Comparative x-CEG heuristic process described in Mivule and Turner (2013) [6] and [17]. ...
... Therefore, as a contribution, the Filtered Classification Error Gauge (Filtered x-CEG) methodology is suggested as a heuristic for the generation of privatized synthetic data [17]. The Filtered x-CEG is a variation of the Comparative x-CEG heuristic process described in Mivule and Turner (2013) [6] and [17]. The Filtered x-CEG heuristic works as follows: (i) Data privacy is applied to the data using noise addition; (ii) in the second step, signal processing technique of discrete cosine transforms, is used to mine the coefficients; (iii) the coefficients are added back to the noisy data; (iv) new privatized synthetic data is produced with a similar formation as the original [17]; (v) the moving average filter is then applied to the privatized synthetic data to improve usability; (vi) machine learning classification is used to test the filtered synthetic data for usability, with lower classification error (high classification accuracy) as an indication of better data usability [6] [17]. ...
... The Filtered x-CEG is a variation of the Comparative x-CEG heuristic process described in Mivule and Turner (2013) [6] and [17]. The Filtered x-CEG heuristic works as follows: (i) Data privacy is applied to the data using noise addition; (ii) in the second step, signal processing technique of discrete cosine transforms, is used to mine the coefficients; (iii) the coefficients are added back to the noisy data; (iv) new privatized synthetic data is produced with a similar formation as the original [17]; (v) the moving average filter is then applied to the privatized synthetic data to improve usability; (vi) machine learning classification is used to test the filtered synthetic data for usability, with lower classification error (high classification accuracy) as an indication of better data usability [6] [17]. Initial outcome from this study indicates that privatized synthetic data could be produced with adequate usability levels. ...
Article
Full-text available
In order to comply with data confidentiality requirements, while meeting usability needs for researchers, entities are faced with the challenge of how to publish privatized data sets that preserve the statistical traits of the original data. One solution to this problem, is the generation of privatized synthetic data sets. However, during data privatization process, the usefulness of data, have a propensity to diminish even as privacy might be guaranteed. Furthermore, researchers have documented that finding an equilibrium between privacy and utility is intractable, often requiring trade-offs. Therefore, as a contribution, the Filtered Classification Error Gauge heuristic, is presented. The suggested heuristic is a data privacy and usability model that employs data privacy, signal processing, and machine learning techniques to generate privatized synthetic data sets with acceptable levels of usability. Preliminary results from this study show that it might be possible to generate privacy compliant synthetic data sets using a combination of data privacy, signal processing, and machine learning techniques, while preserving acceptable levels of data usability.
Conference Paper
Full-text available
The publication and sharing of network trace data is a critical to the advancement of collaborative research among various entities, both in government, private sector, and academia. However, due to the sensitive and confidential nature of the data involved, entities have to employ various anonymization techniques to meet legal requirements in compliance with confidentiality policies. Nevertheless, the very composition of network trace data makes it a challenge when applying anonymization techniques. On the other hand, basic application of microdata anonymization techniques on network traces is problematic and does not deliver the necessary data usability. Therefore, as a contribution, we point out some of the ongoing challenges in the network trace anonymization. We then suggest usability-aware anonymization heuristics by employing microdata privacy techniques while giving consideration to usability of the anonymized data. Our preliminary results show that with trade-offs, it might be possible to generate anonymized network traces with enhanced usability, on a case-by-case basis using micro-data anonymization techniques.
Article
Full-text available
One of the challenges in implementing differential data privacy is that the utility (usefulness) of the privatized data set diminishes even as confidentiality is guaranteed. In such settings, due to excessive noise, original data suffers loss of statistical significance despite the fact that strong levels of data privacy is assured by differential privacy; thus making the privatized data practically valueless to the consumer of the published data. Additionally, researchers have noted that finding equilibrium between data privacy and utility requirements remains intractable, necessitating trade-offs. Therefore, as a contribution, we propose using the moving average filtering model for non-interactive differential privacy settings with the resulting empirical data. In this model, various levels of differential privacy (DP) are applied to a data set, generating various privatized data sets. The privatized data is passed through a moving average filter and the new filtered privatized data sets that meet a set utility threshold are finally published. Preliminary results from this study show that adjustment of ε epsilon parameter in the differential privacy process and the application of the moving average filter might generate better data utility output while conserving privacy in non-interactive differential privacy settings.