Conference PaperPDF Available

Test-Driven Anonymization for Artificial Intelligence

Authors:

Abstract

In recent years, data published and shared with third parties to develop artificial intelligence (AI) tools and services has significantly increased. When there are regulatory or internal requirements regarding privacy of data, anonymization techniques are used to maintain privacy by transforming the data. The side-effect is that the anonymization may lead to useless data to train and test the AI because it is highly dependent on the quality of the data. To overcome this problem, we propose a test-driven anonymization approach for artificial intelligence tools. The approach tests different anonymization efforts to achieve a trade-off in terms of privacy (non-functional quality) and functional suitability of the artificial intelligence technique (functional quality). The approach has been validated by means of two real-life datasets in the domains of healthcare and health insurance. Each of these datasets is anonymized with several privacy protections and then used to train classification AIs. The results show how we can anonymize the data to achieve an adequate functional suitability in the AI context while maintaining the privacy of the anonymized data as high as possible. Article available in: http://digibuo.uniovi.es/dspace/bitstream/10651/52773/1/testDrivenAnonymizationAI.pdf
Software Engineering Research Group
University of Oviedo
1
giis.uniovi.es
Test-driven Anonymization for Artificial Intelligence
2019 IEEE International Conference on Artificial Intelligence Testing, AITest
2019
Cristian Augusto
Department of Computing
University of Oviedo
Gijón, Spain
augustocristian@uniovi.es
Jesús Morán
Department of Computing
University of Oviedo
Gijón, Spain
moranjesus@uniovi.es
Claudio de la Riva
Department of Computing
University of Oviedo
Gijón, Spain
claudio@uniovi.es
Javier Tuya
Department of Computing
University of Oviedo
Gijón, Spain
tuya@uniovi.es
Abstract
In recent years, data published and shared with third parties to develop artificial intelligence (AI) tools
and services has significantly increased. When there are regulatory or internal requirements regarding
privacy of data, anonymization techniques are used to maintain privacy by transforming the data. The
side-effect is that the anonymization may lead to useless data to train and test the AI because it is highly
dependent on the quality of the data. To overcome this problem, we propose a test-driven anonymization
approach for artificial intelligence tools. The approach tests different anonymization efforts to achieve a
trade-off in terms of privacy (non-functional quality) and functional suitability of the artificial intelligence
technique (functional quality). The approach has been validated by means of two real-life datasets in the
domains of healthcare and health insurance. Each of these datasets is anonymized with several privacy
protections and then used to train classification AIs. The results show how we can anonymize the data to
achieve an adequate functional suitability in the AI context while maintaining the privacy of the
anonymized data as high as possible.
Article Available in: http://hdl.handle.net/10651/56868
... Likewise, the EvoSuite AI tool generates unit tests that maximize code coverage, identifying edge cases that manual testing might miss [8]. Test.ai uses machine learning to analyze past test results and optimize the order in which test cases are executed, focusing on the most critical areas first [9]. These are proven use cases of the successful integration of AI into the critical phases of the SDLC to ensure a high-quality product. ...
Article
Full-text available
Modern software systems are becoming more intricate, making identification of risks in the software requirement phase— a fundamental aspect of the software development life cycle (SDLC)—complex. Inadequate risk assessment may result in the malfunction of a software system, either in the development or production phase. Therefore, risk prediction plays a crucial role in software requirements, serving as the first step in any software project. Hence, developing adaptive predictive models that can offer consistent and explainable insights for handling risk prediction is imperative. This study proposes novel ensemble class balanced nested dichotomy (EBND) fuzzy induction models for risk prediction in software requirement. Specifically, the proposed EBND models employ a hierarchical structure consisting of binary trees featuring distinct nested dichotomies that are generated randomly for each tree. Thereafter, we use an ensemble principle to refine rules generated from the resulting binary tree. The predictive efficacy of the suggested EBND models is further extended by introducing a data sampling method into their prediction process. The inclusion of the data sampling method acts to mitigate the underlying disparity in the class labels that may affect its prediction processes. The efficacy of the EBND models is then evaluated and compared to current solutions using the open-source software risk dataset. The observed findings revealed that the EBND models demonstrated superior predictive capabilities when compared to the conventional models and state-of-the-art methodologies. Specifically, the EBND models achieved an average accuracy threshold value of 98%, as well as high values for the f-measure metric.
...  Matrix Multiplication Support: Efficiently supported by extending the STPC multiplication protocol to accommodate vector operations, streamlining this critical operation.  Depth-Optimized Circuits: The protocol optimizes depth in circuits, improving efficiency and reducing online communication, this is done by combination of various AND gates [37]. ...
Article
Full-text available
In Industry 4.0, data is enormously exchanged through wireless devices. Ensuring data privacy and protection is vital. The proposed review paper explores how AI techniques can safeguard sensitive information and ensure compliance. It explores fundamental technologies such as cyber-physical systems and the complex application of AI in analytics and predictive maintenance. The issues with data security are then emphasized and privacy concerns resulting from human-machine interaction are shown. Regulatory frameworks that direct enterprises are touted as essential defenses, coupled with AI-powered solutions and privacy-preserving tactics. Examples from everyday life highlight the constant battle for equilibrium. The review continues with a look ahead to future developments in interdisciplinary research and ethical issues that will influence Industry 4.0's responsible growth. In essence, this paper synthesizes a nuanced understanding of the sophisticated challenges surrounding privacy and data protection within Industry 4.0, underscoring the pivotal role of AI as a custodian of sensitive information and offering an indispensable resource for professionals, policymakers, and researchers navigating the intricate and evolving terrain of Industry 4.0 with technical precision and ethical responsibility.
... To address these concerns, it is necessary to design AI systems with robust privacy and security safeguards; for instance, AI developers can implement data anonymisation (Augusto et al., 2019) and regular security audits to ensure that data is adequately protected as suggested in Chehri et al. (2021). Regarding transparency and accountability, developers can provide documentation (Raji et al., 2020) explaining AI systems' data sources and algorithms and their potential risks and limitations. ...
... As shown in Table 3 in Appendix, there are metrics recommended by authors or experts. While quality attributes with the highest metric recommendation (with the frequency of primary studies) are Fairness (24), Explainability (11), and Functional Suitability (8), there are also quality attributes with no metric recommendation such as Compatibility, Portability, and Trustworthiness. Even there are metric recommendations in the literature, there are several challenges for assuring quality for AI-based software systems regarding these attributes. ...
Presentation
Full-text available
RESUMO: Contextualização: A Inteligência Artificial (IA) é um campo da ciência no qual computadores e máquinas são criados e designados para reproduzir o raciocínio humano de forma autônoma e inteligente, possibilitando fazer coisas que demandam inteligência quando são feitas por seres humanos. Recentemente, a ciência presenciou um crescimento relevante na implementação de IA na área da saúde, apresentando uma revolução na Medicina e nos cuidados de saúde no mundo inteiro, a qual foi promovida pelo avanço das tecnologias da informação (TI), das quais a IA surge como uma potência. Objetivo: Dessa maneira, o objetivo do presente estudo foi examinar as pesquisas realizadas sobre Inteligência Artificial (IA) na Medicina. Métodos: Para isso, foi desenvolvido um estudo bibliométrico, utilizando-se a base de dados bibliográficos Scopus para a coleta de dados. Os dados foram analisados de acordo com: 1) área temática; 2) principais autores; 3) principais revistas; 4) palavras-chave; 5) artigos mais citados; 6) documentos por país/território; e 7) o período das publicações. Além do mais, foram seguidas as três principais leis bibliométricas, isto é, Lei de Lotka, Lei de Bradford e Lei de Zipf. Considerações finais: O estudo permitiu identificar a forte presença de IA na área da Medicina, o que demonstra a importância que as tecnologias digitais vêm exercendo sobre os seres humanos, uma vez que diversas ferramentas de IA estão sendo usadas para auxiliar pesquisas sobre a saúde dos indivíduos, o apoio na tomada de decisão clínica, o uso de dados de pacientes para servirem de base para prevenção, diagnóstico, tratamento de doenças e etc, o que pode viabilizar a melhora na saúde dos pacientes. Além disso, o estudo demonstrou a forte presença dos Estados Unidos no que diz respeito ao número de publicações sobre IA e Medicina. No mais, foi possível observar um salto no número de trabalhos publicados desde o ano de 2021.Sugere-se para pesquisas futuras uma análise que aborde mais criteriosamente as vantagens ou desvantagens da utilização de tecnologias de IA na Medicina, assim como ferramentas que estão sendo utilizadas e para que estão sendo usadas, uma vez que a IA tem inúmeras aplicações potenciais na Medicina. PALAVRAS-CHAVE: Inteligência Artificial; Medicina; Tecnologias Digitais; Máquinas.
Article
The rapidly growing demand to share data more openly creates a need for secure and privacy-preserving sharing technologies. However, there are multiple challenges associated with the development of a universal privacy-preserving data sharing mechanism, and existing solutions still fall short of their promises.
Conference Paper
Full-text available
Artificial intelligence (AI) is a broad field whose prevalence in the health sector has increased during recent years. Clinical data are the basic staple that feeds intelligent healthcare applications, but due to its sensitive character, its sharing and usage by third parties require compliance with both confidentiality agreements and security measures. Data Anonymization emerges as a solution to both increasing the data privacy and reducing the risk against unintentional disclosure of sensitive information through data modifications. Although the anonymization improves privacy, the diverse modifications also harm the data functional suitability. These data modifications can affect applications that employ the anonymized data, especially those that are data-centric as the AI tools. To obtain a trade-off between both qualities (privacy and functional suitability), we use the Test-Driven Anonymization (TDA) approach, which anonymizes incrementally the data to train the AI tools and validate with the real data until maximize its quality. The approach is evaluated on a real-world dataset from the Spanish Institute for the Study of the Biology of Human Reproduction (INEBIR). The anonymized datasets are used to train AI tools and select the dataset that gets the best trade-off between privacy and functional quality requirements. The results show that TDA can be successfully applied to anonymize the clinical data of the INEBIR, allowing third parties to transfer without transgressing the user privacy and develop useful AI Tools with the anonymized data.
Chapter
Full-text available
In recent years, privacy-preserving data mining has been studied extensively, because of the wide proliferation of sensitive information on the internet. A number of algorithmic techniques have been designed for privacy-preserving data mining. In this paper, we provide a review of the state-of-the-art methods for privacy. We discuss methods for randomization, k-anonymization, and distributed privacy-preserving data mining. We also discuss cases in which the output of data mining applications needs to be sanitized for privacy-preservation purposes. We discuss the computational and theoretical limits associated with privacy-preservation over high dimensional data sets.
Poster
Sociotechnical researchers have recently begun studying people with rare diseases. There is potential for impact if data can be anonymized and shared so additional research can take place. However, this data also presents a high risk of re-identification because of the rarity of the diseases. Using existing research on data protection techniques, we generate an anonymized version of a rare disease data set and explore the utility of this data in replicating existing rare disease research. We also explore the utility of this data for seven additional use cases generated by other researchers. We find the loss of utility varies depending on the use case, analysis method, and evaluation metrics.
Article
Often a data holder, such as a hospital or bank, needs to share person-specific records in such a way that the identities of the individuals who are the subjects of the data cannot be determined. One way to achieve this is to have the released records adhere to k-anonymity, which means each released record has at least (k-1) other records in the release whose values are indistinct over those fields that appear in external data. So, k-anonymity provides privacy protection by guaranteeing that each released record will relate to at least k individuals even if the records are directly linked to external information. This paper provides a formal presentation of combining generalization and suppression to achieve k-anonymity. Generalization involves replacing (or recoding) a value with a less specific but semantically consistent value. Suppression involves not releasing a value at all. The Preferred Minimal Generalization Algorithm (MinGen), which is a theoretical algorithm presented herein, combines these techniques to provide k-anonymity protection with minimal distortion. The real-world algorithms Datafly and μ-Argus are compared to MinGen. Both Datafly and μ-Argus use heuristics to make approximations, and so, they do not always yield optimal results. It is shown that Datafly can over distort data and μ-Argus can additionally fail to provide adequate protection.
Article
Big data has the potential to create significant value in health care by improving outcomes while lowering costs. Big data's defining features include the ability to handle massive data volume and variety at high velocity. New, flexible, and easily expandable information technology (IT) infrastructure, including so-called data lakes and cloud data storage and management solutions, make big-data analytics possible. However, most health IT systems still rely on data warehouse structures. Without the right IT infrastructure, analytic tools, visualization approaches, work flows, and interfaces, the insights provided by big data are likely to be limited. Big data's success in creating value in the health care sector may require changes in current polices to balance the potential societal benefits of big-data approaches and the protection of patients' confidentiality. Other policy implications of using big data are that many current practices and policies related to data use, access, sharing, privacy, and stewardship need to be revised.
Article
One of the important issues in software testing is to provide an automated test oracle. Test oracles are reliable sources of how the software under test must operate. In particular, they are used to evaluate the actual results produced by the software. However, in order to generate an automated test oracle, it is necessary to map the input domain to the output domain automatically. In this paper, Multi-Networks Oracles based on Artificial Neural Networks are introduced to handle the mapping automatically. They are an enhanced version of previous ANN-Based Oracles. The proposed model was evaluated by a framework provided by mutation testing and applied to test two industry-sized case studies. In particular, a mutated version of each case study was provided and injected with some faults. Then, a fault-free version of it was developed as a Golden Version to evaluate the capability of the proposed oracle finding the injected faults. Meanwhile, the quality of the proposed oracle is measured by assessing its accuracy, precision, misclassification error and recall. Furthermore, the results of the proposed oracle are compared with former ANN-based Oracles. Accuracy of the proposed oracle was up to 98.93%, and the oracle detected up to 98% of the injected faults. The results of the study show the proposed oracle has better quality and applicability than the previous model. KeywordsAutomated software testing–Software test oracle–Artificial neural networks–Mutation testing