ArticlePDF Available

Tackling data and model drift in AI: Strategies for maintaining accuracy during ML model inference

Authors:

Abstract

In machine learning (ML) and artificial intelligence (AI), model accuracy over time is very important, particularly in dynamic environments where data and relationships change. Data and model drift pose challenging issues that this paper seeks to explore: shifts in input data distributions or underlying model structures that continuously degrade predictive performance. It analyzes different drift types in-depth, including covariate, prior probability, concept drift for dasta, parameters, hyperparameter, and algorithmic model drift. Key causes, ranging from environmental changes to evolving data sources and overfitting, contribute to decreased model reliability. The article also discusses practical strategies for detecting and mitigating Drift, such as regular monitoring, statistical tests, and performance tracking, alongside solutions like automated recalibration, ensemble methods, and online learning models to enhance adaptability. Furthermore, the importance of feedback loops and computerized systems in handling Drift is emphasized, with real-world case studies illustrating drift impacts in financial and healthcare applications. Finally, future AI system drift management will be highlighted from emerging directions such as AI-based drift prediction, transfer learning, and robust model design.
*Corresponding author: Surya Gangadhar Patchipala
Copyright © 2023 Author(s) retain the copyright of this article. This article is published under the terms of the Creative Commons Attribution Liscense 4.0.
Tackling data and model drift in AI: Strategies for maintaining accuracy during ML
model inference
Surya Gangadhar Patchipala *
Director, Consulting Expert - Data, AI, ML Engineering
International Journal of Science and Research Archive, 2023, 10(02), 11981209
Publication history: Received on 19 September 2023; revised on 26 November 2023; accepted on 30 November 2023
Article DOI: https://doi.org/10.30574/ijsra.2023.10.2.0855
Abstract
In machine learning (ML) and artificial intelligence (AI), model accuracy over time is very important, particularly in
dynamic environments where data and relationships change. Data and model drift pose challenging issues that this
paper seeks to explore: shifts in input data distributions or underlying model structures that continuously degrade
predictive performance. It analyzes different drift types in-depth, including covariate, prior probability, concept drift
for dasta, parameters, hyperparameter, and algorithmic model drift. Key causes, ranging from environmental changes
to evolving data sources and overfitting, contribute to decreased model reliability.
The article also discusses practical strategies for detecting and mitigating Drift, such as regular monitoring, statistical
tests, and performance tracking, alongside solutions like automated recalibration, ensemble methods, and online
learning models to enhance adaptability. Furthermore, the importance of feedback loops and computerized systems in
handling Drift is emphasized, with real-world case studies illustrating drift impacts in financial and healthcare
applications. Finally, future AI system drift management will be highlighted from emerging directions such as AI-based
drift prediction, transfer learning, and robust model design.
Keywords: Data Drift; Model Drift; AI Model Accuracy; Machine Learning Drift Detection; Concept Drift
Graphical Abstract
International Journal of Science and Research Archive, 2023, 10(02), 11981209
1199
1. Introduction
Even in the fast-changing world of artificial intelligence (AI) and machine learning (ML), artificially intelligent (AI)
models can be perturbed by data, and the model drifts to the point of dramatically decreasing its accuracy. When the
environment changes (i.e., there is a change in the data patterns or relationships between variables), a drift accompanies
this, and the model performance deteriorates. Picture your machine learning model as an instrument that's so perfectly
tuned when it's deployed, but then the world around it begins to shift, and it loses its ability to perform exactly as you
want it to.
Data and model drift happen for various reasons. User behavior, market trends, and even how data is collected could be
these factors. Eventually, if allowed to continue growing out of control, Drift can corrupt the accuracy of your model.
Despite this, Drift can be detected and corrected early with proactive strategies, and the model will continue to be
accurate during inference.
1.1. Understanding Data Drift
Data drift occurs when the statistical properties of input data change over time. So, in simpler terms, the data your
model sees in real-world use is different than the data it was trained on. For example, if you had built a model to predict
customer preferences given data from 2019 and are now, in 2023, still using that same model without updates, you
probably wouldn't see much benefit as user behavior, preferences, and external factors would have changed. The model
needs to improve its effectiveness in making predictions.
There are many reasons data drifts. However, these can include seasonal trends like spending more during holidays or
temperature trends that change user's behavior. In addition, the patterns in the data will be dramatically changed by
external events like economic recessions or political changes. Additionally, as your user base grows and changes, the
demographic makeup of the data might evolve, making it less reflective of the training set.
1.2. Understanding Model Drift
Model drift refers to the model no longer properly accomplishing its task. A model's ability to accurately predict is
decreased over time, especially when relationships in variables the model relies on change. For instance, a model trained
from past financial data may be hard to predict in the stock market since financial policies may change, technological
advances may occur, and investors may change their behavior.
Second, model drift can also happen due to overfitting when the model is too finely tuned to the historical data and
needs to generalize better to new data. In such a situation, the model may do well at first, but its accuracy will degrade
as it is exposed to data that doesn't fit its learned patterns.
1.3. Why Drift is a Big Deal
Data and model drift can have real-world consequences. Unreliable predictions created by a drop in model performance
can either mean inefficient processes or critical failures, depending on the use case. On the one hand, in high-stakes
fields like healthcare or finance, decisions are made based on AI predictions, so this slight drop in accuracy can mean
the wrong diagnosis, financial losses, or safety risks.
Making it crucial to stay forward, recognize, and manage Drift. Identifying Drift early allows you to take corrective
actions before it causes significant issues.
2. Types of Data Drift
In this context, we refer to data drift as statistical properties of input data change affecting model performance. Data
drift has different effects on different forms of machine learning models.
2.1. Covariate Drift
Covariate drift happens when input features go through a change of distribution. Suppose you have an e-commerce
recommendation system trained on customer preferences in one season. If new products are introduced in another
season, the patterns in customer preferences may shift, and your model may need help to make accurate
recommendations. Covariate drift doesn't necessarily change the relationship between inputs and outputs, but the
inputs evolve, making it harder for the model to predict outcomes reliably.
International Journal of Science and Research Archive, 2023, 10(02), 11981209
1200
2.2. Prior Probability Drift
This form of Drift refers to shifts in the overall probability distribution of target classes. In a fraud detection model, for
example, if the incidence rate of fraudulent transactions suddenly increases or decreases, this would cause prior
probability drift. Your model, which may have been trained on a stable fraud rate, would have a harder time making
accurate predictions due to the shifting likelihood of fraud in the population.
2.3. Concept Drift
Concept drift happens when the relationship between input and output variables changes. Unlike covariate drift, which
affects only the inputs, concept drift alters the underlying logic or patterns that your model has learned. For example,
that's common in predicting the stock market or weather because external factors make relationships between variables
change over time. For example, the same set of economic indicators may generate different market outcomes due to the
various ways investors behave or, alternately, the influence of external events.
Figure 1 Concept Drift Detection Framework
3. Types of Model Drift
Model drift, on the other hand, is specific to changes in the model itselfeither in its performance or in its internal
configuration. Several types of model drift can affect the accuracy and reliability of machine learning systems.
3.1. Parameter Drift
Parameter drift occurs when the parameters of a model tuned out during training start to stop working as intended. For
starters, these parameters were set to individually optimize the model's performance based on the training data, but as
the data changes, these values no longer adequately predict. They keep model accuracy as time elapses and more
"unseen" or evolving data get processed and decay.
International Journal of Science and Research Archive, 2023, 10(02), 11981209
1201
3.2. Hyperparameter Drift
Unlike parameters, which the model learns, hyperparameters are set before training and control how the model learns.
Hyperparameter drift happens when the originally selected hyperparameters no longer produce the best results
because the data environment has changed. This type of Drift is often subtle and may require regular hyperparameter
tuning to restore performance, especially in models deployed in dynamic settings where the nature of data changes
rapidly.
3.3. Algorithmic Drift
Algorithmic Drift is a broader concept that occurs when the machine learning algorithm becomes outdated or less
effective. This often happens because AI methodology advances, but it usually occurs because the data architecture
underlying the data has changed in a way the algorithm can't efficiently deal with. Even when an algorithm used to
detect anomalies in data used to work well, as data complexity grows, it may become inefficient and need to be switched
to a more complex algorithm or architecture.
4. Causes of Data Drift
Understanding why data drift happens is key to developing viable safeguards against the effect of data drifts in machine
learning models. The ambient environment can cause data Drift, a change in human behavior, or the data pipeline.
4.1. External Environmental Changes
The external environment can undergo unexpected shifts, which can majorly impact data patterns and increasingly
reduce the predictability of data on which future predictions are based. For example, those businesses would likely see
big swings in how consumers behave, spend, or make lifestyle decisions. Such changes can write these new data patterns
so far from what training models have seen that the models can't generalize well to the new realities. Additionally,
governmental policy changes, regulatory shifts, or economic downturns can introduce abrupt variations in data trends,
furthering the likelihood of Drift. For example, an economic recession may alter spending or investment behaviors in
financial sectors, which can mislead algorithms if these new patterns aren't detected and accounted for.
4.2. Human Behavior Shifts
Human behavior in the facing industry is dynamic and unpredictable, a large source of data drift. Historical trends in
data form only part of a social trend and shifting consumer preference, which can result in new data patterns that are
disconnected from the norm. Suppose an increased awareness of sustainable and eco-friendly products would change
what people buy. On the other hand, if a recommendation engine in an e-commerce setting is trained on prior consumer
preferences without these new sustainable tendencies, such that it fails to recommend a relevant item, inaccuracies,
and poor model performance are the result. Similar shifts are common in social media, where trends evolve rapidly, and
data patterns from even a few months ago may become irrelevant to current behaviors.
4.3. Changes in Data Distribution
Data drift can also occur when the process of collecting and obtaining the data changes, when the sources of the data
change, or when the structure and structure of the data change. The underlying distribution of data may vary in response
to modifications in the data pipeline, such as a switch in suppliers, changes to how the collection methodology worked,
or data processing in new ways. A slightly different way data was collected and on which the model was trained versus
the data the model sees during production can result in performance degradation. For example, label criteria for data
can change the data structure, which leads to unintentional Drift. In healthcare, changing the types of health metrics
collected for patient analysis may lead to shifts that impact the accuracy of a previously trained predictive model.
5. Causes of Model Drift
Model drift refers to a gradual decline in a model's performance due to various factors, from shifting variable
relationships to issues within the model itself.
5.1. Changing Relationships Between Variables
As real-world relationships between variables evolve, they can impact how effectively the model makes predictions. For
example, if age and income once had a strong predictive relationship in a certain model, economic or societal changes
International Journal of Science and Research Archive, 2023, 10(02), 11981209
1202
might weaken or alter that relationship. When these changes go unnoticed, models make less accurate predictions,
necessitating retraining or adjustment.
5.2. Overfitting to Past Data
When a model is overly specialized to historical data (overfitting), it cannot generalize (and adapt) to new data. If you've
overfitted your model, it may do well initially, but it must be more accurate as data patterns change. Overfitting can
make the model rigid and less able to capture emerging trends or changes in behavior.
5.3. Degradation in Prediction Performance
Over time, all models are subject to some level of performance degradation. A model's predictive accuracy may gradually
decline due to ongoing shifts in the input data or relationships between variables. This degradation can manifest slowly,
often unnoticed, until the model's predictions no longer meet acceptable standards. Consistent monitoring and
recalibration are needed to address and counteract this Drift.
6. The Impact of Drift on Machine Learning Models
Unchecked data or model drift can have a very negative effect on the machine-learning model. Drift can be subtle and
sometimes significant, resulting in a gradual decrease in accuracy, unreliable predictions, or the model simply not
working as intended.
6.1. Reduced Accuracy
One of the most immediate effects of Drift is a drop in accuracy. The model struggles to make correct predictions if data
patterns change (switch) when the model-learned relationships fail to follow new inputs (generalize themselves,
generalize in the wrong way). In critical applications such as fraud detection or medical diagnosis, such loss in accuracy
is particularly dangerous, as predictive accuracy is vital.
6.2. Increased Prediction Errors
As drift increases, so does the frequency of prediction errors. When the model starts to behave inappropriately, errors
can pile up as the years pass, eventually leading to bad decision-making. For instance, in a financial forecasting model,
this might represent very large, costly misjudgments regarding economic loss and strategic errors.
6.3. Failure to Generalize to New Data
Generalizing new data well is one of the core purposes of a machine learning model. However, when there is a drift, the
model breaks down in its ability to work with previously unseen data. If the input distributions change and diverge from
the original training data, the model struggles to recognize new patterns and relationships, ultimately rendering it
ineffective or obsolete.
7. How to Detect Data and Model Drift
It is important to detect Drift early since it keeps a model accurate and reliable. There are many ways to locate Drift
before it creates problems.
7.1. Monitoring Data Distributions
Regularly monitoring the distribution of incoming data can reveal early signs of Drift. Data scientists can check for
discrepancies in the distribution of incoming data relative to original training data, which might point to covariate and
prior probability drift. That's especially useful for spotting those changes that can otherwise go unnoticed.
7.2. Using Statistical Tests
Differences between training and real-world data can be identified through statistical tests, such as the Kolomogorov-
Smirnov test for continuous variables or the Chi-squared test for categorical variables. These tests give a quantitative
sense of how much data entered differs from the data the model was trained on, allowing us to catch data drift early.
International Journal of Science and Research Archive, 2023, 10(02), 11981209
1203
7.3. Model Performance Tracking
By tracking model performance metrics, such as accuracy, precision, recall, and F1 score, over time, it is possible to
detect drift at a model level. And when they're all going down, these metrics are typically a red flag that the model is
drifting and needs to be checked. Setting performance thresholds and monitoring for deviations can enable automated
alerts, helping teams address issues before they escalate.
Figure 2 Drift detection and adaptation framework
8. Strategies for Tackling Model Drift
Proactive measures that maintain the accuracy and reliability of machine learning models need to be used to perform
well on model drift. By implementing these solutions, you can mitigate the effects of Drift and keep your model
performing well in dynamic environments.
8.1. Regular Model Recalibration
One straightforward approach to managing model drift is regular recalibration. By fine-tuning your model periodically,
you can ensure it adapts to new data conditions and changes between variables. Whether adjusting parameters,
retraining the model with fresh data, or updating hyperparameters, recalibration helps maintain accuracy as the
environment evolves.
8.2. Using Ensemble Methods
Ensemble methods, which combine the predictions of multiple models, are another effective strategy for combating
Drift. The diversity of models in an ensemble helps balance the effects of Drift on individual models. But if one model
starts to fail, the other models can fill in, keeping the impact on performance at a minimum. Ensembles are especially
useful in environments like the one described above, which are subject to Drift because techniques such as bagging,
boosting, and stacking are used to tune predictions to be as robust as possible.
8.3. Online Learning Approaches
Online learning models are designed to update themselves as new data becomes available continuously. Online learning
models, however, are different from traditional models that can be used only after periodic retraining, as they can
incorporate new data on the fly and are thus very resilient to drift. This real-time learning approach helps address one
of the most critical vulnerabilities in machine learning systems: drift.
International Journal of Science and Research Archive, 2023, 10(02), 11981209
1204
Figure 3 A Model Ageing Chart Based on a Model Degradation Experiment
9. Using Automation in Drift Detection and Correction
Automation is essential in handling Drift in large-scale deployments where drift monitoring and correction is
cumbersome. It makes it faster to respond and more reliable.
9.1. Automated Monitoring Systems
Automated systems that monitor data and model drift can provide real-time insights into when and where Drift occurs.
These systems track incoming data distributions and model performance metrics, alerting data scientists or triggering
automatic adjustments when Drift is detected. Automated monitoring is especially helpful for early detection, allowing
teams to address Drift before it leads to significant performance degradation.
9.2. Self-Healing Models
Self-healing models take automation a step further by retraining when Drift is detected. These models take on the ability
to find Drift in their data or performance and trigger retraining without human intervention. Self-healing models
prevent the need for manual oversight and thus minimize downtime. They also address performance issues resulting
from Drift.
International Journal of Science and Research Archive, 2023, 10(02), 11981209
1205
10. Automation in the Drift Detection and Correction
Automation is critical for managing data and model drift in large-scale machine-learning systems. Organizations can
detect and resolve Drift efficiently, eliminating the need for manual oversight and speeding up response time by
leveraging automated monitoring combined with self-healing models.
10.1. Automated Monitoring Systems
Automated monitoring systems provide continuous oversight of data and model drift in real time. These systems
analyze incoming data distributions, comparing them with the original training data to spot deviations. When Drift is
detected, they can trigger alerts, allowing data scientists to respond quickly. Advanced automated systems go further,
making immediate corrections or initiating retraining processes to recalibrate the model without human intervention.
This capability helps models stay accurate and responsive, addressing Drift as it arises and maintaining its performance
over time.
10.2. Self-Healing Models
Self-healing models represent an advanced level of automation, where the model can autonomously retrain itself upon
detecting Drift. These models monitor their performance, and when accuracy declines, they initiate corrective actions
independently. By adapting to new data conditions without requiring manual recalibration, self-healing models
minimize downtime and ensure ongoing accuracy and relevance, enhancing the efficiency and reliability of machine
learning systems at scale.
In this immensely growing field of learning, it is important to be able to manage Drift due to its imperfections. Automated
systems and self-healing models provide the essential tools that can be used at minimum cost to ensure that the models
perform at peak and allow organizations to take as few actions as possible but with maximum performance and
longevity.
11. Managing Drift through the Role of Feedback Loops
Effective drift management requires feedback loops to provide continuous updates using real-world data. The model
has a robust feedback loop to hold it in alignment with its dynamic environment so that it can change quickly and
continue to do well.
11.1. Continuous Feedback from Real-world Data
Drift is detected early by incorporating continuous feedback from the model's operating environment. By regenerating
the model with real-time data, we can instantly uncover any discrepancies before they become serious. This feedback
mechanism is indispensable in systems where rapid adaptation and noise removal propels the perception layer into a
different state, such as recommendation engines and autonomous systems that must maintain accuracy consistently.
11.2. Human-in-the-Loop Systems
Human-in-the-loop systems add an extra layer of quality control by involving expert intervention when needed.
Automated systems may detect Drift and take corrective actions, but complex cases or critical applications might still
benefit from human judgment. This approach combines automation with human expertise, allowing data scientists to
oversee model updates and ensure that any adjustments made by automated systems meet accuracy and reliability
standards.
12. Real-World Examples of Data and Model Drift
Data and model drift are problems encountered in practical applications across various fields that necessitate
immediate correction to keep model accuracy and relevance. Since machine learning (ML) models can be very sensitive
to changing conditions, some industries, such as finance and healthcare, are particularly prone to Drift. Below are two
case studies showing the two fields that Drift affects and the corrective action often taken to correct it.
12.1. Case Study: Drift in Financial Models
The market conditions are constantly shifting, as is the case in the finance industry, and hence, the reliability of the ML
models used for trading, risk assessment, credit scoring, etc, is directly under the influence. If, for example, a trading
algorithm trained on data from a bull market, defined as increasing stock prices and high trading volume, falls short in
International Journal of Science and Research Archive, 2023, 10(02), 11981209
1206
a bear market when the prices go down and liquidity dries up. The model is built to suggest trades under certain market
conditions, and the prediction accuracy and the suggested trades can deteriorate in case of any significant change in
market conditions. This describes a concept drift situation in which the relationship between inputs (market indicators)
and outputs (buy/sell) varies with time.
Financial models are used to combat Drift and need to frequently retrain to incorporate data from various market
conditions, whether a bull or bear market. Organizations sometimes apply ensemble methods or online learning
approaches to allow the models to adapt in real time as conditions evolve. By regularly updating models and including
new data, financial institutions can help ensure these models remain effective across varied economic landscapes.
12.2. Case Study: Drift in Healthcare Prediction Systems
Similarly, healthcare prediction systems are equally vulnerable to Drift, as they rely on patient data that changes over
time as medical treatments, practices, and demographics change. For instance, a predictive model to predict which
patients are at risk of having particular conditions would be trained on historical data that helped define old medical
guidelines and practices. However, when new treatments or protocols are tried, such as updated guidance in treating
diabetes and some cardiovascular diseases, the model may not portray patient care practice.
Moreover, changes in demographicssuch as transitions in age distributions or fluctuations in health trendscan
result in data drift, and poor performance is possible. Healthcare models need continuous updates with the most recent
patient data and medical knowledge to maintain accuracy. These updates often involve adding new data or adjusting
the model structure to accommodate more current medical practices. Organizations may also deploy feedback loops
from real-world patient data to detect early signs of Drift, prompting retraining when significant changes in patient
demographics or treatment protocols are detected.
13. Challenges in Managing Drift
Despite effective strategies to handle Drift in machine learning models, several challenges exist. The first step towards
solving this problem is in recognizing these hurdles.
13.1. Lack of Real-Time Monitoring
Managing Drift is challenging because real-time monitoring systems that we could use don't exist. Detecting Drift early
becomes easier with these systems and suffers from unnoticed performance degradation. Many organizations may rely
on periodic checks or manual audits, often insufficient for identifying subtle changes in data distributions or model
performance. As a result, Drift can go undetected until it significantly impacts accuracy and decision-making.
13.2. Limited Resources for Retraining
Model retraining is often time-consuming, consuming resources such as time, computing power, or human expertise.
Computational resources, or even personnel, can constrain organizations' ability to retrain models regularly, making it
difficult for organizations to consider a more flexible schedule. This can lead to outdated models that need to reflect
current data conditions, increasing the risk of drift-related performance issues.
13.3. Complexities in Understanding Drift
It can be challenging to spot Drift, and identifying and correcting Drift in complex datasets with complex models can get
harder. In particular, it can be challenging to tell when Drift has occurred and to what extent when data change is
minimal, as it can greatly affect model performance in high dimensions. Additionally, the relationships between
variables can change unexpectedly, complicating efforts to maintain model accuracy.
14. Future Directions for Drift Management
Consequently, researchers study novel means of addressing the crisis of drift management in the field of AI.
14.1. AI Used to Predict Drift
Predicting the likelihood of Drift is one promising development area for AI systems. These systems analyze historical
data and detect trends that precede drift events so that organizations can take preventive actions before huge
performance issues occur. Predictive models are a useful early warning system to improve the responsiveness of
current drift management strategies.
International Journal of Science and Research Archive, 2023, 10(02), 11981209
1207
14.2. Creating More Robust Models
Future research also focuses on creating more robust models capable of handling various data variations. Researchers
wish to reduce Drift's impact by designing models that are less sensitive to particular data distributions. Regularization,
dealing with adversarial training, and diverse training datasets could help us develop a more resilient model to
environmental changes.
14.3. Mitigating Drift Using Transfer Learning.
Another approach to overcome this is data and model drift, which is transfer learning. Specifically, it allows a model
trained on one domain to transfer its knowledge to another where the data is scarce. For instance, a model trained on
data from one of the demographics may be fine-tuned to work well for another demographic to close the gap created
through Drift in the training set.
15. Conclusion
In the fast and ever-evolving field of AI, it is critical to remain alert against Drift. The drift problem is inevitable, but
adopting a strategic and proactive stance is essential to minimize the resulting consequences. To keep machine learning
models accurate and reliable over time, they can include robust monitoring systems, employ predictive AI tools to detect
early Drift, and build flexible models.
The ultimate goal is to create AI systems that are adaptable from the get-go, not just performing well when they first go
live but able to thrive in the face of changing constraints of the real world. With new techniques and tools emerging,
drift management research and innovation will enable AI models to remain responsive, resilient, and relevant in an
ever-changing world.
Suppose you want AI systems to succeed in the unpredictable world. In that case, managing for Drift is about the
immediate accuracy of your decision-making and the long-term success and robustness of your AI systems.
Compliance with ethical standards
Disclosure of conflict of interest
No conflict of interest to be disclosed.
References
[1] Jiaoyan Chen, Freddy Lécué, Jeff Z. Pan, Shumin Deng, Huajun Chen, Knowledge graph embeddings for dealing
with concept drift in machine learning, Journal of Web Semantics, Volume 67, 2021, 100625, ISSN 1570-8268,
https://doi.org/10.1016/j.websem.2020.100625.
[2] Firas Bayram, Bestoun S. Ahmed, Andreas Kassler, From concept drift to model degradation: An overview on
performance-aware drift detectors, Knowledge-Based Systems, Volume 245, 2022, 108632, ISSN 0950-7051,
https://doi.org/10.1016/j.knosys.2022.108632.
[3] Dwork, C.; Roth, A.; Naor, M. Differential Privacy: A Survey of Results. In Theory and Applications of Models of
Computation; Springer: New York, NY, USA, 2018; pp. 119.
[4] Truex, S.; Xu, C.; Calandrino, J.; Boneh, D. The Limitations of Differential Privacy in Practice. In Proceedings of the
28th USENIX Security Symposium, Santa Clara, CA, USA, 1416 August 2019; pp. 10451062.
[5] Goodfellow, I.; Shlens, J.; Szegedy, C. Explaining and Harnessing Adversarial Examples. Commun. ACM 2022, 65,
5665.
[6] Akhtar, N.; Mian, A. Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey. IEEE Access
2018, 6, 1441014430.
International Journal of Science and Research Archive, 2023, 10(02), 11981209
1208
[7] Steinhardt, J.; Koh, P.W.; Liang, P. Certified Defenses against Adversarial Examples. In Proceedings of the 6th
International Conference on Learning Representations, Vancouver, BC, Canada, 30 April3 May 2018.
[8] Zhu, M.; Yin, H.; Yang, X. A Comprehensive Survey of Poisoning Attacks in Federated Learning. IEEE Access 2021,
9, 5742757447.
[9] Sun, Y.; Zhang, T.; Wang, J.; Wang, X. A Survey of Deep Neural Network Backdoor Attacks and Defenses. IEEE
Trans. Neural Netw. Learn. Syst. 2020, 31, 41504169.
[10] Gu, T.; Dolan-Gavitt, B.; Garg, S. BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain.
In Proceedings of the 28th USENIX Security Symposium, Santa Clara, CA, USA, 1416 August 2019; pp. 1965
1980.
[11] Liu, Y.; Ma, X.; Ateniese, G.; Hsu, W.L. Trojaning Attack on Neural Networks. In Proceedings of the 2018 ACM
SIGSAC Conference on Computer and Communications Security, Toronto, ON, Canada, 1519 October 2018; pp.
2741.
[12] Gao, Y.; Sun, X.; Zhang, Y.; Liu, J. Trojan Attacks on Federated Learning Systems: An Overview. IEEE Netw. 2021,
35, 144150.
[13] Agrahari S., Singh A. K. (2021). Concept drift detection in data stream mining: a literature review. J. King Saud
Univ. Comput. Inform. Sci. 34, 95239540. 10.1016/j.jksuci.2021.11.006
[14] Alippi C., Roveri M. (2008). Just-in-time adaptive classifiers-part I: detecting nonstationary changes. IEEE Trans.
Neural Netw. 19, 11451153. 10.1109/TNN.2008.2000082
[15] Baena-Garcia M., del Campo-Avila J., Fidalgo R., Bifet A., Gavalda R., Morales-Bueno R. (2006). “Early drift
detection method,” in Proc. 4th Int. Workshop Knowledge Discovery from Data Streams.
[16] Bayram F., Ahmed B. S., Kassler A. (2022). From concept drift to model degradation: an overview on performance-
aware drift detectors. Knowledge Based Syst. 245, 108632. 10.1016/j.knosys.2022.108632
[17] Beyene A. A., Welemariam T., Persson M., Lavesson N. (2015). Improved concept drift handling in surgery
prediction and other applications. Knowledge Inform. Syst. 44, 177196. 10.1007/s10115-014-0756-9
[18] Bifet A., Gavalda R. (2007). “Learning from time-changing data with adaptive windowing,” in Proc. 2007 SIAM
Int. Conf. Data Mining, SIAM 2007 (Minneapolis, MN: ). 10.1137/1.9781611972771.42
[19] Bruno Maciel I. F., Silas Santos G. T. C., Barros R. S. M. (2015). “A lightweight concept drift detection ensemble,”
in IEEE 27th International Conference on Tools with Artificial Intelligence (Vietri sul Mare:).
10.1109/ICTAI.2015.151
[20] Brzeziński D., Stefanowski J. (2011). “Accuracy updated ensemble for data streams with concept drift,” in Hybrid
Artificial Intelligent Systems, HAIS, 2011, eds E. Corchado, M. Kurzyński, and M. Wozniak (Berlin; Heidelberg:
Springer; ). 10.1007/978-3-642-21222-2_19
[21] Krishna, K. (2022). Optimizing query performance in distributed NoSQL databases through adaptive indexing
and data partitioning techniques. International Journal of Creative Research Thoughts (IJCRT). https://ijcrt.
org/viewfulltext. php.
[22] Krishna, K., & Thakur, D. (2021). Automated Machine Learning (AutoML) for Real-Time Data Streams: Challenges
and Innovations in Online Learning Algorithms. Journal of Emerging Technologies and Innovative Research
(JETIR), 8(12).
[23] Murthy, P., & Thakur, D. (2022). Cross-Layer Optimization Techniques for Enhancing Consistency and
Performance in Distributed NoSQL Database. International Journal of Enhanced Research in Management &
Computer Applications, 35.
International Journal of Science and Research Archive, 2023, 10(02), 11981209
1209
[24] Murthy, P., & Mehra, A. (2021). Exploring Neuromorphic Computing for Ultra-Low Latency Transaction
Processing in Edge Database Architectures. Journal of Emerging Technologies and Innovative Research, 8(1), 25-
26.
[25] Thakur, D. (2021). Federated Learning and Privacy-Preserving AI: Challenges and Solutions in Distributed
Machine Learning. International Journal of All Research Education and Scientific Methods (IJARESM), 9(6), 3763-
3764.
... As demonstrated by Piccialli et al. (2021, p. 5), ROLS mitigates outlier sensitivity in urban datasets, while MLR, as employed by Cao et al. (2020), provides interpretable classification of binary outcomes. Methodologically, the integration of SHAP (SHapley Additive exPlanations) values (Dewi et al., 2022, p. 72) enables granular feature interpretation, addressing the "black-box" limitations of deep learning approaches that hinder policy adoption (Patchipala, 2023). Practically, the model's design prioritizes computational efficiency achieving 91.56% accuracy with limited data and deploy-ability in low-infrastructure settings, unlike LSTM or CNN hybrids that require extensive training data (Zhu et al., 2020, p. 61). ...
... Classical machine learning techniques, particularly logistic regression and support vector machines, dominated early research due to their interpretability and modest computational requirements (Benny & Soori, 2017). These models demonstrated reasonable accuracy (75-85%) in structured parking environments but struggled with the nonlinear temporal dependencies and spatial correlations characteristic of on-street parking systems (Patchipala, 2023). ...
... Recent hybrid approaches attempt to bridge these gaps. Some studies combine MLR's interpretable framework with XGBoost-derived features, achieving 91-94% accuracy while maintaining auditability (Patchipala, 2023). Others employ MLR as a meta-learner for tree-based model outputs, particularly effective when integrating disparate data sources like weather feeds and traffic APIs (Musa et al., 2023). ...
Article
Full-text available
Purpose: This study developed an enhanced multivariate logistic regression (MLR) model integrated with robust ordinary least squares (ROLS) techniques to address parking occupancy prediction challenges in rapidly urbanizing environments. Focusing on developing country contexts with infrastructure constraints, the research targeted three limitations of conventional approaches: vulnerability to data anomalies, insufficient interpretability, and poor adaptation to resource-limited settings. Methodology: Employing design science research (DSR) methodology, the study utilized parking datasets from Kaggle and GitHub repositories. Comprehensive preprocessing included ROLS-based outlier treatment and temporal/environmental feature engineering. The model incorporated SHapley Additive exPlanations (SHAP) for interpretability and underwent hyperparameter optimization via grid search. Evaluation employed an 80-20 train-test split with accuracy, precision, recall, F1-score, and AUC-ROC metrics. Findings: The ensemble model achieved superior performance (R²=0.9007, MSE=0.00878, accuracy=91.56%) compared to standalone MLR (84.31% accuracy) and ROLS (MSE=0.00872) implementations. Key predictors included historical occupancy patterns, temporal variables, and weather conditions. SHAP analysis confirmed the model's operational transparency while maintaining computational efficiency. Unique Contribution to Theory, Practice and Policy: Implementation in real-time smart parking systems through IoT networks is recommended. Future research should pursue: 1) cross-regional validation studies, 2) dynamic pricing algorithm integration, and 3) enhanced anomaly detection mechanisms. The study provides a theoretically grounded yet practical solution optimized for developing urban contexts.
... While short-term bias detection and mitigation strategies are widely studied, the long-term accumulation of BME remains underexplored, particularly regarding its impact on model drift and information quality decay [3]. Existing AI governance frameworks often overlook this risk, leaving LLMs vulnerable to evolving into self-reinforcing misinformation engines that compromise their reliability and neutrality across critical domains like science, business, public policy, and education. ...
... Large Language Models (LLMs) have become indispensable in natural language understanding and generation across various domains, including healthcare, legal systems, and public policy. However, their outputs are prone to biases, misinformation, and errors that can compromise fairness and utility, especially in high-stakes settings [6,7,3]. These biases often originate from training data, model architecture, and user interactions and are further exacerbated by multilevel feedback loops during deployment [8,9]. ...
... At the macro level, AI models undergo self-reinforcing updates where they train on their own generated outputs, leading to model drift and long-term information decay [3,19]. ...
Conference Paper
Full-text available
Large Language Models (LLMs) are critical tools for knowledge generation and decision-making in fields such as science, business, governance, and education. However, these models are increasingly prone to Bias, Misinformation, and Errors (BME) due to multi-level feedback loops that exacerbate distortions over iterative training cycles. This paper presents a comprehensive framework for understanding these feedback mechanisms-User-AI Interaction, Algorithmic Curation, and Training Data Feedback-as primary drivers of model drift and information quality decay. We introduce three novel metrics-Bias Amplification Rate (BAR), Echo Chamber Propagation Index (ECPI), and Information Quality Decay (IQD) Score-to quantify and track the impact of feedback-driven bias propagation. Simulations demonstrate how these metrics reveal evolving risks in LLMs over successive iterations. Our findings emphasize the urgency of implementing lifecycle-wide governance frameworks incorporating real-time bias detection, algorithmic fairness constraints, and human-in-the-loop verification to ensure the long-term reliability, neutrality, and accuracy of LLM-generated outputs.
... (24,25,26) Furthermore, traditional radar frequency utilization systems are becoming congested, with their potential to scale in the future rapidly growing due to the increasing density of logistical drone systems, space-based systems, and air traffic management systems. (27) When it comes to ground-based traffic or lower airspace traffic, there is a lack of readily available, long-term, historic, and operational traffic data that can be used to gain insights into lower airspace operations. (28) In particular, obtaining global traffic data at grid points is nontrivial. ...
... Another major advantage of AI systems is their ability to process and classify a huge amount of real-time incoming data that increases data accuracy with the flow. (27) Processing could occur instantaneously and yield constant and up-to-date aviation situational awareness, which is either lacking in today's systems or tied to the physical constraints imposed by traditional methods. (27,28,29,30) AI can also offer a robust future prediction module once trained to recognize complex geophysics. ...
... (27) Processing could occur instantaneously and yield constant and up-to-date aviation situational awareness, which is either lacking in today's systems or tied to the physical constraints imposed by traditional methods. (27,28,29,30) AI can also offer a robust future prediction module once trained to recognize complex geophysics. In an adaptive system, future predictions could be more robust and more autonomous from forecast updating rates, reducing computational resources gained from onboard non-destructive artificial intelligence. ...
Article
Full-text available
The unprecedented growth of global air traffic has put immense pressure on the air traffic management systems. In light of that, global air traffic situational awareness and surveillance are indispensable, especially for satellite-based aircraft tracking systems. There has been some crucial development in the field; however, every major player in this arena relies on a single proprietary, non-transparent data feed. This is where this chapter differentiates itself. AIS data has been gaining traction recently for the same purpose and has matured considerably over the past decade; however, satellite-based communication service providers have failed to instrument significant portions of the world’s oceans. This study proposes a multimodal artificial intelligence-powered algorithm to boost the estimates of global air traffic situational awareness using the Global Air Traffic Visualization dataset. Two multimodal artificial intelligence agents categorically detect air traffic streaks in a huge collection of satellite images and notify the geospatial temporal statistical agent whenever both modalities are in concordance. A user can fine-tune the multimodal threshold hyperparameter based on the installed detection rate of datasets to get the best satellite-derived air traffic estimates.
... AI technologies improve workplace safety, worker well-being, and productivity by utilizing various technologies such as Machine learning (ML) algorithms to analyze accident data and environmental conditions to identify trends and prevent incidents (Patchipala et al., 2023). In addition, computer vision systems detect unsafe activities and ensure safety compliance (Liu et al., 2021), while NLP improves hazard detection by analyzing safety reports and worker feedback (Shamshiri et al., 2024). ...
Chapter
Full-text available
This chapter examines the incorporation of Artificial Intelligence (AI) into Occupational Health and Safety (OHS), representing a significant change in workplace safety management. It highlights the importance of AI technology in reducing risks, improving worker satisfaction, and increasing operational effectiveness in the workplace. Consideration is given to the functions of significant AI technologies in unsafe work automation, ergonomic assessments, health monitoring, and hazard identification. Some processes include predictive analytics, computer vision, machine learning, and natural language processing. Additionally, innovations that are altering safety training and skill development include wearable technology, augmented reality, and virtual reality. Alongside focusing on issues including employment displacement, regulatory gaps, and ethical problems, the chapter suggests multidisciplinary collaboration and strategic partnerships for AI-driven safety solutions.
... Once a model is deployed in production, it must be monitored continuously to detect performance degradation, model drift, or any unexpected behavior [16,27]. Studies emphasize the importance of real-time monitoring systems that can trigger alerts when models begin to underperform, allowing for timely intervention [28,29]. Tools like Prometheus and Grafana are commonly used for this purpose, but they often require custom integration with machine learning pipelines, which can be resource-intensive [25]. ...
Article
Full-text available
This paper presents a comprehensive solution for managing AI models across their lifecycle through the development of an AI model registry and lifecycle management system. As AI continues to play a crucial role across industries, the complexity of managing models—from development to deployment—presents significant challenges, especially within cross-functional teams. These challenges include issues such as model versioning, metadata management, deployment inconsistencies, and communication breakdowns among data scientists, engineers, and business stakeholders. The proposed system addresses these challenges by providing a centralized platform that integrates features such as version control, metadata management, and automated deployment, thereby improving transparency and reducing the risk of deployment errors. Furthermore, the system fosters enhanced collaboration by integrating widely-used project management tools like GitHub, Jira, and Slack, ensuring that teams remain aligned throughout the model's lifecycle. By enabling continuous monitoring and incorporating automated model drift detection, the system ensures that AI models remain accurate and efficient post-deployment. This paper also explores the technical implementation strategy for the system, including the use of containerization, cloud-native infrastructure, and microservices architecture to ensure scalability and flexibility. The implications of this work extend beyond technical considerations, as it enhances collaboration, improves model quality, and accelerates deployment cycles. Future research directions include exploring automation in model updates, scalability in large enterprises, and the integration of additional tools and frameworks. This work provides a critical step toward optimizing AI model management, offering a scalable, efficient, and secure approach to managing AI models throughout their lifecycle.
Article
Full-text available
This comprehensive article examines the critical challenges of data drift and concept drift in machine learning systems deployed across various industries. The article explores how these phenomena affect model performance in production environments, with a particular focus on healthcare, manufacturing, and autonomous systems. The article analyzes different types of drift, including covariate shifts and prior probability shifts, while exploring their manifestations and impacts. Through findings of real-world implementations, the article presents advanced detection methodologies and mitigation strategies, ranging from statistical approaches to sophisticated monitoring frameworks. The investigation extends to emerging technologies in sustainable manufacturing and edge computing environments, offering insights into future developments in drift management. The findings emphasize the importance of proactive drift detection and adaptive model maintenance for ensuring continued system reliability and performance.
Article
Full-text available
In the realm of modern building management, the focus on sustainability, efficiency, and cost-effectiveness is driving the adoption of innovative strategies. Among these, predictive maintenance (PdM) stands out as a transformative approach to managing commercial building assets. With the increasing complexity of building systems and the growing demand for operational efficiency, Australian commercial buildings are increasingly turning to machine learning (ML) to enhance maintenance practices. This paper explores the integration of predictive maintenance strategies powered by machine learning in Australian commercial buildings, with a focus on the potential benefits, challenges, and future outlook.
Article
Full-text available
This research examines the transformative role of digital twins in building condition auditing for commercial properties. Digital twins leverage advanced technologies such as the Internet of Things (IoT), artificial intelligence (AI), and data analytics to create real-time, dynamic virtual replicas of physical structures. This paper explores their applications in monitoring building performance, optimizing maintenance strategies, and ensuring regulatory compliance. It further identifies implementation challenges and presents strategies for overcoming them. The research concludes by highlighting the potential of digital twins to revolutionize building condition audits, emphasizing their impact on efficiency, accuracy, and sustainability.
Article
Full-text available
Deep learning (DL) is revolutionizing evidence-based decision-making techniques that can be applied across various sectors. Specifically, it possesses the ability to utilize two or more levels of non-linear feature transformation of the given data via representation learning in order to overcome limitations posed by large datasets. As a multidisciplinary field that is still in its nascent phase, articles that survey DL architectures encompassing the full scope of the field are rather limited. Thus, this paper comprehensively reviews the state-of-art DL modelling techniques and provides insights into their advantages and challenges. It was found that many of the models exhibit a highly domain-specific efficiency and could be trained by two or more methods. However, training DL models can be very time-consuming, expensive, and requires huge samples for better accuracy. Since DL is also susceptible to deception and misclassification and tends to get stuck on local minima, improved optimization of parameters is required to create more robust models. Regardless, DL has already been leading to groundbreaking results in the healthcare, education, security, commercial, industrial, as well as government sectors. Some models, like the convolutional neural network (CNN), generative adversarial networks (GAN), recurrent neural network (RNN), recursive neural networks, and autoencoders, are frequently used, while the potential of other models remains widely unexplored. Pertinently, hybrid conventional DL architectures have the capacity to overcome the challenges experienced by conventional models. Considering that capsule architectures may dominate future DL models, this work aimed to compile information for stakeholders involved in the development and use of DL models in the contemporary world.
Article
Full-text available
The dynamicity of real-world systems poses a significant challenge to deployed predictive machine learning (ML) models. Changes in the system on which the ML model has been trained may lead to performance degradation during the system's life cycle. Recent advances that study non-stationary environments have mainly focused on identifying and addressing such changes caused by a phenomenon called concept drift. Different terms have been used in the literature to refer to the same type of concept drift and the same term for various types. This lack of unified terminology is set out to create confusion on distinguishing between different concept drift variants. In this paper, we start by grouping concept drift types by their mathematical definitions and survey the different terms used in the literature to build a consolidated taxonomy of the field. We also review and classify performance-based concept drift detection methods proposed in the last decade. These methods utilize the predictive model's performance degradation to signal substantial changes in the systems. The classification is outlined in a hierarchical diagram to provide an orderly navigation between the methods. We present a comprehensive analysis of the main attributes and strategies for tracking and evaluating the model's performance in the predictive system. The paper concludes by discussing open research challenges and possible research directions.
Article
Full-text available
Deep learning is at the heart of the current rise of machine learning and artificial intelligence. In the field of Computer Vision, it has become the workhorse for applications ranging from self-driving cars to surveillance and security. Whereas deep neural networks have demonstrated phenomenal success (often beyond human capabilities) in solving complex problems, recent studies show that they are vulnerable to adversarial attacks in the form of subtle perturbations to inputs that lead a model to predict incorrect outputs. For images, such perturbations are often too small to be perceptible, yet they completely fool the deep learning models. Adversarial attacks pose a serious threat to the success of deep learning in practice. This fact has lead to a large influx of contributions in this direction. This article presents the first comprehensive survey on adversarial attacks on deep learning in Computer Vision. We review the works that design adversarial attacks, analyze the existence of such attacks and propose defenses against them. To emphasize that adversarial attacks are possible in practical conditions, we separately review the contributions that evaluate adversarial attacks in the real-world scenarios. Finally, we draw on the literature to provide a broader outlook of the research direction.
Conference Paper
Full-text available
Uncovering information from large data streams containing changes in the data distribution (concept drift) make online learning a challenge that is progressively becoming more relevant. This paper proposes Drift Detection Ensemble (DDE), a small ensemble classifier that aggregates the warnings and drift detections of three concept drift detectors aiming to improve the results of the individual methods using different strategies and configurations. DDE was programmed to use different default combinations of detectors depending on the chosen sensitivity of the ensemble. Experiments were performed against six drift detectors using their default configurations, comparing their results on multiple artificial datasets containing different frequencies and durations of concept drifts, as well as real-world datasets. Our results indicate that the best two methods were DDE versions and they were statistically superior to several detectors.
Article
Full-text available
Several machine learning models, including neural networks, consistently misclassify adversarial examples---inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets. Moreover, this view yields a simple and fast method of generating adversarial examples. Using this approach to provide examples for adversarial training, we reduce the test set error of a maxout network on the MNIST dataset.
Conference Paper
Full-text available
In this paper we study the problem of constructing accurate block-based ensemble classifiers from time evolving data streams. AWE is the best-known representative of these ensembles. We propose a new algorithm called Accuracy Updated Ensemble (AUE), which extends AWE by using online component classifiers and updating them according to the current distribution. Additional modifications of weighting functions solve problems with undesired classifier excluding seen in AWE. Experiments with several evolving data sets show that, while still requiring constant processing time and memory, AUE is more accurate than AWE.
Article
With the introduction of 5G and beyond networks, increasing intelligence and automation levels are being employed in managing and orchestrating virtualized networks. Through Machine Learning (ML) models, Network Service Providers (NSPs) can forecast and predict their networks' future state and proactively react to any potential fault, performance degradation, or change in demand stemming from the dynamic nature of the network environment. As such, ML models will become a critical component in the NSP decision-making process. However, model drift poses significant challenges and can severely degrade an ML model's performance, rendering it inaccurate and ineffective. This paper discusses the various types of model drift and the dangers they pose to ML models deployed in dynamic networks. Additionally, the challenges surrounding the implementation of drift detection and mitigation schemes in resource-constrained networks are outlined. This work discusses three innovation areas to address model drift in dynamic networks, including network drift characteristic understanding, preventative ML model maintenance, and drift-resistant ML architectures. Finally, a novel drift detection and adaptation framework for dynamic networks and an illustrative 5G case study of model drift are presented.
Article
Data stream learning has been largely studied for extracting knowledge structures from continuous and rapid data records. As data is evolving on a temporal basis, its underlying knowledge is subject to many challenges. Concept drift,¹ as one of core challenge from the stream learning community, is described as changes of statistical properties of the data over time, causing most of machine learning models to be less accurate as changes over time are in unforeseen ways. This is particularly problematic as the evolution of data could derive to dramatic change in knowledge. We address this problem by studying the semantic representation of data streams in the Semantic Web, i.e., ontology streams. Such streams are ordered sequences of data annotated with ontological vocabulary. In particular we exploit three levels of knowledge encoded in ontology streams to deal with concept drifts: i) existence of novel knowledge gained from stream dynamics, ii) significance of knowledge change and evolution, and iii) (in)consistency of knowledge evolution. Such knowledge is encoded as knowledge graph embeddings through a combination of novel representations: entailment vectors, entailment weights, and a consistency vector. We illustrate our approach on classification tasks of supervised learning. Key contributions of the study include: (i) an effective knowledge graph embedding approach for stream ontologies, and (ii) a generic consistent prediction framework with integrated knowledge graph embeddings for dealing with concept drifts. The experiments have shown that our approach provides accurate predictions towards air quality in Beijing and bus delay in Dublin with real world ontology streams.
Article
The article presents a new algorithm for handling concept drift: the Trigger-based Ensemble (TBE) is designed to handle concept drift in surgery prediction but it is shown to perform well for other classification problems as well. At the primary care, queries about the need for surgical treatment are referred to a surgeon specialist. At the secondary care, referrals are reviewed by a team of specialists. The possible outcomes of this review are that the referral: (i) is canceled, (ii) needs to be complemented, or (iii) is predicted to lead to surgery. In the third case, the referred patient is scheduled for an appointment with a surgeon specialist. This article focuses on the binary prediction of case three (surgery prediction). The guidelines for the referral and the review of the referral are changed due to, e.g., scientific developments and clinical practices. Existing decision support is based on the expert systems approach, which usually requires manual updates when changes in clinical practice occur. In order to automatically revise decision rules, the occurrence of concept drift (CD) must be detected and handled. The existing CD handling techniques are often specialized; it is challenging to develop a more generic technique that performs well regardless of CD type. Experiments are conducted to measure the impact of CD on prediction performance and to reduce CD impact. The experiments evaluate and compare TBE to three existing CD handling methods (AWE, Active Classifier, and Learn++) on one real-world dataset and one artificial dataset. TBA significantly outperforms the other algorithms on both datasets but is less accurate on noisy synthetic variations of the real-world dataset.
Conference Paper
Over the past five years a new approach to privacy-preserving data analysis has born fruit [13, 18, 7, 19, 5, 37, 35, 8, 32]. This approach differs from much (but not all!) of the related literature in the statistics, databases, theory, and cryptography communities, in that a formal and ad omnia privacy guarantee is defined, and the data analysis techniques presented are rigorously proved to satisfy the guarantee. The key privacy guarantee that has emerged is differential privacy. Roughly speaking, this ensures that (almost, and quantifiably) no risk is incurred by joining a statistical database. In this survey, we recall the definition of differential privacy and two basic techniques for achieving it. We then show some interesting applications of these techniques, presenting algorithms for three specific tasks and three general results on differentially private learning.