ArticlePublisher preview available
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Due to its plug-and-play functionality and wide device support, the universal serial bus (USB) protocol has become one of the most widely used protocols. However, this widespread adoption has introduced a significant security concern: the implicit trust provided to USB devices, which has created a vast array of attack vectors. Malicious USB devices exploit this trust by disguising themselves as benign peripherals and covertly implanting malicious commands into connected host devices. Existing research employs supervised learning models to identify such malicious devices, but our study reveals a weakness in these models when faced with sophisticated data poisoning attacks. We propose, design and implement a sophisticated adversarial data poisoning attack to demonstrate how these models can be manipulated to misclassify an attack device as a benign device. Our method entails generating keystroke data using a microprogrammable keystroke attack device. We develop adversarial attacker by meticulously analyzing the data distribution of data features generated via USB keyboards from benign users. The initial training data is modified by exploiting firmware-level modifications within the attack device. Upon evaluating the models, our findings reveal a significant decrease from 99 to 53% in detection accuracy when an adversarial attacker is employed. This work highlights the critical need to reevaluate the dependability of machine learning-based USB threat detection mechanisms in the face of increasingly sophisticated attack methods. The vulnerabilities demonstrated highlight the importance of developing more robust and resilient detection strategies to protect against the evolution of malicious USB devices.
This content is subject to copyright. Terms and conditions apply.
International Journal of Information Security (2024) 23:2043–2061
https://doi.org/10.1007/s10207-024-00834-y
REGULAR CONTRIBUTION
Deceiving supervised machine learning models via adversarial data
poisoning attacks: a case study with USB keyboards
Anil Kumar Chillara1·Paresh Saxena1·Rajib Ranjan Maiti1·Manik Gupta1·Raghu Kondapalli2·
Zhichao Zhang2·Krishnakumar Kesavan2
Published online: 14 March 2024
© The Author(s), under exclusive licence to Springer-Verlag GmbH, DE 2024
Abstract
Due to its plug-and-play functionality and wide device support, the universal serial bus (USB) protocol has become one of
the most widely used protocols. However, this widespread adoption has introduced a significant security concern: the implicit
trust provided to USB devices, which has created a vast array of attack vectors. Malicious USB devices exploit this trust
by disguising themselves as benign peripherals and covertly implanting malicious commands into connected host devices.
Existing research employs supervised learning models to identify such malicious devices, but our study reveals a weakness
in these models when faced with sophisticated data poisoning attacks. We propose, design and implement a sophisticated
adversarial data poisoning attack to demonstrate how these models can be manipulated to misclassify an attack device as a
benign device. Our method entails generating keystroke data using a microprogrammable keystroke attack device. We develop
adversarial attacker by meticulously analyzing the data distribution of data features generated via USB keyboards from benign
users. The initial training data is modified by exploiting firmware-level modifications within the attack device. Upon evaluating
the models, our findings reveal a significant decrease from 99 to 53% in detection accuracy when an adversarial attacker is
employed. This work highlights the critical need to reevaluate the dependability of machine learning-based USB threat
detection mechanisms in the face of increasingly sophisticated attack methods. The vulnerabilities demonstrated highlight
the importance of developing more robust and resilient detection strategies to protect against the evolution of malicious USB
devices.
Keywords USB ·Adversarial learning ·Data poisoning attacks ·Keystroke injection attacks ·Supervised learning
1 Introduction
The use of universal serial bus (USB) devices will increase
by approximately 86.36% by 2028 [1]. In addition, the USB
market share is anticipated to reach 46.08 billion USD, with
a compound annual growth rate (CAGR) of 13.9% by the
end of 2027 [2]. The rapid expansion of USB devices in
different segments like consumer electronics, automobiles,
healthcare, and medical devices is a contributing factor to this
growth. USB has become the standard interface for connect-
ing peripheral devices to a host system for both information
BAnil Kumar Chillara
p20190422@hyderabad.bits-pilani.ac.in
1CSIS Department, BITS-Pilani, Hyderabad, Telangana
500078, India
2Axiado Corporation, 2610 Orchard Pkwy, 3rd fl., San Jose,
CA 95134, USA
transfer and power supply [3]. However, one major concern
with the growing usage of USB devices is the associated
security risks due to increasing attack vectors [4]. Honey-
well’s cybersecurity research report [5] suggests that 79% of
cyber threats using USB devices could seriously impact the
operational technology environment. It also finds that 37%
of the cyber threats are designed to be launched using USB
flash drives, a class of USB mass storage devices. In terms
of security, tracking USB flash drives is very difficult as they
can be carried in bags, laptop cases, pockets, or sometimes
in unsuspecting form factors. The firmware of some USB
devices can also be modified and it can be used to take con-
trol of critical systems [6].
Various malicious activities are feasible on the connected
host system using different categories of USB hardware, as
shown in Fig.1. Hardware Trojans, for instance, are imple-
mented via programmable microcontrollers [8]. Another
possibility of malicious activity includes the use of standard
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... Other methods, like device and host fingerprinting, show promise in detecting USB attacks [24,25]. However, adversarial data poisoning attacks significantly impair the models' ability to detect USB attacks, rendering the defense mechanism ineffective [26][27][28]. Adversaries use adversarial data poisoning during training to degrade the performance of ML models by injecting adversarial instances [29][30][31]. Kumar et al. in [28] demonstrate a sophisticated attacker using an adversarial data poisoning technique to deceive defense mechanisms that use supervised learning-based models. ...
... Adversaries use adversarial data poisoning during training to degrade the performance of ML models by injecting adversarial instances [29][30][31]. Kumar et al. in [28] demonstrate a sophisticated attacker using an adversarial data poisoning technique to deceive defense mechanisms that use supervised learning-based models. The attack is designed by modifying the firmware of the attack device to mimic legitimate user behavior. ...
... However, the approach relies heavily on key-hold time, which limits its effectiveness. Prior research [28] demonstrates that RF-based models relying on complex keystroke dynamics can be bypassed using advanced evasion strategies. ...
Article
Full-text available
The paper introduces a Universal Serial Bus (USB)-based defense framework, USB-GATE, which leverages a Generative Adversarial Network (GAN) and transformer-based embeddings to enhance the detection of adversarial keystroke injection attacks. USB-GATE uses a Wasserstein GAN with Gradient Penalty (WGAN-GP) to augment benign data. The framework combines benign data augmentation with multimodal transformer-based embeddings to improve the robustness of existing supervised Machine Learning (ML) models in detecting the attacks. The framework generates augmented benign data using WGAN-GP, establishing a robust baseline dataset. Subsequently, it leverages the Vision Transformer (ViT) component of Contrastive Language-Image Pre-training (CLIP) to generate embeddings that boost the performance of various supervised ML models in detecting attacks. Our evaluation highlights significant performance improvements, with the supervised ML model k-Nearest Neighbors (kNN) showing the maximum improvement, achieving a 17% boost in accuracy when the framework is applied. The Random Forest (RF) model achieves the best overall accuracy of 81.3%, marking a 5% improvement when using USB-GATE. Our results demonstrate the efficacy of USB-GATE in detecting adversarial attacks. This framework provides a promising solution for strengthening defenses. It is particularly effective against adversarial USB keystroke injection attacks.
... We examine 1 leading NLP articles to gather information regarding the datasets employed in adversarial attack and defense studies [4]. Figure 1 illustrates the proportion of datasets employed in the pertinent NLP tasks. ...
Article
Full-text available
Advanced neural text classifiers have shown remarkable ability in the task of classification. The investigation illustrates that text classification models have an inherent vulnerability to adversarial texts, where a few words or characters are altered to create adversarial examples that misleads the machine into making incorrect predictions while preserving its intended meaning among human viewers. The present study introduces Inflect-Text, a novel approach for attacking text that works at the level of individual words in a situation where the inner workings of the system are unknown. The objective is to deceive a specific neural text classifier while following specified language limitations in a manner that makes the changes undetectable to humans. Extensive investigations are carried out to evaluate the viability of the proposed attack methodology on various often utilized frameworks, inclusive of Word-CNN, Bi-LSTM and three advanced transformer models, across two benchmark datasets: AG news and MR, which are commonly employed for text classification tasks. Experimental results show that the suggested attack architecture regularly outperforms conventional methods by achieving much higher attack success rates and generating better adversarial examples. The findings suggest that neural text classifiers can be bypassed, which could have substantial ramifications for existing policy approaches.
... The widespread use of machine learning in diverse fields such as cyber security [16], healthcare [34], and autonomous vehicles [49] make it an attractive target for adversaries. Machine learning models are susceptible to various types of adversarial attacks, typically classified as poisoning [14], evasion [10], backdoor [47], inversion [36] and inference [27], [6] attacks. Some potential attacks are demonstrated in [7,57,22,31,48]. ...
Preprint
Full-text available
Poisoning attacks are a primary threat to machine learning models, aiming to compromise their performance and reliability by manipulating training datasets. This paper introduces a novel attack - Outlier-Oriented Poisoning (OOP) attack, which manipulates labels of most distanced samples from the decision boundaries. The paper also investigates the adverse impact of such attacks on different machine learning algorithms within a multiclass classification scenario, analyzing their variance and correlation between different poisoning levels and performance degradation. To ascertain the severity of the OOP attack for different degrees (5% - 25%) of poisoning, we analyzed variance, accuracy, precision, recall, f1-score, and false positive rate for chosen ML models.Benchmarking our OOP attack, we have analyzed key characteristics of multiclass machine learning algorithms and their sensitivity to poisoning attacks. Our experimentation used three publicly available datasets: IRIS, MNIST, and ISIC. Our analysis shows that KNN and GNB are the most affected algorithms with a decrease in accuracy of 22.81% and 56.07% while increasing false positive rate to 17.14% and 40.45% for IRIS dataset with 15% poisoning. Further, Decision Trees and Random Forest are the most resilient algorithms with the least accuracy disruption of 12.28% and 17.52% with 15% poisoning of the IRIS dataset. We have also analyzed the correlation between number of dataset classes and the performance degradation of models. Our analysis highlighted that number of classes are inversely proportional to the performance degradation, specifically the decrease in accuracy of the models, which is normalized with increasing number of classes. Further, our analysis identified that imbalanced dataset distribution can aggravate the impact of poisoning for machine learning models
Conference Paper
Full-text available
Transformers have achieved superior performances in many tasks in natural language processing and computer vision, which also triggered great interest in the time series community. Among multiple advantages of Transformers, the ability to capture long-range dependencies and interactions is especially attractive for time series modeling, leading to exciting progress in various time series applications. In this paper, we systematically review Transformer schemes for time series modeling by highlighting their strengths as well as limitations. In particular, we examine the development of time series Transformers in two perspectives. From the perspective of network structure, we summarize the adaptations and modifications that have been made to Transformers in order to accommodate the challenges in time series analysis. From the perspective of applications, we categorize time series Transformers based on common tasks including forecasting, anomaly detection, and classification. Empirically, we perform robust analysis, model size analysis, and seasonal-trend decomposition analysis to study how Transformers perform in time series. Finally, we discuss and suggest future directions to provide useful research guidance.
Article
Full-text available
With the latest advances in deep learning-based generative models, it has not taken long to take advantage of their remarkable performance in the area of time series. Deep neural networks used to work with time series heavily depend on the size and consistency of the datasets used in training. These features are not usually abundant in the real world, where they are usually limited and often have constraints that must be guaranteed. Therefore, an effective way to increase the amount of data is by using data augmentation techniques, either by adding noise or permutations and by generating new synthetic data. This work systematically reviews the current state of the art in the area to provide an overview of all available algorithms and proposes a taxonomy of the most relevant research. The efficiency of the different variants will be evaluated as a central part of the process, as well as the different metrics to evaluate the performance and the main problems concerning each model will be analysed. The ultimate aim of this study is to provide a summary of the evolution and performance of areas that produce better results to guide future researchers in this field.
Article
Full-text available
As the inter-connectivity in cyberspace continues to increase exponentially, privacy and security have become two of the most concerning issues that need to be tackled in today’s state of technology. Therefore, password-based authentication should no longer be used without an additional layer of authentication, namely two-factor authentication (2FA). One of the most promising 2FA approaches is Keystroke Dynamics, which relies on the unique typing behaviour of the users. Since its discovery, Keystroke Dynamics adoption has been continuously growing to many use cases: generally to obtain access to a platform through typing behaviour similarity, into a continuous keystroke monitoring on e-learning or e-exams platforms to detect illegitimate participants or cheaters. As the adoption of Keystroke Dynamics continues to grow, so does the threats that are lurking in. This paper proposes a novel exploitation method that utilizes computer vision to extract and learn a user’s typing pattern just from a screen-recorded video that captures their typing process. By using a screen-recorded video, an attacker could eliminate the needs to inject a keylogger into the victim’s computer, thus rendering the attack easier to perform and more difficult to detect. Furthermore, the extracted typing pattern can be used to spoof a Keystroke Dynamics authentication mechanism with an evasion rate as high as 64%, a considerably alarming rate given the impact it yields if the attacks are successful, which allows an attacker to pretend, mimic, and falsely authenticate as the victim (i.e., total account takeover). This paper also shows that from a screen-recorded video, one can produce a staggering statistical similarity in keystroke timing patterns as if they used an actual keylogger, and the extracted patterns can be potentially used to spoof the Keystroke Dynamics authentication service. To the author’s best knowledge, there is no precedence of previous research that suggests this kind of attack (i.e. using video to spoof keystroke dynamics). This research can be used as the baseline for future research in this area.
Article
Full-text available
This paper focuses on training machine learning models using the XGBoost and extremely randomized trees algorithms on two datasets obtained using static and dynamic analysis of real malicious and benign samples. We then compare their success rates—both mutually and with other algorithms, such as the random forest, the decision tree, the support vector machine, and the naïve Bayes algorithms, which we compared in our previous work on the same datasets. The best performing classification models, using the XGBoost algorithm, achieved 91.9% detection accuracy and 98.2% sensitivity, 0.853 AUC, and 0.949 F1 score on the static analysis dataset, and 96.4% accuracy and 98.5% sensitivity, 0.940 AUC, and 0.977 F1 score on the dynamic analysis dataset. Then, we exported the best performing machine learning models and used them in our proposed MLMD program, automating the process of static and dynamic analysis and allowing the trained models to be used for classification on new samples.
Article
This paper investigates the ethical implications of using adversarial machine learning for the purpose of obfuscation. We suggest that adversarial attacks can be justified by privacy considerations but that they can also cause collateral damage. To clarify the matter, we employ two use cases—facial recognition and medical machine learning—to evaluate the collateral damage counterarguments to privacy-induced adversarial attacks. We conclude that obfuscation by data poisoning can be justified in facial recognition but not in the medical case. We motivate our conclusion by employing psychological arguments about change, privacy considerations, and purpose limitations on machine learning applications.
Article
Forecasting models play a key role in money-making ventures in many different markets. Such models are often trained on data from various sources, some of which may be untrustworthy.An actor in a given market may be incentivised to drive predictions in a certain direction to their own benefit.Prior analyses of intelligent adversaries in a machine-learning context have focused on regression and classification.In this paper we address the non-iid setting of time series forecasting.We consider a forecaster, Bob, using a fixed, known model and a recursive forecasting method.An adversary, Alice, aims to pull Bob's forecasts toward her desired target series, and may exercise limited influence on the initial values fed into Bob's model.We consider the class of linear autoregressive models, and a flexible framework of encoding Alice's desires and constraints.We describe a method of calculating Alice's optimal attack that is computationally tractable, and empirically demonstrate its effectiveness compared to random and greedy baselines on synthetic and real-world time series data.We conclude by discussing defensive strategies in the face of Alice-like adversaries.
Article
As a new and efficient ensemble learning algorithm, XGBoost has been widely applied for its multitudinous advantages, but its classification effect in the case of data imbalance is often not ideal. Aiming at this problem, an attempt was made to optimize the regularization term of XGBoost, and a classification algorithm based on mixed sampling and ensemble learning is proposed. The main idea is to combine SVM-SMOTE over-sampling and EasyEnsemble under-sampling technologies for data processing, and then obtain the final model based on XGBoost by training and ensemble. At the same time, the optimal parameters are automatically searched and adjusted through the Bayesian optimization algorithm to realize classification prediction. In the experimental stage, the G-mean and area under the curve (AUC) values are used as evaluation indicators to compare and analyze the classification performance of different sampling methods and algorithm models. The experimental results on the public data set also verify the feasibility and effectiveness of the proposed algorithm.