Kevin Leach’s research while affiliated with Vanderbilt University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (78)


Label Errors in the Tobacco3482 Dataset
  • Preprint
  • File available

December 2024

·

1 Read

Gordon Lim

·

·

Kevin Leach

Tobacco3482 is a widely used document classification benchmark dataset. However, our manual inspection of the entire dataset uncovers widespread ontological issues, especially large amounts of annotation label problems in the dataset. We establish data label guidelines and find that 11.7% of the dataset is improperly annotated and should either have an unknown label or a corrected label, and 16.7% of samples in the dataset have multiple valid labels. We then analyze the mistakes of a top-performing model and find that 35% of the model's mistakes can be directly attributed to these label issues, highlighting the inherent problems with using a noisily labeled dataset as a benchmark. Supplementary material, including dataset annotations and code, is available at https://github.com/gordon-lim/tobacco3482-mistakes/.

Download

PBP: Post-training Backdoor Purification for Malware Classifiers

December 2024

·

10 Reads

Dung Thuy Nguyen

·

Ngoc N. Tran

·

·

Kevin Leach

In recent years, the rise of machine learning (ML) in cybersecurity has brought new challenges, including the increasing threat of backdoor poisoning attacks on ML malware classifiers. For instance, adversaries could inject malicious samples into public malware repositories, contaminating the training data and potentially misclassifying malware by the ML model. Current countermeasures predominantly focus on detecting poisoned samples by leveraging disagreements within the outputs of a diverse set of ensemble models on training data points. However, these methods are not suitable for scenarios where Machine Learning-as-a-Service (MLaaS) is used or when users aim to remove backdoors from a model after it has been trained. Addressing this scenario, we introduce PBP, a post-training defense for malware classifiers that mitigates various types of backdoor embeddings without assuming any specific backdoor embedding mechanism. Our method exploits the influence of backdoor attacks on the activation distribution of neural networks, independent of the trigger-embedding method. In the presence of a backdoor attack, the activation distribution of each layer is distorted into a mixture of distributions. By regulating the statistics of the batch normalization layers, we can guide a backdoored model to perform similarly to a clean one. Our method demonstrates substantial advantages over several state-of-the-art methods, as evidenced by experiments on two datasets, two types of backdoor methods, and various attack configurations. Notably, our approach requires only a small portion of the training data -- only 1\% -- to purify the backdoor and reduce the attack success rate from 100\% to almost 0\%, a 100-fold improvement over the baseline methods. Our code is available at \url{https://github.com/judydnguyen/pbp-backdoor-purification-official}.


Figure 1: Experimental setup of view time limit. In this paper, we investigate 100/1000/2500ms time limits to examine their impact on data quality and worker experience.
Figure 2: Training procedure. Participants are shown examples of each class. They proceed after seeing all images in all classes.
Figure 3: Qualification and Test trials. Participants are shown a dog image and must select the best breed category. In the qualification stage, there is no time limit. In the time-limited test stage, the image disappears after 100ms, 1000ms, or 2500ms. Participants can revisit training images by clicking each category. A grey bar shows overall progress.
Figure 7: Select images categorized by characteristics identified as challenging under a view time limit by participants in SDOGS-10H. Note: images may exhibit multiple challenging characteristics.
Figure 9: Participants' accuracy over their trial. Each line represents a sampled participant. Compared to CIFAR-10H participants without a time limit, SDOGS-10H participants with a view time limit maintained good accuracy.

+4

Towards Fair Pay and Equal Work: Imposing View Time Limits in Crowdsourced Image Classification

November 2024

·

1 Read

Crowdsourcing is a common approach to rapidly annotate large volumes of data in machine learning applications. Typically, crowd workers are compensated with a flat rate based on an estimated completion time to meet a target hourly wage. Unfortunately, prior work has shown that variability in completion times among crowd workers led to overpayment by 168% in one case, and underpayment by 16% in another. However, by setting a time limit for task completion, it is possible to manage the risk of overpaying or underpaying while still facilitating flat rate payments. In this paper, we present an analysis of the impact of a time limit on crowd worker performance and satisfaction. We conducted a human study with a maximum view time for a crowdsourced image classification task. We find that the impact on overall crowd worker performance diminishes as view time increases. Despite some images being challenging under time limits, a consensus algorithm remains effective at preserving data quality and filters images needing more time. Additionally, crowd workers' consistent performance throughout the time-limited task indicates sustained effort, and their psychometric questionnaire scores show they prefer shorter limits. Based on our findings, we recommend implementing task time limits as a practical approach to making compensation more equitable and predictable.


Robust Testing for Deep Learning using Human Label Noise

November 2024

·

1 Read

In deep learning (DL) systems, label noise in training datasets often degrades model performance, as models may learn incorrect patterns from mislabeled data. The area of Learning with Noisy Labels (LNL) has introduced methods to effectively train DL models in the presence of noisily-labeled datasets. Traditionally, these methods are tested using synthetic label noise, where ground truth labels are randomly (and automatically) flipped. However, recent findings highlight that models perform substantially worse under human label noise than synthetic label noise, indicating a need for more realistic test scenarios that reflect noise introduced due to imperfect human labeling. This underscores the need for generating realistic noisy labels that simulate human label noise, enabling rigorous testing of deep neural networks without the need to collect new human-labeled datasets. To address this gap, we present Cluster-Based Noise (CBN), a method for generating feature-dependent noise that simulates human-like label noise. Using insights from our case study of label memorization in the CIFAR-10N dataset, we design CBN to create more realistic tests for evaluating LNL methods. Our experiments demonstrate that current LNL methods perform worse when tested using CBN, highlighting its use as a rigorous approach to testing neural networks. Next, we propose Soft Neighbor Label Sampling (SNLS), a method designed to handle CBN, demonstrating its improvement over existing techniques in tackling this more challenging type of noise.


Formal Logic-guided Robust Federated Learning against Poisoning Attacks

November 2024

·

7 Reads

Dung Thuy Nguyen

·

Ziyan An

·

·

[...]

·

Kevin Leach

Federated Learning (FL) offers a promising solution to the privacy concerns associated with centralized Machine Learning (ML) by enabling decentralized, collaborative learning. However, FL is vulnerable to various security threats, including poisoning attacks, where adversarial clients manipulate the training data or model updates to degrade overall model performance. Recognizing this threat, researchers have focused on developing defense mechanisms to counteract poisoning attacks in FL systems. However, existing robust FL methods predominantly focus on computer vision tasks, leaving a gap in addressing the unique challenges of FL with time series data. In this paper, we present FLORAL, a defense mechanism designed to mitigate poisoning attacks in federated learning for time-series tasks, even in scenarios with heterogeneous client data and a large number of adversarial participants. Unlike traditional model-centric defenses, FLORAL leverages logical reasoning to evaluate client trustworthiness by aligning their predictions with global time-series patterns, rather than relying solely on the similarity of client updates. Our approach extracts logical reasoning properties from clients, then hierarchically infers global properties, and uses these to verify client updates. Through formal logic verification, we assess the robustness of each client contribution, identifying deviations indicative of adversarial behavior. Experimental results on two datasets demonstrate the superior performance of our approach compared to existing baseline methods, highlighting its potential to enhance the robustness of FL to time series applications. Notably, FLORAL reduced the prediction error by 93.27\% in the best-case scenario compared to the second-best baseline. Our code is available at \url{https://anonymous.4open.science/r/FLORAL-Robust-FTS}.


FISC: Federated Domain Generalization via Interpolative Style Transfer and Contrastive Learning

October 2024

·

4 Reads

Dung Thuy Nguyen

·

·

Kevin Leach

Federated Learning (FL) shows promise in preserving privacy and enabling collaborative learning. However, most current solutions focus on private data collected from a single domain. A significant challenge arises when client data comes from diverse domains (i.e., domain shift), leading to poor performance on unseen domains. Existing Federated Domain Generalization approaches address this problem but assume each client holds data for an entire domain, limiting their practicality in real-world scenarios with domain-based heterogeneity and client sampling. To overcome this, we introduce FISC, a novel FL domain generalization paradigm that handles more complex domain distributions across clients. FISC enables learning across domains by extracting an interpolative style from local styles and employing contrastive learning. This strategy gives clients multi-domain representations and unbiased convergent targets. Empirical results on multiple datasets, including PACS, Office-Home, and IWildCam, show FISC outperforms state-of-the-art (SOTA) methods. Our method achieves accuracy improvements ranging from 3.64% to 57.22% on unseen domains. Our code is available at https://anonymous.4open.science/r/FISC-AAAI-16107.


Figure 2: An example of the Universal Tokenizer tokenizes a fileame with the Trie Tokenizer and BERT Autotokenizer.
Figure 3: The confidence score distribution for Random Forest classifier with Trie tokenizer on indicative and ambiguous file names from the Web Search dataset.
Figure 4: Prediction Accuracy vs Prediction Rate at all confidence score threshold between 0 and 1 in increments of 0.01 for a Random Forest classifier with the Trie tokenizer trained with a negative class predicted on all the file names in the Web Search dataset including ambiguous file names.
Figure 5: Overall Accuracy vs Overall Time (seconds) at all confidence score threshold between 0 and 1, incremented by 0.01 for a Random Forest classifier with the Trie tokenizer trained with a negative class predicted on all the file names in the Web Search dataset including ambiguous file names, and DiT model to process the documents deferred by the Random Forest model.
Figure 6: Prediction Accuracy vs Prediction Rate at 0.8 confidence threshold for the Random Forest file name classifier with a BERT tokenizer using various k-values on indicative file names from the Web Search dataset.
Document Type Classification using File Names

October 2024

·

41 Reads

Rapid document classification is critical in several time-sensitive applications like digital forensics and large-scale media classification. Traditional approaches that rely on heavy-duty deep learning models fall short due to high inference times over vast input datasets and computational resources associated with analyzing whole documents. In this paper, we present a method using lightweight supervised learning models, combined with a TF-IDF feature extraction-based tokenization method, to accurately and efficiently classify documents based solely on file names that substantially reduces inference time. This approach can distinguish ambiguous file names from the indicative file names through confidence scores and through using a negative class representing ambiguous file names. Our results indicate that file name classifiers can process more than 80% of the in-scope data with 96.7% accuracy when tested on a dataset with a large portion of out-of-scope data with respect to the training dataset while being 442.43x faster than more complex models such as DiT. Our method offers a crucial solution for efficiently processing vast datasets in critical scenarios, enabling fast, more reliable document classification.


MalMixer: Few-Shot Malware Classification with Retrieval-Augmented Semi-Supervised Learning

September 2024

·

31 Reads

Recent growth and proliferation of malware has tested practitioners' ability to promptly classify new samples according to malware families. In contrast to labor-intensive reverse engineering efforts, machine learning approaches have demonstrated increased speed and accuracy. However, most existing deep-learning malware family classifiers must be calibrated using a large number of samples that are painstakingly manually analyzed before training. Furthermore, as novel malware samples arise that are beyond the scope of the training set, additional reverse engineering effort must be employed to update the training set. The sheer volume of new samples found in the wild creates substantial pressure on practitioners' ability to reverse engineer enough malware to adequately train modern classifiers. In this paper, we present MalMixer, a malware family classifier using semi-supervised learning that achieves high accuracy with sparse training data. We present a novel domain-knowledge-aware technique for augmenting malware feature representations, enhancing few-shot performance of semi-supervised malware family classification. We show that MalMixer achieves state-of-the-art performance in few-shot malware family classification settings. Our research confirms the feasibility and effectiveness of lightweight, domain-knowledge-aware feature augmentation methods and highlights the capabilities of similar semi-supervised classifiers in addressing malware classification issues.




Citations (43)


... Like us, Saifullah, et al. raise concern over whether published model performance on the original Tobacco3482 dataset is a valid measure of a model's true capabilities. Moreover, recent work by Larson, et al. (2024) [8] found large amounts of sensitive personal information in both Tobacco3482 and RVL-CDIP; we thus caution researchers in the document understanding community on using Tobacco3482 and other datasets derived from TTID due to the presence of label errors, data bias, and sensitive material. ...

Reference:

Label Errors in the Tobacco3482 Dataset
De-Identification of Sensitive Personal Data in Datasets Derived from IIT-CDIP
  • Citing Conference Paper
  • January 2024

... Many studies focus on evaluating GenAI tools developed for specific course assessments (Daun & Brings, 2023;Vassiliou et al., 2023;R. Zhang et al., 2023) and analysing their impact on assessment practices, including their successes and limitations (Grandel et al., 2024;Kooli & Yusuf, 2024;Qureshi, 2023). However, there is a notable lack of research that reviews the role of GenAI tools in assessment design for CS education. ...

Applying Large Language Models to Enhance the Assessment of Parallel Functional Programming Assignments
  • Citing Conference Paper
  • September 2024

... By dynamically adjusting the allocation of attention, the mechanism can intelligently focus on the feature regions carrying key information in the sequence [23], realizing efficient screening and refining of information. And it is widely used in machine translation [24], machine vision [25], and gesture prediction [26]. ...

EyeTrans: Merging Human and Machine Attention for Neural Code Summarization
  • Citing Article
  • July 2024

... Utilize transfer learning and cross-lingual approaches to adapt highresource models for low-resource settings. • Explainability-Enhance model interpretability [62] by providing insights into why specific sentences or phrases were selected for the summary. Develop visualisation tools that highlight attention weights and decision pathways in the summarisation process. ...

Do Machines and Humans Focus on Similar Code? Exploring Explainability of Large Language Models in Code Summarization
  • Citing Conference Paper
  • June 2024

... For instance, some studies have shown that ChatGPT may display certain biases when detecting harmful content, especially in cases involving politically sensitive topics or comments from specific demographic groups (Zhu et al., 2023;Li et al., 2024;Deshpande et al., 2023;Clews, 2024;Zhang, 2024). Additionally, due to the model's training data and methods, some biases may unintentionally be introduced, causing the model to behave more conservatively in certain situations (Hou et al., 2024). Secondly, in terms of bias evaluation metrics such as FPED, FNED, and SUM-ED, ChatGPT demonstrates relatively lower gender bias compared to Naive Similarly, we conducted the same experiment again on The MTC dataset (The hate speech dataset) and found similar conclusions (see Fig. 3, Table 3, and Fig. 4). ...

ChatGPT Giving Relationship Advice – How Reliable Is It?
  • Citing Article
  • May 2024

Proceedings of the International AAAI Conference on Web and Social Media

... The diversity of these studies illustrates the wide-ranging applications for wearable research, with EmbracePlus used to monitor clinical health markers, fatigue, stress, emotion, and arousal in lab spaces, simulated work environments, or real-world settings. Most studies focused on adults (n = 21) [1][2][3][4][5][6][7][8][9][10]12,13,15,16,18,[32][33][34][35][36][37], with few in older adult or clinical groups (n = 6) [3,7,8,11,17,18], and few with children (n = 3) [11,14,19]. Cardiovascular measures were most commonly used for analysis (n = 28) [1][2][3][4][5][7][8][9][10][11][12][13][14][15]17,[21][22][23]25,[27][28][29][32][33][34][35], followed by electrodermal activity (n = 25) [1][2][3][4][5][8][9][10][11]13,14,16,17,[19][20][21][22]24,25,[27][28][29][32][33][34], temperature (n = 20) [1,[3][4][5]7,9,10,13,14,[16][17][18][19][20]22,24,25,[27][28][29], and accelerometry measures (n = 12) [8,15,16,20,[22][23][24][25][27][28][29]37]. ...

Breaking the Flow: A Study of Interruptions During Software Engineering Activities
  • Citing Conference Paper
  • April 2024

... Graphics processing units (GPUs) have evolved, Fig 1, into a versatile computing platform capable of massive parallelism and tremendous throughput for graphics, scientific, and data-parallel workloads [1]. GPUs achieve high performance through many lightweight cores and abundant memory bandwidth. ...

Building a Lightweight Trusted Execution Environment for Arm GPUs

IEEE Transactions on Dependable and Secure Computing

... Task Dataset CLINC150 (Larson et al., 2019) Redwood (Larson and Leach, 2022) GOOGLE-DSTC8 Leyzer (Sowański and Janicki, 2020) HINT3 (Arora et al., 2020) NLU Chatbot-Corpus (Braun et al., 2017) MultiWOZ BANKING77 (Casanueva et al., 2020) FEWSHOTWOZ (Peng et al., 2020) ATIS (Tur et al., 2010) Schema (Rastogi et al., 2019) CrossNER WNUT17 (Derczynski et al., 2017) NER CoNLL-2003 (Tjong Kim Sang and De Meulder, 2003) CoNLL-2004 (Carreras and Màrquez, 2004) IE OntoNotes (Weischedel et al., 2013) SCIERC (Luan et al., 2018) (Tomasello et al., 2022). SLURP is substantially larger and linguistically more diverse than previous SLU datasets. ...

Redwood: Using Collision Detection to Grow a Large-Scale Intent Classification Dataset
  • Citing Conference Paper
  • January 2022

... Recent advances in document classification have demonstrated continuous improvement in performance on the Tobacco3482 [19] document image dataset (e.g., [1,5,13,16]). However, a growing body of research on dataset quality casts serious doubt on the usefulness of many benchmark datasets for evaluating model performance [2,6,7,18]. Notably, widely-used datasets like ImageNet [3] has been found to have labeling issues, including incorrect labels and images that should be assigned multiple labels [12,14,15]. ...

On Evaluation of Document Classification with RVL-CDIP
  • Citing Conference Paper
  • January 2023