Huaming Chen’s research while affiliated with The University of Sydney and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (8)


Security for Machine Learning-based Software Systems: A Survey of Threats, Practices, and Challenges
  • Article

December 2023

·

68 Reads

·

32 Citations

ACM Computing Surveys

Huaming Chen

·

M. Ali Babar

The rapid development of Machine Learning (ML) has demonstrated superior performance in many areas, such as computer vision, video and speech recognition. It has now been increasingly leveraged in software systems to automate the core tasks. However, how to securely develop the machine learning-based modern software systems (MLBSS) remains a big challenge, for which the insufficient consideration will largely limit its application in safety-critical domains. One concern is that the present MLBSS development tends to be rush, and the latent vulnerabilities and privacy issues exposed to external users and attackers will be largely neglected and hard to be identified. Additionally, machine learning-based software systems exhibit different liabilities towards novel vulnerabilities at different development stages from requirement analysis to system maintenance, due to its inherent limitations from the model and data and the external adversary capabilities. The successful generation of such intelligent systems will thus solicit dedicated efforts jointly from different research areas, i.e., software engineering, system security and machine learning. Most of the recent works regarding the security issues for ML have a strong focus on the data and models, which has brought adversarial attacks into consideration. In this work, we consider that security for machine learning-based software systems may arise from inherent system defects or external adversarial attacks, and the secure development practices should be taken throughout the whole lifecycle. While machine learning has become a new threat domain for existing software engineering practices, there is no such review work covering the topic. Overall, we present a holistic review regarding the security for MLBSS, which covers a systematic understanding from a structure review of three distinct aspects in terms of security threats. Moreover, it provides a thorough state-of-the-practice for MLBSS secure development. Finally, we summarise the literature for system security assurance, and motivate the future research directions with open challenges. We anticipate this work provides sufficient discussion and novel insights to incorporate system security engineering for future exploration.




A Survey on Data-driven Software Vulnerability Assessment and Prioritization

April 2022

·

110 Reads

·

95 Citations

ACM Computing Surveys

Software Vulnerabilities (SVs) are increasing in complexity and scale, posing great security risks to many software systems. Given the limited resources in practice, SV assessment and prioritization help practitioners devise optimal SV mitigation plans based on various SV characteristics. The surges in SV data sources and data-driven techniques such as Machine Learning and Deep Learning have taken SV assessment and prioritization to the next level. Our survey provides a taxonomy of the past research efforts and highlights the best practices for data-driven SV assessment and prioritization. We also discuss the current limitations and propose potential solutions to address such issues.


Figure 2: Program Dependence Graph for Fig. 1. Black lines represent control dependency edges; dashed red lines represent data dependency edges.
Figure 3: LineVD Overall Architecture
Figure 4: Histogram of first rankings of LineVD on default test set. First ranking is defined as the first true-positive statement in a sorted list of softmax scores assigned to each statement for a given function. E.g. in 575 samples, a vulnerable statement is first in the ranked list of vulnerable statement predictions; in 67 samples, a vulnerable statement is second in the list, etc.
RQ3: Graph-based Feature Variants for Statement- level Vulnerability Classification
RQ5: Analysis of Statement-level Predictions
LineVD: Statement-level Vulnerability Detection using Graph Neural Networks
  • Preprint
  • File available

March 2022

·

321 Reads

·

2 Citations

Current machine-learning based software vulnerability detection methods are primarily conducted at the function-level. However, a key limitation of these methods is that they do not indicate the specific lines of code contributing to vulnerabilities. This limits the ability of developers to efficiently inspect and interpret the predictions from a learnt model, which is crucial for integrating machine-learning based tools into the software development workflow. Graph-based models have shown promising performance in function-level vulnerability detection, but their capability for statement-level vulnerability detection has not been extensively explored. While interpreting function-level predictions through explainable AI is one promising direction, we herein consider the statement-level software vulnerability detection task from a fully supervised learning perspective. We propose a novel deep learning framework, LineVD, which formulates statement-level vulnerability detection as a node classification task. LineVD leverages control and data dependencies between statements using graph neural networks, and a transformer-based model to encode the raw source code tokens. In particular, by addressing the conflicting outputs between function-level and statement-level information, LineVD significantly improve the prediction performance without vulnerability status for function code. We have conducted extensive experiments against a large-scale collection of real-world C/C++ vulnerabilities obtained from multiple real-world projects, and demonstrate an increase of 105\% in F1-score over the current state-of-the-art.

Download

Noisy Label Learning for Security Defects

March 2022

·

80 Reads

Data-driven software engineering processes, such as vulnerability prediction heavily rely on the quality of the data used. In this paper, we observe that it is infeasible to obtain a noise-free security defect dataset in practice. Despite the vulnerable class, the non-vulnerable modules are difficult to be verified and determined as truly exploit free given the limited manual efforts available. It results in uncertainty, introduces labeling noise in the datasets and affects conclusion validity. To address this issue, we propose novel learning methods that are robust to label impurities and can leverage the most from limited label data; noisy label learning. We investigate various noisy label learning methods applied to software vulnerability prediction. Specifically, we propose a two-stage learning method based on noise cleaning to identify and remediate the noisy samples, which improves AUC and recall of baselines by up to 8.9% and 23.4%, respectively. Moreover, we discuss several hurdles in terms of achieving a performance upper bound with semi-omniscient knowledge of the label noise. Overall, the experimental results show that learning from noisy labels can be effective for data-driven software and security analytics.


Security for Machine Learning-based Software Systems: a survey of threats, practices and challenges

January 2022

·

535 Reads

The rapid development of Machine Learning (ML) has demonstrated superior performance in many areas, such as computer vision, video and speech recognition. It has now been increasingly leveraged in software systems to automate the core tasks. However, how to securely develop the machine learning-based modern software systems (MLBSS) remains a big challenge, for which the insufficient consideration will largely limit its application in safety-critical domains. One concern is that the present MLBSS development tends to be rush, and the latent vulnerabilities and privacy issues exposed to external users and attackers will be largely neglected and hard to be identified. Additionally, machine learning-based software systems exhibit different liabilities towards novel vulnerabilities at different development stages from requirement analysis to system maintenance, due to its inherent limitations from the model and data and the external adversary capabilities. In this work, we consider that security for machine learning-based software systems may arise by inherent system defects or external adversarial attacks, and the secure development practices should be taken throughout the whole lifecycle. While machine learning has become a new threat domain for existing software engineering practices, there is no such review work covering the topic. Overall, we present a holistic review regarding the security for MLBSS, which covers a systematic understanding from a structure review of three distinct aspects in terms of security threats. Moreover, it provides a thorough state-of-the-practice for MLBSS secure development. Finally, we summarise the literature for system security assurance, and motivate the future research directions with open challenges. We anticipate this work provides sufficient discussion and novel insights to incorporate system security engineering for future exploration.


Fig. 2. List of challenges and future directions for data-driven SV assessment and prioritization.
Comparison of contributions between our survey and the existing related surveys/reviews.
List of the surveyed papers in the Exploit Time sub-theme of the Exploitation theme.
List of the surveyed papers in the Exploit Characteristics sub-theme of the Exploitation theme.
A Survey on Data-driven Software Vulnerability Assessment and Prioritization

July 2021

·

635 Reads

Software Vulnerabilities (SVs) are increasing in complexity and scale, posing great security risks to many software systems. Given the limited resources in practice, SV assessment and prioritization help practitioners devise optimal SV mitigation plans based on various SV characteristics. The surge in SV data sources and data-driven techniques such as Machine Learning and Deep Learning have taken SV assessment and prioritization to the next level. Our survey provides a taxonomy of the past research efforts and highlights the best practices for data-driven SV assessment and prioritization. We also discuss the current limitations and propose potential solutions to address such issues.

Citations (5)


... While this approach considers the possibility of using large language models for generating test cases, it has interpretability issues and high computational requirements that can limit its use in small organizations. Chen and Babar [13] proposed the threats to ML-based software systems' security and the practices implemented to combat them. Although they have attempted to meet the increasing demand for security, their work is mainly centered on security aspects, and the identification of software defects is not discussed. ...

Reference:

Deep ensemble optimal classifier-based software defect prediction for early detection and quality assurance
Security for Machine Learning-based Software Systems: A Survey of Threats, Practices, and Challenges
  • Citing Article
  • December 2023

ACM Computing Surveys

... DPAs can be classified into several distinct groups, Figure 2). These groups include label manipulation, where incorrect labels are assigned to training data [16]; data injection, which involves adding fraudulent data points [17]; feature space manipulation, where the features of the data are altered to mislead the model [18]; and relationship (or context) manipulation, which disrupts the underlying relationships between data points [19]. As shown in Figure 2, the different groups can be further divided into the following types. ...

Noisy label learning for security defects
  • Citing Conference Paper
  • October 2022

... Basic Graph Neural Networks (GNNs) are models designed to understand relationships between nodes in a network, much like identifying connections within a web of friends. In the context of source code analysis, GNN-based models have been utilized in 13 studies, where the source code is represented through various graph or tree-based structures [20,32,38,54,64,83,96,127,130,146,158,163,175]. These models effectively capture structural dependencies within the code, facilitating vulnerability detection. ...

LineVD: statement-level vulnerability detection using graph neural networks
  • Citing Conference Paper
  • October 2022

... Their approach employs a specially designed time-based cross-validation technique to identify the optimal model for each vulnerability characteristic, chosen from eight distinct natural language processing representations and six established machine learning models. Le et al. [46] conducted a review of earlier research activities, emphasizing optimal practices for assessing and ranking software vulnerabilities driven by data. Sun et al. [12] employed BERT-MRC to isolate vulnerability components from their descriptions and used these elements throughout the descriptions to assess six different metrics. ...

A Survey on Data-driven Software Vulnerability Assessment and Prioritization
  • Citing Article
  • April 2022

ACM Computing Surveys

... Authors used feature engineering to fetch meaning from the given document to produce a meaningful summary of the original document. Hin et al. [59] presented a DL architecture, termed LineVD, which focused on formulating vulnerability detection at the statement level as the node classification task. LineVD uses Graph Neural Network (GNN). ...

LineVD: Statement-level Vulnerability Detection using Graph Neural Networks