October 2022
·
9 Reads
·
9 Citations
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
October 2022
·
9 Reads
·
9 Citations
October 2022
·
6 Reads
·
4 Citations
June 2022
·
386 Reads
·
18 Citations
Empirical Software Engineering
Software developed on public platform is a source of data that can be used to make predictions about those projects. While the individual developing activity may be random and hard to predict, the developing behavior on project level can be predicted with good accuracy when large groups of developers work together on software projects. To demonstrate this, we use 64,181 months of data from 1,159 GitHub projects to make various predictions about the recent status of those projects (as of April 2020). We find that traditional estimation algorithms make many mistakes. Algorithms like k-nearest neighbors (KNN), support vector regression (SVR), random forest (RFT), linear regression (LNR), and regression trees (CART) have high error rates. But that error rate can be greatly reduced using hyperparameter optimization. To the best of our knowledge, this is the largest study yet conducted, using recent data for predicting multiple health indicators of open-source projects. To facilitate open science (and replications and extensions of this work), all our materials are available online at https://github.com/arennax/Health_Indicator_Prediction.
May 2022
·
83 Reads
Background: Most of the existing machine learning models for security tasks, such as spam detection, malware detection, or network intrusion detection, are built on supervised machine learning algorithms. In such a paradigm, models need a large amount of labeled data to learn the useful relationships between selected features and the target class. However, such labeled data can be scarce and expensive to acquire. Goal: To help security practitioners train useful security classification models when few labeled training data and many unlabeled training data are available. Method: We propose an adaptive framework called Dapper, which optimizes 1) semi-supervised learning algorithms to assign pseudo-labels to unlabeled data in a propagation paradigm and 2) the machine learning classifier (i.e., random forest). When the dataset class is highly imbalanced, Dapper then adaptively integrates and optimizes a data oversampling method called SMOTE. We use the novel Bayesian Optimization to search a large hyperparameter space of these tuning targets. Result: We evaluate Dapper with three security datasets, i.e., the Twitter spam dataset, the malware URLs dataset, and the CIC-IDS-2017 dataset. Experimental results indicate that we can use as low as 10% of original labeled data but achieve close or even better classification performance than using 100% labeled data in a supervised way. Conclusion: Based on those results, we would recommend using hyperparameter optimization with semi-supervised learning when dealing with shortages of labeled security data.
March 2022
·
40 Reads
Background: Machine learning techniques have been widely used and demonstrate promising performance in many software security tasks such as software vulnerability prediction. However, the class ratio within software vulnerability datasets is often highly imbalanced (since the percentage of observed vulnerability is usually very low). Goal: To help security practitioners address software security data class imbalanced issues and further help build better prediction models with resampled datasets. Method: We introduce an approach called Dazzle which is an optimized version of conditional Wasserstein Generative Adversarial Networks with gradient penalty (cWGAN-GP). Dazzle explores the architecture hyperparameters of cWGAN-GP with a novel optimizer called Bayesian Optimization. We use Dazzle to generate minority class samples to resample the original imbalanced training dataset. Results: We evaluate Dazzle with three software security datasets, i.e., Moodle vulnerable files, Ambari bug reports, and JavaScript function code. We show that Dazzle is practical to use and demonstrates promising improvement over existing state-of-the-art oversampling techniques such as SMOTE (e.g., with an average of about 60% improvement rate over SMOTE in recall among all datasets). Conclusion: Based on this study, we would suggest the use of optimized GANs as an alternative method for security vulnerability data class imbalanced issues.
January 2022
·
248 Reads
·
18 Citations
Empirical Software Engineering
ContextMachine learning-based security detection models have become prevalent in modern malware and intrusion detection systems. However, previous studies show that such models are susceptible to adversarial evasion attacks. In this type of attack, inputs (i.e., adversarial examples) are specially crafted by intelligent malicious adversaries, with the aim of being misclassified by existing state-of-the-art models (e.g., deep neural networks). Once the attackers can fool a classifier to think that a malicious input is actually benign, they can render a machine learning-based malware or intrusion detection system ineffective.Objective To help security practitioners and researchers build a more robust model against non-adaptive, white-box and non-targeted adversarial evasion attacks through the idea of ensemble model.Method We propose an approach called Omni, the main idea of which is to explore methods that create an ensemble of “unexpected models”; i.e., models whose control hyperparameters have a large distance to the hyperparameters of an adversary’s target model, with which we then make an optimized weighted ensemble prediction.ResultsIn studies with five types of adversarial evasion attacks (FGSM, BIM, JSMA, DeepFool and Carlini-Wagner) on five security datasets (NSL-KDD, CIC-IDS-2017, CSE-CIC-IDS2018, CICAndMal2017 and the Contagio PDF dataset), we show Omni is a promising approach as a defense strategy against adversarial attacks when compared with other baseline treatments.Conclusions When employing ensemble defense against adversarial evasion attacks, we suggest to create ensemble with unexpected models that are distant from the attacker’s expected model (i.e., target model) through methods such as hyperparameter optimization.
May 2021
·
112 Reads
·
25 Citations
Empirical Software Engineering
Background In order that the general public is not vulnerable to hackers, security bug reports need to be handled by small groups of engineers before being widely discussed. But learning how to distinguish the security bug reports from other bug reports is challenging since they may occur rarely. Data mining methods that can find such scarce targets require extensive optimization effort. Goal The goal of this research is to aid practitioners as they struggle to optimize methods that try to distinguish between rare security bug reports and other bug reports. Method Our proposed method, called SWIFT, is a dual optimizer that optimizes both learner and pre-processor options. Since this is a large space of options, SWIFT uses a technique called 𝜖-dominance that learns how to avoid operations that do not significantly improve performance. Result When compared to recent state-of-the-art results (from FARSEC which is published in TSE’18), we find that the SWIFT’s dual optimization of both pre-processor and learner is more useful than optimizing each of them individually. For example, in a study of security bug reports from the Chromium dataset, the median recalls of FARSEC and SWIFT were 15.7% and 77.4%, respectively. For another example, in experiments with data from the Ambari project, the median recalls improved from 21.5% to 85.7% (FARSEC to SWIFT). Conclusion Overall, our approach can quickly optimize models that achieve better recalls than the prior state-of-the-art. These increases in recall are associated with moderate increases in false positive rates (from 8% to 24%, median). For future work, these results suggest that dual optimization is both practical and useful.
December 2020
·
67 Reads
·
25 Citations
IEEE Transactions on Software Engineering
Many methods have been proposed to estimate how much effort is required to build and maintain software. Much of that research tries to recommend a single method – an approach that makes the dubious assumption that one method can handle the diversity of software project data. To address this drawback, we apply a configuration technique called “ROME” (Rapid Optimizing Methods for Estimation), which uses sequential model-based optimization (SMO) to find what configuration settings of effort estimation techniques work best for a particular data set. We test this method using data from 1161 traditional waterfall projects and 120 contemporary projects (from GitHub). In terms of magnitude of relative error and standardized accuracy, we find that ROME achieves better performance than the state-of-the-art methods for both traditional waterfall and contemporary projects. In addition, we conclude that we should not recommend one method for estimation. Rather, it is better to search through a wide range of different methods to find what works best for the local data. To the best of our knowledge, this is the largest effort estimation experiment yet attempted and the only one to test its methods on traditional waterfall and contemporary projects.
November 2020
·
49 Reads
BACKGROUND: Machine learning-based security detection models have become prevalent in modern malware and intrusion detection systems. However, previous studies show that such models are susceptible to adversarial evasion attacks. In this type of attack, inputs (i.e., adversarial examples) are specially crafted by intelligent malicious adversaries, with the aim of being misclassified by existing state-of-the-art models (e.g., deep neural networks). Once the attackers can fool a classifier to think that a malicious input is actually benign, they can render a machine learning-based malware or intrusion detection system ineffective. GOAL: To help security practitioners and researchers build a more robust model against adversarial evasion attack through the use of ensemble learning. METHOD: We propose an approach called OMNI, the main idea of which is to explore methods that create an ensemble of "unexpected models"; i.e., models whose control hyperparameters have a large distance to the hyperparameters of an adversary's target model, with which we then make an optimized weighted ensemble prediction. RESULTS: In studies with five adversarial evasion attacks (FGSM, BIM, JSMA, DeepFool and Carlini-Wagner) on five security datasets (NSL-KDD, CIC-IDS-2017, CSE-CIC-IDS2018, CICAndMal2017 and the Contagio PDF dataset), we show that the improvement rate of OMNI's prediction accuracy over attack accuracy is about 53% (median value) across all datasets, with about 18% (median value) loss rate when comparing pre-attack accuracy and OMNI's prediction accuracy. CONCLUSION When using ensemble learning as a defense method against adversarial evasion attacks, we suggest to create ensemble with unexpected models who are distant from the attacker's expected model (i.e., target model) through methods such as hyperparameter optimization.
June 2020
·
57 Reads
Software developed on public platforms are a source of data that can be used to make predictions about those projects. While the activity of a single developer may be random and hard to predict, when large groups of developers work together on software projects, the resulting behavior can be predicted with good accuracy. To demonstrate this, we use 78,455 months of data from 1,628 GitHub projects to make various predictions about the current status of those projects (as of April 2020). We find that traditional estimation algorithms make many mistakes. Algorithms like k-nearest neighbors (KNN), support vector regression (SVR), random forest (RFT), linear regression (LNR), and regression trees (CART) have high error rates (usually more than 50% wrong, sometimes over 130% wrong, median values). But that error rate can be greatly reduced using the DECART hyperparameter optimization. DECART is a differential evolution (DE) algorithm that tunes the CART data mining system to the particular details of a specific project. To the best of our knowledge, this is the largest study yet conducted, using the most recent data, for predicting multiple health indicators of open-source projects. Further, due to our use of hyperparameter optimization, it may be the most successful. Our predictions have less than 10% error (median value) which is much smaller than the errors seen in related work. Our results are a compelling argument for open-sourced development. Companies that only build in-house proprietary products may be cutting themselves off from the information needed to reason about those projects.
... At MSR'22, Majumder et al. [36] reported twelve clusters of projects within the projects explored by Xia et al. This paper applies our methods to the projects nearest to those twelve clusters. ...
October 2022
... DE has been widely applied [67] 8 . Within software engineering, DE has been used for optimization tasks such as Fu et al. [68] tuning study on defect prediction; Shu et al.'s study on tuning detectors for security issues [69], and Xia at al.s study that tuned project health predictors for for opensource JAVA systems [34]. ...
October 2022
... Understanding when a software project is in a healthy state remains a critical yet unsolved challenge in software development. While repositories provide extensive data about project activities, from code changes to community interactions, current approaches struggle to convert this wealth of information into actionable insights about project health [1,2]. This gap affects both practitioners managing projects and researchers studying software development. ...
Reference:
Introducing Repository Stability
June 2022
Empirical Software Engineering
... Notably, various categories of ransomware exist, each with unique characteristics. These categories encompass crypto worms in ref. [27], Human-operated Ransomware in ref. [28], Ransomware-as-a-Service (RaaS) in ref. [29], and Automated Active Adversary ransomware in ref. [30]. Table 2 encapsulates the essential features, propagation methods, exploitation strategies, and ransomware families associated with these diverse ransomware types. ...
January 2022
Empirical Software Engineering
... In addition, researchers also analyze the vulnerabilities from various project artifacts (e.g., IRs, bug reports, etc.). Some researchers utilized text-mining methods to explore the security bug reports to identify the vulnerabilities [29,[82][83][84], while other works analyze the negative impact of the vulnerabilities from the IRs [62,64,66,75]. The other researchers focus on the crowd-based security discussions, e.g., security posts in Stack Overflow, and discussion groups in Gitter/Slacks, to analyze the topics, attacks, and the corresponding mitigations [40,52,67,89,92]. ...
May 2021
Empirical Software Engineering
... The annual PROMISE meeting knows it needs to revisit its goals and methods. Gema Rodríguez-Pérez 9 7. E.g. see the 1100+ recent Github projects used by Xia et al. [50], or everything that can be extracted using CommitGuru [51]. 8 10 cautions that in the early years of PROMISE, data sets were often not really raw data, but rather directly collections of metrics. ...
December 2020
IEEE Transactions on Software Engineering
... The framework [5] combines ltering and ranking methods to reduce the mislabelling of SBRs by text-based classi cation models. Shu et al. [2] replicated and improved the FARSEC approach by applying hyperparameter optimization that has been used before in software engineering (e.g., for software defect classi cation [11], [12] or effort estimation [13]). Wu et al. [7] explored the reasons that led to the poor performance of Peters et al. [5] and Shu et al. [2] and found one main reason: the quality of labels assigned to the bug reports in the datasets. ...
April 2018
... This study proves practice of this approach could be valuable in attaining accuracy over real world data sets having characteristics like NASA data sets. In studies, of Xia T et al [128], we have seen the tool OIL, based on analogies, which was tested over publicly available dataset for estimation and optimization of features. These study support using CART and FLASH as they outperformed others. ...
April 2018