Identifying security issues before attackers do has become a critical concern for software development teams and software users. While methods for finding programming errors (e.g. fuzzers ¹, static code analysis [3] and vulnerability prediction models like Scandariato et al. [10]) are valuable, identifying security issues related to the lack of secure design principles and to poor development processes could help ensure that programming errors are avoided before they are committed to source code.
Typical approaches (e.g. [4, 6--8]) to identifying security-related messages in software development project repositories use text mining based on pre-selected sets of standard security-related keywords, for instance; authentication, ssl, encryption, availability, or password. We hypothesize that these standard keywords may not capture the entire spectrum of security-related issues in a project, and that additional project-specific and/or domain-specific vocabulary may be needed to develop an accurate picture of a project's security.
For instance, Arnold et al.[1], in a review of bug-fix patches on Linux kernel version 2.6.24, identified a commit (commit message: "Fix - >vm_file accounting, mmap_region() may do do_munmap()" ²) with serious security consequences that was mis-classified as a non-security bug. While no typical security keyword is mentioned, memory mapping ('mmap') in the domain of kernel development has significance from a security perspective, parallel to buffer overflows in languages like C/C++. Whether memory or currency is at stake, identifying changes to assets that the software manages is potentially security-related.
The goal of this research is to support researchers and practitioners in identifying security issues in software development project artifacts by defining and evaluating a systematic scheme for identifying project-specific security vocabularies that can be used for keyword-based classification.
We derive three research questions from our goal:
• RQ1: How does the vocabulary of security issues vary between software development projects?
• RQ2: How well do project-specific security vocabularies identify messages related to publicly reported vulnerabilities?
• RQ3: How well do existing security keywords identify project-specific security-related messages and messages related to publicly reported vulnerabilities?
To address these research questions, we collected developer email, bug tracking, commit message, and CVE record project artifacts from three open source projects : Dolibarr, Apache Camel, and Apache Derby. We manually classified 5400 messages from the three project's commit messages, bug trackers, and emails, and linked the messages to each project's public vulnerability records, Adapting techniques from Bachmann and Bernstein [2], Schermann et al. [11], and Guzzi [5], we analyzed each project's security vocabulary and the vocabulary's relationship to the project's vulnerabilities. We trained two classifiers (Model.A and Model.B) on samples of the project data, and used the classifiers to predict security-related messages in the manually-classified project oracles.
Our contributions include:
• A systematic scheme for linking CVE records to related messages in software development project artifacts
• An empirical evaluation of project-specific security vocabulary similarities and differences between project artifacts and between projects
To summarize our findings on RQ1, we present tables of our qualitative and quantitative results. We tabulated counts of words found in security-related messages. Traditional security keywords (e.g. password, encryption) are present, particularly in the explicit column, but each project also contains terms describing entities unique to the project, for example 'endpoint' (Camel), 'blob' (short for 'Binary Large Object'), 'clob' ('Character Large Object'), 'deadlock' (Derby), and 'invoice', 'order' for Dolibarr. The presence of these terms in security-related issues suggests that they are assets worthy of careful attention during the development life cycle.
Table 1 lists the statistics for security-related messages from the three projects, broken down by security class and security property. Explicit security-related messages (messages referencing security properties) are in the minority in each project. Implicit messages represent the majority of security-related messages in each project.
In Table 2, we present the results of the classifiers built using the various project and literature security vocabularies to predict security-related messages in the oracle and CVE datasets. We have marked in bold the highest result for each performance measure for each dataset. Both Models A and B have a high performance across the projects when predicting for the oracle dataset of the project for which they were built. Further, the project-specific models have higher performance than the literature-based models (Ray.vocab [9] and Pletea.vocab [7]) on the project oracle datasets. Model performance is not sustained and is inconsistent when applied to other project's datasets.
To summarize our findings on RQ2, Table 3 presents performance results for the project vocabulary models on the CVE datasets for each project. We have marked in bold the highest result for each performance measure for each dataset. Results for Model.A shows a high recall for Derby and Camel and a worse than average recall for Dolibarr. However, in Model.B, the recall is above 60% for Dolibarr and over 85% for both Derby and Camel. We reason the low precision is due to our approach of labeling only CVE-related messages as security-related and the rest of the messages are labeled to be not security-related. The Dolibarr results are further complicated by the low proportion of security-related messages compared with the other two projects (as reported in 1).
To summarize our findings on RQ3, Table 2 and Table 3 present the classifier performance results for two sets of keywords, Ray.vocab, and Pletea.vocab, drawn from the literature. In each case, the project vocabulary model had the highest recall, precision and F-Score on the project's oracle dataset. With regards to the CVE-dataset, the project vocabulary model has the highest recall. However, the overall performance, as measured by F-Score, varied by dataset, with the Ray and Pletea keywords scoring higher than the project vocabulary model. The low precision for the classifier built on the project's vocabularies follows the explanation provided under RQ2.
Our results suggest that domain vocabulary model show recalls that outperform standard security terms across our datasets. Our conjecture, supported in our data, is that augmenting standard security keywords with a project's security vocabulary yields a more accurate security picture. In future work, we aim to refine vocabulary selection to improve classifier performance, and to define tools implementing the approach in this paper to aid practitioners and researchers in identifying software project security issues.