Content uploaded by Ritu Kapur
Author content
All content in this area was uploaded by Ritu Kapur on Mar 17, 2020
Content may be subject to copyright.
Accepted
A Defect Estimator for Source Code: Linking Defect Reports With
Programming Constructs Usage Metrics
RITU KAPUR and BALWINDER SODHI, Indian Institute of Technology Ropar, India
An important issue faced during software development is to identify defects and the properties of those defects, if found, in a given
source le. Determining defectiveness of source code assumes signicance due to its implications on software development and
maintenance cost.
We present a novel system to estimate the presence of defects in source code and detect attributes of the possible defects, such as the
severity of defects. The salient elements of our system are: i) a dataset of newly introduced source code metrics, called PROgramming
CON struct (PROCON) metrics, and ii) a novel Machine-Learning (ML) based system, called Defect Estimator for Source Code (DESCo),
that makes use of PROCON dataset for predicting defectiveness in a given scenario. The dataset was created by processing 30400+
source les written in four popular programming languages, viz. C, C++, Java, and Python.
The results of our experiments show that DESCo system outperforms one of the state-of-the-art methods with an improvement
of 44.9%. To verify the correctness of our system, we compared the performance of 12 dierent ML algorithms with 50+ dierent
combinations of their key parameters. Our system achieves the best results with SVM technique with a mean accuracy measure of
80.8%.
CCS Concepts:
•Software and its engineering →
Software maintenance tools;Software performance;Source code generation;Parsers;
Language features; Software libraries and repositories; Syntax;
Software design tradeos
;
•Information systems →
Data mining;
•Computing methodologies →Supervised learning;
Additional Key Words and Phrases: Maintaining software, Source code mining, Software defect prediction, Software metrics, Software
faults and failures, Automated software engineering, AI in software engineering
ACM Reference Format:
Ritu Kapur and Balwinder Sodhi. 2020. A Defect Estimator for Source Code: Linking Defect Reports With Programming Constructs
Usage Metrics. ACM Trans. Softw. Eng. Methodol. 1, 1, Article 1 (January 2020), 34 pages. https://doi.org/10.1145/3384517
1 INTRODUCTION
Maintenance of software, particularly identifying and xing the defects, contributes about 90% of the total cost of
developing and deploying the software [
1
]. Early detection of software defects is essential for lowering the cost of
debugging and thus, the overall cost of software development [
2
]. Therefore, it is desirable to have tools and techniques
by which one can determine the defectiveness of a program and possibly various properties of the apparent defects [
3
].
Authors’ address: Ritu Kapur; Balwinder Sodhi, Indian Institute of Technology Ropar, Department of Computer Science and Engineering, Rupnagar,
Punjab, 140001, India, ritu.kapur@iitrpr.ac.in, sodhi@iitrpr.ac.in.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2020 Association for Computing Machinery.
Manuscript submitted to ACM
Manuscript submitted to ACM 1
Accepted
2 Ritu et al.
Denition 1: Defectiveness
We dene the defectiveness of a source le as the likelihood of nding any defects present in it. We express the
value of defectiveness on a Likert scale of {Likely-defective, Unknown}.
We chose
Unknown
as the second value because other plausible option –
NonDefective
– would have implied
absence of defects. We do not determine the absence of defects.
Stated broadly, the objective of the work presented in this paper is to determine the defectiveness of a given source
le. If the le is likely-to-be-defective, then we also predict specic properties of the possible defects. The provided
source le could be an artefact such as the denition of a single class (e.g., a Java class), multiple classes (e.g., a C#
namespace in a single le), or a collection of functions (e.g., in a Python module). We do not try to predict the absence
of defects (hence the item
unknown
on Likert scale). Also, we do not aim to determine the precise locations (for instance,
the line numbers or code blocks) of the apparent defects within the given source le.
Denition 2: Software defect and software failure
Asoftware defect is dened as a aw in the software (or a certain component) that causes the software (or the
specic component) fail to perform its required function, as specied in the requirement specication. A defect,
when encountered during the execution of a software, may lead to partial or complete software failure [4].
Asoftware failure is dened as the inability of a software system or component to perform its required functions
within the specied performance requirements [5].
We primarily aim to determine defectiveness related to non-functional requirements [
6
] of a software; for instance,
the defects related to the performance, security, and reliability of a software. Also, we do not consider syntax errors in
source code that result in build failures.
Some of the examples of defects related to the non-functional requirements of the software, as reported on
dierent bug tracking portalsa,bare:
(1) Performance: Batik inside of Eclipse using bridge memory heap causes crashc.
(2) Security: XML vulnerabilities in Pythond.
(3) Accessibility: Unlocking a le owned by another user doesn’t work.e
(4) Programming Language compatibility: build.xml isn’t forward compatible with java6f.
(5) Maintainability: FAQ references third-party libraries that have been refactored or renamedg.
ahttps://bz.apache.org/Bugzilla/
bhttps://bugs.python.org/
chttps://bz.apache.org/Bugzilla/show_bug.cgi?id=34631
dhttps://defects.python.org/issue17239
ehttps://bz.apache.org/Bugzilla/show_bug.cgi?id=36981
fhttps://bz.apache.org/Bugzilla/show_bug.cgi?id=43608
ghttps://bz.apache.org/Bugzilla/show_bug.cgi?id=36003
1.1 Existing approaches for determining the defectiveness
Many of the recent techniques for prediction of defects in source code employ an ML-based approach. The use of
suitably crafted features derived from the input is a distinct characteristic of such methods. For example, authors in [
7
]
Manuscript submitted to ACM
Accepted
A Defect Estimator for Source Code 3
build a defect prediction model by training Deep Belief Networks (DBN) on the token vectors derived from a subset
of Abstract Syntax Tree (AST) nodes of various programs. [
8
] present another such approach which builds the defect
prediction model by training neural networks on the token-vectors built using AST nodes of source code. Similarly, [
9
]
show that training defect prediction model using AST n-grams improves the prediction performance.
The results from the works cited in the preceding paragraph show that the syntactic features of source code (extracted
using AST-nodes) can serve as useful features for building defect prediction models. In other words, the syntactic
properties of a program can be treated as the proxies for programming decisions (see Denition-3) employed in the
program.
Researchers in the past establish that the coding practices adopted by the programmers signicantly inuence the
quality of a software [
10
,
11
]. The programming decisions made while constructing the software inuence its quality
attributes such as readability, portability, ease-of-learning, reliability and maintainability [
10
]. For instance, the choice
of concise and consistent naming conventions (for identier names) has been reported [
11
,
12
] to result in better quality
software. The above results indicate the existence of a relationship between the programming decisions employed in
constructing a software and the quality of the software.
The above ndings motivated us to develop a technique for determining the defectiveness of a program by analyzing
the programming decisions employed in constructing that program. This intuition of determining the defectiveness of
source code by analyzing the constituent programming decisions forms the motivation of PROCON metrics that we
propose in §2.2.2. Briey stated, the PROCON metrics capture the occurrence frequency and AST depth of various
programming constructs used in source code.
Denition 3: Programming Decision
It includes the following type of syntactic decisions made by the programmer during construction of a program:
(1)
Choice of the programming language constructs. For example,
for
vs
do-while
for looping,
if-else
vs
switch for branching.
(2)
The constructs’ occurrence
count
and
depth
in the AST of the source code. For example, the
if
statement
is used 100 times in a source le with the deepest nesting level of, say, 5.
(3)
The
length
of the names of various elements of the program. For example, the names of variables, functions,
and classes.
Following are not considered the programming decisions:
(1)
The exact names of the variables, functions and classes. We only care about the length of names, not the
names itself.
(2)
Abstract design choices such as those made during architecture design. For instance, whether to use factory
method vs abstract factory design pattern.
1.2 Hypothesis behind the proposed approach
The basic idea underlying our method is to train an ML model
∆
with a large number of source code samples for which
the defectiveness is known; Using
∆
, we can then predict the defectiveness of unseen source code. The critical aspect
that dierentiates our approach is the choice of source code features, using which we derive an accurate representation
of the source code from its defectiveness perspective.
Manuscript submitted to ACM
Accepted
4 Ritu et al.
Existing literature (e.g., [
13
,
14
]) has shown that the choice of ML algorithm and the features employed to build a
prediction model
∆
, have a signicant impact on the performance of
∆
. Further, the use of various syntactic features of
source code for eective detection of source code similarity has been reported by many researchers (as discussed in
§1.1).
In view of the above observations, we hypothesize that:
(1) The programming decisions (see Denition-3) employed in the construction of software can:
(a) serve as useful features for computing a representation (from defectiveness perspective) of the source code.
(b) be represented and measured by programming language constructs’ usage metrics (e.g., PROCON metrics).
(2)
A system can be built which can use an ML model, suitably trained with the PROCON dataset, to estimate the
defectiveness of given source code.
Above points form the basis of developing a technique for determining the defectiveness of a program by analyzing
the programming decisions employed in constructing that program.
Overall, our work makes the following key contributions:
•
We propose a method for estimating the defectiveness of source code. If the code is found likely-to-be-defective,
then we also identify the various properties of apparent defects.
•
We propose a new set of software metrics (named PROCON) which we show are eective in capturing program-
ming decisions employed in the construction of software. These metrics comprise of the
PRO
gramming language
CONstructs (PROCON) usage data measured for the given source code.
•
We create a large dataset of PROCON metrics’ values by processing more than 30000 source les taken from 20
+
Open Source Software (OSS) repositories at GitHub. These source les were written in four major programming
languages, viz., C, C++, Java, and Python.
•
We implement our technique in an open-access tool called DESCo (Defect Estimator for Source Code), which can
take a source le as input and suggest the defectiveness for that source code in the form of a Likert scale value
along with the properties of the likely defects.
•
We empirically evaluate DESCo by two methods: a) Testing it with known pairs of input source les and expected
output such that the input source les are not part of the PROCON dataset which is used by DESCo. This
experiment is carried out in a lab environment. b) We conduct controlled experiments with ten professional
programmers in an industrial setting. The results show that DESCo can correctly detect the defectiveness of
the input source les. Fig. 10 shows the output obtained for one of such test les (downloaded from a defect
reporting engine). The highest accuracy (in the form of the MeanAcc score or Condence) of 95.9% is achieved
for the models estimating the presence of defects in a source le.
The organization of the rest of the paper is as follows. §2describes the details of the proposed system. To verify the
eectiveness of our system, we conducted several experiments which we have discussed in §3. §4describes the related
work. Conclusions drawn from our work are presented in §5.
2 PROPOSED SYSTEM
Two of the primary artefacts which can be used for determining the defectiveness of source code are a) the source code
itself for the program and b) the details of the defects associated with it.
Manuscript submitted to ACM
Accepted
A Defect Estimator for Source Code 5
The plethora of OSS available today various at OSS repository hosts such as GitHub
1
, Apache Software Foundation
(ASF)
2
, can be easily leveraged to extract the above two artefacts. These OSS hosts [
15
] also provide access to the details
of defects associated with the software. Our system makes use of such data available from OSS repository hosts.
Table 1. Table of Notations
L≜The set of programming languages {C, C++, Java, Python}.
G≜The set of grammars associated with the programming languages L.
R≜The set of OSS repositories downloaded from GitHub.
B≜The set of defect reports associated with R.
fR
λ
≜A source le written in programming language λ∈L, and is a part R.
XR
λ
≜The feature set built by extracting PROCON metrics’ values from source les fR
λ,∀λ∈L.
XB
λ≜The feature set built by extracting defect information from defect reports β, such that (∀ β∈B) ∃fR
λ, where λ∈L.
Dλ≜Database comprising of XR
λand XB
λ, where λ∈L.
Z≜The set of programming constructs considered in our work. For instance, if, switch, for, and while.
Kd
λ
≜
The set of source les
fR
λ
which have at least one defect (
d
) reported against it, such that
(∀ fR
λ∈Kd
λ) ∃ β∈B
, where
λ∈L.
Kw
λ
≜
The set of source les
fR
λ
which are without (
w
) any defect information associated with them, such that
(∀ fR
λ∈
Kw
λ) β∈B, where λ∈L.
Kλ≜The complete set of les extracted from R, such that
|Kλ|=|Kd
λ|+|Kw
λ|+θ
, where θrepresents the les which are either empty or very small in size, and are thus not included in the dataset.
M≜The mappings between source les fR
λand defect reports β, such that β∈Band λ∈L
H≜The set of complete paths of fR
λ, where λ∈L.
A≜The set of considered ML algorithms (discussed in §2.3.3).
P≜The set of considered tuning parameters associated with the A.
T≜
The set of defect estimation tasks performed using DESCo (discussed in §2.3.1). For instance, detecting whether an
input source le is likely-to-be-defective or not?
∆λ,ψ≜
The set of ML models built to perform the defect prediction task
ψ
, when dealing with the source les written in
λ
.
Such an ML model is obtained by training an ML algorithm
α∈A
and tuning parameters
π∈Π
, with the dataset
Dλ
.
E≜The set of evaluation metrics {F1score,ROC ar ea}.
For a programming language λ(∈L), our system can be visualized as comprising of two key elements:
(1)
A large enough dataset,
Dλ
, of suitably crafted features sets,
XR
λ
and
XB
λ
, about the source code (
R
) and its
associated defects (B), respectively.
(2)
A suitably designed system
S
, that picks the best performing Machine Learning (ML) model
∆λ,ψ
, for performing
a prediction task ψ, when dealing with the source les written in λ.
1https://github.com/
2https://www.apache.org/
Manuscript submitted to ACM
Accepted
6 Ritu et al.
DefectReports
OpenSourceSoftware
(OSS)Repositories
Features&
Query
PROCON
Dataset
Builder
PROCON
dataset
QueryInterface
New
Source
File
Model
Builder
uses
MLmodels
Parameter
Combinations
ofML
Techniques
Model
Selector
input
Estimation
Engine
Best
Performing
models
returnstheextractedfeatures
outputs
EndUser
InputstheSourcefile
submiitedbyuser
submitstheSource
file&Selected
Query
Estimates
from
various
models
areprovidedasinputto
DESCosystem
Uses
Selectsa
query
providedasinput
outputs
generates
input
builds
Fig. 1. Details of the proposed system
Fig. 1shows the logical structure of the proposed system. There are two major subsystems here:
•
The rst one is the PROCON dataset builder. It extracts the necessary information from various OSS repositories
and defect-tracking portals. The complete procedure for building the dataset is discussed in §2.2.
•
Next, we build the DESCo system. For a given source le
f
and the dataset (PROCON), DESCo estimates the
defectiveness of source code f. The details of how DESCo works are described in §2.3.
2.1 Pivotal design decisions
We need to address following crucial design issues and questions to realize the system outlined in Fig. 1.
(1)
Source les written in which programming languages should be chosen for building the data set used in our
system? Will the overall design of our system be dependent on the choice of programming language?
(2) How to choose the language constructs for various programming languages?
(3) What source code metrics can be used to measure the usage of various constructs of a language reliably?
(4) What criteria to use for selecting the OSS repositories for training our ML models?
(5) What attributes of the defects should and can be estimated?
(6)
What ML algorithms should we use for various tasks? How to choose the values for the tunable parameters of
those algorithms?
(7) How to measure and verify the prediction accuracy of a model for a given task?
In the following sections, we describe how we build various subsystems while also discussing how we address each
of the above issues.
2.2 Creating the PROCON dataset
In §1.1 we have described how various researchers have successfully leveraged the features derived from syntactic
properties of the source code, for various types of prediction tasks that use source code as the input.
Manuscript submitted to ACM
Accepted
A Defect Estimator for Source Code 7
In the preceding sections, we have highlighted the inuence of programming decisions on the quality of software.
Further, [
14
] have shown that the choice of features used in building the dataset employed to train model
∆
, dramatically
inuences the performance of ∆on predictive tasks.
Table 2. Programming Constructs employed in PROCON metrics
Category Programming
Construct (Z)
Programming Language
C C++ Java Python
Conditional
statement
if
switch
Iterative
statement
for,while
do-while
Control
statement
return,break,continue
goto
Exception
statement
try,
catch,throw,
raise,with-stmt
Scope
statement
extern,static,
global
User dened
constructs
struct,enum,
function
union
class
Identier & Operators identiers, arithmetic, logical & relational3
Given the above ndings and our hypothesis as outlined in §1.2, we crafted a new set of metrics (called PROCON)
which can be used as features for deriving a source code representation, that can be utilized to estimate the defectiveness
of source code. The metrics are listed in Table-3.
2.2.1 Choice of programming language and its constructs. For creating the PROCON dataset, we chose four (viz., C,
C++, Java, and Python) programming languages for the input source les. One of the reasons for choosing these very
languages is their popularity with the professional programmers [
16
] and the availability of a large volume of OSS
source code written using these languages4.
The list of constructs that we have used is shown in Table 2. We performed a token-frequency and token-location
analysis of a large corpus of source les. Our analysis has shown that the language constructs that we have selected for
our metrics are the ones that are almost always present, and spread reasonably randomly in the source code written
using these programming languages. For example, it is hard to imagine a real-world program which does not employ the
decision control (
if-else
) and looping constructs (such as
for
and
while
) at some point. The constructs that are used
commonly and at uniformly random locations in the code can serve as robust features for an accurate representation of
the source code [17,18].
One may argue that the defects in the program are more likely to be caused by the usage of the infrequently used
constructs. Hence, leaving out the sparingly used constructs may adversely aect the ability to capture defectiveness.
We would like to note that through these metrics, we are merely trying to compute a representation of the input source
code. The metrics individually by themselves can not indicate defectiveness. An ML model trained using a suitable
algorithm and properly labelled input data set is what estimates the defectiveness of unseen source code. Further, in
3Operators considered are: {division, plus, not, multiply, minus, less, modulus, negation, assignment} operator
4Sources of stats: https://octoverse.github.com/projects.html and https://githut.info/
Manuscript submitted to ACM
Accepted
8 Ritu et al.
Table 3. Details of newly craed soware metrics (PROCON metrics)
Software Metric Description PROCON Metrics’ values for Z = if
construct used in Fig. 3- Table-4
maxZCount maximum number of times a construct Z
is used in a source code maxIfCount:2
minZCount minimum number of times a construct Z
is used in a source code minIfCount:0
avgZCount average number of times a construct Z
is used in a source code avgIfCount:1
stdDevZCount standard deviation of the number of times
a construct Zis used in a source code stdDevIfCount:0.816
maxZDepth maximum depth at which a construct Z
is used in the AST of source code maxIfDepth:11
minZDepth minimum depth at which a construct Z
is used in the AST of source code minIfDepth:0
avgZDepth average depth at which a construct Z
is used in the AST of source code avgIfDepth:7.25
stdDevZDepth standard deviation of depth at which a
construct Zis used in the AST of source code stdDevIfDepth:4.264
maxZLength maximum lexical length of a construct Z
used in a source code maxIfLength:118
minZLength minimum lexical length of a construct Z
used in a source code minIfLength:0
avgZLength average lexical length of a construct Z
used in a source code avgIfLength:65.5
stdDevZLength standard deviation of lexical length of
a construct Zused in a source code stdDevIfLength:49.088
our metrics, we also use the depth (indicating the location) in the program’s AST at which a construct has been used. It
helps maintain the uniqueness of representation of an input source code. For instance, consider the code fragments
shown in Figure 2: Both the code fragments use same constructs but with a dierent ordering (and thus at dierent
depths), and hence are dierent. Although the code fragments shown in Figure 2, have the same values for count and
length metrics of constructs, yet the depth metrics helps to dierentiate between both.
(a) Program fragment 1(b) Program fragment 2
Fig. 2. AST depth: an important PROCON metrics
We build separate datasets of the PROCON metrics for each programming language that we considered. Thus, the
overall design of our system is independent of the programming languages that we consider.
2.2.2 Choice of metrics. On source code level, a common manifestation of programming decisions is the usage patterns
of various programming constructs. Such usage patterns can be derived or detected by lexical properties such as
count,
depth, and length of a construct as used in a source le.
Manuscript submitted to ACM
Accepted
A Defect Estimator for Source Code 9
(a) Program fragment 1
(b) Program fragment 2
(c) Program fragment 3
(d) Partial screenshot of ANTLR generated AST corresponding to Fig. 3a
Fig. 3. Example 1: Program fragments to explain PROCON metrics listed in Table -3
Table 4. Computing the PROCON Metric values for program fragments shown in Fig. 3
Programming
Construct
Lexical
Property
Program Fragments PROCON metrics’ values
Program 1Program 2Program 3max min avg stdDev
if
Count 2 0 1 2 0 1 0.816
Depth 9,11 0 9 11 0 7.25 4.264
Length 79,118 0 65 118 0 65.5 49.088
As such the PROCON metrics that we have introduced (see Table-3) are derived from the lexical properties such as
count, depth,
and
length
associated with the usage of dierent programming constructs
Z
. Some of the examples of
Z
include:
if, while, for
and
try
construct. The property
depth
associated with
Z
, refers to the depth at which
Z
occurs in the Abstract Syntax Tree (AST) of the source code under consideration. For instance, for the program fragment
referred by Fig. 3a, the depth of two
if
constructs shown in the respective AST of the program in Fig. 3d is 9and 11,
respectively. Lexical length of a construct
Z
is dened as the total count of characters present in the construct denition
Manuscript submitted to ACM
Accepted
10 Ritu et al.
or declaration (in case of variables) statement. For instance, the lexical length of a
for
construct would mean the sum
total of all the characters (excluding space) present in its denition (or body). Table-2gives the complete category-wise
list of programming constructs considered by us.
For illustration, we have worked out the computation of PROCON metrics for three program fragments shown in
Fig. 3. The programming construct considered in this example is an
if
construct. To compute the PROCON metrics,
we rst nd out the lexical measure values corresponding to the
if
construct used in the program fragments. We,
then, compute PROCON metrics values by applying statistical measures such as
maximum (max), minimum (min),
average (avg)
and
standard deviation (stdDev)
over the computed lexical measure values (shown in Table-4).
Table-3gives the formal denitions of all the PROCON metrics with their corresponding values obtained for program
fragments shown in Fig. 3.
2.2.3 Structure of PROCON dataset. As described in the preceding section, the PROCON dataset consists mainly of the
following information:
•
Measured values of the lexical properties representing usage patterns of various programming language constructs
in the source les of our corpus.
•Information about the defect reports linked with source les of our corpus.
Table 5. Description of source files present in various datasets
Dataset for
language
Defect-linked
le count5(|Kd
λ|)
Files without defect
information6(|Kw
λ|)
Total les extracted
(|Kλ|=|Kd
λ|+|Kw
λ|+ot he r s )
Total les in the dataset
|Dλ|=2∗mi n( | Kd
λ|,|Kw
λ|)
C3224 3468 11718 6448 (=3224 ∗2)
C++ 174 6827 7202 348 (=174 ∗2)
Java 318 7076 7500 636 (=318 ∗2)
Python 1375 2048 4043 2750 (=1375 ∗2)
Combined75091 19419 30443 1392 (=174 ∗4∗2)
For each of the considered programming languages (viz., C, C++, Java, and Python), we build a separate dataset using
PROCON metrics to train various ML models. Table 5gives the details of the source le from which the datasets have
been created.
Kd
λ
represents the set of les linked with defect reports and
Kw
λ
represents the les not linked with any
defect reports.
Kλ
represents the complete set of les extracted from online portals. This consists of the les represented
by
Kd
λ
,
Kw
λ
, and the les not considered for building the dataset such as empty les or very small les. An equal number
of random les are combined from both
Kd
λ
and
Kw
λ
to form the dataset,
Dλ
. All the datasets follow the same high-level
schema shown in Fig. 4. The description of the tables is provided in Appendix-A.2
We consider a total of about 30400 source les from 20 dierent GitHub repositories for creating the dataset. Further,
approximately 14950 defect reports associated with these source les were extracted to capture the characteristics of
defects. Table- 5and 6describe the composition of the dataset by providing the details such as the type of source les,
OSS repositories, and defect tracking portals chosen for building the dataset. Our dataset is persisted in the form of a
relational database and is available at http://bit.ly/2JFDlH9.
5Candidate defect-linked les are obtained by removing very small les or empty les from the category of defect-linked les present in the dataset.
6
Candidate non-defect-linked les are obtained by removing very small les or empty les from the category of non-defect linked les present in the
dataset and by ltering the les in the same le-length range as in defect-linked les category.
7Combined dataset is formed by taking equal proportion of source les from each of the considered languages.
Manuscript submitted to ACM
Accepted
A Defect Estimator for Source Code 11
Fig. 4. Partial schema showing main entities of dataset.
Table 6. Details of GitHub repositories and associated defect reports
OSS
repository
Total source
les
defect tracking
portal
Total reported
defects
Total source les
linked to defects
ant-master 1220
Apache
653 96
batik-trunk 1650 251 104
commons-bcel-trunk 485 29 2
lenya-BRANCH-1-2-Z 430 234 16
webdavjedit-master 7 0 0
poi-trunk 3284 103 86
pengyou-clients-master 198 15 11
gcc-master 18524 GCC 6087 3296
org.eclipse.paho.mqtt.python-master 40
Eclipse
77 4
paho.mqtt.embedded-c-master 33 3 2
paho.mqtt.-java-master 228 5 3
cpython-master 1336
Python
3235 525
bedevere-master 16 82 4
mypy-master 182 253 31
peps-master 18 20 4
planet-master 16 35 3
Python-2.7.14 1325 1521 443
Python-3.6.3 1284 1712 479
typeshed-master 3 0 0
pythondotorg-master 168 642 47
Total Count 30447 4 14957 5156
2.2.4 Steps for building the PROCON dataset. We extracted the values of PROCON metrics by processing a large
corpus of source les written in dierent programming languages. To estimate the defectiveness of source code, we also
extracted the relevant information from defect reports which reference the source les that we considered. Following
are the main steps involved:
(1)
Selecting the source code repository hosts to extract source code: The OSS repository hosts such as GitHub and
Sourceforge, contain a large number of OSS repositories. We selected GitHub to fetch the OSS repositories due
to the following reasons:
Manuscript submitted to ACM
Accepted
12 Ritu et al.
Algorithm 1 Establishing the mapping between defect reports and source les
Input: R:=The set of OSS repositories downloaded from GitHub.
B:=The set of defect reports associated with R.
L:=The set of programming languages { C, C++, Java, Python }.
Output: M:=The mappings between source les f(∀f∈R)and defect reports β(∀β∈B).
{Filter relevant source les}
1: H←ϕ
2: for all source les f∈Rdo
3: if fis written in λ, and λ∈Lthen
4: P←P∪f.path
5: end if
6: end for
{Identify names of aected source les from defect reports}
7: M←ϕ
8: for all β∈Bdo
9: if βhas an associated patch then
10: µβ←scrapePatchInfoFordefect(β.id)
11: Fµβ←extractAffectedSourceFileNames(µβ)
12: for all le name f.name ∈Fµβdo
13: M←M∪ ⟨f.name,β.id ⟩
14: end for
15: end if
16: end for
{Updating the Mto replace the le name with its full path}
17: for all f.name ∈M.keys() do
18: β.id ←M[f.name]
19: for all le path p∈Hdo
20: if pcontains f.name then
21: Delete M[f.name]
22: M←M∪ ⟨p,β.id⟩
23: end if
24: end for
25: end for
(a) SourceForge.net has discontinued its Version Control System (VCS) support8.
(b) GitLab and BitBucket, being the newer ones, have very few OSS repositories present on them.
(c)
A signicant amount of research work [
19
] utilizes GitHub data, thus making GitHub a reliable source of
information.
Therefore we took a bulk of the raw content (source les) from GitHub for building our dataset.
(2) Selecting the defect tracking portals to fetch the defect information:
To fetch the defect information associated with the OSS repositories extracted from GitHub, we use various
bug tracking portals. To obtain the information of defects associated various OSS repositories fetched from
GitHub, we downloaded the defect reports from four dierent bug tracking portals, viz., Apache Bugzilla
9
, Eclipse
8https://sourceforge.net/blog/decommissioning-cvs-for-commits/
9https://bz.apache.org/bugzilla/
Manuscript submitted to ACM
Accepted
A Defect Estimator for Source Code 13
Bugzilla
10
, GCC GNU Bugzilla
11
, and Python bug tracker
12
. We selected these defect tracking portals due to
easy availability of defect reports, and their wider use in the existing literature [19,20].
(3) Selecting the OSS repositories: Following are the main criteria we considered while selecting an OSS repository:
(a)
Size constraint: The size of repository should be greater than a threshold (
≥
2MB). Please note that repository
size here stands for the total size of only the source les present in the repository. All the binary les such as
multimedia and libraries are excluded.
(b)
Source le count constraint: The repository should contain at least one source le written in the considered
programming languages viz., C, C++, Java, and Python.
(c) Defect report count constraint: The repository should be associated with at least one defect report.
(d)
Reputation constraint: The repository should have earned 500+ stars. This constraint was applied to ensure
that the selected repositories are popular and are being used by a certain mass of programmers.
To eliminate the programming language selection bias, we decided to select the repositories containing les
written in one or more of the considered programming languages. These languages included Java, Python, C++,
and C. These are among the most popular [16] enterprise programming languages used for creating software.
The defect reports associated with these OSS projects were also extracted.
(4)
Establishing the mapping between the defect reports and the source les: To nd the source les associated with
defect reports, we utilize the summary and the patch eld of the defect reports. We fetch this information
(patch and summary) by scraping various defect reports available at dierent defect tracking portals. The patch
information associated with the defect reports contains at least one mention of the aected source le. We
use this information to establish a mapping between the source les present in the OSS repositories, and the
defects reported corresponding to them, as explained in Algorithm 1. We store the mapping information in the
SourceFileTodefectMapping table described in §2.2.3.
(5)
Feature extraction technique: To perform this feature extraction, we built a custom FeatureExtractor module using
ANTLR
(
AN
other
T
ool for
L
anguage
R
ecognition) [
21
] library. The FeatureExtractor module uses grammar of a
programming language to build its parsing program, using which the required features are extracted from an
input source le.
The computed metrics (u) are stored in a database as shallow knowledge in the SourceCodeFeatures table (shown in
Fig. 4). For each source le which is referred to in a defect report, we extract information such as priority, status,
type and user exposure from that defect report and store that information in the defectInfo table (shown in Fig. 4).
(6)
Rene the shallow knowledge: The rening process involves, among other tasks, eliminating any biasing such as
that introduced due to the use of source les written only in one language, presence of outliers in the form of le
size or a specic feature, and so on.
We perform the following steps for rening the shallow knowledge:
(a) Filtering only those source les which are of similar size.
(b)
Normalizing the attribute values on a suitable scale. For example, computed feature values for dierent source
les can be normalized w.r.t the le length.
(c) Removing the bias towards individual features by using a MinMaxScalar13 function of ScikitLearn [22].
The result of this step is our nal dataset persisted in the form of the RenedFeatures table (shown in Fig. 4).
10https://bugs.eclipse.org/bugs/
11https://gcc.gnu.org/bugzilla/
12https://bugs.python.org/
13http://scikit- learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
Manuscript submitted to ACM
Accepted
14 Ritu et al.
Fig. 5. PROCON dataset builder
2.2.5 PROCON dataset builder. Fig. 5shows the architecture of this module. The dataset produced by the output of
processing carried out by this module is stored in a relational database whose schema is shown in Fig. 4. The database
holds the dataset content.
The major components of the PROCON dataset builder are described below:
(1)
LexPar module: Given a set of grammars
G
, ANTLR [
21
] generates the Lexers and Parsers corresponding to each
д∈G
. Lexer and Parser artefacts provide an API using which one can extract various lexical properties of a
given source le. These Lexers and Parsers make up the LexPar module, which is used by the feature extraction
module (described next).
(2)
Feature Extraction module: This is a higher-level program developed to extract the values of various features
from input source les. It makes use of the LexPar module to perform its task. Following are its main steps: For
each source le fselected the collection of OSS repositories (∆) do:
•
Identify the programming language
λ
(
∈L
) in which the source le
f
is written. This is done by checking the
le extension.
•Call the API in LexPar module to extract the PROCON metrics values.
•Store the extracted values in the database table (SourceCodeFeatures) of PROCON dataset schema.
(3)
Defect Information Collector: The input to this module is a set of defect reports
B
. A defect report
β∈B
is
downloaded from the defect tracking portal of the source code repository, and in most cases, it is available in
a text format such as a CSV le.
β
contains attributes such as the unique ID assigned to the defect, severity
assigned to the defect, type of the defect, detailed description, and the aected source les.
This module processes each
β
(
∀β∈B
) to extract such details, and persist them in the table defectInfo of the
PROCON dataset.
(4)
Patch Scraper: To establish the link between source les and the associated defects, we rst need to identify the
les which are possibly aected by the defect. Usually, when a defect is xed, the developers update the defect
report with the details of les modied to x the defect. This information is typically supplied in the patch and
Manuscript submitted to ACM
Accepted
A Defect Estimator for Source Code 15
comments elds of the defect report. If such patch information is not available in the text format (e.g. CSV le)
dump of defects from a defect tracking portal, then we obtain that information via scraping the portal.
Given a set of input defect reports
B
, the patch scraper extracts the necessary information (present in HTML
markup) from the respective web page corresponding to each of the defect report
β∈B
. The scraped information
is then utilized (see next item) to establish a mapping between the input source les and the defect reports.
(5)
Mapping Builder: Given a collection of source repositories
R
and a set of defects reports
B
associated with
R
,
this module establishes the mapping between the source les
f∈R
and defect reports
β∈B
. This mapping
is necessary to study the defect characteristics associated with the defects reported in various source les. To
establish the mapping, we utilize the patch and the summary portion of the defect reports, extracted using the
Patch Scraper. The mapping information is stored in the SourceFileTodefectMapping table (shown in Fig. 4).
(6)
Information Rener: This module lters out non-essential information from the dataset. The goal of this rening
is to derive a normalized dataset which is free from bias. The information thus obtained is stored in the
RenedFeatures table (shown in Fig. 4).
The implementation details of various modules are described in Appendix-A.1.
2.3 The DESCo sub-system
DESCo is an ML-based system which performs various defect estimation tasks for an input source le. The key
components of DESCo are described below in §2.3.2. Identifying the best ML model for a given estimation task is a key
processing step of DESCo, and has been formulated (discussed in §2.3.5) as an optimization problem. §2.3.4 describes
the evaluation metrics used to measure the performance of ML models comprising DESCo.
2.3.1 Goals of the DESCo sub-system. Given an input source le, the DESCo sub-system will determine two things
about it:
•Phase-1: Whether the input source code is likely-to-be-defective. We do not determine the absence of defects.
•Phase-2:
If found likely-to-be-defective, then the system identies the probable attributes of the apparent defects.
The attributes would include defect’s severity, category (e.g., performance, functionality, and crash.), and so on.
It seeks to provide the answers to questions mentioned in the following scenarios.
What is the likelihood that the input source code may contain defects:
(1) of a specic priority or severity (for instance, critical, high, low, and medium.)?
(2) of a specic type (for instance, enhancement and defect-x.)?
(3) that occur mostly on a specic operating system (OS)?
(4) that occur mostly on a specic hardware?
(5)
that involve a specic level of user exposure (measured, for instance, via the number of comments on the defects
reports)?
Why only this set of scenarios?
These scenarios were chosen so as to reect the important characteristics of the
defects across multiple defect tracking platforms (e.g., ASF Bugzilla, Eclipse Bugzilla, GCC Bugzilla, and Python
defect Tracker). It is assumed that the defect characteristics captured by these platforms are well-chosen by keeping
its usefulness for developers in mind.
To the best of our knowledge, such use of defect characteristics is not reported in the current literature.
Manuscript submitted to ACM
Accepted
16 Ritu et al.
The task of Phase-II can be easily repeated for the additional characteristics of defects. Further, it is not dicult to
replicate (we plan to do it in future) the experiments for predicting defectiveness along with various qualitative aspects
of programs such as:
•
Which programming language is more likely to result in defects in a given scenario (for instance, on a certain
OS)?
•
Which type of defects is most likely with the use of a specic programming language, on a particular OS or
hardware.
2.3.2 Key modules of DESCo. These are shown in Fig. 1(on page 6), and are described as follows:
(1)
Query Interface: This module provides an interactive interface to clients (human user and programs). The oered
interfaces include a GUI and an API which allow the callers to specify a query type from a given list. The complete
list of query types, their meaning, expected input and output are described in §2.3.1. The contents of a source le
for which defectiveness is to be estimated is one of the inputs.
(2)
Estimation Engine: It is the main module that performs the defect estimation for an input source le for a given
scenario. To do so, it performs the following steps:
(a) First, it uses the Feature Extraction module to extract the features of the input source le.
(b)
Next, it calls the Algorithm Selector to select the best performing ML model for the given source code feature
set and input scenario.
(c) Finally, it uses the selected best performing model to perform various defect estimation tasks.
(3)
Model Selector: In essence, DESCo estimates defectiveness of an input source le (written in
C, C++, Python or
Java
) by performing ML classication via a most suitable ML model. Since dierent ML techniques outperform
others in dierent scenarios [
14
], accurate classication requires using an ML model which performs the best for
a given scenario. Thus, a crucial step is to identify the best performing models for each of the estimation tasks
(or scenarios) specied in §2.3.1.
To nd such models, we iterate through various parameter combinations of dierent ML algorithms (listed in
§2.3.3). This model selection problem can be solved as an optimization problem, which we formulate in §2.3.5.
This module implements a solver for such an optimization problem.
(4)
Model Builder: It builds various ML models, by training on PROCON dataset using dierent parameter combina-
tions of ML techniques. This module is used in conjunction with the Algorithm Selector.
The implementation details of the DESCo modules is provided in Appendix-A.3.
2.3.3 Details of ML algorithms used by DESCo. Training and testing for the estimation tasks are performed using a
variety of parameter combinations of the following 12 ML classication techniques.
(1) Linear SVM (LSVM) [23,24]
(2) SVM [23,25]
(3) Nu-SVM (NSVM) [23,26]
(4) Gaussian Process (Gauss) classier [27,28]
(5) K Nearest Neighbors (KNN) classier [29,30]
(6) Random Forest (RF) classier [31,32]
(7) Multi-Layer Perceptron (MLP) classier [33,34]
(8)
Supervised Deep Belief Network (DBN) Classica-
tion [35,36]
(9) Logistic Regression [37,38]
(10) Bernoulli Naive Bayes [39,40]
(11) Multinomial Naive Bayes [39,41]
(12) Gaussian Naive Bayes [42,43]
Manuscript submitted to ACM
Accepted
A Defect Estimator for Source Code 17
Table 7. Parameter combination of dierent ML. techniques used in dierent phases of our approach
Key Phase 1 Phase 2
a LSVM(0.1, ‘ovr’) LSVM(1.0, ‘ovr’)
b LSVM(0.1, ‘ovo’) LSVM(1.0, ‘ovo’)
c LSVM(1.0, ‘ovr’) LSVM(0.1, ‘ovr’)
d LSVM(1.0, ‘ovo’) LSVM(0.1, ‘ovo’)
e SVM(1.0, ‘ovr’, ‘l’) SVM(1.0, ‘ovo’, ‘l’)
f SVM(1.0, ‘ovo’, ‘l’) SVM(10, ‘ovo’, ‘r’, 0.2)
g SVM(10, ‘ovr’, ‘r’, 0.2) SVM(1.0, ‘ovo’, ‘p’, 2)
h SVM(10, ‘ovo’, ‘r’, 0.2) SVM(1.0, ‘ovo’, ‘p’, 3)
i SVM(1.0, ‘ovr’, ‘p’, 2) SVM(1.0, ‘ovo’, ‘s’)
j SVM(1.0, ‘ovo’, ‘p’, 2) NSVM(0.5, ‘ovo’, ‘l’)
k SVM(1.0, ‘ovr’, ‘p’, 3) NSVM(0.1, ‘ovo’, ‘l’)
l SVM(1.0, ‘ovo’, ‘p’, 3) NSVM(0.5, ‘ovo’, ‘r’, 0.2)
m SVM(1.0, ‘ovr’, ‘s’, 3) NSVM(0.1, ‘ovo’, ‘p’, 2)
n SVM(1.0, ‘ovo’, ‘s’, 3) NSVM(0.5, ‘ovo’, ‘p’, 3)
o RF(10) NSVM(0.1, ‘ovo’, ‘p’, 3)
p RF(5) NSVM(0.5, ‘ovo’, ‘s’)
q RF(7) NSVM(0.7, ‘ovo’, ‘s’)
r RF(20) Gauss(‘r’, ‘ovo’)
s RF(50) KNN(‘e’)
t RF(100) KNN(‘m’)
u NSVM(0.7, ‘l’, ‘ovo’) SVM(1.0, ‘ovr’, ‘l’)
v NSVM(0.7, ‘l’, ‘ovr’) SVM(10, ‘ovr’, ‘r’, 0.2)
w NSVM(0.5, ‘l’, ‘ovo’) SVM(1.0, ‘ovr’, ‘p’, 2)
x NSVM(0.5, ‘l’, ‘ovr’) SVM(1.0, ‘ovr’, ‘p’, 3)
y NSVM(0.7, ‘r’, ‘ovo’) SVM(1.0, ‘ovr’, ‘s’)
z NSVM(0.5, ‘r’, ‘ovo’) NSVM(0.5, ‘ovr’, ‘l’)
A NSVM(0.7, ‘r’, ‘ovo’, 0.2) NSVM(0.5, ‘ovr’, ‘r’, 0.2)
B NSVM(0.5, ‘r’, ‘ovo’, 0.2) NSVM(0.5, ‘ovr’, ‘p’, 2)
Key Phase 1 Phase 2
C NSVM(0.7, ‘r’, ‘ovr’) NSVM(0.5, ‘ovr’, ‘p’, 3)
D NSVM(0.5, ‘r’, ‘ovr’) NSVM(0.5, ‘ovr’, ‘s’)
E NSVM(0.7, ‘r’, ‘ovr’, 0.2) NSVM(0.7, ‘ovr’, ‘l’)
F NSVM(0.5, ‘r’, ‘ovr’, 0.2) NSVM(0.7, ‘ovr’, ‘r’, 0.2)
G NSVM(0.7, ‘s’, ‘ovo’) NSVM(0.7, ‘ovr’, ‘p’, 2)
H NSVM(0.5, ‘s’, ‘ovo’) NSVM(0.7, ‘ovr’, ‘p’, 3)
I NSVM(0.7, ‘s’, ‘ovr’) NSVM(0.7, ‘ovr’, ‘s’)
J NSVM(0.5, ‘s’, ‘ovr’) Gauss(‘r’, ‘ovr’)
K NSVM(0.7, ‘p’, ‘ovo’) LSVM(0.5, ‘ovr’)
L NSVM(0.5, ‘p’, ‘ovo’, 0.2) LSVM(10, ‘ovr’)
M NSVM(0.7, ‘p’, ‘ovo’, 0.2) LSVM(0.5, ‘ovo’)
N NSVM(0.5, ‘p’, ‘ovo’, 0.2) LSVM(10, ‘ovo’)
O NSVM(0.7, ‘p’, ‘ovr’) LSVM(5, ‘ovr’)
P NSVM(0.5, ‘p’, ‘ovr’) LSVM(5, ‘ovo’)
Q NSVM(0.7, ‘p’, ‘ovr’, 0.2) RF(10)
R NSVM(0.5, ‘p’, ‘ovr’, 0.2) RF(7)
S Gauss(‘r’, ‘ovo’) RF(20)
T Gauss(‘r’, ‘ovr’) RF(50)
U KNN() RF(75)
V MLP() RF(100)
WMLP(hLayers=100*[10],
maxIter = 200)MLP()
X LogisticRegression()
Y BernoulliNB()
Z MultinomialNB()
AA GaussianNB()
AB SupervisedDBNClassication
(hiddenLayerStructure=[256,256], nIter=100,
learningRateRbm=0.05, learningRate=0.1)
We used the implementation of these algorithms as provided by ScikitLearn [22] library.
The parameter combinations that we tuned for these ML algorithms are described in Table-7. Tuning of algorithm-
specic parameters was performed by carrying out several experiments that used dierent parameter combinations of
these ML techniques. Each of the ML models built using these techniques acts as a binary classier. For instance, when
testing for the defectiveness of a source le, an ML model either labels its as likely-to-be-defective (with label = 1) or
unknown (with label = 0). Similarly, when estimating the defect characteristics, ML models are trained to test for the
presence of each of the characteristic values. For instance, we have built the ML models which classify a source le as
likely-to-contain high priority defects and not-likely-to-contain high priority defects, and similarly for all such scenarios
discussed in §2.3.1 and §3.1.
Rationale for choosing the above ML algorithms:
Labelling a source le as {
Likely-defective, Unknown
},
falls in the category of binary classication, which can be solved using a supervised learning ML technique [
44
].
To nd out the best possible ML technique for our case, we experimented with 12 of them listed above. These
algorithms are the ones most used for binary classication tasks of diverse kinds [
45
]. In many cases, they are also
the best performing options. Hence, we selected this set of ML algorithms for our experiments. Needless to say, this
list can keep evolving as more advances are made in the binary classication ML techniques.
Manuscript submitted to ACM
Accepted
18 Ritu et al.
Parameter Congurations:
A brief description of the pertinent parameters (listed in Table-7) of dierent ML algo-
rithms that we tuned is as follows:
(1)
Kernel Congurations (
η
): We experimented with all the four kernel types: linear (
l
), radial (
r
), sigmoid (
s
), and
poly (p) kernel.
(2)
Penalty (
C
) Congurations: We experimented by considering both the fractional and integral (low and high both)
penalty values (viz., 0.1,0.5,0.7,1.0,5.0,and 10.0).
(3) Number of estimators: These are also referred to as the number of decision trees in various ML algorithms. The
set of values chosen were: 5,7,10,20,50,75,and 100.
(4)
Degree of the polynomial (in case of a
poly
kernel): We used both fractional as well as the integral values of the
degree of a polynomial (viz., 2,3,and 0.2).
(5)
Gamma or the Kernel Coecient (
γ
): Both the default value (
auto
) and a fractional value (0
.
2) were experimented
with various ML algorithms.
(6)
Method of classication: We experimented with both the classication method types, viz., one-vs-one (
ovo
) and
one-vs-rest (ovr).
(7) Type of distances: We considered manhattan (m) and euclidean (e) distance types.
Why only these specic conguration values?
Each of the considered ML algorithms has a suggested or
expected range of values for its conguration parameters. The parameter values that we have chosen comply with
such suggested/expected boundaries and are selected to represent a variation within those boundaries such that it
is meaningful for our tasks.
2.3.4 Evaluation metrics. We selected F1 score
14
and ROC curve area
15
metrics as the evaluation metrics of our work.
F1 score is dened as follows:
F1scor e =2∗Pr ecision ∗Recall
Pr ecision +Recall (1)
where Precision is mathematically dened as:
Pr ecision =true positive
true posit ive+f alse positive(2)
, and Recall is dened as:
Recall =true posit ive
true posit ive+f alse neдative(3)
Since F1 score captures both the eect of Precision and Recall, we compute F1 score values and its respective standard
deviation values (or error values) only. Higher is the F1 score better is the prediction accuracy of the model.
ROC metric is used to evaluate the quality of output. An ROC curve is a plot that features the true positive rate
(marked as Y-axis) vs the false positive rate (marked as X-axis) for an experiment. The point at the top-left corner of the
plot depicts the point of most ‘ideal’ behavior having the {false-positive-rate, true-positive-rate} pair value as {0,1}.
Thus, a larger area under the curve signies a better quality output. We, therefore, select the ROC curve area as our
second evaluation metrics.
14http://scikit- learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
15http://scikit- learn.org/stable/auto_examples/model_selection/plot_roc.html
Manuscript submitted to ACM
Accepted
A Defect Estimator for Source Code 19
Since the highest ROC curve area value and the F1 score value may dier across the models, we take the average of
the two as the nal accuracy measure of a model (MeanAcc dened in Equation 4). Further, all the results obtained
are validated using k-fold cross validation. Higher values of k, limit the number of data points in a validations set, and
a lower value would increase the risk of bias in the dataset. We, therefore, selected threshold value of k =5for our
experiments.
2.3.5 Selecting the best performing model: Problem Formulation. We dene the best performing ML model as the one
with the highest MeanAcc measure value, where MeanAcc is represented by Eq.4. There
T
represents the set of defect
estimation tasks (mentioned in §2.3.1),
A
represents the set of ML algorithms and
Π
as the tuning parameter combinations
set, such that,
∆α,π
λ,ψ
represents the ML model built using the ML algorithm
α
with the parameter combination
π
, such
that
α∈A,π∈Π
and
ψ∈T
. Also,
E
represents the set of evaluation metrics viz., F1 score and ROC area discussed in
§2.3.4.
MeanAcc(∆α,π
λ,ψ|Dλ)=(
|E|
Õ
k=1
AccV al(Ek,∆α,π
λ,ψ|Dλ))/|E|(4)
where AccVal represents the accuracy measure value of metrics
Ek
obtained by applying the ML technique parameter
setting model ∆α,π
λ,ψon the input dataset Dλ(explained in more detail with the help of an example below).
For each of the tasks
ψ∈T
, the problem of nding the best performing model trained on PROCON dataset
Dλ
to
perform the defect estimation for source les written in λ∈Lcan be dened as follows:
A,Π
max
α,πMeanAcc(∆α,π
λ,ψ|Dλ),where α∈A,π∈Π(5)
Example scenario:
For instance, let us consider a scenario of nding the most suitable model for defectiveness
estimation using a dataset
Dλ
. Now, one of the instances of
∆α,π
λ,ψ
could be using the LSVM technique (
α
= LSVM) with
tuning parameters (
π
):
⟨C,η⟩
as
⟨
0
.
1
,
‘
r′⟩
; which corresponds to the parameter combination
LSV M(
0
.
1
,
‘
r′)
represented
as the row ‘a′in column “Phase 1” of Table 7.
For each of the estimation tasks
ψ∈T
, we experiment with dierent ML algorithms (
α∈A
and
π∈Π
) and report
the obtained MeanAcc measure values in Fig. 6. Finally, for each of the listed scenarios, we select the best performing
ML model – the one yielding maximum MeanAcc measure value.
3 PERFORMANCE EVALUATION AND COMPARISON
To validate the ecacy of our system, we performed relevant experiments for testing its performance and behaviour.
The performance is evaluated by observing the accuracy achieved by the DESCo system when answering our stated
prediction questions. The behaviour of DESCo is evaluated on the basis of its functional and non-functional aspects.
The functional correctness of the system is evaluated by testing on source le samples and the corresponding defect
reports downloaded from GitHub and Apache Bugzilla, while the non-functional aspects of the system are evaluated on
the basis of its easy-of-use and response-time. Table-8shows a high-level summary of all the experiments.
3.1 Performance Evaluation
These experiments were aimed to identify the best performing algorithm for:
•The {Likely-defective, Unknown} classication task (referred to as Phase 1prediction) on the input source code.
•Estimating associated defect characteristics (referred to as Phase 2prediction) for the Likely-defective cases.
Manuscript submitted to ACM
Accepted
20 Ritu et al.
Table 8. Experiments summary
Experiment Objective Major ndings
Experiment #1(§3.1.1)
Determine the best ML algorithm for performing the following
estimation tasks: Determine if an input source le
is likely-to-be-defective or not.
Best classier: SupervisedDBNClassier;
Best dataset: Python dataset;
Accuracy achieved: 95.96%.
Experiment #2(§3.1.2)
Determine the best ML algorithm for performing the following
estimation tasks: What are the characteristics of defects that are
likely to be associated with an input source le
that has been classied as likely-to-be-defective in Experiment #1?
Best classier: LSVM classier;
Best dataset: C dataset
Accuracy achieved: 91.4%.
Defect characteristic: defects of type-Enhancement
Experiment #3(§3.1.3)How does the DESCo system perform in comparison
to the state-of-the-art (SOA) techniques?
DESCo outperforms one of the SOA [7]
technique with an improvement of 44.9%.
Experiment #4(§3.1.4)
Comparison at Dataset level:
Do PROCON datasets contribute towards
performance of a defect prediction system?
DESCo, when trained on PROCON dataset,
shows an improvement of 19.46% and 29.9%,
for SOA datasets [7,46] respectively.
Experiment #5(§3.2.1)To test the ecacy of DESCo
on real or production-quality software.
Estimates made by DESCo match
with the real characteristics.
Experiment #6(§3.2.2)
To test if DESCo works in an expected manner,
as it is intended to work for the user-audience
it is designed to work for.
DESCo obtains a feedback score of
>8/10 for most of the cases.
To achieve this goal, we iterate through dierent parameter combinations of various ML algorithms and compare the
performance of the generated ML models. The details of the experiments performed are discussed next.
3.1.1 Experiment 1.
Objective:
Determine the best ML algorithm for performing the following estimation tasks: Determine if an input
source le is likely-to-be-defective or not?
Procedure:
We train dierent ML models on PROCON dataset using various parameter conguration of ML
algorithms (listed in Table-7). The ML models perform the task of classifying the input source le as {likely-to-be-
defective, unknown}, referred to as Phase 1henceforth. To nd the best performing model, we compare the models built,
using the averaged accuracy metric (MeanAcc metric values) as discussed in §2.3.5. The MeanAcc metric values obtained
corresponding to 54 parameter combinations of 12 ML algorithms trained on various datasets are shown in Fig. 6(a).
In Fig. 6(a), x-axis represents various ML model parameter combinations (listed in “Phase 1” column of Table 7) with
which the experiment is performed, and the corresponding MeanAcc values obtained are represented on the y-axis.
Salient Observations from the experiment:
The observations drawn from Phase 1results (shown in Fig. 6are as
follows:
•
Python dataset trained using SupervisedDBNClassication technique gives the highest (best) MeanAcc value of
0.9596 (with an associated σof 0.0022).
•
Models built using MLP classier ( point “W” in Fig 6(a)) yield the lowest MeanAcc values for all the datasets;
lowest (=0.25) in case of C++ Dataset.
•
A sharp downfall in MeanAcc values is observed for the models trained using Nu-SVM classier on C, C++ and
Combined datasets.
Inferences drawn from the experiment:
From the MeanAcc measure values shown in Fig. 6, we can draw the
following inferences:
•
Since SVM classier,Nu-SVM classier, with a poly kernel, and MLP Classier result in lowest MeanAcc metric
values, such ML technique parameter combinations (viz.,
h, i, j, k, l, u, v, w, x, y, K,
and
W
of Table-7)
are not advisable for performing the task of defect prediction.
Manuscript submitted to ACM
Accepted
A Defect Estimator for Source Code 21
•
SVM with radial kernel, LSVM,Logistic Regression,Gaussian, and SupervisedDBN Classication models with tuned
parameter combinations (viz.,
a,b, S, T, X,
and
AB
listed in Table-7) are some of the suitable
⟨
ML technique –
parameter ⟩combinations for estimating defectiveness associated with a source code.
•
Performance of an ML model depends on the dataset it is trained on. Highest MeanAcc score values are obtained
corresponding to the models trained using Python source les.
Fig. 6. Mean Accuracy (MeanAcc) measure values for prediction using our approach
Manuscript submitted to ACM
Accepted
22 Ritu et al.
3.1.2 Experiment 2.
Objective:
Determine the best ML algorithm for performing the following estimation tasks: What are the characteristics
of defects that are likely to be associated with an input source le that has been classied as likely-to-be-defective in
Experiment 1?
Procedure:
For the les classied as likely-to-be-defective in Experiment 1(or Phase 1), we next estimate the type
of defects present, referred to as Phase 2henceforth. For each of the defect estimation tasks discussed in §2.3.1, we
perform the following two steps:
(1)
Filter the les linked with the particular defect-types to create the sub-datasets of PROCON dataset for training-
testing ML models. For instance, to estimate the presence of “high” priority defect types in a le, we create a
sub-dataset containing the
⟨
source les linked to the “high” priority defects, and les linked to defects with
other priority types ⟩in 50:50 ratio.
(2)
We iterate through various parameter combination of ML algorithms and record the MeanAcc values (F1 score
and ROC area values) corresponding to each.
(3)
We select the model reporting the highest MeanAcc measure value for performing the testing with unlabelled
source les (performed in §3.2.1).
Fig. 6(b)-(e) represent the results obtained when estimating the defect characteristics by training models on various
datasets. For all the plots shown in Fig. 6(b)-(e), x-axis represents various ML model parameter combinations (listed in
“Phase 2” column of Table 7) with which the experiment is performed, and the corresponding MeanAcc values obtained
are represented on the y-axis.
Salient observations from the experiment:
•
For predicting Medium exposure defects, ML models trained using C++ source les, outperform the rest - LSVM
technique achieves the highest MeanAcc value of 0.7773 (with an associated σof 0.1435)).
•
C dataset gives the best performance for predicting the defects of type Enhancement and those annotated as
highest priority.LSVM classier yields the best results in both the cases with an MeanAcc of 0
.
8357 (with
σ
of
0
.
142) for predicting defects of type Enhancement, and an MeanAcc of 0
.
91397 in case of defects annotated as
highest priority (with σof 0.017).
•
When predicting defects annotated with a specic OS type,SVM Classier with sigmoid kernel trained using Java
dataset gives the best MeanAcc measure value (0.688 with σas 0.115).16
•
In most cases, the SVM classier and Nu-SVM classier with a poly kernel setup, fared the worst on the PROCON
dataset.
Inferences drawn from the experiment:
The important inferences drawn from the results depicted in Fig. 6are as
follows:
•
Dierent ML algorithms emerge as best-performing for predicting dierent types of defect characteristics. In
other words, it is not advisable to employ a single ML algorithm when predicting the type of defect characteristics
in an input source le. Therefore, the DESCo system, instead of using a single ML model, chooses the best
performing ML model for each prediction task.
•
In the case of Medium exposure defects, F1 score measure and ROC area measure both dier in the best-declared
ML techniques. Hence, dierent evaluation measures may mark dierent ML techniques as the best for a given
16
Note: Because of lack of adequate amount of defect metadata information about os-type in case of C, C++ and Python source les, the performance
comparison could not be presented for the same.
Manuscript submitted to ACM
Accepted
A Defect Estimator for Source Code 23
scenario. DESCo system takes the average of both these (viz., F1 score and ROC curve area) measure values
when handling the prediction tasks.
(a) Program fragment 1(b) Program fragment 2
Fig. 7. Example 2: Similar program fragments resulting in dierent feature vectors
Fig. 8. Comparison with State of the art techniques on the basis of Mean Accuracy (MeanAcc) measure values
Manuscript submitted to ACM
Accepted
24 Ritu et al.
3.1.3 Experiment 3.
Objective: How does the DESCo system perform in comparison to the state-of-the-art techniques?
Procedure:
To the best of our knowledge, there does not exist any work for estimating defect characteristics. We, therefore,
present our comparison for Phase-1 of our approach: Estimating if an input source le is likely-to-be-defective or not.
Some of the existing works targeting this problem are [7,47,48].
Similar to our work, [
7
] builds feature vectors capturing the occurrences of various programming constructs (with
order preserved) used in a source le. A Deep Belief Network (DBN) trained on these features is then used to perform
the task of defect prediction. For instance, if the constructs
for, if
and,
foo
are assigned the integral codes 1, 2 and, 3,
respectively; then, the feature vectors corresponding to the program fragments shown in Fig. 7would become {1, 2, 3}
and {3, 2, 1} respectively.
In our work, we capture the
count, depth,
and
length
of various programming constructs (in the form of PROCON
metrics) used in a source le. The collection of all such PROCON metrics’ values extracted from the input set of source
les forms our feature set. To compare [
7
] with our work, we implement it by using pertinent libraries [
36
] of ScikitLearn
[22]. We refer to [7] as one of the State-Of-the-Art (SOA) approach in upcoming sections.
[
47
] targets the defect reports instead of source les; they build a model to classify the defect reports as predictable
or unpredictable. Thus, we believe that comparison with it will not be justiable. Similarly, the work in [
48
] is focused
on recommending (or detecting) defect reports that are similar to an input defect report. This again would not be a
proper comparison to our work.
Fig. 8(a) presents a performance comparison of the 54 ML technique combinations when tried using our approach
and one of the SOA techniques [
7
]. The considered ML model parameter combination (listed in column “Phase1” of
Table 7) are represented by the x-axis, while the y-axis represents the respective MeanAcc measure values.
Observation from the experiment:
As shown in Fig. 8(a), our approach outperforms the state of the art technique
for 50 out of 54 parameter combinations (see Table-7) of ML techniques considered. The highest MeanAcc measure value
of 80
.
8%, with mean error (
σ
) of 0
.
047, is obtained in case of conguration item “g” of Table-7. Our technique outperforms
[
7
] (conguration item “AB”) with an improvement of 44
.
9%. Note: Since the work performed by [
7
] was performed
only for Java source les, we perform the comparison for the common Java source les. We present the comparison at
dataset level in the next section. Further, [
7
] implement their approach using the SupervisedLearningClassier (viz.,
conguration item “AB”) only. But, to compare the performance of our approach, with the approach mentioned in [
7
],
we simulated [7] for all the ML parameter congurations shown in Figure 8and Table 7.
Inference from the experiment:
DESCo, when trained on PROCON dataset, performs the task of defect estimation
with better accuracy in comparison to [
7
](conguration item “AB”). It achieves an improvement of 44.9%. Note: There
exist some ML algorithm parameter conguration items shown in Figure 8, where the ML models built using the
semantic datasets generated using the approach in [
7
] outperform those trained on PROCON dataset (PROCON Java
dataset, specically). But, DESCo outperforms the compared method for most of the cases.
Manuscript submitted to ACM
Accepted
A Defect Estimator for Source Code 25
3.1.4 Experiment 4.
Objective: Comparison at the dataset level:
Do PROCON datasets contribute towards performance of a defect
prediction system?
Description:
In this experiment, we train dierent ML models on PROCON datasets and compare their performance
when trained on SOA datasets. Note: we train models for each of the considered languages, viz.,
C, C++, Java,
and
Python. The results of the experiments are shown in Fig. 8.
Label Notation: τ_λ_Dataset
where
τ
refers to the method used for building the metrics, viz., PROCON-dataset building method, Semantic-dataset
building method[
7
], or the datasets present in PROMISE repository[
46
], and
λ
represents the programming language for
which the dataset is built, viz., C, C++, Java, Python. For instance, PROCON_C_Dataset would represent a dataset
built using PROCON metrics’ values extracted from source les written using C programming language.
Procedure:
We selected two state-of-the-art (SOA) datasets [
7
,
46
] most close to our work. The procedure can be
listed in the following steps:
(1) Find the common les among the three datasets, PROMISE, SOA, and PROCON.
(2)
Build the semantic datasets using the approach listed in [
7
], which we refer to as Semantic datasets henceforth.
(3)
Extract the Chidamber and Kemerer (CK) metrics’ values[
46
] of these common les from PROMISE repository
17
,
which we refer to as the PROMISE datasets henceforth.
(4)
Compare the performance of ML models trained on the language wise combinations of the datasets formed in
the previous steps, using the parameter combinations of various ML algorithms listed in Table-7(for Phase 1).
[
7
] is performed only on Java programming language, but for comparing with our datasets built for other programming
languages (viz.,
C, C++,
and
Python
), we build the language-specic datasets using the approach in [
7
,
46
]. We then
compare the performance of ML models trained on PROCON datasets and the newly generated datasets using [
7
,
46
].
The performance comparison corresponding to all the 11 datasets is presented in Fig. 8(b). The x-axis represents the
parameter combinations of various ML algorithms, and the y-axis represents the
pow(
1500
,MeanAcc)
measure values
obtained, respectively. The power function is applied in order to spread the close placed MeanAcc values. The value
1500 is found after experimenting with a few values.
Salient observations from the experiment:
ML models trained on PROCON C dataset outperform both the baseline datasets (viz., PROMISE dataset and Semantic
dataset) yielding the highest MeanAcc measure value of 84
.
47% (with
σ
of 0
.
016) for RF classier with 100 estimators
(viz., row “t” of Phase 1column mentioned in Table-7). The SOA datasets ([
7
,
46
]) give the highest MeanAcc values of
70.9% and 65.2%, respectively, when trained on SupervisedDBN classier.
Inferences drawn from the the experiment:
The ML models trained on PROCON dataset outperform the ones
trained on SOA datasets [
7
,
46
] with an improvement of 19
.
46% and 29
.
9%, respectively. Note: there are certain ML
algorithm parameter combinations, shown in Figure 8(b), in which the models trained on datasets built using [
7
,
46
]
outperform those trained on PROCON datasets. This dip in DESCo’s performance with-respect-to certain ML algorithm
parameter combinations does not aect its overall performance, since DESCo uses the best-performing model to
perform a given defect-estimation task (discussed in §2.3.5), and thus the choice of such low-accurate models will be
automatically avoided.
17http://openscience.us/repo/defect/
Manuscript submitted to ACM
Accepted
26 Ritu et al.
3.2 ality Testing
Software quality is dened as the degree to which a system, component, or process meets the specied requirements[
5
].
The requirements of a software are broadly categorized as functional and non-functional; with the functional require-
ments modelling the functional correctness of a software, while the non-functional requirements capture the degree to
which a software works as intended. Both the functional and non-functional requirements are listed in the requirements
specication of a software.
To evaluate the functional correctness of DESCo, we test its performance by experimenting with the real or production
quality software data, which is not a part of PROCON dataset. To evaluate the fulllment of non-functional requirements
by DESCo, we choose the evaluation metrics such as the ease-of-use,response-time, and accuracy, as observed by the
end-users. We discuss the details of these experiments in the upcoming sections.
Fig. 9. DESCo User Interface
3.2.1 Experiment 5.We performed this experiment to test the functional correctness of DESCo. For this purpose, we
downloaded some source les (reported to be defective) from various defect reporting portals (such as Apache Bugzilla)
and tested them using DESCo. The results obtained for a source le
18
which is linked to a defect report
19
are shown in
Fig. 10.
Objective: To test the ecacy of DESCo on real or production-quality software.
Procedure: To achieve the objective, we performed the following steps:
(1)
First, we downloaded a collection of source les from various defect reporting engines. While downloading the
source les, we ensured that the source les were associated with one or more defect reports.
(2) Next, we performed defect estimation on this collection of source les using DESCo.
(3)
Finally, we validated the functionality of DESCo by analyzing the obtained defect estimates and the true defect
characteristics of the input source les. The true characteristics were available with the defect report itself.
Salient observations from the experiment:
(1) The defect estimates made by DESCo for the input source le match those in reality.
(2)
DESCo was also able to predict those characteristics for the defects which were missing in the corresponding
defect reports originally. For instance, the given defect report only highlighted that the defect originated when
working on Linux (an OSS OS), but DESCo provides additional information about the other perspectives such
as user-activity associated with it, and application area linked with the defect report. All these estimates are
provided with associated ⟨accur acy,er ror ⟩pairs.
18https://github.com/apache/httpd/blob/trunk/server/request.c
19https://bz.apache.org/Bugzilla/show_bug.cgi?id=45187
Manuscript submitted to ACM
Accepted
A Defect Estimator for Source Code 27
Fig. 10. Prediction results on a test file (ML model trained using PROCON_Combined_Dataset)
Inferences from the experiment:
In DESCo, the prediction of defect characteristics is made by separately training
the ML models for each of the characteristics. This allows us to independently predict the presence of more than one
defect characteristics in an input source le.
3.2.2 Experiment 6.This experiment was conducted in a controlled industrial environment. We recruited 10 participants
for the experiment.
Objective: To test if DESCo works as expected for the intended users and scenarios.
Procedure:
The participants were asked to use DESCo system to perform defect estimation on an input source code
of their choice. The input source les were required to be at least 900 characters long (excluding comments), and written
using C, C++, Java or Python. We assume that the source les containing source code less than 900 characters contain
insucient construct information for condence in the approach. Next, the participants were asked to provide their
ratings on the performance of DESCo system on three quality parameters: the ease-of-use of the interface, response time
of DESCo, and accuracy of defect prediction task performed by it. The users rated on a scale of 1–10, with 1representing
the worst performance and 10 representing the best. Table 9gives the details of the ratings received by us.
Salient observations from the experiment:
(1) End users remark DESCo to have a user-friendly interface.
(2) DESCo achieves a rating of 8/10 or better in most of the cases.
Interpretation of Table 9:
The rst row of table shows that a score of 10 was awarded for ease-of-use and response
time by 1 voter, and for accuracy it was awarded by 2 voters. Similarly, last row says that the score of 6 was awarded by
only 1 voter for the response time.
3.3 Threats to validity
Since we have chosen only the open-source projects for our experiments, the behaviour may dier in case of closed-
source projects. We have not tried every possible value of each parameter of the ML algorithms considered. Since the
Manuscript submitted to ACM
Accepted
28 Ritu et al.
Table 9. Ratings recorded in Experiment 6
Score YVotes for the score Y
(scale 1-10) ease-of-use response time accuracy
10 1 1 2
9 7 0 7
8 2 6 1
7 0 2 0
6 0 1 0
PROCON dataset contains the features of only four languages (viz., C, C++, Python, and Java), our system can only
predict the defectiveness associated with source les belonging to these languages. Dierent ML techniques, which we
found to perform the best for our prediction tasks, may perform dierently on a dierent set of features.
Another challenge is the authenticity of the defect reports itself. In order to establish a mapping between the source
les and the defect reports of various OSS, we search the defect reports for the presence of various source le names.
Thus a wrongly reported defect would lead to an incorrect reference of a source le
↔
defect report, leading to an
incorrect defect mapping. This may give rise to false positives i.e. a situation when a source le free from defects is
labelled as defective. For the current work, we assume that all the defect reports are valid.
The programming constructs and the metrics that we use in PROCON are programming language specic. Also, our
experiments show that the performance of our system varies for dierent programming languages that we considered.
Thus, the results may or may not generalize to all programming languages.
Further, the PROCON metrics capture the specic programming decisions mentioned in Denition 3. They may not
be able to capture the abstract design choices made during architectural design.
We remove the outliers present in our datasets by:
(1)
Selecting the source les with similar size (measured in terms of character count). The range of characters diers
among the source les written in dierent programming languages.
(2)
Normalizing the feature-values using Min-Max Scaling method. This was done to prevent a particular feature
comprising of large measure values (or feature values) from inuencing the entire dataset, thus diminishing the
eect of the other constituent features of the dataset.
We believe that the above outlier avoidance measures help us in eliminating the possibility of swamping-eect [
49
] in
our system. However, as is the case with any experiment, it is not possible to cover all possibilities for the detection and
avoidance of swamping eect.
Denition 4: Swamping eect
It is dened as the situation where “clean” data item is incorrectly labeled as an outlier, due to the presence of
multiple clean sub-groupings within the data [49].
Further, while performing testing of DESCo system, we consider the source les with eective size (excluding
comments and documentation) >= 900 characters. We assume that source les smaller than this size contain insucient
construct information, and can be neglected for our case. Further, the use of small les (<900 characters) might skew
the dataset because of sparse feature vectors generated corresponding to them.
Manuscript submitted to ACM
Accepted
A Defect Estimator for Source Code 29
With the fast-paced advances in machine learning (ML) space, the identication of the State of the Art (SOA)
technique for the current problem is a challenge. The SOA appears to be a moving target. What we think of as the SOA
today, may not be so tomorrow. While doing the performance evaluation of our system, we found [
7
] to be the most
relevant SOA work similar to ours. But, due to the evolving nature of ML, [7] may not remain the SOA for long.
4 RELATED WORK
Eects of programming styles on the quality of software have been well studied and reported in the research literature.
Most such studies can be categorized into two broad areas – Quality and maintenance of software. One of the works
that addressed problems similar to ours is [
7
]. They propose a learning algorithm to extract the semantic representation
associated with a program automatically. They train a DBN on the token vectors derived from the AST representations
of various programs. They rely on trained DBNs to detect the dierences in these token vectors extracted from an input
to predict defects. Another related contribution is presented by [
47
]. They propose a two-phase recommendation model
for suggesting the les to be xed by analyzing the defect reports associated with them. The recommendation model
in its rst phase categorizes a le as “predic” or “unpredic” depending upon the adequacy of the available content of
defect reports. Further, in its second phase, amongst the les categorized as predic, it recommends the top-k les to be
xed. They experimented on a limited set of projects, viz., “Firefox” and “Core” packages of Mozilla.
An information retrieval technique based defect localization module “defectLocator” is proposed by [
48
]. The
defectLocator uses the concept of textual similarity to nd a set of defect reports similar to an input defect report and
using the linked source code tries to identify potential defects related to the input. Use of textual similarity for code
identication can pose problems because often a given programming task may be coded in more than one way. The
authors have compared defectLocator performance with other textual similarity techniques such as Latent Dirichlet
Allocation (LDA) and Latent Semantic Indexing (LSI), on four OSS projects (Eclipse, AspectJ, SWT, and XZing).
The software metrics used in building a predictive model have a more signicant impact on the performance of the
system than the ML technique used [
14
]. Authors in [
13
] present a review of 106 papers and state CK’s Object-Oriented
(OO) metrics as the most used software metrics. Further, the authors compare eight primary ML techniques (such as,
C4.5, SVM, Neural Network, Logistic Regression) used in these 106 works by various evaluation methods viz., precision,
recall, ROC curve method, and cost-eectiveness.
Similarly, authors in [
50
] compare the performance of six ML techniques (such as, Linear regression, Multilayer
perceptron, and Decision tree regression) across 18 datasets of
PROMISE
repository (such as, Ant 1
.
7and Camel 1
.
2)
using the Average Absolute Error (AAE) and Average Relative Error (ARE) metrics. Linear regression and Decision tree
learning techniques were found to perform the best results with least AAE and ARE measures.
Authors in [
51
] employ the use of static code features (viz., Lines Of Count (LOC), Halstead features [
52
] and McCabe
features [
53
]) for defect prediction. They compare the performance of 12 ML techniques (such as, SVM, Naive Bayes
(NB), and Random Forest (RF)) by static metrics values extracted from 10 open-source projects available at NASA
repository. Authors claim that the predictive models which maximize the AUC (eort, the probability of detection)
metric yield the best results. Such models, which maximize the AUC metric, generate the smallest set of modules
containing the most errors. Authors claim that WHICH- meta-learning model proposed by them outperforms the
state-of-the-art techniques.
Manuscript submitted to ACM
Accepted
30 Ritu et al.
Overall, some of the key gaps found in the majority of the existing works, addressing problems similar to ours can
be summarized as follows:
•
In works that use (or propose) source code feature extraction for defect prediction and localization, only limited
types of nodes from the program’s AST have been utilized. For instance, node types such as identier nodes,
operator nodes, scope nodes, and user-dened data types, have not usually been considered. In our work, we
have captured all such types of nodes as per a language’s grammar.
•
While building the feature vectors, mere presence or absence of programming constructs is considered. In our
work, however, we also capture additional characteristics associated with the programming constructs. For
example, “length”, “count” and “depth-of-occurrences” of various constructs.
•
Association of “programming decisions” made during the development of software, with the characteristics of
the defects reported against such source code has not been adequately studied in the literature.
•
Last but not the least, most works reported their results using a small volume (less than ve projects) of source
code and defect reports if any. Our study spans more than 30400 source les written in four dierent programming
languages and taken from 20 OSS repositories.
5 CONCLUSIONS
We present a novel method to estimate the defectiveness associated with a software by analyzing the programming
construct usage patterns present in the software built in past. PROCON metrics are used to extract the programming
construct usage information from several source les, which is then stored as language-specic datasets. ML models
built using various SOA ML techniques, when trained with PROCON datasets represent our system – DESCo. DESCo
automates the defect estimation task, thus reducing the time and cost associated in the process. In addition to pre-
dicting the defectiveness associated with an input source code, DESCo also provides the estimates of the likely defect
characteristics.
Our results show that information associated with the programming constructs used in the existing software
(PROCON dataset) improves the estimation of defects in source code. Our results show that accuracy of the defectiveness
estimation varies with the programming language and the ML techniques. For instance, for detecting the defectiveness
of source les written in
Python
,SupervisedDBN classier performs the best with an accuracy of 95
.
9%, whereas, for
C++ programs, LSVM classier gives the best accuracy of 77.5%.
Our results indicate that DESCo system and PROCON datasets outperform the existing state-of-the-art techniques
and datasets with the best MeanAcc measure value of 80
.
8% in case of RF technique (an improvement of 44
.
9%). DESCo
and PROCON datasets can be used for building software tools related to areas such as defect localization, code review and
recommendation. We have shown that the ML models trained using PROCON metrics give better results in comparison
to the existing OO metrics (CK’s metrics of PROMISE repository).
REFERENCES
[1]
Sayed Mehdi Hejazi Dehaghani and Naseh Hajrahimi. Which factors aect software projects maintenance cost more? Acta Informatica Medica,
21(1):63, 2013.
[2]
Ritu Kapur and Balwinder Sodhi. Towards a knowledge warehouse and expert system for the automation of sdlc tasks. In Proceedings of the
International Conference on Software and System Processes, pages 5–8. IEEE Press, 2019.
[3]
Ritu Kapur and Balwinder Sodhi. Estimating defectiveness of source code: A predictive model using github content. arXiv preprint arXiv:1803.07764,
2018.
[4] Ilene Burnstein. Practical software testing: a process-oriented approach. Springer Science & Business Media, 2006.
Manuscript submitted to ACM
Accepted
A Defect Estimator for Source Code 31
[5]
IEEE Computer Society. Software Engineering Technical Committee. Ieee standard glossary of software engineering terminology. Institute of
Electrical and Electronics Engineers, 1983.
[6]
ISO/IEC. ISO/IEC25010:2011(en), systems and software engineering— systems and software quality requirements and evaluation (SQuaRE)— system
and software quality models. https://www.iso.org/obp/ui/#iso:std:iso-iec:25010:ed-1:v1:en, 2011. Retrieved: 11-03-2018.
[7]
Song Wang, Taiyue Liu, and Lin Tan. Automatically learning semantic features for defect prediction. In Proceedings of the 38th International
Conference on Software Engineering, pages 297–308. ACM, 2016.
[8] Michael Pradel and Koushik Sen. Deep learning to nd bugs. TU Darmstadt, Department of Computer Science, 2017.
[9]
Thomas Shippey, David Bowes, and TracyHall. Automatically identifying code features for software defect prediction: Using ast n-grams. Information
and Software Technology, 106:142–160, 2019.
[10]
Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. Learning natural coding conventions. In Proceedings of the 22nd ACM SIGSOFT
International Symposium on Foundations of Software Engineering, pages 281–293. ACM, 2014.
[11] Florian Deissenboeck and Markus Pizka. Concise and consistent naming. Software Quality Journal, 14(3):261–282, 2006.
[12]
Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. What’s in a name? a study of identiers. In Program Comprehension, 2006. ICPC
2006. 14th IEEE International Conference on, pages 3–12. IEEE, 2006.
[13]
Danijel Radjenović, Marjan Heričko, Richard Torkar, and Aleš Živkovič. Software fault prediction metrics: A systematic literature review. Information
and Software Technology, 55(8):1397–1418, 2013.
[14]
Erik Arisholm, Lionel C Briand, and Eivind B Johannessen. A systematic and comprehensive investigation of methods to build and evaluate fault
prediction models. Journal of Systems and Software, 83(1):2–17, 2010.
[15]
Miltiadis Allamanis and Charles Sutton. Mining source code repositories at massive scale using language modeling. In Proceedings of the 10th
Working Conference on Mining Software Repositories, pages 207–216. IEEE Press, 2013.
[16] StackOverow. Stackoverow developer survey results 2019: Most popular technologies, February 2019.
[17]
Isabelle Guyon and André Elissee. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157–1182, 2003.
[18] Yiming Yang and Jan O Pedersen. A comparative study on feature selection in text categorization. In Icml, volume 97, page 35, 1997.
[19]
Ferdian Thung, Tegawende F Bissyande, David Lo, and Lingxiao Jiang. Network structure of social coding in github. In Software maintenance and
reengineering (csmr), 2013 17th european conference on, pages 323–326. IEEE, 2013.
[20]
Audris Mockus, Roy T Fielding, and James D Herbsleb. Two case studies of open source software development: Apache and mozilla. ACM
Transactions on Software Engineering and Methodology (TOSEM), 11(3):309–346, 2002.
[21] P Terence. The denitive antlr 4 reference. Pragmatic Bookshelf, 2013.
[22]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research,
12:2825–2830, 2011.
[23]
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classication. Journal of
machine learning research, 9(Aug):1871–1874, 2008.
[24] sklearn Linear SVM. (LSVM), February 2020.
[25] sklearn Support Vector Machine. (SVM), February 2020.
[26] sklearn Nu-Support Vector Machine. (NuSVM), February 2020.
[27] J Moćkus, V Tiesis, and A Źilinskas. The application of bayesian methods for seeking the extremum. vol. 2, 1978.
[28] sklearn Gaussian Process Classier. (Gauss), February 2020.
[29] Naomi S Altman. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3):175–185, 1992.
[30] sklearn K Neighbors Classier. (KNN), February 2020.
[31]
Tin Kam Ho. Random decision forests. In Document analysis and recognition, 1995., proceedings of the third international conference on, volume 1,
pages 278–282. IEEE, 1995.
[32] sklearn Random Forest. (RF), February 2020.
[33] Simon Haykin. Neural networks: a comprehensive foundation. Prentice Hall PTR, 1994.
[34] sklearn Multi-Layer Perceptron. (MLP), February 2020.
[35]
Georey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
[36]
sklearn Supervised Learning. A Python implementation of Deep Belief Networks built upon NumPy and TensorFlow with scikit-learn compatibility,
February 2020.
[37]
Hsiang-Fu Yu, Fang-Lan Huang, and Chih-Jen Lin. Dual coordinate descent methods for logistic regression and maximum entropy models. Machine
Learning, 85(1-2):41–75, 2011.
[38] sklearn Logistic Regression. (LogReg), February 2020.
[39]
Jason D Rennie, Lawrence Shih, Jaime Teevan, and David R Karger. Tackling the poor assumptions of naive bayes text classiers. In Proceedings of
the 20th international conference on machine learning (icml-03), pages 616–623, 2003.
[40] sklearn Bernoulli Naive Bayes. (BernNB), February 2020.
[41] sklearn Multinomial Naive Bayes. (MultNB), February 2020.
[42] sklearn Gaussian Naive Bayes. (GaussNB), February 2020.
Manuscript submitted to ACM
Accepted
32 Ritu et al.
[43]
Andrew Y Ng and Michael I Jordan. On discriminative vs. generative classiers: A comparison of logistic regression and naive bayes. In Advances in
neural information processing systems, pages 841–848, 2002.
[44] sklearn Supervised Learning. 1. Supervised learning, February 2020.
[45]
Issam H Laradji, Mohammad Alshayeb, and Lahouari Ghouti. Software defect prediction using ensemble learning on selected features. Information
and Software Technology, 58:388–402, 2015.
[46]
Shyam R Chidamber and Chris F Kemerer. A metrics suite for object oriented design. IEEE Transactions on software engineering, 20(6):476–493, 1994.
[47]
Dongsun Kim, Yida Tao, Sunghun Kim, and Andreas Zeller. Where should we x this bug? a two-phase recommendation model. IEEE transactions
on software Engineering, 39(11):1597–1610, 2013.
[48]
Jian Zhou, Hongyu Zhang, and David Lo. Where should the bugs be xed?-more accurate information retrieval-based bug localization based on bug
reports. In Proceedings of the 34th International Conference on Software Engineering, pages 14–24. IEEE Press, 2012.
[49]
Jung-Tsung Chiang et al. The masking and swamping eects using the planted mean-shift outliers models. Int. J. Contemp. Math. Sciences,
2(7):297–307, 2007.
[50] Santosh S Rathore and Sandeep Kumar. An empirical study of some software fault prediction techniques for the number of faults prediction. Soft
Computing, 21(24):7417–7434, 2017.
[51]
Tim Menzies, Zach Milton, Burak Turhan, Bojan Cukic, Yue Jiang, and Ayşe Bener. Defect prediction from static code features: current results,
limitations, new approaches. Automated Software Engineering, 17(4):375–407, 2010.
[52] Maurice Howard Halstead et al. Elements of software science, volume 7. Elsevier New York, 1977.
[53] Thomas J McCabe. A complexity measure. IEEE Transactions on software Engineering, (4):308–320, 1976.
[54] Leonard Richardson. Beautiful soup documentation, 2007.
A IMPLEMENTATION DETAILS
Our approach for estimation of defectiveness of a given source code involves the use of two key modules which we have
built: the PROCON dataset builder and the DESCo system. We describe the details relevant to the implementation. The
implementation details provided in this section allow replicating our results by independently building our proposed
system.
A.1 Details of PROCON dataset builder
The implementation details of the important libraries (and/or scripts) used in the PROCON dataset builder are described
below:
(1)
LexPar Module: It consists of the collection of all the lexer and parser classes generated by ANTLR. These
lexer/parser classes perform the parsing of input source les and produce a collection of lexer/parser tokens.
These classes and tokens are then fed as input to the Feature Extraction module to extract the necessary PROCON
metrics’ values.
(2)
Feature Extraction module: Given a set of grammars
G
, we generate using ANTLR [
21
] one parse tree listener
20
class corresponding to each
д∈G
. The parse tree listener class contains methods for handling lexer/parser tokens
(input by LexPar) as they are encountered during parsing of input source code conforming to the grammar д.
We then design a class that uses the parse tree listeners to extract lexical features from input source code written
in dierent programming languages. These lexical features correspond to the constructs mentioned in Table-2.
The collection of PROCON metrics (see Table-3) values, extracted from various source les, form the basis of our
PROCON dataset (discussed in Section 2.2).
(3)
Defect Information Collector: This is built as a script consisting of SQL commands written to populate the
information of bug reports (provided as input CSV les) in the BugInfo table (shown in Fig. 4).
20http://www.antlr.org/api/Java/org/antlr/v4/runtime/tree/ParseTreeListener.html
Manuscript submitted to ACM
Accepted
A Defect Estimator for Source Code 33
(4)
Patch Scraper: This module is built using the Beautiful Soup [
54
] library of python to extract the patch and
summary information elds from the HTML markup of a defect report,
d
. The scraped information is written in
the form of a text le, p, with its lename as the defect report ID, d.ID.
(5)
Mapping Builder: This module is developed using Java programming language. It establishes the mapping between
the defect reports (
∀β∈B
) and the source les (
∀f∈R
) fetched from the GitHub OSS repositories. The basic
working idea of this module is that we look for source le name patterns in the scraped information obtained via
Patch Scraper. The detailed steps are provided in Algorithm 1.
(6)
Information Rener: This module renes the information that the Feature Extractor module stored in SourceCode-
Features table. The rening is done by applying various transformations as discussed in Step-5 of §2.2.4. The
transformations logic is implemented via a collection of SQL commands.
A.2 Details of database tables
A brief description of the main tables of the PROCON dataset schema is as follows:
(1)
LanguageConstructs: In this table, we maintain all the programming language constructs, which we consider
while computing the PROCON metrics for the input source les. Each programming language construct
X
has
a single record for it in this table. The attribute ConstructId is the primary key of the table which assigns a
unique IDentier (ID) to each of the programming language constructs used in the considered source les. The
ConstructId attribute acts as a foreign key in the tables SourceCodeFeatures and RenedFeatures (discussed next).
(2)
SourceCodeFeatures: This table stores the values for lexical properties computed for various programming
constructs (as described in §2.2) that are found in the source les. The MeasureType attribute holds the types, viz.,
count, depth
, and
length
of the lexical properties. The ConstructId references the programming construct ID
from the LanguageConstruct table.
Since a source le
f
, may contain multiple instances of a programming language construct
p
; therefore, the
SourceCodeFeatures table has a one-to-many relationship with the LanguageConstructs table. The SourceFileId
refers to the unique ID values assigned to various source les. A source le can contain multiple programming
constructs at dierent nesting levels (i.e., with dierent ParentIds). Thus, the table has a composite primary key
as {SourceFileId, FileName, ConstructId, ParentId, MeasureType}.
(3)
BugInfo: In this table, we store the relevant meta-data fetched from the defect reports associated with the source
les considered in the SourceCodeFeatures table. BugId represents a unique ID to the record of each of such defect
reports. The attribute (BugEngineId) identies the source from where the defect report was fetched.
The combination of BugEngineId and BugId acts as the primary key of this table. The attribute Product holds
information about the module against which the defect was reported. The remaining attributes of the table (viz.,
Status, Priority, Stage, etc.) represent the various defect characteristics.
(4)
SourceFileToBugMapping: It stores the mapping between the source les and the bug reports considered for
building the dataset. Since it is a mapping between two tables (viz., SourceCodeFeatures and BugInfo), the table
comprises of the prime attributes of the two respective tables.
(5)
RenedFeatures: It stores the rened features obtained by transforming the shallow knowledge, as explained in
§2.2.4 step (6). This table represents the values of PROCON metrics extracted for various source les.
Manuscript submitted to ACM
Accepted
34 Ritu et al.
The attributes maxMeasureVal, minMeasureVal, avgMeasureVal, and stdDevMeasureVal store the measure values
obtained by applying various aggregate operations (viz.,
max, min, avg,
and
stdDev
) on the corresponding
attribute values from SourceCodeFeatures table.
When computing these aggregate values, we categorized a programming construct occurrence into three dierent
types of parent entities:
file, class,
and
function
. These types are identied via a unique integer value held
in the queryFlag attribute. Since a construct may occur under all these three parent entities simultaneously, the
table has a composite key {SourceFileId, ConstructId, MeasureType, queryFlag} as the primary key.
Further, each source le
f
present in the SourceCodeFeatures table can have
n
(with n > 1) records in the
RenedFeatures table due to the three queryFlag values.
A.3 Details of DESCo modules
All modules of DESCo have been coded using Python programming language. The implementation details of these
modules are as follows.
(1)
Model Builder: It builds various ML models using an input list of ML algorithms listed in Table-7and the features
present in PROCON dataset. To get the best performing model for various scenarios, it considers the parameter
combinations listed in Table-7. We use the ScikitLearn implementation of these ML algorithms. Details of these
algorithms are discussed in §2.3.3.
(2)
Model Selector: This module implements an optimization problem solver developed in Java. The objective of the
optimization problem is to nd, from a set of given ML models, the best performing ML model(s) for a specic
task. The input to the Model Selector comprises of i) the query selected by the user, ii) the features extracted from
the input source le, and iii) the complete set of ML models built using Model Builder (described above). The
output of this module is the best performing models as per the input.
(3)
Query Interface: It provides an interface for user interaction with the DESCo system. The user selects a defect
estimation task (listed in §2.3.1) from an available list of queries. The user can supply a source le for which the
defect estimation task needs to be performed. Internally, the query interface uses the user-selected query and the
provided source le as the inputs to the Estimation Engine for performing the defect estimation tasks.
(4)
Estimation Engine: It is the main module that drives the defect estimation tasks. The key inputs to this module are
a) a user-selected query for defect estimation and b) the source le whose defectiveness needs to be estimated.
Manuscript submitted to ACM