PreprintPDF Available

A Defect Estimator for Source Code: Linking Bug Reports With Programming Constructs Usage Metrics (Accepted by ACM TOSEM)

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Software and its engineering~Software design engineering, Software and its engineering~Software design tradeoffs, Software and its engineering~Maintaining software, Computing methodologies~Supervised learning, Information Systems~Data mining Transactions on Software Engineering and Methodology We want to submit an original research manuscript titled "A Defect Estimator for Source Code: Linking Bug Reports With Programming Constructs Usage Metrics" for publication in ACM TOSEM. The work submitted is not under review or published elsewhere. In this manuscript, we present a novel system to estimate the presence of defects in source code and detect attributes of the possible defects, such as severity, target OS, etc. The salient elements of our system are i) a dataset of newly introduced source code metrics, called PROgramming CONstruct (PROCON) metrics, and ii) a novel Machine Learning (ML) based system, called Defect Estimator for Source Code (DESCo), that makes use of PROCON dataset for predicting defectiveness in a given scenario. The PROCON metrics capture the properties of programming constructs that a source code uses. We have shown that DESCo can exploit the PROCON metrics to outperform the state-of-the-art approach with an improvement of 64.8%. The PROCON dataset has been created by processing a corpus of 30400+ source files taken from 20+ Open Source Repositories. The considered source files belonged to programs written in four popular programming languages, viz. C, C++, Java, and Python. To learn defectiveness models, the DESCo system processed 14950+ defect reports associated with the source files of PROCON dataset. To verify the correctness of our system, we compared the performance of 12 different ML algorithms with 50+ different combinations of their fundamental parameters. DESCo achieves best K-fold (with k=5) cross-validation results with SVM technique that gives a mean accuracy measure of 0.808. The software artifacts thus created, can be further used to build tools and procedures about several automated software engineering tasks/areas such as bug localization, code review and recommendation, and program repair, etc. We do not have any conflicts of interest to disclose, and no particular request for reviewers. However, we would like to request a double-blind review of our manuscript. We hope that a double-blind review will provide us with a playing field, and will also buffer against any implicit or unknown bias. Thank you for your consideration. Sincerely, Authors An important issue faced during software development is to identify defects and the properties of those defects, if found, in a given source file. Determining defectiveness of source code assumes significance due to its implications on software development and maintenance cost. We present a novel system to estimate the presence of defects in source code and detect attributes of the possible defects, such as severity, target OS, etc. The salient elements of our system are i) a dataset of newly introduced source code metrics, called PROgramming CONstruct (PROCON) metrics, and ii) a novel Machine Learning (ML) based system, called Defect Estimator for Source Code (DESCo), that makes use of PROCON dataset for predicting defectiveness in a given scenario. This dataset was created by using 30400+ source files written in four popular programming languages, viz. C, C++, Java, and Python. We have shown that DESCo system can exploit the PROCON metrics to outperform the state-of-the-art approach with an improvement of 64.8To verify the correctness of our system, we compared the performance of 12 different ML algorithms with 50+ different combinations of their key parameters. Our system achieves best results with SVM technique that gives a mean accuracy measure of 0.808.
Content may be subject to copyright.
A Defect Estimator for Source Code: Linking Defect Reports With
Programming Constructs Usage Metrics
RITU KAPUR and BALWINDER SODHI, Indian Institute of Technology Ropar, India
An important issue faced during software development is to identify defects and the properties of those defects, if found, in a given
source le. Determining defectiveness of source code assumes signicance due to its implications on software development and
maintenance cost.
We present a novel system to estimate the presence of defects in source code and detect attributes of the possible defects, such as the
severity of defects. The salient elements of our system are: i) a dataset of newly introduced source code metrics, called PROgramming
CON struct (PROCON) metrics, and ii) a novel Machine-Learning (ML) based system, called Defect Estimator for Source Code (DESCo),
that makes use of PROCON dataset for predicting defectiveness in a given scenario. The dataset was created by processing 30400+
source les written in four popular programming languages, viz. C, C++, Java, and Python.
The results of our experiments show that DESCo system outperforms one of the state-of-the-art methods with an improvement
of 44.9%. To verify the correctness of our system, we compared the performance of 12 dierent ML algorithms with 50+ dierent
combinations of their key parameters. Our system achieves the best results with SVM technique with a mean accuracy measure of
CCS Concepts:
Software and its engineering
Software maintenance tools;Software performance;Source code generation;Parsers;
Language features; Software libraries and repositories; Syntax;
Software design tradeos
Information systems
Data mining;
Computing methodologies Supervised learning;
Additional Key Words and Phrases: Maintaining software, Source code mining, Software defect prediction, Software metrics, Software
faults and failures, Automated software engineering, AI in software engineering
ACM Reference Format:
Ritu Kapur and Balwinder Sodhi. 2020. A Defect Estimator for Source Code: Linking Defect Reports With Programming Constructs
Usage Metrics. ACM Trans. Softw. Eng. Methodol. 1, 1, Article 1 (January 2020), 34 pages.
Maintenance of software, particularly identifying and xing the defects, contributes about 90% of the total cost of
developing and deploying the software [
]. Early detection of software defects is essential for lowering the cost of
debugging and thus, the overall cost of software development [
]. Therefore, it is desirable to have tools and techniques
by which one can determine the defectiveness of a program and possibly various properties of the apparent defects [
Authors’ address: Ritu Kapur; Balwinder Sodhi, Indian Institute of Technology Ropar, Department of Computer Science and Engineering, Rupnagar,
Punjab, 140001, India,,
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specic permission and/or a fee. Request permissions from
©2020 Association for Computing Machinery.
Manuscript submitted to ACM
Manuscript submitted to ACM 1
2 Ritu et al.
Denition 1: Defectiveness
We dene the defectiveness of a source le as the likelihood of nding any defects present in it. We express the
value of defectiveness on a Likert scale of {Likely-defective, Unknown}.
We chose
as the second value because other plausible option –
– would have implied
absence of defects. We do not determine the absence of defects.
Stated broadly, the objective of the work presented in this paper is to determine the defectiveness of a given source
le. If the le is likely-to-be-defective, then we also predict specic properties of the possible defects. The provided
source le could be an artefact such as the denition of a single class (e.g., a Java class), multiple classes (e.g., a C#
namespace in a single le), or a collection of functions (e.g., in a Python module). We do not try to predict the absence
of defects (hence the item
on Likert scale). Also, we do not aim to determine the precise locations (for instance,
the line numbers or code blocks) of the apparent defects within the given source le.
Denition 2: Software defect and software failure
Asoftware defect is dened as a aw in the software (or a certain component) that causes the software (or the
specic component) fail to perform its required function, as specied in the requirement specication. A defect,
when encountered during the execution of a software, may lead to partial or complete software failure [4].
Asoftware failure is dened as the inability of a software system or component to perform its required functions
within the specied performance requirements [5].
We primarily aim to determine defectiveness related to non-functional requirements [
] of a software; for instance,
the defects related to the performance, security, and reliability of a software. Also, we do not consider syntax errors in
source code that result in build failures.
Some of the examples of defects related to the non-functional requirements of the software, as reported on
dierent bug tracking portalsa,bare:
(1) Performance: Batik inside of Eclipse using bridge memory heap causes crashc.
(2) Security: XML vulnerabilities in Pythond.
(3) Accessibility: Unlocking a le owned by another user doesn’t work.e
(4) Programming Language compatibility: build.xml isn’t forward compatible with java6f.
(5) Maintainability: FAQ references third-party libraries that have been refactored or renamedg.
1.1 Existing approaches for determining the defectiveness
Many of the recent techniques for prediction of defects in source code employ an ML-based approach. The use of
suitably crafted features derived from the input is a distinct characteristic of such methods. For example, authors in [
Manuscript submitted to ACM
A Defect Estimator for Source Code 3
build a defect prediction model by training Deep Belief Networks (DBN) on the token vectors derived from a subset
of Abstract Syntax Tree (AST) nodes of various programs. [
] present another such approach which builds the defect
prediction model by training neural networks on the token-vectors built using AST nodes of source code. Similarly, [
show that training defect prediction model using AST n-grams improves the prediction performance.
The results from the works cited in the preceding paragraph show that the syntactic features of source code (extracted
using AST-nodes) can serve as useful features for building defect prediction models. In other words, the syntactic
properties of a program can be treated as the proxies for programming decisions (see Denition-3) employed in the
Researchers in the past establish that the coding practices adopted by the programmers signicantly inuence the
quality of a software [
]. The programming decisions made while constructing the software inuence its quality
attributes such as readability, portability, ease-of-learning, reliability and maintainability [
]. For instance, the choice
of concise and consistent naming conventions (for identier names) has been reported [
] to result in better quality
software. The above results indicate the existence of a relationship between the programming decisions employed in
constructing a software and the quality of the software.
The above ndings motivated us to develop a technique for determining the defectiveness of a program by analyzing
the programming decisions employed in constructing that program. This intuition of determining the defectiveness of
source code by analyzing the constituent programming decisions forms the motivation of PROCON metrics that we
propose in §2.2.2. Briey stated, the PROCON metrics capture the occurrence frequency and AST depth of various
programming constructs used in source code.
Denition 3: Programming Decision
It includes the following type of syntactic decisions made by the programmer during construction of a program:
Choice of the programming language constructs. For example,
for looping,
switch for branching.
The constructs’ occurrence
in the AST of the source code. For example, the
is used 100 times in a source le with the deepest nesting level of, say, 5.
of the names of various elements of the program. For example, the names of variables, functions,
and classes.
Following are not considered the programming decisions:
The exact names of the variables, functions and classes. We only care about the length of names, not the
names itself.
Abstract design choices such as those made during architecture design. For instance, whether to use factory
method vs abstract factory design pattern.
1.2 Hypothesis behind the proposed approach
The basic idea underlying our method is to train an ML model
with a large number of source code samples for which
the defectiveness is known; Using
, we can then predict the defectiveness of unseen source code. The critical aspect
that dierentiates our approach is the choice of source code features, using which we derive an accurate representation
of the source code from its defectiveness perspective.
Manuscript submitted to ACM
4 Ritu et al.
Existing literature (e.g., [
]) has shown that the choice of ML algorithm and the features employed to build a
prediction model
, have a signicant impact on the performance of
. Further, the use of various syntactic features of
source code for eective detection of source code similarity has been reported by many researchers (as discussed in
In view of the above observations, we hypothesize that:
(1) The programming decisions (see Denition-3) employed in the construction of software can:
(a) serve as useful features for computing a representation (from defectiveness perspective) of the source code.
(b) be represented and measured by programming language constructs’ usage metrics (e.g., PROCON metrics).
A system can be built which can use an ML model, suitably trained with the PROCON dataset, to estimate the
defectiveness of given source code.
Above points form the basis of developing a technique for determining the defectiveness of a program by analyzing
the programming decisions employed in constructing that program.
Overall, our work makes the following key contributions:
We propose a method for estimating the defectiveness of source code. If the code is found likely-to-be-defective,
then we also identify the various properties of apparent defects.
We propose a new set of software metrics (named PROCON) which we show are eective in capturing program-
ming decisions employed in the construction of software. These metrics comprise of the
gramming language
CONstructs (PROCON) usage data measured for the given source code.
We create a large dataset of PROCON metrics’ values by processing more than 30000 source les taken from 20
Open Source Software (OSS) repositories at GitHub. These source les were written in four major programming
languages, viz., C, C++, Java, and Python.
We implement our technique in an open-access tool called DESCo (Defect Estimator for Source Code), which can
take a source le as input and suggest the defectiveness for that source code in the form of a Likert scale value
along with the properties of the likely defects.
We empirically evaluate DESCo by two methods: a) Testing it with known pairs of input source les and expected
output such that the input source les are not part of the PROCON dataset which is used by DESCo. This
experiment is carried out in a lab environment. b) We conduct controlled experiments with ten professional
programmers in an industrial setting. The results show that DESCo can correctly detect the defectiveness of
the input source les. Fig. 10 shows the output obtained for one of such test les (downloaded from a defect
reporting engine). The highest accuracy (in the form of the MeanAcc score or Condence) of 95.9% is achieved
for the models estimating the presence of defects in a source le.
The organization of the rest of the paper is as follows. §2describes the details of the proposed system. To verify the
eectiveness of our system, we conducted several experiments which we have discussed in §3. §4describes the related
work. Conclusions drawn from our work are presented in §5.
Two of the primary artefacts which can be used for determining the defectiveness of source code are a) the source code
itself for the program and b) the details of the defects associated with it.
Manuscript submitted to ACM
A Defect Estimator for Source Code 5
The plethora of OSS available today various at OSS repository hosts such as GitHub
, Apache Software Foundation
, can be easily leveraged to extract the above two artefacts. These OSS hosts [
] also provide access to the details
of defects associated with the software. Our system makes use of such data available from OSS repository hosts.
Table 1. Table of Notations
LThe set of programming languages {C, C++, Java, Python}.
GThe set of grammars associated with the programming languages L.
RThe set of OSS repositories downloaded from GitHub.
BThe set of defect reports associated with R.
A source le written in programming language λL, and is a part R.
The feature set built by extracting PROCON metrics’ values from source les fR
λThe feature set built by extracting defect information from defect reports β, such that (∀ βB) ∃fR
λ, where λL.
DλDatabase comprising of XR
λand XB
λ, where λL.
ZThe set of programming constructs considered in our work. For instance, if, switch, for, and while.
The set of source les
which have at least one defect (
) reported against it, such that
(∀ fR
λ) ∃ βB
, where
The set of source les
which are without (
) any defect information associated with them, such that
(∀ fR
λ)  βB, where λL.
KλThe complete set of les extracted from R, such that
, where θrepresents the les which are either empty or very small in size, and are thus not included in the dataset.
MThe mappings between source les fR
λand defect reports β, such that βBand λL
HThe set of complete paths of fR
λ, where λL.
AThe set of considered ML algorithms (discussed in §2.3.3).
PThe set of considered tuning parameters associated with the A.
The set of defect estimation tasks performed using DESCo (discussed in §2.3.1). For instance, detecting whether an
input source le is likely-to-be-defective or not?
The set of ML models built to perform the defect prediction task
, when dealing with the source les written in
Such an ML model is obtained by training an ML algorithm
and tuning parameters
, with the dataset
EThe set of evaluation metrics {F1score,ROC ar ea}.
For a programming language λ(L), our system can be visualized as comprising of two key elements:
A large enough dataset,
, of suitably crafted features sets,
, about the source code (
) and its
associated defects (B), respectively.
A suitably designed system
, that picks the best performing Machine Learning (ML) model
, for performing
a prediction task ψ, when dealing with the source les written in λ.
Manuscript submitted to ACM
6 Ritu et al.
Fig. 1. Details of the proposed system
Fig. 1shows the logical structure of the proposed system. There are two major subsystems here:
The rst one is the PROCON dataset builder. It extracts the necessary information from various OSS repositories
and defect-tracking portals. The complete procedure for building the dataset is discussed in §2.2.
Next, we build the DESCo system. For a given source le
and the dataset (PROCON), DESCo estimates the
defectiveness of source code f. The details of how DESCo works are described in §2.3.
2.1 Pivotal design decisions
We need to address following crucial design issues and questions to realize the system outlined in Fig. 1.
Source les written in which programming languages should be chosen for building the data set used in our
system? Will the overall design of our system be dependent on the choice of programming language?
(2) How to choose the language constructs for various programming languages?
(3) What source code metrics can be used to measure the usage of various constructs of a language reliably?
(4) What criteria to use for selecting the OSS repositories for training our ML models?
(5) What attributes of the defects should and can be estimated?
What ML algorithms should we use for various tasks? How to choose the values for the tunable parameters of
those algorithms?
(7) How to measure and verify the prediction accuracy of a model for a given task?
In the following sections, we describe how we build various subsystems while also discussing how we address each
of the above issues.
2.2 Creating the PROCON dataset
In §1.1 we have described how various researchers have successfully leveraged the features derived from syntactic
properties of the source code, for various types of prediction tasks that use source code as the input.
Manuscript submitted to ACM
A Defect Estimator for Source Code 7
In the preceding sections, we have highlighted the inuence of programming decisions on the quality of software.
Further, [
] have shown that the choice of features used in building the dataset employed to train model
, dramatically
inuences the performance of on predictive tasks.
Table 2. Programming Constructs employed in PROCON metrics
Category Programming
Construct (Z)
Programming Language
C C++ Java Python
User dened
Identier & Operators identiers, arithmetic, logical & relational3
Given the above ndings and our hypothesis as outlined in §1.2, we crafted a new set of metrics (called PROCON)
which can be used as features for deriving a source code representation, that can be utilized to estimate the defectiveness
of source code. The metrics are listed in Table-3.
2.2.1 Choice of programming language and its constructs. For creating the PROCON dataset, we chose four (viz., C,
C++, Java, and Python) programming languages for the input source les. One of the reasons for choosing these very
languages is their popularity with the professional programmers [
] and the availability of a large volume of OSS
source code written using these languages4.
The list of constructs that we have used is shown in Table 2. We performed a token-frequency and token-location
analysis of a large corpus of source les. Our analysis has shown that the language constructs that we have selected for
our metrics are the ones that are almost always present, and spread reasonably randomly in the source code written
using these programming languages. For example, it is hard to imagine a real-world program which does not employ the
decision control (
) and looping constructs (such as
) at some point. The constructs that are used
commonly and at uniformly random locations in the code can serve as robust features for an accurate representation of
the source code [17,18].
One may argue that the defects in the program are more likely to be caused by the usage of the infrequently used
constructs. Hence, leaving out the sparingly used constructs may adversely aect the ability to capture defectiveness.
We would like to note that through these metrics, we are merely trying to compute a representation of the input source
code. The metrics individually by themselves can not indicate defectiveness. An ML model trained using a suitable
algorithm and properly labelled input data set is what estimates the defectiveness of unseen source code. Further, in
3Operators considered are: {division, plus, not, multiply, minus, less, modulus, negation, assignment} operator
4Sources of stats: and
Manuscript submitted to ACM
8 Ritu et al.
Table 3. Details of newly craed soware metrics (PROCON metrics)
Software Metric Description PROCON Metrics’ values for Z = if
construct used in Fig. 3- Table-4
maxZCount maximum number of times a construct Z
is used in a source code maxIfCount:2
minZCount minimum number of times a construct Z
is used in a source code minIfCount:0
avgZCount average number of times a construct Z
is used in a source code avgIfCount:1
stdDevZCount standard deviation of the number of times
a construct Zis used in a source code stdDevIfCount:0.816
maxZDepth maximum depth at which a construct Z
is used in the AST of source code maxIfDepth:11
minZDepth minimum depth at which a construct Z
is used in the AST of source code minIfDepth:0
avgZDepth average depth at which a construct Z
is used in the AST of source code avgIfDepth:7.25
stdDevZDepth standard deviation of depth at which a
construct Zis used in the AST of source code stdDevIfDepth:4.264
maxZLength maximum lexical length of a construct Z
used in a source code maxIfLength:118
minZLength minimum lexical length of a construct Z
used in a source code minIfLength:0
avgZLength average lexical length of a construct Z
used in a source code avgIfLength:65.5
stdDevZLength standard deviation of lexical length of
a construct Zused in a source code stdDevIfLength:49.088
our metrics, we also use the depth (indicating the location) in the program’s AST at which a construct has been used. It
helps maintain the uniqueness of representation of an input source code. For instance, consider the code fragments
shown in Figure 2: Both the code fragments use same constructs but with a dierent ordering (and thus at dierent
depths), and hence are dierent. Although the code fragments shown in Figure 2, have the same values for count and
length metrics of constructs, yet the depth metrics helps to dierentiate between both.
(a) Program fragment 1(b) Program fragment 2
Fig. 2. AST depth: an important PROCON metrics
We build separate datasets of the PROCON metrics for each programming language that we considered. Thus, the
overall design of our system is independent of the programming languages that we consider.
2.2.2 Choice of metrics. On source code level, a common manifestation of programming decisions is the usage patterns
of various programming constructs. Such usage patterns can be derived or detected by lexical properties such as
depth, and length of a construct as used in a source le.
Manuscript submitted to ACM
A Defect Estimator for Source Code 9
(a) Program fragment 1
(b) Program fragment 2
(c) Program fragment 3
(d) Partial screenshot of ANTLR generated AST corresponding to Fig. 3a
Fig. 3. Example 1: Program fragments to explain PROCON metrics listed in Table -3
Table 4. Computing the PROCON Metric values for program fragments shown in Fig. 3
Program Fragments PROCON metrics’ values
Program 1Program 2Program 3max min avg stdDev
Count 2 0 1 2 0 1 0.816
Depth 9,11 0 9 11 0 7.25 4.264
Length 79,118 0 65 118 0 65.5 49.088
As such the PROCON metrics that we have introduced (see Table-3) are derived from the lexical properties such as
count, depth,
associated with the usage of dierent programming constructs
. Some of the examples of
if, while, for
construct. The property
associated with
, refers to the depth at which
occurs in the Abstract Syntax Tree (AST) of the source code under consideration. For instance, for the program fragment
referred by Fig. 3a, the depth of two
constructs shown in the respective AST of the program in Fig. 3d is 9and 11,
respectively. Lexical length of a construct
is dened as the total count of characters present in the construct denition
Manuscript submitted to ACM
10 Ritu et al.
or declaration (in case of variables) statement. For instance, the lexical length of a
construct would mean the sum
total of all the characters (excluding space) present in its denition (or body). Table-2gives the complete category-wise
list of programming constructs considered by us.
For illustration, we have worked out the computation of PROCON metrics for three program fragments shown in
Fig. 3. The programming construct considered in this example is an
construct. To compute the PROCON metrics,
we rst nd out the lexical measure values corresponding to the
construct used in the program fragments. We,
then, compute PROCON metrics values by applying statistical measures such as
maximum (max), minimum (min),
average (avg)
standard deviation (stdDev)
over the computed lexical measure values (shown in Table-4).
Table-3gives the formal denitions of all the PROCON metrics with their corresponding values obtained for program
fragments shown in Fig. 3.
2.2.3 Structure of PROCON dataset. As described in the preceding section, the PROCON dataset consists mainly of the
following information:
Measured values of the lexical properties representing usage patterns of various programming language constructs
in the source les of our corpus.
Information about the defect reports linked with source les of our corpus.
Table 5. Description of source files present in various datasets
Dataset for
le count5(|Kd
Files without defect
Total les extracted
λ|+ot he r s )
Total les in the dataset
|Dλ|=2mi n( | Kd
C3224 3468 11718 6448 (=3224 2)
C++ 174 6827 7202 348 (=174 2)
Java 318 7076 7500 636 (=318 2)
Python 1375 2048 4043 2750 (=1375 2)
Combined75091 19419 30443 1392 (=174 42)
For each of the considered programming languages (viz., C, C++, Java, and Python), we build a separate dataset using
PROCON metrics to train various ML models. Table 5gives the details of the source le from which the datasets have
been created.
represents the set of les linked with defect reports and
represents the les not linked with any
defect reports.
represents the complete set of les extracted from online portals. This consists of the les represented
, and the les not considered for building the dataset such as empty les or very small les. An equal number
of random les are combined from both
to form the dataset,
. All the datasets follow the same high-level
schema shown in Fig. 4. The description of the tables is provided in Appendix-A.2
We consider a total of about 30400 source les from 20 dierent GitHub repositories for creating the dataset. Further,
approximately 14950 defect reports associated with these source les were extracted to capture the characteristics of
defects. Table- 5and 6describe the composition of the dataset by providing the details such as the type of source les,
OSS repositories, and defect tracking portals chosen for building the dataset. Our dataset is persisted in the form of a
relational database and is available at
5Candidate defect-linked les are obtained by removing very small les or empty les from the category of defect-linked les present in the dataset.
Candidate non-defect-linked les are obtained by removing very small les or empty les from the category of non-defect linked les present in the
dataset and by ltering the les in the same le-length range as in defect-linked les category.
7Combined dataset is formed by taking equal proportion of source les from each of the considered languages.
Manuscript submitted to ACM
A Defect Estimator for Source Code 11
Fig. 4. Partial schema showing main entities of dataset.
Table 6. Details of GitHub repositories and associated defect reports
Total source
defect tracking
Total reported
Total source les
linked to defects
ant-master 1220
653 96
batik-trunk 1650 251 104
commons-bcel-trunk 485 29 2
lenya-BRANCH-1-2-Z 430 234 16
webdavjedit-master 7 0 0
poi-trunk 3284 103 86
pengyou-clients-master 198 15 11
gcc-master 18524 GCC 6087 3296
org.eclipse.paho.mqtt.python-master 40
77 4
paho.mqtt.embedded-c-master 33 3 2
paho.mqtt.-java-master 228 5 3
cpython-master 1336
3235 525
bedevere-master 16 82 4
mypy-master 182 253 31
peps-master 18 20 4
planet-master 16 35 3
Python-2.7.14 1325 1521 443
Python-3.6.3 1284 1712 479
typeshed-master 3 0 0
pythondotorg-master 168 642 47
Total Count 30447 4 14957 5156
2.2.4 Steps for building the PROCON dataset. We extracted the values of PROCON metrics by processing a large
corpus of source les written in dierent programming languages. To estimate the defectiveness of source code, we also
extracted the relevant information from defect reports which reference the source les that we considered. Following
are the main steps involved:
Selecting the source code repository hosts to extract source code: The OSS repository hosts such as GitHub and
Sourceforge, contain a large number of OSS repositories. We selected GitHub to fetch the OSS repositories due
to the following reasons:
Manuscript submitted to ACM
12 Ritu et al.
Algorithm 1 Establishing the mapping between defect reports and source les
Input: R:=The set of OSS repositories downloaded from GitHub.
B:=The set of defect reports associated with R.
L:=The set of programming languages { C, C++, Java, Python }.
Output: M:=The mappings between source les f(∀fR)and defect reports β(∀βB).
{Filter relevant source les}
1: Hϕ
2: for all source les fRdo
3: if fis written in λ, and λLthen
4: PPf.path
5: end if
6: end for
{Identify names of aected source les from defect reports}
7: Mϕ
8: for all βBdo
9: if βhas an associated patch then
10: µβscrapePatchInfoFordefect(β.id)
11: FµβextractAffectedSourceFileNames(µβ)
12: for all le name Fµβdo
13: MM∪ ⟨,β.id
14: end for
15: end if
16: end for
{Updating the Mto replace the le name with its full path}
17: for all M.keys() do
18: β.id M[]
19: for all le path pHdo
20: if pcontains then
21: Delete M[]
22: MM∪ ⟨p,β.id
23: end if
24: end for
25: end for
(a) has discontinued its Version Control System (VCS) support8.
(b) GitLab and BitBucket, being the newer ones, have very few OSS repositories present on them.
A signicant amount of research work [
] utilizes GitHub data, thus making GitHub a reliable source of
Therefore we took a bulk of the raw content (source les) from GitHub for building our dataset.
(2) Selecting the defect tracking portals to fetch the defect information:
To fetch the defect information associated with the OSS repositories extracted from GitHub, we use various
bug tracking portals. To obtain the information of defects associated various OSS repositories fetched from
GitHub, we downloaded the defect reports from four dierent bug tracking portals, viz., Apache Bugzilla
, Eclipse
Manuscript submitted to ACM
A Defect Estimator for Source Code 13
, GCC GNU Bugzilla
, and Python bug tracker
. We selected these defect tracking portals due to
easy availability of defect reports, and their wider use in the existing literature [19,20].
(3) Selecting the OSS repositories: Following are the main criteria we considered while selecting an OSS repository:
Size constraint: The size of repository should be greater than a threshold (
2MB). Please note that repository
size here stands for the total size of only the source les present in the repository. All the binary les such as
multimedia and libraries are excluded.
Source le count constraint: The repository should contain at least one source le written in the considered
programming languages viz., C, C++, Java, and Python.
(c) Defect report count constraint: The repository should be associated with at least one defect report.
Reputation constraint: The repository should have earned 500+ stars. This constraint was applied to ensure
that the selected repositories are popular and are being used by a certain mass of programmers.
To eliminate the programming language selection bias, we decided to select the repositories containing les
written in one or more of the considered programming languages. These languages included Java, Python, C++,
and C. These are among the most popular [16] enterprise programming languages used for creating software.
The defect reports associated with these OSS projects were also extracted.
Establishing the mapping between the defect reports and the source les: To nd the source les associated with
defect reports, we utilize the summary and the patch eld of the defect reports. We fetch this information
(patch and summary) by scraping various defect reports available at dierent defect tracking portals. The patch
information associated with the defect reports contains at least one mention of the aected source le. We
use this information to establish a mapping between the source les present in the OSS repositories, and the
defects reported corresponding to them, as explained in Algorithm 1. We store the mapping information in the
SourceFileTodefectMapping table described in §2.2.3.
Feature extraction technique: To perform this feature extraction, we built a custom FeatureExtractor module using
ool for
ecognition) [
] library. The FeatureExtractor module uses grammar of a
programming language to build its parsing program, using which the required features are extracted from an
input source le.
The computed metrics (u) are stored in a database as shallow knowledge in the SourceCodeFeatures table (shown in
Fig. 4). For each source le which is referred to in a defect report, we extract information such as priority, status,
type and user exposure from that defect report and store that information in the defectInfo table (shown in Fig. 4).
Rene the shallow knowledge: The rening process involves, among other tasks, eliminating any biasing such as
that introduced due to the use of source les written only in one language, presence of outliers in the form of le
size or a specic feature, and so on.
We perform the following steps for rening the shallow knowledge:
(a) Filtering only those source les which are of similar size.
Normalizing the attribute values on a suitable scale. For example, computed feature values for dierent source
les can be normalized w.r.t the le length.
(c) Removing the bias towards individual features by using a MinMaxScalar13 function of ScikitLearn [22].
The result of this step is our nal dataset persisted in the form of the RenedFeatures table (shown in Fig. 4).
Manuscript submitted to ACM
14 Ritu et al.
Fig. 5. PROCON dataset builder
2.2.5 PROCON dataset builder. Fig. 5shows the architecture of this module. The dataset produced by the output of
processing carried out by this module is stored in a relational database whose schema is shown in Fig. 4. The database
holds the dataset content.
The major components of the PROCON dataset builder are described below:
LexPar module: Given a set of grammars
] generates the Lexers and Parsers corresponding to each
. Lexer and Parser artefacts provide an API using which one can extract various lexical properties of a
given source le. These Lexers and Parsers make up the LexPar module, which is used by the feature extraction
module (described next).
Feature Extraction module: This is a higher-level program developed to extract the values of various features
from input source les. It makes use of the LexPar module to perform its task. Following are its main steps: For
each source le fselected the collection of OSS repositories () do:
Identify the programming language
) in which the source le
is written. This is done by checking the
le extension.
Call the API in LexPar module to extract the PROCON metrics values.
Store the extracted values in the database table (SourceCodeFeatures) of PROCON dataset schema.
Defect Information Collector: The input to this module is a set of defect reports
. A defect report
downloaded from the defect tracking portal of the source code repository, and in most cases, it is available in
a text format such as a CSV le.
contains attributes such as the unique ID assigned to the defect, severity
assigned to the defect, type of the defect, detailed description, and the aected source les.
This module processes each
) to extract such details, and persist them in the table defectInfo of the
PROCON dataset.
Patch Scraper: To establish the link between source les and the associated defects, we rst need to identify the
les which are possibly aected by the defect. Usually, when a defect is xed, the developers update the defect
report with the details of les modied to x the defect. This information is typically supplied in the patch and
Manuscript submitted to ACM
A Defect Estimator for Source Code 15
comments elds of the defect report. If such patch information is not available in the text format (e.g. CSV le)
dump of defects from a defect tracking portal, then we obtain that information via scraping the portal.
Given a set of input defect reports
, the patch scraper extracts the necessary information (present in HTML
markup) from the respective web page corresponding to each of the defect report
. The scraped information
is then utilized (see next item) to establish a mapping between the input source les and the defect reports.
Mapping Builder: Given a collection of source repositories
and a set of defects reports
associated with
this module establishes the mapping between the source les
and defect reports
. This mapping
is necessary to study the defect characteristics associated with the defects reported in various source les. To
establish the mapping, we utilize the patch and the summary portion of the defect reports, extracted using the
Patch Scraper. The mapping information is stored in the SourceFileTodefectMapping table (shown in Fig. 4).
Information Rener: This module lters out non-essential information from the dataset. The goal of this rening
is to derive a normalized dataset which is free from bias. The information thus obtained is stored in the
RenedFeatures table (shown in Fig. 4).
The implementation details of various modules are described in Appendix-A.1.
2.3 The DESCo sub-system
DESCo is an ML-based system which performs various defect estimation tasks for an input source le. The key
components of DESCo are described below in §2.3.2. Identifying the best ML model for a given estimation task is a key
processing step of DESCo, and has been formulated (discussed in §2.3.5) as an optimization problem. §2.3.4 describes
the evaluation metrics used to measure the performance of ML models comprising DESCo.
2.3.1 Goals of the DESCo sub-system. Given an input source le, the DESCo sub-system will determine two things
about it:
Phase-1: Whether the input source code is likely-to-be-defective. We do not determine the absence of defects.
If found likely-to-be-defective, then the system identies the probable attributes of the apparent defects.
The attributes would include defect’s severity, category (e.g., performance, functionality, and crash.), and so on.
It seeks to provide the answers to questions mentioned in the following scenarios.
What is the likelihood that the input source code may contain defects:
(1) of a specic priority or severity (for instance, critical, high, low, and medium.)?
(2) of a specic type (for instance, enhancement and defect-x.)?
(3) that occur mostly on a specic operating system (OS)?
(4) that occur mostly on a specic hardware?
that involve a specic level of user exposure (measured, for instance, via the number of comments on the defects
Why only this set of scenarios?
These scenarios were chosen so as to reect the important characteristics of the
defects across multiple defect tracking platforms (e.g., ASF Bugzilla, Eclipse Bugzilla, GCC Bugzilla, and Python
defect Tracker). It is assumed that the defect characteristics captured by these platforms are well-chosen by keeping
its usefulness for developers in mind.
To the best of our knowledge, such use of defect characteristics is not reported in the current literature.
Manuscript submitted to ACM
16 Ritu et al.
The task of Phase-II can be easily repeated for the additional characteristics of defects. Further, it is not dicult to
replicate (we plan to do it in future) the experiments for predicting defectiveness along with various qualitative aspects
of programs such as:
Which programming language is more likely to result in defects in a given scenario (for instance, on a certain
Which type of defects is most likely with the use of a specic programming language, on a particular OS or
2.3.2 Key modules of DESCo. These are shown in Fig. 1(on page 6), and are described as follows:
Query Interface: This module provides an interactive interface to clients (human user and programs). The oered
interfaces include a GUI and an API which allow the callers to specify a query type from a given list. The complete
list of query types, their meaning, expected input and output are described in §2.3.1. The contents of a source le
for which defectiveness is to be estimated is one of the inputs.
Estimation Engine: It is the main module that performs the defect estimation for an input source le for a given
scenario. To do so, it performs the following steps:
(a) First, it uses the Feature Extraction module to extract the features of the input source le.
Next, it calls the Algorithm Selector to select the best performing ML model for the given source code feature
set and input scenario.
(c) Finally, it uses the selected best performing model to perform various defect estimation tasks.
Model Selector: In essence, DESCo estimates defectiveness of an input source le (written in
C, C++, Python or
) by performing ML classication via a most suitable ML model. Since dierent ML techniques outperform
others in dierent scenarios [
], accurate classication requires using an ML model which performs the best for
a given scenario. Thus, a crucial step is to identify the best performing models for each of the estimation tasks
(or scenarios) specied in §2.3.1.
To nd such models, we iterate through various parameter combinations of dierent ML algorithms (listed in
§2.3.3). This model selection problem can be solved as an optimization problem, which we formulate in §2.3.5.
This module implements a solver for such an optimization problem.
Model Builder: It builds various ML models, by training on PROCON dataset using dierent parameter combina-
tions of ML techniques. This module is used in conjunction with the Algorithm Selector.
The implementation details of the DESCo modules is provided in Appendix-A.3.
2.3.3 Details of ML algorithms used by DESCo. Training and testing for the estimation tasks are performed using a
variety of parameter combinations of the following 12 ML classication techniques.
(1) Linear SVM (LSVM) [23,24]
(2) SVM [23,25]
(3) Nu-SVM (NSVM) [23,26]
(4) Gaussian Process (Gauss) classier [27,28]
(5) K Nearest Neighbors (KNN) classier [29,30]
(6) Random Forest (RF) classier [31,32]
(7) Multi-Layer Perceptron (MLP) classier [33,34]
Supervised Deep Belief Network (DBN) Classica-
tion [35,36]
(9) Logistic Regression [37,38]
(10) Bernoulli Naive Bayes [39,40]
(11) Multinomial Naive Bayes [39,41]
(12) Gaussian Naive Bayes [42,43]
Manuscript submitted to ACM
A Defect Estimator for Source Code 17
Table 7. Parameter combination of dierent ML. techniques used in dierent phases of our approach
Key Phase 1 Phase 2
a LSVM(0.1, ‘ovr’) LSVM(1.0, ‘ovr’)
b LSVM(0.1, ‘ovo’) LSVM(1.0, ‘ovo’)
c LSVM(1.0, ‘ovr’) LSVM(0.1, ‘ovr’)
d LSVM(1.0, ‘ovo’) LSVM(0.1, ‘ovo’)
e SVM(1.0, ‘ovr’, ‘l’) SVM(1.0, ‘ovo’, ‘l’)
f SVM(1.0, ‘ovo’, ‘l’) SVM(10, ‘ovo’, ‘r’, 0.2)
g SVM(10, ‘ovr’, ‘r’, 0.2) SVM(1.0, ‘ovo’, ‘p’, 2)
h SVM(10, ‘ovo’, ‘r’, 0.2) SVM(1.0, ‘ovo’, ‘p’, 3)
i SVM(1.0, ‘ovr’, ‘p’, 2) SVM(1.0, ‘ovo’, ‘s’)
j SVM(1.0, ‘ovo’, ‘p’, 2) NSVM(0.5, ‘ovo’, ‘l’)
k SVM(1.0, ‘ovr’, ‘p’, 3) NSVM(0.1, ‘ovo’, ‘l’)
l SVM(1.0, ‘ovo’, ‘p’, 3) NSVM(0.5, ‘ovo’, ‘r’, 0.2)
m SVM(1.0, ‘ovr’, ‘s’, 3) NSVM(0.1, ‘ovo’, ‘p’, 2)
n SVM(1.0, ‘ovo’, ‘s’, 3) NSVM(0.5, ‘ovo’, ‘p’, 3)
o RF(10) NSVM(0.1, ‘ovo’, ‘p’, 3)
p RF(5) NSVM(0.5, ‘ovo’, ‘s’)
q RF(7) NSVM(0.7, ‘ovo’, ‘s’)
r RF(20) Gauss(‘r’, ‘ovo’)
s RF(50) KNN(‘e’)
t RF(100) KNN(‘m’)
u NSVM(0.7, ‘l’, ‘ovo’) SVM(1.0, ‘ovr’, ‘l’)
v NSVM(0.7, ‘l’, ‘ovr’) SVM(10, ‘ovr’, ‘r’, 0.2)
w NSVM(0.5, ‘l’, ‘ovo’) SVM(1.0, ‘ovr’, ‘p’, 2)
x NSVM(0.5, ‘l’, ‘ovr’) SVM(1.0, ‘ovr’, ‘p’, 3)
y NSVM(0.7, ‘r’, ‘ovo’) SVM(1.0, ‘ovr’, ‘s’)
z NSVM(0.5, ‘r’, ‘ovo’) NSVM(0.5, ‘ovr’, ‘l’)
A NSVM(0.7, ‘r’, ‘ovo’, 0.2) NSVM(0.5, ‘ovr’, ‘r’, 0.2)
B NSVM(0.5, ‘r’, ‘ovo’, 0.2) NSVM(0.5, ‘ovr’, ‘p’, 2)
Key Phase 1 Phase 2
C NSVM(0.7, ‘r’, ‘ovr’) NSVM(0.5, ‘ovr’, ‘p’, 3)
D NSVM(0.5, ‘r’, ‘ovr’) NSVM(0.5, ‘ovr’, ‘s’)
E NSVM(0.7, ‘r’, ‘ovr’, 0.2) NSVM(0.7, ‘ovr’, ‘l’)
F NSVM(0.5, ‘r’, ‘ovr’, 0.2) NSVM(0.7, ‘ovr’, ‘r’, 0.2)
G NSVM(0.7, ‘s’, ‘ovo’) NSVM(0.7, ‘ovr’, ‘p’, 2)
H NSVM(0.5, ‘s’, ‘ovo’) NSVM(0.7, ‘ovr’, ‘p’, 3)
I NSVM(0.7, ‘s’, ‘ovr’) NSVM(0.7, ‘ovr’, ‘s’)
J NSVM(0.5, ‘s’, ‘ovr’) Gauss(‘r’, ‘ovr’)
K NSVM(0.7, ‘p’, ‘ovo’) LSVM(0.5, ‘ovr’)
L NSVM(0.5, ‘p’, ‘ovo’, 0.2) LSVM(10, ‘ovr’)
M NSVM(0.7, ‘p’, ‘ovo’, 0.2) LSVM(0.5, ‘ovo’)
N NSVM(0.5, ‘p’, ‘ovo’, 0.2) LSVM(10, ‘ovo’)
O NSVM(0.7, ‘p’, ‘ovr’) LSVM(5, ‘ovr’)
P NSVM(0.5, ‘p’, ‘ovr’) LSVM(5, ‘ovo’)
Q NSVM(0.7, ‘p’, ‘ovr’, 0.2) RF(10)
R NSVM(0.5, ‘p’, ‘ovr’, 0.2) RF(7)
S Gauss(‘r’, ‘ovo’) RF(20)
T Gauss(‘r’, ‘ovr’) RF(50)
U KNN() RF(75)
V MLP() RF(100)
maxIter = 200)MLP()
X LogisticRegression()
Y BernoulliNB()
Z MultinomialNB()
AA GaussianNB()
AB SupervisedDBNClassication
(hiddenLayerStructure=[256,256], nIter=100,
learningRateRbm=0.05, learningRate=0.1)
We used the implementation of these algorithms as provided by ScikitLearn [22] library.
The parameter combinations that we tuned for these ML algorithms are described in Table-7. Tuning of algorithm-
specic parameters was performed by carrying out several experiments that used dierent parameter combinations of
these ML techniques. Each of the ML models built using these techniques acts as a binary classier. For instance, when
testing for the defectiveness of a source le, an ML model either labels its as likely-to-be-defective (with label = 1) or
unknown (with label = 0). Similarly, when estimating the defect characteristics, ML models are trained to test for the
presence of each of the characteristic values. For instance, we have built the ML models which classify a source le as
likely-to-contain high priority defects and not-likely-to-contain high priority defects, and similarly for all such scenarios
discussed in §2.3.1 and §3.1.
Rationale for choosing the above ML algorithms:
Labelling a source le as {
Likely-defective, Unknown
falls in the category of binary classication, which can be solved using a supervised learning ML technique [
To nd out the best possible ML technique for our case, we experimented with 12 of them listed above. These
algorithms are the ones most used for binary classication tasks of diverse kinds [
]. In many cases, they are also
the best performing options. Hence, we selected this set of ML algorithms for our experiments. Needless to say, this
list can keep evolving as more advances are made in the binary classication ML techniques.
Manuscript submitted to ACM
18 Ritu et al.
Parameter Congurations:
A brief description of the pertinent parameters (listed in Table-7) of dierent ML algo-
rithms that we tuned is as follows:
Kernel Congurations (
): We experimented with all the four kernel types: linear (
), radial (
), sigmoid (
), and
poly (p) kernel.
Penalty (
) Congurations: We experimented by considering both the fractional and integral (low and high both)
penalty values (viz., 0.1,0.5,0.7,1.0,5.0,and 10.0).
(3) Number of estimators: These are also referred to as the number of decision trees in various ML algorithms. The
set of values chosen were: 5,7,10,20,50,75,and 100.
Degree of the polynomial (in case of a
kernel): We used both fractional as well as the integral values of the
degree of a polynomial (viz., 2,3,and 0.2).
Gamma or the Kernel Coecient (
): Both the default value (
) and a fractional value (0
2) were experimented
with various ML algorithms.
Method of classication: We experimented with both the classication method types, viz., one-vs-one (
) and
one-vs-rest (ovr).
(7) Type of distances: We considered manhattan (m) and euclidean (e) distance types.
Why only these specic conguration values?
Each of the considered ML algorithms has a suggested or
expected range of values for its conguration parameters. The parameter values that we have chosen comply with
such suggested/expected boundaries and are selected to represent a variation within those boundaries such that it
is meaningful for our tasks.
2.3.4 Evaluation metrics. We selected F1 score
and ROC curve area
metrics as the evaluation metrics of our work.
F1 score is dened as follows:
F1scor e =2Pr ecision Recall
Pr ecision +Recall (1)
where Precision is mathematically dened as:
Pr ecision =true positive
true posit ive+f alse positive(2)
, and Recall is dened as:
Recall =true posit ive
true posit ive+f alse neдative(3)
Since F1 score captures both the eect of Precision and Recall, we compute F1 score values and its respective standard
deviation values (or error values) only. Higher is the F1 score better is the prediction accuracy of the model.
ROC metric is used to evaluate the quality of output. An ROC curve is a plot that features the true positive rate
(marked as Y-axis) vs the false positive rate (marked as X-axis) for an experiment. The point at the top-left corner of the
plot depicts the point of most ‘ideal’ behavior having the {false-positive-rate, true-positive-rate} pair value as {0,1}.
Thus, a larger area under the curve signies a better quality output. We, therefore, select the ROC curve area as our
second evaluation metrics.
Manuscript submitted to ACM
A Defect Estimator for Source Code 19
Since the highest ROC curve area value and the F1 score value may dier across the models, we take the average of
the two as the nal accuracy measure of a model (MeanAcc dened in Equation 4). Further, all the results obtained
are validated using k-fold cross validation. Higher values of k, limit the number of data points in a validations set, and
a lower value would increase the risk of bias in the dataset. We, therefore, selected threshold value of k =5for our
2.3.5 Selecting the best performing model: Problem Formulation. We dene the best performing ML model as the one
with the highest MeanAcc measure value, where MeanAcc is represented by Eq.4. There
represents the set of defect
estimation tasks (mentioned in §2.3.1),
represents the set of ML algorithms and
as the tuning parameter combinations
set, such that,
represents the ML model built using the ML algorithm
with the parameter combination
, such
. Also,
represents the set of evaluation metrics viz., F1 score and ROC area discussed in
AccV al(Ek,α,π
where AccVal represents the accuracy measure value of metrics
obtained by applying the ML technique parameter
setting model α,π
λ,ψon the input dataset Dλ(explained in more detail with the help of an example below).
For each of the tasks
, the problem of nding the best performing model trained on PROCON dataset
perform the defect estimation for source les written in λLcan be dened as follows:
λ,ψ|Dλ),where αA,πΠ(5)
Example scenario:
For instance, let us consider a scenario of nding the most suitable model for defectiveness
estimation using a dataset
. Now, one of the instances of
could be using the LSVM technique (
= LSVM) with
tuning parameters (
; which corresponds to the parameter combination
as the row ain column “Phase 1” of Table 7.
For each of the estimation tasks
, we experiment with dierent ML algorithms (
) and report
the obtained MeanAcc measure values in Fig. 6. Finally, for each of the listed scenarios, we select the best performing
ML model – the one yielding maximum MeanAcc measure value.
To validate the ecacy of our system, we performed relevant experiments for testing its performance and behaviour.
The performance is evaluated by observing the accuracy achieved by the DESCo system when answering our stated
prediction questions. The behaviour of DESCo is evaluated on the basis of its functional and non-functional aspects.
The functional correctness of the system is evaluated by testing on source le samples and the corresponding defect
reports downloaded from GitHub and Apache Bugzilla, while the non-functional aspects of the system are evaluated on
the basis of its easy-of-use and response-time. Table-8shows a high-level summary of all the experiments.
3.1 Performance Evaluation
These experiments were aimed to identify the best performing algorithm for:
The {Likely-defective, Unknown} classication task (referred to as Phase 1prediction) on the input source code.
Estimating associated defect characteristics (referred to as Phase 2prediction) for the Likely-defective cases.
Manuscript submitted to ACM
20 Ritu et al.
Table 8. Experiments summary
Experiment Objective Major ndings
Experiment #13.1.1)
Determine the best ML algorithm for performing the following
estimation tasks: Determine if an input source le
is likely-to-be-defective or not.
Best classier: SupervisedDBNClassier;
Best dataset: Python dataset;
Accuracy achieved: 95.96%.
Experiment #23.1.2)
Determine the best ML algorithm for performing the following
estimation tasks: What are the characteristics of defects that are
likely to be associated with an input source le
that has been classied as likely-to-be-defective in Experiment #1?
Best classier: LSVM classier;
Best dataset: C dataset
Accuracy achieved: 91.4%.
Defect characteristic: defects of type-Enhancement
Experiment #33.1.3)How does the DESCo system perform in comparison
to the state-of-the-art (SOA) techniques?
DESCo outperforms one of the SOA [7]
technique with an improvement of 44.9%.
Experiment #43.1.4)
Comparison at Dataset level:
Do PROCON datasets contribute towards
performance of a defect prediction system?
DESCo, when trained on PROCON dataset,
shows an improvement of 19.46% and 29.9%,
for SOA datasets [7,46] respectively.
Experiment #53.2.1)To test the ecacy of DESCo
on real or production-quality software.
Estimates made by DESCo match
with the real characteristics.
Experiment #63.2.2)
To test if DESCo works in an expected manner,
as it is intended to work for the user-audience
it is designed to work for.
DESCo obtains a feedback score of
>8/10 for most of the cases.
To achieve this goal, we iterate through dierent parameter combinations of various ML algorithms and compare the
performance of the generated ML models. The details of the experiments performed are discussed next.
3.1.1 Experiment 1.
Determine the best ML algorithm for performing the following estimation tasks: Determine if an input
source le is likely-to-be-defective or not?
We train dierent ML models on PROCON dataset using various parameter conguration of ML
algorithms (listed in Table-7). The ML models perform the task of classifying the input source le as {likely-to-be-
defective, unknown}, referred to as Phase 1henceforth. To nd the best performing model, we compare the models built,
using the averaged accuracy metric (MeanAcc metric values) as discussed in §2.3.5. The MeanAcc metric values obtained
corresponding to 54 parameter combinations of 12 ML algorithms trained on various datasets are shown in Fig. 6(a).
In Fig. 6(a), x-axis represents various ML model parameter combinations (listed in “Phase 1” column of Table 7) with
which the experiment is performed, and the corresponding MeanAcc values obtained are represented on the y-axis.
Salient Observations from the experiment:
The observations drawn from Phase 1results (shown in Fig. 6are as
Python dataset trained using SupervisedDBNClassication technique gives the highest (best) MeanAcc value of
0.9596 (with an associated σof 0.0022).
Models built using MLP classier ( point “W” in Fig 6(a)) yield the lowest MeanAcc values for all the datasets;
lowest (=0.25) in case of C++ Dataset.
A sharp downfall in MeanAcc values is observed for the models trained using Nu-SVM classier on C, C++ and
Combined datasets.
Inferences drawn from the experiment:
From the MeanAcc measure values shown in Fig. 6, we can draw the
following inferences:
Since SVM classier,Nu-SVM classier, with a poly kernel, and MLP Classier result in lowest MeanAcc metric
values, such ML technique parameter combinations (viz.,
h, i, j, k, l, u, v, w, x, y, K,
of Table-7)
are not advisable for performing the task of defect prediction.
Manuscript submitted to ACM
A Defect Estimator for Source Code 21
SVM with radial kernel, LSVM,Logistic Regression,Gaussian, and SupervisedDBN Classication models with tuned
parameter combinations (viz.,
a,b, S, T, X,
listed in Table-7) are some of the suitable
ML technique –
parameter combinations for estimating defectiveness associated with a source code.
Performance of an ML model depends on the dataset it is trained on. Highest MeanAcc score values are obtained
corresponding to the models trained using Python source les.
Fig. 6. Mean Accuracy (MeanAcc) measure values for prediction using our approach
Manuscript submitted to ACM
22 Ritu et al.
3.1.2 Experiment 2.
Determine the best ML algorithm for performing the following estimation tasks: What are the characteristics
of defects that are likely to be associated with an input source le that has been classied as likely-to-be-defective in
Experiment 1?
For the les classied as likely-to-be-defective in Experiment 1(or Phase 1), we next estimate the type
of defects present, referred to as Phase 2henceforth. For each of the defect estimation tasks discussed in §2.3.1, we
perform the following two steps:
Filter the les linked with the particular defect-types to create the sub-datasets of PROCON dataset for training-
testing ML models. For instance, to estimate the presence of “high” priority defect types in a le, we create a
sub-dataset containing the
source les linked to the “high” priority defects, and les linked to defects with
other priority types in 50:50 ratio.
We iterate through various parameter combination of ML algorithms and record the MeanAcc values (F1 score
and ROC area values) corresponding to each.
We select the model reporting the highest MeanAcc measure value for performing the testing with unlabelled
source les (performed in §3.2.1).
Fig. 6(b)-(e) represent the results obtained when estimating the defect characteristics by training models on various
datasets. For all the plots shown in Fig. 6(b)-(e), x-axis represents various ML model parameter combinations (listed in
“Phase 2” column of Table 7) with which the experiment is performed, and the corresponding MeanAcc values obtained
are represented on the y-axis.
Salient observations from the experiment:
For predicting Medium exposure defects, ML models trained using C++ source les, outperform the rest - LSVM
technique achieves the highest MeanAcc value of 0.7773 (with an associated σof 0.1435)).
C dataset gives the best performance for predicting the defects of type Enhancement and those annotated as
highest priority.LSVM classier yields the best results in both the cases with an MeanAcc of 0
8357 (with
142) for predicting defects of type Enhancement, and an MeanAcc of 0
91397 in case of defects annotated as
highest priority (with σof 0.017).
When predicting defects annotated with a specic OS type,SVM Classier with sigmoid kernel trained using Java
dataset gives the best MeanAcc measure value (0.688 with σas 0.115).16
In most cases, the SVM classier and Nu-SVM classier with a poly kernel setup, fared the worst on the PROCON
Inferences drawn from the experiment:
The important inferences drawn from the results depicted in Fig. 6are as
Dierent ML algorithms emerge as best-performing for predicting dierent types of defect characteristics. In
other words, it is not advisable to employ a single ML algorithm when predicting the type of defect characteristics
in an input source le. Therefore, the DESCo system, instead of using a single ML model, chooses the best
performing ML model for each prediction task.
In the case of Medium exposure defects, F1 score measure and ROC area measure both dier in the best-declared
ML techniques. Hence, dierent evaluation measures may mark dierent ML techniques as the best for a given
Note: Because of lack of adequate amount of defect metadata information about os-type in case of C, C++ and Python source les, the performance
comparison could not be presented for the same.
Manuscript submitted to ACM
A Defect Estimator for Source Code 23
scenario. DESCo system takes the average of both these (viz., F1 score and ROC curve area) measure values
when handling the prediction tasks.
(a) Program fragment 1(b) Program fragment 2
Fig. 7. Example 2: Similar program fragments resulting in dierent feature vectors
Fig. 8. Comparison with State of the art techniques on the basis of Mean Accuracy (MeanAcc) measure values
Manuscript submitted to ACM
24 Ritu et al.
3.1.3 Experiment 3.
Objective: How does the DESCo system perform in comparison to the state-of-the-art techniques?
To the best of our knowledge, there does not exist any work for estimating defect characteristics. We, therefore,
present our comparison for Phase-1 of our approach: Estimating if an input source le is likely-to-be-defective or not.
Some of the existing works targeting this problem are [7,47,48].
Similar to our work, [
] builds feature vectors capturing the occurrences of various programming constructs (with
order preserved) used in a source le. A Deep Belief Network (DBN) trained on these features is then used to perform
the task of defect prediction. For instance, if the constructs
for, if
are assigned the integral codes 1, 2 and, 3,
respectively; then, the feature vectors corresponding to the program fragments shown in Fig. 7would become {1, 2, 3}
and {3, 2, 1} respectively.
In our work, we capture the
count, depth,
of various programming constructs (in the form of PROCON
metrics) used in a source le. The collection of all such PROCON metrics’ values extracted from the input set of source
les forms our feature set. To compare [
] with our work, we implement it by using pertinent libraries [
] of ScikitLearn
[22]. We refer to [7] as one of the State-Of-the-Art (SOA) approach in upcoming sections.
] targets the defect reports instead of source les; they build a model to classify the defect reports as predictable
or unpredictable. Thus, we believe that comparison with it will not be justiable. Similarly, the work in [
] is focused
on recommending (or detecting) defect reports that are similar to an input defect report. This again would not be a
proper comparison to our work.
Fig. 8(a) presents a performance comparison of the 54 ML technique combinations when tried using our approach
and one of the SOA techniques [
]. The considered ML model parameter combination (listed in column “Phase1” of
Table 7) are represented by the x-axis, while the y-axis represents the respective MeanAcc measure values.
Observation from the experiment:
As shown in Fig. 8(a), our approach outperforms the state of the art technique
for 50 out of 54 parameter combinations (see Table-7) of ML techniques considered. The highest MeanAcc measure value
of 80
8%, with mean error (
) of 0
047, is obtained in case of conguration item “g” of Table-7. Our technique outperforms
] (conguration item “AB”) with an improvement of 44
9%. Note: Since the work performed by [
] was performed
only for Java source les, we perform the comparison for the common Java source les. We present the comparison at
dataset level in the next section. Further, [
] implement their approach using the SupervisedLearningClassier (viz.,
conguration item “AB”) only. But, to compare the performance of our approach, with the approach mentioned in [
we simulated [7] for all the ML parameter congurations shown in Figure 8and Table 7.
Inference from the experiment:
DESCo, when trained on PROCON dataset, performs the task of defect estimation
with better accuracy in comparison to [
](conguration item “AB”). It achieves an improvement of 44.9%. Note: There
exist some ML algorithm parameter conguration items shown in Figure 8, where the ML models built using the
semantic datasets generated using the approach in [
] outperform those trained on PROCON dataset (PROCON Java
dataset, specically). But, DESCo outperforms the compared method for most of the cases.
Manuscript submitted to ACM
A Defect Estimator for Source Code 25
3.1.4 Experiment 4.
Objective: Comparison at the dataset level:
Do PROCON datasets contribute towards performance of a defect
prediction system?
In this experiment, we train dierent ML models on PROCON datasets and compare their performance
when trained on SOA datasets. Note: we train models for each of the considered languages, viz.,
C, C++, Java,
Python. The results of the experiments are shown in Fig. 8.
Label Notation: τ_λ_Dataset
refers to the method used for building the metrics, viz., PROCON-dataset building method, Semantic-dataset
building method[
], or the datasets present in PROMISE repository[
], and
represents the programming language for
which the dataset is built, viz., C, C++, Java, Python. For instance, PROCON_C_Dataset would represent a dataset
built using PROCON metrics’ values extracted from source les written using C programming language.
We selected two state-of-the-art (SOA) datasets [
] most close to our work. The procedure can be
listed in the following steps:
(1) Find the common les among the three datasets, PROMISE, SOA, and PROCON.
Build the semantic datasets using the approach listed in [
], which we refer to as Semantic datasets henceforth.
Extract the Chidamber and Kemerer (CK) metrics’ values[
] of these common les from PROMISE repository
which we refer to as the PROMISE datasets henceforth.
Compare the performance of ML models trained on the language wise combinations of the datasets formed in
the previous steps, using the parameter combinations of various ML algorithms listed in Table-7(for Phase 1).
] is performed only on Java programming language, but for comparing with our datasets built for other programming
languages (viz.,
C, C++,
), we build the language-specic datasets using the approach in [
]. We then
compare the performance of ML models trained on PROCON datasets and the newly generated datasets using [
The performance comparison corresponding to all the 11 datasets is presented in Fig. 8(b). The x-axis represents the
parameter combinations of various ML algorithms, and the y-axis represents the
measure values
obtained, respectively. The power function is applied in order to spread the close placed MeanAcc values. The value
1500 is found after experimenting with a few values.
Salient observations from the experiment:
ML models trained on PROCON C dataset outperform both the baseline datasets (viz., PROMISE dataset and Semantic
dataset) yielding the highest MeanAcc measure value of 84
47% (with
of 0
016) for RF classier with 100 estimators
(viz., row “t” of Phase 1column mentioned in Table-7). The SOA datasets ([
]) give the highest MeanAcc values of
70.9% and 65.2%, respectively, when trained on SupervisedDBN classier.
Inferences drawn from the the experiment:
The ML models trained on PROCON dataset outperform the ones
trained on SOA datasets [
] with an improvement of 19
46% and 29
9%, respectively. Note: there are certain ML
algorithm parameter combinations, shown in Figure 8(b), in which the models trained on datasets built using [
outperform those trained on PROCON datasets. This dip in DESCo’s performance with-respect-to certain ML algorithm
parameter combinations does not aect its overall performance, since DESCo uses the best-performing model to
perform a given defect-estimation task (discussed in §2.3.5), and thus the choice of such low-accurate models will be
automatically avoided.
Manuscript submitted to ACM
26 Ritu et al.
3.2 ality Testing
Software quality is dened as the degree to which a system, component, or process meets the specied requirements[
The requirements of a software are broadly categorized as functional and non-functional; with the functional require-
ments modelling the functional correctness of a software, while the non-functional requirements capture the degree to
which a software works as intended. Both the functional and non-functional requirements are listed in the requirements
specication of a software.
To evaluate the functional correctness of DESCo, we test its performance by experimenting with the real or production
quality software data, which is not a part of PROCON dataset. To evaluate the fulllment of non-functional requirements
by DESCo, we choose the evaluation metrics such as the ease-of-use,response-time, and accuracy, as observed by the
end-users. We discuss the details of these experiments in the upcoming sections.
Fig. 9. DESCo User Interface
3.2.1 Experiment 5.We performed this experiment to test the functional correctness of DESCo. For this purpose, we
downloaded some source les (reported to be defective) from various defect reporting portals (such as Apache Bugzilla)
and tested them using DESCo. The results obtained for a source le
which is linked to a defect report
are shown in
Fig. 10.
Objective: To test the ecacy of DESCo on real or production-quality software.
Procedure: To achieve the objective, we performed the following steps:
First, we downloaded a collection of source les from various defect reporting engines. While downloading the
source les, we ensured that the source les were associated with one or more defect reports.
(2) Next, we performed defect estimation on this collection of source les using DESCo.
Finally, we validated the functionality of DESCo by analyzing the obtained defect estimates and the true defect
characteristics of the input source les. The true characteristics were available with the defect report itself.
Salient observations from the experiment:
(1) The defect estimates made by DESCo for the input source le match those in reality.
DESCo was also able to predict those characteristics for the defects which were missing in the corresponding
defect reports originally. For instance, the given defect report only highlighted that the defect originated when
working on Linux (an OSS OS), but DESCo provides additional information about the other perspectives such
as user-activity associated with it, and application area linked with the defect report. All these estimates are
provided with associated accur acy,er ror pairs.
Manuscript submitted to ACM
A Defect Estimator for Source Code 27
Fig. 10. Prediction results on a test file (ML model trained using PROCON_Combined_Dataset)
Inferences from the experiment:
In DESCo, the prediction of defect characteristics is made by separately training
the ML models for each of the characteristics. This allows us to independently predict the presence of more than one
defect characteristics in an input source le.
3.2.2 Experiment 6.This experiment was conducted in a controlled industrial environment. We recruited 10 participants
for the experiment.
Objective: To test if DESCo works as expected for the intended users and scenarios.
The participants were asked to use DESCo system to perform defect estimation on an input source code
of their choice. The input source les were required to be at least 900 characters long (excluding comments), and written
using C, C++, Java or Python. We assume that the source les containing source code less than 900 characters contain
insucient construct information for condence in the approach. Next, the participants were asked to provide their
ratings on the performance of DESCo system on three quality parameters: the ease-of-use of the interface, response time
of DESCo, and accuracy of defect prediction task performed by it. The users rated on a scale of 110, with 1representing
the worst performance and 10 representing the best. Table 9gives the details of the ratings received by us.
Salient observations from the experiment:
(1) End users remark DESCo to have a user-friendly interface.
(2) DESCo achieves a rating of 8/10 or better in most of the cases.
Interpretation of Table 9:
The rst row of table shows that a score of 10 was awarded for ease-of-use and response
time by 1 voter, and for accuracy it was awarded by 2 voters. Similarly, last row says that the score of 6 was awarded by
only 1 voter for the response time.
3.3 Threats to validity
Since we have chosen only the open-source projects for our experiments, the behaviour may dier in case of closed-
source projects. We have not tried every possible value of each parameter of the ML algorithms considered. Since the
Manuscript submitted to ACM
28 Ritu et al.
Table 9. Ratings recorded in Experiment 6
Score YVotes for the score Y
(scale 1-10) ease-of-use response time accuracy
10 1 1 2
9 7 0 7
8 2 6 1
7 0 2 0
6 0 1 0
PROCON dataset contains the features of only four languages (viz., C, C++, Python, and Java), our system can only
predict the defectiveness associated with source les belonging to these languages. Dierent ML techniques, which we
found to perform the best for our prediction tasks, may perform dierently on a dierent set of features.
Another challenge is the authenticity of the defect reports itself. In order to establish a mapping between the source
les and the defect reports of various OSS, we search the defect reports for the presence of various source le names.
Thus a wrongly reported defect would lead to an incorrect reference of a source le
defect report, leading to an
incorrect defect mapping. This may give rise to false positives i.e. a situation when a source le free from defects is
labelled as defective. For the current work, we assume that all the defect reports are valid.
The programming constructs and the metrics that we use in PROCON are programming language specic. Also, our
experiments show that the performance of our system varies for dierent programming languages that we considered.
Thus, the results may or may not generalize to all programming languages.
Further, the PROCON metrics capture the specic programming decisions mentioned in Denition 3. They may not
be able to capture the abstract design choices made during architectural design.
We remove the outliers present in our datasets by:
Selecting the source les with similar size (measured in terms of character count). The range of characters diers
among the source les written in dierent programming languages.
Normalizing the feature-values using Min-Max Scaling method. This was done to prevent a particular feature
comprising of large measure values (or feature values) from inuencing the entire dataset, thus diminishing the
eect of the other constituent features of the dataset.
We believe that the above outlier avoidance measures help us in eliminating the possibility of swamping-eect [
] in
our system. However, as is the case with any experiment, it is not possible to cover all possibilities for the detection and
avoidance of swamping eect.
Denition 4: Swamping eect
It is dened as the situation where “clean” data item is incorrectly labeled as an outlier, due to the presence of
multiple clean sub-groupings within the data [49].
Further, while performing testing of DESCo system, we consider the source les with eective size (excluding
comments and documentation) >= 900 characters. We assume that source les smaller than this size contain insucient
construct information, and can be neglected for our case. Further, the use of small les (<900 characters) might skew
the dataset because of sparse feature vectors generated corresponding to them.
Manuscript submitted to ACM
A Defect Estimator for Source Code 29
With the fast-paced advances in machine learning (ML) space, the identication of the State of the Art (SOA)
technique for the current problem is a challenge. The SOA appears to be a moving target. What we think of as the SOA
today, may not be so tomorrow. While doing the performance evaluation of our system, we found [
] to be the most
relevant SOA work similar to ours. But, due to the evolving nature of ML, [7] may not remain the SOA for long.
Eects of programming styles on the quality of software have been well studied and reported in the research literature.
Most such studies can be categorized into two broad areas – Quality and maintenance of software. One of the works
that addressed problems similar to ours is [
]. They propose a learning algorithm to extract the semantic representation
associated with a program automatically. They train a DBN on the token vectors derived from the AST representations
of various programs. They rely on trained DBNs to detect the dierences in these token vectors extracted from an input
to predict defects. Another related contribution is presented by [
]. They propose a two-phase recommendation model
for suggesting the les to be xed by analyzing the defect reports associated with them. The recommendation model
in its rst phase categorizes a le as “predic” or “unpredic” depending upon the adequacy of the available content of
defect reports. Further, in its second phase, amongst the les categorized as predic, it recommends the top-k les to be
xed. They experimented on a limited set of projects, viz., “Firefox” and “Core” packages of Mozilla.
An information retrieval technique based defect localization module “defectLocator” is proposed by [
]. The
defectLocator uses the concept of textual similarity to nd a set of defect reports similar to an input defect report and
using the linked source code tries to identify potential defects related to the input. Use of textual similarity for code
identication can pose problems because often a given programming task may be coded in more than one way. The
authors have compared defectLocator performance with other textual similarity techniques such as Latent Dirichlet
Allocation (LDA) and Latent Semantic Indexing (LSI), on four OSS projects (Eclipse, AspectJ, SWT, and XZing).
The software metrics used in building a predictive model have a more signicant impact on the performance of the
system than the ML technique used [
]. Authors in [
] present a review of 106 papers and state CK’s Object-Oriented
(OO) metrics as the most used software metrics. Further, the authors compare eight primary ML techniques (such as,
C4.5, SVM, Neural Network, Logistic Regression) used in these 106 works by various evaluation methods viz., precision,
recall, ROC curve method, and cost-eectiveness.
Similarly, authors in [
] compare the performance of six ML techniques (such as, Linear regression, Multilayer
perceptron, and Decision tree regression) across 18 datasets of
repository (such as, Ant 1
7and Camel 1
using the Average Absolute Error (AAE) and Average Relative Error (ARE) metrics. Linear regression and Decision tree
learning techniques were found to perform the best results with least AAE and ARE measures.
Authors in [
] employ the use of static code features (viz., Lines Of Count (LOC), Halstead features [
] and McCabe
features [
]) for defect prediction. They compare the performance of 12 ML techniques (such as, SVM, Naive Bayes
(NB), and Random Forest (RF)) by static metrics values extracted from 10 open-source projects available at NASA
repository. Authors claim that the predictive models which maximize the AUC (eort, the probability of detection)
metric yield the best results. Such models, which maximize the AUC metric, generate the smallest set of modules
containing the most errors. Authors claim that WHICH- meta-learning model proposed by them outperforms the
state-of-the-art techniques.
Manuscript submitted to ACM
30 Ritu et al.
Overall, some of the key gaps found in the majority of the existing works, addressing problems similar to ours can
be summarized as follows:
In works that use (or propose) source code feature extraction for defect prediction and localization, only limited
types of nodes from the program’s AST have been utilized. For instance, node types such as identier nodes,
operator nodes, scope nodes, and user-dened data types, have not usually been considered. In our work, we
have captured all such types of nodes as per a language’s grammar.
While building the feature vectors, mere presence or absence of programming constructs is considered. In our
work, however, we also capture additional characteristics associated with the programming constructs. For
example, “length”, “count” and “depth-of-occurrences” of various constructs.
Association of “programming decisions” made during the development of software, with the characteristics of
the defects reported against such source code has not been adequately studied in the literature.
Last but not the least, most works reported their results using a small volume (less than ve projects) of source
code and defect reports if any. Our study spans more than 30400 source les written in four dierent programming
languages and taken from 20 OSS repositories.
We present a novel method to estimate the defectiveness associated with a software by analyzing the programming
construct usage patterns present in the software built in past. PROCON metrics are used to extract the programming
construct usage information from several source les, which is then stored as language-specic datasets. ML models
built using various SOA ML techniques, when trained with PROCON datasets represent our system – DESCo. DESCo
automates the defect estimation task, thus reducing the time and cost associated in the process. In addition to pre-
dicting the defectiveness associated with an input source code, DESCo also provides the estimates of the likely defect
Our results show that information associated with the programming constructs used in the existing software
(PROCON dataset) improves the estimation of defects in source code. Our results show that accuracy of the defectiveness
estimation varies with the programming language and the ML techniques. For instance, for detecting the defectiveness
of source les written in
,SupervisedDBN classier performs the best with an accuracy of 95
9%, whereas, for
C++ programs, LSVM classier gives the best accuracy of 77.5%.
Our results indicate that DESCo system and PROCON datasets outperform the existing state-of-the-art techniques
and datasets with the best MeanAcc measure value of 80
8% in case of RF technique (an improvement of 44
9%). DESCo
and PROCON datasets can be used for building software tools related to areas such as defect localization, code review and
recommendation. We have shown that the ML models trained using PROCON metrics give better results in comparison
to the existing OO metrics (CK’s metrics of PROMISE repository).
Sayed Mehdi Hejazi Dehaghani and Naseh Hajrahimi. Which factors aect software projects maintenance cost more? Acta Informatica Medica,
21(1):63, 2013.
Ritu Kapur and Balwinder Sodhi. Towards a knowledge warehouse and expert system for the automation of sdlc tasks. In Proceedings of the
International Conference on Software and System Processes, pages 5–8. IEEE Press, 2019.
Ritu Kapur and Balwinder Sodhi. Estimating defectiveness of source code: A predictive model using github content. arXiv preprint arXiv:1803.07764,
[4] Ilene Burnstein. Practical software testing: a process-oriented approach. Springer Science & Business Media, 2006.
Manuscript submitted to ACM
A Defect Estimator for Source Code 31
IEEE Computer Society. Software Engineering Technical Committee. Ieee standard glossary of software engineering terminology. Institute of
Electrical and Electronics Engineers, 1983.
ISO/IEC. ISO/IEC25010:2011(en), systems and software engineering— systems and software quality requirements and evaluation (SQuaRE)— system
and software quality models., 2011. Retrieved: 11-03-2018.
Song Wang, Taiyue Liu, and Lin Tan. Automatically learning semantic features for defect prediction. In Proceedings of the 38th International
Conference on Software Engineering, pages 297–308. ACM, 2016.
[8] Michael Pradel and Koushik Sen. Deep learning to nd bugs. TU Darmstadt, Department of Computer Science, 2017.
Thomas Shippey, David Bowes, and TracyHall. Automatically identifying code features for software defect prediction: Using ast n-grams. Information
and Software Technology, 106:142–160, 2019.
Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. Learning natural coding conventions. In Proceedings of the 22nd ACM SIGSOFT
International Symposium on Foundations of Software Engineering, pages 281–293. ACM, 2014.
[11] Florian Deissenboeck and Markus Pizka. Concise and consistent naming. Software Quality Journal, 14(3):261–282, 2006.
Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. What’s in a name? a study of identiers. In Program Comprehension, 2006. ICPC
2006. 14th IEEE International Conference on, pages 3–12. IEEE, 2006.
Danijel Radjenović, Marjan Heričko, Richard Torkar, and Aleš Živkovič. Software fault prediction metrics: A systematic literature review. Information
and Software Technology, 55(8):1397–1418, 2013.
Erik Arisholm, Lionel C Briand, and Eivind B Johannessen. A systematic and comprehensive investigation of methods to build and evaluate fault
prediction models. Journal of Systems and Software, 83(1):2–17, 2010.
Miltiadis Allamanis and Charles Sutton. Mining source code repositories at massive scale using language modeling. In Proceedings of the 10th
Working Conference on Mining Software Repositories, pages 207–216. IEEE Press, 2013.
[16] StackOverow. Stackoverow developer survey results 2019: Most popular technologies, February 2019.
Isabelle Guyon and André Elissee. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157–1182, 2003.
[18] Yiming Yang and Jan O Pedersen. A comparative study on feature selection in text categorization. In Icml, volume 97, page 35, 1997.
Ferdian Thung, Tegawende F Bissyande, David Lo, and Lingxiao Jiang. Network structure of social coding in github. In Software maintenance and
reengineering (csmr), 2013 17th european conference on, pages 323–326. IEEE, 2013.
Audris Mockus, Roy T Fielding, and James D Herbsleb. Two case studies of open source software development: Apache and mozilla. ACM
Transactions on Software Engineering and Methodology (TOSEM), 11(3):309–346, 2002.
[21] P Terence. The denitive antlr 4 reference. Pragmatic Bookshelf, 2013.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research,
12:2825–2830, 2011.
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classication. Journal of
machine learning research, 9(Aug):1871–1874, 2008.
[24] sklearn Linear SVM. (LSVM), February 2020.
[25] sklearn Support Vector Machine. (SVM), February 2020.
[26] sklearn Nu-Support Vector Machine. (NuSVM), February 2020.
[27] J Moćkus, V Tiesis, and A Źilinskas. The application of bayesian methods for seeking the extremum. vol. 2, 1978.
[28] sklearn Gaussian Process Classier. (Gauss), February 2020.
[29] Naomi S Altman. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3):175–185, 1992.
[30] sklearn K Neighbors Classier. (KNN), February 2020.
Tin Kam Ho. Random decision forests. In Document analysis and recognition, 1995., proceedings of the third international conference on, volume 1,
pages 278–282. IEEE, 1995.
[32] sklearn Random Forest. (RF), February 2020.
[33] Simon Haykin. Neural networks: a comprehensive foundation. Prentice Hall PTR, 1994.
[34] sklearn Multi-Layer Perceptron. (MLP), February 2020.
Georey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
sklearn Supervised Learning. A Python implementation of Deep Belief Networks built upon NumPy and TensorFlow with scikit-learn compatibility,
February 2020.
Hsiang-Fu Yu, Fang-Lan Huang, and Chih-Jen Lin. Dual coordinate descent methods for logistic regression and maximum entropy models. Machine
Learning, 85(1-2):41–75, 2011.
[38] sklearn Logistic Regression. (LogReg), February 2020.
Jason D Rennie, Lawrence Shih, Jaime Teevan, and David R Karger. Tackling the poor assumptions of naive bayes text classiers. In Proceedings of
the 20th international conference on machine learning (icml-03), pages 616–623, 2003.
[40] sklearn Bernoulli Naive Bayes. (BernNB), February 2020.
[41] sklearn Multinomial Naive Bayes. (MultNB), February 2020.
[42] sklearn Gaussian Naive Bayes. (GaussNB), February 2020.
Manuscript submitted to ACM
32 Ritu et al.
Andrew Y Ng and Michael I Jordan. On discriminative vs. generative classiers: A comparison of logistic regression and naive bayes. In Advances in
neural information processing systems, pages 841–848, 2002.
[44] sklearn Supervised Learning. 1. Supervised learning, February 2020.
Issam H Laradji, Mohammad Alshayeb, and Lahouari Ghouti. Software defect prediction using ensemble learning on selected features. Information
and Software Technology, 58:388–402, 2015.
Shyam R Chidamber and Chris F Kemerer. A metrics suite for object oriented design. IEEE Transactions on software engineering, 20(6):476–493, 1994.
Dongsun Kim, Yida Tao, Sunghun Kim, and Andreas Zeller. Where should we x this bug? a two-phase recommendation model. IEEE transactions
on software Engineering, 39(11):1597–1610, 2013.
Jian Zhou, Hongyu Zhang, and David Lo. Where should the bugs be xed?-more accurate information retrieval-based bug localization based on bug
reports. In Proceedings of the 34th International Conference on Software Engineering, pages 14–24. IEEE Press, 2012.
Jung-Tsung Chiang et al. The masking and swamping eects using the planted mean-shift outliers models. Int. J. Contemp. Math. Sciences,
2(7):297–307, 2007.
[50] Santosh S Rathore and Sandeep Kumar. An empirical study of some software fault prediction techniques for the number of faults prediction. Soft
Computing, 21(24):7417–7434, 2017.
Tim Menzies, Zach Milton, Burak Turhan, Bojan Cukic, Yue Jiang, and Ayşe Bener. Defect prediction from static code features: current results,
limitations, new approaches. Automated Software Engineering, 17(4):375–407, 2010.
[52] Maurice Howard Halstead et al. Elements of software science, volume 7. Elsevier New York, 1977.
[53] Thomas J McCabe. A complexity measure. IEEE Transactions on software Engineering, (4):308–320, 1976.
[54] Leonard Richardson. Beautiful soup documentation, 2007.
Our approach for estimation of defectiveness of a given source code involves the use of two key modules which we have
built: the PROCON dataset builder and the DESCo system. We describe the details relevant to the implementation. The
implementation details provided in this section allow replicating our results by independently building our proposed
A.1 Details of PROCON dataset builder
The implementation details of the important libraries (and/or scripts) used in the PROCON dataset builder are described
LexPar Module: It consists of the collection of all the lexer and parser classes generated by ANTLR. These
lexer/parser classes perform the parsing of input source les and produce a collection of lexer/parser tokens.
These classes and tokens are then fed as input to the Feature Extraction module to extract the necessary PROCON
metrics’ values.
Feature Extraction module: Given a set of grammars
, we generate using ANTLR [
] one parse tree listener
class corresponding to each
. The parse tree listener class contains methods for handling lexer/parser tokens
(input by LexPar) as they are encountered during parsing of input source code conforming to the grammar д.
We then design a class that uses the parse tree listeners to extract lexical features from input source code written
in dierent programming languages. These lexical features correspond to the constructs mentioned in Table-2.
The collection of PROCON metrics (see Table-3) values, extracted from various source les, form the basis of our
PROCON dataset (discussed in Section 2.2).
Defect Information Collector: This is built as a script consisting of SQL commands written to populate the
information of bug reports (provided as input CSV les) in the BugInfo table (shown in Fig. 4).
Manuscript submitted to ACM
A Defect Estimator for Source Code 33
Patch Scraper: This module is built using the Beautiful Soup [
] library of python to extract the patch and
summary information elds from the HTML markup of a defect report,
. The scraped information is written in
the form of a text le, p, with its lename as the defect report ID, d.ID.
Mapping Builder: This module is developed using Java programming language. It establishes the mapping between
the defect reports (
) and the source les (
) fetched from the GitHub OSS repositories. The basic
working idea of this module is that we look for source le name patterns in the scraped information obtained via
Patch Scraper. The detailed steps are provided in Algorithm 1.
Information Rener: This module renes the information that the Feature Extractor module stored in SourceCode-
Features table. The rening is done by applying various transformations as discussed in Step-5 of §2.2.4. The
transformations logic is implemented via a collection of SQL commands.
A.2 Details of database tables
A brief description of the main tables of the PROCON dataset schema is as follows:
LanguageConstructs: In this table, we maintain all the programming language constructs, which we consider
while computing the PROCON metrics for the input source les. Each programming language construct
a single record for it in this table. The attribute ConstructId is the primary key of the table which assigns a
unique IDentier (ID) to each of the programming language constructs used in the considered source les. The
ConstructId attribute acts as a foreign key in the tables SourceCodeFeatures and RenedFeatures (discussed next).
SourceCodeFeatures: This table stores the values for lexical properties computed for various programming
constructs (as described in §2.2) that are found in the source les. The MeasureType attribute holds the types, viz.,
count, depth
, and
of the lexical properties. The ConstructId references the programming construct ID
from the LanguageConstruct table.
Since a source le
, may contain multiple instances of a programming language construct
; therefore, the
SourceCodeFeatures table has a one-to-many relationship with the LanguageConstructs table. The SourceFileId
refers to the unique ID values assigned to various source les. A source le can contain multiple programming
constructs at dierent nesting levels (i.e., with dierent ParentIds). Thus, the table has a composite primary key
as {SourceFileId, FileName, ConstructId, ParentId, MeasureType}.
BugInfo: In this table, we store the relevant meta-data fetched from the defect reports associated with the source
les considered in the SourceCodeFeatures table. BugId represents a unique ID to the record of each of such defect
reports. The attribute (BugEngineId) identies the source from where the defect report was fetched.
The combination of BugEngineId and BugId acts as the primary key of this table. The attribute Product holds
information about the module against which the defect was reported. The remaining attributes of the table (viz.,
Status, Priority, Stage, etc.) represent the various defect characteristics.
SourceFileToBugMapping: It stores the mapping between the source les and the bug reports considered for
building the dataset. Since it is a mapping between two tables (viz., SourceCodeFeatures and BugInfo), the table
comprises of the prime attributes of the two respective tables.
RenedFeatures: It stores the rened features obtained by transforming the shallow knowledge, as explained in
§2.2.4 step (6). This table represents the values of PROCON metrics extracted for various source les.
Manuscript submitted to ACM
34 Ritu et al.
The attributes maxMeasureVal, minMeasureVal, avgMeasureVal, and stdDevMeasureVal store the measure values
obtained by applying various aggregate operations (viz.,
max, min, avg,
) on the corresponding
attribute values from SourceCodeFeatures table.
When computing these aggregate values, we categorized a programming construct occurrence into three dierent
types of parent entities:
file, class,
. These types are identied via a unique integer value held
in the queryFlag attribute. Since a construct may occur under all these three parent entities simultaneously, the
table has a composite key {SourceFileId, ConstructId, MeasureType, queryFlag} as the primary key.
Further, each source le
present in the SourceCodeFeatures table can have
(with n > 1) records in the
RenedFeatures table due to the three queryFlag values.
A.3 Details of DESCo modules
All modules of DESCo have been coded using Python programming language. The implementation details of these
modules are as follows.
Model Builder: It builds various ML models using an input list of ML algorithms listed in Table-7and the features
present in PROCON dataset. To get the best performing model for various scenarios, it considers the parameter
combinations listed in Table-7. We use the ScikitLearn implementation of these ML algorithms. Details of these
algorithms are discussed in §2.3.3.
Model Selector: This module implements an optimization problem solver developed in Java. The objective of the
optimization problem is to nd, from a set of given ML models, the best performing ML model(s) for a specic
task. The input to the Model Selector comprises of i) the query selected by the user, ii) the features extracted from
the input source le, and iii) the complete set of ML models built using Model Builder (described above). The
output of this module is the best performing models as per the input.
Query Interface: It provides an interface for user interaction with the DESCo system. The user selects a defect
estimation task (listed in §2.3.1) from an available list of queries. The user can supply a source le for which the
defect estimation task needs to be performed. Internally, the query interface uses the user-selected query and the
provided source le as the inputs to the Estimation Engine for performing the defect estimation tasks.
Estimation Engine: It is the main module that drives the defect estimation tasks. The key inputs to this module are
a) a user-selected query for defect estimation and b) the source le whose defectiveness needs to be estimated.
Manuscript submitted to ACM
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Software applications often receive a large number of enhancement requests that suggest developers to fulfill additional functions. Such requests are usually checked manually by the developers, which is time consuming and tedious. Consequently, an approach that can automatically predict whether a new enhancement report will be approved is beneficial for both the developers and enhancement suggesters. With the approach, according to their available time, the developers can rank the reports and thus limit the number of reports to evaluate from large collection of low quality enhancement requests that are unlikely to be approved. The approach can help developers respond to the useful requests more quickly. To this end, we propose a multinomial naive Bayes based approach to automatically predict whether a new enhancement report is likely to be approved or rejected. We acquire the enhancement reports of open-source software applications from Bugzilla for evaluation. Each report is preprocessed and modeled as a vector. Using these vectors with their corresponding approval status, we train a Bayes based classifier. The trained classifier predicts approval or rejection of the new enhancement reports. We apply different machine learning and neural network algorithms, and it turns out that the multinomial naive Bayes classifier yields the highest accuracy with the given dataset. The proposed approach is evaluated with 40,000 enhancement reports from 35 open source applications. The results of tenfold cross validation suggest that the average accuracy is up to 89.25%.
Full-text available
During the software development process, prediction of the number of faults in software modules can be more helpful instead of predicting the modules being faulty or non-faulty. Such an approach may help in more focused software testing process and may enhance the reliability of the software system. Most of the earlier works on software fault prediction have used classification techniques for classifying software modules into faulty or non-faulty categories. The techniques such as Poisson regression, negative binomial regression, genetic programming, decision tree regression, and multilayer perceptron can be used for the prediction of the number of faults. In this paper, we present an experimental study to evaluate and compare the capability of six fault prediction techniques such as genetic programming, multilayer perceptron, linear regression, decision tree regression, zero-inflated Poisson regression, and negative binomial regression for the prediction of number of faults. The experimental investigation is carried out for eighteen software project datasets collected from the PROMISE data repository. The results of the investigation are evaluated using average absolute error, average relative error, measure of completeness, and prediction at level l measures. We also perform Kruskal–Wallis test and Dunn’s multiple comparison test to compare the relative performance of the considered fault prediction techniques.
Conference Paper
Full-text available
Descriptive names are a vital part of readable, and hence maintainable, code. Recent progress on automatically suggesting names for local variables tantalizes with the prospect of replicating that success with method and class names. However, suggesting names for methods and classes is much more difficult. This is because good method and class names need to be functionally descriptive, but suggesting such names requires that the model goes beyond local context. We introduce a neural probabilistic language model for source code that is specifically designed for the method naming problem. Our model learns which names are semantically similar by assigning them to locations, called embeddings, in a high-dimensional continuous space, in such a way that names with similar embeddings tend to be used in similar contexts. These embeddings seem to contain semantic information about tokens, even though they are learned only from statistical co-occurrences of tokens. Furthermore, we introduce a variant of our model that is, to our knowledge, the first that can propose neologisms, names that have not appeared in the training corpus. We obtain state of the art results on the method, class, and even the simpler variable naming tasks. More broadly, the continuous embeddings that are learned by our model have the potential for wide application within software engineering.
Context: Identifying defects in code early is important. A wide range of static code metrics have been evaluated as potential defect indicators. Most of these metrics offer only high level insights and focus on particular pre-selected features of the code. None of the currently used metrics clearly performs best in defect prediction. Objective: We use Abstract Syntax Tree (AST) n-grams to identify features of defective Java code that improve defect prediction performance. Method: Our approach is bottom-up and does not rely on pre-selecting any specific features of code. We use non-parametric testing to determine relationships between AST n-grams and faults in both open source and commercial systems. We build defect prediction models using three machine learning techniques. Results: We show that AST n-grams are very significantly related to faults in some systems, with very large effect sizes. The occurrence of some frequently occurring AST n-grams in a method can mean that the method is up to three times more likely to contain a fault. AST n-grams can have a large effect on the performance of defect prediction models. Conclusions: We suggest that AST n-grams offer developers a promising approach to identifying potentially defective code.
While software metrics are a generally desirable feature in the software management functions of project planning and project evaluation, they are of especial importance with a new technology such as the object-oriented approach. This is due to the significant need to train software engineers in generally accepted object-oriented principles. This paper presents theoretical work that builds a suite of metrics for object-oriented design. In particular, these metrics are based upon measurement theory and are informed by the insights of experienced object-oriented software developers. The proposed metrics are formally evaluated against a widelyaccepted list of software metric evaluation criteria.
Conference Paper
Software defect prediction, which predicts defective code regions, can help developers find bugs and prioritize their testing efforts. To build accurate prediction models, previous studies focus on manually designing features that encode the characteristics of programs and exploring different machine learning algorithms. Existing traditional features often fail to capture the semantic differences of programs, and such a capability is needed for building accurate prediction models. To bridge the gap between programs' semantics and defect prediction features, this paper proposes to leverage a powerful representation-learning algorithm, deep learning, to learn semantic representation of programs automatically from source code. Specifically, we leverage Deep Belief Network (DBN) to automatically learn semantic features from token vectors extracted from programs' Abstract Syntax Trees (ASTs). Our evaluation on ten open source projects shows that our automatically learned semantic features significantly improve both within-project defect prediction (WPDP) and cross-project defect prediction (CPDP) compared to traditional features. Our semantic features improve WPDP on average by 14.7% in precision, 11.5% in recall, and 14.2% in F1. For CPDP, our semantic features based approach outperforms the state-of-the-art technique TCA+ with traditional features by 8.9% in F1.
Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension. We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations---and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether (a) code can be usefully modeled by statistical language models and (b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very regular, and, in fact, even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse's completion capability. We conclude the paper by laying out a vision for future research in this area.
Programmers run into parsing problems all the time. Whether it's a data format like JSON, a network protocol like SMTP, a server configuration file for Apache, a PostScript/PDF file, or a simple spreadsheet macro language--ANTLR v4 and this book will demystify the process. ANTLR v4 has been rewritten from scratch to make it easier than ever to build parsers and the language applications built on top. This completely rewritten new edition of the bestselling Definitive ANTLR Reference shows you how to take advantage of these new features. Build your own languages with ANTLR v4, using ANTLR's new advanced parsing technology. In this book, you'll learn how ANTLR automatically builds a data structure representing the input (parse tree) and generates code that can walk the tree (visitor). You can use that combination to implement data readers, language interpreters, and translators. You'll start by learning how to identify grammar patterns in language reference manuals and then slowly start building increasingly complex grammars. Next, you'll build applications based upon those grammars by walking the automatically generated parse trees. Then you'll tackle some nasty language problems by parsing files containing more than one language (such as XML, Java, and Javadoc). You'll also see how to take absolute control over parsing by embedding Java actions into the grammar. You'll learn directly from well-known parsing expert Terence Parr, the ANTLR creator and project lead. You'll master ANTLR grammar construction and learn how to build language tools using the built-in parse tree visitor mechanism. The book teaches using real-world examples and shows you how to use ANTLR to build such things as a data file reader, a JSON to XML translator, an R parser, and a Java class-interface extractor. This book is your ticket to becoming a parsing guru!What You Need: ANTLR 4.0 and above. Java development tools. Ant build system optional (needed for building ANTLR from source)