Conference PaperPDF Available

Automatic Metric Thresholds Derivation for Code Smell Detection

Authors:

Figures

Content may be subject to copyright.
Automatic metric thresholds derivation for
code smell detection
Francesca Arcelli Fontana, Vincenzo Ferme, Marco Zanoni, Aiko Yamashita
Department of Informatics, Systems and Communication
University of Milano-Bicocca, Milano, Italy
Email: arcelli@disco.unimib.it, marco.zanoni@disco.unimib.it
Faculty of Informatics
University of Lugano (USI), Switzerland
Email: vincenzo.ferme@usi.ch
Mesan AS
Henrik Ibsens gate 20, Oslo, Norway
Email: aikoy@mesan.no
Abstract—Code smells are archetypes of design shortcomings
in the code that can potentially cause problems during main-
tenance. One known approach for detecting code smells is via
detection rules: a combination of different object-oriented metrics
with pre-defined threshold values. The usage of inadequate
thresholds when using this approach could lead to either having
too few observations (too many false negatives) or too many
observations (too many false positives). Furthermore, without
a clear methodology for deriving thresholds, one is left with
those suggested in literature (or by the tool vendors), which
may not necessarily be suitable to the context of analysis. In
this paper, we propose a data-driven (i.e., benchmark-based)
method to derive threshold values for code metrics, which can
be used for implementing detection rules for code smells. Our
method is transparent, repeatable and enables the extraction of
thresholds that respect the statistical properties of the metric in
question (such as scale and distribution). Thus, our approach
enables the calibration of code smell detection rules by selecting
relevant systems as benchmark data. To illustrate our approach,
we generated a benchmark dataset based on 74 systems of the
Qualitas Corpus, and extracted the thresholds for five smell
detection rules.
I. INTRODUCTION
Code smells [1] are symptoms of design shortcomings in the
code that can potentially cause problems during maintenance
and evolution. Many of the available smell detection tools im-
plement a metrics-based approach, with ‘fixed’ or configurable
threshold values. In some cases, tool providers using ‘fixed’
threshold values do not provide a clear rationale on how these
thresholds have been devised. In other cases such as in [2], the
authors provide a rationale on how the thresholds values for
some metrics have been identified (i.e., heuristics derived from
the interpretation of object-oriented design principles, together
with statistical analysis).
The usage of thresholds in code smell detection entails sev-
eral challenges. First, one can argue the adequacy of thresholds
provided by tool vendors, for which the derivation rationale
is often undisclosed. This leads to unclear assumptions on
the validity and representativeness of these thresholds. Even
if the rationale is given for the thresholds, if the process
followed to arrive to them is not repeatable, it is not possible
in practice to compare them with other values or techniques.
Furthermore, the usage of inadequate thresholds could lead
to either having too many false negatives or too many false
positives. Developers more than often end up with long lists
of ‘candidates’ for examination, and may have hard time
prioritizing the most critical instances.
In sum, without a clear methodology for deriving thresholds,
one is left with those suggested in the literature (or by tool
vendors), which may not necessarily be suitable to the context
of analysis. A transparent, repeatable mechanism is needed in
order to derive the thresholds most suitable to the context. In
that way, we can move towards more useful and adequate code
smell-based evaluations of source code quality. In this paper,
we propose a data-driven method (i.e., using a benchmark
dataset of software systems) to extract metric thresholds for
defining code smell detection rules. We call the extracted
thresholds as Benchmark-Based Thresholds (BBT).
Our method is characterized by three features: 1) it relies on
the statistical distribution of metrics for selecting thresholds;
2) it applies a non-parametric process to discard the part of
the distribution that is not useful for deriving thresholds; 3)
it customizes the definition of the thresholds for each code
smell, by filtering the dataset according to the code smell.
To illustrate our method, we extracted 40 different metrics
from 74 systems of the Qualitas Corpus (v. 20120401r)
curated by Tempero et al. [3] to build a benchmark dataset,
and we illustrate how the extracted threshold values can be
used in five code smells detection rules defined by Lanza and
Marinescu [2].
The paper is organized as follows. In Section II, we describe
our approach inspired by a previously suggested data-driven
technique. In Section III, we illustrate how some of the
thresholds derived through our method can be used in an
existing code smell detection approach. In Section IV, we
discuss the current limitations of the work. In Section V, we
provide related work on metric thresholds derivation. Finally,
2015 6th International Workshop on Emerging Trends in Software Metrics
2327-0969/15 $31.00 © 2015 IEEE
DOI 10.1109/WETSoM.2015.14
44
2015 6th International Workshop on Emerging Trends in Software Metrics
2327-0969/15 $31.00 © 2015 IEEE
DOI 10.1109/WETSoM.2015.14
44
Section VI provides concluding remarks and discusses avenues
for future research.
II. THRESHOLDS DERIVATION PROCESS
Our data-driven approach is inspired by the thresholds
derivation work reported by Alves et al. [4]. The primary
goal of the approach by Alves et al. is to use the extracted
thresholds for building maintainability assessment models [5].
Conversely, the main goal of our approach is to derive thresh-
olds for code smell detection rules, which can furthermore be
used for software quality and maintainability assessments.
Our method is meant to be applied on metrics that do not
represent ratio values of other metrics, which are typically
ranging in the interval [0,1]. An example of a ratio value metric
is the Tight Class Cohesion (TCC), and such metrics are not
covered in our approach. We will hereon refer to the thresholds
computed with our method as BBT.
In the remainder of this section, we will explain the process
for extracting thresholds, by first describing the dataset from
which the metrics were extracted. Then, we explain how we
used the distributions of the metrics for mapping symbolic
threshold values to actual values.
We focus on the metrics used in the book by Lanza and
Marinescu [2] for the definition and detection of the following
code smells: God Class, Data Class, Brain Method, Dispersed
Coupling, and Shotgun Surgery. The metrics used for the code
smell detection rules are reported in Table I.
A. Metric Thresholds Definition
Detection rules for code smells are often defined in the
terms of metric groups or classifications. An example can
be: “identify classes that have LOW cohesion” or “identify
methods that have a HIGH complexity”. We want to derive
thresholds in a way that they can be semantically mapped to
these informal requirements, to find out what LOW cohesion
or HIGH complexity means in terms of the metrics we use to
measure the cohesion and complexity of the software.
With this need in mind we compute, for each metric, a
set of three thresholds that identifies three points on the
metrics distribution. We define a threshold using a name and
a corresponding percentile. We use the name to refer to a
threshold with a semantic meaning. The names we use for the
three thresholds are: LOW, MEDIUM, and HIGH.
The percentile corresponding to a threshold name is the
percentile on the metrics distribution where to find the value
for the given threshold. In other words, every name is an alias
to a percentile, which is a pointer to the place on the metric
distribution where to find the value of the metric to use as the
actual threshold.
B. Benchmark-Based Thresholds
The metrics we consider in our method have values belong-
ing to the extended real line or to the natural number line
, mostly starting from zero.
The starting point of our method is the one proposed by
Alves et al. [4], which follows three core principles:
1) “The method should not be driven by expert opinion but
by measurement data from representative set of systems
(data-driven);”
2) “The method should respect the statistical properties of
the metric, such as metric scale and distribution and
should be resilient against outliers in metric values and
system size (robust);”
3) “The method should be repeatable, transparent, and
straightforward to carry out (pragmatic).
Their method is aimed at finding thresholds for system-level
metrics, to assess the risk levels of a given system. Our method
makes similar assumptions on the nature of the raw data,
but is aimed at finding metric thresholds applicable to single
elements in the code (e.g., a class, a method), to determine if
they belong to a certain category or group.
The proposed method allows to compute in an automated
way, thresholds that are needed to detect each code smell.
The method displays a high level of automation, allowing the
extension of the benchmark data with additional systems over
time, to support a meaningful and up-to-date set of thresholds.
The main steps of the method are:
1) Metrics computation: metric values of all classes and
methods of the analyzed systems are computed;
2) Aggregated metrics distribution analysis: the quantile
function of each metric distribution is computed;
3) Metrics thresholds derivation: the point on the quantile
function to assign to the labels (LOW, MEDIUM, HIGH)
is selected.
1) Metrics computation: The first step consists of iden-
tifying a set of systems to be used as a benchmark, and
computing the metrics values. Table I presents the metrics we
use to illustrate our work, and their definitions by Lanza and
Marinescu [2]. In addition, we report in Table VII all the other
metrics we analyzed with our method; metrics defined in the
literature have a reference, and the ones we defined are in bold
font.
The systems we selected as a benchmark are the ones
from the Qualitas Corpus (v. 20120401r), a curated collection
of 111 open source systems created by Tempero et al. [3].
We selected 74 of the available systems for our benchmark.
The size of the selected systems is reported in Table II. We
selected the systems that were possible to compile, and added
missing libraries when required. Compilation is necessary for
the computation of dependency metrics in our environment
(we exploit the Eclipse JDT library), for the resolution of de-
pendent classes also in external libraries. We use the metadata
from the Qualitas Corpus to consider only the “core” classes
for our analysis; we exclude, i.e., tests and demos from the
dataset. During the manual inspection of the systems (e.g.,
to solve compilation problems), we also removed all the test
packages we were able to identify.
For each code smell, we compute a benchmark database that
contains only the entities that can be affected by that specific
smell (this is done by using a selection criterion for each
code smell). The set of values for all the metrics that are part
4545
Table I
METRICS DEFINITIONS
Metric Label Metric Name Definition
ATFD Access To Foreign Data The number of attributes from unrelated classes belonging to the system, accessed directly or by
invoking accessor methods.
WMC Weighted Methods Count The sum of complexity of the methods that are defined in the class. We compute the complexity
with CYCLO.
NOAP Number Of Public Attributes The number of public attributes of a class
NOAM Number Of Accessor Methods The number of accessor (getter and setter) methods of a class.
LOC Lines of Code The number of lines of code of an operation or of a class, including blank lines and comments.
CYCLO Cyclomatic Complexity The maximum number of linearly independent paths in a method. A path is linear if there is no
branch in the execution flow of the corresponding code.
MAXNESTING Maximum Nesting Level The maximum nesting level of control structures within an operation.
NOAV Number Of Accessed Variables The total number of variables accessed directly or through accessor methods from the measured
operation. Variables include parameters, local variables, but also instance variables and global
variables declared in classes belonging to the system.
CINT Coupling Intensity The number of distinct operations called by the measured operation.
CM Changing Methods The number of distinct methods calling the measured method.
CC Changing Classes The number of classes in which the methods that call the measured method are defined in.
of a given code smell detection rule is hereon called code
smell metrics set. The selection criteria for the code smells
considered in this paper are:
God Class: any class, including anonymous classes; in-
terfaces and enums are discarded;
Data Class: any concrete (non-abstract) class, and anony-
mous classes; interfaces and enums are discarded;
Brain Method and Dispersed Coupling: non-abstract
methods contained in a class, anonymous class or enum
(no interface); constructors are included, but not default
constructors;
Shotgun Surgery: non-abstract methods contained in a
class or enum (no anonymous classes or interfaces);
constructors are excluded.
2) Metrics Aggregated Distribution Extraction: At the end
of step 1, we obtain for each code smell metrics set,an
aggregated set of values from entities belonging to different
systems that are part of the benchmark. For each metric, we
aggregate all the values obtaining an aggregated distribution,
and we plot the Inverse Cumulative Density Function i.e., the
Quantile Function (𝑄𝐹 ) for the aggregated metric distribution
of each system.
There are many methods to estimate the quantiles of a
population and define the 𝑄𝐹 . We refer to the type 1 method
defined by Hyndman and Fan [6]. This kind of analysis is not
influenced by outliers, as it does not rely on the actual values
of the metrics.
Figure 1 and Figure 2 show the 𝑄𝐹 plots regarding metrics
measured on types (classes and interfaces) and methods,
Table II
STATISTICS ABOUT THE ANALYZED SYSTEMS
Number of systems 74
Lines of Code 6,785,568
Number of Packages 3,420
Number of Classes 51,826
Number of Methods 404,316
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
percentile
normalized metric
Figure 1. Type metrics 𝑄𝐹 overview
respectively. To represent the different curves together in the
plots, the values of metrics have been normalized in the range
[0,1]. The figures show that most metrics in the benchmarks
have a common heavily skewed distribution: most values are
concentrated in a small portion of the possible values, the one
closest to the lowest quantiles.
The 𝑄𝐹 makes this fact clear because it grows very slowly
for most of the quantiles, and after a given point, it grows
very fast. As an example, we plot in Figure 3 the 𝑄𝐹 for
metric LOC on types, cropped at 90% (only the last 10% of
the distribution).
3) Metrics Cropped Distribution Extraction: If we use the
distribution as-is to derive thresholds by applying common
statistical approaches (e.g., the boxplot approach by Fenton
et al. [7]), we run the risk of obtaining values that do not
characterize problematic entities. For example, the first quartile
4646
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
percentile
normalized metric
Figure 2. Method metrics 𝑄𝐹 overview
0.90 0.92 0.94 0.96 0.98 1.00
0 5000 10000 15000 20000 25000
percentile
LOC
Figure 3. Aggregated LOC 𝑄𝐹 Cropped at 90%
is often equal to the lowest value the metric can have (e.g.,
zero for LOC); even the median can be one of the very few
first values in the distribution. For this reason, we define an
automatic method to select the point of the metric distribution
where the variability among values starts to be higher than
the average, and we use it as the starting point for deriving
thresholds.
The goal of the procedure is to find a cut-point on the x-
axis of the 𝑄𝐹 that splits the 𝑄𝐹 in two parts: the left part
contains the lower and most repetitive values, the right part
contains the higher and more variable values, which we will
use for the threshold computation; we call the right part the
cropped distribution. We start from the verified assumption
that the distribution is highly right-skewed.
The procedure samples 100 equally distributed values of the
𝑄𝐹 (i.e., percentiles), in the range (0.01, 1). The outcome is
an array 𝑎of 100 elements; each element is the value of the
𝑄𝐹 in the respective position.
Given a function 𝑓, representing the frequency of a value
in 𝑎,a𝑓𝑚𝑖𝑑 value is computed as the median of the frequency
of values in 𝑎. With 𝑓𝑚𝑖𝑑, we can compute 𝑣𝑚𝑖𝑑 as 𝑚𝑖𝑛{𝑣
𝑎𝑓(𝑣)𝑓𝑚𝑖𝑑 (𝑤𝑎𝑤𝑣𝑓(𝑤)𝑓𝑚𝑖𝑑)}.𝑣𝑚𝑖𝑑
represents the minimum value of the metric we will consider in
the output distribution. Therefore, to select the quantile where
to crop the distribution, we look for the maximum quantile
having a value strictly less than 𝑣𝑚𝑖𝑑.
As an example, Table III shows the different values of NOM
(Number Of Methods) in descending order, associated to their
frequency in the distribution. Figure 4 shows an histogram
representing the same data. The median of the frequencies is 1.
Therefore, we select in Table III the rightmost (smaller) value
having all frequencies on his left less than or equal to 1. The
resulting value is 13 (marked in bold in Table III). Value 13 is
then searched over the 𝑄𝐹 for metric NOM; the first element
having value 13 is at the 87th percentile. As a consequence,
the last 14% of the distribution is used for computing the
threshold values, while the first 86% is discarded.
Alves et al. [4] too, as part of their method, select a part
of the distribution where to pick dangerous reference values
for a metric. They do this work by hand, and highlight the
problem of identifying those points according to the metric
distribution. They use 70th, 80th, and 90th percentile as
common choice and then they adapt the selection based on
the metric distribution. Our method executes this task in an
automated way. The cut points we found for the considered
metrics are reported in Table IV.
4) Thresholds Derivation: After cropping the distribution
in the previous step, we compute the new 𝑄𝐹 and then we
derive the three thresholds. Here are the labels associated to
the thresholds, and the respective percentile on the new 𝑄𝐹:
LOW: 25th percentile;
MEDIUM: 50th percentile;
HIGH: 75th percentile.
In Table V we show all the final extracted threshold values
of the metrics used for the detection of the code smells we
have considered in this paper.
It can happen that the values of two or more derived
thresholds for a metric are the same, depending on the
values distribution. For example, in Table V we can see that
MAXNESTING has the same value (4) for both the LOW and
MEDIUM thresholds. This happens because the values of the
MAXNESTING metric are distributed in a little range. In this
situation, associating different labels to the same actual value
can be confusing, especially when the values are exploited
for code smell detection: different rules could represent the
same query. It is possible to recalibrate thresholds in this
case. A simple recalibration procedure is be to keep the
MEDIUM threshold fixed, and move the other two thresholds
4747
Table III
NOM CROP DATA
NOM 674 60 42 34 29 25 23 21 19 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Freq 1 1 1 1 1 1 1 1 1 1 1 1 1 122233446810 14 26 2
1111111111111 2 2 2 3 3 4 4 6 8 10 14 26 2
100
101
102
103
Frequencies
NOM (log scale)
Figure 4. NOM crop graph
Table IV
CUT POINTS FOR DISTRIBUTION CROPPING
Code Smell Metric Value Percentile
God
Class
ATFD 8 91
WMC 19 79
Data
Class
NOAP + NOAM 6 92
WMC 16 76
Brain
Method
LOC 16 82
CYCLO 7 94
MAXNESTING 4 95
NOAV 10 90
Dispersed
Coupling
MAXNESTING 4 95
CINT 6 94
Shotgun
Surgery
CM 4 94
CC 4 96
on the distribution in opposite directions, until the values are
disambiguated. Following this procedure, we could simply
change the LOW threshold to 3. However, this could lead to
a misalignment of information between threshold values and
percentiles. The end user inspecting the results would at least
need to be notified that one of the applied thresholds is not
tied to its default percentile. For this reason, we choose to
avoid applying recalibration. We can think of recalibration as
an optional procedure, to be explicitly activated by the user.
III. EXTRACTING THRESHOLDS FOR EXISTING RULES:
AN EXAMPLE
We analyse the extracted thresholds under the light of the
work by Lanza and Marinescu [2], which reports a set of
code smell detection rules. The thresholds reported in the book
were initially derived from the statistical analysis of 45 Java
Table V
FINAL EXTRACTED THRESHOLDS
Code Smell Metric LOW MEDIUM HIGH
God
Class
ATFD 10 13 22
WMC 25 36 64
Data
Class
NOAP + NOAM 7 10 17
WMC 21 32 57
Brain
Method
LOC 20 28 45
CYCLO 8 10 15
MAXNESTING 4 4 5
NOAV 11 14 20
Dispersed
Coupling
MAXNESTING 4 4 5
CINT 7 9 12
Shotgun
Surgery
CM 5 7 12
CC 4 6 10
software systems, and then manually adapted or combined for
each particular code smell, as needed.
We focus on the rules regarding the detection of five
code smells: God Class, Data Class, Brain Method, Dispersed
Coupling, Shotgun Surgery. By using our method, we compute
the thresholds of all metrics involved with these rules and
compare them with the ones proposed in the book.
Table VI displays the detection rules for the smells consid-
ered and their respective metrics (definitions are available in
Table I). The metrics are reported alongside their comparison
with a symbolic threshold, as defined by each detection rule.
Detection rules in [2] are expressed as predicates in logical
composition, where metric values are compared with symbolic
thresholds. Figure 5 provides as an example, the detection rule
for the Brain Method code smell.
4848
Table VI
CODE SMELL AND METRICS WITH THRESHOLDS
Code Smell Metric Comparator Threshold Threshold OO Threshold A Threshold B Threshold C
God Class ATFD >FEW 2-5 0 10 10
WMC VERYHIGH 47 15 59 64
Data Class
NOAP +NOAM >FEW 2-5 0 8 7
NOAP +NOAM >MANY 5 1 20 17
WMC <HIGH 31 15 59 57
WMC <VERYHIGH 47 15 59 57
Brain Method
LOC >HIGH (Class) / 2 65 61 94 91
CYCLO HIGH 3 2 19 15
MAXNESTING SEVERAL 2-5 2 5 5
NOAV >MANY 5 5 17 20
Dispersed Coupling MAXNESTING >SHALLOW 1 1 4 4
CINT >Short Memory Cap 7-8 0 9 9
Shotgun Surgery CM >Short Memory Cap 7-8 0 8 7
CC >MANY 5 1 10 10
Figure 5. Brain Method Detection Rule
We derived from [2] the values assigned to the symbolic
thresholds (for Java), and we report them in the “Threshold
OO” column in Table VI. As the book does not explicitly
report the exact value for all the thresholds, we did our best
to estimate those not available. The last three columns of
Table VI are the thresholds we computed considering:
the aggregated distribution on the whole dataset, without
filtering the dataset according to the code smell (“Thresh-
old A”);
only the cropped distribution on the whole dataset, with-
out filtering the dataset according to the code smell
(“Threshold B”);
the cropped distribution on the code smell metrics set,
i.e., the dataset filtered according to the code smell
(“Threshold C”).
The symbolic names used in Lanza and Marinescu’s book
for thresholds are different than our threshold names. To allow
the comparison of the extracted values, we associated every
symbolic name used in the book to one of the three names
defined in our method:
SHALLOW/FEW LOW
Short Memory Cap MEDIUM
MANY/HIGH/SEVERAL/VERYHIGH HIGH
In Table VI, one of the predicates of the Brain Method de-
tection rule compares the LOC metric to the 𝐻𝐼𝐺𝐻(𝑐𝑙𝑎𝑠𝑠)/2
threshold. This means that the LOC of the evaluated method
are compared to half of the LOC of the method’s container
class. The usage in the Brain Method rule of a metric regarding
the class that contains the evaluated method does not fit our
procedure. Our method filters the dataset, keeping only the
elements (methods in this case) that can be affected by the
particular smell, and computes thresholds on the metric of only
those elements. To allow the comparison of the thresholds we
extract using our method with the ones used by Lanza and
Marinescu [2], in Table VI we report in as Threshold A the
value of 𝐻𝐼𝐺𝐻(𝑐𝑙𝑎𝑠𝑠)/2as used by the reference, and in
Thresholds B and C the value of our threshold 𝐻𝐼 𝐺𝐻/2,
but computed considering the dataset we use for God Class,
not the one used for Brain Method. This exception makes
the comparison of our thresholds and the reference one more
consistent. Following our approach, instead, we would prefer
to use in a detection rule the regular HIGH threshold computed
on the Brain Method dataset, achieving different thresholds,
i.e., 11, 43, 45 respectively for Thresholds A, B, C.
From Table VI we can see that the thresholds produced
automatically by our approach tend to be higher than the ones
from Lanza and Marinescu. Threshold C values are from 2 to
4 times Threshold OO, with the exception of the VERYHIGH
threshold of WMC, where the difference is less strong (47 vs
57–64). Another exception is CM and CINT metrics w.r.t. the
Short Memory Cap threshold, which was mapped to MEDIUM
in our approach. In this case, values differ of one unit at most.
We can attribute the reason of the large difference in values
between the thresholds from [2] and the thresholds from
our approach to the fact that we are selecting very narrow
4949
portions of the whole distribution in our cropping procedure.
This moves the focus on the highest values in the metric
distribution.
It is also worth noting the difference among Thresholds A,
B and C. The highest difference is obviously between A and
B, since in B we are selecting a small part of the distribution
for threshold calculation. The difference between B and C is
less pronounced, yet it displays interesting properties. First
of all, different selection criteria can change the sign of the
difference between B and C. For example, WMC is increased
in God Class and decreased in Data Class. Second, the values
did not change for metrics with values in a small range, e.g.,
MAXNESTING. This is natural, since the chance of having a
different value in the respective distribution is much smaller,
also considering that we already cropped the datasets. Overall,
in 12 cases out of 14, Threshold B Threshold C, meaning
that the most evident effect of filtering the dataset is to lower
thresholds.
IV. CHALLENGES AND LIMITATIONS
Despite the greater level of automation that our proposed
method provides, we do not claim that there is no need for
human interpretation or adjustment. For instance, our method
cannot identify if an overall tendency on a given metric is
symptomatic of a greater design issue. For instance, one could
verify that DIT should not exceed a certain threshold, but if
all DIT values are too low, then the system does not benefit
from the power of object-orientation and is probably facing
a larger design issue. Instead, we propose this approach as
a complement to other analysis techniques, intended to sim-
plify the task of metric/code smell interpretation and quality
evaluation.
We are also aware that derivation of metrics thresholds from
a benchmark dataset introduces dependency on the dataset and
its representativeness. Also, threshold values may be depen-
dent on the way metrics were computed. Further investigation
needs to be done to determine the degree of sensitivity of
this approach with respect to different benchmark datasets,
precomputed metrics, and different metric extraction tools.
However, our intended contribution is the method/strategy
to extract threshold values, and not the threshold values
themselves.1. The idea is that this approach can be used and
repeated across different contexts so that code analysis (in
particular code smell analysis) can be calibrated according to
the context of analysis.
Furthermore, a full experiment evaluating the effect of
the application of our thresholds for the detection of code
smells has not been performed yet. The construction of
such an experiment is also complicated, because we have
no reference benchmark to evaluate our approach. From our
preliminary results, we can say that the reported thresholds
are more selective than the reference ones from [2], i.e.,
less false positives, whiles we cannot currently quantify the
effect on false negatives. We are investigating the results
1These values are used primarily to illustrate our approach
of our approach from different perspectives, e.g., the effect
of selecting thresholds using datasets from specific domains,
using different criteria for selecting the 𝑄𝐹 cut point.
V. RELATED WORK
In the literature, besides the already cited work by Alves et
al. [4], different approaches and techniques to derive metric
thresholds can be found, e.g., via observation, correlation
with defects, metrics analysis, statistical analysis, and machine
learning.
In earlier work, Coleman, McCabe and Nejmeh, defined
metric thresholds according to their experiences [8], [9], [10].
Other efforts for defining metric threshold values can be
seen in the work by Henderson-Sellers [11], who suggested
three ranges for different metrics to classify classes into
three categories: safe,flag, and alarm. Rosenberg [12], [13]
suggested a set of threshold values for the Chidamber-Kemerer
(CK) metrics. These values can be used to select classes for
inspection or redesign.
Shatnawi et al. [14] and Benlarbi et al. [15] attempt to
identify thresholds by investigating the correlation of the
metrics with the existence of bugs or failures. Shatnawi et al.
derive thresholds to predict the existence of bugs using three
releases of the Eclipse project, while Benlarbi et al. investigate
the relation of metric thresholds and software failures for a
subset of the CK metrics using linear regression. However,
both approaches are valid only for a specific error prediction
model and thus, lack of general applicability.
Other authors tackle the thresholds computation using met-
ric analysis. An example is by Erni and Lewerentz [16], who
proposed a threshold derivation approach based on the mean
and the standard deviation of a given metric. Often these
analyses assume that the metrics are normally distributed (such
as the case of [16]). We postulate alongside the authors of [17],
[18], [19], that this assumption does not hold in most cases,
as often metrics follow a power-law distribution.
More recent work have also used benchmarks for threshold
derivation. Oliveria et al. [20] propose the concept of relative
thresholds for evaluating metrics data following heavy-tailed
distributions. The proposed thresholds are relative because
they assume that metric thresholds should be followed by most
source code entities, but that it is also natural to have a number
of entities in the “long-tail” that do not follow the defined
limits. They introduced the notion of a relative threshold, i.e.,
a pair including an upper limit and a percentage of classes
whose metric values should not exceed this limit and describe
an empirical method for extracting relative thresholds from
real systems. They argue that the proposed thresholds express
a balance between real and idealized design practices. Oliveira
et al. [21] extend their previous work and propose RTTooL,
an open source tool for extracting relative thresholds from the
measurement data of a benchmark of software systems.
Ferreira et al. [22] defined thresholds for six object-oriented
metrics using a benchmark of 40 Java systems. They assert that
the extracted metrics values, except for DIT follow heavy-
tailed distribution. Based on their experience, they defined
5050
three threshold categories, but without establishing the per-
centage of expected classes for each category.
Another approach, proposed by Foucault et al. [23] assumes
that thresholds depend greatly on the context (as programming
language or application domain), thus the threshold computa-
tion is done by taking into account the context, for making a
trade-off between the representativeness of the threshold and
the computational cost. Their approach is based on an unbiased
selection of software entities and makes no assumptions on the
statistical properties of the metrics.
In order to facilitate metrics interpretation, Vasa et al. [24]
propose analyzing software metrics over a range of software
projects by using the Gini coefficient. They observed that many
metrics not only display remarkably high Gini values, but these
values are remarkably consistent as a project evolves over time.
Serebrenik et al. [25] propose a new approach for ag-
gregating software metrics at the micro-level (e.g., methods,
classes and packages) to the macro-level of the entire software
system. They used the Theil index, an econometric measure
of inequality, to study the impact of different categorizations
of the artifacts, e.g., based on the development technology
or developers’ teams, and on the inequality of the metrics
measured. Theil index is not metrics-specific and can be used
to aggregate values produced by a wide range of metrics
satisfying certain conditions.
Mordal et al. [26], considering the growing need for quality
assessment of software systems in the industry, proposed and
analyzed the requirements for aggregating metrics for quality
assessment purposes. They present a solution through the
Squale model for metric aggregation and validate the adequacy
of Squale through experiments on Eclipse. Moreover, they
compare the Squale model to both traditional aggregation
techniques (e.g., the arithmetic mean), and to econometric
inequality indices (e.g., the Gini or the Theil indices).
Finally, Herbold et al. [27] used a machine learning algo-
rithm for calculating thresholds, where classes and methods
are classified as satisfying or not the computed thresholds.
They used 4 metrics to analyze methods and functions and 7
metrics for the analysis of classes.
None of the above mentioned approaches has as central fo-
cus the derivation of thresholds for code smell detection. In the
literature, we did not found any comparable approaches with
our work, except for the approach of Lanza and Marinescu,
which combines statistical analysis and expert judgment (as
briefly described in Section III). In our approach, we provide
a filter mechanism to include in the benchmark only metrics
regarding particular methods or classes, according to the code
smell to be detected; and we automatically select values
within the metrics distribution, without manually adjusting the
percentiles representing the threshold values.
Our approach, as other previous work 1) fully relies on data
extracted from software systems, 2) takes into consideration
the metric distribution properties, without making unproven
assumptions about the nature of data, and 3) is completely
automatic and repeatable. Finally, we derive thresholds at a
lower level of granularity (i.e., at method or class levels), in
contraposition to other approaches such as Alves et al. [4],
which use a system-level type of granularity.
VI. CONCLUSION AND FUTURE WORK
In this work, we proposed a data-driven approach to de-
rive metric thresholds. We claim our method is repeatable,
transparent and enables contextual customization. We have
illustrated our approach by deriving metric thresholds from
74 Open Source systems from the Qualitas Corpus.
Our method allows to quantify systematically the elements
of an informal definition (e.g., a threshold for metrics that mea-
sure class complexity, to identify “high complexity” classes)
by observing the statistical properties of the metrics of inter-
est, and by taking into account the metrics distribution (via
non-parametric statistical analysis techniques). The proposed
method can be used for more general software quality as-
sessment tasks, as it allows defining ranges for metric values,
which in turn can be exploited for detecting different kinds of
design violations.
We are currently comparing the thresholds derived by this
approach, with those previously obtained through a machine
learning approach [28]. From our preliminary results, we
observe that thresholds compared with the operator have
comparable values for God Class and Brain Method smells.
For Data Class, the rules are different across approaches, and
so are the threshold values. The comparison of code smell
detection tools is a complex task [29] and we aim to work
towards this direction by exploiting our BBT method.
Future work includes the comparison of the results of our
BBT derivation approach with other thresholds available in
the literature or extracted by other tools, e.g., the recent
RTTool [21]. In addition, we aim to evaluate our approach by
considering aspects related to the context [30] of the projects
involved in the benchmark, such as the application domain,
size, age of the systems, and number of changes.
Furthermore, we are interested in integrating historical data
into our threshold derivation method. In particular, the evo-
lution of the metric values in different versions of a system
can be exploited for fine-tuning thresholds, or for predicting
possible future threshold violations.
Additional future work includes exploring the usefulness of
the thresholds by investigating if the derived code smell rules
help us to distinguish better classes and methods more prone
to defects, by analysing medium-large industrial Java systems.
As mentioned previously, the derivation of metrics thresh-
olds from benchmark datasets introduces dependency on the
benchmarks and its representativeness. In this work, we have
considered the Qualitas Corpus, but we aim in the future to
consider industrial projects as part of the benchmark, and other
corpora or precomputed metrics datasets, e.g., COMETS2[31],
Helix3and Promise4, as well as consider other metrics within
the datasets. More formal methods can also be applied to
maximize the representativeness of the datasets [32].
2http://java.llp.dcc.ufmg.br/comets/index.html
3http://www.ict.swin.edu.au/research/projects/helix/
4https://code.google.com/p/promisedata
5151
REFERENCES
[1] M. Fowler, Refactoring: Improving the Design of Existing Code.
Addison-Wesley, 1999.
[2] M. Lanza and R. Marinescu, Object-Oriented Metrics in Practice.
Springer, 2005.
[3] E. Tempero, C. Anslow, J. Dietrich, T. Han, J. Li, M. Lumpe, H. Melton,
and J. Noble, “The qualitas corpus: A curated collection of java code
for empirical studies,” in Proceedings of the 17th Asia Pacific Software
Engineering Conference. Sydney, NSW, Australia: IEEE Computer
Society, December 2010, pp. 336–345.
[4] T. L. Alves, C. Ypma, and J. Visser, “Deriving metric thresholds
from benchmark data,” in Proc. IEEE Int’l Conf. Softw. Maintenance
(ICSM2010). Timisoara, Romania: IEEE, Sep. 2010, pp. 1–10.
[5] R. Baggen, J. Correia, K. Schill, and J. Visser, “Standardized code
quality benchmarking for improving software maintainability,Software
Quality Journal, vol. 20, no. 2, pp. 287–307, 2012.
[6] R. J. Hyndman and Y. Fan, “Sample quantiles in statistical packages,”
The American Statistician, vol. 50, no. 4, pp. pp. 361–365, Nov. 1996.
[Online]. Available: http://www.jstor.org/stable/2684934
[7] N. E. Fenton and S. L. Pfleeger, Software Metrics - A Rigorous And
Practical Approach. PWS Publishing, 1998.
[8] D. Coleman, B. Lowther, and P. Oman, “The application of software
maintainability models in industrial software systems,” Journal of Sys-
tems and Software, vol. 29, no. 1, pp. 3–16, 1995.
[9] T. McCabe, “A complexity measure,” IEEE Transactions on Software
Engineering, no. 4, pp. 308–320, 1976.
[10] B. Nejmeh, “NPATH: a measure of execution path complexity and its
applications,” Communications of the ACM, vol. 31, no. 2, 1988.
[11] B. Henderson-Seller, Object-oriented Metrics: Measures of Complexity.
Prentice Hall, 1996.
[12] L. Rosemberg, “Metrics for object oriented environment,” in Proc.
EFAITP/AIE 3rd Annual Software Metrics Conference, 1997.
[13] L. H. Rosemberg, “Applying and interpreting object oriented metrics,
Software Assurance Technology Center at NASA Goddard Space Flight
Center, Tech. Rep., 2001.
[14] R. Shatnawi and W. Li, “Finding software metrics threshold values using
ROC curves,Journal of Software Maintenance and Evolution: Research
and Practice, vol. 22, no. 1, pp. 1–16, 2010.
[15] S. Benlarbi, K. El Emam, N. Goel, and S. Rai, “Thresholds for
object-oriented measures,” in Proc. 11th Int’l Symp. Software Reliability
Engineering (ISSRE 2000). San Jose, CA, USA: IEEE, Oct. 2000, pp.
24–38.
[16] K. Erni and C. Lewerentz, “Applying design-metrics to object-oriented
frameworks,” in Proc. 3rd Int’l Software Metrics Symposium. Berlin,
Germany: IEEE, Mar. 1996, pp. 64–74.
[17] G. Concas, M. Marchesi, S. Pinna, and N. Serra, “Power-laws in a
large object-oriented software system,” IEEE Transactions on Software
Engineering, vol. 33, no. 10, pp. 687–708, Oct. 2007.
[18] P. Louridas, D. Spinellis, and V. Vlachos, “Power laws in software,”
ACM Transactions on Software Engineering and Methodology, vol. 18,
no. 1, pp. 1–26, Sep. 2008.
[19] R. Wheeldon and S. Counsell, “Power law distributions in class rela-
tionships,” in Proc. 3rd IEEE International Workshop on Source Code
Analysis and Manipulation (SCAM 2003). IEEE, 2005, pp. 45–54.
[20] P. Oliveira, M. Valente, and F. Paim Lima, “Extracting relative thresholds
for source code metrics,” in Software Evolution Week - IEEE Conference
on Software Maintenance, Reengineering and Reverse Engineering
(CSMR-WCRE 2014). Antwerp, Belgium: IEEE Computer Society,
Feb. 2014, pp. 254–263.
[21] P. Oliveira, F. Lima, M. Valente, and A. Serebrenik, “RTTool: A
tool for extracting relative thresholds for source code metrics,” in
IEEE International Conference on Software Maintenance and Evolution
(ICSME 2014). Victoria, BC, Canada: IEEE Computer Society, Sep.
2014, pp. 629–632.
[22] K. A. Ferreira, M. A. Bigonha, R. S. Bigonha, L. F. Mendes, and H. C.
Almeida, “Identifying thresholds for object-oriented software metrics,”
Journal of Systems and Software,vol. 85, no. 2, pp. 244–257, 2012.
[23] M. Foucault, M. Palyart, J.-R. Falleri, and X. Blanc, “Computing
contextual metric thresholds,” in Proceedings of the 29th Annual ACM
Symposium on Applied Computing (SAC’14). Gyeongju, Republic of
Korea: ACM, 2014, pp. 1120–1125.
[24] R. Vasa, M. Lumpe, P. Branch, and O. Nierstrasz, “Comparative analysis
of evolving software systems using the gini coefficient,” in IEEE Interna-
tional Conference on Software Maintenance (ICSM 2009). Edmonton,
AB, Canada: IEEE Computer Society, Sep. 2009, pp. 179–188.
[25] A. Serebrenik and M. van den Brand, “Theil index for aggregation of
software metrics values,” in IEEE International Conference on Software
Maintenance (ICSM 2010). Timisoara, Romania: IEEE Computer
Society, Sep. 2010, pp. 1–9.
[26] K. Mordal, N. Anquetil, J. Laval, A. Serebrenik, B. Vasilescu, and
S. Ducasse, “Software quality metrics aggregation in industry,Journal
of Software: Evolution and Process, vol. 25, no. 10, pp. 1117–1135,
2013.
[27] S. Herbold, J. Grabowski, and S. Waack, “Calculation and optimization
of thresholds for sets of software metrics,” Empirical Software Engi-
neering, vol. 16, no. 6, pp. 812–841, 2011.
[28] F. Arcelli Fontana, M. Zanoni, A. Marino, and M. V. M¨
antyl¨
a, “Code
smell detection: towards a machine learning-based approach,” in Proc.
29th IEEE International Conference on Software Maintenance (ICSM
2013), ERA Track. Eindhoven, The Netherlands: IEEE, Sep 2013, pp.
396–399.
[29] F. Arcelli Fontana, P. Braione, and M. Zanoni, “Automatic detection
of bad smells in code: An experimental assessment,” Journal of Object
Technology, vol. 11, no. 2, pp. 5:1–38, Aug. 2012.
[30] F. Zhang, A. Mockus, Y. Zou, F. Khomh, and A. E. Hassan, “How
does context affect the distribution of software maintainability metrics?”
in Proceedings of the 29th IEEE International Conference on Software
Maintenance (ICSM 2013), Eindhoven, The Netherlands, Sep. 2013, pp.
350–359.
[31] C. Couto, C. Maffort, R. Garcia, and M. T. Valente, “COMETS: A
dataset for empirical research on software evolution using source code
metrics and time series analysis,” SIGSOFT Softw. Eng. Notes, vol. 38,
no. 1, pp. 1–3, Jan. 2013.
[32] M. Nagappan, T. Zimmermann, and C. Bird, “Representativeness in
software engineering research,” Microsoft Research, Tech. Rep. MSR-
TR-2012-93, September 2012.
[33] M. Lorenzen and J. Kidd, Object-oriented software metrics: a practical
guide. PTR Prentice Hall, 146.
[34] R. Marinescu, “Measurement and Quality in Object Oriented Design,”
Doctoral Thesis, Politehnica University of Timisoara, 2002.
[35] S. R. Chidamber and C. F. Kemerer, “A metrics suite for object oriented
design,” IEEE Transactions on Software Engineering, vol. 20, no. 6, pp.
476–493, 1994.
APPENDIX A
CODE SMELLS DEFINITIONS
We provide the definition of the smells we have considered
in this paper according to the definitions proposed in the
book [2], plus the definition for the Message Chain smell [1].
We considered also this smell in our experiments; we defined
also metrics for its detection (see Table VII).
God Class: The God Class design flaw refers to classes that
tend to centralize the intelligence of the system. A God Class
performs too much work on its own, delegating only minor
details to a set of trivial classes and using the data from other
classes. This has a negative impact on the reusability and the
understandability of that part of the system.
Data Class: Data Classes are “dumb” data holders without
complex functionality but other classes strongly rely on them.
The lack of functionally relevant methods may indicate that
related data and behavior are not kept in one place; this is
a sign of a non-object-oriented design. Data Classes are the
manifestation of a lacking encapsulation of data, and of a poor
data-functionality proximity.
Brain Method: Often a method starts out as a “normal”
method but then more and more functionality is added to
it until it gets out of control, becoming hard to maintain or
5252
Table VII
ADDITIONAL EVALUATED METRICS
Quality Dimension Metric Label Metric Name Granularity
Size
LOCNAMM Lines of Codes Without Accessor or Mutator Methods Class
NOPK Number of Packages Project
NOCS Number of Classes Project, Package
NOM Number of Methods [33] Project, Package, Class
NOMNAMM Number of Not Accessor or Mutator Methods Project, Package, Class
NOA Number of Attributes Class
Complexity
WMCNAMM Weighted Methods Count of Not Accessor or Mutator
Methods
Class
AMW Average Methods Weight [2] Class
AMWNAMM Average Methods Weight of Not Accessor or Mutator
Methods
Class
CLNAMM Called Local Not Accessor or Mutator Methods Method
NOP Number of Parameters Method
ATLD Access to Local Data Method
NOLV Number of Local Variable [34] Method
Coupling
FANOUT - [2] Class, Method
FANIN - Class
FDP Foreign Data Providers [2] Method
RFC Response for a Class [35], [7] Class
CBO Coupling Between Objects Classes Class
CFNAMM Called Foreign Not Accessor or Mutator Methods Class, Method
MaMCL Maximum Message Chain Length Method
MeMCL Mean Message Chain Length Method
NMCS Number of Message Chain Statements Method
Encapsulation LAA Locality of Attribute Accesses [2] Method
Inheritance
DIT Depth of Inheritance Tree [35] Class
NOI Number of Interfaces Project, Package
NOC Number of Children [35] Class
NMO Number of Methods Overridden [33] Class
NIM Number of Inherited Methods [33] Class
NOII Number of Implemented Interfaces Class
understand. Brain Methods tend to centralize the functionality
of a class, in the same way as a God Class centralizes the
functionality of an entire subsystem, or sometimes even a
whole system.
Dispersed Coupling: This is the case of an operation which
is excessively tied to many other operations in the system, and
additionally these provider methods that are dispersed among
many classes. In other words, this is the case where a single
operation communicates with an excessive number of provider
classes, whereby the communication with each of the classes is
not very intense, i.e., the operation calls one or a few methods
from each class.
Shotgun Surgery: Not only outgoing dependencies cause
trouble, but also incoming ones. This design disharmony
means that a change in an operation implies many (small)
changes to a lot of different operations and classes. This
disharmony tackles the issue of strong afferent(incoming)
coupling and it regards not only the coupling strength but also
the coupling dispersion.
Message Chain: You see message chains when a client asks
one object for another object, which the client then asks for
yet another object, which the client then asks for yet another
object, and so on. You may see these as a long line of getThis
methods, or as a sequence of temps.
APPENDIX B
ADDITIONAL METRICS FOR SMELL DETECTION
Table VII lists all the other metrics, other than those listed in
Table I, we computed and for which we derived thresholds. We
report the quality dimension it is related to each metric, and the
kind of software elements it has been applied to (Granularity).
Metrics in bold are metrics we have defined for code
smell detection. We think these metrics capture useful features
for smell detection. We include other metrics defined in the
literature for software quality assessment.
In our metrics, we often exclude accessor methods (getters
and setters). We think that these methods often create noise in
metrics extraction, especially regarding complexity and size
measures, since they are usually created and perceived as
generated code.
5353
... lines of code) than other components in the system [6]. Originally, GC was defined using a fixed threshold on the lines of code [6], ARCAN however uses a variable benchmark based on the frequencies of the number of lines of code of the other packages in the system [23]. Adopting a benchmark to derive the detection threshold fits particularly well in this case because what is considered a "large component" depends on the size of other components in the system under analysis and in many other systems. ...
... 3.3.2.2 Size-based smells: God Component is a smell that is detected based on the number of lines of code an artefact has (calculated summing up the LOC of the directly contained files) and whether it exceeds a certain threshold. The threshold is calculated using an adaptive statistical approach that takes into consideration the number of LOC of the other packages in the system and in a benchmark of over 100 systems [23]. The adaptive threshold is defined in such a way that it is always larger than the median lines of code of the packages/components in the system and benchmark. ...
Article
Full-text available
A key aspect of technical debt (TD) management is the ability to measure the amount of principal accumulated in a system. The current literature contains an array of approaches to estimate TD principal, however, only a few of them focus specifically on architectural TD, but none of them satisfies all three of the following criteria: being fully automated, freely available, and thoroughly validated. Moreover, a recent study has shown that many of the current approaches suffer from certain shortcomings, such as relying on hand-picked thresholds. In this paper, we propose a novel approach to estimate architectural technical debt principal based on machine learning and architectural smells to address such shortcomings. Our approach can estimate the amount of technical debt principal generated by a single architectural smell instance. To do so, we adopt novel techniques from Information Retrieval to train a learning-to-rank machine learning model (more specifically, a gradient boosting machine) that estimates the severity of an architectural smell and ensure the transparency of the predictions. Then, for each instance, we statically analyse the source code to calculate the exact number of lines of code creating the smell. Finally, we combine these two values to calculate the technical debt principal. To validate the approach, we conducted a case study and interviewed 16 practitioners, from both open source and industry, and asked them about their opinions on the TD principal estimations for several smells detected in their projects. The results show that for 71% of instances, practitioners agreed that the estimations provided were representative of the effort necessary to refactor the smell.
... lines of code) than other components in the system [6]. Originally, GC was defined using a fixed threshold on the lines of code [6], ARCAN however uses a variable benchmark based on the frequencies of the number of lines of code of the other packages in the system [22]. Adopting a benchmark to derive the detection threshold fits particularly well in this case because what is considered a "large component" depends on the size of other components in the system under analysis and in many other systems. ...
... 3.3.2.2 Size-based smells: God Component is a smell that is detected based on the number of lines of code an artefact has (calculated summing up the LOC of the directly contained files) and whether it exceeds a certain threshold. The threshold is calculated using an adaptive statistical approach that takes into consideration the number of LOC of the other packages in the system and in a benchmark of over 100 systems [22]. The adaptive threshold is defined in such a way that it is always larger than the median lines of code of the packages/components in the system and benchmark. ...
Preprint
Full-text available
A key aspect of technical debt (TD) management is the ability to measure the amount of principal accumulated in a system. The current literature contains an array of approaches to estimate TD principal, however, only a few of them focus specifically on architectural TD, and none of these are fully automated, freely available, and thoroughly validated. Moreover, a recent study has shown that many of the current approaches suffer from certain shortcomings, such as relying on hand-picked thresholds. In this paper, we propose a novel approach to estimate architectural technical debt principal based on machine learning and architectural smells to address such shortcomings. Our approach can estimate the amount of technical debt principal generated by a single architectural smell instance. To do so, we adopt novel techniques from Information Retrieval to train a learning-to-rank machine learning model that estimates the severity of an architectural smell and ensure the transparency of the predictions. Then, for each instance, we statically analyse the source code to calculate the exact number of lines of code creating the smell. Finally, we combine these two values to calculate the technical debt principal. To validate the approach, we conducted a case study and interviewed 16 practitioners, from both open source and industry, and asked them about their opinions on the TD principal estimations for several smells detected in their projects. The results show that for 71\% of instances, practitioners agreed that the estimations provided were \emph{representative} of the effort necessary to refactor the smell.
... On the other hand, few studies specified the use of maintenance support tools (RQ-M5), with the main tool being static code analyzers. The handling of thresholds that activate metrics for this type of software was mentioned as a problem in previous publications [60], indicating code as defective when it is not and vice versa. Software has benefitted from such tools, but their use with factory configurations or by default should be avoided. ...
Article
Full-text available
While some areas of software engineering knowledge present great advances with respect to the automation of processes, tools, and practices, areas such as software maintenance have scarcely been addressed by either industry or academia, thus delegating the solution of technical tasks or human capital to manual or semiautomatic forms. In this context, machine learning (ML) techniques play an important role when it comes to improving maintenance processes and automation practices that can accelerate delegated but highly critical stages when the software launches. The aim of this article is to gain a global understanding of the state of ML-based software maintenance by using the compilation, classification, and analysis of a set of studies related to the topic. The study was conducted by applying a systematic mapping study protocol, which was characterized by the use of a set of stages that strengthen its replicability. The review identified a total of 3776 research articles that were subjected to four filtering stages, ultimately selecting 81 articles that were analyzed thematically. The results reveal an abundance of proposals that use neural networks applied to preventive maintenance and case studies that incorporate ML in subjects of maintenance management and management of the people who carry out these tasks. In the same way, a significant number of studies lack the minimum characteristics of replicability.
... Some researchers do not create full machine learning models but instead decide to only tune parameters of pre-created models [37,38]. Often, this tuning uses sophisticated statistical [39] or evolutionary [40] techniques. ...
Article
Full-text available
Context Code smells are patterns in source code associated with an increased defect rate and a higher maintenance effort than usual, but without a clear definition. Code smells are often detected using rules hard-coded in detection tools. Such rules are often set arbitrarily or derived from data sets tagged by reviewers without the necessary industrial know-how. Conclusions from studying such data sets may be unreliable or even harmful, since algorithms may achieve higher values of performance metrics on them than on models tagged by experts, despite not being industrially useful. Objective Our goal is to investigate the performance of various machine learning algorithms for automated code smell detection trained on code smell data set(MLCQ) derived from actively developed and industry-relevant projects and reviews performed by experienced software developers. Method We assign the severity of the smell to the code sample according to a consensus between the severities assigned by the reviewers, use the Matthews Correlation Coefficient (MCC) as our main performance metric to account for the entire confusion matrix, and compare the median value to account for non-normal distributions of performance. We compare 6720 models built using eight machine learning techniques. The entire process is automated and reproducible. Results Performance of compared techniques depends heavily on analyzed smell. The median value of our performance metric for the best algorithm was 0.81 for Long Method, 0.31 for Feature Envy, 0.51 for Blob, and 0.57 for Data Class. Conclusions Random Forest and Flexible Discriminant Analysis performed the best overall, but in most cases the performance difference between them and the median algorithm was no more than 10% of the latter. The performance results were stable over multiple iterations. Although the F-score omits one quadrant of the confusion matrix (and thus may differ from MCC), in code smell detection, the actual differences are minimal.
Article
Background Developers use Static Analysis Tools (SATs) to control for potential quality issues in source code, including defects and technical debt. Tool vendors have devised quite a number of tools, which makes it harder for practitioners to select the most suitable one for their needs. To better support developers, researchers have been conducting several studies on SATs to favor the understanding of their actual capabilities. Aims Despite the work done so far, there is still a lack of knowledge regarding (1) what is their agreement, and (2) what is the precision of their recommendations. We aim at bridging this gap by proposing a large-scale comparison of six popular SATs for Java projects: Better Code Hub, CheckStyle, Coverity Scan, FindBugs, PMD, and SonarQube. Methods We analyze 47 Java projects applying 6 SATs. To assess their agreement, we compared them by manually analyzing - at line- and class-level - whether they identify the same issues. Finally, we evaluate the precision of the tools against a manually-defined ground truth. Results The key results show little to no agreement among the tools and a low degree of precision. Conclusion Our study provides the first overview on the agreement among different tools as well as an extensive analysis of their precision that can be used by researchers, practitioners, and tool vendors to map the current capabilities of the tools and envision possible improvements.
Article
Full-text available
Context Facebook’s React is a widely popular JavaScript library to build rich and interactive user interfaces (UI). However, due to the complexity of modern Web UIs, React applications can have hundreds of components and source code files. Therefore, front-end developers are facing increasing challenges when designing and modularizing React-based applications. As a result, it is natural to expect maintainability problems in React-based UIs due to suboptimal design decisions. Objective To help developers with these problems, we propose a catalog with twelve React-related code smells and a prototype tool to detect the proposed smells in React-based Web apps. Method The smells were identified by conducting a grey literature review and by interviewing six professional software developers. We also use the tool in the top-10 most popular GitHub projects that use React and conducted a historical analysis to check how often developers remove the proposed smells. Results We detect 2,565 instances of the proposed code smells. The results show that the removal rates range from 0.9% to 50.5%. The smell with the most significant removal rate is Large File (50.5%). The smells with the lowest removal rates are Inheritance Instead of Composition (IIC) (0.9%), and Direct DOM Manipulation (14.7%). Conclusion The list of React smells proposed in this paper as well as the tool to detect them can assist developers to improve the source code quality of React applications. While the catalog describes common problems with React applications, our tool helps to detect them. Our historical analysis also shows the importance of each smell from the developers’ perspective, showing how often each smell is removed.
Chapter
The presence of bad smells in code hampers software’s maintainability, comprehensibility, and extensibility. A type of code smell, which is common in software projects, is “duplicated code” bad smell, also known as code clones. These types of smells generally arise in a software system due to the copy-paste-modify actions of software developers. They can either be exact copies or copies with certain modifications. Different clone detection techniques exist, which can be broadly classified as text-based, token-based, abstract syntax tree-based (AST-based), metrics-based, or program dependence graph-based (PDG-based) approaches based on the amount of preprocessing required on the input source code. Researchers have also built clone detection techniques using a hybrid of two or more approaches described above. In this paper, we did a narrative review of the metrics-based techniques (solo or hybrid) reported in the previously published studies and analyzed them for their quality in terms of run-time efficiency, accuracy values, and the types of clones they detect. This study can be helpful for practitioners to select an appropriate set of metrics, measuring all the code characteristics required for clone detection in a particular scenario.KeywordsClone detectionMetrics-based techniquesHybrid clone detection techniquesCategorizationQualitative analysis
Article
Defect prediction is commonly used to reduce the effort from the testing phase of software development. A promising strategy is to use machine learning techniques to predict which software components may be defective. Features are key factors to the prediction’s success, and thus extracting significant features can improve the model’s accuracy. In particular, code smells are a category of those features that have been shown to improve the prediction performance significantly. However, Design code smells, a state-of-the-art collection of code smells based on the violations of the object-oriented programming principles, have not been studied in the context of defect prediction. In this paper, we study the performance of defect prediction models by training multiple classifiers for 97 real projects. We compare using Design code smells as features and using other Traditional smells from the literature and both. Moreover, we cluster and analyze the models’ performance based on the categories of Design code smells. We conclude that the models trained with both the Design code smells and the smells from the literature performed the best, with an improvement of 4.1% for the AUC score, compared to models trained with only Traditional smells. Consequently, Design smells are a good addition to the smells commonly studied in the literature for defect prediction.
Article
Full-text available
Software metrics have been developed to measure the quality of software systems. A proper use of metrics requires thresholds to determine whether the value of a metric is acceptable or not. Many approaches propose to define thresholds based on large analyses of software systems. However it has been shown that thresholds depend greatly on the context of the project. Thus there is a need for an approach that computes thresholds by taking into account this context. In this paper we propose such approach with the objective to reach a trade-off between representativeness of the threshold and computation cost. Our approach is based on an unbiased selection of software entities and makes no assumptions on the statistical properties of the software metrics values. It can therefore be used by anyone, ranging from developer to manager, for computing a representative metric threshold tailored to their context.
Conference Paper
Full-text available
Meaningful thresholds are essential for promoting source code metrics as an effective instrument to control the internal quality of software systems. Despite the increasing number of source code measurement tools, no publicly available tools support extraction of metric thresholds. Moreover, earlier studies suggest that in larger systems significant number of classes exceed recommended metric thresholds. Therefore, in our previous study we have introduced the notion of a relative threshold, i.e., a pair including an upper limit and a percentage of classes whose metric values should not exceed this limit. In this paper we propose RTTOOL, an open source tool for extracting relative thresholds from the measurement data of a benchmark of software systems. RTTOOL is publicly available at http://aserg.labsoft.dcc.ufmg.br/rttool.
Conference Paper
Full-text available
Software metrics have many uses, e.g., defect prediction, effort estimation, and benchmarking an organization against peers and industry standards. In all these cases, metrics may depend on the context, such as the programming language. Here we aim to investigate if the distributions of commonly used metrics do, in fact, vary with six context factors: application domain, programming language, age, lifespan, the number of changes, and the number of downloads. For this preliminary study we select 320 nontrivial software systems from Source Forge. These software systems are randomly sampled from nine popular application domains of Source Forge. We calculate 39 metrics commonly used to assess software maintainability for each software system and use Kruskal Wallis test and Mann-Whitney U test to determine if there are significant differences among the distributions with respect to each of the six context factors. We use Cliff's delta to measure the magnitude of the differences and find that all six context factors affect the distribution of 20 metrics and the programming language factor affects 35 metrics. We also briefly discuss how each context factor may affect the distribution of metric values. We expect our results to help software benchmarking and other software engineering methods that rely on these commonly used metrics to be tailored to a particular context.
Conference Paper
Full-text available
Establishing credible thresholds is a central challenge for promoting source code metrics as an effective instrument to control the internal quality of software systems. To address this challenge, we propose the concept of relative thresholds for evaluating metrics data following heavy-tailed distributions. The proposed thresholds are relative because they assume that metric thresholds should be followed by most source code entities, but that it is also natural to have a number of entities in the “long-tail” that do not follow the defined limits. In the paper, we describe an empirical method for extracting relative thresholds from real systems. We also report a study on applying this method in a corpus with 106 systems. Based on the results of this study, we argue that the proposed thresholds express a balance between real and idealized design practices.
Article
Full-text available
With the growing need for quality assessment of entire software systems in the industry, new issues are emerging. First, because most software quality metrics are defined at the level of individual software components, there is a need for aggregation methods to summarize the results at the system level. Second, because a software evaluation requires the use of different metrics, with possibly widely varying output ranges, there is a need to combine these results into a unified quality assessment. In this paper we derive, from our experience on real industrial cases and from the scientific literature, requirements for an aggregation method. We then present a solution through the Squale model for metric aggregation, a model specifically designed to address the needs of practitioners. We empirically validate the adequacy of Squale through experiments on Eclipse. Additionally, we compare the Squale model to both traditional aggregation techniques (e.g., the arithmetic mean), and to econometric inequality indices (e.g., the Gini or the Theil indices), recently applied to aggregation of software metrics. Copyright © 2012 John Wiley & Sons, Ltd.
Article
While software metrics are a generally desirable feature in the software management functions of project planning and project evaluation, they are of especial importance with a new technology such as the object-oriented approach. This is due to the significant need to train software engineers in generally accepted object-oriented principles. This paper presents theoretical work that builds a suite of metrics for object-oriented design. In particular, these metrics are based upon measurement theory and are informed by the insights of experienced object-oriented software developers. The proposed metrics are formally evaluated against a widelyaccepted list of software metric evaluation criteria.
Conference Paper
Several code smells detection tools have been developed providing different results, because smells can be subjectively interpreted and hence detected in different ways. Usually the detection techniques are based on the computation of different kinds of metrics, and other aspects related to the domain of the system under analysis, its size and other design features are not taken into account. In this paper we propose an approach we are studying based on machine learning techniques. We outline some common problems faced for smells detection and we describe the different steps of our approach and the algorithms we use for the classification.