Conference PaperPDF Available

Comparing maintainability index, SIG Method, and SQALE for technical debt identification

Authors:

Figures

Content may be subject to copyright.
Comparing Maintainability Index, SIG Method, and SQALE for
Technical Debt Identification
Peter Strečanský
Masaryk University
Brno, Czech Republic
xstrec05@mail.muni.cz
Stanislav Chren
Masaryk University
Brno, Czech Republic
chren@mail.muni.cz
Bruno Rossi
Masaryk University
Brno, Czech Republic
brossi@mail.muni.cz
ABSTRACT
Many techniques have emerged to evaluate software Technical
Debt (TD). However, dierences in reporting TD are not yet studied
widely, as they can give dierent perceptions about the evolution
of TD in projects. The goal of this paper is to compare three TD
identication techniques: i. Maintainability Index (MI), ii. SIG TD
models and iii. SQALE analysis. Considering 17 large open source
Python libraries, we compare TD measurements time series in terms
of trends in dierent sets of releases (major, minor, micro). While
all methods report generally growing trends of TD over time, MI,
SIG TD, and SQALE all report dierent patterns of TD evolution.
CCS CONCEPTS
Social and professional topics Management of comput-
ing and information systems;Software maintenance;
KEYWORDS
Software Technical Debt, Software Maintenance, Software Quality,
Maintainability Index, SIG Method, SQALE
1 INTRODUCTION
Technical Debt (TD) is a metaphor coined by Ward Cunningham
in 1993, that made an analogy between poor decisions during soft-
ware development and economic debt. Even though short-term
decisions can speed-up development or the release process, there
is an unavoidable interest that will have to be paid in the future.
In general, the impact of TD can be quite relevant for industry.
Many studies found out that TD has negative nancial impacts
on companies (e.g., [
14
]). Every hour of a developer time used on
xing poor design or guring out how badly documented code
works with other modules instead of developing new features is
essentially a waste of money from the company point of view.
The goal of the paper is to compare three main techniques about
TD identication that were proposed over time for source code TD
identication: i. the Maintainability Index (MI) (1994) that was one
of the rst attempts to measure TD and is still in use, ii. SIG TD
models (2011) that were dened in search of proper code metrics for
TD measurement and iii. SQALE (2011) a framework that attempts
to put into more practical terms the indication from the ISO/IEC
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
SAC ’20, March 30-April 3, 2020, Brno, Czech Republic
©2020 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-6866-7/20/03.
https://doi.org/10.1145/3341105.3374079
9126 software quality standard (recently replaced that ISO/IEC
25010:2011). As a method, we compare the time series derived from
dierent methods, looking at the results in the terms of trends and
time series similarities.
2 TECHNICAL DEBT (TD)
One of the widely spread denitions of TD is from McConnell in
2008: "A design or construction approach that’s expedient in the short
term but that creates a technical context in which the same work
will cost more to do later than it would cost to do now (including in-
creased cost over time)" [
8
]. In the research context, Guo et al. (2014)
presented TD as "incomplete, immature, or inadequate artifact in
the software development lifecycle" [
4
]. Theodoropoulos proposed a
new, broader denition of TD: "Technical debt is any gap within the
technology infrastructure or its implementation which has a material
impact on the required level of quality" [
13
]. Gat (2011) proposed an
even more extensive denition of TD: "Quality issues in the code
other than function/feature completeness" divided into intrinsic and
extrinsic quality issues [
1
]. One of the most important shortcom-
ings of TD denitions is the fact that there is yet to be a unied
measurement unit [
11
]. It is generally complex to quantify most
of the forms of TD. As well, there are dierent categories of tech-
nical debt: code debt, design and architectural debt, environment
debt (connected to the hardware software ecosystem), knowledge
distribution and documentation debt, and testing debt [14].
There are not many studies that compare alternative TD identi-
cation methods. One reason could be the complexity / time required
to implement the methods, the second reason about the compara-
bility of the metrics dened. Furthermore, Izurieta et al
.
[
5
] note
that it can be dicult to compare alternative TD measurements
methods due to missing ground truth and the uncertainties of the
measurement process. One of the earliest studies to compare met-
rics for TD identication was the study by Zazworka et al
.
[
15
],
comparing four alternative methods across dierent versions of
Apache Hadoop: a) modularity violations, b) design patterns grime
build-up, c) source code smells, d) static code analysis. The focus
was on comparing how such methods behave at the class level. The
ndings were that the TD identication techniques indicate dier-
ent classes as part of the problems, with not many overlaps between
the methods. Furthermore, Grith et al
.
[
3
] compared ten releases
of ten open source systems with three methods of TD identication
(i. SonarQube TD plug-in, ii. a method based on TD identication
using a cost model based on detected violations, iii. and one method
dening design disharmonies to derive issues in quality). These
methods were compared against software quality models. Authors
found that only one method had a strong correlation to the quality
attributes of reusability and understandability.
SAC ’20, March 30-April 3, 2020, Brno, Czech Republic Peter Strečanský, Stanislav Chren, and Bruno Rossi
3 MI, SQALE, SIG TD COMPARISON
For the denition of the experimental evaluation, we dened the
following goal: to analyze
technical debt evaluation techniques
(MI, SQALE, SIG TD)
for the purpose of
comparing their sim-
ilarity
with respect to
the trends and evolution of the mea-
surements
from the point of view of
practitioners aiming at
measuring TD
. The goal was rened into three main research
questions (RQs):
RQ1:
Are the trends of the measurements provided by the three
methods comparable? Metric: Pearson correlation between trends.
RQ2:
How are trends of TD comparable across dierent release
types? Metric: comparison of trends by release type.
RQ3:
How much can one method be used to forecast another
one? Metric: Granger causality between time series.
3.1 Compared TD Measurement Techniques
Of many methods that were proposed over time, we focus on three
representative methods for TD identication: i. the Maintainability
Index (MI), ii. SIG TD models, and iii. SQALE analysis. Methods
that practitioners can still nd implemented in TD analysis tools.
3.1.1 Maintainability Index (MI). MI was introduced in 1994, with
the goal to nd a simple, applicable model which is generic enough
for a wide range of software [10]:
MI =171 5.2 ln(H V ) 0.23(CC) 16.2 ln(LoC)+0.99(CMT )
Where
HV
is average Halstead Volume per module,
CC
is average
Cyclomatic Complexity per module,
LoC
is average lines of code
per module,
CMT
is average lines of comments per module. A de-
rivative formula of MI is still used in some popular code editors
(e.g., Microsoft Visual Studio), so is relatively easy to be adopted by
practitioners to detect TD.
3.1.2 SIG TD Models. The Software Improvement Group (SIG)
dened in 2011 a model which quanties TD based on an estimation
of repair eort and estimation of maintenance eort, which provides
a clear picture about the cost of repair, its benets, and the expected
payback period [
9
]. Quantifying TD of the project is done in several
steps and requires a calculation of three dierent variables: Rebuild
value (RV), Rework fraction (RF) and Repair eort (RE).
Rebuild value
is dened as an estimate of the eort (in man-
months) that needs to be spent to rebuild a system using particular
technology. To calculate this value, the following formula is used:
RV =SS ×T F
Where
SS
is
System Size
in Lines of Code and
TF
is a
Technology
Factor which is a language productivity factor.
Rework Fraction
is dened as an estimate of % of LoC to be
changed in order to improve the quality by one level. The values of
the RF in between two quality levels are empirically dened [9].
Finally, the
Repair Eort
is calculated by the multiplication of
the Rework Fraction and Repair Eort. It is possible to multiply it by
Refactoring Adjustments (RA)
metric, which shows external,
context-specic aspects of the project, which represent a discount
in the overall technical debt of the project:
RE =RF ×RV ×RA
3.1.3 SQALE. Software QuALity Enhancement (SQALE) focuses
on the operationalization of the ISO 9126 Software Quality standard
by means of several code metrics that are attached to the taxonomy
dened in ISO 9126 [
6
]. Mimicking the ISO 9126 standard, SQALE
has a rst level dening the characteristics (e.g. testability), further
sub-characteristics (e.g., unit testing testability), and further source
code level requirements. Such source code requirements are then
mapped to remediation indexes that translate in the time/eort
required to x the issues. For calculation of TD, the Remediation
Cost (RC) represents the cost to x the violations to the rules that
have been dened for each category [7]:
RC =Íru l e e f f or tTo Fi x(violationsr ul e )
8[hr/day]
For SQALE, we adopted the SonarQube implementation: a default
set of rules was used, which is claimed to be the best-practice, mini-
mum set of rules to assess the technical debt.
3.2 Rationale & Methods
To compare the three methods, we looked at the time series of
all the measures collected by all the three methods. The tested
packages were randomly selected from the list of 5000 most popular
Python libraries (full list can be found in [
12
]). We dene a time
series as
T
, consisting of data points of TD at each release time
R={r1,r2, ...rn}
, as:
T={tr1,tr2, .. ., trn }
. The MI measure is an
inverse of the other measures, as is giving an indication of the
maintainability of the project (the lower the worse), while the other
methods give indication of TD accumulating (the higher the worse).
For the other parts of the analysis, to compare the time series, we
reversed the MI index, to make it comparable.
For RQ1. trends of measurements, we compute TD’s
measure-
ments between two releases for each of the projects. For release
ri
,
TD is dened as:
TD ri =(tr11tr1)
(tr11+tr1)/2
We then compute the Pearson correlations between all the points
of each of the compared methods. Results of the
TD ri
for each of
the time series are also shown in aggregated form in boxplots.
For RQ2., we consider dierent types of project releases: ma-
jor (e.g.,
0.7.3, 1.0.0
), minor (e.g.,
0.7.3, 0.8.0
), and micro
(e.g.,
0.9.0, 0.9.1
) releases, looking at TD trends dierences. To
answer this research question, we look at TD’s
as increasing
trends,
as decreasing trends, and
0in periods between releases
in which TD did not change. Where
TD ri
is categorized in one of
the categories:
TD =
TD ,if T Dr i >0
TD ,if T Dr i <0
TD ,otherwise
For RQ3., we look at how much one of the three methods can be
used to forecast the results from another method. We take into
account time series of the measurements from the three methods
(MI, SIG TD, SQALE) and we compute Granger causality between
methods in pairs. Granger causality test, rst proposed in 1969
by Clive Granger, is a statistical hypothesis test which is used to
determine whether a time series can be used to predict other time
Comparing MI, SIG, SQALE for Technical Debt Identification SAC ’20, March 30-April 3, 2020, Brno, Czech Republic
series values [
2
]. More precisely, we can report that
T
1"Granger
causes"
T
2, if the lags of
T
1(i.e.,
T
1
t1
,
T
1
t2
,
T
1
t3
,...) can provide
predictive capability over
T
2beyond what allowed by considering
the own lags of
T
2. The null hypothesis is that T2 does not Granger-
cause the time series of T1. We adopted the standard SSR-based F
test. If the probability value is less than 0.05, T2 Granger-causes T1.
3.3 Results
3.3.1 RQ.1 Are the trends of the measurements provided by the
three methods comparable? To compare TD over time between the
techniques, we used the Pearson correlation coecient to measure
linear relationships between variables. Fig. 1 reports the boxplots
of the correlation between the trends for each release (
TD ri
).
Each datapoint in the boxplot is a correlation for one project. The
three boxplots propose the comparison SQALE-SIG (Median: 0.71),
SQALE-MI (0.56), and SIG-MI (0.75). SQALE and MI are the least
comparable methods, with highest number of negative correlations
and much higher variability. SQALE & SIG and SIG & MI showed
similar distributions of the correlations, in favor of SIG & MI (Me-
dian:0.75) which have less negative correlations and lower variance.
Figure 1: TD ri correlation between dierent methods.
We run Wilcoxon Signed-Rank Tests, paired tests to evaluate
the mean ranks dierences for the correlations. For Wilcoxon
Signed-Rank Test, we calculate eect size as
r=Z/N
, where
N=
#
cases
2, to consider non-independent paired samples, us-
ing Cohen’s denition to discriminate between small (0
.
0
0
.
3),
medium (0
.
3
0
.
6), and large eects (
>
0
.
6). The dierence is statisti-
cally signicant for SQALE-MI vs SIG-MI (
p
-value 0.044
p
0
.
05,
two-tailed, medium eect size (
r=
0
.
34)) while not signicant for
SQALE-SIG vs SQALE-MI (p-value 0.423, p0.05, two-tailed).
When we look at the trends on comparisons between every two
following releases (Table 1), the trend is similar for SIG and MI
(as previous correlations discussed), with a slight dierence in the
falling trend. This seems to indicate that, according to MI, TD tends
to be repaid more often than on SIG. In SQALE, however, we can
observe that TD was more stable across dierent releases (Table 1).
RQ1 Findings.
SIG TD and MI are the models which show sta-
tistically signicant comparability in terms of correlation of the
trends of TD changes. SQALE and SIG TD, also show similarities,
though not statistically signicant. Generally, SQALE and MI are
the models that show lower correlation in trends.
Table 1: TD trends on all releases
TD T D T D
SQALE 33.15% 6.47 % 60.38 %
SIG TD 72.24 % 13.75 % 14.02 %
MI 61.19 % 21.83 % 16.98 %
3.3.2 RQ.2 How are trends of TD comparable across dierent release
types? This RQ is similar to RQ1, but in RQ2, we look at the compar-
ison based on the release types, if
major, minor, micro
releases
matter for the dierences in TD identication. Comparisons solely
between major releases have brought interesting results (Table 2),
similar to the results on all releases. Throughout all comparisons,
most of major releases caused TD to rise for each analysis. SQALE
had again the highest number of
still
trends and the most TD
repayments (falling trend) were recorded with MI.
Table 2: TD trends on major releases
TD T D T D
SQALE 63.33 % 13.33 % 32.33 %
SIG TD 73.33 % 26.67 % 0 %
MI 56.67 % 43.33 % 0 %
The rising of TD is stronger at minor release level for both SIG TD
and SQALE, as each of the methods encountered the rise of
rising
trend compared to major releases (see Table 3). SQALE showed
a decrease in growing trends. As in the previous cases, SQALE
recorded more periods of steady TD, and MI the most repayments
of TD (to a much larger extent than SIG TD and SQALE).
Table 3: TD trends on minor releases
TD T D T D
SQALE 42.48 % 8.50 % 49.02%
SIG TD 85.62 % 7.19 % 7.19 %
MI 71.24 % 20.92 % 7.84 %
The last comparison was done on micro releases. The same trends
were observed also at this level: vast majority of releases inducted
more TD on SIG and MI, while considering SQALE the majority of
releases did not change TD (see table 4). Again, MI is the method
that reports more TD repayment (21.99%).
Table 4: TD trends on micro releases
TD T D T D
SQALE 39.27 % 7.33 % 53.40 %
SIG TD 72.77 % 14.66 % 12.57 %
MI 60.73 % 21.99 % 17.28 %
RQ2 Findings.
Considering major, minor, micro releases, MI and
SIG TD show mostly the majority of growing trends. SQALE shows
the most of TD steady states, while MI shows much larger TD
repayment periods compared to the other methods. These patterns
seem to be consistent across major, minor, micro releases.
SAC ’20, March 30-April 3, 2020, Brno, Czech Republic Peter Strečanský, Stanislav Chren, and Bruno Rossi
3.3.3 RQ.3 How much can one method be used to forecast another
one? A time series
X
can be said to Granger-cause another time
series
Y
if the probability of correctly forecasting
Yt+1
, with
t=
1
, ..., T
, increases by including information about
Xt
in addition to
the information included in
Y
alone. In the case of the three methods
reviewed, this means that the measurements from one method (e.g.
SQALE) can be used together with the measurements of another
method (e.g., MI), to provide better prediction of future values of the
complemented method (e.g., MI). Aggregated results for all projects
about Granger causality tests can give us the indication about how
much a time series results from a TD identication technique can
help in forecasting time series from other methods.
Granger causality, dierently from correlation, is generally an
asymmetric property, that is the fact that in one the projects’s time
series SQALE Granger-causes SIG TD model does not imply that
SIG TD model Granger-causes SQALE results. For this reason, we
provide all the combinations of results with counters for how many
times the results were positive according to the F-test for the 17
projects. In case of some tested libraries, the linear trend of TD in
SQALE caused the tests to end with an error, as Granger causality
test does not apply in case of non-linear relationships. In general,
SQALE-MI and SQALE-SIG TD had F-statistic signicant in around
1/3 of the projects (29.4%), while in the majority of the other cases
Granger causality was negative. These results could indicate that
considering the lagged values of SQALE time series we can get
better prediction of values in MI and SIG TD time series.
Table 5: Aggregated results of Granger causality tests (per-
centage of F-test results)
True False None
SQALE - MI 29.4% 58.8% 11.8%
SQALE - SIG 29.4% 58.8% 11.8%
SIG - SQALE 17.6% 70.6% 11.8%
MI - SIG 11.8% 88.2%
MI - SQALE 88.2% 11.8%
SIG - MI 11.8% 88.2%
RQ3 Findings.
Results from Granger causality show that the TD
identication measurements are rather independent. Only SQALE
time series Granger-causes both MI and SIG TD in 1/3 of the projects,
while for other methods there is mostly no Granger causality.
3.3.4 Replicability. The analysis was implemented and run using
Python version 3.6.3. A derivative from the Python Radon library
was used
1
. The sonar-python plugin
2
was used for the SQALE anal-
ysis. The
statsmodels
library was used for computing the Granger
causality tests. Scripts can be run to run, aggregate, and plot all the
results in a semi-automated way [12].
1https://radon.readthedocs.io/en/latest/index.html
2https://docs.sonarqube.org/display/PLUG/SonarPython
4 CONCLUSIONS
Comparing three main methods about TD identication (Maintain-
ability Index (MI), SIG TD models, SQALE) on a set of 17 Python
projects, we can see in all cases increasing trends of reported TD.
However, there are dierent patterns in the nal evolution of the
measurements time series. MI and SIG TD report generally more
growing trends of TD compared to SQALE, that shows more pe-
riods of steady TD. MI is the methods that reports largely more
repayments of TD compared to the other methods. SIG TD and MI
are the models that show more similarity in the way TD evolves,
while SQALE and MI are the less comparable. Granger causality
for all projects and combination of methods shows that there is
a limited dependency between the TD time series. We could nd
some relationships between SQALE & MI, and SQALE & SIG TD
models, in the sense that previous lags of SQALE time series could
be used to improve prediction of other models in 1/3 of the projects.
Acknowledgments.
The work was supported from ERDF/ESF "Cy-
berSecurity, CyberCrime and Critical Information Infrastructures
Center of Excellence" (No. CZ.02.1.01/0.0/0.0/16_019/0000822).
REFERENCES
[1]
Israel Gat. 2011. Technical Debt: Technical Debt: Assessment and Reduction.
(2011). https://agilealliance.org/wp-content/uploads/2016/01/Technical_Debt_
Workshop_Gat.pdf Agile 2011: Technical debt workshop.
[2]
C. W. J. Granger. 1969. Investigating Causal Relations by Econometric Models
and Cross-spectral Methods. Econometrica 37, 3 (1969), 424–438.
[3]
I. Grith, D. Reimanis, C. Izurieta, Z. Codabux, A. Deo, and B. Williams. 2014.
The Correspondence Between Software Quality Models and Technical Debt
Estimation Approaches. In Int. Workshop on Managing Technical Debt. 19–26.
[4]
Yuepu Guo, Rodrigo Oliveira Spínola, and Carolyn Seaman. 2016. Exploring the
costs of technical debt management–a case study. Empirical Software Engineering
21, 1 (2016), 159–182.
[5]
Clemente Izurieta, Isaac Grith, Derek Reimanis, and Rachael Luhr. 2013. On
the uncertainty of technical debt measurements. In 2013 International Conference
on Information Science and Applications (ICISA). IEEE, 1–4.
[6]
Jean-Louis Letouzey. 2012. The SQALE method for evaluating technical debt. In
Third International Workshop on Managing Technical Debt (MTD). IEEE, 31–36.
[7]
Alois Mayr, Reinhold Plösch, and Christian Körner. 2014. A benchmarking-based
model for technical debt calculation. In 2014 14th International Conference on
Quality Software. IEEE, 305–314.
[8]
Steve McConnell. 2008. Managing Technical Debt. Technical Report. Construx
Software, Bellevue, WA 98004, 10900 NE 8th Street. 14 pages. https://www.
construx.com/developer-resources/whitepaper- managing-technical- debt/
[9]
Ariadi Nugroho, Joost Visser, and Tobias Kuipers. 2011. An Empirical Model of
Technical Debt and Interest. In Proceedings of the 2Nd Workshop on Managing
Technical Debt (MTD ’11). ACM, New York, NY, USA, 1–8.
[10]
Paul Oman and Jack Hagemeister. 1994. Construction and testing of polynomials
predicting software maintainability. Journal of Systems and Software 24, 3 (1994),
251 266. https://doi.org/10.1016/0164-1212(94)90067- 1 Oregon Workshop on
Software Metrics.
[11]
K. Schmid. 2013. On the limits of the technical debt metaphor some guidance on
going beyond. In 2013 4th International Workshop on Managing Technical Debt
(MTD). 63–66. https://doi.org/10.1109/MTD.2013.6608681
[12]
Peter Strečanský. 2019. Dealing with Software Development Technical Debt. Master
Thesis. Masaryk University. https://is.muni.cz/auth/th/x0boz/master_thesis_
digital.pdf
[13]
Ted Theodoropoulos, Mark Hofberg, and Daniel Kern. 2011. Technical Debt from
the Stakeholder Perspective. In Proceedings of the 2Nd Workshop on Managing
Technical Debt (MTD ’11). ACM, New York, NY, USA, 43–46.
[14]
Edith Tom, AybüKe Aurum, and Richard Vidgen. 2013. An exploration of technical
debt. Journal of Systems and Software 86, 6 (2013), 1498–1516.
[15]
Nico Zazworka, Clemente Izurieta, Sunny Wong, Yuanfang Cai, Carolyn Sea-
man, Forrest Shull, et al
.
2014. Comparing four approaches for technical debt
identication. Software Quality Journal 22, 3 (2014), 403–426.
... Defined as an empirical model for TD principal and interest, the SIG model [38] was proposed in 2011 by the Software Improvement Group as an alternative to the deprecated Maintainability Index [47]. Its evaluation starts from source code level metrics such as lines of code, cyclomatic complexity, code duplication, unit size and unit testing. ...
... SonarQube remains the most widely used TD assessment tool both in academia, as well as the industry [3]. It produces detailed and comparable results, that have already been used across multiple studies targeting software quality [15,47,35], TD [44,5,36], as well as to study SonarQube itself [23,27]. With regards to debt composition, existing studies revealed that a small number of SonarQube rules aggregated most of the application's debt [33,5,27]; in our previous research [33], we've discovered that some of these rules were prevalent in generating issues across many target applications. ...
... Most of the empirical research targeting TD are cross-sectional [5]. Among those which considered a longitudinal approach, they either targeted several consecutive application versions [15,47], or recorded source code snapshots at a given interval [27]. In contrast with most existing research, we carried out our evaluation over the target applications' entire lifetime [34,33,36], ensuring a sufficient number of measurement waves [20]. ...
Chapter
Technical debt represents unwanted issues that result from decisions made to speed up the design or implementation of software at the expense of resolving existing issues. Like financial debt, it consists of the principal and an interest. Debt is usually paid back through code rewrites, refactoring, or the introduction of test code. When unchecked, interest can accumulate over time and lead to development crises where focus and resources must be shifted to resolve existing debt before the development process can be resumed. Existing software tooling allows practitioners to quantify the level of debt and identify its sources, allowing decision makers to measure and control it. We propose a detailed exploration of the characteristics of source code technical debt over the lifetime of several popular open-source applications. We employed a SonarQube instance configured for longitudinal analysis to study all publicly released versions of the target applications, amounting to over 15 years’ worth of releases for each. We found that a small number of issue types were responsible for most of the debt and observed that refactoring reduced debt levels across most application packages. We observed increased variance in technical debt distribution and composition in early application versions, which lessened once applications matured. We addressed concerns regarding the accuracy of SonarQube estimations and illustrated some of its limitations. We aim to continue our research by including additional tools to characterize debt, leverage existing open data sets and extend our exploration to include additional applications and types of software.KeywordsTechnical debtSoftware evolutionLongitudinal case studyOpen-source softwareRefactoringSoftware maintenance
... Khomyakov et al. [5] also focus on the possibility of improving current methods (the SQALE) of estimating the amount of technical debt using simple regression models. Strečanský et al. [17] compare the Maintainability Index, the SIG Method, and the SQALE method, paying special attention to TD measurements of time series and trends in different sets of releases (major, minor, micro). ...
... Therefore, only three methods are highlighted: the SQALE, the CAST, and the SIG. The SQALE method was used in 10 studies [5,6,8,[11][12][13][15][16][17]19], CAST in 3 studies [5,6,15], and SIG in 6 studies [5,6,10,13,15,17]. Each of the methods in Table 1 uses a different quality model. ...
... Therefore, only three methods are highlighted: the SQALE, the CAST, and the SIG. The SQALE method was used in 10 studies [5,6,8,[11][12][13][15][16][17]19], CAST in 3 studies [5,6,15], and SIG in 6 studies [5,6,10,13,15,17]. Each of the methods in Table 1 uses a different quality model. ...
Article
Full-text available
Technical debt is a well understood and used concept in IT development. The metaphor, rooted in the financial world, captures the amount of work that development teams owe to a product. Every time developers take a shortcut within development, the technical debt accumulates. Technical debt identification can be accomplished via manual reporting on the technical debt items, which is called self-admitted technical debt. Several specialised methods and tools have also emerged that promise to measure the technical debt. Based on experience in the community, the impression emerged that the measured technical debt is of a significantly different amount than the self-admitted debt. In this context, we decided to perform empirical research on the possible gap between the two. We investigated 14 production-grade software products while determining the amount of accumulated technical debt via (a) a self-admitting procedure and (b) measuring the debt. The outcomes show clearly the significant difference in the technical debt reported by the two methods. We urge development and quality-assurance teams not to rely on technical debt measurement alone. The tools demonstrated their strength in identifying low-level code technical debt items that violate a set of predefined rules. However, developers should have additional insight into violations, based on the interconnected source code and its relation to the domain and higher-level design decisions.
... Although concerns regarding the appropriateness of the Index for object-oriented languages have been expressed because the measure was initially proposed for systems developed in procedural programming languages [15,[18][19][20][21], the Index is frequently used in maintainability research for object-oriented software. For instance, the Index is used for monitoring the maintainability of a software system over time to ensure high-quality software products, to guide and support software-related decision-making processes or characterize the software's maintainability evolution [10,[22][23][24][25][26][27], to explore the relationship between maintainability and design or code metrics [14][15][16]20,28,29], to detect technical debt [30], and recently, the Index has been used as a measure of maintainability in various machine learning prediction models [31][32][33][34][35]. ...
... The table highlights that different Index variants are used across the research community, while using software based on various object-oriented programming languages. The Index variants are used for several purposes, including maintainability monitoring [10,[22][23][24][25][26][27]62], exploration of the relationship between maintainability and software metrics [14][15][16]20,28,29,63], characterization of architectural design patterns in view of maintainability [64][65][66], investigation of the impact of community patterns on software maintainability [67], technical debt detection [30], and machine-learning-based fault, change proneness, and maintainability trend prediction [31][32][33][34][35]. ...
Article
Full-text available
During maintenance, software systems undergo continuous correction and enhancement activities due to emerging faults, changing environments, and evolving requirements, making this phase expensive and time-consuming, often exceeding the initial development costs. To understand and manage software under development and maintenance better, several maintainability measures have been proposed. The Maintainability Index is commonly used as a quantitative measure of the relative ease of software maintenance. There are several Index variants that differ in the factors affecting maintainability (e.g., code complexity, software size, documentation) and their given importance. To explore the variants and understand how they compare when evaluating software maintainability, an experiment was conducted with 45 Java-based object-oriented software systems. The results showed that the choice of the variant could influence the perception of maintainability. Although different variants presented different values when subjected to the same software, their values were strongly positively correlated and generally indicated similarly how maintainability evolved between releases and over the long term. Though, when focusing on fine-grained results posed by the Index, the variant selection had a larger impact. Based on their characteristics, behavior, and interrelationships, the variants were divided into two distinct clusters, i.e., variants that do not consider code comments in their calculation and those that do.
... When applied on systems having lower maintainability, reliability or security ratings, evaluations diverged. Authors of [29] compared the SIG quality model [27] against the maintainability index [32] and SQALE for technical debt identification in a case study using 17 large-scale, open-source Python libraries. Results showed differences between the debt calculated according to the models, with more similarity reported between the maintainability index and the SIG model. ...
... The most important remaining threat regards the applicability of the SonarQube model. Research [11,29] has shown that differences exist in the evaluation of software quality between existing models. Furthermore, in [13] authors showed that many of the reliability issues reported by SonarQube did not actually lead to observable faults. ...
... When applied on systems having lower maintainability, reliability or security ratings, evaluations diverged. Authors of [29] compared the SIG quality model [27] against the maintainability index [32] and SQALE for technical debt identification in a case study using 17 large-scale, open-source Python libraries. Results showed differences between the debt calculated according to the models, with more similarity reported between the maintainability index and the SIG model. ...
... The most important remaining threat regards the applicability of the SonarQube model. Research [11,29] has shown that differences exist in the evaluation of software quality between existing models. Furthermore, in [13] authors showed that many of the reliability issues reported by SonarQube did not actually lead to observable faults. ...
Preprint
Existing software tools enable characterizing and measuring the amount of technical debt at selective granularity levels. In this paper we aim to study the evolution and characteristics of technical debt in open-source software. We carry out a longitudinal study that covers the entire development history of several complex applications. We study how technical debt is introduced in software, as well as identify how developers handle its accumulation over the long term. We carried out our evaluation using three complex, open-source Java applications. All 110 released versions, covering more than 10 years of development history for each application were analyzed using SonarQube. We studied how the amount, composition and history of technical debt changed during development, compared our results across the studied applications and present our most important findings. For each application, we identified key versions during which large amounts of technical debt were added, removed or both. This had significantly more impact when compared to the lines of code or class count increases that generally occurred during development. Within each version, we found high correlation between file lines of code and technical debt. We observed that the Pareto principle was satisfied for the studied applications, as 20% of issue types generated around 80% of total technical debt. Early application versions showed greater fluctuation in the amount of existing technical debt. Application size appeared to be an unreliable predictor for the quantity of technical debt. Most debt was introduced in applications as part of milestone releases that expanded their feature set. We also discovered that technical debt issues persist for a long time in source code, and their removal did not appear to be prioritized according to type or severity.
... In contrast, the SQALE Business Impact Index (SBII) represents the business perspective of non-conformities in the source code (TD interest). The SQALE method is implemented by various tools, including SonarQube, 1 Squore, 2 and NDepend. 3 Moreover, the method was also discussed in a few primary studies on TD [63][64][65]. In comparison to our approach, SQALE has a narrower focus as it primarily estimates the cost of non-remediation (i.e. ...
... Despite having its origins in the early nineties, it is still widely used in both academia and the industry. For instance, in recent studies, researchers utilized the MI to evaluate software quality in terms of maintainability of modern projects based on Java [7,15], Python [20,21], and JavaScript [2,6,18]. As another example, Microsoft Visual Studio, a popular and well-known integrated development environment, still incorporates the computation of the MI. ...
Conference Paper
A common practice in software development is to include linters, static analysis tools that warn developers about potential issues in the code, in the software quality assurance process. Actionable warnings generated by linters upon violations of defined rules help detect, resolve, and reduce coding errors, quality flaws, code style inconsistencies, and deviations from best coding practices and conventions. However, little empirical evidence exists to fully understand the relationship between linter warnings and external software quality factors. To this end, an empirical investigation of the source code of 40 open-source JavaScript project releases was conducted to study whether there is a relation between software maintainability measured by the Maintainability Index and the density of linter warnings per Logical Lines of Code. The findings suggest a very weak to strong negative correlation between warning density and the value of the Index at a project- and a module-level. Changes in warning density between projects only slightly inversely correspond to changes in maintainability. Additionally, a statistically significant difference in maintainability was found between projects defining linters in their manifest file and those that do not, in favor of the former.
... There are existing methods for TD identification like the Maintainability Index, SIG TD models, and SQALE. These methods are based on static analysis techniques and code coverage measures [56]. While these methods deliver quantitative metrics on TD, they lack a measure of the actual business impact in terms of the additional waste due to the identified TD. ...
Preprint
Full-text available
Code quality remains an abstract concept that fails to get traction at the business level. Consequently, software companies keep trading code quality for time-to-market and new features. The resulting technical debt is estimated to waste up to 42% of developers' time. At the same time, there is a global shortage of software developers, meaning that developer productivity is key to software businesses. Our overall mission is to make code quality a business concern, not just a technical aspect. Our first goal is to understand how code quality impacts 1) the number of reported defects, 2) the time to resolve issues, and 3) the predictability of resolving issues on time. We analyze 39 proprietary production codebases from a variety of domains using the CodeScene tool based on a combination of source code analysis, version-control mining, and issue information from Jira. By analyzing activity in 30,737 files, we find that low quality code contains 15 times more defects than high quality code. Furthermore, resolving issues in low quality code takes on average 124% more time in development. Finally, we report that issue resolutions in low quality code involve higher uncertainty manifested as 9 times longer maximum cycle times. This study provides evidence that code quality cannot be dismissed as a technical concern. With 15 times fewer defects, twice the development speed, and substantially more predictable issue resolution times, the business advantage of high quality code should be unmistakably clear.
... Analyzing code, code comments, commits, and architecture are the most used mechanisms for identifying technical debt. Some papers solely worked on the identification of technical debt [48,73,78,71]. Aside from that, most of the related papers identify technical debt for the necessity of it in their work. ...
Article
Full-text available
Poor design choices, bad coding practices, or the need to produce software quickly can stand behind technical debt. Unfortunately, manually identifying and managing technical debt gets more difficult as the software matures. Recent research offers various techniques to automate the process of detecting and managing technical debt to address these challenges. This manuscript presents a mapping study of the many aspects of technical debt that have been discovered in this field of study. This includes looking at the various forms of technical debt, as well as detection methods, the financial implications, and mitigation strategies. The findings and outcomes of this study are applicable to a wide range of software development life-cycle decisions.
Conference Paper
Full-text available
Technical debt has recently become a major concern in the software industry. While it has been shown that technical debt has an adverse effect on the quality of a software system, there has been little work to explore this relationship. Further, with the growing number of approaches to estimate the technical debt principal of a software system, there is a dearth of work to empirically validate the relationship between technical debt scores produced by practical tools against established theoretical quality models. We conducted a case study across 10 releases of 10 open source systems in order to evaluate three proposed methods of technical debt principal estimation. The evaluation compares each technique against an external quality model. We found that only one estimation technique had a strong correlation to the quality attributes reusability and understand ability. In a multiple linear regression analysis we also found that a different estimation technique had a significant relationship to the quality attributes effectiveness and functionality. These results indicate that it is important that industry practitioners, ensure that the technical debt estimate they employ accurately depicts the effects of technical debt as viewed from their quality model.
Conference Paper
Full-text available
Measurements are subject to random and systematic errors, yet almost no study in software engineering makes significant efforts in reporting these errors. Whilst established statistical techniques are well suited for the analysis of random error, such techniques are not valid in the presence of systematic errors. We propose a departure from de- facto methods of reporting results of technical debt measurements for more rigorous techniques drawn from established methods in the physical sciences. This line of inquiry focuses on technical debt calculations; however it can be generalized to quantitative software engineering studies. We pose research questions and seek answers to the identification of systematic errors in metric-based tools, as well as the reporting of such errors when subjected to propagation. Exploratory investigations reveal that the techniques suggested allow for the comparison of uncertainties that come from differing sources. We suggest the study of error propagation of technical debt is a worthwhile subject for further research and techniques seeded from the physical sciences present viable options that can be used in software engineering reporting.
Article
Full-text available
Background: Software systems accumulate technical debt (TD) when short-term goals in software development are traded for long term goals (e.g., quick-and-dirty implementation to reach a release date vs. a well-refactored implementation that supports the long term health of the project). Some forms of TD accumulate over time in the form of source code that is difficult to work with and exhibits a variety of anomalies. A number of source code analysis techniques and tools have been proposed to potentially identify the code-level debt accumulated in a system. What has not yet been studied is if using multiple tools to detect TD can lead to benefits, i.e. if different tools will flag the same or different source code components. Further, these techniques also lack investigation into the symptoms of TD "interest" that they lead to. To address this latter question, we also investigated whether TD, as identified by the source code analysis techniques, correlates with interest payments in the form of increased defect- and change-proneness. Aims: Comparing the results of different TD identification approaches to understand their commonalities and differences and to evaluate their relationship to indicators of future TD "interest". Method: We selected four different TD identification techniques (code smells, automatic static analysis (ASA) issues, grime buildup, and modularity violations) and applied them to 13 versions of the Apache Hadoop open source software project. We collected and aggregated statistical measures to investigate whether the different techniques identified TD indicators in the same or different classes and whether those classes in turn exhibited high interest (in the form of a large number of defects and higher change proneness). Results: The outputs of the four approaches have very little overlap and are therefore pointing to different problems in the source code. Dispersed coupling and modularity violations were co-located in classes with higher defect proneness. We also observed a strong relationship between modularity violations and change proneness. Conclusions: Our main contribution is an initial overview of the TD landscape, showing that different TD techniques are loosely coupled and therefore indicate problems in different locations of the source code. Moreover, our proxy interest indicators (change- and defect-proneness) correlate with only a small subset of TD indicators
Article
Technical debt is a metaphor for delayed software maintenance tasks. Incurring technical debt may bring short-term benefits to a project, but such benefits are often achieved at the cost of extra work in future, analogous to paying interest on the debt. Currently technical debt is managed implicitly, if at all. However, on large systems, it is too easy to lose track of delayed tasks or to misunderstand their impact. Therefore, we have proposed a new approach to managing technical debt, which we believe to be helpful for software managers to make informed decisions. In this study we explored the costs of the new approach by tracking the technical debt management activities in an on-going software project. The results from the study provided insights into the impact of technical debt management on software projects. In particular, we found that there is a significant start-up cost when beginning to track and monitor technical debt, but the cost of ongoing management soon declines to very reasonable levels.
Conference Paper
Technical debt has become a popular term in the software engineering community in recent years for labelling issues and development risks incurred either intentionally or unintentionally throughout the entire software development process. There are some approaches for calculating and/or managing the costs related with different kinds of technical debt. Current research lacks a clear classification of these existing approaches. We therefore, in a first step, developed a classification scheme including respective categories and dimensions for this purpose and derived in a second step the need for a new approach that is able to (1) consider several levels of required target quality a project shall reach, and (2) base on a calculation mechanism that allows to regard experiences with known reference projects. In an experiment with two open source projects we find that the results of our model are in-line with an external quality judgment of these projects. It also shows how the resulting remediation costs depend on the actual quality level of the projects and the target quality level. We conclude this paper with future work regarding improvements of our calculation model and planned enhancements to cover design and documentation aspects.
Article
ContextWhilst technical debt is considered to be detrimental to the long term success of software development, it appears to be poorly understood in academic literature. The absence of a clear definition and model for technical debt exacerbates the challenge of its identification and adequate management, thus preventing the realisation of technical debt's utility as a conceptual and technical communication device.Objective To make a critical examination of technical debt and consolidate understanding of the nature of technical debt and its implications for software development.Method An exploratory case study technique that involves multivocal literature review, supplemented by interviews with software practitioners and academics to establish the boundaries of the technical debt phenomenon.ResultA key outcome of this research is the creation of a theoretical framework that provides a holistic view of technical debt comprising a set of technical debts dimensions, attributes, precedents and outcomes, as well as the phenomenon itself and a taxonomy that describes and encompasses different forms of the technical debt phenomenon.Conclusion The proposed framework provides a useful approach to understanding the overall phenomenon of technical debt for practical purposes. Future research should incorporate empirical studies to validate heuristics and techniques that will assist practitioners in their management of technical debt.
Article
The concept of technical debt provides an excellent tool for describing technology gaps in terms any stakeholder can understand. The technical debt metaphor was pioneered by the software development community and describes technical challenges in that context very well. However, establishing a definitional framework which describes issues affecting quality more broadly will better align to stakeholder perspectives. Building on the existing concept in this way will enable technology stakeholders by providing a centralized technical debt model. The metaphor can then be used to consistently describe quality challenges anywhere within the technical environment. This paper lays the foundation for this conceptual model by proposing a definitional framework that describes how technology gaps affect all aspects of quality.
Conference Paper
Over recent years the topic of technical debt has gained significant attention in the software engineering community. The area of technical debt research is somewhat peculiar within software engineering as it is built on a metaphor. This has certainly benefited the field as it helps to achieve a lot of attention and eases communication about the topic, however, it seems it is to some extent also sidetracking research work, if the metaphor is used beyond its range of applicability. In this paper, we focus on the limits of the metaphor and the problems that arise when over-extending its applicability. We do also aim at providing some additional insights by proposing certain ways of handling these restrictions.