Content uploaded by Bruno Rossi
Author content
All content in this area was uploaded by Bruno Rossi on Jun 12, 2020
Content may be subject to copyright.
Comparing Maintainability Index, SIG Method, and SQALE for
Technical Debt Identification
Peter Strečanský
Masaryk University
Brno, Czech Republic
xstrec05@mail.muni.cz
Stanislav Chren
Masaryk University
Brno, Czech Republic
chren@mail.muni.cz
Bruno Rossi
Masaryk University
Brno, Czech Republic
brossi@mail.muni.cz
ABSTRACT
Many techniques have emerged to evaluate software Technical
Debt (TD). However, dierences in reporting TD are not yet studied
widely, as they can give dierent perceptions about the evolution
of TD in projects. The goal of this paper is to compare three TD
identication techniques: i. Maintainability Index (MI), ii. SIG TD
models and iii. SQALE analysis. Considering 17 large open source
Python libraries, we compare TD measurements time series in terms
of trends in dierent sets of releases (major, minor, micro). While
all methods report generally growing trends of TD over time, MI,
SIG TD, and SQALE all report dierent patterns of TD evolution.
CCS CONCEPTS
•Social and professional topics →Management of comput-
ing and information systems;Software maintenance;
KEYWORDS
Software Technical Debt, Software Maintenance, Software Quality,
Maintainability Index, SIG Method, SQALE
1 INTRODUCTION
Technical Debt (TD) is a metaphor coined by Ward Cunningham
in 1993, that made an analogy between poor decisions during soft-
ware development and economic debt. Even though short-term
decisions can speed-up development or the release process, there
is an unavoidable interest that will have to be paid in the future.
In general, the impact of TD can be quite relevant for industry.
Many studies found out that TD has negative nancial impacts
on companies (e.g., [
14
]). Every hour of a developer time used on
xing poor design or guring out how badly documented code
works with other modules instead of developing new features is
essentially a waste of money from the company point of view.
The goal of the paper is to compare three main techniques about
TD identication that were proposed over time for source code TD
identication: i. the Maintainability Index (MI) (1994) that was one
of the rst attempts to measure TD and is still in use, ii. SIG TD
models (2011) that were dened in search of proper code metrics for
TD measurement and iii. SQALE (2011) a framework that attempts
to put into more practical terms the indication from the ISO/IEC
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
SAC ’20, March 30-April 3, 2020, Brno, Czech Republic
©2020 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-6866-7/20/03.
https://doi.org/10.1145/3341105.3374079
9126 software quality standard (recently replaced that ISO/IEC
25010:2011). As a method, we compare the time series derived from
dierent methods, looking at the results in the terms of trends and
time series similarities.
2 TECHNICAL DEBT (TD)
One of the widely spread denitions of TD is from McConnell in
2008: "A design or construction approach that’s expedient in the short
term but that creates a technical context in which the same work
will cost more to do later than it would cost to do now (including in-
creased cost over time)" [
8
]. In the research context, Guo et al. (2014)
presented TD as "incomplete, immature, or inadequate artifact in
the software development lifecycle" [
4
]. Theodoropoulos proposed a
new, broader denition of TD: "Technical debt is any gap within the
technology infrastructure or its implementation which has a material
impact on the required level of quality" [
13
]. Gat (2011) proposed an
even more extensive denition of TD: "Quality issues in the code
other than function/feature completeness" divided into intrinsic and
extrinsic quality issues [
1
]. One of the most important shortcom-
ings of TD denitions is the fact that there is yet to be a unied
measurement unit [
11
]. It is generally complex to quantify most
of the forms of TD. As well, there are dierent categories of tech-
nical debt: code debt, design and architectural debt, environment
debt (connected to the hardware software ecosystem), knowledge
distribution and documentation debt, and testing debt [14].
There are not many studies that compare alternative TD identi-
cation methods. One reason could be the complexity / time required
to implement the methods, the second reason about the compara-
bility of the metrics dened. Furthermore, Izurieta et al
.
[
5
] note
that it can be dicult to compare alternative TD measurements
methods due to missing ground truth and the uncertainties of the
measurement process. One of the earliest studies to compare met-
rics for TD identication was the study by Zazworka et al
.
[
15
],
comparing four alternative methods across dierent versions of
Apache Hadoop: a) modularity violations, b) design patterns grime
build-up, c) source code smells, d) static code analysis. The focus
was on comparing how such methods behave at the class level. The
ndings were that the TD identication techniques indicate dier-
ent classes as part of the problems, with not many overlaps between
the methods. Furthermore, Grith et al
.
[
3
] compared ten releases
of ten open source systems with three methods of TD identication
(i. SonarQube TD plug-in, ii. a method based on TD identication
using a cost model based on detected violations, iii. and one method
dening design disharmonies to derive issues in quality). These
methods were compared against software quality models. Authors
found that only one method had a strong correlation to the quality
attributes of reusability and understandability.
SAC ’20, March 30-April 3, 2020, Brno, Czech Republic Peter Strečanský, Stanislav Chren, and Bruno Rossi
3 MI, SQALE, SIG TD COMPARISON
For the denition of the experimental evaluation, we dened the
following goal: to analyze
technical debt evaluation techniques
(MI, SQALE, SIG TD)
for the purpose of
comparing their sim-
ilarity
with respect to
the trends and evolution of the mea-
surements
from the point of view of
practitioners aiming at
measuring TD
. The goal was rened into three main research
questions (RQs):
RQ1:
Are the trends of the measurements provided by the three
methods comparable? Metric: Pearson correlation between trends.
RQ2:
How are trends of TD comparable across dierent release
types? Metric: comparison of trends by release type.
RQ3:
How much can one method be used to forecast another
one? Metric: Granger causality between time series.
3.1 Compared TD Measurement Techniques
Of many methods that were proposed over time, we focus on three
representative methods for TD identication: i. the Maintainability
Index (MI), ii. SIG TD models, and iii. SQALE analysis. Methods
that practitioners can still nd implemented in TD analysis tools.
3.1.1 Maintainability Index (MI). MI was introduced in 1994, with
the goal to nd a simple, applicable model which is generic enough
for a wide range of software [10]:
MI =171 −5.2 ln(H V ) − 0.23(CC) − 16.2 ln(LoC)+0.99(CMT )
Where
HV
is average Halstead Volume per module,
CC
is average
Cyclomatic Complexity per module,
LoC
is average lines of code
per module,
CMT
is average lines of comments per module. A de-
rivative formula of MI is still used in some popular code editors
(e.g., Microsoft Visual Studio), so is relatively easy to be adopted by
practitioners to detect TD.
3.1.2 SIG TD Models. The Software Improvement Group (SIG)
dened in 2011 a model which quanties TD based on an estimation
of repair eort and estimation of maintenance eort, which provides
a clear picture about the cost of repair, its benets, and the expected
payback period [
9
]. Quantifying TD of the project is done in several
steps and requires a calculation of three dierent variables: Rebuild
value (RV), Rework fraction (RF) and Repair eort (RE).
Rebuild value
is dened as an estimate of the eort (in man-
months) that needs to be spent to rebuild a system using particular
technology. To calculate this value, the following formula is used:
RV =SS ×T F
Where
SS
is
System Size
in Lines of Code and
TF
is a
Technology
Factor which is a language productivity factor.
Rework Fraction
is dened as an estimate of % of LoC to be
changed in order to improve the quality by one level. The values of
the RF in between two quality levels are empirically dened [9].
Finally, the
Repair Eort
is calculated by the multiplication of
the Rework Fraction and Repair Eort. It is possible to multiply it by
Refactoring Adjustments (RA)
metric, which shows external,
context-specic aspects of the project, which represent a discount
in the overall technical debt of the project:
RE =RF ×RV ×RA
3.1.3 SQALE. Software QuALity Enhancement (SQALE) focuses
on the operationalization of the ISO 9126 Software Quality standard
by means of several code metrics that are attached to the taxonomy
dened in ISO 9126 [
6
]. Mimicking the ISO 9126 standard, SQALE
has a rst level dening the characteristics (e.g. testability), further
sub-characteristics (e.g., unit testing testability), and further source
code level requirements. Such source code requirements are then
mapped to remediation indexes that translate in the time/eort
required to x the issues. For calculation of TD, the Remediation
Cost (RC) represents the cost to x the violations to the rules that
have been dened for each category [7]:
RC =Íru l e e f f or tTo Fi x(violationsr ul e )
8[hr/day]
For SQALE, we adopted the SonarQube implementation: a default
set of rules was used, which is claimed to be the best-practice, mini-
mum set of rules to assess the technical debt.
3.2 Rationale & Methods
To compare the three methods, we looked at the time series of
all the measures collected by all the three methods. The tested
packages were randomly selected from the list of 5000 most popular
Python libraries (full list can be found in [
12
]). We dene a time
series as
T
, consisting of data points of TD at each release time
R={r1,r2, ...rn}
, as:
T={tr1,tr2, .. ., trn }
. The MI measure is an
inverse of the other measures, as is giving an indication of the
maintainability of the project (the lower the worse), while the other
methods give indication of TD accumulating (the higher the worse).
For the other parts of the analysis, to compare the time series, we
reversed the MI index, to make it comparable.
For RQ1. trends of measurements, we compute TD’s
∆
measure-
ments between two releases for each of the projects. For release
ri
,
∆TD is dened as:
∆TD ri =(tr1−1−tr1)
(tr1−1+tr1)/2
We then compute the Pearson correlations between all the points
of each of the compared methods. Results of the
∆TD ri
for each of
the time series are also shown in aggregated form in boxplots.
For RQ2., we consider dierent types of project releases: ma-
jor (e.g.,
0.7.3, 1.0.0
), minor (e.g.,
0.7.3, 0.8.0
), and micro
(e.g.,
0.9.0, 0.9.1
) releases, looking at TD trends dierences. To
answer this research question, we look at TD’s
∆↑
as increasing
trends,
∆↓
as decreasing trends, and
∆
0in periods between releases
in which TD did not change. Where
∆TD ri
is categorized in one of
the categories:
∆TD =
∆TD ↑,if ∆T Dr i >0
∆TD ↓,if ∆T Dr i <0
∆TD −,otherwise
For RQ3., we look at how much one of the three methods can be
used to forecast the results from another method. We take into
account time series of the measurements from the three methods
(MI, SIG TD, SQALE) and we compute Granger causality between
methods in pairs. Granger causality test, rst proposed in 1969
by Clive Granger, is a statistical hypothesis test which is used to
determine whether a time series can be used to predict other time
Comparing MI, SIG, SQALE for Technical Debt Identification SAC ’20, March 30-April 3, 2020, Brno, Czech Republic
series values [
2
]. More precisely, we can report that
T
1"Granger
causes"
T
2, if the lags of
T
1(i.e.,
T
1
t−1
,
T
1
t−2
,
T
1
t−3
,...) can provide
predictive capability over
T
2beyond what allowed by considering
the own lags of
T
2. The null hypothesis is that T2 does not Granger-
cause the time series of T1. We adopted the standard SSR-based F
test. If the probability value is less than 0.05, T2 Granger-causes T1.
3.3 Results
3.3.1 RQ.1 Are the trends of the measurements provided by the
three methods comparable? To compare TD over time between the
techniques, we used the Pearson correlation coecient to measure
linear relationships between variables. Fig. 1 reports the boxplots
of the correlation between the trends for each release (
∆TD ri
).
Each datapoint in the boxplot is a correlation for one project. The
three boxplots propose the comparison SQALE-SIG (Median: 0.71),
SQALE-MI (0.56), and SIG-MI (0.75). SQALE and MI are the least
comparable methods, with highest number of negative correlations
and much higher variability. SQALE & SIG and SIG & MI showed
similar distributions of the correlations, in favor of SIG & MI (Me-
dian:0.75) which have less negative correlations and lower variance.
Figure 1: ∆TD ri correlation between dierent methods.
We run Wilcoxon Signed-Rank Tests, paired tests to evaluate
the mean ranks dierences for the correlations. For Wilcoxon
Signed-Rank Test, we calculate eect size as
r=Z/√N
, where
N=
#
cases ∗
2, to consider non-independent paired samples, us-
ing Cohen’s denition to discriminate between small (0
.
0
−
0
.
3),
medium (0
.
3
−
0
.
6), and large eects (
>
0
.
6). The dierence is statisti-
cally signicant for SQALE-MI vs SIG-MI (
p
-value 0.044 –
p≥
0
.
05,
two-tailed, medium eect size (
r=
0
.
34)) while not signicant for
SQALE-SIG vs SQALE-MI (p-value 0.423, p≥0.05, two-tailed).
When we look at the trends on comparisons between every two
following releases (Table 1), the trend is similar for SIG and MI
(as previous correlations discussed), with a slight dierence in the
falling trend. This seems to indicate that, according to MI, TD tends
to be repaid more often than on SIG. In SQALE, however, we can
observe that TD was more stable across dierent releases (Table 1).
RQ1 Findings.
SIG TD and MI are the models which show sta-
tistically signicant comparability in terms of correlation of the
trends of TD changes. SQALE and SIG TD, also show similarities,
though not statistically signicant. Generally, SQALE and MI are
the models that show lower correlation in trends.
Table 1: TD trends on all releases
∆TD ↑∆T D ↓∆T D−
SQALE 33.15% 6.47 % 60.38 %
SIG TD 72.24 % 13.75 % 14.02 %
MI 61.19 % 21.83 % 16.98 %
3.3.2 RQ.2 How are trends of TD comparable across dierent release
types? This RQ is similar to RQ1, but in RQ2, we look at the compar-
ison based on the release types, if
major, minor, micro
releases
matter for the dierences in TD identication. Comparisons solely
between major releases have brought interesting results (Table 2),
similar to the results on all releases. Throughout all comparisons,
most of major releases caused TD to rise for each analysis. SQALE
had again the highest number of
still
trends and the most TD
repayments (falling trend) were recorded with MI.
Table 2: TD trends on major releases
∆TD ↑∆T D ↓∆T D−
SQALE 63.33 % 13.33 % 32.33 %
SIG TD 73.33 % 26.67 % 0 %
MI 56.67 % 43.33 % 0 %
The rising of TD is stronger at minor release level for both SIG TD
and SQALE, as each of the methods encountered the rise of
rising
trend compared to major releases (see Table 3). SQALE showed
a decrease in growing trends. As in the previous cases, SQALE
recorded more periods of steady TD, and MI the most repayments
of TD (to a much larger extent than SIG TD and SQALE).
Table 3: TD trends on minor releases
∆TD ↑∆T D ↓∆T D−
SQALE 42.48 % 8.50 % 49.02%
SIG TD 85.62 % 7.19 % 7.19 %
MI 71.24 % 20.92 % 7.84 %
The last comparison was done on micro releases. The same trends
were observed also at this level: vast majority of releases inducted
more TD on SIG and MI, while considering SQALE the majority of
releases did not change TD (see table 4). Again, MI is the method
that reports more TD repayment (21.99%).
Table 4: TD trends on micro releases
∆TD ↑∆T D ↓∆T D−
SQALE 39.27 % 7.33 % 53.40 %
SIG TD 72.77 % 14.66 % 12.57 %
MI 60.73 % 21.99 % 17.28 %
RQ2 Findings.
Considering major, minor, micro releases, MI and
SIG TD show mostly the majority of growing trends. SQALE shows
the most of TD steady states, while MI shows much larger TD
repayment periods compared to the other methods. These patterns
seem to be consistent across major, minor, micro releases.
SAC ’20, March 30-April 3, 2020, Brno, Czech Republic Peter Strečanský, Stanislav Chren, and Bruno Rossi
3.3.3 RQ.3 How much can one method be used to forecast another
one? A time series
X
can be said to Granger-cause another time
series
Y
if the probability of correctly forecasting
Yt+1
, with
t=
1
, ..., T
, increases by including information about
Xt
in addition to
the information included in
Y
alone. In the case of the three methods
reviewed, this means that the measurements from one method (e.g.
SQALE) can be used together with the measurements of another
method (e.g., MI), to provide better prediction of future values of the
complemented method (e.g., MI). Aggregated results for all projects
about Granger causality tests can give us the indication about how
much a time series results from a TD identication technique can
help in forecasting time series from other methods.
Granger causality, dierently from correlation, is generally an
asymmetric property, that is the fact that in one the projects’s time
series SQALE Granger-causes SIG TD model does not imply that
SIG TD model Granger-causes SQALE results. For this reason, we
provide all the combinations of results with counters for how many
times the results were positive according to the F-test for the 17
projects. In case of some tested libraries, the linear trend of TD in
SQALE caused the tests to end with an error, as Granger causality
test does not apply in case of non-linear relationships. In general,
SQALE-MI and SQALE-SIG TD had F-statistic signicant in around
1/3 of the projects (29.4%), while in the majority of the other cases
Granger causality was negative. These results could indicate that
considering the lagged values of SQALE time series we can get
better prediction of values in MI and SIG TD time series.
Table 5: Aggregated results of Granger causality tests (per-
centage of F-test results)
True False None
SQALE - MI 29.4% 58.8% 11.8%
SQALE - SIG 29.4% 58.8% 11.8%
SIG - SQALE 17.6% 70.6% 11.8%
MI - SIG 11.8% 88.2% —
MI - SQALE — 88.2% 11.8%
SIG - MI 11.8% 88.2% —
RQ3 Findings.
Results from Granger causality show that the TD
identication measurements are rather independent. Only SQALE
time series Granger-causes both MI and SIG TD in 1/3 of the projects,
while for other methods there is mostly no Granger causality.
3.3.4 Replicability. The analysis was implemented and run using
Python version 3.6.3. A derivative from the Python Radon library
was used
1
. The sonar-python plugin
2
was used for the SQALE anal-
ysis. The
statsmodels
library was used for computing the Granger
causality tests. Scripts can be run to run, aggregate, and plot all the
results in a semi-automated way [12].
1https://radon.readthedocs.io/en/latest/index.html
2https://docs.sonarqube.org/display/PLUG/SonarPython
4 CONCLUSIONS
Comparing three main methods about TD identication (Maintain-
ability Index (MI), SIG TD models, SQALE) on a set of 17 Python
projects, we can see in all cases increasing trends of reported TD.
However, there are dierent patterns in the nal evolution of the
measurements time series. MI and SIG TD report generally more
growing trends of TD compared to SQALE, that shows more pe-
riods of steady TD. MI is the methods that reports largely more
repayments of TD compared to the other methods. SIG TD and MI
are the models that show more similarity in the way TD evolves,
while SQALE and MI are the less comparable. Granger causality
for all projects and combination of methods shows that there is
a limited dependency between the TD time series. We could nd
some relationships between SQALE & MI, and SQALE & SIG TD
models, in the sense that previous lags of SQALE time series could
be used to improve prediction of other models in 1/3 of the projects.
Acknowledgments.
The work was supported from ERDF/ESF "Cy-
berSecurity, CyberCrime and Critical Information Infrastructures
Center of Excellence" (No. CZ.02.1.01/0.0/0.0/16_019/0000822).
REFERENCES
[1]
Israel Gat. 2011. Technical Debt: Technical Debt: Assessment and Reduction.
(2011). https://agilealliance.org/wp-content/uploads/2016/01/Technical_Debt_
Workshop_Gat.pdf Agile 2011: Technical debt workshop.
[2]
C. W. J. Granger. 1969. Investigating Causal Relations by Econometric Models
and Cross-spectral Methods. Econometrica 37, 3 (1969), 424–438.
[3]
I. Grith, D. Reimanis, C. Izurieta, Z. Codabux, A. Deo, and B. Williams. 2014.
The Correspondence Between Software Quality Models and Technical Debt
Estimation Approaches. In Int. Workshop on Managing Technical Debt. 19–26.
[4]
Yuepu Guo, Rodrigo Oliveira Spínola, and Carolyn Seaman. 2016. Exploring the
costs of technical debt management–a case study. Empirical Software Engineering
21, 1 (2016), 159–182.
[5]
Clemente Izurieta, Isaac Grith, Derek Reimanis, and Rachael Luhr. 2013. On
the uncertainty of technical debt measurements. In 2013 International Conference
on Information Science and Applications (ICISA). IEEE, 1–4.
[6]
Jean-Louis Letouzey. 2012. The SQALE method for evaluating technical debt. In
Third International Workshop on Managing Technical Debt (MTD). IEEE, 31–36.
[7]
Alois Mayr, Reinhold Plösch, and Christian Körner. 2014. A benchmarking-based
model for technical debt calculation. In 2014 14th International Conference on
Quality Software. IEEE, 305–314.
[8]
Steve McConnell. 2008. Managing Technical Debt. Technical Report. Construx
Software, Bellevue, WA 98004, 10900 NE 8th Street. 14 pages. https://www.
construx.com/developer-resources/whitepaper- managing-technical- debt/
[9]
Ariadi Nugroho, Joost Visser, and Tobias Kuipers. 2011. An Empirical Model of
Technical Debt and Interest. In Proceedings of the 2Nd Workshop on Managing
Technical Debt (MTD ’11). ACM, New York, NY, USA, 1–8.
[10]
Paul Oman and Jack Hagemeister. 1994. Construction and testing of polynomials
predicting software maintainability. Journal of Systems and Software 24, 3 (1994),
251 – 266. https://doi.org/10.1016/0164-1212(94)90067- 1 Oregon Workshop on
Software Metrics.
[11]
K. Schmid. 2013. On the limits of the technical debt metaphor some guidance on
going beyond. In 2013 4th International Workshop on Managing Technical Debt
(MTD). 63–66. https://doi.org/10.1109/MTD.2013.6608681
[12]
Peter Strečanský. 2019. Dealing with Software Development Technical Debt. Master
Thesis. Masaryk University. https://is.muni.cz/auth/th/x0boz/master_thesis_
digital.pdf
[13]
Ted Theodoropoulos, Mark Hofberg, and Daniel Kern. 2011. Technical Debt from
the Stakeholder Perspective. In Proceedings of the 2Nd Workshop on Managing
Technical Debt (MTD ’11). ACM, New York, NY, USA, 43–46.
[14]
Edith Tom, AybüKe Aurum, and Richard Vidgen. 2013. An exploration of technical
debt. Journal of Systems and Software 86, 6 (2013), 1498–1516.
[15]
Nico Zazworka, Clemente Izurieta, Sunny Wong, Yuanfang Cai, Carolyn Sea-
man, Forrest Shull, et al
.
2014. Comparing four approaches for technical debt
identication. Software Quality Journal 22, 3 (2014), 403–426.