Content uploaded by Bill Curtis
Author content
All content in this area was uploaded by Bill Curtis on Apr 08, 2016
Content may be subject to copyright.
Estimating the Size, Cost, and Types of Technical Debt
Bill Curtis Jay Sappidi Alexandra Szynkarski
CAST CAST CAST
Fort Worth, Texas, USA New York, NY, USA Paris, France
curtis@acm.org j.sappidi@castsoftware.com a.szynkarski@castsoftware.com
Abstract— This study summarizes results of a study of
Technical Debt across 745 business applications comprising
365 million lines of code collected from 160 companies in 10
industry segments. These applications were submitted to a
static analysis that evaluates quality within and across
application layers that may be coded in different languages.
The analysis consists of evaluating the application against a
repository of over 1200 rules of good architectural and coding
practice. A formula for estimating Technical Debt with
adjustable parameters is presented. Results are presented for
Technical Debt across the entire sample as well as for different
programming languages and quality factors.
Keywords- software metrics; software structural quality;
technical debt; static analysis; benchmarking
I. INTRODUCTION
Although there are several definitions of Technical Debt,
we define it as the cost of fixing structural quality problems
in production code that the organization knows must be
eliminated to control development costs or avoid operational
problems. We believe this is the most relevant definition to
industrial practice where the Technical Debt metaphor has
provided a new means of communicating with executive
management about the costs and risks of poor structural
quality in their application portfolio.
The purpose of this study is to explore a method for
quantifying an estimate of the Technical Debt within a
business application. Such studies are needed to help IT
organizations make visible the costs and risks hidden within
their application portfolio, as well as establish a benchmark
for making decisions about investments in application
quality, and especially structural quality.
Structural quality involves the non-functional, internal
characteristics of software. It reflects the engineering
soundness of an application’s architecture and coding, rather
than the correctness with which the application implements
functional requirements. Structural quality characteristics
are critical because they are often difficult to detect through
standard testing, yet they are frequent causes of operational
problems such as outages, performance degradation,
breaches by unauthorized users, and data corruption [1].
Internal quality metrics have been shown to correlate with
criteria such as maintenance effort and defect detection [2,
3]. The first enumeration of such quality characteristics was
provided by Boehm and his colleagues at TRW in the 1970s
[4].
II. THE SAMPLE AND DATA
The data for this study are drawn from the Appmarq
benchmarking repository maintained by CAST, comprised of
745 applications submitted by 160 organizations for analysis,
and consisting of 365 million lines of code or 11.3 million
Backfired Function Points. No applications were accepted
into the sample if they consisted of less than 10 KLOC (kilo
or thousand lines of code). Sixty applications were between
10 and 20 thousands of lines of code (KLOC), 240 were
between 20 and 100 KLOC, 271 were between 100 and 500
KLOC, 82 were between 500 KLOC and 1 million lines of
code (MLOC, and 93 were greater than 1 MLOC.
These organizations are located primarily in the United
States, Europe, and India. Since there is no rigorously
developed population description of the global trove of
business applications, it is impossible to assess the
generalizability of these results. Although these results may
not characterize the global population of IT business
applications, they do emerge from what is believed to be the
largest sample of applications ever to be statically analyzed
and measured against internal quality characteristics across
different technologies. Figure 1 presents the number of
applications by industry sector and by language/technology
type. Because of the selection process for submitting
applications to deep structural analysis, we believe this
sample is biased toward business critical applications.
These business applications were analyzed using CAST’s
Application Intelligence Platform (AIP) which performs a
static analysis of an entire application using over 1200 rules
to detect violations of good architectural and coding practice.
These rules have been drawn from an exhaustive study of
software engineering texts, online discussion groups focused
on application best practices and defects, and customer
experience drawn from defect logs and application architects.
Examples of violations in the area of security would include
SQL injection, cross-site scripting, buffer overflows, and
similar problems from the Common Weakness Enumeration
(cwe.mitre.org).
The AIP begins by parsing an entire application at build
time to develop a representation of the elements from which
the application is built, its data-flows. This analysis is
normally performed at during the build in order to analyze
the source code at the application level across various
language and technical platforms. The AIP includes parsers
for the 28 languages such as Java, JavaEE, .NET, Visual
Basic, JSP, PHP, C, C++, C#, ABAP, XML, Javascript,
SQL, COBOL, and a universal analyzer that provides and
80% parse for languages lacking a dedicated parser.
Once the code is parsed, AIP looks for violations of its
architectural and coding rules and identifies the number of
violations versus the number of opportunities for violations
for each rule. The results are aggregated to the application
level where each violation is weighted by its severity and
summed into both a specific measure for a quality
characteristic (called Health Factors) such as Changeability
or Security, and a Total Quality Index that aggregates scores
for violations across all Health Factors. AIP provides a
series of management reports and a portal that guide
developers to locations in the source code for specific
violations that need remediation. More information about
AIP can be obtained at www.castsoftware.com.
I
ndustr
y
T
otal
Language
Total
Ener
gy
&
U
tility
40
C
14
Financial
150
C++
9
I
nsur
anc
e
70
.NET
51
IT
C
onsulting
109
J2EE
339
M
anufac
tur
ing
94
Visual Basic
14
O
ther
30
ABAP
59
G
o
v
er
nmen
t
78
Oracle Forms
39
R
etail
32
Oracle ERP
12
T
echnology
21
COBOL
80
T
elec
om
121
Mixed & Other
128
T
otal
745
Total
745
Figure 1. Distribution of applications in the Appmarq sample by
industry segment and language/technology.
The application health factors in AIP were selected after
reviewing ISO/IEC 9126 [5]. However, since the quality
characteristics in this standard have not been defined down
to a level that can be computed from the source code, some
health factors names differ from 9126 based on the content
analyzed and the meaningfulness of the names to users of the
technology. The five Health Factors reported in this study are
as follows
• Robustness—the stability of an application and the
ease of recovery from failures.
• Performance Efficiency—the responsiveness of the
application.
• Security—an application’s ability to prevent
unwanted intrusions.
• Transferability—the ease with which a new team
can understand the application and quickly become
productive working on it.
• Changeability—an applications’ ability to become
quickly and easily modified.
In order to provide more standardization to computable
measures of internal quality, the Consortium for IT Software
Quality [6] sponsored by the Software Engineering Institute
at Carnegie Mellon University and the Object Management
Group is developing standard definitions for automatable
software quality metrics. CISQ intends to make these
metrics as consistent as possible with the emerging ISO
25000 [7] series standards which will replace ISO 9126.
The number of rules evaluated for each Health Factor
ranged between 176 and 506. Scores for each of these
internal quality characteristics are aggregated from the
component to the application level and reported on a scale of
1 (high risk) to 4 (low risk), using an algorithm that weights
the severity of each violation and its relevance to each
individual Health Factor.
III. THE TECHNICAL DEBT METAPHOR
Ward Cunningham initiated the Technical Debt metaphor
in 1992 by referring to violations of good architectural and
coding practice as ‘debt’. According to Cunningham,
“Shipping first time code is like going into debt. A little debt
speeds development so long as it is paid back promptly with
a rewrite…The danger occurs when the debt is not repaid.
Every minute spent on not-quite-right code counts as interest
on that debt. Entire engineering organizations can be brought
to a stand-still under the debt load of an unconsolidated
implementation.”
However, the fundamental problem underlying Technical
Debt was formulated in the 1970s by Meir Lehman [8] who
posited in one of his laws of software evolution that as a
“system evolves its complexity increases unless work is done
to maintain or reduce it.” This complexity degrades the
performance of business applications, increases their
likelihood of failure, and multiplies the cost of owning them.
Technical Debt is created when developers write
software that violates good architectural or coding practices,
creating structural flaws in the code. Although Cunningham
was only referring to structural problems that result from
conscious design tradeoffs or coding shortcuts to get
functionality running quickly, we embrace a broader
approach to include all structural problems that an IT
organization prioritizes as ‘must fix’. According to industry
luminary Steve McConnell [9], sometimes Technical Debt is
an unintentional consequence of inexperience or incorrect
assumptions, while in other cases it is intentional, as in
Cunningham’s definition, in order to get new functionality
running quickly. In either case, the development team
knows, or ultimately learns, that it has released software with
structural flaws that must be fixed or the cost and risk of the
application will grow unacceptably.
Technical Debt must be distinguished from defects or
failures. Failures during test or operation may be symptoms
of Technical Debt, but most of the structural flaws creating
Technical Debt have not caused test or operational failures.
Some may never cause test or operational failures but instead
make an application less efficient, less scalable, more
difficult to enhance, or more penetrable by hackers. In
essence, Technical Debt emerges from poor structural
quality and affects a business both as IT cost and business
risk.
Choosing ‘debt’ as a metaphor engages a set of financial
concepts that help executives think about software quality in
business terms. In this section we will define the concepts
required to apply the full Technical Debt metaphor so that
each factor can be measured and used in analyzing the
structural quality of applications financially.
Technical Debt—the future costs attributable to known
structural flaws in production code that need to be fixed, a
cost that includes both principle and interest. A structural
flaw in production code is only included in Technical Debt
calculations if those responsible for the application believe it
is a ‘must-fix’ problem. Technical Debt is a primary
component of the cost of application ownership.
Principal—the cost of remediating must-fix problems in
production code. At a minimum the principal is calculated
from the number of hours required to remediate must-fix
problems in production code, multiplied by the fully
burdened hourly cost of those involved in designing,
implementing, and testing these fixes.
Interest—the continuing costs primarily in IT attributable
to must-fix problems in production code. These continuing
costs can result from the excessive effort to modify
unnecessarily complex code, greater resource usage by
inefficient code, and similar costs.
Business risk—the potential costs to the business if must-
fix problems in production code cause damaging operational
events or other problems that reduce the value to be derived
from the application.
Liability—the costs to the business resulting from
operational problems caused by flaws in production code.
Such operational problems would include outages, incorrect
computations, lost productivity from performance
degradation, and security breaches. From a risk perspective,
flaws in the code include both must-fix problems included in
the calculation of Technical Debt as well as problems not
listed as must-fix because their risk was underestimated.
Risk—the potential liability to the business if a must-fix
problem in production code was to cause a liability-inducing
event. Risk will be expressed in terms of potential liability
to the business rather than the IT costs which are accounted
for under ‘interest’.
Opportunity cost—benefits that could have been
achieved had resources been committed to developing new
capability rather than being assigned to retire Technical
Debt. Opportunity cost represents the tradeoff that
application managers and executives must weigh when
deciding how much effort to devote to retiring Technical
Debt.
Structural quality problems give rise to Technical Debt,
which contains both principal and interest on the debt. The
cost to fix these structural problems constitutes the principal
of this debt. Structurally flawed code creates inefficiencies
such as greater maintenance effort or excessive computing
resources whose costs represent interest on the debt.
The structural problems underlying Technical Debt also
create business risks. When these risks translate into
negative operational events, they create a liability such as
lost such as outages and security breaches revenue from
Website sales or costly clean-up from a security breach.
Remediating Technical Debt requires schedule and effort
that could have been devoted to creating new business
functionality. Effort committed to retiring Technical Debt
represents an opportunity cost related to lost benefits that
might otherwise have been achieved by the business.
IV. ESTIMATING PRINCIPAL IN TECHNICAL DEBT
There is no exact measure of Technical Debt, since its
calculation must be based only on the structural flaws that
the organization intends to fix, some of which may not have
been detected yet. However, modern software analysis and
measurement technology allows us to estimate the amount of
principal in the Technical Debt of an application based on
actual counts of detectable structural problems. By
analyzing the structural quality of application, rating the
severity of each problem, and prioritizing the must-fix
problems, IT organizations can now estimate the amount of
Principal in the Technical Debt (hereafter called TD-
Principal) from empirical evidence.
Within this context, TD-Principal is a function of three
variables—the number of must-fix problems in an
application, the time required to fix each problem, and the
cost for fixing a problem. Each of these variables can be
measured or estimated and entered into a formula for
estimating TD-Principal. This formula produces results that
do not include interest on the debt, liability, or any of the
other costs associated with Technical Debt other than the
principal.
The number of must-fix structural problems in an
application can be measured through the static analysis of an
application’s source code. However, with limited
application budgets, IT organizations will never fix all the
problems in an application. Therefore each of the structural
problems detected through static analysis must be weighted
by its potential severity. If severity scores are grouped into
categories—for instance; low, medium, and high—then IT
management can determine what percentage of problems in
each category are must-fix.
The time to fix a structural quality problem includes the
time to analyze the problem, the time to understand the code
and determine a correction, the time to evaluate potential
side-effects, the time to implement and test the correction,
and the time to release the correction into operations.
This variable can be set to the average burdened rate for
the developers assigned to fix structural problems. Although
burdened hourly rates may vary by experience and location,
we have found that a rate of between $70 and $80 per hour
reflects the average costs for many IT organizations. If an
organization’s labor rates vary widely, this variable can also
be measured as a frequency distribution of costs.
Although the data presented here were calculated using
the TD-Principal formula as parameterized above, different
assumptions about the parameters might be more appropriate
for the specific conditions within different organizations.
We encourage organizations to adjust the parameters in this
formula to best fit their objectives, experiences, and cost.
V. INITIAL RESULTS IN MEASURING TD-PRINCIPAL
In an initial exploration of measuring TD-Principal, we
assumed that an IT organization would fix 50% of the high
severity problems, 25% of the medium severity problems,
and no more than 10% of the low severity problems. To
keep the estimate of TD-Principal conservative we assumed
that defects would be fixed in 1 hour. However, this number
appears to only describe the repairing of simple violations in
single components. We set the labor rate to an average of
$75 per hour.
However, the parameters in this formula can be easily
adjusted to better reflect the experience and objectives of a
specific organization. For instance, an organization can set
the parameters for the percentage of problems it will fix in
each severity category according to its own maintenance and
structural quality objectives. In future work we anticipate
changing the parameters to 0% LSV, 50% MSV, and 100%
HSV since field discussions with IT organizations suggest
that they are primarily interested in fixing high priority
defects and some medium priority defects.
In this initial exploration we made the very conservative
assumption that all problems would be fixed within one
hour. This parameter value is extremely conservative and
was chosen to make our initial TD-Principal results
conservative. However, preliminary data from operational
environments show wide variation in correction times based
on the complexities of the structural problems involved.
Based on distributions from the limited data available from
operational environments, we anticipate using a
parameterized Weibell distribution to represent fix times in
future calculations.
The initial formula and parameterization we used for
calculating TD-Principal in this paper are as follows.
TD-Principal =
(Σ high severity violations) x .5) 1 hr.) x 75$) +
(Σ medium severity violations) x .25) 1 hr.) x 75$) +
(Σ low severity violations) x .1) 1 hr.) x 75$)
To develop and initial estimate the average TD-Principal
across the Appmarq sample, we first calculated TD-Principal
individually for each of the 745 applications using the
formula presented above. These individual application
scores were then averaged across the Appmarq sample to
produce an average TD-Principal of $3.61 per line of code.
Based on this formulation, a typical application accrues
$361,000 of TD-Principal for each 100,000 lines of code,
and applications of 300,000 or more lines carry more than $1
million of TD-Principal ($1,083,000). This is an estimate of
the cost to repair only the must-fix problems and is
conservative based on the initial parameter values chosen.
Had we used the parameters we anticipate using in the
future, the TD-Principal per line of code would have been
closer to $10 which is closer to estimates provided by some
analysts.
Although IT organizations could estimate their total TD-
Principal by multiplying an estimate of the size of their
application code base by $3.61, it would be more accurate to
analyze it by technology and language type. Significant
differences were found in TD-Principal estimates between
languages, with the lowest figure being reported for ABAP
($0.43) and the highest for Java-EE ($5.42). C++ ($4.33)
and Oracle Forms also had above average TD-Principal
estimates. Since these figures are based on very conservative
parameters, the actual TD-Principal in most applications is
likely to be significantly higher. Thus, these estimates
should be treated as lower bounds.
The greatest variability in TD-Principal results occurred
for C++ (s.d.=$7.02) and Oracle Forms (s.d.=$6.70). These
results demonstrate that even for applications developed
using the same language and technology, TD-Principal
results can vary widely. Consequently, in order to be used
effectively for management decisions, TD-Principal should
be measured and analyzed individually for each application,
or at a minimum category of applications, rather than using
an average value across all applications regardless of the
language or technology platform on which the application
was developed.
These figures could change if the mix of application
characteristics in each technology/language category change
in the Appmarq repository as the sample of applications
grows. Consequently we urge caution in interpreting these
figures as industry benchmarks, especially since they are
based on very conservative assumptions. Nevertheless they
provide a starting point for estimating TD-Principal, and one
that can be adjusted based on different assumptions about the
parameters in the calculation used in the formula in Figure 2.
VI. COMPONENTS OF TECHNICAL DEBT
Although TD-Principal can be measured as violations of
good structural quality, these violations consist of different
types of threats to the business or costs to IT. In order to use
TD-Principal effectively in making decisions about how
much resource to allocate to eliminating these violations,
management needs to distinguish among its quality priorities
and then prioritize the importance of eliminating TD-
Principal in each area. Our data allow us to measure the TD-
Principal associated with each of the five Health Factors
since they represent different types of costs to IT or risks to
the business.
The amount of TD-Principal in an application associated
with each of these Health Factors differs. Seventy percent of
the TD-Principal measured in this sample was contained in
the IT cost related Health Factors of Changeability (30%)
and Transferability (40%). Thirty percent of the TD-
Principal was associated with the business risk Health
Factors of Robustness (18%), Security (7%), and
Performance Efficiency (5%). We cannot determine from
the data whether IT organizations are spending more time
eliminating TD-Principal related to business risk or whether
TD-Principal is disproportionately created in IT cost-related
factors. Nevertheless, a single high severity violation related
to business risk can be devastating if it eventually causes an
operational problem.
Although the comparative percentages of TD-Principal
remain generally consistent among Health Factors across
language/technology categories, some variation is apparent.
In particular the TD-Principal scores for Robustness appear
much higher for ABAP (42%), Oracle Forms (32%), and
Visual Basic (23%).
These results indicate that the analysis and measurement
of TD-Principal can guide critical management decisions
about how to allocate resources for reducing business risk
and IT cost. Trying to make decisions about retiring TD-
Principal at a global level is overwhelming and it is difficult
to visualize what the expected payoff will be. However,
when TD-Principal can be analyzed into its constituent parts,
management can set specific reduction targets based on
strategic quality priorities with an expectation of the benefit
to be achieved.
For instance, removing the highest severity violations
affecting Robustness reduces the risk of catastrophic
operational crashes, thus improving IT’s ability to achieve
availability targets. As IT collects more data, management
will be able to develop a quantitative understanding of how
much TD-Principal related to Robustness it can sustain in an
application without risking its availability goals. Such
reasoning can be applied to decisions regarding the amounts
to invest in reducing the TD-Principal associated with each
Health Factor. When TD-Principal is measured and
estimated, it will become a standard referent for managing
applications and portfolios. Further exploration of these and
related results can be found in the CRASH Report [10].
REFERENCES
[1] Spinellis, D. (2006). Code Quality: The Open Source Perspective.
Boston: Addison-Wesley.
[2] Curtis, B., Sheppard, S.B., Milliman, P., Borst, A., & Love, T.
(1979a). Measuring the psychological complexity of software
maintenance tasks with the Halstead and McCabe metrics. IEEE
Transactions on Software Engineering, 5 (2), 96-104.
[3] Curtis, B., Sheppard, S.B., and Milliman, P. (1979b). Third time
charm: Stronger prediction of programmer performance by software
complexity metrics. Proceedings of the 4th International Conference
on Software Engineering. Washington, DC: IEEE Computer Society,
356-360.
[4] Boehm, B.W., Brown, J.R., & Lipow, M. (1976). Quantitative
evaluation of software quality. Proceedings of the 2nd International
Conference on Software Engineering. Los Alamitos, CA: IEEE
Computer Society Press, 592-605.
[5] ISO/IEC JTC 1, SC 7 (2001). ISO 9126. Geneva: ISO.
[6] Consortium for IT Software Quality (2010). www.it-cisq.org.
[7] ISO/IEC JTC1/SC7 (2010). ISO 25000. Montreal: École de
technologie supérieure – Department of Software and IT Engineering,
1100 Notre Dame Ouest, Montréal, Québec Canada H3C 1K3.
[8] Lehman, M. M. (1980). Programs, life cycles, and laws of software
evolution. Proceedings of the IEEE, 68 (9), 1060–1076.
[9] McConnell, S. (2007). Technical Debt.
http://blogs.construx.com/blogs/stevemcc/archive/2007/11/01/technic
al-debt-2.aspx.
[10] Sappidi, J., Curtis, B., & Szynkarski, A. (2010). CRASH Report:
CAST Report on Application Software Health—2011/2012. New
York: CAST Software.