Content uploaded by Marcus Ciolkowski
Author content
All content in this area was uploaded by Marcus Ciolkowski on Jul 29, 2018
Content may be subject to copyright.
Challenges in Assessing Technical Debt based on
Dynamic Runtime Data
Marcus Ciolkowski∗Liliana Guzm´
an†, Adam Trendowicz†, Anna Maria Vollmer†
∗QAware GmbH
Aschauer Str. 32, 81549, M¨
unchen, Germany
Email: marcus.ciolkowski@qaware.de
†Fraunhofer IESE
Fraunhofer Platz 1, 67663, Kaiserslautern, Germany
Email: {liliana.guzman, adam.trendowicz, anna-maria.vollmer}@iese.fraunhofer.de
Abstract—Existing definitions and metrics of technical debt
(TD) tend to focus on static properties of software artifacts, in
particular on code measurement. Our experience from software
renovation projects is that dynamic aspects — runtime indicators
of TD — often play a major role. In this position paper,
we present insights and solution ideas gained from numerous
software renovation projects at QAware and from a series of
interviews held as part of the ProDebt research project. We inter-
viewed ten practitioners from two German software companies in
order to understand current requirements and potential solutions
to current problems regarding TD. Based on the interview results,
we motivate the need for measuring dynamic indicators of TD
from the practitioners’ perspective, including current practical
challenges. We found that the main challenges include a lack
of production-ready measurement tools for runtime indicators,
the definition of proper metrics and their thresholds, as well
as the interpretation of these metrics in order to understand
the actual debts and derive countermeasures. Measuring and
interpreting dynamic indicators of TD is especially difficult to
implement for companies because the related metrics are highly
dependent on runtime context and thus difficult to generalize.
We also sketch initial solution ideas by presenting examples of
dynamic indicators for TD and outline directions for future work.
Index Terms—Technical debt, measurement, dynamic data,
runtime quality
I. INTRODUCTION
An important principle in software engineering projects is
to avoid “broken windows” of software quality in any type
of software project. That is, it is important to find and fix
quality deficits (e.g., bad design, wrong architectural decisions,
or poor code) as soon as they are detected to avoid signs of
neglect — similar to a broken window in a building that invites
vandalism [1]. This principle has become particularly promi-
nent with the advent of agile software development [2]. Agile
software development facilitates flexible, rapid, and continuous
development by providing practitioners with iterative methods
that rely on extensive collaboration. To succeed in this context,
organizations need tools for continuously analyzing software
quality along short-time releases [3]. Otherwise, developers
tend to focus on functional requirements, neglecting quality
requirements [4]. Thus, continuous measurement of quality
deficits is crucial for ensuring software quality.
We define the following terms (see Ampatzoglou et al. [5, p.
63]): Principal denotes the effort that is required to address the
difference between the current and the target level of design-
time quality. The interest to be paid on the principal means
the additional cost (e.g., effort) needed for maintaining and
operating the software, due to its decayed design-time quality.
Technical debt itself comprises the principal plus interest to
be paid within a certain time interval (e.g., the next year).
The experience at QAware — a software-development com-
pany with a lot of experience in analyzing and renovating
software systems — has shown that TD spreads throughout
systems, and affects code as well as non-code artifacts. More-
over, many forms of TD are difficult or even impossible to
spot statically and can only be observed dynamically; that is,
at system runtime.
In recent years, the research on TD has focused mainly on
static analysis of source code and test artifacts. For example,
Alves et al. [6, pp. 107–108] report that only two of 100 papers
mention dynamic properties.In both cases, it is unclear whether
this means an analysis at runtime or static code analysis. A
notable exception is TD related to testing, which is addressed
in 24 of 100 papers. However, all of these look at static
indicators of testing: Although they require running the system
(or, at least, executing the test cases), they ignore runtime data
such as test execution time. Thus, there is a research gap for
TD with respect to runtime or dynamic indicators of TD.
This paper is a position paper that is based on experi-
ence gained from numerous software renovation projects at
QAware, and on interview results from the ProDebt research
project on strategic planning of TD.
II. ON T HE IM PO RTANC E OF DY NA MI C ASP ECTS OF
TECHNICAL DEB T
The background of this position paper is that of QAware, an
independent software development and consultancy enterprise
that analyzes, renovates, invents, and implements software
systems for customers whose success heavily depends on IT.
One important field of business is system troubleshooting and
software renovation [7], [8]. System troubleshooting makes in-
tensive use of runtime analysis of distributed systems: QAware
experts install measurement infrastructure that measures fine-
grained data on the system and application level, and collects
and stores this data in one repository for analysis purposes.
From QAware’s experience, runtime aspects of system behav-
ior are critical in practice, contribute heavily to TD interest
and software bankruptcy, and are often difficult to detect.
Existing measurement support for TD focuses on static
aspects, such as component coupling or other code-based
metrics [9]. This is important, for instance, for managing debt
associated with software maintenance cost. Because there are
so many tools available for measuring static aspects [10], it can
be easily monitored and controlled to a certain degree: e.g., if
a project follows a zero–violations policy (no violations must
be carried across sprints or releases), there are no violations
against the measured (static) TD metrics.
Dynamic aspects of TD, however, are often overlooked in
existing tool support and research [9], although they often con-
tribute significantly to the debt’s interest. Thus, measurement
and interpretation of dynamic aspects lacks ready-to-use tool
support in spite of their importance: TD can create runtime
symptoms such as extensive (i.e., larger than necessary) con-
sumption of resources such as time, memory, or computing
power. Later on, during tests, dynamic symptoms of TD can
hinder automatic testing or increase the effort and complexity
of manual testing. Finally, during operation, TD may surface
as a decrease in performance, stability, or scalability, which in
turn affects operation costs and customer satisfaction.
Detecting runtime symptoms of TD is difficult, but this is
also where the tricky cases in maintenance are often hidden
(e.g., leaks or locks) [7]. Although expert tools are avail-
able to support runtime analysis (e.g., Profiler, Heap Walker,
stacktracing tools), these are typically difficult to use and not
suitable for continuous monitoring, in particular for distributed
systems. In addition, as the required metrics include system-
level as well as application-specific metrics, they stem from
many tools and have to be collected and stored in a common
repository in order to make them accessible for analysis
and interpretation. Moreover, they have to be measured and
collected from many components in distributed systems, often
over networks that are only partially secured .
However, measuring dynamic indicators of TD is important,
because some forms of TD are difficult or impossible to spot
statically and only surface at runtime. For example, problems
with the runtime architecture (i.e., actual call relationships
at runtime) are often difficult to spot in software code and
their detection is rather obfuscated by the intensive use of
inheritance, code injection, reflection, or soft links in modern
information systems. Thus, they can be only be measured
reliably at runtime; yet corresponding measurement typically
has to be built from scratch [11].
III. IDE AS F OR D EFI NI NG &MEASURING TECHNICAL DEBT
A. Context
The research project ProDebt aimed at developing an inno-
vative tool-supported method to support the strategic planning
of TD. At the beginning of the ProDebt project1, we aimed
at understanding the meaning, management, and problems
1http://www.prodebt.de/
related to TD from the perspective of different stakeholders in
German software companies involved in agile software devel-
opment. Therefore, we designed and performed ten individual
semi-structured interviews with product owners and developers
in the software companies involved in the ProDebt project,
namely QAware and Insiders Technologies.
Each interview lasted up to 45 minutes and was performed
by two researchers. The collected data was analyzed using
thematic analysis. For a detailed description of QAware and
Insiders Technologies and the related use cases, refer to [3].
B. Indicators of Dynamic Aspects
Besides traditional code-related static indicators of TD,
the interviews identified a number of non-code and dynamic
indicators of TD. Table I summarizes the most prominent
aspects and the associated measurement and runtime behavior
from the perspective of product owners and developers.
C. Measurement and Challenges
The interviewees also discussed potential challenges of
measuring dynamic aspects of TD. The main measurement
challenges include defining and interpreting appropriate met-
rics, as such metrics are highly dependent on the context
in which TD and the associated dynamic software aspects
are considered. From the implementation viewpoint, one of
the main obstacles perceived was the lack of tools for au-
tomatic measurement of software behavior at runtime. The
interviewees noted that there are a number of opportunities
for collecting runtime data that can be utilized: On the one
hand, there are a number of software lifecycle stages where
software behavior can be observed. These stages include
various levels of testing (e.g., unit, integration, acceptance)
that are typically at least partially automated as well as the
software in production. On the other hand, a number of
tools exist that may by utilized for collecting basic data on
software behavior. Examples include tools that come with
the operational system (e.g., Unix and Windows); within-
application metric frameworks (e.g., JMX/Jolokia for Java)
that offer many metrics out of the box, such as garbage
collection statistics or thread counters); or external monitoring
tools (e.g., Nagios). Other ways of collecting runtime data
include embedding measurement mechanisms (e.g., writing
custom code that uses JMX/Jolokia) or analyzing log files
produced during runtime (e.g., using the ELK/Elastic stack).
From the perspective of the interviewees, one critical aspect
of measuring software at runtime is potential intrusion into the
system under measurement and the related impact on, e.g.,
performance. Non-invasive measurement is particularity im-
portant regarding systems in operation. In this context, highly
invasive measurement approaches, such as using profiling tools
or measurement through bytecode injection are typically not
feasible for production systems but may be applicable in
testing scenarios [11].
Measurement data creates the basis for managing TD, but
to support decision making the data must be interpreted
properly. Interpreting runtime measurement data and deriving
TABLE I
EXCERPT OF MEASUREMENT IDEAS FOR DYNAMIC INDICATORS OF TECHNICAL DEBT
Indicator Meaning Metric
Service
usage
Usage profile of services provided by the software system. On the
one hand, service usage is an indicator of the runtime load of a
software system. On the other hand, it specifies the context for
measuring other performance indicators in a comparable way: i.e.,
performance indicators are measured and compared under the same
service usage profile.
Count service calls per each individual service.
Appropriateness
of test profile
Correspondence between the test profile and the real usage profile of
the software system. This factor is not a direct indicator of perfor-
mance efficiency but determines the reliability of the performance
indicators assessed during tests. The less a test profile corresponds
to the real usage of the software application the less reliable are the
performance indicators assessed during such tests.
Difference between the test profile (e.g. using Limbo tool) and
the real usage profile (measured by analysing the logfiles).
Similarity can be defined through a common statistical distance
metric between probability distributions or distance metrics
between the point in space where the test and usage profile
represent two points of an n-dimensional space and each of n
services represents one dimension.
Service call
duration
Duration of service/method call. With automation, the duration of
use cases can be measured. Trends as well as outliers are valuable.
Service or method call duration: log files (ELK), JMX/Jolokia
for Java. Testing environments: test execution time supported
by most testing frameworks; or by using Bytecode injection.
Intensity of
DB calls
Efficiency of DB operations (e.g., per service call or per
unit/integration/performance test). For example, DB operations are
realized in a way that leads to extensive runtime when the system
runs under high load (e.g., multiple small DB accesses in a short
time instead of one large one).
Count outgoing, opened DB connections initiated by the system
at runtime and their status.
Processor
load
Utilization of processor capacity (per component, service, system)
over time. Processor capacity should be utilized as expected in
a scenario; in practice, it can be minimal or maximal utilization,
depending on the specific context.
Production: runtime data collection from operation systems
tools (e.g., ps under Unix) or metric frameworks such as JMX
for Java.
Memory
load
Utilization of memory (per component, service, system). Available
memory should be utilized as expected in a given scenario; e.g., it
should not increase continuously and/or should not exceed a specific
threshold.
Production: runtime collection of data from operation systems
tools (e.g., ps under Unix) or metric frameworks such as JMX
for Java.
Testing: supported by most testing frameworks; needs to be
stored afterwards (e.g. with SonarQube’s Surefire plugin)
judgments regarding the actual debt is the second major
challenge of data-driven TD management. Other than for code-
based static software measurements, the definition of absolute
preference thresholds is typically not possible for runtime
behavior because they are highly dependent on runtime context
and thus difficult to generalize. Although for some indicators
a meaningful absolute threshold can be defined in specific
contexts as early indicators of potential debt, the best way
of assessing runtime behavior remains to observe trends over
time, from the short-term (e.g., sprints) as well as long-term
(e.g., releases) perspective. Typical interpretation patterns of
runtime data are thus:
Threshold: An issue is raised when values of interest
approach, reach, exceed or fall below a defined limit. In such
cases, not only an instant value but also the frequency or
duration with which a certain limit is reached should raise
an issue. The following types of limit values are particularly
interesting in practice: hard limits (e.g., 100% CPU consump-
tion), soft limits (e.g., >90% CPU consumption is critical),
and empirical limits (significant outliers according to baseline
data and a given significance level).
Trend: An issue is raised when a TD indicator (e.g.,
consumption of resources) show a continuous trend or even
a tendency to go beyond the limit. Although an absolute limit
is difficult to define, TD can be reliably detected based on
patterns and trends.
Correlation: An issue is raised when the values correlate
with values of an already known anomaly, that related to TD in
the past. A typical example of such a pattern is CPU workload.
D. Example Dynamic Indicators of Debt
This section discusses selected indicators of TD (Table I)
and how they can be measured and interpreted. The following
examples stem from QAware’s experience in troubleshooting
projects in many companies and domains. The setting in a
troubleshooting project is typically that a (distributed) system
is instable or suffers from performance problems, causing high
cost. QAware experts analyze the system to provide quick
fixes. Typically, troubleshooting projects end with lists of
short-term fixes and mid- to long-term TDs to be removed.
1) Connections Quantity: Connections refer to runtime
input-output operations. This represents a typical situation for
saturation patterns: While it is difficult to define an absolute
threshold, a growing trend is always problematic.
For example, in one troubleshooting project there was a
distributed system that had been rolled out world-wide but
was too slow to be used, causing high follow-up costs. An
analysis of file handles showed an increase in the number of
handles used by the operating system. The operating system
references each object of the operating system via a handle,
such as threads, Mutex objects, or files. In this case, a missing
call of Release() was responsible for the fact that COM objects
were never released, which could only be detected with a
debugger.
2) Intensity of (DB) calls: A situation occurring often
in troubleshooting projects is that one incoming connection
causes many outgoing connections; for example, one incom-
ing web request call resulting in many database calls. An
indicator for TD is therefore the relation between incoming
and outgoing calls, and the trend of system-internal calls vs.
external calls. In one troubleshooting project, we found that an
incoming web request caused up to 2400 JavaScript calls, 30
service requests and 1600 database calls. Root causes are most
often programming errors; for example, inefficient algorithms
or data structures, inefficient database queries with many tiny
queries and joins in Java code instead of within the database.
Another typical root cause is inappropriate usage of complex
frameworks (e.g., hibernate for database operations requires
expert knowledge about its side effects), or (sometimes) de-
fects in frameworks or the Java virtual machine.
3) Garbage Collection (GC): Garbage collection refers to
garbage collector activity and the amount of heap memory,
which typically indicate TD (e.g., a memory leak) if the
frequency of GC rises, or if maximum/minimum peaks of heap
memory rise over the course of GC activity (see Figure 1).
Typical root causes for class or memory leaks include inef-
ficient programming (e.g., forgetting to release handles), but
often also defects in frameworks or side-effects in frameworks
that are known but not well-documented.
Fig. 1. Technical debt: Garbage Collection activity rises, min/max memory
rises. This hints at a memory leak.
IV. CONCLUSION
Current approaches for measuring TD focus mainly on
static aspects of code or near-code artifacts, measured at
development or build time. There is a need for extending
current approaches to address dynamic indicators—that is,
properties emerging at system runtime—which are often an
important cause of software bankruptcy. In this paper, we
presented initial insights on requirements and measurement
ideas gained during commercial software renovation projects
and a research project on TD management.
The presented examples of dynamic indicators of TD have
in common that all of them frequently occur in practice. All
of them were difficult, or even impossible, to detect with static
analysis techniques, and were observable at system runtime,
and all of them resulted in quality deficits that need to be
detected and fixed as soon as possible. This is why indicators
for dynamic aspects of TD are important in practice and need
ready-to-use tools for measuring and storing runtime data;
advances are also needed with regard to their interpretation.
Continuous monitoring of dynamic aspects of TD can
supplement static measurement. This is becomming increas-
ingly important in the context of DevOps and CloudNative
approaches.There are plenty of measurement opportunities in
such environments: Every build or nightly build can include,
for example, automated load tests; systems in operation can
use metrics (e.g., based on JMX) or log files (e.g., based on the
ELK stack) to collect dynamic indicators and thereby detect
anomalies as well as slow trends over time. Detecting these
forms of TD early is important; there may be a long time until
an issue is detected, if it is detected at all (e.g., it may just
cause higher cost without being recognized as a defect). Tools
for defining and partly for collecting such data exist (e.g.,
ELK stack, Prometheus). The challenge is to make these data
accessible for analysis and interpretation.
ACKNOWLEDGMENT
We would like to thank the members of QAware GmbH and
Insiders Technologies who participated in the interviews. Parts
of this work have been supported by the German Ministry of
Education and Research (BMBF) under grant no. 01IS15008D
and 01IS15008A.
REFERENCES
[1] A. Hunt and D. Thomas, “Zero-tolerance construction,” IEEE Software,
vol. 19, pp. 100–102, 2002.
[2] I. Inayat, S. S. Salim, S. Marczak, M. Daneva, and S. Shamshirband,
“A systematic literature review on agile requirements engineering
practices and challenges,” Computers in Human Behavior, vol.
51, Part B, pp. 915 – 929, 2015. [Online]. Available:
//www.sciencedirect.com/science/article/pii/S074756321400569X
[3] L. Guzman, M. Oriol, P. Rodriguez, X. Franch, A. Jedlitschka, and
M. Oivo, “How can quality-awareness support rapid software develop-
ment? a vision paper,” in Proceedings of the 23rd Working Conference on
Requirements Engineering Foundation for Software Quality, ser. REFSQ
2017. Berlin, Heidelberg: Springer-Verlag, 2017, pp. 1–7.
[4] S. Wagner, Software product quality control. Springer Verlag, 2013.
[5] A. Ampatzoglou, A. Ampatzoglou, A. Chatzigeorgiou, and P. Avgeriou,
“The financial aspect of managing technical debt: A systematic
literature review,” vol. 64, pp. 52–73, 2015. [Online]. Available:
http://linkinghub.elsevier.com/retrieve/pii/S0950584915000762
[6] N. S. Alves, T. S. Mendes, M. G. de Mendona, R. O. Spnola, F. Shull,
and C. Seaman, “Identification and management of technical debt: A
systematic mapping study,” vol. 70, pp. 100–121, 2016. [Online]. Avail-
able: http://linkinghub.elsevier.com/retrieve/pii/S0950584915001743
[7] J. Weigend, J. Siedersleben, and J. Adersberger, “Dynamische Analyse
mit dem Software-EKG,” Informatik-Spektrum, vol. 34, no. 5, pp. 484–
495, Oct. 2011.
[8] J. Adersberger and J. Weigend, “IT-Sanierung. Pr¨
avention statt Herzstill-
stand: Wie sich kranke IT-Systeme kurieren lassen.” Objekt-Spektrum,
no. 2015-01, pp. 54–55, Jan. 2015.
[9] N. S. Alves, L. F. Ribeiro, V. Caires, T. S. Mendes,
and R. O. Spinola, “Towards an Ontology of Terms
on Technical Debt.” IEEE, pp. 1–7. [Online]. Available:
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6974882
[10] Z. Li, P. Avgeriou, and P. Liang, “A systematic mapping study on tech-
nical debt and its management,” vol. 101, pp. 193–220. [Online]. Avail-
able: http://linkinghub.elsevier.com/retrieve/pii/S0164121214002854
[11] M. Ciolkowski, S. Faber, and S. von Mammen, “3-D visualization of
dynamic runtime structures.” Gothenburg: ACM Press, Oct. 2017, pp.
189–198.