BookPDF Available

Application Performance Management: Measuring and Optimizing the Digital Customer Experience


Abstract and Figures

Nowadays, the success of most companies is determined by the quality of their IT services and application systems. To make sure that application systems provide the expected quality of service, it is crucial to have up-to-date information about the system and the user experience to detect problems and to be able to solve them effectively. Application performance management (APM) is a core IT operations discipline that aims to achieve an adequate level of performance during operations. APM comprises methods, techniques, and tools for i) continuously monitoring the state of an applications system and its usage, as well as for ii) detecting, diagnosing, and resolving performance-related problems using the monitored data. This book provides an introduction by covering a common conceptual foundation for APM. On top of the common foundation, we introduce today's tooling landscape and highlight current challenges and directions of this discipline.
Content may be subject to copyright.
Copyright © 2018 SIGS DATACOM GmbH
Lindlaustraße 2c
53842 Troisdorf
Lindlaustraße 2c
53842 Troisdorf
Die vorliegende Publikation ist urheberrechtlich geschützt. Alle Rechte vorbehalten.
Die Verwendung der Texte und Abbildungen, auch auszugsweise, ist ohne die schriftliche Zustimmung des Herausgebers ur-
heberrechtswidrig und daher strafbar. Dies gilt insbesondere für die Vervielfältigung, Übersetzung oder die Verwendung in
elektronischen Systemen. Es wird darauf hingewiesen, dass die in der Broschüre verwendeten Soft- und Hardware-Bezeich-
nungen sowie Markennamen und Produktbezeichnungen der jeweiligen Firmen im Allgemeinen warenzeichen-, marken- oder
patentrechtlichem Schutz unterliegen. Alle Angaben und Programme in dieser Broschüre wurden mit größter Sorgfalt kont-
rolliert. Weder Autor noch Herausgeber können jedoch für Schäden haftbar gemacht werden, die im Zusammenhang mit der
Verwendung dieser Broschüre stehen. Wo nicht anders angegeben, wurde auf die im Text verlinkten Quellen zurückgegriffen.
SIGS DATACOM GmbH has granted the authors the right to publish this version of the
e-book online as open access. The original version of the e-book can be requested via
André van Hoorn is a junior research group leader
with the Institute of Software Technology at the Univer-
sity of Stuttgart, Germany. André’s research focuses on
novel methods, techniques, and tools for designing, ope-
rating, and evolving trustworthy distributed software sys-
tems. Of particular interest are quality attributes such as
performance, reliability, and resilience–and how they can
be assessed and optimized using a smart combination of
model-based and measurement-based approaches. He is
researching the APM context for more than ten years and
is one of the people behind the open-source APM frame-
work Kieker. Currently, André investigates challenges and
opportunities to apply such approaches in the context of
continuous software engineering and DevOps.
Stefan Siegl lives by the slogan „Performance matters!“.
His goal is to convey performance awareness to business,
operators and developers. He has 15 years of experience in
integrating APM solutions and processes in big enterpri-
ses. He heads the practise area „Application Performance
Management and Data Intelligence“ at the NovaTec Consul-
ting GmbH. In his spare time he loves playing with his two
sons, watching endless episodes with his wife and saving
the virtual world in computer games.
About the Authors 3
1. Introduction 7
2. Collection of Performance Measurements 9
2.1 Where to Collect Data? 9
2.2 What Data to Collect? 10
2.3 How to Collect Data? 12
3. Extraction of Performance-Relevant System Information 13
3.1 Time Series 13
3.2 Execution Traces 14
3.3 Architectural Information 16
4. Visualization of APM Information 17
4.1 APM Views 17
Traffic Lights and Gauges 18
Time Series 19
Graphs 20
Execution Traces 21
Other Types of Views 22
4.2 APM Dashboards 22
5. Data Interpretation and Use 25
5.1 Health Monitoring and Architecture discovery 25
5.2 Problem Diagnosis 25
5.3 Problem Diagnosis and Root Cause Isolation 27
5.4 System Refactoring and Adaptation 28
6. APM Tooling 29
6.1 Features of APM tools 29
6.2 Technical Stack 30
6.3 Commercial vs. Open-source APM Tools 32
7. Challenges and Directions 34
7.1 Automation of Supporting APM Activities 34
7.2 Data Protection Regulations 34
7.3 Automated Problem Detection, Diagnosis and Prediction 35
7.4 Tooling Interoperability 35
7.5 Development Paradigms, Architectural Styles, and
Programming Languages 35
8. Conclusion 36
Acknowledgments 37
References 38
New Relic 41
The Secret to Faster, Better Software -
How to Improve Performance and Optimize
Your Business with APM 41
The sponsor contents have been removed for this open-access version of the e-book.
The full version of the e-book, including the sponsor contents, can be requested via
Nowadays, the success of most companies is determined by the quality of their IT
services. In addition to internal application systems, e.g., ERP systems, the appli-
cation systems including the digital services provided to end-users have become
business-critical assets of primarily every company. As a part of the digital trans-
formation, customers access digital services as end-users through web brows-
ers or dedicated smartphone apps. Examples include buying products via online
retail stores and reserving concert tickets via respective booking sites. Studies have
shown that the way customers experience the digital services of the companies has
a direct impact on the business KPIs: satisfied customers buy products; customers
who experience services that are not available or slow will move to competitors. For
example, Google loses 20 % of traffic if their websites get 500 ms slower; Amazon
loses 1 % of revenue for every 100 ms in latency1 . Hence, application systems and
the resulting end-user experience have to be measured and optimized continuously.
To make sure that application systems provide the expected quality of service, it is
crucial to have up-to-date information about the system and the user experience to
detect problems and to be able to solve them effectively.
Application performance management (APM) is a core IT operations discipline that
aims to achieve an adequate level of performance during operations. APM comprises
methods, techniques, and tools for i) continuously monitoring the state of an appli-
cations system and its usage, as well as for ii) detecting, diagnosing, and resolving
performance-related problems using the monitored data. It needs to be emphasized
that in the APM context, the notion of performance includes a comprehensive set of
non-functional properties (e.g., timeliness, resource usage, reliability, availability,
and security) and maybe even functional aspects. It is important not to limit APM to
the back-end of the system hosted in data centers but to get a holistic view starting
from the devices and experiences of the end users.
The APM market is steadily and strongly growing over the years. 2016 saw a growth
rate of 18% leading to a total revenue of approximately $3.7 billion2 for commercial
suites alone. In the last years big technology companies pushed open-source solu-
tions to the market and cloud providers integrated APM offerings inside their cloud
2 According to Gartner’s “APM Magic Quadrant” 2018
The adoption of APM and the current maturity level of the APM process differ strongly
between enterprises. This book thus provides an introduction by covering a common
conceptual foundation for APM. This foundation is helpful to practitioners already
using commercial APM tooling, for enterprises building their own APM tooling, and
for people new to APM alike. On top of the common foundation we introduce today´s
tooling landscape and highlight current challenges and directions of this discipline.
Regardless of the actual technical realization concerning tools, APM involves the
following four activities, which are covered in more details in Chapters 2 to 5:
1. Collection of performance measurements. Performance measurements are
collected from the different system tiers (including devices of end-users),
layers, and locations by a combination of complementary techniques and
2. Extraction of performance-relevant system information. The collected data
is combined into higher-level data structures, such as time series or exe-
cution traces.
3. Visualization of APM information. Data is made available for visual inspec-
tion on different levels for the various stakeholders of an APM solution—
from rather abstract overview dashboards for management and opera-
tions, to very precise and technical deep analysis traces.
4. Data interpretation and use. The data is used to manually or automatically
reason about and act upon the current state.
Chapter 6 gives an overview of the landscape of APM tools that implement these
activities. Given the increasing importance of open-source solutions in the APM
space, we cover both commercial and open-source tools, and provide guidance on
how to select the right tools.
With changing technologies and development principles (e.g., container-based virtu-
alization and microservices), new challenges for APM arise. In Chapter 7, we dis-
cuss selected challenges and directions towards which APM as a discipline and APM
tools will move. Chapter 8 concludes the book.
Managing application performance requires the continuous collection of data about
all relevant parts of the system starting from the end user all the way through
the system.
This collected data is the basis for getting a holistic end-to-end and up-to-date view
of the application state including the end-user experience. In this chapter, we will
discuss what data to collect, and from where and how to collect the data in order to
achieve this view. Figure 1 provides some examples.
Modern application systems are multi-tiered, distributed, multi-layered, and acces-
sed via different types of clients and devices (e.g., third-party systems and humans
using the desktop or mobile devices).
Sales data, conversion and bounce rate
Where? What? How?
Operating System
User interactions: length of stay, load time,
errors: number of resources on HTML pages
Component iteractions, method response
times, trace data
Queuing statistices, pooling, garbage
File handling statistics, virtualization,
thread statistics
CPU load, memory consumption, I/0 statistics
Active Passive
Stimulation of the
system by periodic
E.g., synthetic user
Collection of runtime
data from real sys-
tem usage.
E.g., injection of
code, analysis of
network traffic,
resource utilization,
or log files
Some technologies on lower levels provide
standard interfaces for data collection, e.g.,
Nagios, JMX
Most application systems are implemented in a way that, in addition to the applica-
tion logic executed at the provider’s site (referred to as the back-end), parts of the
application are executed at the client’s site. The client site usually constitutes a sys-
tem tier accessing the back-end via (graphical) interfaces such as fat-clients, thin
clients realized in web browsers, or native apps on mobile devices. Communication
between clients and the back-end may be conducted by networked, wireless, and/or
cellular connections.
Figure 1: examples of wh re, what and how th c ollect per formance dat a
Tip box: Do not forget to monitor your network and third-parties
Third-party components also belong to your application. Users cannot and will not
differ between the part of the application running in “your” cloud and the part that
you just integrated as a third party. Often people argue that monitoring components
that you are not responsible for is not necessary as you cannot directly change and
improve them. We argue that you are monitoring the end user experience and this
includes everything–you need to know what the end user experiences and need to
break it down to the root cause. You can only do this if you are gathering the data.
In our consulting engagements we saw various examples:
- No monitoring of the network. Usually this allowed for the argument that any pro-
blem that could not be pinpointed actually has to be the network
- No monitoring of the content delivery network leading to very high loading times
due to misconfigurations.
- No monitoring of technological components (service brokers, queuing systems,
etc.) as they did not provide easy-to-integrate monitoring interfaces, leading to a
finger-pointing back and forth between departments instead of solving a problem
together based on facts.
To provide an end-to-end view on application performance, APM requires the collec-
tion of relevant performance measures from all of the mentioned locations.
The types of measures that can and should be collected depends on the previously
mentioned locations, but also on the architectural style used by the application
system. As mentioned previously, the primary goal of any application system is to
understand the experience that the customers had with the digital service and to
provide support for business processes. Hence, in addition to technical measures
(e.g., CPU utilization and response times), business-level measures need to be con-
sidered. Examples include the number of completed orders per hour for a specific
business use case. This information needs to be further tailored to specific needs by
adding contextual information. In the aforementioned example, it would be useful to
integrate the geolocation to be able to differentiate the completed orders by country
or city if this is relevant to the business.
Hence, following a top-down perspective of APM, measures related to business
and end-user experience (often called the customer journey, which refers to the
sequence of requests of a specific user) are of primary interest. On a business level,
these measures include data about completed and uncompleted conversions (e.g.,
statistics about client sessions with and without buy transactions). Measures about
the end-user experience obtained on the client site include end-to-end response
times of interactions, (page) load times, errors, and data about the UI usage.
On the application level, measures about the application-internal behavior can be
collected, including executions of methods, occurrences of exceptions, calls to
remote services or databases, etc. On the system level, i.e., middleware, operating
system, and hardware, measures about the state of hardware and software resour-
ces are collected in particular. Additional example measures for the different levels
are included in Figure 1.
Tip box: Focus on the right measures
Data is often just collected for it to be there and not used. This results in massive
amounts of raw data that nobody will look at and nobody will be able to understand.
This in return results in substantial monitoring environments. Without a meaningful
(automatic) interpretation of the data, collecting every data point available is usually
not meaningful.
It is hard to know which data will be used and which is just “garbage”. In fact, kno-
wing which data to keep and which one to drop is one of the hardest questions and
strongly impacts the monitoring infrastructure. Due to this, commercial solutions
often include internal limitations.
There are various approaches available:
- Reduce the impact to the monitoring infrastructure by aggregating
the raw data to aggregated data.
- Start small and smart and extend in case you need more data. “Smart”
refers to selecting the measures that usually are interesting to monito-
ring and problem diagnosis. Commercial tools usually provide presets
that you can extend. In Figure 1, we provide an (incomplete) list of mea-
ningful measures.
There are many approaches to collecting the data, which can be categorized into two
main groups: active and passive. We will explain both and list their benefits and dra-
Active approaches stimulate the system under diagnosis and create measurements
based on the responses of the systems. Often, active data collection is performed by
periodic sampling of system services or resources. This includes the emulation of
customers using synthetic requests.
The hypotheses of active approaches are that the synthetic requests behave like the
actual user requests. Thus, if the simulated request encounters problems it is likely
that the actual users will encounter problems as well.
They are easy to set up, as often tests for use cases are already available and they
only need to be automated.
The measurement quality is independent of the actual load.
• The business mapping to a end-user use case is easy as it is known which use
case is being executed.
The overhead is directly dependent on the number of synthetic invocations.
The tests are always the same, thus deviations of the gathered measurements
could directly point to potential problems. Often service level agreements are
based on a defined set of tests with defined test parameters.
Only data from an end-user perspective can be captured.
Synthetic requests may differ from the actual end-user behavior.
Tip Box: Combine passive and active monitoring approaches
A good practice is to combine active and passive monitoring approaches to get the
best of both worlds. We usually see passive approaches to capture detailed measu-
rements (like traces) supplemented by active approaches for SLA monitoring, avai-
lability monitoring, and sometimes management reportings.
The previous chapter focused on the collection of performance measurements from
the relevant locations of the application system. This chapter focuses on the repre-
sentation of higher level performance-relevant information about the system and
their end-users that can be extracted from this data and that is used for APM visua-
lization and reasoning, as detailed in the next chapters. Notably, we will focus on
three commonly used representations, namely time series, execution traces, and
augmented information about the architecture of the application system. While time
series represent summary statistics (e.g., counts, percentile, etc.) over time, execu-
tion traces provide a detailed representation of the application-internal control flow
that results from individual system requests. From this data, architectural informa-
tion, including logical and physical deployments and interactions (topology), can be
extracted. For all cases, we will highlight examples and use cases in the context of
Informally, a time series is a chronological sequence of data points. In the cont-
ext of APM, time series are frequently used to show and analyze the evolution of
performance measurements over time. As an example, Figure 2 depicts a sche-
matic time series with response time measurements between 1 PM and 4 PM. The
horizontal axis represents the calendar time, while the vertical axis represents the
scale of measurements—in this case, response times in seconds. Very often, a data
point of the time series does not represent a single performance measurement but,
instead, aggregates a set of measurements observed during a defined time period.
The aggregation is conducted using common statistics, such as computing average,
minimum, or maximum values, or by computing percentiles—e.g., representing the
value under which 95 % of the measurements fall. For instance, we can assume that
each data point in Figure 2 depicts the average value of all response times observed
in the past 3 minutes. Time series may be used for various types of measures as
introduced in the previous chapter.
Time series are very useful to assess the overall state of a system or its components.
Depending on the type of measure, the time series can be used to detect anomalies.
For instance, if the time series of memory usage shows a steadily increasing trend,
this may indicate a memory leak in the system.
When depicting the number of users accessing a system, time series usually show
a periodic pattern, e.g., based on the weekdays and the hours of the day. Other inte-
resting patterns are spikes, for instance, indicating peaks in workload or hiccups in
the system.
A common challenge with time series is to select the right statistics and aggrega-
tion windows. For instance, the average of performance measurements is highly
influenced by outliers–which are common for response times, e.g., due to garbage
collection. Larger aggregation windows smooth the time series but may hide import-
ant patterns. Contrarily, too small aggregation windows may highlight unimportant
To summarize, time series are useful to provide an aggregate view on performance
measurements over time. However, they are not suited to analyze individual requests.
We concluded the previous section with the statement that time series are not sui-
table for analyzing individual requests. A data structure commonly used in APM for
this purpose is an execution trace. Informally, an execution trace is a representation
of the execution flow of a request through the system–ideally starting from the end
user. As an example, Figure 3 depicts a schematic execution trace. The execution
trace starts with an operation called doFilter that is commonly found as an entry
point in web-based applications. It can be observed that the execution of the doFilter
operation includes a sequence of additional nested operation executions, until the
list operation performs a sequence of calls to a database.
Fi gu re 2: e xa mpl e t im e se ri es
In addition to the execution flow, capturing components (e.g., Java classes or micro-
services) and operations, and locations (e.g., application server, IP address), execu-
tion traces usually include further measurements. One type of performance mea-
surement commonly found in execution traces is the response time (or duration)
of each operation execution. In the example, the response time for each operation
execution is included in the second column. Moreover, execution traces may include
information such as the parameters of the operation executions.
A common use case of execution traces is the detailed analysis of erroneous or slow
user requests. In this case, the execution trace is inspected for undesirable patterns
such as high response times of operation executions or frequent remote calls, e.g.,
to a database. A big challenge when working with time series is the amount of data to
be processed. First, an execution trace may exist for every single request to the sys-
tem. Second, each execution trace contains tens or hundreds of operations. Hence,
the manual analysis of execution traces is time consuming.
Figure 3: example execution trace
Time series and execution traces allow to analyze the chronological order of perfor-
mance measurements and of individual requests respectively. This information is
commonly used to derive and represent performance-relevant architectural infor-
mation of a system. The architecture of a system includes structural and dynamic
information. Examples for structural information are the existence and deployment of
software and hardware components. The dynamic information includes interactions
(e.g., number of calls, average response times) between components and associated
information about the runtime behavior, e.g., a health state or time series. In Chapter
4 we include example of performance-augmented architectural information.
This representation is useful to have an overall state of the system and it provides a
basis for a detailed manual or automated analysis. Moreover, it is the basis for the
APM visualization covered in the next chapter.
APM information— such as the representations introduced in the previous section—
needs to be visually presented in a meaningful and comprehensible way. In the APM
context, this is achieved using different interrelated and navigable views, which are
usually presented as a dashboard. The dashboards are included in fat-client desktop
applications or web-based applications.
In this chapter, we will present typical views of APM information and give guidelines
about how to design a useful dashboard by combining the different views.
As depicted in Figure 4, the views can be categorized using two dimensions: the
scope (business vs. technology) and the level of abstraction. Views can contain
detailed business information such as the status of user devices, geolocations, as
well as the health of the available services. On the other hand, data can be presented
in the form of traces, time series, page flows, underlying topologies, server health,
etc. These views vary from more abstract to more detailed, depending on what is
required to answer the respective concern.
We will not be able to cover all types of views. Instead, we cover the most common
examples based on Figure 4.
Figure 4: dimensions of APM visualization
Traffic Lights and Gauges
Commons ways to indicate the current value of a performance measure are textual
or graphical visualization such as absolute (integer or decimal) numbers or traffic
lights. Their meaning should be self-explaining. Colors like green, orange, and red
indicate whether the respective measure is in a normal, warning, or critical range.
The visualizations may be augmented by an indicator of the current trend. A simple
example is an arrow next to the visualization indicating a decrease or an increase
compared to previous values in time. Also, speedometer-like gauges are frequently
used. Figure 5 shows examples3.
Figure 5: Example views to display numbers, trends, and status
© by APM vendor
Figures 5 to 9 show publicly available screenshots of tools from APM vendors such as
AppDynamics, CA, Dynatrace, and New Relic. The copyrights of the figures are held by
the respective vendors.
Time Series
We have previously introduced time series, which are commonly used to represent
the evolution of a performance metric over time. Time series views are a basic com-
ponent of any APM dashboard. In the simplest case, the views show one or more
combined time series. However, very often, the time series are augmented by additi-
onal information that may be a static part of the view or that is only displayed on
demand in an interactive way when pointing to or selecting parts of the time series.
Figure 6 shows examples.
Figure 6: Example views to display time series © by APM vendor
Performance-augmented architectural information is commonly displayd in
graph-based views. The graphs comprise nodes and edges, where their meaning
differs based on the purpose of the respective view. Common examples include
nodes representing applications or application components with edges representing
calls between them. The graph is annotated by information about the type of node
(e.g., MySQL database, Linux server, inventory service) and additional status infor-
mation as introduced previously; likewise the color or thickness of edges
may represent the health status or intensity of calls, and may be annotated by quan-
titative information such as response time statistics or numbers of calls. Because
the underlying architecture may impose a larger number of nodes and edges, spe-
cial attention is paid to the layout. Graph-based views are usually interactive in that
nodes and edges can be selected, the views can be increased and decreased in
scale, and subgraphs may be (de-)collapsed. Figure 7 shows examples.
Figure 7: Examples graph-based views
© by APM vendor
Execution Traces
Views of execution traces allow to visually inspect the control flow of individual
requests through the system. Due to the large number of execution traces, the views
usually offer a grouping, e.g., by transaction types, error types, or performance pro-
perties such as high response times. Due the length of individual traces, they can be
(de-)collapsed. Figure 8 shows example views to display and inspect execution tra-
Figure 8: Examples view related to execution traces
© by APM vendor
Other types of views
Various other views are commonly used in dashboards. First, standard graphics for
displaying statistics, such as box plots, pie charts, and histograms are usually sup-
ported. Additional types of views include geographical maps to associate information
to geolations (e.g., countries), lists of messages and alerts, or flow graphs to visua-
lize the interaction of the end-users. Figure 9 includes selected examples of other
view types.
Dashboards provide visual access to APM information about a system and its usage.
A dashboards gives stakeholders interactive access to different, inter-connected
views as introduced in the previous sections. Dashboards are often considered the
result of good monitoring. Dashboards are accessed by individual stakeholders or
are put on big screens throughout the building to inform all stakeholders of the cur-
rent state of the digital services. An example use of the different views is as follows.
A service status view shows that all services are healthy. As soon as a service is indi-
cated not to be healthy any more, the incident needs to be analyzed. This is achieved
Figure 9: Examples of other views
© by APM vendor
by navigating to the other, more detailed views. Dashboards are usually pre-configu-
red and possibly refined, or created manually.
Creating dashboards may seem to be one of the easiest things in the world. However,
at the same time, it is one of the hardest. Arguably, today’s tools provide easy ways
to put data on screens: they just require to select data, select the widget to display
a view, and position it on the screen. Approaches like these will create thousands of
dashboards—and we often see thousands of dashboards at customers. The crucial
question is whether these dashboards fulfill their goal and whether the information
that these dashboards provide is also understandable to the user.
In the following, we sketch a process and list criteria to obtain good dashboards.
1. Think about what you want to achieve.
2. Think about the target group.
3. Discuss with the target group.
4. Design the dashboard and discuss again.
5. Ensure good design principles.
6. Test your dashboard and ask others what they understand from the
7. Take the feedback, optimize, or even delete the dashboard.
Criteria for good dashboards:
Know your target group and design your dashboard accordingly. Do you want to
put the dashboard on the wall, then you should not implement any mouse over
effects or necessitate any drill down actions. Also note that the common pas-
serby has only seconds to see and interpret the data on the dashboard. More
might be less.
Be precise: dashboards often visualize data without providing details on what this
data actually is. What data is presented? Where and how is the data collected?
• Provide context and reference: humans tend to put data into reference to make
a meaning. Instead of visualizing response times for services as a time series, it
might be more meaningful to provide the number of requests that were normal,
slow and failing. Another approach is to at least provide a baseline as reference
to indicate whether the displayed data is good or bad.
Tip Box: Do not over-do dashboards but think about actionable information
Dashboards are often considered as the go-to solution for monitoring. In our experi-
ence, we often see people creating dashboards to get all data in one place to analyze
a problem or to check whether the same problem arose again. Most dashboards
improve as soon as you aggregate data and transform it to information. Instead of
showing the response time, the error rate, and the throughput of a service you could
also create a single traffic light that interprets the data and provides the information
whether there is a problem with the service instead of necessitating that the watcher
will make up her own interpretation. Most dashboards are just saying around, bro-
ken and without data. Often, alerts with notification are way better than dashboards,
as issues are not noticed when the dashboard is not inspected.
Dashboards are often part of the APM solution. The first thing to do is to check if
these dashboards are not already good enough.
Chapter 3 already presented representations of APM knowledge, namely time
series, execution traces, augmented architectural models. As mentioned previously,
the representations may serve different use cases. This chapter highlights selected
A basic use case of APM is to monitor the overall state of the deployed application
system, its usage, and the way the users experience the application.
Therefore, it is important to have well-designed dashboards that give insights about
key performance measures and their interpretation. We have previously covered the
topic of dashboard design in Chapter 4. A basic use case of the dashboards is to pro-
vide information about the health state to stakeholders. The monitoring of the health
state also helps to decide whether changes to the system, e.g., new deployments or
reconfigurations, did not lead to negative effects. Therefore, the dashboards should
not only display measures but provide an interpretation of these, e.g., whether mea-
sures are in a problematic state or not. The underlying analyses will be covered in
the following sections.
APM is used in complex infrastructures. Very often, the exact architecture is not
known explicitly. Therefore, as presented in Chapters 3 and 4, APM helps by its
so-called architecture discovery features, i.e., by extracting and visualizing architec-
tural views from the monitoring data. Hence, APM may serve as a valuable means to
comprehend the actual architecture of the systems and its interplay with third-party
During production, it is normal that problems happen. Problems include user-percei-
ved failures such as services being unavailable or being slow, but also internal failu-
res such as crashed or overutilized computing or storage nodes. A basic feature of
APM is to detect these problems and send an alert to the appropriate stakeholders.
Therefore, the extracted information needs to be analyzed. Intuitively, this can be
done by humans continuously inspecting the health via APM dashboards. It is clear,
that this manual problem detection is not reliable. Hence, it is desired to have auto-
mation support for problem detection and alerting.
Statistical techniques can be applied to the data to detect anomalies, which can indi-
cate problems. For instance, time series are analyzed for violations of thresholds (e.g.,
service level agreements or other manually defined reference values), for undesired
trends (e.g., increasing response times or resource usage), or deviations from nor-
mal behavior that is very often learned from historical data. Figure 10 displays an
example of time series analyses that reveal problems in the underlying performance
measurements. In general, there are various established techniques to analyze the
collected data and information, including basic statistics, data mining, and machine
learning. These techniques can also be proactive, i.e., instead of detecting a problem
that is already present, prediction is used to alert upcoming problems.
In case a problem is detected, alerts can be sent out to system operators. The task
of sending alerts is easy and includes channels such as e-mail, pagers, bots, etc.
The critical part—and this is more related to the analysis techniques than to the aler-
ting—is when and whom to alert. To retain trust in automatic detection and alerting,
a high classification quality (e.g., in terms of precision, recall, and related measures)
is desired. An APM system that never alerts problems has no false alarms but mis-
ses actual problems. On the other hand, a system that is too sensitive is often right
in alerting a problem but also sends out many false alarms.
Figure 10: example time series analysis highlight a problem
5:50 am
0 Apps
5 Apps
5:55 am 6 am
Tip Box: alerting along the customer journey
Once a problem in a system occurs, many hardware and software components may
be affected. Usually, the root cause of the problem is not immediately known. Often,
it can be observed that alerting is based on technical entities, e.g., front-end, back-
end, database. In this case, it is hard to decide who should be informed. We suggest
to have an organization and the alerting along the customer journey, i.e., along a
business functionality.
In case a problem occurred, the goal is to isolate its root cause during problem diag-
nosis—again manually or automatically. As opposed to problem detection, the prob-
lem diagnosis and the root-cause isolation is a challenging task, which is impossible
to automate due to the diversity of reasons for problems.
Manual analysis is usually conducted by employing the previously mentioned data
representations and visualizations, e.g., by navigating from status lights, via applica-
tion topologies, to component drill downs, execution traces, and time series.
Particularly in performance engineering research, there are approaches to auto-
matically detect root causes of typical performance problems, for instance, based
on so-called performance antipatterns. These antipatterns document recurring rea-
sons and symptoms for performance problems. Figure 11 visualizes the so-called
N+1 problem in an execution trace. An N+1 problem is a common anti-pattern for
database queries, where an initial query is used to iterate over the result set leading
to multiple subsequent short queries. This usually leads to long response times of
the corresponding user request. An automated diagnosis can search for the pre-
sence of this anti-patterns in the corresponding execution trace.
Other examples of diagnosis techniques exist, that usually leverage information
about one or more performance measures in combination with execution traces and
augmented architectural information.
The next step after having isolated the root cause of the problem is to remove it.
There are several ways to resolve the problem depending on the nature of the root
cause. Examples include required changes of bugs in the code, adjustments of con-
figurations or deployments, architectural changes, or repair or reboot of software
or hardware components. Again, the aforementioned approaches can be conducted
in a reactive or proactive manner, i.e., after a problem occurred or to predict that a
problem will occur.
Figure 11: example of an N+1 problems observed in an execution trace
So far, we discussed the necessary theoretical steps that will give insights into the
performance of an application system and the end users. However, without tooling to
collect the data based on the performance measures of interest, transform the data
to information, display the information to the stakeholders, and allow to reason on
the information, implementing APM in the real world is not possible.
The market of APM tools is ever growing, the research institute Gartner estimates
that the licensing and maintenance costs of APM tools in 2016 was $3.6b, which
corresponds to an 18% increase over the previous year. In addition to the strong and
competitive commercial market, big tech companies and smaller—yet innovative—
companies push open-source solutions that provide a valid alternative.
In this chapter, we provide an overview about the APM tooling landscape by descri-
bing the typical features and the typical technical stack, and by giving guidance for
the decision whether and when to use open-source or commercial APM tools.
As mentioned in Chapter 1, APM aims to achieve an adequate level of performance
during operations, which is a very high-level goals and does not dictate, which featu-
res an APM tools should provide. And indeed, it is not so easy—maybe even impos-
sible—to list all features.
An intuitive approach based on this book is to request the tools to support the activi-
ties described in Chapter 2 to 5, namely the ability to collect, interpret, visualize, and
interpret the data, as well as to act on it.
The approach followed by Gartner in their annually published Magic Quadrant for
Application Performance Monitoring Suites is to evaluate APM tools based on fun-
ction dimensions. Their latest report does this by three dimensions, namely i) digi-
tal experience monitoring (DEM), ii) application discovery, tracing and diagnostics
(ADTD), and iii) artificial intelligence for IT operations (AIOps) for applications. DEM
covers the ability to reason about the end-user perspective. DEM covers the ability to
understand and reason about the application-internal details, including application
discovery and visualization, as well as end-to-end tracing of requests. ADTD com-
prises all abilities to automatically analyze the data, e.g., based on statistical and
machine-learning techniques.
Given today’s tooling landscape comprising full-blown APM tools and compositions
of smaller tools to achieve the overall APM goals, both approaches are helpful to
describe and evaluate the tools’ features. Particularly, the commercial tools try
to cover Gartner’s dimensions and achieve this by providing the technical features
based on the chapters in this book. The approach using open-source tools is to try to
compose multiple tools with these feature to achieve the overall goals.
APM tools usually support subsets of the previously mentioned features. In this sec-
tion, we outline the components that are usually part of an end-to-end APM solution,
which may be a stand-alone APM tool or a composition of different tools focusing on
specific aspects. The following description is based on the categories and descrip-
tions of the OpenAPM initiative:4
Agent: In traditional APM architectures, the agents are responsible for collecting
the performance data from applications, processes, hosts, etc. They usually run
as part of applications or as an independent process. Together with instrumenta-
tion libraries, they can be seen as the sources of the performance-related data,
as described in Chapter 2.
Library: Instrumentation frameworks and libraries can be used to collect diffe-
rent types of performance data from applications, including the parts deployed
on the end users’ devices. They are usually representing a direct dependency to
the application being monitored, and opposite to the agents, require (source or
byte) code changes in order to collect performance data.
Collector: The main responsibility of collectors in the APM landscape is to
receive data from the agents or instrumentation frameworks. The received data
is usually persisted to some kind of storage or piped to another tool. Depending
on the collector type, performance data enhancement and modification is also
possible inside of the collector. In addition, collectors can have other responsibi-
lities. For example, some expose the data access API, configuration points for the
agents, or user interfaces for the interaction with the stored data.
Pipeline: The pipeline tools can be seen as transportation and processing pipe-
lines for the performance data. Similar to classic pipes, the pipeline tools receive
performance data in one format, buffer, transport, or generate additional value
on the raw data, and usually output or store it in another or the same format.
Each tool usually ingests the data from a multitude of sources and also sends the
results to different destinations.
Storage: Persisting the collected performance data to disk (or memory) is the
responsibility of storage components. Some storage types, like time series data-
bases (TSDBs), were specially designed to support high-amounts of metrics
being monitored. Others storage types were primarily designed for the gene-
ral usage. However, they also found their part in the APM landscape, especially
NoSQL databases with their great flexibility and scalability when storing and pro-
viding structured data.
Visualization: As the name suggests, the visualization components are offering
some kind of user interface to interact with the APM tools or data. They can cover
different APM aspects, like showing traces or application flow-maps. In addition,
they can offer interfaces for configuration and setup or allow interaction with
views that have been generated on top of the performance data, for example,
health and problem views, business transactions overview, end-user actions
visualizations, views showing results of artificial intelligence working with per-
formance data, etc.
Dashboarding: The dashboarding tools are a special kind of visualization tools
that display performance data on dashboards. Dashboards often provide at-a-
glance views of KPIs (key performance indicators) relevant to a particular mea-
sure or a set of measures. They are usually constructed from set of graphs,
but dashboarding tools often offer additional components like single stats, heat-
maps, tables, etc.
Alerting: The alerting tools are capable of creating alerts based on the operatio-
nal performance measures and dispatching those alerts via different notification
channels such as e-mail, SMS, applications, etc. In addition to notifications, some
alerting tools support script execution or deployment updates when an alert is
fired, thus enabling problem resolution through automation.
The most mature and feature-rich APM tools are commercial products such as App-
Dynamics, CA APM, Dynatrace, and New Relic, regularly reviewed by Gartner. Two
usual modes of installation are available, namely on-premise and SaaS-based (soft-
ware as a service). As an alternative to commercial solutions, open-source tools are
often used to implement the technical APM infrastructure. Mature open-source tools
for monitoring on system-level have been around for many years (e.g., Nagios).
Open-source application-level monitoring tools (e.g., Kieker and inspectIT) are avai-
lable. Other tools are available for collecting distributed execution traces (e.g., Zip-
kin) and are being used together with emerging technologies for data storage and
analytics, e.g., logging infrastructure, NoSQL databases, big data. The OpenAPM
website5 provides a comprehensive list of open-source tools and how to compose
The challenges, possibilities, and requirements with respect to APM are as diverse
as the technologies, people, and processes in the corresponding companies. There
are different dimensions and aspects that influence the requirements towards an
APM tool.
First, each company has its own culture, organizational requirements and also indi-
vidual processes in place. Examples include:
the degree to which development and operations are integrated
in-house development vs. third-party software
preference or ability to use commercial vs. open-source software
preference for on-premise vs. software-as-a-service solutions
privacy regulations
user and role management
Second, the variety of technology stacks is huge and differs from context to context.
Examples include:
architectural styles (e.g., microservices)
used programming languages and frameworks
used operating systems
cloud vs. in-house hosting
Finally, the range of features in the area of APM is vast, as discussed previously. Not
every company needs all of them. It is rather typical that, depending on the context,
different combinations of features are required.
With this amount of aspects, the amount of possible combinations and manifesta-
tions of APM solutions is huge. This is why it is nearly impossible to provide one APM
solution that would cover all practically potential contexts and requirements.
Commercial APM vendors do a great job in building feature-rich APM solutions
that cover many scenarios and potential contexts. Still, all of them have individual
strengths and weaknesses. Individual open-source APM tools usually have signi-
ficantly less features than commercial solutions. They rather have clear individual
focuses. However, open-source tools are usually designed in an open way, such that
they can be easily combined and integrated to provide sophisticated APM solutions—
again, potentially requiring development efforts.
Regardless of commercial or open source, APM solutions need to be selected and
tailored in a way that they match the requirements in the individual target contexts.
Recent trends in software engineering and progress on the open source APM mar-
ket open new opportunities for composing open, tailored APM solutions.
APM can be seen as a mature topic for a certain class of application systems, e.g.,
traditional 3-tier Java EE systems. However, modern technologies and development
practices for applications systems impose challenges and promising directions for
In this chapter, we will highlight selected challenges and directions.
APM practice requires expertise and effort. For instance, expertise is required for
setting up and maintaining APM configurations (e.g., deciding which parts of the
software to instrument), as well as for the analysis and visualization of the data.
Even for experts, manual tasks can be error-prone, costly, and frustrating because
various tasks and problems are recurring. Automation of these tasks could be per-
formed by formalizing the expert knowledge, and using it to solve these tasks. APM
solutions are nowadays coming up with features to minimize installation and update
procedures. On modern software stacks, APM tools can often be seamlessly integ-
rated in a matter of hours.
With the advent of stronger data protection regulations also APM data needs to be
analyzed. Especially contextual information about requests might contain informa-
tion that allows to identify specific individuals and might thus be affected by the EU
General Data Protection Regulation (GDPR). Also the “right to be forgotten” might be
leading to additional functionality in APM in order to delete any collected informa-
tion about a single person without affecting monitoring in general and allowing for
detailed root cause analysis for a specific user request.
Regarding the root cause analysis of performance problems, today’s tools are pro-
viding an increasing amount of support to automatically pinpoint the root cause of
a problem. Further systematization of expert knowledge machine learning approa-
ches and dependency models will provide key support to improve this feature even
more. Predictive approaches are still very limited in the APM market whereas APM
research is already providing lots of approaches. We thus expect to see a further
adoption of these algorithms in the APM tools in the future.
Currently, many tools store the data in their own format. As a consequence, it is a
common practice that the same analysis approach has to be re-implemented for dif-
ferent APM tools. Some tools allow export of data in some machine-readable format,
such as XML, and this data can then be parsed with more or less effort. However,
there are ongoing works on developing APIs and formats for APM tool interoperabi-
lity that are mainly driven by open-source APM technologies of the big tech players:
OpenTracing provides a vendor-neutral API for distributed tracing.
OpenCensus (Google) is a vendor-agnostic library for collecting metrics and
Modern development paradigms such as DevOps aim for frequent releases—posing
additional APM challenges, e.g. how to calculate new baselines when the time span
between new releases is extremely short. Emerging architectural styles and new
multi-cloud-based delivery models, e.g., microservice and serverless architectures,
further extend the challenge of cross-platform, multi-cloud monitoring and require
the tools to cope with different technologies and locations the software is running
on. The diversity of programming languages requires the development of dedicated
agents for collecting performance measurements.
APM allows to provide deep insights into the run-time behavior of application sys-
tems and its user experience, and supports the detection, diagnosis, and resolution
of incidents. In the previous chapters, we gave an overview about typical APM activi-
ties and tooling support, and current challenges.
We would like to emphasize the following key takeaways:
APM is not a purely technical topic any more. Instead, it is a cross-cutting concern
involving all organizational layers and units of a company.
APM needs to be thought of from the perspective of the user experience instead
of focusing purely on the back-end systems.
The contents of this book are based on previous contents developed for our SIGS
DATACOM poster on Application Performance Management that appeared in the
OBJEKTspektrum magazine in 2016 as well as our tutorial on Application Perfor-
mance Management presented at the ACM/SPEC International Conference on Per-
formance Engineering (ICPE 2016). A short paper paper appeared as [Christoph
Heger, André van Hoorn, Mario Mann, Dusan Okanovic...]). We have benefited from
countless discussions with our colleagues, including (in alphabetic order) Christoph
Heger, Dusan Okanovic, and Alexander Wert.
Katrin Angerbauer, Dusan Okanovic, André van Hoorn, Christoph Heger: The
Back End is Only One Part of the Picture: Mobile-Aware Application Performance
Monitoring and Problem Diagnosis. VALUETOOLS 2017: 82-89
Andreas Brunnert, André van Hoorn, Felix Willnecker, Alexandru Danciu, Wil-
helm Hasselbring, Christoph Heger, Nikolas Roman Herbst, Pooyan Jamshidi,
Reiner Jung, Jóakim von Kistowski, Anne Koziolek, Johannes Kroß, Simon Spin-
ner, Christian Vögele, Jürgen Walter, Alexander Wert: Performance-oriented
DevOps: A Research Agenda. Technical Report, SPEC Research Group, Standard
Performance Evaluation Corporation, 2015
Christoph Heger, André van Hoorn, Mario Mann, Dusan Okanovic: Application
Performance Management: State of the Art and Challenges for the Future. ICPE
2017: 429-432
Christoph Heger, André van Hoorn, Dusan Okanovic, Stefan Siegl, Alexander
Wert: Expert-Guided Automatic Diagnosis of Performance Problems in Enter-
prise Applications. EDCC 2016: 185-188
André van Hoorn, Jan Waller, Wilhelm Hasselbring: Kieker: a framework for
application performance monitoring and dynamic software analysis. ICPE 2012:
Dusan Okanovic, André van Hoorn, Christoph Heger, Alexander Wert, Stefan
Siegl: Towards Performance Tooling Interoperability: An Open Format for Repre-
senting Execution Traces. EPEW 2016: 94-108
ResearchGate has not been able to resolve any citations for this publication.
Performance-oriented DevOps: A Research Agenda
  • Andreas Brunnert
  • André Van Hoorn
  • Felix Willnecker
  • Alexandru Danciu
  • Wilhelm Hasselbring
  • Christoph Heger
  • Nikolas Roman Herbst
  • Pooyan Jamshidi
  • Reiner Jung
  • Anne Jóakim Von Kistowski
  • Johannes Koziolek
  • Simon Kroß
  • Christian Spinner
  • Jürgen Vögele
  • Alexander Walter
  • Wert
• Andreas Brunnert, André van Hoorn, Felix Willnecker, Alexandru Danciu, Wilhelm Hasselbring, Christoph Heger, Nikolas Roman Herbst, Pooyan Jamshidi, Reiner Jung, Jóakim von Kistowski, Anne Koziolek, Johannes Kroß, Simon Spinner, Christian Vögele, Jürgen Walter, Alexander Wert: Performance-oriented DevOps: A Research Agenda. Technical Report, SPEC Research Group, Standard Performance Evaluation Corporation, 2015
Application Performance Management: State of the Art and Challenges for the Future
  • Christoph Heger
  • André Van Hoorn
  • Mario Mann
  • Dusan Okanovic
• Christoph Heger, André van Hoorn, Mario Mann, Dusan Okanovic: Application Performance Management: State of the Art and Challenges for the Future. ICPE 2017: 429-432
Expert-Guided Automatic Diagnosis of Performance Problems in Enterprise Applications
  • Christoph Heger
  • André Van Hoorn
  • Dusan Okanovic
  • Stefan Siegl
  • Alexander Wert
• Christoph Heger, André van Hoorn, Dusan Okanovic, Stefan Siegl, Alexander Wert: Expert-Guided Automatic Diagnosis of Performance Problems in Enterprise Applications. EDCC 2016: 185-188