ArticlePDF Available

Cloud-Native Observability: The Many-Faceted Benefits of Structured and Unified Logging - A Multi-Case Study

Authors:
  • Lübeck University of Applied Sciences

Abstract and Figures

Cloud-native software systems often have a much more decentralized structure and many independently deployable and (horizontally) scalable components, making it more complicated to create a shared and consolidated picture of the overall decentralized system state. Today, observability is often understood as a triad of collecting and processing metrics, distributed tracing data, and logging. The result is often a complex observability system composed of three stovepipes whose data are difficult to correlate. Objective: This study analyzes whether these three historically emerged observability stovepipes of logs, metrics and distributed traces could be handled in a more integrated way and with a more straightforward instrumentation approach. Method: This study applied an action research methodology used mainly in industry-academia collaboration and common in software engineering. The research design utilized iterative action research cycles, including one long-term use case. Results: This study presents a unified logging library for Python and a unified logging architecture that uses the structured logging approach. The evaluation shows that several thousand events per minute are easily processable. Conclusions: The results indicate that a unification of the current observability triad is possible without the necessity to develop utterly new toolchains.
Content may be subject to copyright.
Citation: Kratzke, N. Cloud-Native
Observability: The Many-Faceted
Benefits of Structured and Unified
Logging—A Multi-Case Study.
Future Internet 2022,14, 274. https://
doi.org/10.3390/fi14100274
Academic Editor: Seng W. Loke
Received: 24 August 2022
Accepted: 22 September 2022
Published: 26 September 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2022 by the author.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
future internet
Article
Cloud-Native Observability: The Many-Faceted Benefits of
Structured and Unified Logging—A Multi-Case Study
Nane Kratzke
Department of Electrical Engineering and Computer Science, Lübeck University of Applied Sciences,
23562 Lübeck, Germany
Abstract: Background:
Cloud-native software systems often have a much more decentralized struc-
ture and many independently deployable and (horizontally) scalable components, making it more
complicated to create a shared and consolidated picture of the overall decentralized system state.
Today, observability is often understood as a triad of collecting and processing metrics, distributed
tracing data, and logging. The result is often a complex observability system composed of three
stovepipes whose data are difficult to correlate.
Objective:
This study analyzes whether these three
historically emerged observability stovepipes of logs, metrics and distributed traces could be handled
in a more integrated way and with a more straightforward instrumentation approach.
Method:
This study applied an action research methodology used mainly in industry–academia collaboration
and common in software engineering. The research design utilized iterative action research cycles,
including one long-term use case.
Results:
This study presents a unified logging library for Python
and a unified logging architecture that uses the structured logging approach. The evaluation shows
that several thousand events per minute are easily processable.
Conclusions:
The results indicate
that a unification of the current observability triad is possible without the necessity to develop utterly
new toolchains.
Keywords:
cloud-native; observability; cloud computing; logging; structured logging; logs; metrics;
traces; distributed tracing; log aggregation; log forwarding; log consolidation
1. Introduction
A “crypto winter” basically means that the prices for so-called cryptocurrencies such
as Bitcon, Ethereeum, Solana, etc. fell sharply on the crypto exchanges and then stay low.
The signs were all around in 2022: the failure of the Terra Luna crypto project in May 2022
sent an icy blast through the market, then the cryptocurrency lending platform Celsius
Network halted withdrawals, prompting a sell-off that pushed Bitcoin to a 17-month low.
This study logged such a “crypto winter” on Twitter more by accident than by intention.
Twitter was simply selected as an appropriate use case to evaluate a unified logging solution
for cloud-native systems. The intent was to log Tweets containing stock symbols like $USD
or $EUR. It turned out that most symbols used on Twitter are not related to currencies
like $USD (US-Dollar) or stocks like $AAPL (Apple) but to Cryptocurrencies like $BTC
(Bitcoin) or $ETH (Ethereum). However, although some data of this 2022 crypto winter will
be presented in this paper, this paper will put the methodical part more into focus and will
address how such and further data could be collected more systematically in distributed
cloud-native applications. The paper will at least show that even complex observability of
distributed systems can be reached, simply by logging events to stdout.
Observability measures how well a system’s internal state can be inferred from knowl-
edge of its external outputs. The concept of observability was initially introduced by
the Hungarian-American engineer Rudolf E. Kálmán for linear dynamical systems [
1
,
2
].
However, observability also applies to information systems and is of particular interest
Future Internet 2022,14, 274. https://doi.org/10.3390/fi14100274 https://www.mdpi.com/journal/futureinternet
Future Internet 2022,14, 274 2 of 23
to fine-grained and distributed cloud-native systems that come with a very own set of
observability challenges.
Traditionally, the responsibility for observability is (was?) with operations (Ops).
However, this evolved into a collection of different technical methods and a culture for
collaboration between software development (Dev) and IT operations (Ops). With this
emergence of DevOps, we can observe a shift of Ops responsibilities to developers. Thus,
observability is evolving more and more into a Dev responsibility. Observability should ide-
ally already be considered during the application design phase and not be regarded as some
“add-on” feature for later expansion stages of an application. The current discussion about
observability began well before the advent of cloud-native technologies like Kubernetes.
A widely cited blog post by Cory Watson from 2013 shows how engineers at Twitter looked
for ways to monitor their systems as the company moved from a monolithic to a distributed
architecture [
3
5
]. One of the ways Twitter did this was by developing a command-line tool
that engineers could use to create their dashboards to keep track of the charts they were
creating. While Continuous Integration and Continuous Deliver/Deployment (CI/CD)
tools and container technologies often bridge Dev and Ops in one direction, observability
solutions close the loop in the opposite direction, from Ops to Dev [
4
]. Observability is thus
the basis for data-driven software development (see Figure 1and [
6
]). As developments
around cloud(-native) computing progressed, more and more engineers began to “live in
their dashboards.” They learned that it is not enough to collect and monitor data points
but that it is necessary to address this problem more systematically.
Figure 1. Observability can be seen as a feedback channel from Ops to Dev (adopted from [4,6]).
2. Problem Description
Today, observability is often understood as a triad. Observability of distributed in-
formation systems is typically achieved through the collection and processing of metrics
(quantitative data primarily as time-series), distributed tracing data (execution durations of
complex system transactions that flow through services of a distributed system), and log-
ging (qualitative data of discrete system events often associated with timestamps but
encoded as unstructured strings). Consequently, three stacks of observability solutions
have emerged, and the following somehow summarizes the current state of the art.
Future Internet 2022,14, 274 3 of 23
Metrics:
Here, quantitative data are often collected in time series, e.g., how many
requests a system is currently processing. The metrics technology stack is often
characterized by tools such as Prometheus and Grafana.
Distributed tracing
involves following the path of transactions along the components
of a distributed system. The tracing technology stack is characterized by tools such as
Zipkin or Jaeger, and the technologies are used to identify and optimize particularly
slow or error-prone substeps of distributed transaction processing.
Logging
is probably as old as software development itself, and many developers,
because of the log ubiquity, are unaware that logging should be seen as part of holistic
observability. Logs are usually stored in so-called log files. Primarily qualitative
events are logged (e.g., user XYZ logs in/out). An event is usually attached to a log
file in a text line. Often, the implicit and historically justifiable assumption prevails
with developers that these log files are read and evaluated primarily by adminis-
trators (thus humans). However, this is hardly the case anymore. It is becoming
increasingly common for the contents of these log files to be forwarded to a central
database through “log forwarders” so that they can be evaluated and analyzed cen-
trally. The technology stack is often characterized by tools such as Fluentd, FileBeat,
LogStash for log forwarding, databases such as ElasticSearch, Cassandra or simply S3
and user interfaces such as Kibana.
Incidentally, all three observability pillars have in common that software to be de-
veloped must be somehow instrumented. This instrumentation is normally done using
programming language-specific libraries. Developers often regard distributed tracing in-
strumentation in particular as time-consuming. In addition, which metric types (counter,
gauge, histogram, history, and more) are to be used in metric observability solutions such
as Prometheus often depends on Ops experience and is not always immediately apparent
to developers. Certain observability hopes fail simply because of wrongly chosen metric
types. Only system metrics such as Central Processing Unit (CPU), memory, and storage
utilization can be easily captured in a black-box manner (i.e., without instrumentation
in the code). However, these data are often only of limited use for the functional assess-
ment of systems. For example, CPU utilization provides little information about whether
conversion rates in an online store are developing in the desired direction.
Thus, current observability solutions are often based on these three stovepipes for logs,
metrics, and traces. The result is an application surrounded by a complex observability
system whose isolated datasets can be difficult to correlate. Figure 2focuses on the applica-
tion (i.e., the object to be monitored) and triggers the question, whether it is justified to use
three complex subsystems and three types of instrumentation, which always means three
times the instrumentation and data analysis effort of isolated data silos.
The often-used tool combination of ElasticSearch, LogStash, and Kibana is used for
logging and has even been given a catchy acronym: ELK-Stack [
7
,
8
]. The ELK stack can
be used to collect metrics and using the Application Performance Management (APM)
plugin [
9
] also for distributed tracing. Thus, at least for the ELK stack, the three stovepipes
are not clearly separable or disjoint. The separateness is somewhat historically “suggested”
rather than technologically given. Nevertheless, this tripartite division into metrics, tracing,
and logging is very formative for the industry, as shown, for example, by the Open-
Telemetry project [
10
]. OpenTelemetry is currently in the incubation stage at the Cloud
Native Computing Foundation and provides a collection of standardized tools, Application
Programming Interfaces (APIs), and Software Development Kits (SDKs) to instrument,
generate, collect, and export telemetry data (metrics, logs, and traces) to analyze the perfor-
mance and behaviour of software systems. OpenTelemetry thus standardizes observability
but hardly aims to overcome the columnar separation into metrics, tracing, and logging.
Future Internet 2022,14, 274 4 of 23
Figure 2.
An application is quickly surrounded by a complex observability system when metrics,
tracing, and logs are captured with different observability stacks.
In past and current industrial action research [
4
,
6
,
11
,
12
], I came across various cloud-
native applications and corresponding engineering methodologies like the 12-factor app
(see Section 4.1). However, this previous research was not primarily concerned with
observability or instrumentation per se. Especially, no instrumentation libraries have been
developed, like it was done in this research. Instrumentation and observability were—as
so often—only used in the context of evaluation or assessment of system performance.
The instrumentation usually followed the analysis stack used. Developers who perform
Distributed Tracing use Distributed Tracing Libraries for instrumentation. Developers who
perform metric instrumentation use metric libraries. Those who log events use logging
libraries. This instrumentation approach is so obvious that hardly any developer thinks
about it. However, the result is disjoint observability data silos. This paper takes up this
observation and asks whether uniform instrumentation helps avoid these observability data
silos. In various projects, we have used instrumentation oriented towards the least complex
case, logging, and have only slightly extended it for metrics and distributed tracing.
We learned that the discussion around observability is increasingly moving beyond
these three stovepipes and taking a more nuanced and integrated view. There is a growing
awareness of integrating and unifying these three pillars, and more emphasis is being
placed on analytics.
Each of the three pillars of observability (logs, metrics, traces) is little more than a
specific application of time series analysis. Therefore, the obvious question is how to
instrument systems to capture events in an unobtrusive way that operation platforms can
efficiently feed them into existing time series analysis solutions [
13
]. In a perfect world,
developers should not have to worry too much about such kind of instrumentation, whether
it is qualitative events, quantitative metrics, or tracing data from transactions moving along
the components of distributed systems.
In statistics, time series analysis deals with the inferential statistical analysis of time
series. It is a particular form of regression analysis. The goal is often the prediction of trends
(trend extrapolation) regarding their future development. Another goal might be detecting
time series anomalies, which might indicate unwanted system behaviours. A time series is
a chronologically ordered sequence of values or observations in which the arrangement of
the results of the characteristic values necessarily from the course of time (e.g., stock prices,
Future Internet 2022,14, 274 5 of 23
population development, weather data, but also typical metrics and events occurring in
distributed systems, like CPU utilization or login-attempts of users).
The
research question
arises whether these three historically emerged observability
stovepipes of logs, metrics and distributed traces could be handled in a more integrated
way and with a more straightforward instrumentation approach. The results of this action
research study show that this unification potential could be surprisingly easy to realize
if we exploit consequently the shared characteristic of time-series analysis in all three
stovepipes. This paper presents the followed research methodology in Section 3and its
results in Section 4(including a logging prototype in Section 4.4 as the
main contribution
of this paper to the field). The evaluation of this logging prototype is presented in Section 5.
A critical discussion is done in Section 6. Furthermore, the study presents related work
in Section 7and concludes its findings as well as future promising research directions in
Section 8.
3. Methodology
This study followed the action research methodology as a proven and well-established
research methodology model for industry-academia collaboration in the software engineer-
ing context to analyze the research-question mentioned above. Following the recommen-
dations of Petersen et al. [
14
], a research design was defined that applied iterative action
research cycles (see Figure 3):
1. Diagnosis (Diagnosing according to [14])
2. Prototyping (Action planning, design and taking according to [14])
3. Evaluation including a may be required redesign (Evaluation according to [14])
4. Transfer learning outcomes to further use cases (Specifying learning according to [14]).
Figure 3. Action research methodology of this study.
Future Internet 2022,14, 274 6 of 23
With each of the following use cases, insights were transferred from the previous use
case into a structured logging prototype (see Figure 3). The following use cases (UC) have
been studied and evaluated.
Use Case 1:
Observation of qualitative events occurring in an existing solution (online
code editor; https://codepad.th-luebeck.dev (accessed on 20 September 2022) , this
use case was inspired by our research [15])
Use Case 2:
Observation of distributed events along distributed services (distributed
tracing in an existing solution of an online code editor, see UC1)
Use Case 3:
Observation of quantitative data generated by a technical infrastructure
(Kubernetes platform, this use case was inspired by our research [11,12,16])
Use Case 4:
Observation of a massive online event stream to gain experiences with
high-volume event streams (we used Twitter as a data source and tracked worldwide
occurrences of stock symbols; this use case was inspired by our research [17]).
4. Results of the Software-Prototyping
The analysis of cloud-native methodologies like the 12-factor app [
18
] has shown that,
to build observability, one should take a more nuanced and integrated view to integrate
and unify these three pillars of metrics, traces, and logs to enable more agile and convenient
analytics in feedback information flow in DevOps cycles (see Figure 1). Two aspects that
gained momentum in cloud-native computing are of interest:
Recommendations on how to handle log forwarding and log consolidation in cloud-
native applications;
Recommendations to apply structured logging.
Because both aspects guided the implementation of the logging prototype deeply, they
will be explained in more detail providing the reader with the necessary context.
4.1. Twelve-Factor Apps
The 12-factor app is a method [
18
] for building software-as-a-service applications that
pay special attention to the dynamics of organic growth of an application over time, the dy-
namics of collaboration between developers working together on a codebase, and avoiding
the cost of software erosion. At its core, 12 rules (factors) should be followed to develop well-
operational and evolutionarily developable distributed applications. This methodology
harmonizes very well with microservice architecture approaches [
3
,
19
23
] and cloud-native
operating environments like Kubernetes [
24
], which is why the 12-factor methodology is
becoming increasingly popular. Incidentally, the 12-factor methodology does not contain
any factor explicitly referring to observability, certainly not in the triad of metrics, tracing
and logging. However, factor XI recommends how to handle logging:
Logs are the stream of aggregated events sorted by time and summarized from the output
streams of all running processes and supporting services. Logs are typically a text format
with one event per line. [...]
A twelve-factor app never cares about routing or storing its output stream. It should
not attempt to write to or manage log files.
Instead, each running process writes its
stream of events to stdout.
[...] On staging or production deploys, the streams of all
processes are captured by the runtime environment, combined with all other streams of
the app, and routed to one or more destinations for viewing or long-term archiving. These
archiving destinations are neither visible nor configurable to the app—they are managed
entirely from the runtime environment.
4.2. From Logging to Structured Logging
The logging instrumentation is quite simple for developers and works mainly pro-
gramming language specific but basically according to the following principle illustrated
in Python.
Future Internet 2022,14, 274 7 of 23
A logging library must often be imported, defining so-called log levels such as DEBUG,
INFO, WARNING, ERROR, FATAL, and others. While the application is running, a log
level is usually set via an environment variable, e.g., INFO. All log calls above this level are
then written to a log file.
1im po rt l ogging
lo g gi n g . ba s ic C on f ig ( f i le n am e = " ex am p le . l og " , le v el = l og g in g . DE B UG )
3lo g gi n g . de b ug ( " P e rf o rm i ng u s er c he c k ")
us e r = Na n e K ra t zk e
5lo g g in g . i nf o ( f U se r { u s er } t ri e s to l og i n . ’)
lo g gi n g . wa r ni ng ( f Us er { us e r } no t f ou nd )
7lo g g in g . e rr o r (f Us e r ␣{ u se r } ha s be e n ba n ne d . ’)
For example, line 5 would create the following entry in a log file:
1IN FO 2 02 2 -0 1 -2 7 16 :1 7 :5 8 - Us er N an e Kr a tz ke t ri es t o l og i n
In a 12-factor app, this logging would be configured so that events are written directly
to Stdout (console). The runtime environment (e.g., Kubernetes with FileBeat service
installed) then routes the log data to the appropriate database taking work away from the
developer that they would otherwise have to invest in log processing. This type of logging
is well supported across many programming languages and can be consolidated excellently
with the ELK stack (or other observability stacks).
Logging (unlike distributed tracing and metrics collection) is often not even perceived
as (complex) instrumentation by developers. Often, it is done on their own initiative.
However, one can systematize this instrumentation somewhat and extend it to so-called
“structured logging”. Again, the principle is straightforward. One simply does not log lines
of text like
1IN FO 2 02 2 -0 1 -2 7 16 :1 7 :5 8 - Us er N an e Kr a tz ke t ri es t o l og i n
but, instead, the same information in a structured form, e.g., using JSON:
1{ ‘‘ l og l ev e l ’’ : in f o ’’ , t im es t am p ’: ‘2 02 2 -0 1 -2 7 16 : 17 : 58 , ev en t
’: Lo g i n , us e r ’: Na ne Kr a tz k e , re s u lt : s u cc e ss }
In both cases, the text is written to the console. In the second case, however, a struc-
tured text-based data format is used that is easier to evaluate. In the case of a typical
logging statement like “User Max Mustermann tries to log in", the text must first be analyzed
to determine the user. This text parsing is costly on a large scale and can also be very
computationally intensive and complex if there is plenty of log data in a variety of formats
(which is the common case in the real world).
However, in the case of structured logging, this information can be easily extracted
from the JavaScript Object Notation (JSON) data field “user". In particular, more complex
evaluations become much easier with structured logging as a result. However, the instru-
mentation does not become significantly more complex, especially since there are logging
libraries for structured logging. The logging looks in the logging prototype
log12
of this
study like this:
1im po rt l og 12
[. .. ]
3lo g 12 . e r ro r (‘ L og in , u s er = u se r , r e su l t =" N ot f o un d ’, r ea so n = " Ba n ne d ’)
Future Internet 2022,14, 274 8 of 23
The resulting log files are still readable for administrators and developers (even if
a bit more unwieldy) but much better processable and analyzable by databases such as
ElasticSearch. Quantitative metrics can also be recorded in this way. Structured logging
can thus also be used for the recording of quantitative metrics:
1im po rt l og 12
[. .. ]
3lo g 12 . i n fo ( Op e n re q ue s ts , r e qu e s ts = l en ( r e qu e s ts ) )
1{ ev e nt : Op en r e qu e st s " ,␣ re q ue s ts " : 4 2 }
Furthermore, this structured logging approach can also be used to create tracings.
In distributed tracing systems, a trace identifier (ID) is created for each transaction that
passes through a distributed system. The individual steps are so-called spans. These are
also assigned an identifier (span ID). The span ID is then linked to the trace ID, and the
runtime is measured and logged. In this way, the time course of distributed transactions
can be tracked along the components involved, and, for example, the duration of individual
processing steps can be determined.
4.3. Resulting and Simplified Logging Architecture
Thus, the two principles to print logs simply to standard outout (stdout) and to log in a
structured and text-based data format are applied consequently. The resulting observability
system complexity thus reduces from Figure 2to Figure 4because all system components
can collect log, metric, and trace information in the same style that can be routed seamlessly
from an operation platform provided log forwarder (already existing technology) to a
central analytical database.
Figure 4.
An observability system consistently based on structured logging with significantly re-
duced complexity.
4.4. Study Outcome: Unified Instrumentation via a Structured Logging Library (Prototype)
This paper will briefly explain below the way to capture events, metrics, and traces
using the logging prototype that emerged. The prototype library log12 was developed in
Python 3 but can be implemented in other programming languages analogously.
Future Internet 2022,14, 274 9 of 23
log12
will create automatically for each event additional key–value attributes like a
unique identifier (that is used to relate child events to parent events and even remote events
in distributed tracing scenarios) and start and completion timestamps that can be used to
measure the runtime of events (although known from distributed tracing libraries but not
common for logging libraries). It is explained:
how to create a log stream;
how an event in a log stream is created and logged;
how a child event can be created and assigned to a parent event (to trace and record
runtimes of more complex and dependent chains of events within the same process);
and how to make use of the distributed tracing features to trace events that pass
through a chain of services in a distributed service of services system).
The following lines of code create a log stream with the name “logstream” that is
logged to stdout, see Listing 1:
Listing 1. Creating an event log stream in log12.
1im po rt l og 12
lo g = l o g1 2 . lo g gi n g ( ‘‘ l o gs t re a m ’,
3ge n e ra l = va l ue , t ag = f oo , se r v ic e _ ma r k = ’t e st
)
Each event and child events of this stream are assigned a set of key–value pairs:
general=“value”
tag=“foo”
service_mark=“test”
These log-stream-specific key–value pairs can be used to define selection criteria in an-
alytical databases like ElasticSearch to filter events of a specific service only. The following
lines of code demonstrate how to create a parent event and child events, see Listing 2.
Listing 2. Event logging in log12 using blocks as structure.
# Lo g events us in g th e wi th c la us e
2wi t h l og . e v en t ( ‘‘ T e st , h el l o =‘ W or l d ’) as e v en t :
ev e nt . u p da t e ( te s t =‘ s o me t hi n g ’)
4# ad ds e ve nt s pe ci fi c ke y va lue pairs to th e~ e vent
6wi t h e v en t . c hi ld ( S ub e ve n t 1 of T e st ) a s e v :
ev . u p d at e ( f o o = ‘‘ b a r )
8ev . e r r or ( C a t a st r o ph e )
# E x p li c it c a ll o f l og ( h er e on e rr o r le v el )
10
wi t h e v en t . c hi ld ( S ub e ve n t 2 of T e st ) a s e v :
12 ev . u p da t e ( ba r = fo o ’)
# I m p li c i t c al l of e v . in f o ( S uc c e ss ) ( a t b l o ck e n d )
14
wi t h e v en t . c hi ld ( S ub e ve n t 3 of T e st ) a s e v :
16 ev . u p da t e ( ba r = fo o ’)
# I m p li c i t c al l of e v . in f o ( S uc c e ss ) ( a t b l o ck e n d )
Furthermore, it is possible to log events in the event stream without the block style,
see Listing 3. That might be necessary for programming languages that do not support
closing resources (here a log stream) at the end of a block. In this case, programmers are
responsible for closing events using the .info(),.warn(),.error() log levels.
Future Internet 2022,14, 274 10 of 23
Listing 3. Event logging in log12 without blocks.
1# T o l o g e v en t s wi t ho u t w i th - b l oc k s i s p os s ib l e a s we l l .
ev = lo g . e v en t ( A no t h er te s t , f oo = b a r ’)
3ev . u p da t e ( ba r = fo o ’)
ch i l d = ev . c h il d ( S u be v e nt o f A no t h er t e st , f o o = ‘‘ b a r )
5ev . i nf o ( Fi n is h ed )
# <= Howeve r , th an ~ yo u are are r esp on si bl e to l og e ve nt s ex pl ic itly
7# If p ar en t ev en ts a re l og ge d al l su bs eq ue nt ch il d ev en ts
# ar e assumed to h av e cl os ed su ccess fu ll y as w el l
Using this type of logging to forward events along Hypertext Transfer Protocol (HTTP)
requests is also possible. This usage of HTTP-Headers is the usual method in distributed
tracing. Two main capabilities are required for this [
25
]. First, extracting header information
received by an HTTP service process must be possible. Secondly, it must be possible to
inject the tracing information in follow-up upstream HTTP requests (in particular, the trace
ID and span ID of the process initiating the request).
Listing 4shows how
log12
supports this with an extract attribute at event creation
and an inject method of the event that extracts relevant key–value pairs from the event so
that they can be passed as header information along an HTTP request.
Listing 4. Extraction and injection of tracing headers in log12.
im po rt l og 12
2im po rt r equests # To g en er at e H TT P re quests
fr om f la sk i mp or t req ue st # To d e mo n st r at e H ea d er ~ e x tr a ct i on
4
wi t h l og . e v en t ( ‘‘ D i s tr i bu t ed t r ac in g , extract=request.headers ) as e v:
6
# He re i s how to pa ss t ra ci ng in fo rm at ion a lo ng r em ot e calls
8wi t h e v . ch i ld ( Ta sk 1 ’) a s e v en t :
re sponse = re qu es ts . ge t(
10 ‘‘ h t tp s : // q r . my la b . th - l u eb e ck . d ev / r o ut e ? ur l = ht t ps : / / go o gl e .
co m ’,
headers=event.inject()
12 )
ev e nt . u p da te ( l e ng t h = le n ( re s po n se . t ex t ) , s ta t us = r e sp o ns e .
s ta tu s _c o de )
5. Evaluation of the Logging Prototype
The study evaluated the software prototype in the defined use cases to determine its
suitability for capturing qualitative events, quantitative metrics, and traces in distributed
systems. The evaluation also performed long-term recordings of high-volume event streams
in the sense of stress tests:
The study designed the use cases 1 and 2 mainly to evaluate the instrumentation of
qualitative system events;
Use case 3 was primarily used to capture quantitative metrics that often occur in IT
infrastructures or platforms and are essential to multi-level observability;
Use case 4 was used to monitor systems that were intentionally not under the direct
control of the researchers and, therefore, could not be instrumented directly. Further-
more, the use case was intended to provide insight into both long-term detection and
the detection of high-volume event streams.
5.1. Evaluation of Use Cases 1 and 2 (Event-Focused Observation of a Distributed Service)
Use Cases 1 and 2:
Codepad is an online coding tool to share quickly short code
snippets in online and offline teaching scenarios. It has been introduced during the Corona
Future Internet 2022,14, 274 11 of 23
Pandemic shutdowns to share short code snippets mainly in online educational settings for
1st or 2nd semester computer science students. Meanwhile, the tool is used in presence
lectures and labs as well. The reader is welcome to try out the tool at https://codepad.
th-luebeck.dev (accessed on 20 September 2022). This study used the Codepad tool in
its
steps 1, 2, 3, and 4
of its action research methodology as an instrumentation use case
(see Figure 3) to evaluate the instrumentation of qualitative system events according to
Section 4.4. Figure 5shows the Web-UI on the left and the resulting dashboard on the right.
In a transfer step (
steps 12, 13, 14, and 15
of the action research methodology, see Figure 3),
the same product was used to evaluate distributed tracing instrumentation (not covered in
detail by this report).
Figure 5.
Use Cases 1 and 2: Codepad is an online coding tool to share quickly short code snippets in
online and offline teaching scenarios—on the left, the Web-UI; on the right, the Kibana Dashboard
used for observability in this study. Codepad was used as an instrumentation object of investigation.
5.2. Evaluation of Use Case 3 (Metrics-Focused Observation of Infrastructure)
The
Use Case 3 (steps 5, 6, 7, 8 of research methodology; Figure 3)
observed an
institute’s infrastructure, the so-called myLab infrastructure. myLab (Available online:
https://mylab.th-luebeck.de) (accessed on 20 September 2022) is a virtual laboratory that
can be used by students and faculty staff to develop and host web applications. This use
case was chosen to demonstrate that it is possible to collect primarily metrics based data
over a long term using the same approach as in Use Case 1. A pod tracked mainly the
resource consumption of various differing workloads deployed by more than 70 student
web projects of different university courses. To observe this resource consumption, the pod
simply run periodically
kubectl top nodes;
kubectl top pods –all-namespaces
against the cluster. This observation pod parsed the output of both shell commands
and printed the parsed results in the structured logging approach presented in Section 4.4.
Figure 6shows the resulting Kibana dashboard for demonstration purposes.
Future Internet 2022,14, 274 12 of 23
Figure 6. Use Case 3: The dashboard of the Kubernetes infrastructure under observation (myLab).
5.3. Evaluation of Use Case 4 (Long-Term and High-Volume Observation)
The
Use Case 4 (steps 9, 10, 11 of research methodology; Figure 3)
left our own
ecosystem and observed the public Twitter Event stream as a type representative for a
high-volume and long-term observation of an external system. Thus, a system that was
intentionally not under the direct administrative control of the study investigators. The Use
Case 4 was designed as a two-phase study.
5.3.1. Screening Phase of Use Case 4
The first screening phase was designed to gain experiences in logging high volume
event streams and to provide necessary features and performance optimizations to the
structured logging library prototype. The screening phase was designed to screen the
complete and representative Twitter traffic as a kind of “ground truth”. We were interested
in the distribution of languages and stock symbols in relation to the general Twitter “back-
ground noise”. This screening phase lasted from 20 January 2022 to 30 January 2022 and
identified most used stock symbols (see Figure 7).
hashtag
mention
symbol
tweet
user
Screening
phase
2022-01-21
2022-01-23
2022-01-25
2022-01-27
2022-01-29
2022-01-31
2022-02-01
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Events per day
×107
all events
symbols
Figure 7. Recorded events (screening phase of use case 4).
Future Internet 2022,14, 274 13 of 23
5.3.2. Long-Term Evaluation Phase of Use Case 4
A long-term recording was then done as a second long-term evaluation phase and was
used to track and record the most frequent used stock symbols identified in the screening
phase. This evaluation phase lasted from February 2022 until the middle of August 2022.
In this evaluation phase, just one infrastructure downtime occurred due to a shutdown of
electricity of the author’s institute. However, this downtime was not due to or related to
the presented unified logging stack (see Figure 8).
hashtag
mention
symbol
tweet
user
Evaluation
phase
2022-02
2022-03
2022-04
2022-05
2022-06
2022-07
2022-08
0
1
2
3
4
5
Events per day
×106
Infrastructure
downtime
LUNA crash
all events
symbols
Figure 8. Recorded events (long-term evaluation phase of use case 4).
5.3.3. The Instrumentation Approach Applied in Both Phases
The recording was done using the following source code, see Listing 5, compiled into
a Docker container, that has been executed on a Kubernetes cluster that has been logged in
Use Cases 1, 2, and 3
. FileBeat was used as a log forwarding component to a background
ElasticSearch database. The resulting event log has been analyzed and visualized using
Kibana. Kibana was used as well to collect the data in form of CSV-Files for the screening
and the evaluation phase. Figures 7,9and 10 have been compiled from that data. This
setting followed the unified and simplified logging architecture presented in Figure 4
exactly.
Listing 5.
The used logging program to record Twitter stock symbols from the public Twitter Stream
API.
1im po rt l og1 2 , tw ee py , o s
3KE Y = o s. e n vi r on . g et ( CO N SU M ER _ KE Y )
SE C R ET = o s . e nv i r on . g e t (‘ C O N SU M E R_ S E CR E T )
5TO K EN = os . e n v ir o n . ge t ( AC C E SS _ T OK E N ’)
T OK E N _ SE C R E T = os . e n vi r o n . g et ( A C C E SS _ T O K EN _ S E CR E T )
7
L AN G UA G ES = [ l . st r ip ( ) fo r l i n o s . en v ir o n .g e t (‘ L AN G UA G ES , ’’ ) . sp l it
( ‘‘ , ) ]
9TR A C K = [ t. s t ri p ( ) fo r t i n o s . e n vi r o n . ge t ( T R AC K S ). s p li t ( , ’) ]
11 lo g = l og 1 2 . lo g gi ng ( t wi tt e r st r ea m ’)
13 class Twista(tweepy.Stream):
15 de f o n _s t a tu s ( s e lf , s ta t u s ) :
wi t h l og . e v en t ( ‘‘ t w ee t , t we e t_ i d = st a tu s . id _ st r ,
17 us e r _i d = s ta t us . u s er . i d_ st r , l an g = s ta t us . l an g
) a s e v en t :
19 ki n d = st a tu s
Future Internet 2022,14, 274 14 of 23
ki n d = r ep l y i f st a t us . _ j so n [ i n _ r ep l y _ t o_ s t a tu s _ i d ]
else kind
21 ki n d = re t we e t i f r e tw e e te d _ st a t us in s t at u s . _j s on e l se
kind
ki nd = quote’ if qu ote d_ st at us i n status. _ json e ls e k ind
23 ev e nt . u p da t e ( la n g = st a tu s . la ng , k i nd = k in d , m es s ag e = s ta t us . t e xt
)
25 wi t h e v en t . c hi ld ( u se r ) a s u sr :
na m e = s ta t us . u s er . n am e if s t at u s . us e r . na me e l se
unknown’’
27 us r . u pd at e ( l an g = st a tu s . la ng , id = s t at us . u s er . i d_ st r ,
na m e = na m e ,
29 s cr e e n_ n a m e = f ‘‘ @ { s t at u s . u s er . s c re e n _ na m e } ,
me s s ag e = s t at u s . t ex t ,
31 kind=kind
)
33
fo r t ag i n s t a tu s . e n t it i e s [ h a sh t a gs ] :
35 wi t h e v en t . c hi ld ( h a sh t ag ) a s h as h ta g :
ha s h ta g . u pd a te ( l a ng = s t at us . l an g ,
37 ta g = f #{ t a g [’ t e xt ’] . l ow e r () } ,
me s s ag e = s t at u s . t ex t ,
39 kind=kind
)
41
fo r s ym i n s t a tu s . e n t it i e s [ s y mb o ls ] :
43 wi t h e v en t . c hi ld ( s y mb ol ) a s sy m bo l :
sy m bo l . u pd a te ( l an g = st a tu s . la ng ,
45 sy m b ol = f ${ s y m [ te x t ] . u pp e r ( ) } ’’ ,
me s s ag e = s t at u s . t ex t ,
47 kind=kind
)
49 sy m bo l . u pd a te ( s c re e n_ n am e = f ‘‘ @ { st a tu s . us e r .
s cr e e n_ n a m e } ’)
51 fo r u se r _m e nt i on i n s t at us . e n ti t ie s [ us e r_ m en t io n s ]:
wi t h e v en t . c hi ld ( m e nt i on ) a s m en t io n :
53 me n t io n . u pd a te ( l a ng = s t at us . l an g ,
s cr e e n_ n a m e = f ‘‘ @ { u s e r_ m e n ti o n [ s c r ee n _ n am e ]} ,
55 me s sa g e = st a tu s . te xt ,
kind=kind
57 )
59 re c o r d = T w is t a ( K EY , S EC R ET , T O KE N , T O KE N _ S E CR E T )
if L A NG U AG ES :
61 re c or d . fi l te r ( tr a ck = T RA CK , l a ng ua g es = L A NG U AG E S )
else:
63 re c or d . f il t er ( t r ac k = T RA CK )
5.3.4. Observed and Recorded Twitter Behaviour in Both Phases of Use Case 4
According to Figures 7and 8, just every 100th observed event in the screening phase
was a stock symbol. That is simply the “ground-truth” on Twitter. If one is observing the
public Twitter stream without any filter, that is what you get. Thus, the second evaluation
phase recorded a very specific “filter bubble” of the Twitter stream. The reader should
be aware that the data presented in the following is a clear bias and not a representative
Twitter event stream; it is clearly a stock market focused subset or, to be even more precise,
a cryptocurrency focused subset, because almost all stock symbols on Twitter are related
to cryptocurrencies.
It is possible to visualize the resulting effects using the recorded data. Figure 9shows
the difference in language distributions of the screening phase (unfiltered ground-truth)
and the evaluation phase (activated symbol filter). However, in the screening phase, English
(en), Spanish (es), Portugese (pt), and Turkish (tr) are responsible for more than 3/4 of all
Future Internet 2022,14, 274 15 of 23
traffic; in the evaluation phase, almost all recorded Tweets are in English. Thus, on Twitter,
the most stock symbol related language is clearly English.
en
es pt
tr
Other
Languages
(ISO code)
Screening phase
en
und
Other
Languages
(ISO code)
Evaluation phase
Figure 9. Observed languages (screening and evaluation phase of Use Case 4).
Although the cryptocurrency logging was used mainly as a use case for technical
evaluation purposes of the logging library prototype, some interesting insights could be
gained. For example, although Bitcoin (BTC) is likely the most prominent cryptocurrency,
it is by far not the most frequently used stock symbol on Twitter. The most prominent stock
symbols on Twitter are:
ETH: Ethereum cryptocurrency;
SOL: Solana cryptocurrency;
BTC: Bitcoin cryptocurrency;
LUNA: Terra Luna cryptocurrency (replaced by a new version after the crash in
May 2022);
BNB: Binance Coin cryptocurrency.
Furthermore, we can see interesting details in trends (see Figure 10).
The ETH usage on Twitter seems to reduce throughout our observed period;
The SOL usage is, on the contrary, increasing, although we observed a sharp decline
in July.
The LUNA usage has a clear peak that correlates with the LUNA cryptocurrency crash
in the middle of May 2022 (this crash was heavily reflected in the investor media).
The Twitter usage was not correlated with the currency rates in cryptocurrency stock
markets. However, changes in usage patterns of stock market symbols might be of interest
for cryptocurrency investors as interesting indicators to observe. As this study shows, these
changes can be easily tracked using structured logging approaches. Of course, this can be
transferred to other social media streaming or general event streaming use cases like IoT
(Internet of Things) as well.
Future Internet 2022,14, 274 16 of 23
2022-02 2022-03 2022-04 2022-05 2022-06 2022-07 2022-08
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
Infrastructure
downtime
LUNA crash
decline unclear
Recorded symbols per day (Evaluation phase)
$ETH $SOL $BTC $LUNA $BNB
Figure 10. Recorded symbols per day (evaluation phase of Use Case 4).
6. Discussion
This style of a unified and structured observability was successfully evaluated on
several use cases that made usage of a FileBeat/ElasticSearch-based observability stack.
However, other observability stacks that can forward and parse structured text in a JSON-
format will likely show the same results. The evaluation included a long-term test over
more than six months for a high-volume evaluation use-case.
On the one hand, it could be proven that such a type of logging can easily be used to
perform classic metrics collections. For this purpose, BlackBox metrics such as CPU,
memory, and storage for the infrastructure (nodes) but also the “payload” (pods) were
successfully collected and evaluated in several Kubernetes clusters (see Figure 6).
Second, a high-volume use case was investigated and analyzed in-depth. Here, all
English-language tweets on the public Twitter stream were logged. About 1 million
events per hour were logged over a week and forwarded to an ElasticSearch database
using the log forwarder FileBeat. Most systems will generate far fewer events (see
Figure 7).
In addition, the prototype logging library log12 is meanwhile used in several in-
ternal systems, including web-based development environments, QR code services,
and e-learning systems, to record access frequencies to learning content, and to study
learning behaviours of students.
6.1. Lessons Learned
All use cases have shown that structured logging is easy to instrument and harmonizes
well with existing observability stacks (esp. Kubernetes, Filebeat, ElasticSearch, Kibana).
However, some aspects should be considered:
1.
It is essential to apply structured logging, since this can be used to log events, metrics,
and traces in the same style.
2.
Very often, only error-prone situations are logged. However, if you want to act in the
sense of DevOps-compliant observability, you should also log normal—completely
regular—behaviour. DevOps engineers can gain many insights from how normal
users use systems in standard situations. Thus, the log level should be set to INFO,
and not WARNING, ERROR, or below.
3.
Cloud-native system components should rely on the log forwarding and log aggrega-
tion of the runtime environment. Never implement this on your own. You will double
logic and end up with complex and may be incompatible log aggregation systems.
4.
To simplify analysis for engineers, one should push key–value pairs of parent events
down to child events. This logging approach simplifies analysis in centralized log
analysis solutions—it simply reduces the need to derive event contexts that might be
Future Internet 2022,14, 274 17 of 23
difficult to deduce in JSON document stores. However, this comes with the cost of
more extensive log storage.
5.
Do not collect aggregated metrics data. The aggregation (mean, median, percentile,
standard deviations, sum, count, and more) can be done much more conveniently
in the analytical database. The instrumentation should focus on recording metrics
data in a point-on-time style. According to our developer experience, developers are
happy to be authorized to log only such simple metrics, especially when there is not
much background knowledge in statistics.
6.2. Threats of Validity and to Be Considered Limitations of the Study Design
Action research is prone to drawing incorrect or non-generalizable conclusions. Logi-
cally, the significance is consistently the highest within the considered use cases. In order
to minimize such kind of risks, the reader should consider the following threats on validity
(see [26,27]):
Construct validity refers to the question, whether the employed measures appropri-
ately reflect the constructs they represent.
Internal validity refers to the question of whether observed relationships are due
to a cause–effect relationship. Thus, it is only relevant for case studies in which a
cause–effect is to be determined [
28
], which is not the case for the presented software
prototype study.
External validity refers to the question of whether the findings of the case study can
be generalized.
Reliability or experimental validity refers to the question of whether the study can be
repeated with the same results.
Therefore, considerations on construct and external validity are reported as well as
considerations on repeatability.
6.2.1. Considerations on Construct Validity
In order to draw generalizable conclusions, this study defined use cases in such a way
that intentionally different classes of telemetry data (logs, metrics, traces) were considered.
It should be noted that the study design primarily considered logs and metrics but traces
only marginally. Traces were not wholly neglected, however, but were analyzed less
intensively. However, the study design deliberately covered logs, traces, and metrics to see
if they could be recorded using a consistent instrumentation approach. This instrumentation
approach was designed to generate a structured time-series dataset that can be consolidated
and analyzed using existing observability tool stacks.
The study evaluated the resulted software prototype in four use cases to determine its
suitability for capturing qualitative events, quantitative metrics, and traces in distributed
systems. The evaluation also performed long-term recordings of high-volume event streams
in the sense of stress tests. Although this setting was constructed to cover a broad range
of web-service and web-application domains, outside the scope of this setting, the study
outcomes should not be taken to draw any conclusions. For instance, the reader should not
make any assumptions on the applicability in hard real-time scenarios where microsecond
latencies often have to be considered.
6.2.2. Considerations on External Validity
This study did the proof of concept by creating a software prototype for the Python
3 programming language on the instrumentation side and using the ELK stack on the
analysis side. Nevertheless, further efforts are needed for transferring the results to other
programming languages and observability stacks. Since both basic principles (structured
logging and time-series databases and analyses) can be assumed to be known and mastered,
the reader can expect no significant difficulties of a technical nature here.
The long-term acquisition was performed with a high-volume use case to cover certain
stress test aspects. However, the reader must be aware that the screening phase generated
Future Internet 2022,14, 274 18 of 23
significantly higher data volumes in Use Case 4 than the evaluation phase. Therefore, to use
stress test data from this study, one should look at the event volume of the screening phase
of Use Case 4. Here, about ten thousand events per minute were logged for more than a
week giving an impression of the performance of the proposed approach. The study data
show that the saturation limit should be far beyond these ten thousand events per minute.
However, the study design did not push the system to its event recording saturation limits.
Furthermore, this study should not be used to derive any cryptocurrency related
conclusions. Although some interesting aspects from Use Case 4 could be of interest for
cryptocurrency trading indicator generation, no detailed analysis on correlations between
stock prices and usage frequencies of stock symbols on Twitter have been done.
6.2.3. Considerations on Repeatability
To give the reader the opportunity to reproduce the use cases presented here, source
code from log12 [
29
] has been made available as open source software. We also reported
on the applied observability stack that has been used in this study (ELK-Stack).
7. Related Work
The three pillars of metrics, traces, and logs have also been tackled in a recently
published survey on anomaly detection and root cause analysis [
30
]. However, very
often observability is reduced on tools and benchmarks for automated log parsing [
31
]
or interesting publications like [
32
] focus on advances and challenges in log analysis.
Publications like [
33
,
34
] report on empirical studies regarding how developers log or how
to improve logging in general. Studies like [
35
,
36
] focus more on anomaly detection in
logs. However, all log-related studies are meanwhile a bit outdated. In addition, very often
logging is related to log file analysis in the context of IT security and anomaly detection
only [37].
More recent studies look at observability from a more all-encompassing point of
view. Ref. [
38
] focus explicitly on the observability and monitoring of distributed sys-
tems. Ref. [
39
] focuses microservices in this observability context. Ref. [
40
] focuses the need
of multi-level observability especially in orchestration approaches. In addition, Ref. [
41
]
considers scalable observability data management. An interesting and recent overview on
observability of distributed edge and container-based microservices is provided by [
42
].
This survey provides a list of microservice-focused managed and unified observability
services (Dynatrace, Datadog, New Relic, Sumo Logic, Solar Winds, Honeycomb). The pre-
sented research prototype of this study heads into the same direction but tries to pursue
the problem primarily on the instrumenting side using a more lightweight and unified
approach. Thus, to address the client-side of the problem is obviously harder economi-
cal exploitable, which is why the industry might address the problem preferably on the
managed service side.
Of logs, metrics, and distributed traces, distributed tracing is still considered in the
most detail. In particular, the papers around Dapper [
25
] (a large-scale distributed systems
tracing infrastructure initially operated by Google) should be mentioned here, which had
a significant impact on this field. A black box approach without instrumenting needs for
distributed tracing is presented by [
43
]. It reports on end-to-end performance analysis
of large-scale internet services mainly by statistical means. However, these black-box
approaches are pretty limited in their expressiveness since operations must record large
data sets to derive transactions along distributed systems’ components simply due to their
observable network behaviour. In the meantime, it has become an abode to accept the
effort of white-box instrumentation to be able to determine and statistically evaluate precise
transaction processes. In this context, Ref. [
44
] compares and evaluates existing open
tracing tools. Ref. [
45
] provides an overview of how to trace distributed component-based
systems. In addition, Ref. [
46
] focuses on automated analysis of distributed tracing and
corresponding challenges and research directions. This study, however, has seen tracing as
only one of three aspects of observability and therefore follows a broader approach. Most
Future Internet 2022,14, 274 19 of 23
importantly, this study has placed its focus on the instrumentation side of observability
and less on the database and time series analysis side.
7.1. Existing Instrumenting Libraries and Observability Solutions
Although the academic coverage of the observability field is expandable, in practice,
there is an extensive set of existing solutions, especially for time series analysis and in-
strumentation. A complete listing is beyond the scope of this paper. However, from the
disproportion of the number of academic papers to the number of real existing solutions,
one quickly recognizes the practical relevance of the topic. Table 1contains a list of ex-
isting database products often used for telemetry data consolidation to give the reader
an overview without claiming completeness. This study used ElasticSearch as an analyti-
cal database.
Table 1.
Often seen databases for telemetry data consolidation. Products used in this study are
marked bold , without claiming completeness.
Product Organization License Often Seen Scope
APM [9] Elastic Apache 2.0 Tracing (add-on to ElasticSearch database)
ElasticSearch [47] Elastic Apache/Elastic License 2.0 Logs, Tracing, (rarely Metrics)
InfluxDB [48] Influxdata MIT Metrics
Jaeger [49] Linux Foundation Apache 2.0 Tracing
OpenSearch [50] Amazon Web Services Apache 2.0 Logs, Tracing, (rarely Metrics); fork from ElasticSearch
Prometheus [51] Linux Foundation Apache 2.0 Metrics
Zipkin [52] OpenZipkin Apache 2.0 Tracing
Table 2lists several frequently used forwarding solutions that developers can use to
forward data from the point of capture to the databases listed in Table 1. In the context
of this study, FileBeat was used as a log forwarding solution. It could be proved that
this solution is also capable of forwarding traces and metrics if applied in a structured
logging setting.
Table 2.
Often seen forwarding solutions for log consolidation. Products used in this study are
marked bold , without claiming completeness.
Product Organization License
Fluentd [53] FluentD Project Apache 2.0
Flume [54] Apache Apache 2.0
LogStash [55] Elastic Apache 2.0
FileBeat [56] Elastic Apache/Elastic License 2.0
Rsyslog [57] Adiscon GPL
syslog-ng [58] One Identity GPL
An undoubtedly incomplete overview of instrumentation libraries for different prod-
ucts and languages is given in Table 3, presumably because each programming language
comes with its own form of logging in the shape of specific libraries. To avoid this language-
binding is hardly possible in the instrumentation context unless one pursues “esoteric
approaches” like [
43
]. The logging library prototype is strongly influenced by the Python
standard logging library but also by structlog for structured logging but without actually
using these libraries.
Future Internet 2022,14, 274 20 of 23
Table 3.
Often seen instrumenting libraries. Products that inspired the research prototype are marked
bold , without claiming completeness.
Product Use Case Organization License Remark
APM Agents [9] Tracing Elastic BSD 3
Jaeger Clients [49] Tracing Linux Foundation Apache 2.0
log [59] Logging Go Standard Library BSD 3 Logging for Go
log4j [60] Logging Apache Apache 2.0 Logging for Java
logging [61] Logging Python Standard Library GPL compatible Logging for Python
Micrometer [62] Metrics Pivotal Apache 2.0
Open Telemetry [10] Tracing Open Telemetry Apache 2.0
prometheus [51] Metrics Linux Foundation Apache 2.0
Splunk APM [63] Tracing Splunk Apache 2.0
structlog [64] Logging Hynek Schlawack Apache 2.0, MIT structured logging for Python
winston [65] Logging Charlie Robbins MIT Logging for node.js
7.2. Standards
There are hardly any observability standards. However, a noteworthy standardiza-
tion approach is the OpenTelemetry Specification [
10
] of the Cloud Native Computing
Foundation [
66
], which tries to standardize the way of instrumentation. This approach
corresponds to the core idea, which this study also follows. Nevertheless, the standard is
still divided into Logs [
67
], Metrics [
68
] and Traces [
69
], which means that the conceptual
triad of observability is not questioned. On the other hand, approaches like the Open-
Telemetry Operator [
70
] for Kubernetes enable injecting auto-instrumentation libraries for
Java, Node.js and Python into Kubernetes operated applications, which is a feature that is
currently not addressed by the present study. However, so-called service meshes [
71
,
72
]
also use auto-instrumentation. A developing standard here is the so-called Service Mesh
Interface (SMI) [73].
8. Conclusions and Future Research Directions
Cloud-native software systems often have a much more decentralized structure and
many independently deployable and (horizontally) scalable components, making it more
complicated to create a shared and consolidated picture of the overall decentralized system
state [
74
,
75
]. Today, observability is often understood as a triad of collecting and processing
metrics, distributed tracing data, and logging—but why except for historical reasons?
This study presents a unified logging library for Python [
29
] and a unified logging
architecture (see Figure 4) that uses a structured logging approach. The evaluation of four
use cases shows that several thousand events per minute are easily processable and can
be used to handle logs, traces, and metrics the same. At least, this study was able with
a straightforward approach to log the world-wide Twitter event stream of stock market
symbols over a period of six months without any noteworthy problems. As a side effect,
some interesting aspects of how crypto-currencies are reflected on Twitter could be derived.
This might be of minor relevance for this study but shows the overall potential of a unified
and structured logging based observability approach.
The presented approach relies on an easy-to-use programming language-specific
logging library that follows the structured logging approach. The long-term observation
results of more than six months indicate that a unification of the current observability
triad of logs, metrics, and traces is possible without the necessity to develop utterly new
toolchains. The reason is the flexibility of the underlying structured logging approach. This
kind of flexibility is a typical effect of data format standardization. The trick is to
use structured logging and
apply log forwarding to a central analytical database
in a systematic infrastructure- or platform-provided way.
Future Internet 2022,14, 274 21 of 23
Further research should therefore be concentrated on the instrumenting and less on
the log forwarding and consolidation layer. If we instrument logs, traces, and metrics in
the same style using the same log forwarding, we automatically generate correlatable data
in a single data source of truth, and we simplify analysis.
Thus, the observability road ahead may have several paths. On the one hand, we
should standardize the logging libraries in a structured style like log12 in this study or the
OpenTelemetry project in the “wild”. Logging libraries should be comparably implemented
in different programming languages and shall generate the same structured logging data.
Thus, we have to standardize the logging SDKs and the data format. Both should be
designed to cover logs, metrics, and distributed traces in a structured format. To sim-
plify instrumentation further, we should additionally think about auto-instrumentation
approaches, for instance, proposed by the OpenTelemetry Kubernetes Operator [
70
] and
several Service Meshes like Istio [76,77] and corresponding standards like SMI [73].
Funding: This research received no external funding.
Data Availability Statement:
The resulting research prototype of the developed structured logging
library
log12
can be accessed here [
29
]. However, the reader should be aware, that this is prototyping
software in progress.
Conflicts of Interest: The author declares no conflict of interest.
References
1.
Kalman, R. On the general theory of control systems. IFAC Proc. Vol.
1960
,1, 491–502. https://doi.org/10.1016/S1474-
6670(17)70094-8.
2.
Kalman, R.E. Mathematical Description of Linear Dynamical Systems. J. Soc. Ind. Appl. Math. Ser. A Control
1963
,1, 152–192.
https://doi.org/10.1137/0301010.
3. Newman, S. Building Microservices, 1st ed.; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2015.
4.
Kim, G.; Humble, J.; Debois, P.; Willis, J.; Forsgren, N. The DevOps Handbook: How to Create World-Class Agility, Reliability, &
Security in Technology Organizations; IT Revolution: Sebastopol, CA, USA, 2016.
5. Davis, C. Cloud Native Patterns: Designing Change-Tolerant Software; Simon and Schuster: New York, NY, USA, 2019.
6.
Kratzke, N. Cloud-native Computing: Software Engineering von Diensten und Applikationen für die Cloud; Carl Hanser Verlag GmbH
Co. KG: Munich, Germany, 2021.
7.
Rochim, A.F.; Aziz, M.A.; Fauzi, A. Design Log Management System of Computer Network Devices Infrastructures Based on
ELK Stack. In Proceedings of the 2019 International Conference on Electrical Engineering and Computer Science (ICECOS),
Batam Island, Indonesia, 2–3 October 2019; pp. 338–342.
8.
Lahmadi, A.; Beck, F. Powering monitoring analytics with elk stack. In Proceedings of the 9th International Conference on
Autonomous Infrastructure, Management and Security (Aims 2015), Ghent, Belgium, 22–25 June 2015.
9.
APM Authors. APM: Application Performance Monitoring. 2022. Available online: https://www.elastic.co/observability/
application-performance-monitoring (accessed on 20 September 2022).
10.
The OpenTelemetry Authors. The OpenTelemetry Specification. 2021. Available online: https://github.com/open-telemetry/
opentelemetry-specification/releases/tag/v1.12.0 (accessed on 20 September 2022).
11.
Kratzke, N.; Quint, P.C. Understanding Cloud-native Applications after 10 Years of Cloud Computing-A Systematic Mapping
Study. J. Syst. Softw. 2017,126, 1–16. https://doi.org/10.1016/j.jss.2017.01.001.
12. Kratzke, N. A Brief History of Cloud Application Architectures. Appl. Sci. 2018,8, 1368. https://doi.org/10.3390/app8081368.
13.
Bader, A.; Kopp, O.; Falkenthal, M. Survey and comparison of open source time series databases. In Datenbanksysteme für Business,
Technologie und Web (BTW 2017)-Workshopband; Gesellschaft für Informatik, Bonn, Germany, 2017.
14.
Petersen, K.; Gencel, C.; Asghari, N.; Baca, D.; Betz, S. Action Research as a Model for Industry-Academia Collaboration in the
Software Engineering Context. In Proceedings of the 2014 International Workshop on Long-Term Industrial Collaboration on
Software Engineering, WISE ’14, Vasteras, Sweden, 16 September 2014; Association for Computing Machinery: New York, NY,
USA, 2014; pp. 55–62. https://doi.org/10.1145/2647648.2647656.
15.
Kratzke, N. Smart Like a Fox: How clever students trick dumb programming assignment assessment systems. In Proceedings of
the 11th International Conference on Computer Supported Education (CSEDU 2019), Heraklion, Greece, 2–4 May 2019.
16.
Truyen, E.; Kratzke, N.; Van Landuyt, D.; Lagaisse, B.; Joosen, W. Managing Feature Compatibility in Kubernetes: Vendor
Comparison and Analysis. IEEE Access 2020,8, 228420–228439. https://doi.org/10.1109/ACCESS.2020.3045768.
17.
Kratzke, N. The #BTW17 Twitter Dataset-Recorded Tweets of the Federal Election Campaigns of 2017 for the 19th German
Bundestag. Data 2017,2, 34. https://doi.org/10.3390/data2040034.
18. Wiggins, A. The Twelve-Factor App. 2017. Available online: https://12factor.net (accessed on 20 September 2022).
Future Internet 2022,14, 274 22 of 23
19.
Dragoni, N.; Giallorenzo, S.; Lafuente, A.L.; Mazzara, M.; Montesi, F.; Mustafin, R.; Safina, L. Microservices: Yesterday, today, and
tomorrow. In Present and Ulterior Software Engineering; Springer: Berlin/Heidelberg, Germany, 2017, pp. 195–216.
20.
Taibi, D.; Lenarduzzi, V.; Pahl, C. Architectural patterns for microservices: A systematic mapping study. In Proceedings of the
CLOSER 2018: The 8th International Conference on Cloud Computing and Services Science, Funchal, Portugal, 19–21 March 2018;
SciTePress; Setubal, Portugal: 2018.
21.
Di Francesco, P.; Lago, P.; Malavolta, I. Architecting with microservices: A systematic mapping study. J. Syst. Softw.
2019
,
150, 77–97.
22.
Soldani, J.; Tamburri, D.A.; Van Den Heuvel, W.J. The pains and gains of microservices: A systematic grey literature review. J.
Syst. Softw. 2018,146, 215–232.
23.
Baškarada, S.; Nguyen, V.; Koronios, A. Architecting microservices: Practical opportunities and challenges. J. Comput. Inf. Syst.
2020,60, 428–436.
24. The Kubernetes Authors. Kubernetes, 2014. Available online: https://kubernetes.io (accessed on 20 September 2022).
25.
Sigelman, B.H.; Barroso, L.A.; Burrows, M.; Stephenson, P.; Plakal, M.; Beaver, D.; Jaspan, S.; Shanbhag, C. Dapper, a Large-Scale
Distributed Systems Tracing Infrastructure; Technical Report; Google, Inc.: Mountain View, CA, USA, 2010.
26.
Feldt, R.; Magazinius, A. Validity Threats in Empirical Software Engineering Research-An Initial Survey. In Proceedings of the
SEKE, San Francisco, CA, USA, 1–3 July 2010.
27.
Wohlin, C.; Runeson, P.; Höst, M.; Ohlsson, M.C.; Regnell, B.; Wesslén, A., Case Studies. In Experimentation in Software Engineering;
Springer: Berlin/Heidelberg, Germany, 2012; pp. 55–72. https://doi.org/10.1007/978-3-642-29044-2_5.
28.
Yin, R. Case Study Research and Applications: Design and Methods; Supplementary Textbook; SAGE Publications: New York, NY,
USA, 2017.
29.
Kratzke, N. log12-a Single and Self-Contained Structured Logging Library. 2022. Available online: https://github.com/nkratzke/
log12 (accessed on 20 September 2022).
30.
Soldani, J.; Brogi, A. Anomaly Detection and Failure Root Cause Analysis in (Micro) Service-Based Cloud Applications: A Survey.
ACM Comput. Surv. 2022,55, 1–39. https://doi.org/10.1145/3501297.
31.
Zhu, J.; He, S.; Liu, J.; He, P.; Xie, Q.; Zheng, Z.; Lyu, M.R. Tools and benchmarks for automated log parsing. In Proceedings
of the 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP),
Montreal, QC, Canada, 25–31 May 2019; pp. 121–130.
32. Oliner, A.; Ganapathi, A.; Xu, W. Advances and challenges in log analysis. Commun. ACM 2012,55, 55–61.
33.
Fu, Q.; Zhu, J.; Hu, W.; Lou, J.G.; Ding, R.; Lin, Q.; Zhang, D.; Xie, T. Where do developers log? an empirical study on logging
practices in industry. In Proceedings of the Companion Proceedings of the 36th International Conference on Software Engineering,
Hyderabad, India, 31 May–7 June 2014; pp. 24–33.
34.
Zhu, J.; He, P.; Fu, Q.; Zhang, H.; Lyu, M.R.; Zhang, D. Learning to log: Helping developers make informed logging decisions. In
Proceedings of the 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Florence, Italy, 16–24 May 2015;
Volume 1, pp. 415–425.
35.
Guan, Q.; Fu, S. Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures. In Proceedings
of the 2013 IEEE 32nd International Symposium on Reliable Distributed Systems, Braga, Portugal, 1–3 October 2013; pp. 205–214.
36.
Pannu, H.S.; Liu, J.; Fu, S. Aad: Adaptive anomaly detection system for cloud computing infrastructures. In Proceedings of the
2012 IEEE 31st Symposium on Reliable Distributed Systems, Irvine, CA, USA, 8–11 October 2012; pp. 396–397.
37.
He, S.; Zhu, J.; He, P.; Lyu, M.R. Experience report: System log analysis for anomaly detection. In Proceedings of the 2016 IEEE
27th international symposium on software reliability engineering (ISSRE), Ottawa, ON, Canada , 23–27 October 2016; pp. 207–218.
38.
Niedermaier, S.; Koetter, F.; Freymann, A.; Wagner, S. On observability and monitoring of distributed systems–an industry
interview study. In Proceedings of the International Conference on Service-Oriented Computing, Dubai, United Arab Emirates,
14–17 December 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 36–52.
39.
Marie-Magdelaine, N.; Ahmed, T.; Astruc-Amato, G. Demonstration of an observability framework for cloud native microservices.
In Proceedings of the 2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM), Bordeaux, France, 18–19
May 2021; pp. 722–724.
40.
Picoreti, R.; do Carmo, A.P.; de Queiroz, F.M.; Garcia, A.S.; Vassallo, R.F.; Simeonidou, D. Multilevel observability in cloud
orchestration. In Proceedings of the 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl
Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and
Technology Congress (DASC/PiCom/DataCom/CyberSciTech), Athens, Greece 12–15 August 2018; pp. 776–784.
41.
Karumuri, S.; Solleza, F.; Zdonik, S.; Tatbul, N. Towards observability data management at scale. ACM SIGMOD Rec.
2021
,
49, 18–23.
42.
Usman, M.; Ferlin, S.; Brunstrom, A.; Taheri, J. A Survey on Observability of Distributed Edge & Container-based Microservices.
IEEE Access 2022,10, 86904–86919. https://doi.org/10.1109/ACCESS.2022.3193102.
43.
Chow, M.; Meisner, D.; Flinn, J.; Peek, D.; Wenisch, T.F. The Mystery Machine: End-to-end Performance Analysis of Large-scale
Internet Services. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14),
Carlsbad, CA, USA, 11–13 July 2022; USENIX Association: Broomfield, CO, USA, 2014; pp. 217–231.
44. Janes, A.; Li, X.; Lenarduzzi, V. Open Tracing Tools: Overview and Critical Comparison. arXiv 2022, arXiv:2207.06875.
Future Internet 2022,14, 274 23 of 23
45.
Falcone, Y.; Nazarpour, H.; Jaber, M.; Bozga, M.; Bensalem, S. Tracing distributed component-based systems, a brief overview.
In Proceedings of the International Conference on Runtime Verification, Limassol, Cyprus, 10–13 November 2018; Springer:
Berlin/Heidelberg, Germany, 2018; pp. 417–425.
46.
Bento, A.; Correia, J.; Filipe, R.; Araujo, F.; Cardoso, J. Automated Analysis of Distributed Tracing: Challenges and Research
Directions. J. Grid Comput. 2021,19, 9. https://doi.org/10.1007/s10723-021-09551-5
47.
ElasticSearch Authors. ElasticSearch Database. 2022. Available online: https://www.elastic.co/elasticsearch/ (accessed on 20
September 2022).
48.
InfluxDB Authors. InfluxDB Time Series Data Platform. 2022. Available online: https://www.influxdata.com/ (accessed on 20
September 2022).
49. Jaeger Authors. Jaeger. 2022. Available online: https://jaegertracing.io (accessed on 20 September 2022).
50. OpenSearch Authors. OpenSearch. 2022. Available online: https://opensearch.org (accessed on 20 September 2022).
51. Prometheus Authors. Prometheus. 2022. Available online: https://prometheus.io (accessed on 20 September 2022).
52. Zipkin Authors. Zipkin. 2022. Available online: https://zipkin.io (accessed on 20 September 2022).
53. Fluentd Authors. Fluentd. 2022. Available online: https://fluentd.org (accessed on 20 September 2022).
54. Flume Authors. Flume. 2022. Available online: https://flume.apache.org (accessed on 20 September 2022).
55. LogStash Authors. LogStash. 2022. Available online: https://www.elastic.co/logstash (accessed on 20 September 2022).
56. FileBeat Authors. FileBeat. 2022. Available online: https://www.elastic.co/filebeat (accessed on 20 September 2022).
57.
Rsyslog Authors. RSYSLOG-The Rocket-Fast Syslog Server. 2020. Available online: https://www.rsyslog.com (accessed on 20
September 2022).
58. Syslog-Ng Authors. Syslog-Ng. 2022. Available online: https://www.syslog-ng.com (accessed on 20 September 2022).
59. Go Standard Library Authors. Log. 2022. Available online: https://pkg.go.dev/log (accessed on 20 September 2022).
60. Log4j Authors. Log4j. 2022. Available online: https://logging.apache.org/log4j/2.x (accessed on 20 September 2022).
61.
Python Standard Library Authors. Logging. 2022. Available online: https://docs.python.org/3/howto/logging.html (ac-
cessed on 20 September 2022).
62.
Micrometer Authors. Micrometer Application Monitor. 2022. Available online: https://micrometer.io/ (accessed on 20
September 2022).
63.
Splunk APM Authors. Splunk Application Performance Monitoring. 2022. Available online: https://www.splunk.com/en_us/
products/apm-application-performance-monitoring.html (accessed on 20 September 2022).
64. Schlawack, H. Structlog. 2022. Available online: https://pypi.org/project/structlog (accessed on 20 September 2022).
65. Winston Authors. Winston. 2022. Available online: https://github.com/winstonjs/winston (accessed on 20 September 2022).
66.
Linux Foundation. Cloud-Native Computing Foundation, 2015. Available online: https://cncf.io (accessed on 20 September
2022).
67.
The OpenTelemetry Authors. The OpenTelemetry Specification-Logs Data Model. 2021. Available online: https://opentelemetry.
io/docs/reference/specification/logs/data-model/ (accessed on 20 September 2022).
68.
The OpenTelemetry Authors. The OpenTelemetry Specification-Metrics SDK. 2021. Available online: https://opentelemetry.io/
docs/reference/specification/metrics/sdk/ (accessed on 20 September 2022).
69.
The OpenTelemetry Authors. The OpenTelemetry Specification-Tracing SDK. 2021. Available online: https://opentelemetry.io/
docs/reference/specification/trace/sdk/ (accessed on 20 September 2022).
70.
The OpenTelemetry Authors. The OpenTelemetry Operator. 2021. Available online: https://github.com/open-telemetry/
opentelemetry-operator (accessed on 20 September 2022).
71.
Li, W.; Lemieux, Y.; Gao, J.; Zhao, Z.; Han, Y. Service mesh: Challenges, state of the art, and future research opportunities. In
Proceedings of the 2019 IEEE International Conference on Service-Oriented System Engineering (SOSE), San Francisco, CA, USA,
4–9 April 2019; pp. 122–1225.
72.
Malki, A.E.; Zdun, U. Guiding architectural decision making on service mesh based microservice architectures. In Proceedings of
the European Conference on Software Architecture, Paris, France, 9–13 September 2019; Springer: Berlin/Heidelberg, Germany,
2019, pp. 3–19.
73.
Service Mesh Interface Authors. SMI: A Standard Interface for Service Meshes on Kubernetes. 2022. Available online:
https://smi-spec.io (accessed on 20 September 2022).
74.
Al-Debagy, O.; Martinek, P. A comparative review of microservices and monolithic architectures. In Proceedings of the 2018 IEEE
18th International Symposium on Computational Intelligence and Informatics (CINTI), Budapest, Hungary, 21–22 November
2018; pp. 000149–000154.
75.
Balalaie, A.; Heydarnoori, A.; Jamshidi, P.; Tamburri, D.A.; Lynn, T. Microservices migration patterns. Softw. Pract. Exp.
2018
,
48, 2019–2042.
76.
Sheikh, O.; Dikaleh, S.; Mistry, D.; Pape, D.; Felix, C. Modernize digital applications with microservices management using the
istio service mesh. In Proceedings of the 28th Annual International Conference on Computer Science and Software Engineering,
Toronto, ON, Canada, 29–31 October 2018; pp. 359–360.
77. Istio Authors. The Istio Service Mesh. 2017. Available online: https://istio.io/ (accessed on 20 September 2022).
... Para Kratzke e Stage (2022), a observabilidade representa a base para o desenvolvimento de software orientado por dados [23]. Enquanto isso, para Marie et al. (2021) [30], a observabilidade é um conceito emergente que pode ser considerado um conjunto mais amplo do que o monitoramento, estendendo seus limites e aplicando técnicas de análise de dados. ...
... Marie-Magdelaine et al. (2020) apresentam um framework de escalonamento automático proativo usando um modelo de previsão baseado em aprendizado para ajustar dinamicamente o conjunto de recursos, horizontalmente e verticalmente [31] [26]. Com relação aos propósitos apresentados na taxonomia, os insights são verificados em cinco trabalhos ( [22], [23], [25], [53], [41]). Kosińska et al. (2020) apresentam requisitos para o gerenciamento autonômico de CNA e a construção de um framework denominado AMoCNA, de acordo com os conceitos de Arquitetura Dirigida por Modelos, cujo principal objetivo é reduzir a complexidade do gerenciamento dessas aplicações [22]. ...
... O trabalho utiliza apenas métricas como domínio, focando em infraestrutura (CPU / Memória), para apresentar uma visão geral da aplicação. Kratzke et al. (2022) apresentam uma solução para a fusão dos três tipos de dados para observabilidade, de forma mais integrada e com uma abordagem de instrumentação direta, através de uma biblioteca de logs unificados e estruturados do tipo JSON [23]. O trabalho utiliza agentes para coleta dos dados com instrumentação manual para obter mais detalhes nos registros, com os traces mostrando o tempo de execução dos microsserviços. ...
Conference Paper
Aplicações baseadas em microsserviços têm sido amplamente adotadas nas empresas devido à sua excelente escalabilidade e capacidade de atualização oportuna. No entanto, embora a modularidade granular e a orientação por serviços reduzam a complexidade do desenvolvimento de sistemas, elas também aumentam a complexidade da operação e manutenção do sistema. Isso ocorre porque as falhas no sistema se tornaram mais frequentes e complexas. Com o aumento da complexidade das aplicações nativas em nuvem, as soluções de monitoramento tradicionais tornam-se inadequadas, levando a maiores riscos de falhas. A expansão do monitoramento para aplicações nativas em nuvem é conhecida como observabilidade, que pode ser formalmente definida como a capacidade de entender e diagnosticar o comportamento interno de um sistema analisando seus estados externos, permitindo a detecção e resolução eficazes de problemas. Embora alguns estudos na literatura abordem os conceitos, ferramentas e desafios da observabilidade, nenhum deles apresenta uma taxonomia de observabilidade. Neste trabalho, é proposta uma taxonomia de observabilidade para aplicações baseadas em microsserviços, apoiada por uma revisão sistemática da literatura. Os autores deste trabalho identificaram 26 estudos publicados entre 2019 e 2023. Esses estudos selecionados foram minuciosamente analisados e categorizados por essa taxonomia, fornecendo aos pesquisadores uma visão abrangente sobre Observabilidade.
... Logs from several microservices in cloud-native settings must be consolidated to offer a cohesive perspective, making solutions such as the Elasticsearch, Logstash, and Kibana (ELK) stack indispensable. Effective tracing is crucial for comprehending the sequence of transactions among several microservices [14]. Distributed tracing technologies like Jaeger or Zipkin facilitate the mapping of requests as they move through different services, enabling the identification of latency problems and the precise determination of the underlying cause of errors. ...
Article
Full-text available
In the context of contemporary software development, the desire for scalability and resilience has become of the utmost importance, particularly considering the explosion in the number of online services and applications. In the context of cloud computing, the supply of hosted services, which may include software, hardware, and storage, is referred to as cloud computing. As a result of the benefits that cloud computing offers, which include rapid deployment, flexibility, minimal upfront costs, and scalability, numerous enterprises of varying sizes have adopted cloud computing. Cloud-native applications (CNAs) have emerged as a potential game-changer in the fight against these difficulties. These applications provide dynamic scalability and robust resilience by utilizing unique architectural techniques. Observing and monitoring these applications might be difficult, particularly for a certified nursing assistant who is constrained by compliance regulations. The investigation's chief goal is to research a comprehensive reference architecture approach for CNA observability along with compliance challenges from a research perspective. This research indicates that it is of the utmost importance to implement a multi-layered observability strategy that incorporates metrics, logs, and traces. This will guarantee that all micro-services and components are adequately visible to the user. This study provides a systematic method for improving observability and assuring compliance in cloud-native architectures. This technique enables businesses to achieve operational efficiency and regulatory compliance in environments that are complex and scattered throughout the world.
... Businesses have benefited greatly from cloud-native observability and operational approaches. Businesses who adopted cutting-edge cloud-native strategies outperformed their rivals by 23% in revenue growth and 38% in time to market, according to research from Google Cloud and Harvard Business Review [18]. These findings highlight the significance of utilizing cloudnative observability and operations in generating value and innovation for businesses. ...
Article
Full-text available
The need for resilient and scalable software has grown significantly in the modern software development landscape, especially due to the exponential expansion of web-based services and applications. With their new architectural methods, cloud-native technologies provide dynamic scalability and robust resilience, making them a transformational force in tackling these difficulties. This article provides an in-depth analysis of how software development might benefit from cloud-native technologies in terms of scalability and resilience. Starting with cloud-native architecture's guiding principles-including containerization, microservices, and declarative APIs-the paper moves on to examine its underlying notions. Following these guidelines, programmers can create and release apps with fault tolerance, high availability, and the capacity to scale dynamically according to user demand. Additionally, the evaluation delves into the essential elements of ecosystems that are native to the cloud, such as Kubernetes and other container orchestration tools that automate the scaling and maintenance of containerized applications. Also covered is how service meshes help make systems more resilient by allowing microservices to communicate with each other in a safe and dependable manner. In addition, exploring themes like distributed tracing, circuit breaking, and chaotic engineering, the article dives into patterns and best practices for building resilient and scalable cloud-native apps. To make their systems more resilient, developers can use these approaches to find any weak spots and fix them before they happen. The importance of cloud-native technologies in facilitating the development of resilient and scalable applications is highlighted in this review. Organizations may successfully adapt to the ever-changing requirements of software development in today's competitive market by utilizing cloud-native ideas and the right tools and processes.
... Las funcionalidades están descritas en Casos de Uso y se utilizaron tecnologías de vanguardia para la construcción de las aplicaciones. En esta arquitectura la observabilidad es un punto crucial y, para ello se utiliza ElasticSearch, Filebeat y Kibana para recopilar, analizar y visualizar los registros de logs de las aplicaciones [9], [10]. Se adoptó la cultura DevOps para la integración y despliegue continuo de artefactos. ...
Conference Paper
The Ministry of Economy and Finance (MEF) is responsible for the development and maintenance of Paraguay's Integrated State Financial Administration System (SIARE), which provides state financial services for the Ministry and other governmental entities. Currently, the MEF count with a system built on client-server architecture and monolithic sub-systems, which presents challenges related to scalability, agility in the implementation of new functionalities and high coupling between modules. This article presents the technological adaptation and integration of the microservices architecture in SIARE. The solution comprises an ecosystem of components. Each of these components adheres to a layering scheme, which responds to a logical grouping of the internal components of each one of them. This architectural approach allows them to be grouped by responsibilities from the point of view of information, behavior and security. This architectural approach favors the decoupling of components, which increases the flexibility and maintainability of the system. The development practice employed throughout the project life cycle was the agile scrum methodology. The functionalities are outlined in Use Cases and the main technologies used to build the applications were: Spring Boot, Angular, Oracle, PostgreSql, Redis and Kafka. Besides, for artifact orchestration, Jenkins and WSO2 API Manager were deployed to manage the REST API services. In addition, for application observability, the ELK tool was selected as a monitoring tool. As a result of the performance tests we could observe that the architecture is stable to support large amount of workload, supporting at least 5000 simultaneous requests per second.
... Since the primary goal of UNFRAGILE framework is to make any system that already exists more antifragile, we know that introducing observability to already-existing systems presents significant challenges because of the need for code changes in several places to add monitoring and instrumentation boilerplate code. As a result, we recommend utilising contemporary APM features for cloud-native systems, such as auto-instrumentation [42,43], which can reduce the amount of code changes required in order to add monitoring to already-existing systems. In the context of cloud native systems, it provides additional functionality, such as logging and monitoring, without the need for disruptive code changes. ...
Article
Full-text available
Context: Unlike resilience, antifragility describes systems that get stronger rather than weaker under stress and chaos. Antifragile systems have the capacity to overcome stressors and come out stronger, whereas resilient systems are focused on their capacity to return to their previous state following a failure. As technology environments become increasingly complex, there is a great need for developing software systems that can benefit from failures while continuously improving. Most applications nowadays operate in cloud environments. Thus, with this increasing adoption of Cloud-Native Systems they require antifragility due to their distributed nature. Objective: The paper proposes UNFRAGILE framework, which facilitates the transformation of existing systems into antifragile systems. The framework employs chaos engineering to introduce failures incrementally and assess the system's response under such perturbation and improves the quality of system response by removing fragilities and introducing adaptive fault tolerance strategies. Method: The UNFRAGILE framework's feasibility has been validated by applying it to a cloud-native using a real-world architecture to enhance its antifragility towards long outbound service latencies. The empirical investigation of fragility is undertaken, and the results show how chaos affects application performance metrics and causes disturbances in them. To deal with chaotic network latency, an adaptation phase is put into effect. Results: The findings indicate that the steady stage's behaviour is like the antifragile stage's behaviour. This suggests that the system could self-stabilise during the chaos without the need to define a static configuration after determining from the context of the environment that the dependent system was experiencing difficulties. Conclusion: Overall, this paper contributes to ongoing efforts to develop antifragile software capable of adapting to the rapidly changing complex environment. Overall, the research provides an operational framework for engineering software systems that learn and improve through exposure to failures rather than just surviving them.
... Although recent publications indicate that Cloud-native observability is a significant topic [18], [19], there is a compelling demand for a systematic study that assesses the engineering approaches of observability, its maturity, efficiency, tools and also shows future research directions. The contribution of this work is the Systematic Mapping Study (SMS) [20] in the area of Cloud-native Application (CNApp) observability. ...
Article
Full-text available
The Cloud-native model, established to enhance the Twelve-Factor patterns, is an approach to developing and deploying applications according to DevOps concepts, Continuous Integration/Continuous Delivery, containers, and microservices. The notion of observability can help us cope with the complexity of such applications. We present a Systematic Mapping Study (SMS) in the observability of Cloud-native applications. We have chosen 56 studies published between 2018 and 2022. The selected studies were thoroughly analyzed, compared, and classified according to the chosen comparative criteria. The presented SMS assesses engineering approaches, maturity, and efficiency of observability by deliberating around four research questions: (1) What provides the motivations for equipping Cloud-native applications with observability capabilities? (2) Which research areas are addressed in the related literature? (3) How are observability approaches implemented? (4) What are the future trends in the Cloud-native applications observability research?.
... Prometheus Prometheus [Pro22] is a tool for cloud-native observability [Kra22], which is widely used in production [Clo22]. It continuously requests metrics data from so-called exporters and stores them in a time series database. ...
Thesis
Full-text available
Cloud-native applications constitute a recent trend for designing large-scale software systems. This thesis introduces the Theodolite benchmarking method, allowing researchers and practitioners to conduct empirical scalability evaluations of cloud-native applications, their frameworks, configurations, and deployments. The benchmarking method is applied to event-driven microservices, a specific type of cloud-native applications that employ distributed stream processing frameworks to scale with massive data volumes. Extensive experimental evaluations benchmark and compare the scalability of various stream processing frameworks under different configurations and deployments, including different public and private cloud environments. These experiments show that the presented benchmarking method provides statistically sound results in an adequate amount of time. In addition, three case studies demonstrate that the Theodolite benchmarking method can be applied to a wide range of applications beyond stream processing.
Article
Full-text available
Software logging is the practice of recording different events and activities that occur within a software system, which are useful for different activities such as failure prediction and anomaly detection. While previous research focused on improving different aspects of logging practices, the goal of this paper is to conduct a systematic literature review and the existing challenges of practitioners in software logging practices. In this paper, we focus on the logging practices that cover the steps from the instrumentation of a software system, the storage of logs, up to the preprocessing steps that prepare log data for further follow-up analysis. Our systematic literature review (SLR) covers 204 research papers and a quantitative and qualitative analysis of 20,766 and 149 questions on StackOverflow (SO). We observe that 53% of the studies focus on improving the techniques that preprocess logs for analysis (e.g., the practices of log parsing, log clustering and log mining), 37% focus on how to create new log statements, and 10% focus on how to improve log file storage. Our analysis of SO topics reveals that five out of seven identified high-level topics are not covered by the literature and are related to dependency configuration of logging libraries, infrastructure related configuration, scattered logging, context-dependant usage of logs and handling log files.
Article
Full-text available
Edge computing is proposed as a technical enabler for meeting emerging network technologies (such as 5G and Industrial Internet of Things), stringent application requirements and key performance indicators (KPIs). It aims to alleviate the problems associated with centralized cloud computing systems by placing computational resources to the network's edge, closer to the users. However, the complexity of distributed edge infrastructures grows when hosting containerized workloads as microservices, resulting in hard to detect and troubleshoot outages on critical use cases such as industrial automation processes. Observability aims to support operators in managing and operating complex distributed infrastructures and microservices architectures by instrumenting end-to-end runtime performance. To the best of our knowledge, no survey article has been recently proposed for distributed edge and containerized microservices observability. Thus, this article surveys and classifies state-of-the-art solutions from various communities. Besides surveying state-of-the-art, this article also discusses the observability concept, requirements, and design considerations. Finally, we discuss open research issues as well as future research directions that will inspire additional research in this area.
Article
Full-text available
Microservice-based architectures are gaining popularity for their benefits in software development. Distributed tracing can be used to help operators maintain observability in this highly distributed context, and find problems such as latency, and analyse their context and root cause. However, exploring and working with distributed tracing data is sometimes difficult due to its complexity and application specificity, volume of information and lack of tools. The most common and general tools available for this kind of data, focus on trace-level human-readable data visualisation. Unfortunately, these tools do not provide good ways to abstract, navigate, filter and analyse tracing data. Additionally, they do not automate or aid with trace analysis, relying on administrators to do it themselves. In this paper we propose using tracing data to extract service metrics, dependency graphs and work-flows with the objective of detecting anomalous services and operation patterns. We implemented and published open source prototype tools to process tracing data, conforming to the OpenTracing standard, and developed anomaly detection methods. We validated our tools and methods against real data provided by a major cloud provider. Results show that there is an underused wealth of actionable information that can be extracted from both metric and morphological aspects derived from tracing. In particular, our tools were able to detect anomalous behaviour and situate it both in terms of involved services, work-flows and time-frame. Furthermore, we identified some limitations of the OpenTracing format—as well as the industry accepted tracing abstractions—, and provide suggestions to test trace quality and enhance the standard.
Article
Full-text available
Kubernetes (k8s) is a kind of cluster operating system for cloud-native workloads that has become a de-facto standard for container orchestration. Provided by more than one hundred vendors, it has the potential to protect the customer from vendor lock-in. However, the open-source k8s distribution consists of many optional and alternative features that must be explicitly activated and may depend on pre-configured system components. As a result, incompatibilities still may ensue among Kubernetes vendors. Mostly managed k8s services typically restrict the customizability of Kubernetes. This paper firstly compares the most relevant k8s vendors and, secondly, analyses the potential of Kubernetes to detect and configure compatible support for required features across vendors in a uniform manner. Our comparison is performed based on documented features, by testing, and by inspection of the configuration state of running clusters. Our analysis focuses on the potential of the end-to-end testing suite of Kubernetes to detect support for a desired feature in any Kubernetes vendor and the possibility of reconfiguring the studied vendors with missing features in a uniform manner. Our findings are threefold: First, incompatibilities arise between default cluster configurations of the studied vendors for approximately 18% of documented features. Second, matching end-to-end tests exist only for around 64% of features and for 17% of features these matching tests are not well developed for all vendors. Third, almost all feature incompatibilities can be resolved using a vendor-agnostic API. These insights are beneficial to avoid feature incompatibilities already in cloud-native application engineering processes. Moreover, the end-to-end testing suite can be extended in currently unlighted areas to provide better feature coverage.
Article
The proliferation of services and service interactions within microservices and cloud-native applications, makes it harder to detect failures and to identify their possible root causes, which is, on the other hand crucial to promptly recover and fix applications. Various techniques have been proposed to promptly detect failures based on their symptoms, viz., observing anomalous behaviour in one or more application services, as well as to analyse logs or monitored performance of such services to determine the possible root causes for observed anomalies. The objective of this survey is to provide a structured overview and qualitative analysis of currently available techniques for anomaly detection and root cause analysis in modern multi-service applications. Some open challenges and research directions stemming out from the analysis are also discussed.
Book
Märkte verändern sich immer schneller, Kundenwünsche stehen im Mittelpunkt – viele Unternehmen sehen sich 𝗛𝗲𝗿𝗮𝘂𝘀𝗳𝗼𝗿𝗱𝗲𝗿𝘂𝗻𝗴𝗲𝗻 gegenüber, die nur 𝗱𝗶𝗴𝗶𝘁𝗮𝗹 𝗯𝗲𝗵𝗲𝗿𝗿𝘀𝗰𝗵𝗯𝗮𝗿 sind. Um diese Anforderungen zu bewältigen, bietet sich der Einsatz von 𝗖𝗹𝗼𝘂𝗱-𝗻𝗮𝘁𝗶𝘃𝗲-𝗧𝗲𝗰𝗵𝗻𝗼𝗹𝗼𝗴𝗶𝗲𝗻 an. Dabei reicht es jedoch nicht aus, einen Account bei einem Cloud-Anbieter anzulegen. Es geht auch darum, die unterschiedlichen Faktoren zu verstehen, die den Erfolg von Cloud-native-Projekten beeinflussen. Anhand von realen 𝗣𝗿𝗮𝘅𝗶𝘀𝗯𝗲𝗶𝘀𝗽𝗶𝗲𝗹𝗲𝗻 wird gezeigt, was bei der Umsetzung in 𝘂𝗻𝘁𝗲𝗿𝘀𝗰𝗵𝗶𝗲𝗱𝗹𝗶𝗰𝗵𝗲𝗻 𝗕𝗿𝗮𝗻𝗰𝗵𝗲𝗻 gut und was schlecht gelaufen ist und welche Best Practices sich daraus ableiten lassen. Dabei wird auch die Migration von Legacy-Code berücksichtigt. 𝗜𝗧-𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗸𝘁𝗲𝗻 𝘃𝗲𝗿𝗺𝗶𝘁𝘁𝗲𝗹𝘁 𝗱𝗶𝗲𝘀𝗲𝘀 𝗕𝘂𝗰𝗵 𝗱𝗮𝘀 𝗴𝗿𝘂𝗻𝗱𝗹𝗲𝗴𝗲𝗻𝗱𝗲 𝗪𝗶𝘀𝘀𝗲𝗻, 𝘂𝗺 𝗖𝗹𝗼𝘂𝗱-𝗻𝗮𝘁𝗶𝘃𝗲-𝗧𝗲𝗰𝗵𝗻𝗼𝗹𝗼𝗴𝗶𝗲𝗻 𝘂𝗻𝗱 𝗱𝗶𝗲 𝗗𝗲𝘃𝗢𝗽𝘀-𝗞𝘂𝗹𝘁𝘂𝗿 𝗶𝗻 𝗶𝗵𝗿𝗲𝗺 𝗣𝗿𝗼𝗷𝗲𝗸𝘁 𝗼𝗱𝗲𝗿 𝗶𝗺 𝗴𝗲𝘀𝗮𝗺𝘁𝗲𝗻 𝗨𝗻𝘁𝗲𝗿𝗻𝗲𝗵𝗺𝗲𝗻 einzuführen. Das Buch beleuchtet den 𝗖𝗹𝗼𝘂𝗱-𝗻𝗮𝘁𝗶𝘃𝗲-𝗪𝗮𝗻𝗱𝗲𝗹 aus 𝘂𝗻𝘁𝗲𝗿𝘀𝗰𝗵𝗶𝗲𝗱𝗹𝗶𝗰𝗵𝗲𝗻 𝗣𝗲𝗿𝘀𝗽𝗲𝗸𝘁𝗶𝘃𝗲𝗻: Von der Unternehmenskultur, der Cloud-Ökonomie und der Einbeziehung der Kunden (Co-Creation) über das Projektmanagement (Agilität) und die Softwarearchitektur bis hin zu Qualitätssicherung (Continuous Delivery) und Betrieb (DevOps). ▪️ Grundlagen des Cloud Computings (Service-Modelle und Cloud-Ökonomie) ▪️ Das Everything-as-Code-Paradigma (DevOps, Deployment Pipelines, IaC) ▪️ Den Systembetrieb mit Container-Orchestrierung automatisieren ▪️ Microservice- und Serverless-Architekturen verstehen und Cloud-native-Architekturen mit Domain Driven Design entwerfen EXTRA: 𝗖𝗖𝟬-𝗹𝗶𝘇𝗲𝗻𝘀𝗶𝗲𝗿𝘁𝗲 𝘂𝗻𝗱 𝗲𝗱𝗶𝘁𝗶𝗲𝗿𝗯𝗮𝗿𝗲 𝗛𝗮𝗻𝗱𝗼𝘂𝘁𝘀 und Labs für Trainer:innen und Dozent:innen finden sich übrigens hier: https://cloud-native-computing.de
Article
Observability has been gaining importance as a key capability in today's large-scale software systems and services. Motivated by current experience in industry exemplified by Slack and as a call to arms for database research, this paper outlines the challenges and opportunities involved in designing and building Observability Data Management Systems (ODMSs) to handle this emerging workload at scale.