PreprintPDF Available

Cloud-native Observability: The Many-faceted Benefits of Structured and Unified Logging - A Case Study

  • Lübeck University of Applied Sciences

Abstract and Figures

Background: Cloud-native software systems often have a much more decentralized structure and many independently deployable and (horizontally) scalable components, making it more complicated to create a shared and consolidated picture of the overall decentralized system state. Today, observability is often understood as a triad of collecting and processing metrics, distributed tracing data, and logging. The result is often a complex observability system composed of three stovepipes whose data is difficult to correlate. Objective: This study analyzes whether these three historically emerged observability stovepipes of logs, metrics and distributed traces could be handled more integrated and with a more straightforward instrumentation approach. Method: This study applied an action research methodology used mainly in industry-academia collaboration and common in software engineering. The research design utilized iterative action research cycles, including one long-term use case. Results: This study presents a unified logging library for Python and a unified logging architecture that uses the structured logging approach. The evaluation shows that several thousand events per minute are easily processable. Conclusion: The results indicate that a unification of the current observability triad is possible without the necessity to develop utterly new toolchains.
Content may be subject to copyright.
Cloud-native Observability: The Many-faceted Benefits of
Structured and Unified Logging - A Case Study
Nane Kratzke
Lübeck University of Applied Sciences;
Abstract: Background: Cloud-native software systems often have a much more decentralized
structure and many independently deployable and (horizontally) scalable components, making it
more complicated to create a shared and consolidated picture of the overall decentralized system state.
Today, observability is often understood as a triad of collecting and processing metrics, distributed
tracing data, and logging. The result is often a complex observability system composed of three
stovepipes whose data is difficult to correlate. Objective: This study analyzes whether these three
historically emerged observability stovepipes of logs, metrics and distributed traces could be handled
more integrated and with a more straightforward instrumentation approach. Method: This study
applied an action research methodology used mainly in industry-academia collaboration and common
in software engineering. The research design utilized iterative action research cycles, including one
long-term use case. Results: This study presents a unified logging library for Python and a unified
logging architecture that uses the structured logging approach. The evaluation shows that several
thousand events per minute are easily processable. Conclusion: The results indicate that a unification
of the current observability triad is possible without the necessity to develop utterly new toolchains.
Keywords: cloud-native; observability; cloud computing; logging; structured logging; logs; metrics;
traces; distributed tracing; log aggregation; log forwarding; log consolidation 16
1. Introduction 17
A "crypto winter" basically means that the prices for so-called cryptocurrencies such as
Bitcon, Ethereeum, Solana, etc. fell sharply on the crypto exchanges and then stay low. The
signs were all around in 2022: the failure of the TerraUSD crypto project in May 2022 sent
an icy blast through the market, then the cryptocurrency lending platform Celsius Network
halted withdrawals, prompting a sell-off that pushed Bitcoin to a 17-month low. 22
This study logged such a "crypto winter" on Twitter more by accident than by intention.
Twitter was simply selected as an appropriate use case to evaluate a unified logging solution
for cloud-native systems and decided to log Tweets containing stock symbols like $USD or
$EUR. It turned out that most symbols used on Twitter are not related to currencies like
$USD (US-Dollar) or stocks like $AAPL (Apple) but to Cryptocurrencies like $BTC (Bitcoin)
or $ETH (Ethereum). The Twitter community therefore seems to be quite cryptocurrency-
savvy. So, although some data of this 2022 crypto winter will be presented in this paper,
this paper will take more the methodical part into focus and will address how such and
further data could be collected more systematically in distributed cloud-native applications.
The paper will at least show that even complex observability of distributed systems can be
reached, simply by logging events to stdout. 33
Observability measures how well a system’s internal state can be inferred from knowl-
edge of its external outputs. The concept of observability was initially introduced by
the Hungarian-American engineer Rudolf E. Kálmán for linear dynamical systems [
However, observability also applies to information systems and is of particular interest
to fine-grained and distributed cloud-native systems that come with a very own set of
observability challenges. 39
Preprints ( | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
© 2022 by the author(s). Distributed under a Creative Commons CC BY license.
2 of 18
Traditionally, the responsibility for observability is (was?) with operations (Ops). With
the emergence of DevOps, we can observe a shift of Ops responsibilities to developers. So,
observability is evolving more and more into a Dev responsibility. Observability should
ideally already be considered during the application design phase and not be regarded
as some "add-on" feature for later expansion stages of an application. The current discus-
sion about observability began well before the advent of cloud-native technologies like
Kubernetes. A widely cited blog post by Cory Watson from 2013 shows how engineers at
Twitter looked for ways to monitor their systems as the company moved from a monolithic
to a distributed architecture [
]. One of the ways Twitter did this was by developing a
command-line tool that engineers could use to create their dashboards to keep track of the
charts they were creating. While CI/CD tools and container technologies often bridge Dev
and Ops in one direction, observability solutions close the loop in the opposite direction,
from Ops to Dev [
]. Observability is thus the basis for data-driven software development
(see Fig. 1and [
]). As developments around cloud(-native) computing progressed, more
and more engineers began to "live in their dashboards." They learned that it is not enough
to collect and monitor data points but that it is necessary to address this problem more
systematically. 56
Figure 1. Observability can be seen as a feedback channel from Ops to Dev (adopted from [
] + [
2. Problem description 57
Today, observability is often understood as a triad. Observability of distributed information
systems is typically achieved through the collection and processing of metrics (quantitative
data primarily as time-series), distributed tracing data (execution durations of complex
system transactions that flow through services of a distributed system), and logging (qual-
itative data of discrete system events often associated with timestamps but encoded as
unstructured strings). Consequently, three stacks of observability solutions have emerged,
and the following somehow summarizes the current state of the art. 64
Metrics: Here, quantitative data is often collected in time series, e.g., how many
requests a system is currently processing. The metrics technology stack is often
characterized by tools such as Prometheus and Grafana. 67
Preprints ( | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
3 of 18
Distributed tracing involves following the path of transactions along the components
of a distributed system. The tracing technology stack is characterized by tools such as
Zipkin or Jaeger, and the technologies are used to identify and optimize particularly
slow or error-prone substeps of distributed transaction processing. 71
Logging is probably as old as software development itself, and many developers,
because of the log ubiquity, are unaware that logging should be seen as part of holistic
observability. Logs are usually stored in so-called log files. Primarily qualitative events
are logged (e.g. user XYZ logs in/out). An event is usually attached to a log file in
a text line. Often the implicit and historically justifiable assumption prevails with
developers that these log files are read and evaluated primarily by administrators
(thus humans). However, that is hardly the case anymore. It is becoming increasingly
common for the contents of these log files to be forwarded to a central database
through "log forwarders" so that they can be evaluated and analyzed centrally. The
technology stack is often characterized by tools such as Fluentd, FileBeat, LogStash
for log forwarding, databases such as ElasticSearch, Cassandra or simply S3 and user
interfaces such as Kibana. 83
Figure 2. An application is quickly surrounded by a complex observability system when metrics,
tracing and logs are captured with different observability stacks.
Incidentally, all three observability pillars have in common that software to be developed
must be somehow instrumented. This instrumentation is normally done using program-
ming language-specific libraries. Developers often regard distributed tracing instrumenta-
tion in particular as time-consuming. Also, which metric types (counter, gauge, histogram,
history, and more) are to be used in metric observability solutions such as Prometheus
often depends on Ops experience and is not always immediately apparent to developers.
Certain observability hopes fail simply because of wrongly chosen metric types. Only
system metrics such as CPU, memory, and storage utilization can be easily captured in a
black-box manner (i.e., without instrumentation in the code). However, these data are often
only of limited use for the functional assessment of systems. For example, CPU utilization
provides little information about whether conversion rates in an online store are developing
in the desired direction. 95
So, current observability solutions are often based on these three stovepipes for logs,
metrics, and traces. The result is an application surrounded by a complex observability
system whose isolated datasets can be difficult to correlate. Fig. 2focuses on the application
Preprints ( | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
4 of 18
(i.e., the object to be monitored) and triggers the question, whether it is justified to use three
complex subsystems and three types of instrumentation, which always means three times
the instrumentation and data analysis effort of isolated data silos. 101
The often-used tool combination of ElasticSearch, LogStash, and Kibana is often used
for logging and has even been given a catchy acronym: ELK-Stack [3]. The ELK stack
can be used to collect metrics and using the plugin APM also for distributed tracing. So,
at least for the ELK stack, the three stovepipes are not clearly separable or disjoint. The
separateness is somewhat historically "suggested" than technologically given. Nevertheless,
this tripartite division into metrics, tracing and logging is very formative for the industry,
as shown, for example, by the OpenTelemetry project [
]. OpenTelemetry is currently in the
incubation stage at the Cloud Native Computing Foundation and provides a collection of
standardized tools, APIs, and SDKs to instrument, generate, collect, and export telemetry
data (metrics, logs, and traces) to analyze the performance and behaviour of software
systems. OpenTelemetry thus standardizes observability but hardly aims to overcome the
columnar separation into metrics, tracing, and logging. 113
In past and current industrial action research [
], I came across various cloud-
native applications and corresponding engineering methodologies like the 12-factor app
(see 4.1) and learned that the discussion around observability is increasingly moving
beyond these three stovepipes and taking a more nuanced and integrated view. There is a
growing awareness of integrating and unifying these three pillars, and more emphasis is
being placed on analytics. 119
The research question arises whether these three historically emerged observability
stovepipes of logs, metrics and distributed traces could be handled more integrated and
with a more straightforward instrumentation approach. The results of this action research
study shows that this unification potential could be surprisingly easy to realize. This paper
presents the methodology in Sec. 3and its results in Sec. 4(including a logging prototype
in Sec 4.4 and its evaluation results in 4.5 as the main contribution of this paper to the
field). The results are discussed in Sec. 5. Furthermore, the study presents related work in
Sec. 6and concludes its findings as well as future promising research directions in Sec. 7.127
3. Methodology 128
This study followed the action research methodology as a proven and well-established re-
search methodology model for industry-academia collaboration in the software engineering
context to analyze the research-question mentioned above. Following the recommendations
of Petersen et al. [
], a research design was defined that applied iterative action research
cycles (see Fig. 3): 133
1. Diagnosis (Diagnosing according to [15]) 134
2. Prototyping (Action planning, design and taking according to [15]) 135
3. Evaluation including a may be required redesign (Evaluation according to [15]) 136
Transfer learning outcomes to further use cases (Specifying learning according to
[15]) 138
Preprints ( | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
5 of 18
Figure 3. Action research methodology of this study
With each of the following use cases insights were transferred from the previous use case
into a structured logging prototype (see Fig. 3). The following use cases have been studied
and evaluated. 141
Use Case 1: Observation of qualitative events occurring in an existing solution (on-
line code editor;, this use case was inspired by our
research [11]) 144
Use Case 2: Observation of distributed events along distributed services (distributed
tracing in an existing solution of an online code editor, see UC1) 146
Use Case 3: Observation of quantitative data generated by a technical infrastructure
(Kubernetes platform, this use case was inspired by our research [14]) 148
Use Case 4: Observation of a massive online event stream to gain experiences with
high-volume event streams (we used Twitter as a data source and tracked worldwide
occurrences of stock symbols, this use case was inspired by our research [16,17]) 151
4. Results 152
The analysis of cloud-native methodologies like the 12-factor app [
] has shown that to
build observability, one should take a more nuanced and integrated view to integrate and
unify these three pillars of metrics, traces, and logs to enable more agile and convenient
analytics in feedback information flow in DevOps cycles (see Fig. 1). Two aspects that
gained momentum in cloud-native computing are of interest: 157
Recommendations on how to handle log forwarding and log consolidaion in cloud-
native applications 159
Recommendations to apply structured logging 160
Because both aspects guided the implementation of the logging prototype deeply, they will
be explained in more details providing the reader the necessary context. 162
4.1. Twelve-factor apps 163
The 12-factor app is a method [
] for building software-as-a-service applications that
pay special attention to the dynamics of organic growth of an application over time,
Preprints ( | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
6 of 18
the dynamics of collaboration between developers working together on a codebase, and
avoiding the cost of software erosion. At its core, 12 rules (factors) should be followed to
develop well-operational and evolutionarily developable distributed applications. This
methodology harmonizes very well with microservice architecture approaches [
] and
cloud-native operating environments like Kubernetes [
], which is why the 12-factor
methodology is becoming increasingly popular. Incidentally, the 12-factor methodology
does not contain any factor explicitly referring to observability, certainly not in the triad of
metrics, tracing and logging. However, factor XI recommends how to handle logging: 173
Logs are the stream of aggregated events sorted by time and summarized from the output
streams of all running processes and supporting services. Logs are typically a text format
with one event per line. 176
[...] 177
A twelve-factor app never cares about routing or storing its output stream. It should
not attempt to write to or manage log files. Instead, each running process writes its
stream of events to stdout. [...] On staging or production deploys, the streams of all
processes are captured by the runtime environment, combined with all other streams of
the app, and routed to one or more destinations for viewing or long-term archiving. These
archiving destinations are neither visible nor configurable to the app - they are managed
entirely from the runtime environment. 184
4.2. From logging to structured logging 185
The logging instrumentation is quite simple for developers and works mainly programming
language specific but basically according to the following principle illustrated in Python. 187
A logging library must often be imported, defining so-called log levels such as DEBUG,
INFO, WARNING, ERROR, FATAL, and others. While the application is running, a log
level is usually set via an environment variable, e.g. INFO. All log calls above this level are
then written to a log file. 191
1im po rt l ogging 192
lo g gi n g . b as i cC o nf i g ( fi l en a me = " e xa m pl e . lo g " , le v el = l o gg in g . D EB U G ) 193
3lo g gi n g . de b ug ( " P e rf o rm i ng u s er c he c k ") 194
us e r = "N a ne K r at z ke " 195
5lo g g in g . i nf o ( f " U se r { u se r } t r ie s t o lo g i n . ") 196
lo g g in g . w ar n in g ( f " Us er { u se r }␣ n o t␣ f ou n d ) 197
7lo g g in g . e rr o r (f " U se r { us e r } ha s be e n b an n ed . " ) 198
For example, line 5 would create the following entry in a log file: 199
1IN FO 20 22 - 01 - 27 16 : 17 :5 8 - Us er Na ne K ra tz ke tr i es t o lo g in 200
In a 12-factor app, this logging would be configured so that events are written directly to
Stdout (console). The runtime environment (e.g., Kubernetes with FileBeat service installed)
then routes the log data to the appropriate database taking work away from the developer
that they would otherwise have to invest in log processing. This type of logging is well
supported across many programming languages and can be consolidated excellently with
the ELK stack (or other observability stacks). 206
Logging (unlike distributed tracing and metrics collection) is often not even perceived
as (complex) instrumentation by developers. Often it is done on their own initiative.
However, one can systematize this instrumentation somewhat and extend it to so-called
"structured logging". Again, the principle is straightforward. One simply does not log lines
of text like 211
1IN FO 20 22 - 01 - 27 16 : 17 :5 8 - Us er Na ne K ra tz ke tr i es t o lo g in 212
but instead, the same information in a structured form, e.g. using JSON: 213
1{ " lo g l e ve l " : " i nf o " , " t im e s ta m p " : " 20 2 2 - 01 - 2 7 1 6: 1 7 :5 8 " , " e ve n t " : " Lo g i n " , 214
" us e r ": " N an e K ra t zk e " , " re s ul t " : " su c ce s s "} 215
In both cases, the text is written to the console. In the second case, however, a structured text-
based data format is used that is easier to evaluate. In the case of a typical logging statement
Preprints ( | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
7 of 18
like "User Max Mustermann tries to log in" the text must first be analyzed to determine the
user. This text parsing is costly on a large scale and can also be very computationally
intensive and complex if there is plenty of log data in a variety of formats (which is the
common case in the real world). 221
However, in the case of structured logging, this information can be easily extracted
from the JSON data field "user". In particular, more complex evaluations become much
easier with structured logging as a result. However, the instrumentation does not become
significantly more complex, especially since there are logging libraries for structured
logging. The logging looks in the logging prototype log12 of this study like this: 226
1im po rt l og 12 227
[. .. ] 228
3lo g 1 2 . e rr o r ( " Lo g i n " , us e r = us er , re s u lt = " N ot f ou n d " , r ea s o n = " Ba n n ed " ) 229
The resulting log files are still readable for administrators and developers (even if a bit more
unwieldy) but much better processable and analyzable by databases such as ElasticSearch.
Quantitative metrics can also be recorded in this way. Structured logging can thus also be
used for the recording of quantitative metrics. 233
1im po rt l og 12 234
[. .. ] 235
3lo g 12 . i n fo ( " O pe n r eq u es t s " , re q ue s ts = l e n ( re q ue s ts ) ) 236
1{ " ev en t " : " Op en r e qu e st s " , " re q ue s ts " : 42 } 237
What is more, this structured logging approach can also be used to create tracings. In
distributed tracing systems, a trace ID is created for each transaction that passes through a
distributed system. The individual steps are so-called spans. These are also assigned an
ID (span ID). The span ID is then linked to the trace ID, and the runtime is measured and
logged. In this way, the time course of distributed transactions can be tracked along the
components involved, and, for example, the duration of individual processing steps can be
determined. 244
4.3. Resulting and simplified logging architecture 245
So, if the two principles to print logs simply to stdout and to log in a structured and text-
based data format are applied consequently. The resulting observability system complexity
thus reduces from Fig. 2to Fig. 4because all system components can collect log, metric, and
trace information in the same style that can be routed seamlessly from an operation platform
provided log forwarder (already existing technology) to a central analytical database. 250
Figure 4. An observability system consistently based on structured logging with significantly reduced
4.4. Study outcome: Unified instrumentation via an structured logging library (prototype) 251
This paper will briefly explain below the way to capture events, metrics, and traces using
the logging prototype that emerged. The prototype library log12 was developed in Python
3 but could implemented in other programming languages analogously. 254
Preprints ( | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
8 of 18
log12 will create automatically for each event additional key-value attributes like an
unique identifier (that is used to relate child events to parent events and even remote events
in distributed tracing scenarios) and start and completion timestamps that can be used to
measure the runtime of events (although known from distributed tracing libraries but not
common for logging libraries). It is explained 259
how to create a log stream, 260
how an event in a log stream is created and logged, 261
how a child event can be created and assigned to a parent event (to trace and record
runtimes of more complex and dependent chains of events within the same process), 263
and how to make use of the distributed tracing features to trace events that pass
through a chain of services in a distributed service of services system). 265
The following lines of code create a log stream with the name "logstream" that is logged to
stdout. 267
Listing 1: Creating an event log stream in log12
1im po rt l og 12 268
lo g = l og 1 2 . l o gg i n g ( " l og s t re a m " , 269
3ge n e ra l = " v al u e " , ta g =" f o o " , se r vi c e _m a r k =" t es t " 270
Each event and child events of this stream are assigned a set of key-value pairs: 272
general="value" 273
tag="foo" 274
service_mark="test" 275
These log-stream-specific key-value pairs can be used to define selection criteria in analytical
databases like ElasticSearch to filter events of a specific service only. The following lines of
code demonstrate how to create a parent event and child events. 278
Listing 2: Event logging in log12 using blocks as structure
# Lo g events us in g th e wi th cl au se 279
2wi t h lo g . ev e nt ( " T es t " , he ll o = " Wo r ld " ) as e v en t : 280
ev e nt . u p da t e ( te s t =" s o me t hi n g " ) 281
4# ad ds e ve nt s pe ci fi c ke y va lue pairs to th e event 282
6wi t h e ve n t . c h il d ( " S u be v e n t 1␣ o f T e st " ) a s ev : 284
ev . u p da t e ( fo o = " ba r " ) 285
8ev . e r r or ( " C a ta s t r op h e " ) 286
# E xp l ic i t ca l l of l o g (h e re o n e rr o r le v el ) 287
10 288
wi t h e ve n t . c h il d ( " S u be v e n t 2␣ o f T e st " ) a s ev : 289
12 ev . u p da t e ( ba r = " fo o " ) 290
# I mp l i ci t c a ll of e v . i nf o ( " S u cc e s s ") ( a t b lo c k e nd ) 291
14 292
wi t h e ve n t . c h il d ( " S u be v e n t 3␣ o f T e st " ) a s ev : 293
16 ev . u p da t e ( ba r = " fo o " ) 294
# I mp l i ci t c a ll of e v . i nf o ( " S u cc e s s ") ( a t b lo c k e nd ) 295
Furthermore, it is possible to log events in the event stream without the block style. That
might be necessary for programming languages that do not support to close resources (here
a log stream) at the end of a block. In this case programmers are responsible to close events
using the .info(),.warn(),.error() log levels. 299
Listing 3: Event logging in log12 without blocks
1# To l og ev e nt s w i th o ut w it h - b l oc k s is p o ss i b le a s w el l . 300
ev = l og . e v en t ( " A no t he r t e st " , f oo = " b ar " ) 301
3ev . u p da t e ( ba r = " fo o " ) 302
ch i l d = ev . c h il d ( " S u be v e nt o f A no t h er t e st " , f oo = " b ar " ) 303
5ev . i n fo ( " F i ni s h e d " ) 304
# <= Howeve r , th an y ou a re are r es po ns ib le t o lo g ev en ts e xp licit y 305
7# If p ar en t ev en ts a re l og ge d al l su bs eq ue nt child events 306
# ar e assumed to ha ve c lo se d s uc ce ss fu lly as wel l 307
Preprints ( | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
9 of 18
Using this type of logging to forward events along HTTP-based requests is also possible.
This usage of HTTP-Headers is the usual method in distributed tracing. Two main ca-
pabilities are required for this [
]. First, extracting header information received by an
HTTP service process must be possible. Secondly, it must be possible to inject the tracing
information in follow-up upstream HTTP requests (in particular, the trace ID and span ID
of the process initiating the request). 313
Listing 4shows how log12 supports this with an extract attribute at event creation
and an inject method of the event that extracts relevant key-value pairs from the event so
that they can be passed as header information along an HTTP request. 316
Listing 4: Extraction and injection of tracing headers in log12
im po rt l og 12 317
2im po rt r equests # To g enerate H TT P r eq ue st s 318
fr om fl as k import re qu es t # To d emons tr at e H ea de r ext ra ct io n 319
wi t h lo g . ev e nt ( " D is t ri b ut e d tr ac i ng " , extract=request.headers ) as e v : 321
# He re i s ho w to p as s tracing i nf or mat io n al on g r em ot e ca ll s 323
8wi t h ev . c hi l d (" T a sk 1 ") a s ev e nt : 324
re s p on s e = r eq u es t s . g et ( 325
10 " h tt p s :/ / q r . my l ab . t h - l ue b ec k . d ev / r o ut e ? u rl = h t tp s : / / g oo g le . c o m ", 326
headers=event.inject() 327
12 )328
ev e nt . u p da te ( l e ng t h = le n ( re s po n se . t ex t ) , st a tu s = re s po n se . s t a tu s _c o de ) 329
4.5. Evaluation of logging prototype in the definded use cases 330
Use Cases 1 and 2: Codepad is an online coding tool to share quickly short code snippets in
online and offline teaching scenarios. It has been introduced during the Corona Pandemic
shutdowns to share short code snippets mainly in online educational settings for 1st or
2nd semester computer science students. Meanwhile the tool is used in presence lectures
and labs as well. The reader is welcome to try out the tool at
dev. This study used the Codepad tool in its steps 1, 2, 3, and 4 of its action research
methodology as an instrumentation use case (see Fig. 3) to evaluate the instrumentation of
qualitative system events according to Sec. 4.4. Fig. 5shows the Web-UI on the left and the
resulting dashboard on the right. In a transfer step (steps 12, 13, 14, and 15 of the action
research methodolgy, see Fig. 3) the same product was used to evaluate distributed tracing
instrumentation (not covered in detail by this report). 341
Figure 5. Use Cases 1 and 2: Codepad is an online coding tool to share quickly short code snippets in
online and offline teaching scenarios. On the left the Web-UI. On the right the Kibana Dashboard
used for observability in this study. Codepad was used as an instrumentation object of investigation.
Preprints ( | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
10 of 18
The Use Case 3 (steps 5, 6, 7, 8 of research methodology; Fig. 3)observed an institutes
infrastructure, the so-called myLab infrastructure. myLab (
is a virtual laboratory that can be used by students and faculty staff to develop and host
web applications. This use case was chosen to demonstrate that it is possible to collect
primarily metrics based data over a long term using the same approach as in Use Case 1. A
pod tracked mainly the resource consumption of various differing workloads deployed by
more than 70 student web projects of different university courses. To observe this resource
consumption the pod simply run periodically 349
kubectl top nodes 350
kubectl top pods –all-namespaces 351
against the cluster. This observation pod parsed the output of both shell commands and
printed the parsed results in the structured logging approach presented in Sec. 4.4. Fig. 6
shows the resulting Kibana dashboard for demonstration purposes. 354
Figure 6. Use Case 3: The dashboard of the Kubernetes infrastructure under observation (myLab)
The Use Case 4 (steps 9, 10, 11 of research methodology; Fig. 3)left our own ecosystem and
observed the public Twitter Event stream as a type representative for a high-volume and
long-term observation of an external system. So, a system that was intentionally not under
the direct administrative control of the study investigators. The Use Case 4 was designed as
two phase study: The first screening phase was designed to gain experiences in logging high
volume event streams and to provide necessary features and performance optimizations
to the structured logging library prototype. The screening phase was designed to screen
the complete and representative Twitter traffic as a kind of "ground truth". We were
interested in the distribution of languages and stock symbols in relation to the general
Twitter "background noise". This screening phase lasted from 20/01/2022 to 02/02/2022
and identified most used stock symbols. A long-term recording was then done as a second
long-term evaluation phase and was used to track and record the most frequent used stock
symbols identified in the screening phase. This evaluation phase lasted from Feb. 2022 until
mid of August 2022. In this evaluation phase just one infrastructure downtime occurred
due to a shutdown of electricity of the author’s institute. However, this downtime was not
due to or related to the presented unified logging stack (see Fig. 9). 370
Preprints ( | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
11 of 18
Events per day
all events
Events per day
LUNA crash all events
Figure 7. Recorded events (screening and evaluation phase of Use Case 4).
The recording was done using the following source code, compiled into a Docker container,
that has been executed on a Kubernetes cluster that has been logged in Use Case 1, 2, and 3.
FileBeat was used as a log forwarding component to a background ElasticSearch database.
The resulting event log has been analyzed and visualized using Kibana. Kibana was used
as well to collect the data in form of CSV-Files for the screening and the evaluation phase.
The Fig. 7,8, and 9have been compiled from that data. This setting followed exactly the
unified and simplified logging architecture presented in Fig. 4.377
Listing 5: The used logging program to record Twitter stock symbols from the public
Twitter Stream API
1im po rt l og1 2 , t weepy , o s 378
3KE Y = o s . en v ir on . g et ( " C ON SU M ER _ KE Y " ) 380
SE C R ET = o s . en v ir o n . g et ( " C O NS U M ER _ S EC R E T ") 381
5TO K E N = o s . e n vi r o n . ge t ( " A C C ES S _ T OK E N " ) 382
T OK E N _ SE C R E T = o s . e n vi r o n . g et ( " A C C E SS _ T O K EN _ S E C RE T " ) 383
L AN G U AG E S = [l . s t r ip ( ) fo r l in o s . e n vi r o n . ge t ( " L A N GU A G ES " , " ") . s p l it ( " , " ) ] 385
9TR A C K = [ t . s tr i p ( ) fo r t in o s . e n vi r o n . ge t ( " T R AC K S " ) . s pl i t ( " ," ) ] 386
11 lo g = lo g 12 . l o gg i ng ( " t wi t te r st r ea m " ) 388
13 cl a ss Tw i st a ( t we ep y . S tr e am ) : 390
15 de f o n _ st a t us ( s el f , s t a tu s ) : 392
wi t h lo g . ev e nt ( " t we e t ", t w ee t _i d = s ta t us . i d_ st r , 393
17 us e r _i d = s ta t us . u s er . i d_ st r , l an g = s ta t us . l a ng 394
) a s e ve n t : 395
19 ki n d = " s ta t us " 396
ki n d = " r ep l y " i f s ta tu s . _ js o n [’ i n _ re p ly _ t o_ s ta t u s_ i d ] e ls e k in d 397
21 ki n d = " r et w ee t " if r e tw e et e d_ s ta t us i n s t at u s . _j so n e ls e k in d 398
ki n d = " q uo t e " i f qu o te d _ st a tu s in st a tu s . _ js on el s e ki n d 399
23 ev e nt . u p da t e ( la n g = st a tu s . la ng , ki n d = ki nd , m e ss a ge = s t at u s . te x t ) 400
25 wi t h e v en t . c h i ld ( u s e r ) a s u sr : 402
na m e = s t at u s . us e r . na me if s t at u s . us e r . na me el s e " u n kn o w n " 403
27 us r . u pd at e ( l an g = st a tu s . la ng , i d = st at u s . us er . i d_ s tr , 404
na m e = na m e , 405
29 s cr e e n_ n a m e = f "@ { s t a tu s . u s er . s c r ee n _ n am e } " , 406
me s sa g e = st a tu s . te xt , 407
31 kind=kind 408
33 410
Preprints ( | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
12 of 18
fo r t ag i n s t at u s . e n ti t i e s [ h as h t ag s ] : 411
35 wi t h e ve n t . c h il d ( h a sh t a g ) a s h a sh t a g : 412
ha s ht a g . up d at e ( la n g = st at u s . la ng , 413
37 ta g = f " #{ t ag [ t e x t ]. l o w er ( ) } " , 414
me s sa g e = st a tu s . te xt , 415
39 kind=kind 416
41 418
fo r s ym i n s t at u s . e n ti t i e s [ s ym b o ls ] : 419
43 wi t h e ve n t . c h il d ( s y mb o l ) as s y m bo l : 420
sy m bo l . u pd a te ( l an g = st a tu s . la ng , 421
45 sy m b ol = f " $ { sy m [ t ex t ]. u p p er ( ) } " , 422
me s sa g e = st a tu s . te xt , 423
47 kind=kind 424
49 sy m bo l . u pd a te ( s c re e n_ n am e = f "@ { s ta t us . u se r . s cr e en _ na m e }" ) 426
51 fo r u s e r_ m e n ti o n i n s ta t u s . e nt i t ie s [ u s er _ m e nt i o n s ]: 428
wi t h e ve n t . c h il d ( m e nt i o n ) a s m e nt i o n : 429
53 me n ti o n . up d at e ( la n g = st at u s . la ng , 430
s cr ee n _n a me = f "@ { u se r _m e nt i on [ sc r ee n _n a me ’] } ", 431
55 me s sa g e = st a tu s . te xt , 432
kind=kind 433
57 )434
59 re c o r d = Tw i s ta ( K EY , S EC R E T , T OK E N , T O K EN _ S E C RE T ) 436
if L A NG U AG E S : 437
61 re c or d . fi l te r ( tr a ck = T RA CK , l a ng u ag e s = LA N GU AG E S ) 438
else: 439
63 re c or d . f il t er ( t ra c k = TR AC K ) 440
According to Fig. 7, just every 100th observed event in the screening phase was a stock
symbol. That is simply the "ground-truth" on Twitter. If one is observing the public Twitter
stream without any filter, that is what you get. So, the second evaluation phase recorded
a very specific "filter bubble" of the Twitter stream. The reader should be aware, that the
data presented in the following is a clear bias and not a representative Twitter event stream,
it is clearly a stock market focused subset or to be even more precise: a cryptcocurrency
focused subset, because almost all stock symbols on Twitter are related to cryptocurrencies.
It is possible to visualize the resulting effects using the recorded data. Fig. 8shows the
difference in language distributions of the screening phase (unfiltered ground-truth) and
the evaluation phase (activated symbol filter). While in the screening phase English (en),
Spanish (es), Portugese (pt), and Turkish (tr) are responsible for more than 3/4 of all traffic,
in the evaluation phase almost all recorded Tweets are in English. So, on Twitter, the most
stock symbol related language is clearly English. 453
(ISO code)
Screening phase
(ISO code)
Evaluation phase
Figure 8. Observed languages (screening and evaluation phase of Use Case 4).
Although the cryptocurrency logging was used mainly as a use case for technical evaluation
purposes of the logging library prototype, some interesting insights could be gained. E.g., 455
Preprints ( | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
13 of 18
although Bitcoin (BTC) is likely the most prominent cryptocurrency, it is by far not the most
frequent used stock symbol on Twitter. The most prominent stock symbols on Twitter are:
ETH: Ethereum cryptocurrency 458
SOL: Solana cryptocurrency 459
BTC: Bitcoin cryptocurrency 460
LUNA: Terra Luna cryptocurrency (replaced by a new version after the crash in May
2022) 462
BNB: Binance Coin cryptocurrency 463
What is more, we can see interesting details in trends (see Fig. 9). 464
The ETH usage on Twitter seems to reducing throughout our observed period. 465
The SOL usage is on the opposite increasing, although we observed a sharp decline in
July. 467
The LUNA usage has a clear peak that correlates with the LUNA cryptocurrency crash
in the mid of May 2022 (this crash was heavily reflected in the investor media). 469
The Twitter usage was not correlated with the curreny rates on crpytocurrency stock
markets. However, changes in usage patterns of stock market symbols might be of interest
for cryptocurrency investors as interesting indicators to observe. As this study shows, these
changes can be easily tracked using structured logging approaches. Of course, this can be 473
transferred to other social media streaming or general event streaming use cases like IoT
(Internet of Things) as well. 475
5. Discussion 476
This style of a unified and structured observability was successfully evaluated on several
use cases that made usage of a FileBeat/ElasticSearch-based observability stack. However,
other observability stacks that can forward and parse structured text in a JSON-format will
likely show the same results. The evaluation included a long-term test over more than six 480
months for a high-volume evaluation use-case. 481
On the one hand, it could be proven that such a type of logging can easily be used to 482
perform classic metrics collections. For this purpose, BlackBox metrics such as CPU,
memory, and storage for the infrastructure (nodes) but also the "payload" (pods) were
successfully collected and evaluated in several Kubernetes clusters (see Fig. 6). 485
Second, a high-volume use case was investigated and analyzed in-depth. Here, all
English-language tweets on the public Twitter stream were logged. About 1 million
events per hour were logged over a week and forwarded to an ElasticSearch database
using the log forwarder FileBeat. Most systems will generate far fewer events (see
Figure 7). 490
2022-02 2022-03 2022-04 2022-05 2022-06 2022-07 2022-08
LUNA crash
decline unclear
Recorded symbols per day (Screening phase)
Figure 9. Recorded symbols per day (evaluation phase of Use Case 4).
Preprints ( | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
14 of 18
In addition, the prototype logging library log12 is meanwhile used in several internal
systems, including web-based development environments, QR code services, and
e-learning systems, to record access frequencies to learning content, and to study
learning behaviours of students. 494
5.1. Lessons learned 495
All use cases have shown that structured logging is easy to instrument and harmonizes
well with existing observability stacks (esp. Kubernetes, Filebeat, ElasticSearch, Kibana).
However, some aspects should be considered: 498
It is essential to apply structured logging, cause this can be used to log events, metrics,
and traces in the same style. 500
Very often, only error-prone situations are logged. However, if you want to act in the
sense of DevOps-compliant observability, you should also log normal - completely
regular - behaviour. DevOps engineers can gain many insights from how normal
users use systems in standard situations. So, the log level should be set to INFO, and
not WARNING, ERROR, or above. 505
Cloud-native system components should rely on the log forwarding and log aggrega-
tion of the runtime environment. Never implement this on your own. You will double
logic and end up with complex and may be incompatible log aggregation systems. 508
To simplify analysis for engineers, one should push key-value pairs of parent events
down to child events. This logging approach simplifies analysis in centralized log
analysis solutions - it simply reduces the need to derive event contexts that might be
difficult to deduce in JSON document stores. However, this comes with the cost of
more extensive log storage. 513
Do not collect aggregated metrics data. The aggregation (mean, median, percentile,
standard deviations, sum, count, and more) can be done much more convenient in
the analytical database. The instrumentation should focus on recording metrics data
in a point-on-time style. According to our developer experience, developers are glad
to be authorized to log only such simple metrics, especially when there is not much
background knowledge in statistics. 519
5.2. Threats of validity and to be considered limitations of the study design 520
Action research is prone to drawing incorrect or non-generalizable conclusions. Logically,
the significance is consistently highest within the considered use cases. In order to draw
generalizable conclusions, this study defined use cases in such a way that intentionally
different classes of telemetry data (logs, metrics, traces) were considered. It should be noted
that the study design primarily considered logs and metrics but traces only marginally.
Traces were not wholly neglected, however, but were analyzed less intensively. 526
The long-term acquisition was performed with a high-volume use case to cover certain
stress test aspects. However, the reader must be aware, that the screening phase generated
significantly higher data volumes in Use Case 4 than the evaluation phase. Therefore, to use
stress test data from this study, one should look at the event volume of the screening phase
of Use Case 4. Here, about ten thousand events per minute were logged for more than a
week giving an impression of the performance of the proposed approach. The study data
shows that the saturation limit should be far beyond these ten thousand events per minute.
However, the study design did not pushed the system to its event recording saturation
limits. 535
What is more, this study should not be used to derive any cryptocurrency related
conclusions. Although some interesting aspects from Use Case 4 could be of interest for
cryptocurency trading indicator generation. However, no detailed analysis on correlations
between stock prices and usage frequencies of stock symbols on Twitter have been done. 539
Preprints ( | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
15 of 18
6. Related work 540
There are relatively few studies dealing with observability as a main object of investigation
in an academic understanding. The field is currently treated somewhat stepmotherly.
However, an interesting and recent overview is provided by the survey of Usman et al.
]. This survey provides a list of microservice-focused managed and unified observability
services (Dynatrace, Datadog, New Relic, Sumo Logic, Solar Winds, Honeycomb). The
presented research prototype of this study heads into the same direction, but tries to pursue
the problem primarily on the instrumenting side using a more lightweight and unified
approach. So, to address the client-side of the problem is obviously harder economical ex-
ploitable which is why the industry might address the problem preferable on the managed
service side. 550
Of logs, metrics and distributed traces, distributed tracing is still considered in the
most detail. In particular, the papers around Dapper [
] should be mentioned here, which
had a significant impact on this field. A black box approach without instrumenting needs 553
for distributed tracing is presented by [
]. This study, however, has seen tracing as only
one of three aspects of observability and therefore follows a broader approach. A more
recent review on current challenges and approaches of distributed tracing is presented by 556
Bento et. al. [23]. 557
6.1. Existing instrumenting libraries and observability solutions 558
Although the academic coverage of the observability field is expandable, in practice, there is
an extensive set of existing solutions, especially for time series analysis and instrumentation.
A complete listing is beyond the scope of this paper. However, from the disproportion
of the number of academic papers to the number of real existing solutions, one quickly
recognizes the practical relevance of the topic. Table 1contains a list of existing database
products often used for telemetry data consolidation to give the reader an overview without
claiming completeness. This study used ElasticSearch as an analytical database. 565
Table 1. Often seen databases for telemetry data consolidation. Products used in this study are
marked bold . Without claiming completeness.
Product Organization License often seen scope
APM Elastic Apache 2.0 Tracing (add-on to ElasticSearch database)
ElasticSearch Elastic Apache/Elastic License 2.0 Logs, Tracing, (rarely Metrics)
InfluxDB Influxdata MIT Metrics
Jaeger Linux Foundation Apache 2.0 Tracing
OpenSearch Amazon Web Services Apache 2.0 Logs, Tracing, (rarely Metrics); fork from ElasticSearch
Prometheus Linux Foundation Apache 2.0 Metrics
Zipkin OpenZipkin Apache 2.0 Tracing
Table 2lists several frequently used forwarding solutions that developers can use to forward
data from the point of capture to the databases listed in Table 1. In the context of this study,
FileBeat was used as a log forwarding solution. It could be prooved that this solution is
also capable to forward traces and metrics if applied in a structured logging setting. 569
Table 2. Often seen forwarding solutions for log consolidation. Products used in this study are
marked bold . Without claiming completeness.
Product Organization License
Fluentd FluentD Project Apache 2.0
Flume Apache Apache 2.0
LogStash Elastic Apache 2.0
FileBeat Elastic Apache/Elastic License 2.0
Rsyslog Adiscon GPL
syslog-ng One Identity GPL
Preprints ( | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
16 of 18
An undoubtedly incomplete overview of instrumentation libraries for different prod-
ucts and languages is given in Table 3, presumably because each programming language
comes with its own form of logging in the shape of specific libraries. To avoid this language-
binding is hardly possible in the instrumentation context unless one pursues "esoteric
approaches" like [
]. The logging library prototype is strongly influenced by the Python
standard logging library but also by structlog for structured logging but without actually
using these libraries. 576
Table 3. Often seen instrumenting libraries. Products that inspired the research prototype are marked
bold . Without claiming completeness.
Product Use Case Organization License Remark
APM Agents Tracing Elastic BSD 3
Jaeger Clients Tracing Linux Foundation Apache 2.0
log Logging Go Standard Library BSD 3 Logging for Go
log4j Logging Apache Apache 2.0 Logging for Java
logging Logging Python Standard Library GPL compatible Logging for Python
Micrometer Metrics Pivotal Apache 2.0
OpenTracing Tracing OpenTracing Apache 2.0
prometheus Metrics Linux Foundation Apache 2.0
Splunk APM Tracing Splunk Apache 2.0
structlog Logging Hynek Schlawack Apache 2.0, MIT structured logging for Python
winston Logging Charlie Robbins MIT Logging for node.js
6.2. Standards 577
There are hardly any observability standards. However, a noteworthy standardization ap-
proach is the OpenTelemetry Specification [
] of the Cloud Native Computing Foundation
], that tries to standardize the way of instrumentation. This approach corresponds to the
core idea, which this study also follows. Nevertheless, the standard is still divided into Logs
], Metrics [
] and Traces [
], which means that the conceptual triad of observability
is not questioned. On the other hand, approaches like the OpenTelemetry Operator [
for Kubernetes enable to inject auto-instrumentation libraries for Java, Node.js and Python
into Kubernetes operated applications which is a feature that is currently not addressed
by the present study. However, so-called service meshes also use auto-instrumentation. A
developing standard here is the so-called Service Mesh Interface (SMI) [29]. 587
7. Conclusions and Future Research Directions 588
Cloud-native software systems often have a much more decentralized structure and many
independently deployable and (horizontally) scalable components, making it more compli-
cated to create a shared and consolidated picture of the overall decentralized system state.
Today, observability is often understood as a triad of collecting and processing metrics,
distributed tracing data, and logging. But why except for historical reasons? 593
This study presents a unified logging library for Python [
] and a unified logging
architecture (see Fig. 4) that uses a structured logging approach. The evaluation of four
use cases shows that several thousand events per minute are easily processable and can
be used to handle logs, traces, and metrics the same. At least, this study was able with
a straight-forward approach to log the world-wide Twitter event stream of stock market
symbols over a period of six months without any noteworthy problems. As a side effect,
some interesting aspects how crypto-currencies are reflected on Twitter could be derived.
This might be of minor relevance for this study but shows the overall potential of an unified
and structured logging based observability approach. 602
The presented approach relies on an easy-to-use programming language-specific
logging library that follows the structured logging approach. The long-term observation
results of more than six months indicate that a unification of the current observability
Preprints ( | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
17 of 18
triad of logs, metrics, and traces is possible without the necessity to develop utterly new
toolchains. The trick is to 607
use structured logging and 608
apply log forwarding to a central analytical database 609
in a systematic infrastructure- or platform-provided way. 610
Further research should therefore be concentrated on the instrumenting and less on the
log forwarding and consolidation layer. If we instrument logs, traces, and metrics in the
same style using the same log forwarding, we automatically generate correlatable data in a
single data source of truth and we simplify analysis. 614
So, the observability road ahead may have several paths. On the one hand, we
should standardize the logging libraries in a structured style like log12 in this study
or the OpenTelemetry project in the "wild". Logging libraries should be comparably
implemented in different programming languages and shall generate the same structured
logging data. So, we have to standardize the logging SDKs and the data format. Both
should be designed to cover logs, metrics, and distributed traces in a structured format. To
simplify instrumentation further, we should additionally think about auto-instrumentation
approaches, for instance, proposed by the OpenTelemetry Kubernetes Operator [
] and
several Service Meshes like Istio [31] and corresponding standards like SMI [29]. 623
Funding: This research received no external funding. 624
Data Availability Statement: The resulting research prototype of the developed structured logging
library log12 can be accessed here [
]. However, the reader should be aware, that this is prototyping
software in progress. 627
Conflicts of Interest: The author declares no conflict of interest. 628
References 629
Kalman, R. On the general theory of control systems. IFAC Proceedings Volumes 1960,1, 491–502. 1st International IFAC Congress
on Automatic and Remote Control, Moscow, USSR, 1960,
Kalman, R.E. Mathematical Description of Linear Dynamical Systems. Journal of the Society for Industrial and Applied Mathematics
Series A Control 1963,1, 152–192.
3. Newman, S. Building Microservices, 1st ed.; O’Reilly Media, Inc., 2015. 634
Kim, G.; Humble, J.; Debois, P.; Willis, J.; Forsgren, N. The DevOps handbook: How to create world-class agility, reliability, & security in
technology organizations; IT Revolution, 2016. 636
5. Davis, C. Cloud Native Patterns: Designing change-tolerant software; Simon and Schuster, 2019. 637
Kratzke, N. Cloud-native Computing: Software Engineering von Diensten und Applikationen für die Cloud; Carl Hanser Verlag GmbH
Co. KG, 2021. 639
7. The OpenTelemetry Authors. The OpenTelemetry Specification, 2021. 640
Kratzke, N.; Peinl, R. ClouNS - a Cloud-Native Application Reference Model for Enterprise Architects. In Proceedings
of the 2016 IEEE 20th International Enterprise Distributed Object Computing Workshop (EDOCW), 2016, pp. 1–10. https:
9. Kratzke, N.; Quint, P.C. Understanding Cloud-native Applications after 10 Years of Cloud Computing - A Systematic Mapping 644
Study. Journal of Systems and Software 2017,126, 1–16.
10. Kratzke, N. A Brief History of Cloud Application Architectures. Applied Sciences 2018,8.
Kratzke, N. How programming students trick and what JEdUnit can do against it. In Computer Supported Education ; Lane, H.C.;
Zvacek, S.; Uhomoibhi, J., Eds.; Springer International Publishing , 2020; pp. 1–25. CSEDU 2019 - Revised Selected Best Papers 648
12. Kratzke, N. Einfachere Observability durch strukturiertes Logging. Informatik Aktuell 2022.650
Kratzke, N.; Siegfried, R. Towards Cloud-native Simulations - Lessons learned from the front-line of cloud computing. Journal of
Defense Modeling and Simulation 2020.
Truyen, Eddy.; Kratzke, Nane.; Van Landyut, Dimitri.; Lagaisse, Bert.; Joosen, Wouter. Managing Feature Compatibility in
Kubernetes: Vendor Comparison and Analysis. IEEE Access 2020,8, "228420–228439".
Petersen, K.; Gencel, C.; Asghari, N.; Baca, D.; Betz, S. Action Research as a Model for Industry-Academia Collaboration in the
Software Engineering Context. In Proceedings of the Proceedings of the 2014 International Workshop on Long-Term Industrial
Collaboration on Software Engineering; Association for Computing Machinery: New York, NY, USA, 2014; WISE ’14, p. 55–62.
Preprints ( | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
18 of 18
Kratzke, N. The #BTW17 Twitter Dataset - Recorded Tweets of the Federal Election Campaigns of 2017 for the 19th German
Bundestag. Data 2017,2.
17. Kratzke, N. Monthly Samples of German Tweets, 2022.
18. Wiggins, A. The Twelve-Factor App, 2017.
19. The Kubernetes Authors. Kubernetes, 2014.
Sigelman, B.H.; Barroso, L.A.; Burrows, M.; Stephenson, P.; Plakal, M.; Beaver, D.; Jaspan, S.; Shanbhag, C. Dapper, a Large-Scale
Distributed Systems Tracing Infrastructure. Technical report, Google, Inc., 2010. 666
Usman, M.; Ferlin, S.; Brunstrom, A.; Taheri, J. A Survey on Observability of Distributed Edge & Container-based Microservices.
IEEE Access 2022, pp. 1–1.
Chow, M.; Meisner, D.; Flinn, J.; Peek, D.; Wenisch, T.F. The Mystery Machine: End-to-end Performance Analysis of Large-scale
Internet Services. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14);
USENIX Association: Broomfield, CO, 2014; pp. 217–231. 671
Bento, A.; Correia, J.; Filipe, R.; Araujo, F.; Cardoso, J. Automated Analysis of Distributed Tracing: Challenges and Research
Directions. Journal of Grid Computing 2021,19, 9. 09551-5.673
24. Linux Foundation. Cloud-native Computing Foundation, 2015.
The OpenTelemetry Authors. The OpenTelemetry Specification - Logs Data Model, 2021.
The OpenTelemetry Authors. The OpenTelemetry Specification - Metrics SDK, 2021.
The OpenTelemetry Authors. The OpenTelemetry Specification - Tracing SDK, 2021.
The OpenTelemetry Authors. The OpenTelemetry Operator, 2021. operator.
29. Service Mesh Interface Authors. SMI: A standard interface for service meshes on Kubernetes, 2022.
30. Kratzke, N. log12 - a single and self-contained structured logging library, 2022.
31. Istio Authors. The Istio service mesh, 2017.
Preprints ( | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Edge computing is proposed as a technical enabler for meeting emerging network technologies (such as 5G and Industrial Internet of Things), stringent application requirements and key performance indicators (KPIs). It aims to alleviate the problems associated with centralized cloud computing systems by placing computational resources to the network's edge, closer to the users. However, the complexity of distributed edge infrastructures grows when hosting containerized workloads as microservices, resulting in hard to detect and troubleshoot outages on critical use cases such as industrial automation processes. Observability aims to support operators in managing and operating complex distributed infrastructures and microservices architectures by instrumenting end-to-end runtime performance. To the best of our knowledge, no survey article has been recently proposed for distributed edge and containerized microservices observability. Thus, this article surveys and classifies state-of-the-art solutions from various communities. Besides surveying state-of-the-art, this article also discusses the observability concept, requirements, and design considerations. Finally, we discuss open research issues as well as future research directions that will inspire additional research in this area.
Full-text available
Microservice-based architectures are gaining popularity for their benefits in software development. Distributed tracing can be used to help operators maintain observability in this highly distributed context, and find problems such as latency, and analyse their context and root cause. However, exploring and working with distributed tracing data is sometimes difficult due to its complexity and application specificity, volume of information and lack of tools. The most common and general tools available for this kind of data, focus on trace-level human-readable data visualisation. Unfortunately, these tools do not provide good ways to abstract, navigate, filter and analyse tracing data. Additionally, they do not automate or aid with trace analysis, relying on administrators to do it themselves. In this paper we propose using tracing data to extract service metrics, dependency graphs and work-flows with the objective of detecting anomalous services and operation patterns. We implemented and published open source prototype tools to process tracing data, conforming to the OpenTracing standard, and developed anomaly detection methods. We validated our tools and methods against real data provided by a major cloud provider. Results show that there is an underused wealth of actionable information that can be extracted from both metric and morphological aspects derived from tracing. In particular, our tools were able to detect anomalous behaviour and situate it both in terms of involved services, work-flows and time-frame. Furthermore, we identified some limitations of the OpenTracing format—as well as the industry accepted tracing abstractions—, and provide suggestions to test trace quality and enhance the standard.
Full-text available
Kubernetes (k8s) is a kind of cluster operating system for cloud-native workloads that has become a de-facto standard for container orchestration. Provided by more than one hundred vendors, it has the potential to protect the customer from vendor lock-in. However, the open-source k8s distribution consists of many optional and alternative features that must be explicitly activated and may depend on pre-configured system components. As a result, incompatibilities still may ensue among Kubernetes vendors. Mostly managed k8s services typically restrict the customizability of Kubernetes. This paper firstly compares the most relevant k8s vendors and, secondly, analyses the potential of Kubernetes to detect and configure compatible support for required features across vendors in a uniform manner. Our comparison is performed based on documented features, by testing, and by inspection of the configuration state of running clusters. Our analysis focuses on the potential of the end-to-end testing suite of Kubernetes to detect support for a desired feature in any Kubernetes vendor and the possibility of reconfiguring the studied vendors with missing features in a uniform manner. Our findings are threefold: First, incompatibilities arise between default cluster configurations of the studied vendors for approximately 18% of documented features. Second, matching end-to-end tests exist only for around 64% of features and for 17% of features these matching tests are not well developed for all vendors. Third, almost all feature incompatibilities can be resolved using a vendor-agnostic API. These insights are beneficial to avoid feature incompatibilities already in cloud-native application engineering processes. Moreover, the end-to-end testing suite can be extended in currently unlighted areas to provide better feature coverage.
Full-text available
Cloud computing can be a game-changer for computationally intensive tasks like simulations. The computational power of Amazon, Google, or Microsoft is even available to a single researcher. However, the pay-as-you-go cost model of cloud computing influences how cloud-native systems are being built. We transfer these insights to the simulation domain. The major contributions of this paper are twofold: (A) we propose a cloud-native simulation stack and (B) derive expectable software engineering trends for cloud-native simulation services. Our insights are based on systematic mapping studies on cloud-native applications, a review of cloud standards, action research activities with cloud engineering practitioners, and corresponding software prototyping activities. Two major trends have dominated cloud computing over the last 10 years. The size of deployment units has been minimized and corresponding architectural styles prefer more fine-grained service decompositions of independently deployable and horizontally scalable services. We forecast similar trends for cloud-native simulation architectures. These similar trends should make cloud-native simulation services more microservice-like, which are composable but just ''simulate one thing well.'' However, merely transferring existing simulation models to the cloud can result in significantly higher costs. One critical insight of our (and other) research is that cloud-native systems should follow cloud-native architecture principles to leverage the most out of the pay-as-you-go cost model.
Full-text available
This paper presents a review of cloud application architectures and its evolution. It reports observations being made during a research project that tackled the problem to transfer cloud applications between different cloud infrastructures. As a side effect, we learned a lot about commonalities and differences from plenty of different cloud applications which might be of value for cloud software engineers and architects. Throughout the research project, we analyzed industrial cloud standards, performed systematic mapping studies of cloud-native application-related research papers, did action research activities in cloud engineering projects, modeled a cloud application reference model, and performed software and domain-specific language engineering activities. Two primary (and sometimes overlooked) trends can be identified. First, cloud computing and its related application architecture evolution can be seen as a steady process to optimize resource utilization in cloud computing. Second, these resource utilization improvements resulted over time in an architectural evolution of how cloud applications are being built and deployed. A shift from monolithic service-oriented architectures (SOA), via independently deployable microservices towards so-called serverless architectures, is observable. In particular, serverless architectures are more decentralized and distributed and make more intentional use of separately provided services. In other words, a decentralizing trend in cloud application architectures is observable that emphasizes decentralized architectures known from former peer-to-peer based approaches. This is astonishing because, with the rise of cloud computing (and its centralized service provisioning concept), the research interest in peer-to-peer based approaches (and its decentralizing philosophy) decreased. However, this seems to change. Cloud computing could head into the future of more decentralized and more meshed services.
Full-text available
The German Bundestag elections are the most important elections in Germany. This dataset comprises Twitter interactions related to German politicians of the most important political parties over several months in the (pre-)phase of the German federal election campaigns in 2017. The Twitter accounts of more than 360 politicians were followed for four months. The collected data comprise a sample of approximately 10 GB of Twitter raw data, and they cover more than 120,000 active Twitter users and more than 1,200,000 recorded tweets. Even without sophisticated data analysis techniques, it was possible to deduce a likely political party proximity for more than half of these accounts simply by looking at the re-tweet behavior. This might be of interest for innovative data-driven party campaign strategists in the future. Furthermore, it is observable, that, in Germany, supporters and politicians of populist parties make use of Twitter much more intensively and aggressively than supporters of other parties. Furthermore, established left-wing parties seem to be more active on Twitter than established conservative parties. The dataset can be used to study how political parties, their followers and supporters make use of social media channels in political election campaigns and what kind of content is shared.
Full-text available
It is common sense that cloud-native applications (CNA) are intentionally designed for the cloud. Although this understanding can be broadly used it does not guide and explain what a cloud-native application exactly is. The term "cloud-native" was used quite frequently in birthday times of cloud computing (2006) which seems somehow obvious nowadays. But the term disappeared almost completely. Suddenly and in the last years the term is used again more and more frequently and shows increasing momentum. This paper summarizes the outcomes of a systematic mapping study analyzing research papers covering "cloud-native" topics, research questions and engineering methodologies. We summarize research focuses and trends dealing with cloud-native application engineering approaches. Furthermore, we provide a definition for the term "cloud-native application" which takes all findings, insights of analyzed publications and already existing and well-defined terminology into account.
Märkte verändern sich immer schneller, Kundenwünsche stehen im Mittelpunkt – viele Unternehmen sehen sich 𝗛𝗲𝗿𝗮𝘂𝘀𝗳𝗼𝗿𝗱𝗲𝗿𝘂𝗻𝗴𝗲𝗻 gegenüber, die nur 𝗱𝗶𝗴𝗶𝘁𝗮𝗹 𝗯𝗲𝗵𝗲𝗿𝗿𝘀𝗰𝗵𝗯𝗮𝗿 sind. Um diese Anforderungen zu bewältigen, bietet sich der Einsatz von 𝗖𝗹𝗼𝘂𝗱-𝗻𝗮𝘁𝗶𝘃𝗲-𝗧𝗲𝗰𝗵𝗻𝗼𝗹𝗼𝗴𝗶𝗲𝗻 an. Dabei reicht es jedoch nicht aus, einen Account bei einem Cloud-Anbieter anzulegen. Es geht auch darum, die unterschiedlichen Faktoren zu verstehen, die den Erfolg von Cloud-native-Projekten beeinflussen. Anhand von realen 𝗣𝗿𝗮𝘅𝗶𝘀𝗯𝗲𝗶𝘀𝗽𝗶𝗲𝗹𝗲𝗻 wird gezeigt, was bei der Umsetzung in 𝘂𝗻𝘁𝗲𝗿𝘀𝗰𝗵𝗶𝗲𝗱𝗹𝗶𝗰𝗵𝗲𝗻 𝗕𝗿𝗮𝗻𝗰𝗵𝗲𝗻 gut und was schlecht gelaufen ist und welche Best Practices sich daraus ableiten lassen. Dabei wird auch die Migration von Legacy-Code berücksichtigt. 𝗜𝗧-𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗸𝘁𝗲𝗻 𝘃𝗲𝗿𝗺𝗶𝘁𝘁𝗲𝗹𝘁 𝗱𝗶𝗲𝘀𝗲𝘀 𝗕𝘂𝗰𝗵 𝗱𝗮𝘀 𝗴𝗿𝘂𝗻𝗱𝗹𝗲𝗴𝗲𝗻𝗱𝗲 𝗪𝗶𝘀𝘀𝗲𝗻, 𝘂𝗺 𝗖𝗹𝗼𝘂𝗱-𝗻𝗮𝘁𝗶𝘃𝗲-𝗧𝗲𝗰𝗵𝗻𝗼𝗹𝗼𝗴𝗶𝗲𝗻 𝘂𝗻𝗱 𝗱𝗶𝗲 𝗗𝗲𝘃𝗢𝗽𝘀-𝗞𝘂𝗹𝘁𝘂𝗿 𝗶𝗻 𝗶𝗵𝗿𝗲𝗺 𝗣𝗿𝗼𝗷𝗲𝗸𝘁 𝗼𝗱𝗲𝗿 𝗶𝗺 𝗴𝗲𝘀𝗮𝗺𝘁𝗲𝗻 𝗨𝗻𝘁𝗲𝗿𝗻𝗲𝗵𝗺𝗲𝗻 einzuführen. Das Buch beleuchtet den 𝗖𝗹𝗼𝘂𝗱-𝗻𝗮𝘁𝗶𝘃𝗲-𝗪𝗮𝗻𝗱𝗲𝗹 aus 𝘂𝗻𝘁𝗲𝗿𝘀𝗰𝗵𝗶𝗲𝗱𝗹𝗶𝗰𝗵𝗲𝗻 𝗣𝗲𝗿𝘀𝗽𝗲𝗸𝘁𝗶𝘃𝗲𝗻: Von der Unternehmenskultur, der Cloud-Ökonomie und der Einbeziehung der Kunden (Co-Creation) über das Projektmanagement (Agilität) und die Softwarearchitektur bis hin zu Qualitätssicherung (Continuous Delivery) und Betrieb (DevOps). ▪️ Grundlagen des Cloud Computings (Service-Modelle und Cloud-Ökonomie) ▪️ Das Everything-as-Code-Paradigma (DevOps, Deployment Pipelines, IaC) ▪️ Den Systembetrieb mit Container-Orchestrierung automatisieren ▪️ Microservice- und Serverless-Architekturen verstehen und Cloud-native-Architekturen mit Domain Driven Design entwerfen EXTRA: 𝗖𝗖𝟬-𝗹𝗶𝘇𝗲𝗻𝘀𝗶𝗲𝗿𝘁𝗲 𝘂𝗻𝗱 𝗲𝗱𝗶𝘁𝗶𝗲𝗿𝗯𝗮𝗿𝗲 𝗛𝗮𝗻𝗱𝗼𝘂𝘁𝘀 und Labs für Trainer:innen und Dozent:innen finden sich übrigens hier:
According to our data, about 15% of programming students trick if they are aware that only a "dumb" robot evaluates their programming assignments unattended by programming experts. Especially in large-scale formats like MOOCs, this might become a question because to trick current automated assignment assessment systems (APAAS) is astonishingly easy and the question arises whether unattended grading components grade the capability to program or to trick. This study analyzed what kind of tricks students apply beyond the well-known "copy-paste" code plagiarism to derive possible mitigation options. Therefore, this study analyzed student cheat patterns that occurred in two programming courses and developed a unit testing framework JEdUnit as a solution proposal that intentionally targets such tricky educational aspects of programming. The validation phase validated JEdUnit in another programming course. This study identified and analyzed four recurring cheat patterns (overfitting, evasion, redirection, and injection) that hardly occur in "normal" software development and are not aware to normal unit testing frameworks that are frequently used to test the correct-ness of student submissions. Therefore, the concept of well-known unit testing frameworks was extended by adding three "countermeasures": randomization, code inspection, separation. The validation showed that JEdUnit detected these patterns and in consequence, reduced cheating entirely to zero. From a students perspective, JEdUnit makes the grading component more intelligent, and cheating does not pay-off anymore. This Chapter explains the cheat patterns and what features of JEdUnit mitigate them by a continuous example.